Web Agent Browsing
Status: 📋 Planned
Version: Future
Last Updated: November 15, 2025
Overview
Agents autonomously browse and interact with web applications using Gemini 2.5 Computer Use and OpenAI Operator for complex web tasks, targeting benchmark performance on the Online Mind2Web Leaderboard.
Description
Goal
Enable agents to navigate real websites, fill forms, click buttons, and complete multi-step web tasks autonomously, competing with state-of-the-art web agents on standardized benchmarks.
Key Features
- Visual Web Understanding
- Screenshot analysis to understand page layout
- Element detection (buttons, forms, links, menus)
- Context understanding (what page is this, what can I do)
- Dynamic content handling (JavaScript-rendered pages)
- Intelligent Interaction
- Click appropriate elements
- Fill forms with correct information
- Navigate multi-page workflows
- Handle popups, modals, cookie banners
- Scroll and search for hidden elements
- Task Planning & Execution
- Decompose complex tasks into steps
- Maintain context across page navigations
- Recover from errors and dead ends
- Verify task completion
- Multi-Modal Integration
- Process images, videos on web pages
- Handle file uploads/downloads
- Extract structured data from tables, lists
- Interact with interactive widgets
- Benchmark Performance
- Target: Online Mind2Web Leaderboard
- Metrics: Task success rate, action efficiency
- Compare to: GPT-4V, Claude Computer Use, Gemini
Example Tasks
- E-commerce: Search product, add to cart, checkout
- Travel Booking: Find flights, select dates, book tickets
- Form Submission: Fill multi-page survey, upload documents
- Data Extraction: Scrape information from multiple pages
- Account Management: Login, navigate settings, update profile
Testing Guidelines
Test Scenarios
- Simple Navigation Test
- Task: “Go to Wikipedia and search for ‘Artificial Intelligence’”
- Expected: Opens Wikipedia, uses search, navigates to article
- Validation: Correct page reached in <30 seconds
- Form Filling Test
- Task: “Fill out contact form with name, email, message”
- Expected: All fields filled correctly, form submitted
- Validation: Form submission successful, no validation errors
- Multi-Step Workflow Test
- Task: “Search for ‘laptop’ on Amazon, filter by price under $1000, add cheapest to cart”
- Expected: Completes all steps correctly
- Validation: Correct item in cart at end
- Error Recovery Test
- Task: Intentionally provide invalid input (e.g., bad email format)
- Expected: Agent recognizes error message, corrects input
- Validation: Successfully recovers and completes task
- Complex Interaction Test
- Task: “Book a flight from NYC to SF on travel site”
- Expected: Handles date picker, dropdowns, popups
- Validation: Booking reaches confirmation page
- Mind2Web Benchmark Test
- Setup: Standard Mind2Web test suite
- Test: Run agent on benchmark tasks
- Expected: Success rate competitive with leaderboard
- Validation: Submit results to leaderboard
Benchmark Metrics
- Success Rate: Percentage of tasks completed correctly
- Action Efficiency: Number of actions to complete task
- Time Efficiency: Time to complete task
- Error Rate: Failed actions / total actions
- Recovery Rate: Successful recoveries / errors encountered
Evaluation Methodology
- Functional Correctness: Did the task complete successfully?
- Efficiency: Minimal number of actions required?
- Robustness: Handles edge cases and errors gracefully?
- Generalization: Works across different websites?
Validation Criteria
- ✅ >60% success rate on Mind2Web benchmark
- ✅ Average action efficiency within 1.5x of human baseline
- ✅ Handles 90%+ of common web UI patterns
- ✅ Error recovery rate >70%
- ✅ Leaderboard placement in top 10
Implementation Notes
Technology Stack
Computer Use Tools:
- Gemini 2.5 Computer Use (browser environment)
- OpenAI Operator (web agent capabilities)
- Claude Computer Use (browser-specific actions)
Browser Automation:
- Playwright for reliable browser control
- Chrome DevTools Protocol for advanced features
- Headless/headed mode support
Visual Understanding:
- Screenshot analysis with vision models
- Element detection and classification
- Layout understanding
Configuration Example
web_browsing:
agent:
backend: gemini-2.5-computer-use-preview
environment: browser
browser_type: chromium
headless: false
capabilities:
screenshot_analysis: true
element_detection: true
multi_page_navigation: true
form_filling: true
error_recovery: true
benchmarks:
- mind2web
- webarena
Execution Command
# Single task
DISPLAY=:20 massgen --config web_browsing.yaml \
--query "Search for 'machine learning' on Google Scholar"
# Benchmark evaluation
massgen --config web_browsing_benchmark.yaml \
--benchmark mind2web \
--output results.json
- Computer Use Tools (v0.1.9, v0.1.12) - Foundation for web browsing
- Gemini Computer Use (v0.1.12) - Browser automation capabilities
- Claude Computer Use - Browser environment support
References
Target Benchmark
Compete on the Online Mind2Web Leaderboard, demonstrating MassGen’s ability to handle real-world web automation tasks with state-of-the-art performance.