Google Integrates Computer Use into Gemini 3.5 Flash for Cross-Platform Multi-Step Tasks
Google has announced the integration of Computer Use capabilities directly into the Gemini 3.5 Flash model. This feature enables the AI to identify UI elements on screen via screenshots and simulate actions such as clicking, typing, scrolling, and switching tabs, supporting multi-step tasks across web pages, desktop software, and mobile interfaces, with loops of up to 70+ operations.
Core Capabilities and Implementation
- Vision-driven: The model reads screenshots and UI structure information to understand the current interface state.
- Task execution: It can autonomously perform actions like clicking, inputting, scrolling, and switching tabs, forming a loop of "read screen → select action → execute."
- Cross-platform coverage: Unlike browser-only agents, this capability supports web pages, desktop software, and mobile interfaces.
Safety Mechanisms
Google has added safety constraints to the model execution pipeline: when sensitive operations or irreversible consequences are involved, the system actively interrupts the process and requires user confirmation; the model can also autonomously identify indirect attacks through page content or input information.
Performance and Positioning
- Benchmarks: With Computer Use, Gemini 3.5 Flash matches frontier models on several benchmark tasks and can complete complex long-cycle browser tasks at lower cost.
- Positioning considerations: Google integrated Computer Use into the lightweight Flash model rather than Pro, primarily due to cost and speed—long task loops require frequent model calls, and Flash's unit price and speed are more suitable.
Industry Context and Comparison
- Pioneers: Anthropic first launched browser operation capabilities in October 2024, followed by OpenAI's Operator.
- Differentiation: Google's Computer Use has broader coverage (not limited to browsers) and is directly built into the main model.
Application Scenarios
Suitable for operations, product testing, data organization, and other tasks that require frequent switching between multiple web pages, backends, and spreadsheets, such as cross-site information extraction and structured organization.
Also available in 中文.