MAI-UI: Alibaba Tongyi Lab’s Foundation GUI Agents Push State of the Art in Android Navigation and Grounding

Graphical user interface (GUI) agents are moving from demos to real usage—handling multi-step tasks on phones, understanding what’s on screen, and deciding what to tap next. Alibaba Tongyi Lab’s newly released MAI-UI aims to make that transition practical by combining multimodal perception, tool use, and a privacy-aware deployment design that splits work between device and cloud.

What is MAI-UI?

MAI-UI is a family of multimodal GUI agents built on Qwen3 VL. The release spans multiple model sizes—2B, 8B, 32B, and 235B A22B—designed to take two inputs: natural-language instructions from a user and rendered UI screenshots. From there, MAI-UI produces structured actions that can be executed in a live Android environment.

Those actions cover standard mobile interactions such as:

Clicking UI elements
Swiping and scrolling
Entering text
Pressing system buttons

What distinguishes MAI-UI from many earlier GUI agents is that it expands the action space beyond “touch-only” navigation. In addition to GUI actions, the system introduces explicit operations that allow the agent to:

Answer user questions directly in natural language
Ask the user for clarification when goals are underspecified or ambiguous
Invoke external tools through MCP tool calls

This design allows a single trajectory to mix UI manipulation, conversational turns, and tool/API-level operations—an approach that can matter in real tasks where a user may change their intent mid-way, or where an app action is more reliably completed through a tool invocation than through fragile UI taps.

Three gaps MAI-UI is designed to address

According to the underlying report, Alibaba Tongyi Lab positions MAI-UI as a response to three gaps that many early GUI agents often overlook:

Native agent–user interaction: The agent isn’t limited to silent navigation; it can ask clarifying questions and respond to users when appropriate.
MCP tool integration: Tool calls are treated as first-class actions, enabling hybrid execution paths that combine UI steps with tool-driven operations.
Device–cloud collaboration architecture: The system can route execution depending on privacy sensitivity and task state—keeping privacy-sensitive work on-device while leveraging large cloud models when needed.

From a modeling and training standpoint, MAI-UI is described as unifying three major components: a self-evolving navigation data pipeline (including user interaction and MCP cases), an online reinforcement learning (RL) framework that scales across many parallel Android instances with long contexts, and a device–cloud collaboration system to balance privacy and capability.

GUI grounding with instruction reasoning

For a GUI agent to function reliably, it must solve a core problem: grounding. Grounding is the process of mapping a free-form instruction—like “open monthly billing settings”—to the correct on-screen control. In practice, this means identifying the right UI element and outputting an action (often a click) at the correct location.

MAI-UI applies a grounding strategy inspired by UI-Ins, which centers on multi-perspective instruction descriptions. Rather than relying on a single caption per element, the training pipeline generates multiple “views” of the same UI element. These can describe, for example:

How an element looks (appearance)
What it does (function)
Where it is (spatial location)
Why a user would use it (user intent)

These alternative descriptions become reasoning evidence for the model. The model’s task is then to select a point inside the correct bounding box for the target element. The multi-view approach is intended to reduce brittleness caused by flawed, incomplete, or overly vague instructions—an issue earlier work has highlighted in common GUI datasets.

For training labels, ground-truth boxes are assembled from a combination of curated GUI datasets and large-scale exploration of virtualized operating systems in containerized environments. To align text metadata with pixel-level element locations, MAI-UI uses accessibility trees or OCR-based parsers.

The objective mixes supervised fine-tuning with reinforcement-style signals, rewarding correct “point-in-box” predictions and valid output formatting.

Reported benchmark results for grounding

On public GUI grounding benchmarks, MAI-UI reports the following results:

73.5% accuracy on ScreenSpot Pro with adaptive zoom in
91.3% on MMBench GUI L2
70.9% on OSWorld G
49.2% on UI Vision

The report states these results surpass Gemini 3 Pro and Seed1.8 on ScreenSpot Pro, and that MAI-UI performs notably better than earlier open models on UI Vision.

Navigation is typically more difficult than grounding because it requires long-horizon planning across many steps. A mobile assistant may need to maintain context, switch between screens (or even apps), recover from mistakes, and sometimes ask the user questions or call tools.

To train that kind of behavior, Tongyi Lab uses what it describes as a self-evolving navigation data pipeline. The process starts with seed tasks collected from:

App manuals
Hand-designed scenarios
Filtered public data

Those tasks are expanded by perturbing parameters such as dates, limits, and filter values to increase coverage. The pipeline also uses object-level substitutions that remain within the same use case to diversify the training distribution.

Task trajectories are then produced by multiple agents together with human annotators executing tasks inside Android environments. A judge model evaluates the resulting trajectories, retains the longest correct prefixes, and filters out low-quality segments. Subsequent supervised training rounds use a combination of fresh human traces and high-quality model rollouts, allowing the data distribution to track the current policy over time.

MobileWorld: 201 tasks across 20 applications

MAI-UI is evaluated on MobileWorld, a benchmark created by the same team. MobileWorld includes 201 tasks spanning 20 applications and explicitly includes three task categories:

Pure GUI tasks (classic navigation)
Agent–user interaction tasks (natural-language back-and-forth)
MCP-augmented tasks (requiring tool calls)

On MobileWorld, MAI-UI reports 41.7% overall success. The report describes this as an improvement of about 20.8 points over the strongest end-to-end GUI baselines, and notes that the performance is competitive with agentic frameworks that rely on larger proprietary planners such as Gemini 3 Pro.

Online RL in containerized Android environments

Because mobile apps are dynamic—and UI layouts, state, and flows can change—MAI-UI also leans on online reinforcement learning. In the described setup, agents interact directly with containerized Android Virtual Devices (AVDs), rather than learning purely from static offline data.

The environment stack is packaged to scale: rooted AVD images and backend services are placed into Docker containers, with standard reset and step operations exposed over a service layer. The stack supports more than 35 self-hosted apps across categories including e-commerce, social, productivity, and enterprise.

GRPO on verl, with long-horizon trajectories

The RL method is described as an asynchronous on-policy approach called GRPO, implemented on top of verl. The system combines tensor, pipeline, and context parallelism in a Megatron-style setup, enabling learning from trajectories up to 50 steps with very long token sequences.

Rewards are generated using rule-based verifiers or model judges that detect task completion, plus penalties intended to discourage obvious looping behaviors. To keep learning stable, only recent successful trajectories are stored in task-specific buffers.

Scaling effects reported in the study

The report highlights two scaling observations from the RL environment:

Increasing parallel GUI environments from 32 to 512 improves navigation success by about 5.2 percentage points.
Increasing the allowed environment steps from 15 to 50 adds about 4.3 points.

AndroidWorld results: surpassing leading baselines

For online navigation evaluation, the report points to AndroidWorld, which measures task success across a standard Android app suite. On this benchmark, the largest MAI-UI variant reaches 76.7% success.

In the same reported comparison, MAI-UI surpasses UI-Tars-2, Gemini 2.5 Pro, and Seed1.8 on AndroidWorld.

Why MCP tool calls and device–cloud collaboration matter

Beyond leaderboard numbers, MAI-UI’s architecture is aimed at practical deployment constraints that commonly limit GUI agents:

Reliability: Tool calls can reduce dependence on brittle UI sequences when an action is better handled at an API level.
User experience: Built-in interaction actions (asking for clarification, answering questions) let the agent handle uncertainty more gracefully instead of guessing.
Privacy and capability trade-offs: A device–cloud split can keep sensitive steps local while still accessing more powerful cloud models when necessary.

In combination, these choices frame MAI-UI not just as a model, but as a system designed to operate across long, mixed-modality workflows typical of mobile use.

Conclusion

MAI-UI positions Alibaba Tongyi Lab’s Qwen3 VL-based GUI agents as a unified approach to grounding, long-horizon mobile navigation, tool-augmented execution, and privacy-aware deployment. With reported improvements across ScreenSpot Pro, MobileWorld, and AndroidWorld—reaching 76.7% success on AndroidWorld—the release underscores how quickly GUI agents are evolving from simple “tap predictors” into more complete interactive systems.

This article is based on reporting originally published by www.marktechpost.com.

<<>>

Based on reporting originally published by www.marktechpost.com. See the sources section below.

MAI-UI: Alibaba Tongyi Lab’s Foundation GUI Agents Push State of the Art in Android Navigation and Grounding

What is MAI-UI?

Three gaps MAI-UI is designed to address

GUI grounding with instruction reasoning

Reported benchmark results for grounding

Self-evolving navigation data and the MobileWorld benchmark

MobileWorld: 201 tasks across 20 applications

Online RL in containerized Android environments

GRPO on verl, with long-horizon trajectories

Scaling effects reported in the study

AndroidWorld results: surpassing leading baselines

Why MCP tool calls and device–cloud collaboration matter

Conclusion

Sources

What is MAI-UI?

Three gaps MAI-UI is designed to address

GUI grounding with instruction reasoning

Reported benchmark results for grounding

Self-evolving navigation data and the MobileWorld benchmark

MobileWorld: 201 tasks across 20 applications

Online RL in containerized Android environments

GRPO on verl, with long-horizon trajectories

Scaling effects reported in the study

AndroidWorld results: surpassing leading baselines

Why MCP tool calls and device–cloud collaboration matter

Conclusion

Related Articles

Sources

You might also like

2025 AI Milestones: 9 Breakthroughs That Reshaped Artificial Intelligence This Year

Subscribe to our newsletter