Apple researchers develop local AI agent that interacts with apps – Charm

Reddit is being spammed by AI bots, and it's all Reddit's fault | Conceptual image of a row of AI–powered robots

Despite having just 3 billion parameters, Ferret-UI Lite matches or surpasses the benchmark performance of models up to 24 times larger. Here are the details.

A bit of background on Ferret

In December 2023, a team of 9 researchers published a study called “FERRET: Refer and Ground Anything Anywhere at Any Granularity”. In it, they presented a multimodal large language model (MLLM) that was capable of understanding natural language references to specific parts of an image:

Since then, Apple has published a series of follow-up papers expanding the Ferret family of models, including Ferretv2, Ferret-UI, and Ferret-UI 2.

Specifically, Ferret-UI variants expanded on the original capabilities of FERRET, and were trained to overcome what the researchers defined as a shortcoming of general-domain MLLMs.

From the original Ferret-UI paper:

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate “any resolution” on top of Ferret to magnify details and leverage enhanced visual features.

A few days ago, Apple expanded the Ferret-UI family of models even further, with a study called Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents.

Ferret-UI was built on a 13B-parameter model, which focused primarily on mobile UI understanding and fixed-resolution screenshots. Meanwhile, Ferret-UI 2 expanded the system to support multiple platforms and higher-resolution perception.

By contrast, Ferret-UI Lite is a much more lightweight model, designed to run on-device, while remaining competitive with significantly larger GUI agents.

Ferret-UI Lite

According to the researchers of the new paper, “the majority of existing methods of GUI agents […] focus on large foundation models.” That is because “the strong reasoning and planning capabilities of large server-side models allow these agentic systems to achieve impressive capabilities in diverse GUI navigation tasks.”

They note that while there has been a lot of progress on both multi-agent, and end-to-end GUI systems, that take different approaches to streamline the many tasks that involve agentic interaction with GUIs (“low-level GUI grounding, screen understanding, multi-step planning, and self-reflection”), they are basically too large and compute-hungry to run well on-device.

So, they set out to develop Ferret-UI Lite, a 3-billion parameter variant of Ferret-UI, which “is built with several key components, guided by insights on training small-scale” language models.

Ferret-UI Lite leverages:

Real and synthetic training data from multiple GUI domains;
On-the-fly (or, inference-time) cropping and zooming-in techniques to better understand specific segments of the GUI;
Supervised fine-tuning and reinforcement learning techniques.

The result is a model that closely matches or even outperforms competing GUI agent models that are up to 24 times its parameter count.

While the entire architecture (which is thoroughly detailed in the study) is interesting, the real-time cropping and zooming-in techniques are particularly noteworthy.

The model makes an initial prediction, crops around it, then re-predicts on that cropped region. This helps such a small model compensate for its limited capacity to process large numbers of image tokens.

Another notable contribution of the paper is how Ferret-UI Lite basically generates its own training data. The researchers built a multi-agent system that interacts directly with live GUI platforms to produce synthetic training examples at scale.

There is a curriculum task generator that proposes goals of increasing difficulty, a planning agent breaks them into steps, a grounding agent executes them on-screen, and a critic model evaluates the results.

With this pipeline, the training system captures the fuzziness of real-world interaction (such as errors, unexpected states, and recovery strategies), which is something that would be much more challenging to do while relying on clean, human-annotated data.

Interestingly, while Ferret-UI and Ferret-UI 2 used iPhone screenshots and other Apple interfaces in their evaluations, Ferret-UI Lite was trained and evaluated on Android, web, and desktop GUI environments, using benchmarks like AndroidWorld and OSWorld.

The researchers don’t note explicitly why they chose this route for Ferret-UI Lite, but it likely reflects where reproducible, large-scale GUI-agent testbeds are available today.

Be it as it may, the researchers found that while Ferret-UI Lite performed well on short-horizon, low-level tasks, it did not perform as strongly on more complicated, multi-step interactions, a trade-off that would be largely expected, given the constraints of a small, on-device model.

On the other hand, Ferret-UI Lite offers a local, and by extension, private (since no data needs to go to the cloud and be processed on remote servers) agent that autonomously interacts with app interfaces based on user requests, which, by all accounts, is pretty cool.

To learn more about the study, including benchmark breakdowns and results, follow this link.