AI agents on the device hit a hard memory limit. Apple’s new architecture avoids this.



On-device AI models have been kept small because the entire weight set has to live in DRAM, limiting the practical parameter count well below what server-side implementations use. Enterprise architects evaluating agent workloads have had to choose between cloud-dependent capable models and on-device limited models. Apple’s third-generation entry-level models, announced at WWDC26, break that restriction by moving the weight that activates the DRAM completely.

The AFM 3 family was developed in collaboration with Google and spans five models: two on-device and three server-based, all running within the boundaries of Apple’s Private Cloud Compute. Server-side models, including AFM 3 Cloud Pro for agent tooling and complex reasoning, run on Nvidia GPUs on Google Cloud. The architecture of the device is Apple’s own. AFM 3 Core Advanced is a 20 billion parameter model that stores weights in NAND flash memory instead of DRAM.

"Instead of forcing the entire model into DRAM, the entire model is stored in flash memory," The Apple research team wrote. "Because NAND-to-DRAM bandwidth is too slow to exchange token-for-token weights, as required by standard MoE models, AFM 3 Core Advanced makes routing decisions based on the message."

How architecture really works

The memory wall that Apple is working with is one that all local AI developers run into.

"You can’t put 20B of parameters into RAM with reasonable accuracy," Awni HannunAnthropic researcher and former Apple research scientist, published in X. "To make it work, they are using a fairly exotic architecture by today’s standards. A small model predicts from the query (or prompt) which experts to load from NAND to RAM."

That prediction and loading mechanism has three distinct components, each driven by the hardware limitations of consumer silicon.

The full set of 20B weights is in flash, not DRAM. AFM 3 Core Advanced stores its entire parameter set in NAND flash memory instead of active memory. Standard on-device implementations require the entire model to fit within DRAM, which is what limits the parameter count. Apple’s approach, which it calls Instruction Following Pruning (IFP) and developed with its own researchers, treats flash memory as the model’s permanent home and DRAM as a buffer for experts to work with when a given prompt is required.

Expert routing occurs once per message, not per token. In a conventional Expert Mixture model, a router selects different experts for each token generated, which would require continuous movement of weight between flash memory and DRAM at inference speed. The bandwidth of NAND to DRAM cannot support that. AFM 3 Core Advanced routes once at a time, selects a fixed expert set, loads it into DRAM along with always-on shared experts, and generates all tokens from that same configuration.

"The key difference with a typical MoE is that this is done once per query and then all tokens are generated with the same experts." Hanoun wrote.

The active parameter count increases from 1B to 4B depending on the complexity of the task. Instead of running a fixed model size for each request, AFM 3 Core Advanced adjusts the number of parameters it activates depending on what the task requires: 1 billion for simpler operations, up to 4 billion for more difficult operations, all drawn from the set of 20 billion parameters in flash.

What Apple has revealed and what it hasn’t revealed

The architecture document details the memory design and sparse activation mechanism. It is less forthcoming about the practical limitations of implementation.

Apple’s profiling tools expose the times, but not the metrics that decide the viability of production. "Power, memory bandwidth, thermal? Not in the documents," Marco Abis, who is building Ziraph, a profiler for local AI on Apple silicon, published in X. "A notable gap, given that they are the ones who decide most of the device’s performance."

Abis also found no statement in Apple’s documentation (in the Core AI docs, the Foundation Models docs, or the Private Cloud Compute security post) about when a request is transparently offloaded to the device or whether that routing is visible to the developer or user. For companies that need to document where inference is executed, that is a straightforward compliance issue.

Not all information is currently available. Apple has indicated that a full whitepaper with benchmarks will arrive later this summer.

What this means for enterprise architects

Regulated industries evaluating agent AI implementations now have a concrete architectural decision to make.

  • The DRAM wall for agents on the device has just moved. Companies evaluating agents that need to run without a round trip to the cloud now have an on-premises option of 20 billion parameters to evaluate. The restriction moves from the capacity of the model to the hardware of the device.

  • The private/cloud boundary is now an architectural decision, not a default option. Simpler requests remain on the device; Complex agent tasks are directed to AFM 3 Cloud Pro in Private Cloud Compute. Apple has not publicly specified when a request is downloaded or whether that path is visible to the developer, a gap that complicates policy decisions for organizations that need to document where inference is executed.

  • The agent server level depends on Google Cloud. AFM 3 Cloud Pro runs on Nvidia GPUs on Google Cloud. The Private Cloud Compute guarantee covers data privacy. It does not remove the dependency on Google Cloud for server-side inference.

AFM 3 Core Advanced gives enterprises a 20 billion parameter on-device option that didn’t exist before WWDC26. Whether it can be implemented at scale depends on answers that Apple has not yet published. Those details will be published in the summer technical report.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *