Unpacking AFM 3: The Architecture Behind Apple's Local Inference Updates

We just wrapped up WWDC 2026, and while the consumer-facing UI updates in iOS 27 and macOS 27 are taking up most of the oxygen, the real story is the architectural leap happening at the foundational layer. Apple just introduced the third generation of Apple Foundation Models (AFM 3), and their approach to local inference is a masterclass in hardware-software co-design.

If you care about memory bandwidth, inference optimization, or the mechanics of running massive models outside of the data center, this update is worth dissecting.

AFM 3: Overcoming the Memory Bandwidth Bottleneck

Running a highly capable Large Language Model locally usually means slamming into a wall of DRAM constraints. You either quantize a model until it loses its reasoning capabilities, or you bottleneck the system.

Apple’s engineering teams have delivered an incredibly elegant solution to this with AFM 3 Core Advanced. Instead of brute-forcing a dense architecture, they’ve deployed a 20-billion-parameter Mixture-of-Experts (MoE) model utilizing a highly optimized sparse architecture. Depending on the complexity of the prompt, the model dynamically activates only 1 to 4 billion parameters during inference.

But here is where the architecture team truly deserves a standing ovation: streaming weights from NAND flash. Instead of pre-loading the entire 20B footprint into unified memory, the OS keeps the full model stored in flash memory. By utilizing “shared experts” that stay active in DRAM, and dynamically swapping “routed experts” from NAND to RAM exactly when the router dictates, they’ve mitigated the latency hit usually associated with flash storage speeds. Managing that memory bandwidth without crippling the tokens-per-second output is an astonishing engineering achievement. Hats off to the teams who made that pipeline a reality.

The “Core AI” Framework: Strategic Local Deployment

For developers looking to integrate customized inference into their own applications, the new Core AI framework provides a direct, highly optimized pipeline to the Neural Engine and Apple Silicon’s unified memory.

This isn’t just a wrapper; it’s a foundational shift in how we handle localized deployment. You can now deploy full-scale LLMs natively, bypassing network latency and external API dependencies entirely. Furthermore, Core AI introduces dynamic profiling, allowing developers to adjust how models interact with application states at runtime. For those of us building systems that require high privacy, zero-latency inference, and adaptive behaviors, this framework provides the primitives we’ve been waiting for.

Agentic Workflows Natively in Xcode 27

Finally, the introduction of local AI coding agents in Xcode 27 is a fascinating look at the future of development environments.

Rather than relying on basic autocomplete, Xcode 27 implements a true agentic workflow. The local models utilize a reasoning-action-observation loop directly within your IDE. You can engage in multiturn architectural planning, and the agent can preview structural code changes in a markdown canvas alongside your project. Because the inference is local, the context window can securely encompass your entire local repository without exposing proprietary codebases to external servers.

Time to Dig Deeper

Apple’s engineering teams have delivered a masterclass in architectural design, but the most exciting part is what we can actually build with it. If you want to understand how these localized models work without getting immediately overwhelmed by memory bandwidth constraints and NAND flash routing, I highly recommend checking out the introductory materials from this year’s WWDC.

A fantastic starting point is the Meet Core AI session, which gives a clear, high-level overview of the new framework and how it abstracts the complexity of model execution. For those of you who are already comfortable experimenting with training and fine-tuning, Apple has also rolled out new Core AI PyTorch extensions to help bridge your current workflows seamlessly into this new ecosystem.

It’s an incredibly accessible time to start experimenting with local inference. The hardware and software teams have given us a powerful new set of primitives, and now it’s our turn to see what we can create with them.

PREVIOUSDemystifying the ReAct Loop: The Engine Behind Autonomous Agents

NEXTPixi Garden: A New, Constructive Kind of Messaging