There is a distinct lack of articles on the internet to explain why we don't achieve the desired performance from NPUs, after model deployment. There are either research papers, or marketing fluff.
This document serves as a middle ground, and explains the hardware underneath in a simplified way, to help diagnose bottlenecks faced during model inference on edge hardware.
Most graph fragmentation problems are caused by small set of well defined hardware constraints, and can be avoided or bypassed.
Hence, we will focus on the categories of operation the silicon refuses, why, and where in our model-export pipeline those refusals originates.
An NPU is a chip block designed to run neural network math, mostly matrix multiplies and a small set of element-wise ops, at a very high arithmetic/watt rate but at the cost of losing generality. Unlike a CPU or GPU, which fetch and decode instructions every cycle, an NPU is configured once per subgraph. A specialized compiler emits a binary that wires the chip’s compute blocks (matrix unit, vector unit, on-chip SRAM, DMA engines) into a fixed dataflow for your specific computation, and then the data streams through it with almost no per-cycle control overhead. The cost this efficiency. as stated earlier, is that anything outside the supported set falls off a cliff.
The architectural family is well described in the recent paper Scaling LLM Test-Time Compute with Mobile NPU on Smartphones (Hao et al., EuroSys ‘26), which calls out the standard pattern:

Qualcomm Hexagon will be used repeatedly as the concrete example in this document (it’s the most widely deployed mobile NPU and has the most accessible SDK).

Refer to Figure 3 of the paper linked above
We first need a working picture of what happens when we hand a model to an NPU.
When we point a tool like QAIRT (for Qualcomm), Vitis AI (for AMD XDNA), or CoreML’s compiler at our model, it doesn’t produce a stream of instructions in the CPU/GPU sense, rather it produces a configuration binary, which is a description of how to wire the chip’s compute blocks together for any specific computation. This is the single most important conceptual difference from a GPU compiler.

We construct a computation graph from our given source (can be a .tflite flatbuffer, ONNX model etc.), then we read and walk the nodes. For every node, we check the two gates which were mentioned in the introduction i.e. do I have a builder for this Op? and does this specific config validate? The result would be a boolean mask over the graph.
Then continuous runs of accepted ops get extracted as subgraphs. Everything else stays in the host graph as fallback ops. Each accepted subgraph becomes its own independent compile target. Now within a subgraph, the compiler finds sequences it can wire together without immediate memory round trips. For example, mamtul -> bias -> activation -> layernorm. The matrix unit’s output can feed directly into the vector unit’s input, the vector unit chains its own stages, and the whole sequence becomes a single configuration rather than four.
The on-chip SRAM is quite small (single digit megabytes on Hexagon’s VTCM, similar elsewhere*), whereas a 4096x4096 FP16 weight tensor is 32MB on its own, which is too large to fit. Hence the compiler has to slice the operation into tiles that fit in the scratchpad alongside their corresponding activation slices, schedule DMAs to bring their next tile in while the current tile computes, and decide which intermediates stay on-chip versus spilling into the DDR.
Now the compiler writes out the final binary which dictates the wiring map for the chip’s compute fabric, the DMA schedule, the scalar controller’s sequencing program, weights pre-arranged into hardware native layouts (because the systolic array consumes data in a specific order; the EuroSys paper has a nice illustration of Hexagon’s “every two rows are permuted” tile format), and the quantization parameters needed for runtime scaling.