Batching is the primary lever ML infrastructure engineers reach for when GPU utilization numbers look disappointing. The logic is straightforward: larger batches mean larger matrix multiplies, larger matrix multiplies mean better systolic array utilization, better utilization means lower cost per inference. On paper, this is correct. In production, the lever has a hard stop, and that stop is determined by your latency SLA, not your hardware.
Understanding where that stop is — and why the hardware behaves the way it does on either side of it — is essential for making honest infrastructure tradeoffs.
The Latency-Throughput Tradeoff Is Not Symmetric
Take a 7B parameter transformer model running on an A100 80GB. At batch size 1, you get roughly 1,400 tokens/second and 12ms first-token latency. At batch size 32, you get roughly 18,000 tokens/second and 85ms first-token latency. The throughput improvement is real — about 13x. The latency cost is real too — about 7x. Both numbers matter to someone.
The problem is that for real-time serving applications — chat interfaces, API services, document analysis pipelines — first-token latency is often the binding constraint. A P99 first-token requirement of 50ms puts your maximum batch size at roughly 12–16 for that model on that hardware, regardless of what the hardware could achieve at batch 128 in a controlled benchmark.
When batch size is capped at 12–16 by the latency SLA, the A100's 312 TFLOPS FP16 throughput is operating in a regime where arithmetic intensity is too low to saturate the compute. You're memory-bandwidth bound at that batch size, not compute-bound. The teraflops you're paying for in the hardware purchase are architecturally unreachable at the batch sizes your workload requires.
SM Occupancy at Low Batch Sizes: The Numbers
An A100 has 108 Streaming Multiprocessors. Each SM can concurrently run up to 2,048 threads across 64 warps. For a batch-1 transformer forward pass, the total thread count needed for the matrix multiply kernels across all layers is orders of magnitude smaller than the hardware's thread capacity. The result: most SMs sit idle waiting for work.
NVIDIA's Nsight Compute profiler will show you SM occupancy broken down by kernel. For a well-optimized batch-1 inference workload, theoretical occupancy might be 85% per active SM, but active SM count might be 20–30% of the total pool for most of the forward pass. The effective utilization — what matters for throughput-per-dollar — is the product of those two numbers.
Continuous batching, as implemented in systems like vLLM, addresses part of this by multiplexing tokens from multiple concurrent requests across a single forward pass. Instead of batching at the request level, you batch at the token level — a request that has generated 200 tokens contributes to the same matrix multiply as a request that just arrived. This meaningfully improves effective batch size for generation workloads, and it's the right technique to deploy. But it operates within the CUDA execution model — the SM utilization ceiling it can reach is still determined by the hardware's compute-to-memory-bandwidth ratio at the effective token batch size.
A Worked Scenario: Embedding Service at Scale
Consider a production embedding pipeline — a BERT-large style encoder, 340M parameters, running at 8,000 requests per second with a 30ms P95 latency requirement. This is a common shape: content moderation, retrieval augmentation, fraud signal generation.
At 30ms latency and BERT-large's typical throughput curve on an A100, the maximum batch size for consistent P95 is around 48–64. At batch 48, the A100 is at approximately 45% SM occupancy. The remaining 55% of compute capacity is idle. Two A100s handle this workload with room to spare.
Now the team projects traffic growth to 20,000 req/s over 18 months. Linear scaling says 5 A100s. The power budget for 5 A100 SXM4s at inference load is roughly 1.8 kW. Annual power cost at $0.08/kWh and PUE 1.4: ~$175K. Hardware amortization adds another $300–350K/year. Total annual run rate: $475–525K for a workload that fits in roughly 45% of each GPU's compute capacity.
The mismatch is structural. The workload's batch size ceiling is set by the latency SLA. The latency SLA is set by the application. The application isn't going to relax its latency target so the GPU vendor can improve utilization numbers.
Where Operator Fusion Helps (and Where It Doesn't)
One real optimization for GPU inference is operator fusion — combining adjacent operations in the compute graph into a single kernel to reduce memory round-trips. Layer normalization followed by attention projection, or GELU activation followed by the next linear layer, can often be fused so that intermediate activations stay in SRAM or registers instead of being written to HBM and re-read.
FlashAttention is the most widely-known example. By fusing the attention score computation, softmax, and value aggregation into a single tiled kernel that keeps the KV working set in SRAM, FlashAttention eliminates several HBM round-trips and substantially reduces memory bandwidth consumption for the attention block — particularly at longer sequence lengths where the unfused algorithm was bandwidth-limited.
Operator fusion is a genuine improvement, and any production inference stack should be using it. The important constraint: operator fusion works within the limitations of the GPU's SRAM budget per SM. The L1 cache and shared memory per SM on an A100 is 192 KB (128 KB L1 + 64 KB shared, configurable). A 13B model's attention working set at sequence length 512 with 40 attention heads exceeds that budget many times over. FlashAttention tiles specifically to manage within this constraint — but the tiling is a workaround for a memory hierarchy that wasn't designed for inference, not an escape from it.
INT8 and FP8 Quantization: Real Gains, Real Limits
Quantization from FP16 to INT8 halves weight storage and memory bandwidth requirements, and on hardware with dedicated INT8 compute paths (Tensor Cores on Ampere/Hopper support INT8 at 2x the FP16 rate), it improves both memory-bound and compute-bound throughput. For a 7B FP16 model that was memory-bandwidth-bound at small batches, INT8 quantization can improve tokens/second by 1.5–1.8x with acceptable accuracy loss on many workloads.
FP8 quantization (available on H100 Hopper with the new FP8 tensor core instructions) pushes further: 4x less memory per weight vs. FP16, and 2x more compute throughput vs. INT8. For models where FP8 accuracy holds — which varies significantly by architecture and task — this is meaningful.
We're not saying quantization is insufficient — for many teams it's the right call and shouldn't be skipped. We're saying it shifts the utilization ceiling rather than removing the fundamental batch-size constraint. A quantized model at INT8 still hits the same latency-SLA-imposed batch ceiling; it's just more efficient within that ceiling. The 45% SM occupancy number we cited for the embedding workload improves somewhat, but the structural dynamic doesn't change.
What Changes with Purpose-Built Inference Hardware
The batching efficiency problem on GPUs is a consequence of architecture, not software configuration. The design decisions that lead to low utilization at small batches — large warp size requiring many parallel threads to amortize scheduling overhead, memory hierarchy sized for training arithmetic intensity, no fixed-function attention path — are correct for the GPU's intended workload and wrong for inference.
An accelerator designed around the inference workload profile makes the opposite tradeoffs: narrower but deeper execution paths sized for the target batch range, on-chip memory allocated to the specific working sets of production inference patterns, and fixed-function datapaths for the attention operator that eliminate the tiling workaround entirely. The result is that the hardware operates at high utilization at the batch sizes the application actually requires — not at batch sizes that only occur in benchmarks.
The batching efficiency problem isn't solved by running GPU clusters harder. It's solved by matching hardware architecture to the workload's actual compute pattern. Those are different problems with different solutions.