Throughput-per-watt is the metric that matters most for production inference infrastructure economics, and it is also the metric most frequently cited under conditions that make it useless for actual procurement decisions. Published numbers from hardware vendors — and from accelerator startups including this one — are almost universally measured at peak throughput, fixed batch size, optimal sequence length, and full TDP power. None of those conditions describe a production system.
This article walks through the calculation methodology for measuring actual throughput-per-watt under your specific workload configuration. The goal is a number you can use to compare hardware options honestly, including against each other and against our own published figures.
Why Published Specs Are Misleading
The standard published format is something like: "4,200 tokens/second at 250W TDP." Both numbers are real, but neither is what you'll measure in your rack.
The throughput figure is typically measured at the batch size that saturates the hardware — batch 128, batch 256, fixed input length 512, fixed output length 128. Your production workload runs at batch 8–16 because your latency SLA forces it. The relationship between throughput and batch size is superlinear for GPUs — going from batch 128 to batch 8 often reduces throughput by 6–10x, not 16x, because memory bandwidth rather than compute becomes the bottleneck. But it still reduces, and that reduction is specific to your workload's sequence distribution.
The power figure is TDP — the maximum rated thermal design power. This is the worst-case number. Under actual inference load at moderate batch size, GPU power draw is typically 60–80% of TDP. But that means the published "tokens/second per watt" number was calculated with a denominator that's 25–40% too high in absolute watts while the numerator was measured at a throughput point you'll never reach. The combination can make a chip look 2–3x more efficient than it is at your operating point.
The Correct Measurement Framework
Step 1: Define Your Operating Point
Before measuring anything, characterize your production workload distribution precisely:
- Batch size distribution — not the maximum batch size, but the P50 and P95 effective batch size under your actual request arrival rate and latency SLA. For a real-time serving system with a 100ms first-token P95 target, this is often 4–16.
- Input token length distribution — P50, P95, and P99. Variable-length inputs affect arithmetic intensity and, on GPUs, warp utilization.
- Output token length distribution — relevant for autoregressive generation workloads. Longer outputs mean more decoding iterations, which are more memory-bandwidth-bound than prefill.
- Request arrival pattern — sustained load vs. bursty. Sustained load is what matters for steady-state TCO; burst handling determines headroom requirements.
Step 2: Measure System Power at Operating Point
Do not use TDP. Measure actual wall power under your representative workload using a rack-mounted power distribution unit (PDU) with per-outlet power metering, or with IPMI/BMC power telemetry for the node. Measure the whole server node, not just the accelerator — CPUs, memory, NVMe, and motherboard components are real costs that the accelerator's TDP doesn't capture.
For a four-GPU inference node under representative inference load, total server wall power is typically 1.8–2.4 kW. The GPU TDP sum alone would be 2.8 kW (4 × 700W for H100). The difference is real and changes the denominator of your calculation.
Take a sustained measurement over at least 30 minutes under representative load, not a peak reading during a burst. Power draw varies with memory bandwidth utilization, not just compute utilization, and inference workloads have different power profiles than training.
Step 3: Measure Throughput at That Operating Point
Use your production framework's metrics (tokens generated per second, requests completed per second, etc.) captured during the same measurement window as the power data. Do not mix benchmark throughput with production power, or vice versa — the numbers must come from the same operating condition.
For autoregressive models, decide whether you're measuring prefill throughput, decode throughput, or end-to-end per-request throughput. These are different numbers with different hardware sensitivity profiles. Decode is almost always memory-bandwidth-bound; prefill is compute-bound at larger batch sizes. The right choice depends on which phase dominates your workload's wall-clock time.
Step 4: Calculate and Account for the Full System
effective_throughput_per_watt = (tokens_per_second) / (system_wall_watts)
# Example:
# 2,400 tokens/second (measured at batch=12, mixed input lengths)
# 2,200W wall power (four-GPU node)
# = 1.09 tokens/second/watt (system level)
# vs. published spec:
# 18,000 tokens/second at batch=128
# 700W TDP per GPU, 4 GPUs = 2,800W (but TDP ≠ actual)
# = 6.43 tokens/second/watt (benchmark level)
# Real-world vs. published: 0.17x — a 5.9x difference
The 5.9x difference in this example is not atypical for transformer inference at production batch sizes versus published benchmarks. The variance range across different model architectures and latency targets is roughly 3–8x.
Adjusting for Data Center Overhead
The calculation above gives you chip-level and node-level throughput-per-watt. For infrastructure TCO, you need data-center-level throughput-per-watt, which accounts for cooling, power conversion, and distribution losses.
PUE (Power Usage Effectiveness) is the standard metric: total facility power divided by IT equipment power. Modern hyperscale data centers achieve PUE of 1.1–1.2. Typical co-location facilities run 1.3–1.6. At PUE 1.4:
datacenter_effective_throughput_per_watt =
(tokens_per_second) / (system_wall_watts × PUE)
# Using the example above:
# = 2,400 / (2,200 × 1.4)
# = 2,400 / 3,080
# = 0.78 tokens/second/watt (facility level)
This is the number that maps directly to your power cost: divide your facility's kWh rate by this number to get dollars per thousand tokens.
Comparing Hardware Configurations Honestly
When comparing two hardware options, both calculations must be run at the same operating point — your operating point, not each vendor's chosen benchmark conditions. This means:
- Same model architecture (or as close as the software stack allows)
- Same effective batch size (matching your latency SLA, not each hardware's optimal)
- Same input/output length distribution
- Whole-system power measurement for both, not accelerator TDP
We're not saying published vendor benchmarks are dishonest — they're measuring what they're measuring, and peak throughput at peak batch is a real data point. We're saying that a comparison based on published specs without normalization to your operating point is not a valid comparison, and drawing hardware procurement conclusions from it is a category error.
Applying This to Custom Accelerator Claims
Procunit publishes a 4x throughput-per-watt figure versus commodity GPU for workloads that match our target compute graph profile. That number was measured at our target operating point — batch size 8–24, mixed real-world sequence lengths, whole-system power — not at peak throughput on a controlled benchmark. We publish the measurement methodology alongside the number because it's only meaningful in context.
If you're evaluating that claim, the right test is to run the same measurement framework described above, for both systems, at the operating point that represents your actual production workload. If the workload is outside our target profile — very large batch sizes with relaxed latency constraints, highly variable model architectures — the 4x figure may not reproduce. That constraint is part of the spec, not a footnote to it.
The throughput-per-watt calculation isn't technically difficult. The discipline is in committing to measure at the conditions that actually represent your infrastructure, not the conditions that produce the best-looking number for a marketing slide.