Technology

Inference at the precision of the problem.

General-purpose GPU architecture is optimized to handle any tensor operation in any order. Procunit inverts this premise — building hardware that handles exactly the operations your frozen model performs, in exactly the order they occur.

Request Evaluation Architecture Details

Compiler Input

ONNX / TorchScript

Op Coverage

Full graph

Specialization

Model-locked

Precision

INT8 / FP16

Graph Compiler

From frozen model to custom IR.

Procunit's compiler ingests a frozen model graph in ONNX or TorchScript format and produces a custom dataflow IR — a precise enumeration of every compute primitive, tensor dimension, memory access pattern, and data dependency in your model. This IR is not an optimization of a general-purpose program — it's a hardware specification.

Standard ML compiler stacks (XLA, TVM, IREE) apply operator fusion heuristics that assume a fixed ISA — they decide what the hardware can do and fit your model into it. Procunit's compiler has no fixed ISA to target. It observes your model's exact op graph and generates a hardware description that matches it. No heuristics. No dead op coverage.

Full ONNX opset 17 coverage for inference subgraphs
No operator approximation — every op mapped exactly
Output IR drives the hardware synthesis stage directly
Deterministic output for a given frozen model

procunit — graph ingestion

$ procunit ingest \
    --model model.onnx \
    --target procunit-eval-1
Procunit Compiler v0.9.2 (eval)
Analyzing 847 ops...
23 unique op types identified
Mapping to 6 custom datapaths
Memory working set: 3.2 GB SRAM
IR written to ./procunit-ir/model-eval1.pir
Done. Estimated throughput gain: 3.8–4.3x

Abstract dataflow topology showing compute primitive interconnects as glowing amber nodes on dark field

Dataflow Architecture

Custom dataflow, not systolic arrays.

GPU execution is built on systolic array engines — grids of multiply-accumulate units that work well when tensor dimensions are large and regular. Production inference models often have irregular graphs: mixed sequence lengths, conditional execution paths, non-uniform memory access.

Procunit's dataflow architecture is derived directly from the model graph. Each node in the hardware corresponds to a compute primitive in the model. Data flows along paths that reflect actual model dependencies — not a generic SIMT execution model with warp-level scheduling overhead.

Dataflow topology derived from model graph, not assumed
No warp divergence penalty — conditional execution handled natively
Mixed-precision scheduling at per-op granularity
Memory bandwidth budget allocated to actual access patterns

Memory Hierarchy

Working set sized to your model, not the broadest case.

The highest-cost memory is the memory you provision for bursts that don't happen. Procunit's on-chip SRAM allocation is derived from the model's actual working set — the minimum footprint required to keep active tensors resident and execute the frozen graph without DRAM bandwidth intervention. There is no unified L2 cache doing speculative prefetch for op types that aren't in your graph.

On-Chip SRAM First

Working set analysis identifies the minimum on-chip SRAM to keep your model's active tensors resident. No external DRAM bandwidth bottleneck during execution.

Prefetch Schedule

Weight prefetch schedule is derived from the execution graph — weights arrive at the compute unit the cycle before they're needed, not after a cache miss.

Bandwidth Efficiency

Memory bandwidth is allocated proportionally to each op's actual access footprint. No bandwidth wasted on ops that a general-purpose controller over-provisions.

Performance Projections

Engineering estimates on throughput-per-watt.

Framed as simulation projections, not certified benchmarks. Numbers derived from Procunit's silicon characterization model applied to common stable-model workload profiles.

Platform	Tokens/sec (est.)	TDP (W)	Tokens/sec/W	Relative Efficiency
NVIDIA A100 (80GB)	~2,800	400	~7.0	1.0× (baseline)
NVIDIA H100 (SXM5)	~4,200	700	~6.0	0.86×
Procunit PCU-1 Eval (proj.)	~3,200	150	~21.3	3.9–4.2× (proj.)

All figures are engineering projections derived from simulation. GPT-2-sized stable inference workload profile. Actual performance will vary by model graph. GPU figures approximate from published data sheets and independent benchmarking literature. Not certified benchmarks.

Next Step

See how the architecture is built.

Die floor plan, full stack diagram, and integration specs.

View Architecture