Technology

Inference at the precision of the problem.

General-purpose GPU architecture is optimized to handle any tensor operation in any order. Procunit inverts this premise — building hardware that handles exactly the operations your frozen model performs, in exactly the order they occur.

Request Evaluation Architecture Details
Compiler Input
ONNX / TorchScript
Op Coverage
Full graph
Specialization
Model-locked
Precision
INT8 / FP16
Graph Compiler

From frozen model to custom IR.

Procunit's compiler ingests a frozen model graph in ONNX or TorchScript format and produces a custom dataflow IR — a precise enumeration of every compute primitive, tensor dimension, memory access pattern, and data dependency in your model. This IR is not an optimization of a general-purpose program — it's a hardware specification.

Standard ML compiler stacks (XLA, TVM, IREE) apply operator fusion heuristics that assume a fixed ISA — they decide what the hardware can do and fit your model into it. Procunit's compiler has no fixed ISA to target. It observes your model's exact op graph and generates a hardware description that matches it. No heuristics. No dead op coverage.

  • Full ONNX opset 17 coverage for inference subgraphs
  • No operator approximation — every op mapped exactly
  • Output IR drives the hardware synthesis stage directly
  • Deterministic output for a given frozen model
Abstract dataflow topology showing compute primitive interconnects as glowing amber nodes on dark field
Dataflow Architecture

Custom dataflow, not systolic arrays.

GPU execution is built on systolic array engines — grids of multiply-accumulate units that work well when tensor dimensions are large and regular. Production inference models often have irregular graphs: mixed sequence lengths, conditional execution paths, non-uniform memory access.

Procunit's dataflow architecture is derived directly from the model graph. Each node in the hardware corresponds to a compute primitive in the model. Data flows along paths that reflect actual model dependencies — not a generic SIMT execution model with warp-level scheduling overhead.

  • Dataflow topology derived from model graph, not assumed
  • No warp divergence penalty — conditional execution handled natively
  • Mixed-precision scheduling at per-op granularity
  • Memory bandwidth budget allocated to actual access patterns
Memory Hierarchy

Working set sized to your model, not the broadest case.

The highest-cost memory is the memory you provision for bursts that don't happen. Procunit's on-chip SRAM allocation is derived from the model's actual working set — the minimum footprint required to keep active tensors resident and execute the frozen graph without DRAM bandwidth intervention. There is no unified L2 cache doing speculative prefetch for op types that aren't in your graph.

On-Chip SRAM First

Working set analysis identifies the minimum on-chip SRAM to keep your model's active tensors resident. No external DRAM bandwidth bottleneck during execution.

Prefetch Schedule

Weight prefetch schedule is derived from the execution graph — weights arrive at the compute unit the cycle before they're needed, not after a cache miss.

Bandwidth Efficiency

Memory bandwidth is allocated proportionally to each op's actual access footprint. No bandwidth wasted on ops that a general-purpose controller over-provisions.

Performance Projections

Engineering estimates on throughput-per-watt.

Framed as simulation projections, not certified benchmarks. Numbers derived from Procunit's silicon characterization model applied to common stable-model workload profiles.

Platform Tokens/sec (est.) TDP (W) Tokens/sec/W Relative Efficiency
NVIDIA A100 (80GB) ~2,800 400 ~7.0 1.0× (baseline)
NVIDIA H100 (SXM5) ~4,200 700 ~6.0 0.86×
Procunit PCU-1 Eval (proj.) ~3,200 150 ~21.3 3.9–4.2× (proj.)

All figures are engineering projections derived from simulation. GPT-2-sized stable inference workload profile. Actual performance will vary by model graph. GPU figures approximate from published data sheets and independent benchmarking literature. Not certified benchmarks.

Next Step

See how the architecture is built.

Die floor plan, full stack diagram, and integration specs.

View Architecture