Inference Silicon Co-Design: The Software Engineer Between the Model and the Chip
Why This Field Matters
Running a model well on a GPU and shaping a chip built solely for that model are two different layers of work. In June 2026, OpenAI unveiled Jalapeño, its first custom inference processor, co-designed with Broadcom. It targets inference rather than training, and it is a reticle-sized ASIC. The company says it went from design start to tape-out in nine months, which it frames as one of the fastest cycles ever for a high-performance ASIC. The telling detail: OpenAI’s own models helped with the design itself.
This is not OpenAI alone. Google has run its TPUs for years, and Amazon designs and ships Trainium. The pattern is now clear: the largest inference operators no longer lean on a single class of general-purpose GPU, but build silicon shaped to their own workloads. The economics are simple. With inference sitting at the center of revenue, a chip that produces the same answer on less power is margin. OpenAI would only say perf-per-watt is “substantially better” without giving hard numbers, but the plan to deploy at gigawatt scale by the end of 2026 makes plain why power efficiency is the business. And chips like this do not emerge when hardware architects and model researchers work in separate rooms. Someone has to bridge them directly. That is the inference silicon co-design engineer.
Required Skills
You first need computer architecture in your bones: a feel for where memory bandwidth bottlenecks, how to lay out compute units and on-chip memory so data moves less, and which dataflow structures stream matrix multiplies efficiently. In Silicon Valley this work lives at FAANG-scale players building accelerators and at silicon startups like Groq, Tenstorrent, and Cerebras, alongside hyperscaler chip teams. To stand on the chip-design side, you handle HLS or RTL, or at minimum express accelerator dataflow in a hardware description language.
Next comes the compiler layer that connects model to silicon. Working stacks like MLIR, TVM, and XLA to lower an ML graph into accelerator instructions is the core of it. Onto that sits a sense for numerics in hardware: deciding how far you can quantize a model into int8 or int4 while holding accuracy, and which operations map to which bit width. The last piece is measurement. You profile perf-per-watt and throughput directly, find the bottleneck, and rewrite kernels to lift efficiency. The toolchain usually pairs Python on the model side with C++ on the high-performance paths, plus CUDA or an equivalent accelerator programming model. Going deep on one axis is not enough; the value of this seat comes from a pair of hands fluent in three languages at once — model, compiler, and hardware.
Career Path
Juniors usually start on a single slice of kernel or compiler work. They write a pass that lowers a specific operation to accelerator instructions, verify that a quantized kernel does not break accuracy, and run benchmarks until they can read perf-per-watt and latency at a glance. This is where you learn, by hand, how a model actually flows across a chip and where data leaks.
Moving to senior, weight shifts from one kernel to accelerator co-design. You read ahead of time which hardware resources a changing model architecture will starve, and you re-lay dataflow and the memory hierarchy to fit the model. The person who translates constraints between the model team and the hardware team is forged here. Higher still is the silicon-software architect, who decides early in the design cycle which models the next-generation chip will target and how far to carry the compiler and runtime alongside it. As Jalapeño showed, taping out a chip in nine months is possible only when hardware and software see the same picture from the start. Drawing that picture is where this path ends. Now that custom inference silicon has crossed from one or two experiments into an industry-standard strategy, the hands that bridge the gap are the first ones needed.
Tags
References
- https://techcrunch.com/2026/06/24/openai-unveils-its-first-custom-chip-built-by-broadcom/
- https://www.tomshardware.com/tech-industry/artificial-intelligence/broadcom-and-openai-unveil-custom-built-jalapeno-inference-processor-openais-first-chip-is-a-massive-reticle-sized-asic-built-in-an-ultra-fast-nine-month-development-cycle
Ready to Start?
Everyone above started just like you. Pick one thing and do it today!