• Skip to main content
  • Skip to primary sidebar

Dharmendra S. Modha

My Work and Thoughts.

  • Brain-inspired Computing
    • Collaborations
    • Videos
  • Life & Universe
    • Creativity
    • Leadership
    • Interesting People
  • Accomplishments
    • Prizes
    • Papers
    • Positions
    • Presentations
    • Press
    • Profiles
  • About Me

PNAS: Can neuromorphic computing help reduce AI’s high energy cost?

November 4, 2025 By dmodha

Excepts from an article in PNAS (The Proceedings of the National Academy of Sciences):

NorthPole is an “AI Accelerator’ that’s “designed with energy efficiency in mind,” says Dharmendra Modha, IBM’s chief scientist for brain-inspired computing.

“We are driven not so much by neuroscience, but more by the intrinsic mathematical potential of the architecture,” he says.

In a 2023 paper, Modha and his team at IBM reported that the NorthPole neuromorphic chip successfully classified images from a dataset—a task often used to benchmark the performance of AI systems. The chip did so using a tiny fraction of the energy required by a conventional system, and it was five times faster. Modha believes that building chips differently, rather than only finding ways to shrink circuit dimensions and pack more processors onto integrated circuits, can lead to greater gains in energy efficiency. “Architecture trumps Moore’s Law,” he says.

Filed Under: Press

Computer History Museum Interview

September 7, 2025 By dmodha

Computer History Museum interview on the occasion of NorthPole’s induction into the Museum. Other interviewees include: John Backus (Fortran), Brian Kernighan (UNIX), Robert Metcalfe (Ethernet, 3Com), Gordon Moore (Moore’s Law), Robert Kahn (TCP/IP), Douglas Engelbart (hypertext), Ronald Rivest (RSA), John McCarthy (LISP), Donald Knuth (analysis of algorithms), James Gosling (JAVA), John Hennessy (RISC), Ken Thompson (UNIX, B), Rodney Brooks (robotics).

Filed Under: Press

EE Times Interview by Sunny Bains

September 7, 2025 By dmodha

Sunny Bains interviewed me for Brains and Machines. It captures our journey through DARPA SyNAPSE, TrueNorth, and NorthPole. Listen here.

Filed Under: Press

SiLQ: Simple Large Language Model Quantization-Aware Training

September 6, 2025 By dmodha

Thrilled to share the latest work from the IBM Research NorthPole Team pushing the cutting edge of quantized large language model performance. In a recent paper, we introduce a new quantization recipe and apply it to 8 billion parameter Granite and Llama models. We demonstrate these models with 8-bit activations and cache and 4-bit weights showing minimal accuracy degradation on 3 leader boards spanning 20 distinct tasks.

Our method is high accuracy, outperforming all prior published quantization methods on the models and precisions examined, is simple, able to reuse existing training code after adding appropriate quantization and knowledge distillation, and is relatively low-cost, able to reuse existing training data or publicly available datasets, and requiring an increase in total training budget of less than 0.1%. We believe that this will be a powerful enabling tool for deploying models on ultra-low-latency inference accelerators like NorthPole, greatly enhancing the performance of latency critical applications such as interactive dialog and agentic workflows.

The paper, written with co-authors Steven Esser, Jeffrey McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra Modha, can be found here.

Filed Under: Papers

Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole

September 26, 2024 By dmodha

New: As presented at the IEEE HPEC Conference (High Performance Extereme Computing) today, exciting new results from IBM Research demonstrate that for a 3-billion parameter LLM, a compact 2U research prototype system using the IBM AIU NorthPole inference chip delivers an astounding 28,356 tokens/sec of system throughput and sub-1ms/token (per-user) latency.  NorthPole is optimized for the two conflicting objectives of energy-efficiency and low latency. In the regime of low-latency, NorthPole (in 12nm) provides 72.7x better energy efficiency (tokens/second/W) versus a state-of-the-art 4nm GPU.  In the regime of high-energy efficiency, NorthPole (in 12nm) provides 46.9x better latency (ms/token) versus a 5nm GPU.

NorthPole is a brain-inspired, silicon-optimized chip architecture suitable for neural inference that was published in October 2023 in Science Magazine. Result of nearly two decades of work at IBM Research and a 14+ year partnership with United States Department of Defense (Defense Advanced Research Projects Agency, Office of the Under Secretary of Defense for Research and Engineering, and Air Force Research Laboratory).

NorthPole balances two conflicting objectives of energy efficiency and low latency.

First, because LLMs demand substantial energy resources for both training and inference, a sustainable future computational infrastructure is needed to enable their efficient and widespread deployment. Energy efficiency of data centers is becoming critical as their carbon footprints expand, and as they become increasingly energy-constrained. According to the World Economic Forum, “At present, the environmental footprint is split, with training responsible for about 20% and inference taking up the lion’s share at 80%. As AI models gain traction across diverse sectors, the need for inference and its environmental footprint will escalate.”

Second, many applications such as interactive dialog and agentic workflows require very low latencies. Decreasing latency, within a given computer architecture, can be achieved by decreasing throughput, however, that leads to decreasing energy efficiency. To paraphrase a classic systems maxim, “Throughput problems can be cured with money. Latency problems are harder because the speed of light is fixed.”

Caption: NorthPole (12 nm) performance relative to current state-of-the-art GPUs (7 / 5 / 4 nm) on energy and system latency metrics, where system latency is the total latency experienced by each user. At the lowest GPU latency (H100, point P2), NorthPole provides 72.7x better energy metric (tokens/sec/W). At the best GPU energy metric (L4, point P1), NorthPole provides 46.9x lower latency.
Caption: Exploded view of the research prototype appliance showing installation of the 16 NorthPole PCIe cards. NorthPole cards can communicate via the standard PCIe endpoint model through the host or directly, and more efficiently, with one another via additional hardware features on each card.
Caption: Strategy for mapping the 3-billion-parameter LLM to the 16-card NorthPole appliance. Each transformer layer is mapped to one NorthPole card and the output layer is mapped to two cards (left). For each layer, all weights and KV cache are stored on-chip, so only the small embedding tensor produced by each card’s layer must be forwarded to the next card over low-bandwidth PCIe when generating a token. Within each transformer layer (right), weights and KV cache are stored at INT4 precision. Activations are also INT4 except when higher dynamic range is needed for accumulations.

PDF of the Accepted Version.

NorthPole_HPEC_LLM_2024Download

Future: Next research and development steps are further optimizations of energy-efficiency; mapping larger LLMs (8B, 13B, 20B, 34B, 70B) on correspondingly larger NorthPole appliances; new LLM models co-optimized with NorthPole architecture; and future system and chip architectures.

Caption: IBM AIU NorthPole rack under construction!
Design Credit: Ryan Mellody, Susana Rodriguez de Tembleque, William Risk, Map Project Office

Filed Under: Papers

  • Page 1
  • Page 2
  • Page 3
  • Interim pages omitted …
  • Page 49
  • Go to Next Page »

Primary Sidebar

Recent Posts

  • PNAS: Can neuromorphic computing help reduce AI’s high energy cost?
  • Computer History Museum Interview
  • EE Times Interview by Sunny Bains
  • SiLQ: Simple Large Language Model Quantization-Aware Training
  • Breakthrough low-latency, high-energy-efficiency LLM inference performance using NorthPole

Copyright © 2025