April 08, 2016

Mighty oaks from little acorns grow.

A 65,536 times growth in number of neurons in less than 6 years. From a chip with 256 neurons in August 2011 to 1 million neurons in TrueNorth August 2014 to 16 million neurons for LLNL NS16e System in March 2016.

March 28, 2016

PREPRINT: Convolution Networks for Fast, Energy-Efficient Neuromorphic Computing

Guest Blog by Steven K. Esser

Today, to seek feedback from fellow scientists, my colleagues and I are very excited to share a preprint with the community.

Title: Convolution Networks for Fast, Energy-Efficient Neuromorphic Computing

Authors: Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeff rey L. McKinstry, Timothy Melano, Davis R. Barch,  Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, and Dharmendra S. Modha 

Abstract: Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that i) approach state-of-the-art classi cation accuracy across 8 standard datasets, encompassing vision and speech, ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1100 and 2300 frames per second and using between 25 and 325 mW (effectively > 5000 frames / sec / W) and iii) can be specifi ed and trained using backpropagation with same ease-of-use as contemporary deep learning. For the fi rst time, the algorithmic power of deep learning can be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.

Link: http://arxiv.org/abs/1603.08270

At this juncture, I hope that a personal retrospective will help you in sharing my enthusiasm.

In the Winter of 2008, I graduated school in Wisconsin and moved to California over the New Year to begin working in what was to become IBM’s Brain Inspired Computing lab. The Team had just won DARPA SyNAPSE contract. Arriving, I joined a handful of researchers whose enthusiasm quickly pulled me into sharing their lofty vision -- to build a computer designed from the ground up along the lines of the mammalian brain, and to have that system make a beneficial impact on society. Our goal was clear; our route was not. From the beginning, we chose to take a different approach from rule-based artificial intelligence, whose algorithms, though able to do some of the things our brain can do, under-the-hood work nothing like the brain. We also chose to take a different path from traditional artificial neural networks, which use neurons and synapses -- the basic computational elements of the brain -- but ignore limits on precision and connectivity that are critical for low power operation. Energy-efficiency is, after all, critical to creating scalable or embedded systems (and indeed our own brain uses only as much power as a typical lightbulb). At the start, this decision left us with no suitable existing approaches capable of tackling real world problems, as many of our scientific peers were more than willing to remind us! This gave me pause, but the challenge was fascinating, and so with the optimism of a recent graduate, I dove in.

As work progressed, the lab had the fortune to build an incredible team of hardware researchers. They created a prototype brain-inspired core in 2011 and the four thousand core TrueNorth chip in 2014, giving us the hardware needed for fast, low energy “neurosynaptic” computation. On the algorithm front, we began to internalize the architecture and created a number of small scale demos in 2012 and 2013 and also built a programming language that provided the necessary foundation for later work. These showcased certain capabilities, but were custom solutions to specific problems. To have a major impact, we needed a truly general purpose, scalable approach for network creation.

To this end, our path led back to neural networks in the form of modern deep learning, which can achieve state-of-the-art accuracy across a broad range of perceptual tasks. Though the traditional methods of deep learning are not directly compatible with our chip, to our great surprise and relief, we found that they are extremely resilient to reduced precision and connectivity, as necessitated by TrueNorth. From this insight, we were able to adapt the canonical backpropagation learning rule used in deep learning for compatibility with TrueNorth. At NIPS 2015 last year, we demonstrated near state-of-the-art accuracy on a canonical handwritten digit recognition challenge, using orders of magnitude less energy than the best previous results. As this work was ongoing, we were very encouraged to see research from a number of other laboratories demonstrating the power of deep learning for low precision computation.

Today, this preprint builds on upon many previous efforts. In this work, we demonstrate a method for adapting convolutional neural networks, a powerful tool for deep learning, to create networks that run on the TrueNorth chip. We achieve near state-of-the-art accuracy on 8 datasets spanning color image and speech, while running on real hardware at between 1100 and 2300 frames per second and using between 25 and 325 mW, which is effectively > 5000 frames / sec / W. This is important because approaching state-of-the-art accuracy within neuromorphic constraints was previously believed to be difficult, if not, impossible, and because the ensuing speed and energy efficiencies open up an entirely new operating regime not accessible via conventional computing.

This work is a major personal milestone in a journey that began over 7 years ago and I cannot wait to see the breakthroughs that the next 7 years will bring!

A Scale-up Synaptic Supercomputer (NS16e): Four Perspectives

Today, Lawrence Livermore National Lab (LLNL) and IBM announce the development of a new Scale-up Synaptic Supercomputer (NS16e) that highly integrates 16 TrueNorth Chips in a 4×4 array to deliver 16 million neurons and 256 million synapses. LLNL will also receive an end-to-end software ecosystem that consists of a simulator; a programming language; an integrated programming environment; a library of algorithms as well as applications; firmware; tools for composing neural networks for deep learning; a teaching curriculum; and cloud enablement. Also, don't miss the story in The Wall Street Journal (sign-in required) and the perspective and a video by LLNL's Brian Van Essen.

To provide insights into what it took to achieve this significant milestone in the history of our project, following are four intertwined perspectives from my colleagues:

  • Filipp Akopyan -- First Steps to an Efficient Scalable NeuroSynaptic Supercomputer.
  • Bill Risk and Ben Shaw -- Creating an Iconic Enclosure for the NS16e.
  • Jun Sawada -- NS16e System as a Neural Network Development Workstation.
  • Brian Taba -- How to Program a Synaptic Supercomputer.

The following timeline provides context for today's milestone in terms of the continued evolution of our project.

Timeline
Illustration Credit: William Risk

Line Separator

 

First Steps to an Efficient Scalable NeuroSynaptic Supercomputer

Guest Blog by Filipp Akopyan

 

Recently, IBM's Brain-inspired Computing Team has revealed the world's first 1 million-neuron evaluation platform for mobile applications (NS1e), based on IBM's TrueNorth (TN) neurosynaptic chip. Afterwards, we demonstrated the first 16 million-neuron scale-out system (NS1e-16) assembled using 16 instances of the NS1e board along with supporting periphery, which includes a host server, network router, power supervisors, and other components. Detailed Information on NS1e-16 may be found at [Revealed: A Scale-Out Synaptic Supercomputer (NS1e-16)].

NS1e-16 is a powerful system, but we needed a bigger challenge. Why not build a more compact, more efficient 16 million neuron system that can fit in your shoe box?! We have had some initial prototypes of such a system before, but nothing that could actually be delivered to our customers and partners. So, with the support of LLNL, we embarked on a journey to NeuroSynaptic 16 million-neuron evaluation platform, NS16e (note the subtle but significant lettering difference from the earlier NS1e-16 system).

NS16e evaluation platform consists of the following main components: a custom 4×4 board, a custom Interposer board, and an Avnet off-the-shelf Zynq SOC mini module, AES-MMP-7Z045-G. The three boards are assembled into a single NS16e structure using vertical stacking/mating connectors to supply power to all the boards and to exchange data and control signals.

The 4×4 board contains 16 programmable TrueNorth chips capable of implementing large-scale (up to 16 million neurons) neural models for various applications. Each TrueNorth neurosynaptic chip has 4092 usable cores with each core containing 256 neurons, 256 axons, and 65536 synapses. To maximize the communication bandwidth with the TrueNorth chips on the 4×4 board, we use data port expanders, which access the TrueNorth input/output ports. A similar set of IC-s is used to perform the configuration of all the TrueNorth chips using scan chains. The 4×4 board also hosts all the TrueNorth power domain regulators, power supervisor, current sensing circuitry and several SPI/I2C programmable devices.

The Interposer board provides high speed interfaces and power domains for all the NS16e system components. It contains a PCIe x4 connector for high-speed communication with a host machine and an SFP+ Ethernet cage (which is not currently enabled). The Interposer also contains a power supervisor and a series of electronic fuses (along with the 4×4 board) meant to shut down the system in case of a power failure. For debugging purposes the Interposer board (as well as the 4×4 board) contains LED-s, which may indicate failures on power domains. The Interposer also hosts a JTAG connector for programing the Zynq module.

The Zynq module provides an FPGA fabric for implementing custom glue logic to control and potential modify (if there is such a requirement) the incoming/outgoing data to and from NS16e. Specifically, the Zynq fabric may be used for DMA, data-to-spike transduction, filtering, etc. The high speed transceivers on the SOC (system-on-chip) are used to provide a PCIe communication bridge between NS16e and the host system. The Zynq SOC also contains two ARM processor cores, which are not enabled on the current revision of NS16e to minimize power consumption.

The team created a specification, schematics and performed the physical design (layout) of the system. We have then manufactured the boards and assembled them using the latest cutting-edge electronic components.

So without further ado, we present to you the first NS16e prototype system. Figure 1 shows the assembled 4×4 board (no chips); Figure 2 shows the assembled Interposer board.

Top of 4×4 board Bottom of 4×4 board

Figure 1. NS16e: 4×4 board TOP on the left, and BOTTOM on the right.

 

Top of interposer board Bottom of interposer board

Figure 2. NS16e: Interposer board TOP on the left, and BOTTOM on the right.

 

Since the current number of TrueNorth chips is limited, we have decided to use an assembly risk aversion technique and populate the initial few systems without soldering down the TrueNorth chips. This gives us an opportunity to test the complete design of the system up to (but not including) the actual TrueNorth chips. Figure 3 shows the two boards side-by-side, ready for stacking on top of one another (Zynq module not depicted).

4×4 and Interposer boards

Figure 3. NS16e: 4×4 and Interposer boards side-by-side (ready for mating).

 

As one can imagine, the bring-up of such an unconventional system is not a trivial task. We have developed and put in place various hardware, firmware and software tests to verify the functionality of the system.

On these first prototypes we have provisioned for special type of sockets, which can be populated post-assembly to host the actual TrueNorth chips. Once we have fully verified the functionality of the two newly designed boards without the chips, we have populated the 4×4 board with TrueNorth chips by the means of aforementioned sockets, as shown in Figure 4.

NS16e: System populated with TN-chips using sockets

Figure 4. NS16e: System populated with TN-chips using sockets.

 

Power up of this platform has to be performed by stages with extra care to make sure that the current draw of the system is within reason (i.e. no shorts); all the key interface signals between TrueNorth chips and glue logic are monitored with oscilloscopes for proper operation and signal integrity. The low level bring-up of the full system is depicted in Figure 5.

NS16e: Low level bring-up with TN-chips in sockets

Figure 5. NS16e: Low level bring-up with TN-chips in sockets.

 

Bringing-up and debugging complex boards takes at least several days (with luck on your side). This platform was no exception. Once we had all the low level tests running correctly on system with sockets, we have given a green light to our board assembler to create the first prototypes with soldered down TrueNorth chips. The first 4×4 with soldered chips is shown in Figure 6.

First NS16e prototype with soldered down TN-chips

Figure 6: First 4×4 prototype with soldered down TN-chips.

 

Now that all the low level tests have been successfully completed, we connect the new NS16e systems to the host servers (Figure 7) using several interfaces: Xilinx JTAG for Zynq programming, Lattice JTAG for power supervisor and FPGA programming, PCIe gen2 link for data exchange between host server and TrueNorth chips.

NS16e systems connected to host machines NS16e systems connected to host machines

Figure 7: NS16e systems connected to host machines.

 

At this point the high level testing takes over and we start running neural network algorithms and applications right on the new NS16e system.

Some of us (myself included) like to work on the weekends, since it is quiet and the janitors don't bother you when you stay late! Also there are no mandatory meetings, so it is just you and the pure drive to create the world's most advanced neurosynaptic system! The only problem is that the cafeteria is closed ... so we have to get pizza!

NS16e application testing

Figure 8. Scott Lekuch having a bite to eat and performing NS16e application testing simultaneously.

 

At the end of this exciting journey we have successfully brought up the NS16e prototypes. Now with the help of our highly-talented software team we can run complex neural applications on this powerful platform. Check out the blogs by Jun Sawada and Brian Taba for the application details. Advanced image/sound recognition and classification in real time on 16 million artificial spiking neurons, anyone?!

Line Separator

 

Creating an Iconic Enclosure for the NS16e

Guest Blog by Bill Risk and Ben Shaw

 

For deployment to LLNL, the three-board stack that Filipp Akopyan described above needed to be housed in a protective enclosure. Since the NS16e system is a first-of-its-kind neurosynaptic supercomputer, we wanted to create an interesting, iconic enclosure design that would reflect this novelty, while securing the boards and allowing access to the required connectors, switches, and indicators. Since the board configuration was already fixed, the enclosure had to be designed to accommodate the existing placement of these components and the irregular shape of the three-board stack.

These goals created a challenging design problem, but fortunately, we have been collaborating for several years with the highly creative IBM Design Team, who had previously designed concept models showing potential applications of the TrueNorth chip [Cognitive Apps ]. Aaron Cox (Industrial Design) had worked with us earlier to create an iconic gold cover for the chip (Figure 9). The design of this cover—with tabs in the four cardinal directions—expresses a key feature of the TrueNorth chip: its built-in ability to be tiled with other chips in a two-dimensional array. Since the NS16e is the first system we've built that fully exploits this tileability, we wanted to ensure that the enclosure made the chip caps visible, to dramatically highlight the 4×4 array of chips contained in the system (Figure 10). Working on the overall personality of this research prototype system, Camillo Sassano (Industrial Design) and Kevin Schultz (User Experience) created a design that uses sharp angled surfaces to give the enclosure a shape that appears to change with the viewing angle. A 4×4 array of chamfered pockets on the front features and emphasizes the 3D effect of the golden chips caps, and light accents complement the sculptural geometry of the device.

TrueNorth chip cap

Figure 9. TrueNorth chip cap. The 414 refers to the number of synapses on a single chip (414=256 million).


 

NS16e enclosure

Figure 10. NS16e enclosure.

 

The need to adapt the enclosure to the existing board stack—for example, to actuate pushbutton switches on the edge of one of the PC boards—required the design of some clever mechanisms. A 3D-printed slider provided the ability to press one switch (Figure 11, left panel). In the case of the other pushbutton, a rocker mechanism was used; however, since we also had to provide visual access to an adjacent LED, a light pipe was incorporated into the rocker mechanism, which was 3-D printed in a transparent material to serve both functions (Figure 11, right panel).

Slide mechanism Rocker mechanism

Figure 11. Slide mechanism (left) and rocker mechanism (right).

 

We wanted the enclosure to support two modes of use: one sitting on a desktop as a standalone unit; the other, placement within a 2U-high drawer for mounting in a server rack. The enclosure was designed so that it could be used in either mode, by attaching either a support foot to allow it to stand on its own (Figure 12) or an adapter that allows it to be mounted in the drawer. In the rackmount mode, the NS16e can be laid flat in the closed drawer for normal operation; the drawer can be opened and the NS16e tilted up for display or maintenance. (Figure 13). Since the enclosure is not visible when the drawer is closed, we added some panel graphics and LED lighting to identify and distinguish the unit.

Desktop mode

Figure 12. Desktop mode.

 

Rackmount mode

Figure 13. Rackmount mode.

 

Once mounted in the enclosure, the system is ready for final testing and delivery to LLNL!

Line Separator

 

NS16e System as a Neural Network Development Workstation

Guest Blog by Jun Sawada

 

It all started very small, I still remember the first time when we first simulated a single TrueNorth neuron model. It was only on a circuit simulator. A few years later, we have fabricated it into the TrueNorth chip, with 1 million neurons on a stamp-size silicon piece. Today, we have built the NS16e system.

Unlike the scale-out system we built earlier using single-chip boards [Revealed: A Scale-Out Synaptic Supercomputer (NS1e-16)], this machine is intended toward running larger, more powerful neural networks in real time. In a typical setup, the NS16e system is connected to a host x86 server computer by a PCI Express link. The host server can pump large data in and out of the NS16e system. The host server with GPUs can build and train a large neural network, and NS16e can immediately start running it. We can run with a single command the entire end-to-end development process of preprocessing training data, neural network construction, training the network, and optimizing the network for hardware, and running it on the NS16e. Because this development process is very quick on NS16e, NS16e system combined with a powerful host server is a dream machine for large scale neural network development.

In order to tell you how the machine works, please see the system overview in Figure 14. The core neural computation in NS16e takes place in the grid of TrueNorth chips, tiled in an 4x4 grid array. The TrueNorth chip is designed in such a way that, if you connect chip to chip by wires, chips start talking to each other using spikes. All the communication takes place using an asynchronous protocol, without any external clocks or additional interfacing chips.

NS16e system overview

Figure 14. NS16e system overview.

 

A host server is connected to NS16e by PCI Express link. It can send and receive 500Mb/sec of data between the NS16e and the host server. The FPGA's on the NS16e system work like a bridge between the TrueNorth chips and the host server. In a sense, it is a translator between a conventional von-Neumann computer and a TrueNorth-based neurosynaptic computer, which speak in different languages. A digital von-Neumann computer operates in instructions and binary data, while the neurosynaptic computer talks in neuron spike signals.

When you enter a command to run a neural network model to the host computer, here is what happens. The host computer sends the network model over the PCI Express link, and loads the neural network model to the TrueNorth chips on NS16e. The terminal window on the computer shows the progress of uploading data to NS16e. After 30 seconds or so (depending on the size of the network model), the uploading is completed, and the neural network starts spiking actively. On the host computer screen, you can only see a summary of how many output spikes are generated and how fast all the neurons are updated. Neuron spikes themselves are encoded in a way that is not easy for a person to understand. That is where we use visualizers. It decodes the spikes coming out of NS16e, and visualizes what the NS16e is trying to say.

Let's say we are going to run an image recognition task on NS16e. We upload the neural network to NS16e, and then we send neuron spikes encoding photo (A) in Figure 15 to NS16e. NS16e produces answers in neuron spikes, and send them back to the visualizer program. The visualizer decodes the spikes and shows picture NS16e chooses like (B) in Figure 15.

Image recognition by NS16e

Figure 15. Image recognition by NS16e.

 

The system is still evolving. We continuously create and test new learning algorithms, new model generation techniques, and new optimization algorithms. One unique thing about this machine is that it needs some optimization on how we map the logical representation of a neural network to physical circuits. In our brain, a certain portion of the cortex is responsible for visual recognition and other parts are responsible for motor function. The NS16e has something similar. You may assign which chip is holding which part of a large neural network. For example, we can make each chip to hold a single layer of multi-layer neural network, or we may assign each chip a dissection of the entire multi-layer network corresponding to a patch of an image. This optimization problem, we call "core placement", is a very unique problem to NS16e-like neurosynaptic machines. This makes some network runs more efficiently and faster, because inter-chip communication is much slower than intra-chip communication. This is one place where we are trying different algorithms and techniques one after another.

The NS16e system is truly a product of great team work. Based on the TrueNorth chip design we worked on in the past, NS16e requires new circuit boards, FPGA programmable logic, system software, placement optimizers, neural network development tools and training algorithms. Only when all of these components start to work together, the whole system begins to produce meaningful answers. It is like creating an entire ecosystem from scratch and integrating them together. We also spent a lot of effort to perfect the system. We have been running countless tests on NS16e and compared the results against our simulation results, in order to make sure that the finished NS16e system has no malfunctioning components. We tracked down every hardware and software issue very earnestly. System testing and documentation team cleaned up many imperfections in our software and documentation. This makes me very proud of the NS16e system, which is a result of diligent work of many people over many years.

At the end of this story, I should say that the NS16e system is just a step forward to much larger scaled systems. Today's convolutional deep learning network is growing rapidly in size, we would like to build much bigger machines than NS16e. Future scaled-up version of NS16e will certainly run much larger neural network using a fraction of energy consumed by CPUs and GPUs. Perhaps one day we may see a single rack of neurosynaptic system with as many neurons and synapses as in a human brain.

Image Recognition by NS16e

 

How to Program a Synaptic Supercomputer

Guest Blog by Brian Taba

 

So how do you actually write programs for a 16-million-neuron synaptic supercomputer? Over the last year, we have built an integrated software ecosystem around a stack of end-to-end TrueNorth development tools. We have bundled this ecosystem into a TrueNorth DevKit for release to partners like LLNL and the many other alumni of last year's Boot Camp.

Let's examine one of the reference applications included in our DevKit, a simple TrueNorth image classifier. This example uses a GPU to train a convolutional neural network (CNN) on a standard benchmark dataset, and then deploys the trained network on the NS16e. For more about our algorithm for training CNNs for TrueNorth, see the paper just posted by Steve Esser. Here, we will focus on the developer workflow.

Image classifier run flow

Figure 16. Run flow for a TrueNorth image classifier.

 

The basic structure of an generic TrueNorth image classifier is laid out in Figure 16. We first acquire a stream of input data, which could be image files read from disk, frames from a webcam, or even natively generated spikes from a Dynamic Vision Sensor (DVS). Raw data might be preprocessed into features like multi-scale edge maps, before being encoded as spike streams for input to TrueNorth. Depending on how the application is distributed across system components, preprocessing and encoding might happen in a server CPU, in the board's embedded ARM cores or FPGA, or intrinsically in a spiking sensor. Input spikes can be processed by TrueNorth chips on an NS1e or NS16e board, or by the NSCS functional simulator. Finally, output spikes are decoded into classification predictions that are typically sent off-board for visualization.

In this instance, we train a TrueNorth network to classify images from the CIFAR-10 and CIFAR-100 datasets. We simplify the problem by preprocessing and encoding the entire dataset offline. The resulting input spike file is looped through TrueNorth hardware to generate class predictions for the images in the test set, which are streamed to a workstation for visualization.

Image classifier development flow

Figure 17. Development flow for a TrueNorth image classifier.

 

The chain of actions required to train and deploy a TrueNorth image classifier is shown in Figure 17. Out of the box, the DevKit provides at least one tool to fill in every block in this chain, but you can swap out any of our tools for your own, as long as you implement the same interface. For example, the DevKit includes two equivalent tools for reading and writing LMDB databases—a command-line utility called tn-signal-processor, of which more later; or a set of MATLAB MEX files—but as long as the downstream blocks get an LMDB in the right format, they don't care whether you used one of our tools to make it, or rolled your own.

Dataset/Preprocessor

The first steps in the training workflow are just basic data science. We import the original dataset into a standard format that all of our tools understand, then preprocess it into the features we will use at runtime. We store data in LMDB databases, which are popular for deep learning because their fast read access lets training data pump quickly into a GPU.

To import and preprocess data, we provide a fast C++ command-line utility called tn-signal-processor. Written by David Berg, tn-signal-processor is a Swiss Army knife for manipulating data—it can import JPG and PNG files into our LMDB format, crop, rotate, apply center-surround filters, encode images as spikes, decode spikes into images, visualize data, and more. It can also convert Caffe's LMDB format to ours and back, so anyone who has already imported their data into Caffe can easily transfer it to the TrueNorth ecosystem, and vice versa.

Trainer/Corelet

Now that we have our data in the proper format, it's time to train a classifier. TrueNorth Convolutional Networks (TNCN) is a framework for composing and training convolutional neural networks that automatically satisfy the TrueNorth design constraints. Created by Steve Esser using the Corelet Programming Environment (CPE), TNCN frees a data scientist to focus on abstract network properties like layer order, filter size, and data precision; without getting bogged down in the nuts and bolts of how to configure the 23 free parameters in a TrueNorth neuron to represent data as 1-bit spikes instead of 32-bit floats, for example.

TNCN implements high-level classes for common CNN layers like convolution, pooling, dropout, loss, etc. Each layer class has a corresponding corelet that automatically compiles its parameters into legal TrueNorth core configurations. The result is a model file containing a list of logical core parameters that collectively implement the trained TNCN network.

To accelerate training with a GPU, we use a deep learning framework called MatConvNet, which wraps NVIDIA's cuDNN primitives in binary MEX files that are easy to invoke in MATLAB. MatConvNet is not one of the famous frameworks like Caffe or Torch, but we picked it because we already needed MATLAB for CPE and it's very convenient to have a single development environment for codesigning TNCN layers and corelets. Of course, when we want to run many training instances in parallel on a GPU cluster, it's awkward and expensive to activate a MATLAB license on every compute node, so we also provide a compiled version of TNCN that can be run at the command line using the free MATLAB Runtime.

Placer

The final step before deploying this network to hardware is to assign the logical cores in the model file to physical cores in the NS16e chip array. This can be a challenge in multi-chip systems, due to the bandwidth bottleneck at the interface between chips. If cores are sending too many spikes across chip boundaries, spikes might not all arrive at their destinations within the design window of 1 millisecond per simulation tick. Of course, we can always increase the tick period to close timing. But it's better to reduce cross-chip traffic if possible, by being smarter about where we place cores in the chips.

For the LLNL system, Pallab Datta wrote a new tool from scratch that heuristically places logical cores in the TNCN model file at physical locations in the NS16e core array (Figure 18), based on minimizing chip crossings in the core-to-core connectivity graph. The Neuro Synaptic Core Placer (NSCP) is critical for making large TrueNorth models like a multi-chip TNCN network run successfully in hardware.

Physical core placement on NS16e

Figure 18. Physical core placement for a 4-chip CIFAR-10 network on NS16e.

 

Application

Now we can run our trained TrueNorth classifier in hardware, and see it in action! To decode, evaluate, and visualize the test image labels predicted by hardware, we have another command-line utility called PatchClassifier that can be configured to lay out an array of category icons for a given image dataset, with the predicted category outlined in green if correct and red if incorrect. The length of the bar beneath each category indicates the confidence of the classification.

Figure 19 is a screencap of the predicted CIFAR-100 test image classifications being streamed from the NS16e to the PatchClassifier visualizer running on a laptop. In this case, the NS16e correctly classified the test image as a whale (green bar), and its next guess would have been a dolphin (white bar).

Physical core placement on NS16e

Figure 19. PatchClassifier visualizer configured for CIFAR-100 dataset.

February 29, 2016

A Beautiful Animation

To illustrate one of my recent patents, IBM created the following beautiful animation! Enjoy.

Mystery Box

December 15, 2015

Revealed: A Scale-Out Synaptic Supercomputer (NS1e-16)

Guest Blog by William P. Risk and Michael V. Debole with Contributions from Rodrigo Alvarez-Icaza and Filipp Akopyan.

A few months ago, we unveiled the NeuroSynaptic Evaluation (NS1e) board, which contained a single TrueNorth chip, along with circuitry for interfacing the chip to sensors and real-world data. These boards were used in our August 2015 “Boot Camp” event, in which participants learned how to program the chip to implement cognitive systems [Brain-Inspired Computing Boot Camp Begins]. During BootCamp, each NS1e board was housed in its own plastic case, and for convenience, we built a rack to hold the 48 boards used during that event. Although the rack nicely organized and displayed the boards, a bulky assembly of power strips, ethernet switches, and servers were also required for their use.

Recently, a government client requested that we build a system of 16 NS1e boards, with power unit, ethernet switch, and Linux server all housed in a compact, self-contained unit, where each of the NS1e boards can be seamlessly integrated, but mounted in such a way that any individual board could be swapped in or out easily. This requirement led us to explore designs in which individual NS1e boards are mounted on cards that could be inserted vertically into a card rack (Figure 1) and all elements were mounted in a small desktop rack unit (shown below).

“Fig1
Figure 1. NS1e Card Rack

 

We initially considered two similar designs, both using a 6U high desktop rack with components stacked as follows: (bottom) 1U – power-strip / network switch, 3U – NS1e card rack, 1U NS1e card power, 1U server. We ultimately chose the design with wiring in the back as it provided a cleaner looking front panel.

“Fig2
Figure 2. Final Design Concept

 

The next step was determining how to make the concept become a reality. For the most part, this was a straightforward process since we were able to use many off-the-shelf components (server, network switch, power strip, etc..). However, powering 16 NS1e boards required a bit of engineering to reduce the space required. As standalone boards, each is typically powered by an AC-DC adapter which simply plugs into a standard outlet, but including 16 bulky “wall warts” in a 1U form factor was impractical. In addition, we wanted to provide the capability to remotely monitor the current consumption of each individual board and to control its power state. To solve this problem we turned to a USB-style power distribution module developed by Cambrionix. While normally intended to charge and sync cell phones and tablets, it’s port capacity (16 USB ports) and current limits were suitable for our purposes. However, with typical USB connectors plugged into the Cambrionix board, the height required was close to 2U (3.5″), greater than the 1U we had allocated for the power distribution unit in the initial design. Fortunately, the card rack holding the NS1e boards did not occupy the full depth of the rack and we had just enough room to design a step-down enclosure using 1U of space above the NS1e drawer and dropping down to 2U in the back (See below). Finally, to give some visual appeal, we united the 16 individual NS1e boards by spreading a graphic (our award-winning visualization of the network diagram of the monkey brain) across their front panels and added some accentuating LED strip lighting on both sides of the drawer and below the chassis.

Building the system, once all the planning was complete, was relatively straightforward:

Build Evolution

“Fig3
Figure 3. Initial Skeleton

 

“Fig4
Figure 4. Early Prototype Front and Back

 

“Fig5
Figure 5. Functional Prototype (Alpha)

 

“Fig6
Figure 6. Functional Prototype (Beta)

 

“Fig7
Figure 7. Custom Power Enclosure

 

“Fig8
Figure 8. Final Lab Photo

 

“Fig9
Figure 9. Final Photo

 

Then crated and shipped!

 

“Fig10
Figure 10. Preparing to ship system to clients

 

The end result is a system that provides 16 million neurons and 4 billion synapses in a package about the size of a carry-on suitcase!

December 14, 2015

Mystery!

What did Santa Claus bring from IBM Research's Brain-inspired Computing Team? I will reveal this week here on http://modha.org and on Twitter at @DharmendraModha.

Mystery Box

December 03, 2015

NIPS 2015: Backpropagation for Energy-Efficient Neuromorphic Computing

Guest Post by Steven K. Esser, Rathinakumar Appuswamy, Paul A. Merolla, and John V. Arthur.

At the 2015 Neural Information Processing Systems (NIPS) conference, in a paper entitled Backpropagation for Energy-Efficient Neuromorphic Computing, we will be presenting our latest research in adapting machine learning techniques for use in training TrueNorth networks. In essence, this is our first step towards bringing together deep learning (for offline learning) together with brain-inspired computing (for online delivery).

This work is driven by an interest in using neural networks in embedded systems to solve real world problems. Such systems must satisfy both performance requirements, namely accuracy and generalizability, as well as platform requirements, such as a small footprint, low power consumption and real time capabilities. We have seen many recent examples demonstrating machine learning is able to meet performance needs, and other examples that neuromorphic approaches such as the TrueNorth are well suited to the platform needs.

An interesting challenge arises in trying to bring machine learning and neuromorphic hardware together. To achieve high efficiency, TrueNorth uses spiking neurons, discrete synapses and constrained connectivity. However, backpropagation, the algorithm at the core of much of machine learning, uses continuous-output neurons, high precision synapses, and typically operates with no limits on the number of inputs per neuron. How then can we build systems that take advantage of algorithmic insights from machine learning and the operational efficiency of neuromorphic hardware?

In our work, we demonstrate a learning rule and network topology that reconcile this apparent incompatibility by training in a continuous and differentiable probabilistic space that has a direct correspondence to spikes and discrete synaptic states in the hardware domain. Using this approach, we achieved near state-of-the-art performance on the MNIST handwritten digit dataset (99.42%), and the best accuracy to date using spiking neurons and/or low-precision discrete synapses. We have demonstrated three orders of magnitude less energy per classification than the next best low power approach.

NIPS 2015
Accuracy and energy of network trained using our approach running on the TrueNorth chip. Ensembles of multiple networks were tested, with ensemble size indicated next to each data point. It is possible to trade-off accuracy versus energy.

The software behind the published algorithm is already in the hands of nearly 100 developers most of whom attended the August 2015 Boot Camp, and we expect a number of new results in 2016.

Looking to the future, we are working to expand the repertoire of machine learning approaches for training TrueNorth networks. We have exciting work brewing in our lab using TrueNorth with convolution networks, and have achieved near state of the art on a number of additional datasets.

The paper can be found at https://papers.nips.cc/paper/5862-backpropagation-for-energy-efficient-neuromorphic-computing

Stay tuned!

November 30, 2015

R&D 100 Award, Editor's Choice

TrueNorth received 2015 R&D 100 Award and was named Editor's Choice under IT/Electrical Category.

Summer 2016 Internships

Apply here.
modha-web.jpg

Categories

  • Talk
Creative Commons License
This weblog is licensed under a Creative Commons License.
The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.