Lt-Wt Net

FAST+EFFICIENT     8-BIT HW     CASE STUDIES     DOWNLOAD

Fast, Energy-Efficient Deep Neural Net

The Lt-Wt (Lightweight) net is our fast and energy-efficient deep neural net that can be embedded directly into resource-constrained IoT devices. These devices will not need to broadcast any data as they will be able to do their processing locally. This will result in improved security and privacy.

The Lt-Wt net requires 95% fewer ops, 95% less memory, and 95% less logic as compared with the conventional neural net (CNN), making it suitable for fast and economical ASIC, FPGA, or 8-bit microcontroller implementations. Its inference ability is similar to the CNN.

The Lt-Wt net can approximate arbitrary continuous functions with any accuracy. It requires modest storage and has a multiplication-free forward pass, rendering it suitable for deployment on inexpensive hardware. The Lt-Wt learning process automatically drops insignificant inputs, unnecessary weights, and unneeded hidden neurons. The sparse weight matrices of the Lt-Wt net loosen the coupling among the layers, making the Lt-Wt net more tolerant to the failure of individual neurons.

The learned information in CNN is distributed over all weights. In a Lt-Wt net, the picture is less fuzzy and the localized nature of computation is much more obvious due to the presence of a large number of zero-valued weights. For image processing, a CNN does not scale well with increases in the resolution of the images due to the fully-connected structure of the net. A Lt-Wt net, on the other hand, scales much better due to the sparsity of its weight matrices.

The small magnitude of the Lt-Wt net weights should result in smooth mappings and the small number of non-zero weights should result in low generalization error.

The Lt-Wt net has been successfully tested with up to 16 hidden layers and 4.4 million weights on problems having input vectors consisting of tens of thousands of elements.

Implementation in 8-bit Hardware

The inputs are mapped to an 8-bit fixed-decimal representation. All neurons have 8-bit outputs and all arithmetic is integer only. The arithmetic ops consist of additions and subtractions only. The activation function is implemented as a 2 kB lookup table.

The executable Lt-Wt net consists of the following elements:

  • RAM (input and outputs)
  • Random access ROM for the lookup table
  • Read-once ROM to hold the network definition
  • Control logic

Configuring these elements for fast execution on an 8-bit microcontroller or efficient implementation on an FPGA is straightforward. An end-to-end 16-bit implementation of the Lt-Wt Net on a Cortex-M4 requires only 0.1 kB of code. This enables, in the case of the Human Activity Recognizer, operation in the sub-µW range on microprocessors and at 10-20% of that on FPGAs.

Case Studies

Problem
Predict failure of the Air Pressure System (APS) in heavy Scania trucks based on sensor data.*
Data
There are 170 features and a single yes/no outcome that needs to be predicted. The training dataset consists of 60 k instances, out of which 59 k are of the negative class, 1 k of positive class, and 850 k values are missing. The test dataset consists of 16 k instances and 229 k values are missing.
Training Results
The trained Lt-Wt net had an F1 score of 0.73, recall 0.68, precision of 0.79, and accuracy 0.99 on test data.
Comparison With 32-Bit Floating-Point CNN
TABLE 1: Key differences between the 32-bit floating-point CNN and the 8/16-bit fixed-point Lt-Wt net.
Aspect
CNN
Lt-Wt Net
ROM (excl. LUT)
3.4 MB
130 kB
Memory fetches
1.7 M
63 k
Multiplications
871 k
0
Additions
871 k
62 k

The tiny number of fetches (1/72 bytes) and additions (1/14) combined with integer-only arithmetic, lack of multiplications, and 8-bit data flows results in a very fast and highly energy-efficient Lt-Wt net. Moreover, the tiny memory footprint makes it possible to place the whole network in the L1-cache of certain microprocessors.

TABLE 2: The Lt-Wt deep neural net is fast, energy-efficient and compact.
Benefit
      Key Contributors
Speed
  • Reduced number of memory fetches (1/108 in terms of bytes)
  • Reduced number of arithmetic (1/27) and other operations
Energy
  • Reduced number of bytes fetched (1/108)
  • No multiplications, floating-point or otherwise
  • Integer-only additions and subtractions
Size
  • Lower RAM (1/4) and ROM (1/26) requirement
  • 8-bit input-neuron, neuron-neuron, and neuron-output data paths
  • No multiplier, floating-point or otherwise
  • Integer-only adder
TABLE 3: Detailed comparison of the 8/16-bit Lt-Wt net and the 32-bit floating-point CNN.
Lt-Wt Net: CONFIG
Config: Initial 170:1024:512:256:128:64:2; Final 170:1024:512:155:14:3:2
Inputs
I
170
Neurons
N
1.7 k
Excitor connections
Ce
31 k
Inhibitor connections
Ci
31 k
Lt-Wt Net: MEMORY
ROM (excl. 0.6 kB LUT)
2(Ce + Ci) + 6N + 8
130 kB
RAM
I + N
1.9 kB
Lt-Wt Net: OPERATIONS
RAM/ROM fetch
2(Ce + Ci) + 3N + 4
130 k
16-bit copy to accum
N
1.7 k
8-bit int add to the accum
Ce
31 k
8-bit int accum subtract
Ci
31 k
8-bit write to RAM
N
1.7 k
Boolean comparison
Ce + Ci + N + 1
63 k
Arithmetic comparison
2N
3.4 k
Single/double increment
Ce + Ci + N + 1
63 k
Total ops
5(Ce + Ci) + 9N + 6
324 k
Lt-Wt Net: MCU IMPLEMENTATION
Ops
K
324 k
Overhead
O
~1 M
Single-cycle equiv ops
K + O
1.3 M
ATmega2560 8-bit 16 MHz
~16 MIPS
Inference pass duration
0.08 s
32-bit Floating-Point Neural Net: CONFIG
Config: 170:1024:512:256:128:64:2
Inputs
I
170
Neurons
N
2 k
Weights
W
871 k
32-bit Floating-Point Neural Net: MEMORY
ROM (excl. LUT)
4(W + N)
3.4 MB
RAM
4(I + N)
8.5 kB
32-bit Floating-Point Neural Net: OPERATIONS
32-bit memory fetch
2W + N
1.7 M
32-bit flt-pt multiply
W
871 k
32-bit flt-pt add
W
871 k
32-bit write to RAM
N
2 k
*Costa C.F., Nascimento M.A. (2016) IDA 2016 Industrial Challenge: Using Machine Learning for Predicting Failures. In: Bostrom H., Knobbe A., Soares C., Papapetrou P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science, vol 9897. Springer, Cham
Problem
Human activity recognition based on triaxial acceleration and triaxial angular velocity readings from a smartphone attached to the waist.*
Data
There are 561 features and six possible outputs: laying; sitting; standing; walking; walking downstairs; walking upstairs. The features are based on the time-series data from the two sensors, and include time domain as well as frequency domain components. The training dataset consists of 7,352 instances, out of which 986 represent the minority class and 1,407 the majority class. The test dataset consists of 2,947 instances, out of which 420 represent the minority and 537 the majority class.
Training Results
The trained Lt-Wt net had an F1 score, recall, precision, and accuracy equal to 0.95 on test data.
Comparison With 32-Bit Floating-Point CNN
TABLE 1: Key differences between the 32-bit floating-point CNN and the 8/16-bit fixed-point Lt-Wt net.
Aspect
CNN
Lt-Wt Net
Test data accuracy
0.95**
0.95
ROM (excl. LUT)
67 kB
1.4 kB
Memory fetches
34 k
1.4 k
Multiplications
17 k
0
Additions
17 k
0.7 k

The tiny number of fetches (1/66 bytes) and additions (1/26) combined with integer-only arithmetic, lack of multiplications, and 8-bit data flows results in a very fast and highly energy-efficient Lt-Wt net. Moreover, the tiny memory footprint makes it possible to place the whole network in the L1-cache of certain microprocessors.

TABLE 2: The Lt-Wt deep neural net is fast, energy-efficient and compact.
Benefit
      Key Contributors
Speed
  • Reduced number of memory fetches (1/66 in terms of bytes)
  • Reduced number of arithmetic (1/52) and other operations
Energy
  • Reduced number of bytes fetched (1/66)
  • No multiplications, floating-point or otherwise
  • Integer-only additions and subtractions
Size
  • Lower RAM (1/7) and ROM (1/48) requirement
  • 8-bit input-neuron, neuron-neuron, and neuron-output data paths
  • No multiplier, floating-point or otherwise
  • Integer-only adder
TABLE 3: Detailed comparison of the 8/16-bit Lt-Wt net and the 32-bit floating-point CNN.
Lt-Wt Net: CONFIG
Config: Initial 561:8:3:6; Final 333:7:3:6
Inputs
I
333
Neurons
N
16
Excitor connections
Ce
0.3 k
Inhibitor connections
Ci
0.3 k
Lt-Wt Net: MEMORY
ROM (excl. 0.6 kB LUT)
2(Ce + Ci) + 6N + 8
1.4 kB
RAM
I + N
0.4 kB
Lt-Wt Net: OPERATIONS
RAM/ROM fetch
2(Ce + Ci) + 3N + 4
1.4 k
16-bit copy to accum
N
16
8-bit int add to the accum
Ce
0.3 k
8-bit int accum subtract
Ci
0.3 k
8-bit write to RAM
N
16
Boolean comparison
Ce + Ci + N + 1
0.7 k
Arithmetic comparison
2N
32
Single/double increment
Ce + Ci + N + 1
0.7 k
Total ops
5(Ce + Ci) + 9N + 6
3.4 k
Lt-Wt Net: MCU IMPLEMENTATION
Ops
K
3.4 k
Overhead
O
10 k
Single-cycle equiv ops
K + O
14 k
ATmega2560 8-bit 16 MHz
~16 MIPS
Inference pass duration
0.9 ms
32-bit Floating-Point Neural Net: CONFIG
Config: 561:30:6**
Inputs
I
0.6 k
Neurons
N
36
Weights
W
17 k
32-bit Floating-Point Neural Net: MEMORY
ROM (excl. LUT)
4(W + N)
67 kB
RAM
4(I + N)
2.3 kB
32-bit Floating-Point Neural Net: OPERATIONS
32-bit memory fetch
2W + N
34 k
32-bit flt-pt multiply
W
17 k
32-bit flt-pt add
W
17 k
32-bit write to RAM
N
36
*Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
**Human Activity Recognition with Smartphones
Problem
Predict failure of aircraft engines based on data collected during run-to-failure simulations*
Data
Inputs from 21 sensors, including those for temperature, pressure, RPM, fuel flow, fuel-air ratio, and bleed-enthalpy measurements, are used to make one of three possible predictions: engine failure in 1-15,16-30, or 30+ operational cycles. The training and test datasets consist of 21 k and 13 k instances, respectively.
Training Results
The trained Lt-Wt net had an F1 score, recall, precision, and accuracy equal to 0.95 on test data.
Comparison With 32-Bit Floating-Point CNN
TABLE 1: Key differences between the 32-bit floating-point CNN and the 8/16-bit fixed-point Lt-Wt net.
Aspect
CNN
Lt-Wt Net
Test data accuracy
0.92**
0.95
ROM (excl. LUT)
16 kB
7.5 kB
Memory fetches
7.7 k
6.9 k
Multiplications
3.8k
0
Additions
3.8k
3.1 k

The lower number of fetches (1/3 bytes) and additions (1/1.2) combined with integer-only arithmetic, lack of multiplications, and 8-bit data flows results in a very fast and highly energy-efficient Lt-Wt net. Moreover, the tiny memory footprint makes it possible to place the whole network in the L1-cache of certain microprocessors.

TABLE 2: The Lt-Wt deep neural net is fast, energy-efficient and compact.
Benefit
      Key Contributors
Speed
  • Reduced number of memory fetches (1/3 in terms of bytes)
  • Reduced number of arithmetic (1/2.5) and other operations
Energy
  • Reduced number of bytes fetched (1/3)
  • No multiplications, floating-point or otherwise
  • Integer-only additions and subtractions
Size
  • Lower RAM (1/2) and ROM (1/2) requirement
  • 8-bit input-neuron, neuron-neuron, and neuron-output data paths
  • No multiplier, floating-point or otherwise
  • Integer-only adder
TABLE 3: Detailed comparison of the 8/16-bit Lt-Wt net and the 32-bit floating-point CNN.
Lt-Wt Net: CONFIG
Config: Initial 35:128:64:32:16:8:3; Final 35:128:64:32:15:5:3
Inputs
I
35
Neurons
N
247
Excitor connections
Ce
1.6 k
Inhibitor connections
Ci
1.5 k
Lt-Wt Net: MEMORY
ROM (excl. 0.6 kB LUT)
2(Ce + Ci) + 6N + 8
7.5 kB
RAM
I + N
0.3 kB
Lt-Wt Net: OPERATIONS
RAM/ROM fetch
2(Ce + Ci) + 3N + 4
6.9 k
16-bit copy to accum
N
247
8-bit int add to the accum
Ce
1.6 k
8-bit int accum subtract
Ci
1.5 k
8-bit write to RAM
N
247
Boolean comparison
Ce + Ci + N + 1
3.3 k
Arithmetic comparison
2N
494
Single/double increment
Ce + Ci + N + 1
3.3 k
Total ops
5(Ce + Ci) + 9N + 6
18 k
Lt-Wt Net: MCU IMPLEMENTATION
Ops
K
18 k
Overhead
O
52 k
Single-cycle equiv ops
K + O
70 k
ATmega2560 8-bit 16 MHz
~16 MIPS
Inference pass duration
4 ms
32-bit Floating-Point Neural Net: CONFIG
Config: 35:100:3
Inputs
I
35
Neurons
N
103
Weights
W
3.8 k
32-bit Floating-Point Neural Net: MEMORY
ROM (excl. LUT)
4(W + N)
16 kB
RAM
4(I + N)
0.6 kB
32-bit Floating-Point Neural Net: OPERATIONS
32-bit memory fetch
2W + N
7.7 k
32-bit flt-pt multiply
W
3.8 k
32-bit flt-pt add
W
3.8 k
32-bit write to RAM
N
103
*A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation”, in the Proceedings of the 1st International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.
**Predictive Maintenance: Simulated aircraft engine run-to-failure

Downloads

  • Lightweight Neural Networks pdf arxiv preprint
  • Case Study: Real-Time Condition Monitoring pdf
  • Case Study: Predictive Maintainenace pdf
  • Case Study: Wearable Intelligence pdf