Lt-Wt Net – Infxl

FAST+EFFICIENT 8-BIT HW CASE STUDIES DOWNLOAD

Fast, Energy-Efficient Deep Neural Net

The Lt-Wt (Lightweight) net is our fast and energy-efficient deep neural net that can be embedded directly into resource-constrained IoT devices. These devices will not need to broadcast any data as they will be able to do their processing locally. This will result in improved security and privacy.

The Lt-Wt net requires 95% fewer ops, 95% less memory, and 95% less logic as compared with the conventional neural net (CNN), making it suitable for fast and economical ASIC, FPGA, or 8-bit microcontroller implementations. Its inference ability is similar to the CNN.

The Lt-Wt net can approximate arbitrary continuous functions with any accuracy. It requires modest storage and has a multiplication-free forward pass, rendering it suitable for deployment on inexpensive hardware. The Lt-Wt learning process automatically drops insignificant inputs, unnecessary weights, and unneeded hidden neurons. The sparse weight matrices of the Lt-Wt net loosen the coupling among the layers, making the Lt-Wt net more tolerant to the failure of individual neurons.

The learned information in CNN is distributed over all weights. In a Lt-Wt net, the picture is less fuzzy and the localized nature of computation is much more obvious due to the presence of a large number of zero-valued weights. For image processing, a CNN does not scale well with increases in the resolution of the images due to the fully-connected structure of the net. A Lt-Wt net, on the other hand, scales much better due to the sparsity of its weight matrices.

The small magnitude of the Lt-Wt net weights should result in smooth mappings and the small number of non-zero weights should result in low generalization error.

The Lt-Wt net has been successfully tested with up to 16 hidden layers and 4.4 million weights on problems having input vectors consisting of tens of thousands of elements.

Implementation in 8-bit Hardware

The inputs are mapped to an 8-bit fixed-decimal representation. All neurons have 8-bit outputs and all arithmetic is integer only. The arithmetic ops consist of additions and subtractions only. The activation function is implemented as a 2 kB lookup table.

The executable Lt-Wt net consists of the following elements:

RAM (input and outputs)
Random access ROM for the lookup table
Read-once ROM to hold the network definition
Control logic

Configuring these elements for fast execution on an 8-bit microcontroller or efficient implementation on an FPGA is straightforward. An end-to-end 16-bit implementation of the Lt-Wt Net on a Cortex-M4 requires only 0.1 kB of code. This enables, in the case of the Human Activity Recognizer, operation in the sub-µW range on microprocessors and at 10-20% of that on FPGAs.

Case Studies

Predict Failure of Air Pressure Systems in Heavy Trucks

Problem

Predict failure of the Air Pressure System (APS) in heavy Scania trucks based on sensor data.*

Data

There are 170 features and a single yes/no outcome that needs to be predicted. The training dataset consists of 60 k instances, out of which 59 k are of the negative class, 1 k of positive class, and 850 k values are missing. The test dataset consists of 16 k instances and 229 k values are missing.

Training Results

The trained Lt-Wt net had an F1 score of 0.73, recall 0.68, precision of 0.79, and accuracy 0.99 on test data.

Comparison With 32-Bit Floating-Point CNN

TABLE 1: Key differences between the 32-bit floating-point CNN and the 8/16-bit fixed-point Lt-Wt net.

Aspect

CNN

Lt-Wt Net

ROM (excl. LUT)

3.4 MB

130 kB

Memory fetches

1.7 M

63 k

Multiplications

871 k

Additions

871 k

62 k

The tiny number of fetches (1/72 bytes) and additions (1/14) combined with integer-only arithmetic, lack of multiplications, and 8-bit data flows results in a very fast and highly energy-efficient Lt-Wt net. Moreover, the tiny memory footprint makes it possible to place the whole network in the L1-cache of certain microprocessors.

TABLE 2: The Lt-Wt deep neural net is fast, energy-efficient and compact.

Benefit

Key Contributors

Speed

Reduced number of memory fetches (1/108 in terms of bytes)
Reduced number of arithmetic (1/27) and other operations

Energy

Reduced number of bytes fetched (1/108)
No multiplications, floating-point or otherwise
Integer-only additions and subtractions

Size

Lower RAM (1/4) and ROM (1/26) requirement
8-bit input-neuron, neuron-neuron, and neuron-output data paths
No multiplier, floating-point or otherwise
Integer-only adder

TABLE 3: Detailed comparison of the 8/16-bit Lt-Wt net and the 32-bit floating-point CNN.

Lt-Wt Net: CONFIG

Config: Initial 170:1024:512:256:128:64:2; Final 170:1024:512:155:14:3:2

Inputs

170

Neurons

1.7 k

Excitor connections

31 k

Inhibitor connections

31 k

Lt-Wt Net: MEMORY

ROM (excl. 0.6 kB LUT)

2(Ce + Ci) + 6N + 8

130 kB

RAM

I + N

1.9 kB

Lt-Wt Net: OPERATIONS

RAM/ROM fetch

2(Ce + Ci) + 3N + 4

130 k

16-bit copy to accum

1.7 k

8-bit int add to the accum

31 k

8-bit int accum subtract

31 k

8-bit write to RAM

1.7 k

Boolean comparison

Ce + Ci + N + 1

63 k

Arithmetic comparison

3.4 k

Single/double increment

Ce + Ci + N + 1

63 k

Total ops

5(Ce + Ci) + 9N + 6

324 k

Lt-Wt Net: MCU IMPLEMENTATION

Ops

324 k

Overhead

~1 M

Single-cycle equiv ops

K + O

1.3 M

ATmega2560 8-bit 16 MHz

~16 MIPS

Inference pass duration

0.08 s

32-bit Floating-Point Neural Net: CONFIG

Config: 170:1024:512:256:128:64:2

Inputs

170

Neurons

2 k

Weights

871 k

32-bit Floating-Point Neural Net: MEMORY

ROM (excl. LUT)

4(W + N)

3.4 MB

RAM

4(I + N)

8.5 kB

32-bit Floating-Point Neural Net: OPERATIONS

32-bit memory fetch

2W + N

1.7 M

32-bit flt-pt multiply

871 k

32-bit flt-pt add

871 k

32-bit write to RAM

2 k

*Costa C.F., Nascimento M.A. (2016) IDA 2016 Industrial Challenge: Using Machine Learning for Predicting Failures. In: Bostrom H., Knobbe A., Soares C., Papapetrou P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science, vol 9897. Springer, Cham

Human Activity Recognition Based on Accelerometer and Gyroscope Data

Problem

Human activity recognition based on triaxial acceleration and triaxial angular velocity readings from a smartphone attached to the waist.*

Data

There are 561 features and six possible outputs: laying; sitting; standing; walking; walking downstairs; walking upstairs. The features are based on the time-series data from the two sensors, and include time domain as well as frequency domain components. The training dataset consists of 7,352 instances, out of which 986 represent the minority class and 1,407 the majority class. The test dataset consists of 2,947 instances, out of which 420 represent the minority and 537 the majority class.

Training Results

The trained Lt-Wt net had an F1 score, recall, precision, and accuracy equal to 0.95 on test data.

Comparison With 32-Bit Floating-Point CNN

TABLE 1: Key differences between the 32-bit floating-point CNN and the 8/16-bit fixed-point Lt-Wt net.

Aspect

CNN

Lt-Wt Net

Test data accuracy

0.95**

0.95

ROM (excl. LUT)

67 kB

1.4 kB

Memory fetches

34 k

1.4 k

Multiplications

17 k

Additions

17 k

0.7 k

The tiny number of fetches (1/66 bytes) and additions (1/26) combined with integer-only arithmetic, lack of multiplications, and 8-bit data flows results in a very fast and highly energy-efficient Lt-Wt net. Moreover, the tiny memory footprint makes it possible to place the whole network in the L1-cache of certain microprocessors.

TABLE 2: The Lt-Wt deep neural net is fast, energy-efficient and compact.

Benefit

Key Contributors

Speed

Reduced number of memory fetches (1/66 in terms of bytes)
Reduced number of arithmetic (1/52) and other operations

Energy

Reduced number of bytes fetched (1/66)
No multiplications, floating-point or otherwise
Integer-only additions and subtractions

Size

Lower RAM (1/7) and ROM (1/48) requirement
8-bit input-neuron, neuron-neuron, and neuron-output data paths
No multiplier, floating-point or otherwise
Integer-only adder

TABLE 3: Detailed comparison of the 8/16-bit Lt-Wt net and the 32-bit floating-point CNN.

Lt-Wt Net: CONFIG

Config: Initial 561:8:3:6; Final 333:7:3:6

Inputs

333

Neurons

Excitor connections

0.3 k

Inhibitor connections

0.3 k

Lt-Wt Net: MEMORY

ROM (excl. 0.6 kB LUT)

2(Ce + Ci) + 6N + 8

1.4 kB

RAM

I + N

0.4 kB

Lt-Wt Net: OPERATIONS

RAM/ROM fetch

2(Ce + Ci) + 3N + 4

1.4 k

16-bit copy to accum

8-bit int add to the accum

0.3 k

8-bit int accum subtract

0.3 k

8-bit write to RAM

Boolean comparison

Ce + Ci + N + 1

0.7 k

Arithmetic comparison

Single/double increment

Ce + Ci + N + 1

0.7 k

Total ops

5(Ce + Ci) + 9N + 6

3.4 k

Lt-Wt Net: MCU IMPLEMENTATION

Ops

3.4 k

Overhead

10 k

Single-cycle equiv ops

K + O

14 k

ATmega2560 8-bit 16 MHz

~16 MIPS

Inference pass duration

0.9 ms

32-bit Floating-Point Neural Net: CONFIG

Config: 561:30:6**

Inputs

0.6 k

Neurons

Weights

17 k

32-bit Floating-Point Neural Net: MEMORY

ROM (excl. LUT)

4(W + N)

67 kB

RAM

4(I + N)

2.3 kB

32-bit Floating-Point Neural Net: OPERATIONS

32-bit memory fetch

2W + N

34 k

32-bit flt-pt multiply

17 k

32-bit flt-pt add

17 k

32-bit write to RAM

*Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
**Human Activity Recognition with Smartphones

Predict Failure of Aircraft Engines Based on Temperature, Pressure, RPM, Flow, and Air-Ratio Sensor Data

Problem

Predict failure of aircraft engines based on data collected during run-to-failure simulations*

Data

Inputs from 21 sensors, including those for temperature, pressure, RPM, fuel flow, fuel-air ratio, and bleed-enthalpy measurements, are used to make one of three possible predictions: engine failure in 1-15,16-30, or 30+ operational cycles. The training and test datasets consist of 21 k and 13 k instances, respectively.

Training Results

The trained Lt-Wt net had an F1 score, recall, precision, and accuracy equal to 0.95 on test data.

Comparison With 32-Bit Floating-Point CNN

TABLE 1: Key differences between the 32-bit floating-point CNN and the 8/16-bit fixed-point Lt-Wt net.

Aspect

CNN

Lt-Wt Net

Test data accuracy

0.92**

0.95

ROM (excl. LUT)

16 kB

7.5 kB

Memory fetches

7.7 k

6.9 k

Multiplications

3.8k

Additions

3.8k

3.1 k

The lower number of fetches (1/3 bytes) and additions (1/1.2) combined with integer-only arithmetic, lack of multiplications, and 8-bit data flows results in a very fast and highly energy-efficient Lt-Wt net. Moreover, the tiny memory footprint makes it possible to place the whole network in the L1-cache of certain microprocessors.

TABLE 2: The Lt-Wt deep neural net is fast, energy-efficient and compact.

Benefit

Key Contributors

Speed

Reduced number of memory fetches (1/3 in terms of bytes)
Reduced number of arithmetic (1/2.5) and other operations

Energy

Reduced number of bytes fetched (1/3)
No multiplications, floating-point or otherwise
Integer-only additions and subtractions

Size

Lower RAM (1/2) and ROM (1/2) requirement
8-bit input-neuron, neuron-neuron, and neuron-output data paths
No multiplier, floating-point or otherwise
Integer-only adder

TABLE 3: Detailed comparison of the 8/16-bit Lt-Wt net and the 32-bit floating-point CNN.

Lt-Wt Net: CONFIG

Config: Initial 35:128:64:32:16:8:3; Final 35:128:64:32:15:5:3

Inputs

Neurons

247

Excitor connections

1.6 k

Inhibitor connections

1.5 k

Lt-Wt Net: MEMORY

ROM (excl. 0.6 kB LUT)

2(Ce + Ci) + 6N + 8

7.5 kB

RAM

I + N

0.3 kB

Lt-Wt Net: OPERATIONS

RAM/ROM fetch

2(Ce + Ci) + 3N + 4

6.9 k

16-bit copy to accum

247

8-bit int add to the accum

1.6 k

8-bit int accum subtract

1.5 k

8-bit write to RAM

247

Boolean comparison

Ce + Ci + N + 1

3.3 k

Arithmetic comparison

494

Single/double increment

Ce + Ci + N + 1

3.3 k

Total ops

5(Ce + Ci) + 9N + 6

18 k

Lt-Wt Net: MCU IMPLEMENTATION

Ops

18 k

Overhead

52 k

Single-cycle equiv ops

K + O

70 k

ATmega2560 8-bit 16 MHz

~16 MIPS

Inference pass duration

4 ms

32-bit Floating-Point Neural Net: CONFIG

Config: 35:100:3

Inputs

Neurons

103

Weights

3.8 k

32-bit Floating-Point Neural Net: MEMORY

ROM (excl. LUT)

4(W + N)

16 kB

RAM

4(I + N)

0.6 kB

32-bit Floating-Point Neural Net: OPERATIONS

32-bit memory fetch

2W + N

7.7 k

32-bit flt-pt multiply

3.8 k

32-bit flt-pt add

3.8 k

32-bit write to RAM

103

*A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation”, in the Proceedings of the 1st International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.
**Predictive Maintenance: Simulated aircraft engine run-to-failure

Downloads

Lightweight Neural Networks pdf arxiv preprint
Case Study: Real-Time Condition Monitoring pdf
Case Study: Predictive Maintainenace pdf
Case Study: Wearable Intelligence pdf