In this project, you are required to design a systolic array that efficiently implements the logic required to support per-channel activation tensor quantization for a convolution neural network. You are required to implement the design using SystemVerilog, simulate and synthesize it after which the layout will be designed. Area, power, and energy will be analyzed and compared to a conventional systolic array.

Skills you will acquire: SystemVerilog, Synopsys Design Compiler, Cadence Innovus.

CNNs introduce state-of-the-art results in numerous applications. Yet, they are compute intensive, require billions of multiply-and-accumulate (MAC) operations to classify a single RGB image from the ImageNet dataset, for example. To mitigate the computation intensity, quantization is used. By quantizing the inputs (activations) and weights from FP32 to INT8, for example, the hardware implementation becomes easier.

The common quantization approach is kernel-wise, so each weight kernel is coupled to its own scaling factor. In addition, the input (or activation) tensor is globally quantized, that is, has a single scaling factor. By doing so, each output activation is simply multiplied by the two scaling factors after the convolution operation, which is easy to implement in hardware.

Having said that, quantizing the input (activation) tensor per-channel as well will lead to superior results [1]. Hardware implementation of per-channel activation tensor quantization has never been done due to the difficulties involved. The per-channel quantization require many scale factors, and as a consequence, there is a need to fetch the partial sums from the PEs more times and perform more FP operations on them.

Systolic arrays are arrays of simple processing elements built for fast and efficient operation of regular algorithms that perform the same task with different data at different points in time. Each processing elements performs a computation after which is passes data to its immediate neighbor. In general, systolic arrays are very regular and easily scalable.

Dedicated hardware implementation of any algorithm can dramatically speed up its performance.

In this project, you are required to design a systolic array that efficiently implements the logic required to support per-channel activation tensor quantization for a convolution neural network. You are required to implement the design using SystemVerilog, simulate and synthesize it after which the layout will be designed. Area, power, and energy will be analyzed and compared to a conventional systolic array.

__Prerequisites:__computer architecture (EE or CS).

__Skills you will acquire: __SystemVerilog, Synopsys Design Compiler, Cadence Innovus.

* This is a research-oriented project

[1] Banner, Ron, Yury Nahshan, and Daniel Soudry. “Post training 4-bit quantization of convolutional networks for rapid-deployment.” Advances in Neural Information Processing Systems. 2019.