We are living in the era of Big Data, witnessing an exponential increase in the amount of generated data, primarily driven by the substantial increase in the use of connected devices and embedded sensors. Machine learning plays a critical role in extracting meaningful information from such collected data. Furthermore, there are growing application domains in which resource-constrained hardware platforms are required to run machine-learning applications locally. These resource constraints usually appear in the form of energy, power consumption, throughput, latency and area, making it necessary to create new design strategies. Recently, hardware acceleration and parallelization have been established as key approaches for addressing the energy and performance of computational kernels. Although accelerators provide an opportunity to address resource constraints, they primarily improve computational operations and not the memory-accessing/data-movement operations. However, in data-centric workloads, data movement becomes the bottleneck. Thus, the leverage of hardware acceleration and the range of kernels for which it is beneficial is limited.
In this dissertation, the main focus has been on three matters. First, on a technological level, the leverage that 3D IC technology offers is analyzed to enable increasing performance gains from hardware acceleration. This is demonstrated via introducing a three layer architecture, interfaced with vertical 3D interconnects between the adjacent layers, implementing the sparse matrix-vector multiplication kernel. Secondly, for dense linear algebraic computation kernels, we demonstrate the first charge-domain in-memory-computing accelerator that integrates dense weight storage and multiplication in order to reduce the overall data movement. This is achieved via incorporating the computations inside the very compact memory bit cells, and using highly linear and stable interdigitated Metal-Oxide-Metal (MOM) capacitors that are laid out on top of the bit cells, occupying no additional area. Thirdly, we examine an important theoretical approach which can be utilized to mitigate the memory accessing bottleneck. In particular, we study the problem of matrix factorization, which in its low-rank form, can drastically reduce the required memory footprint. Furthermore, it identifies the important structures in the data, thereby creating regularities in the memory-accessing patterns.