Autotuning the Performance of Matrix Multiplication and Convolution for Deep Learning on CPU
*长波 陈 (中国科学院重庆绿色智能技术研究院)
昊宇 池 (中国科学院重庆绿色智能技术研究院)
Deep learning (DL) compilers have emerged with the aim of closing the gap between abundant, fast-growing DL models and the lag of high performance implementations of these models on diverse hardware devices. In this work, we introduce several strategies and integrate them into a unified autotuning framework, called AutoMCL, to improve the performance of DL compilers by combining human's expertise with machine's learned intelligence. The preliminary experiments conducted on different CPU platforms show that the proposed framework brings an average $29.07\times$ speedup compared to TensorFlow and an average $1.55\times$ speedup while consuming only $0.47\times$ optimization time compared to a state-of-art DL compiler AutoTVM for fully connected neural networks on an Intel CPU,
and an average $1.36\times$ speedup compared to TensorFlow and an average $1.09\times$ speedup with similar compilation time compared to AutoTVM for several well-known convolutional neural networks on multiple CPUs.