I. Introduction
The enthusiasm for researching deep neural networks (DNNs), particularly driven by convolutional neural networks (CNNs), is now expanding its reach into various application domains, including sequence-to-sequence models (seq2seq), recurrent neural networks (RNNs), and graph neural networks (GNNs) [1]–[3]. Accordingly, DNN applications have become indispensable tools in various fields [4]–[15]. These currently prevalent neural networks have a common feature. Convolution layers, primary elements of CNN, generally go through a process called lowering (im2col) [16]. This strategy allows for improved thread-level parallelism by untying deeply nested loops of convolution with a simple general matrix-to-matrix multiplication (GEMM). Transformers that have become well known through BERT [13] and GPT [17] include an attention mechanism consisting of multiple GEMMs to obtain a key, query, and value matrices and to compute attention distribution eventually. Likewise, GEMM is the principal operation of RNN-type networks to obtain hidden states and several state vectors, which is a general matrix-to-vector multiplication (GEMV) by default but computed as GEMM by the batch size. The same is true for fully connected layers and embedding operations that many types of networks adopt to extract inference results or specific information.