Abstract:To address the issue of deploying convolutional neural networks on resource-constrained devices, a high-performance fast convolution algorithm(FastInfer)was proposed for the FT-2000/4 multi-core processor. The algorithm optimized general matrix multiplication using a block-based strategy, storing frequently accessed data closer to the processor's cache to improve memory access efficiency during computation. In addition, a high-performance matrix multiplication microkernel was designed and implemented, utilizing vector outer product operations to update data and enhance the memory-to-computation ratio. This design maximized the masking of memory instruction latency. Experimental results demonstrated that FastInfer achieved a peak computational performance of 99.56 GFLOPS on the FT-2000/4 processor. In tests with general matrix multiplication at various input scales, FastInfer outperformed OpenBLAS by 1.07 and 1.52 times. In convolution tests, FastInfer performed 1.32 times better than the ARM Compute Library, achieving high-performance convolution computation on the FT-2000/4 multi-core processor.