When I was at AMD, one of the most exciting parts of my work was optimizing math libraries at the CPU level. I got the chance to contribute to libFLAME, AMD's high-performance linear algebra library, and that experience gave me a front-row seat to the art (and science) of squeezing every drop of performance out of silicon.
Matrix multiplication was at the heart of it. On paper, multiplying two matrices is simple. In practice, making it run fast on a modern CPU is a different beast entirely. That's where AVX instructions came in powerful SIMD (Single Instruction, Multiple Data) extensions that let me process multiple floating-point values in parallel. Matrix multiplication might seem straightforward on paper, but optimizing it for performance reveals its complexity. Every cache miss, misaligned load, and wasted cycle becomes dramatically apparent when scaling to billions of operations per second.What makes this work even more exciting is how relevant it remains today. Whether you're training large language models, running recommendation systems, or crunching simulations in scientific computing, matrix multiplication is everywhere. While GPUs dominate AI training, CPUs - especially modern AMD EPYC chips with AVX-512 still handle a huge share of inference, analytics, and mixed workloads in the cloud.
The power of AVX instructions lies in their ability to process multiple floating-point values simultaneously through SIMD (Single Instruction, Multiple Data) operations. On AMD's Zen 4 and Zen 5 architectures, AVX-512 isn't just about wide registers; it's about sustaining high clock speeds, handling real-world HPC + AI workloads, and letting CPUs punch far above their weight in performance per watt.


Credit: Xinnor
By the time I was working on this, AMD CPUs had excellent AVX2 and later AVX-512 support. These weren’t just “bigger registers” - they came with features like:
a * b + c with minimal rounding error.On AMD’s Zen 4 architecture, the beauty was that unlike Intel’s AVX-512 (which often throttled frequencies), AMD managed to sustain high clock speeds. That meant we could push performance without hitting frequency cliffs.
The baseline was always the three-nested-loop matrix multiplication:
for (int i = 0; i < m; i++) {
    for (int j = 0; j < n; j++) {
        for (int k = 0; k < k_size; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
It’s correct but painfully slow - poor cache usage, no vectorization, and it simply doesn’t respect modern CPUs.