How To Optimize Gemm -- a revised version

there is a very good guide on how to optimization gemm:

https://github.com/flame/how-to-optimize-gemm/wiki

But unfortunately it didn't emphasize the key ideas and motivations behind the optimization, and the best performance it got is much lower than modern library (like Eigen) can archieve, here I revised it and try to provide a more concentrated version.

Note: the last curve is Eigen implementation as a great performence example

To reproduce the result shown above, please run command under src/HowToOptimizeGemm folder:

# change NEW varible in makefile to MMult0
make run
# change NEW varible in makefile to MMultL0
make run
...
# change NEW varible in makefile to MMultLEigen
make run

# now performances are recorded in output_MMult*.m files.
# plot them with python
python3 ./PlotAll.py output_MMult0.m output_MMultL*.m

Orginal code: MMult0

/* Create macros so that the matrices are stored in column-major order */
#define A(i, j) a[(j)*lda + (i)]
#define B(i, j) b[(j)*ldb + (i)]
#define C(i, j) c[(j)*ldc + (i)]

/* Routine for computing C = A * B + C */
void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;

  for (i = 0; i < m; i++) {     /* Loop over the rows of C */
    for (j = 0; j < n; j++) {   /* Loop over the columns of C */
      for (p = 0; p < k; p++) { /* Update C( i,j ) with the inner
                                product of the ith row of A and
                                the jth column of B */
        C(i, j) = C(i, j) + A(i, p) * B(p, j); // <==== S0
      }
    }
  }
}

from performance curve, we see:

for smaller size matrix, it dosen't fully exploit GFlops capability of CPU.
for bigger size matrix (especially when matrix size exceeds L3 cache), its performance drops even further, so the cache efficiency is bad.

Fundamental strategy

we found that there are no dependency between all instances of inner loop statement S0(i,j,k), in theory they can be executed in any order w/o effect the result accuracy, the original code choosed a particular execution order just for better alignment with the mathematical definition of matrix multiplication.

in practice there will be little & acceptable error introduced after addition order change.

S0(i,j,k):
    C(i, j) = C(i, j) + A(i, p) * B(p, j);

So the fundamental optimization strategy is:

Re-organize these instances of statement S0, to find an order that increases regiter & cache & ALU efficiency the most.

At first you may suspect how the execution order can make any big difference, well, the order matters a lot actually, let's check it out.

Optimize L0

from the definition of A(i,j)/B(i,j)/C(i,j), we found that inner most loop can be changed to from p to i, to increase cache hit-rate since i is the physical lowermost dimension.

/* Create macros so that the matrices are stored in column-major order */
#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Routine for computing C = A * B + C */
void MY_MMult( int m, int n, int k, double *a, int lda, 
                                    double *b, int ldb,
                                    double *c, int ldc )
{
  int i, j, p;
  for ( j=0; j<n; j++ ){     /* Loop over the columns of C */
    for ( p=0; p<k; p++ ){   /* Update C( i,j ) with the inner */
       for ( i=0; i<m; i++ ){/* Loop over the rows of C */
         C( i,j ) = C( i,j ) +  A( i,p ) * B( p,j );
     }
   }
  }
}

we can see this change increased performance by a lot especially at big matrix due to better cache efficiency.

Optimize L1

Now we are good at cache efficiency, let's focus on the issue of low CPU GFlops efficiency.

We set the theoretical maximal GFlops according to SIMD specification, since with double precision SIMD instruction CPU can execute 2 multiply + 2 addition in one cycle, we set our target to be the 4 times clock-frequency.

So we must use SIMD, and natural choice is read neighbouring data into SIMD lanes:

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p) * B(p,j);
 2xdouble      2xdouble        2xdouble   2xdouble (broad-cast)

the code is:

/* Create macros so that the matrices are stored in column-major order */
#define A(i, j) a[(j)*lda + (i)]
#define B(i, j) b[(j)*ldb + (i)]
#define C(i, j) c[(j)*ldc + (i)]

/* Routine for computing C = A * B + C */
#include <emmintrin.h> // SSE3
#include <mmintrin.h>
#include <pmmintrin.h> // SSE2
#include <xmmintrin.h> // SSE

typedef union {
  __m128d v;
  double d[2];
} v2df_t;

void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;
  double *ptrC;
  v2df_t regA, regC, regB;

  for (j = 0; j < n; j++) {   /* Loop over the columns of C */
    for (p = 0; p < k; p++) { /* Update C( i,j ) with the inner */
      regB.v = _mm_loaddup_pd(&B(p, j));
      for (i = 0; i < m; i += 2) { /* Loop over the rows of C */
        ptrC = &C(i, j);
        regC.v = _mm_load_pd(ptrC);
        regA.v = _mm_load_pd(&A(i, p));

        regC.v = regC.v + regA.v * regB.v;

        _mm_store_pd(ptrC, regC.v);
      }
    }
  }
}

here B(p,j) do not change inside inner most loop, we store it in register.

we can see performance is almost doubled but not hit our target yet, that's because we still have overhead of load/store instruction.

Optimize L2

Even though load/store instructions are accessing cache, they still occupy CPU cycles, to mitigate this overhead, we should try to avoid load/store too often, i.e. increase register hit-rate.

An obvious way to archive that is since we are accumulating C, why not accumulate in register more items before saving it back to memory?

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p) * B(p,j);

can be changed into :

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j)
                          + ...;

here is the code L2:

/* Create macros so that the matrices are stored in column-major order */
#define A(i, j) a[(j)*lda + (i)]
#define B(i, j) b[(j)*ldb + (i)]
#define C(i, j) c[(j)*ldc + (i)]

/* Routine for computing C = A * B + C */
#include <emmintrin.h> // SSE3
#include <mmintrin.h>
#include <pmmintrin.h> // SSE2
#include <xmmintrin.h> // SSE

typedef union {
  __m128d v;
  double d[2];
} v2df_t;

void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;
  double *ptrA0;
  double *ptrA1;
  double *ptrC;
  v2df_t regC;
  v2df_t regA0, regB0;
  v2df_t regA1, regB1;

  for (j = 0; j < n; j++) {      /* Loop over the columns of C */
    for (p = 0; p < k; p += 2) { /* Update C( i,j ) with the inner */
      regB0.v = _mm_loaddup_pd(&B(p, j));
      regB1.v = _mm_loaddup_pd(&B(p + 1, j));

      for (i = 0; i < m; i += 2) { /* Loop over the rows of C */
        ptrC = &C(i, j);
        ptrA0 = &A(i, p + 0);
        ptrA1 = &A(i, p + 1);
        regC.v = _mm_load_pd(ptrC);

        regA0.v = _mm_load_pd(ptrA0);
        regA1.v = _mm_load_pd(ptrA1);

        regC.v += regA0.v * regB0.v + regA1.v * regB1.v;

        _mm_store_pd(ptrC, regC.v);
      }
    }
  }
}

now performance is increased again.

Optimismze L3

To further unroll or re-organize the execution, we first check the inner-loop and found that we already loadded A(i:i+2, p) & A(i:i+2, p+1) into register, why not re-use them by moving/unrolling next instance of S0 requiring them into the loop ?

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j);

Can be changed into:

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j);

C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p)   * B(p,j+1)
                              +  A(i:i+2, p+1) * B(p+1,j+1);

so we saved two loads of A by grouping/unrolling neighbouring S0 along the right axis:

void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;
  double *ptrA0;
  double *ptrA1;
  double *ptrC0;
  double *ptrC1;
  v2df_t regCj0;
  v2df_t regCj1;
  v2df_t regA0, regB0j0, regB0j1;
  v2df_t regA1, regB1j0, regB1j1;

  for (j = 0; j < n; j+=2) {   /* Loop over the columns of C */
    for (p = 0; p < k; p+=2) { /* Update C( i,j ) with the inner */
      regB0j0.v = _mm_loaddup_pd(&B(p, j));
      regB1j0.v = _mm_loaddup_pd(&B(p+1, j));

      regB0j1.v = _mm_loaddup_pd(&B(p, j+1));
      regB1j1.v = _mm_loaddup_pd(&B(p+1, j+1));

      for (i = 0; i < m; i += 2) { /* Loop over the rows of C */
        ptrC0 = &C(i, j);
        ptrC1 = &C(i, j + 1);
        ptrA0 = &A(i, p + 0);
        ptrA1 = &A(i, p + 1);
        regCj0.v = _mm_load_pd(ptrC0);
        regCj1.v = _mm_load_pd(ptrC1);

        regA0.v = _mm_load_pd(ptrA0);
        regA1.v = _mm_load_pd(ptrA1);

        regCj0.v += regA0.v * regB0j0.v + regA1.v * regB1j0.v;
        regCj1.v += regA0.v * regB0j1.v + regA1.v * regB1j1.v;

        _mm_store_pd(ptrC0, regCj0.v);
        _mm_store_pd(ptrC1, regCj1.v);
      }
    }
  }
}

The performance increased again, for small matrix, we have pretty high CPU GFlops efficiency (near 87%).

Optimismze L4

We can see the performance is still dropping at big matrix, the inner-loop should have gain the maximum possible cache efficiency since A & C are loaded continuously. So that left only loading of B, since we only load 1 double in loop p, which wasted the prefetched data following it in same cache-line, why not use more of them by unrolling along p axis?

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j)
                          +  A(i:i+2, p+2) * B(p+2,j)
                          +  A(i:i+2, p+3) * B(p+3,j);

C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p)   * B(p,j+1)
                              +  A(i:i+2, p+1) * B(p+1,j+1)
                              +  A(i:i+2, p+2) * B(p+2,j+1)
                              +  A(i:i+2, p+3) * B(p+3,j+1);

So here not only B's cache efficiency is increased, C's also because it was accumulated 2 more items before saving back to memory.

void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;
  double *ptrA0;
  double *ptrA1;
  double *ptrA2;
  double *ptrA3;

  double *ptrC0;
  double *ptrC1;

  v2df_t regCj0;
  v2df_t regCj1;
  v2df_t regA0, regB0j0, regB0j1;
  v2df_t regA1, regB1j0, regB1j1;
  v2df_t regA2, regB2j0, regB2j1;
  v2df_t regA3, regB3j0, regB3j1;

  for (j = 0; j < n; j+=2) {   /* Loop over the columns of C */
    for (p = 0; p < k; p+=4) { /* Update C( i,j ) with the inner */
      regB0j0.v = _mm_loaddup_pd(&B(p+0, j));
      regB1j0.v = _mm_loaddup_pd(&B(p+1, j));
      regB2j0.v = _mm_loaddup_pd(&B(p+2, j));
      regB3j0.v = _mm_loaddup_pd(&B(p+3, j));

      regB0j1.v = _mm_loaddup_pd(&B(p+0, j+1));
      regB1j1.v = _mm_loaddup_pd(&B(p+1, j+1));
      regB2j1.v = _mm_loaddup_pd(&B(p+2, j+1));
      regB3j1.v = _mm_loaddup_pd(&B(p+3, j+1));

      for (i = 0; i < m; i += 2) { /* Loop over the rows of C */
        ptrC0 = &C(i, j);
        ptrC1 = &C(i, j + 1);

        ptrA0 = &A(i, p+0);
        ptrA1 = &A(i, p+1);
        ptrA2 = &A(i, p+2);
        ptrA3 = &A(i, p+3);

        regCj0.v = _mm_load_pd(ptrC0);
        regCj1.v = _mm_load_pd(ptrC1);

        regA0.v = _mm_load_pd(ptrA0);
        regA1.v = _mm_load_pd(ptrA1);
        regA2.v = _mm_load_pd(ptrA2);
        regA3.v = _mm_load_pd(ptrA3);

        regCj0.v += regA0.v * regB0j0.v + regA1.v * regB1j0.v;
        regCj0.v += regA2.v * regB2j0.v + regA3.v * regB3j0.v;

        regCj1.v += regA0.v * regB0j1.v + regA1.v * regB1j1.v;
        regCj1.v += regA2.v * regB2j1.v + regA3.v * regB3j1.v;

        _mm_store_pd(ptrC0, regCj0.v);
        _mm_store_pd(ptrC1, regCj1.v);
      }
    }
  }
}

Optimismze L5

Just a reinforcement of L4 by increasing the unroll-factor from 4 to 8 on p-axis , but we still can see stable improvement.

#include <stdio.h>

/* Create macros so that the matrices are stored in column-major order */

#define A(i, j) a[(j)*lda + (i)]
#define B(i, j) b[(j)*ldb + (i)]
#define C(i, j) c[(j)*ldc + (i)]

/* Routine for computing C = A * B + C */

#include <emmintrin.h> // SSE3
#include <mmintrin.h>
#include <pmmintrin.h> // SSE2
#include <xmmintrin.h> // SSE

typedef union {
  __m128d v;
  double d[2];
} v2df_t;

void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;
  double *ptrA0;
  double *ptrA1;
  double *ptrA2;
  double *ptrA3;
  double *ptrA4;
  double *ptrA5;
  double *ptrA6;
  double *ptrA7;

  double *ptrC0;
  double *ptrC1;

  v2df_t regCj0;
  v2df_t regCj1;
  v2df_t regA0, regB0j0, regB0j1;
  v2df_t regA1, regB1j0, regB1j1;
  v2df_t regA2, regB2j0, regB2j1;
  v2df_t regA3, regB3j0, regB3j1;

  v2df_t regA4, regB4j0, regB4j1;
  v2df_t regA5, regB5j0, regB5j1;
  v2df_t regA6, regB6j0, regB6j1;
  v2df_t regA7, regB7j0, regB7j1;

  for (j = 0; j < n; j+=2) {   /* Loop over the columns of C */
    for (p = 0; p < k; p+=8) { /* Update C( i,j ) with the inner */
      regB0j0.v = _mm_loaddup_pd(&B(p+0, j));
      regB1j0.v = _mm_loaddup_pd(&B(p+1, j));
      regB2j0.v = _mm_loaddup_pd(&B(p+2, j));
      regB3j0.v = _mm_loaddup_pd(&B(p+3, j));
      regB4j0.v = _mm_loaddup_pd(&B(p+4, j));
      regB5j0.v = _mm_loaddup_pd(&B(p+5, j));
      regB6j0.v = _mm_loaddup_pd(&B(p+6, j));
      regB7j0.v = _mm_loaddup_pd(&B(p+7, j));

      regB0j1.v = _mm_loaddup_pd(&B(p+0, j+1));
      regB1j1.v = _mm_loaddup_pd(&B(p+1, j+1));
      regB2j1.v = _mm_loaddup_pd(&B(p+2, j+1));
      regB3j1.v = _mm_loaddup_pd(&B(p+3, j+1));
      regB4j1.v = _mm_loaddup_pd(&B(p+4, j+1));
      regB5j1.v = _mm_loaddup_pd(&B(p+5, j+1));
      regB6j1.v = _mm_loaddup_pd(&B(p+6, j+1));
      regB7j1.v = _mm_loaddup_pd(&B(p+7, j+1));

      for (i = 0; i < m; i += 2) { /* Loop over the rows of C */
        ptrC0 = &C(i, j);
        ptrC1 = &C(i, j + 1);

        ptrA0 = &A(i, p+0);
        ptrA1 = &A(i, p+1);
        ptrA2 = &A(i, p+2);
        ptrA3 = &A(i, p+3);
        ptrA4 = &A(i, p+4);
        ptrA5 = &A(i, p+5);
        ptrA6 = &A(i, p+6);
        ptrA7 = &A(i, p+7);

        regCj0.v = _mm_load_pd(ptrC0);
        regCj1.v = _mm_load_pd(ptrC1);

        regA0.v = _mm_load_pd(ptrA0);
        regA1.v = _mm_load_pd(ptrA1);
        regA2.v = _mm_load_pd(ptrA2);
        regA3.v = _mm_load_pd(ptrA3);
        regA4.v = _mm_load_pd(ptrA4);
        regA5.v = _mm_load_pd(ptrA5);
        regA6.v = _mm_load_pd(ptrA6);
        regA7.v = _mm_load_pd(ptrA7);

        regCj0.v += regA0.v * regB0j0.v + regA1.v * regB1j0.v
                  + regA2.v * regB2j0.v + regA3.v * regB3j0.v;

        regCj0.v += regA4.v * regB4j0.v + regA5.v * regB5j0.v
                  + regA6.v * regB6j0.v + regA7.v * regB7j0.v;

        regCj1.v += regA0.v * regB0j1.v + regA1.v * regB1j1.v
                  + regA2.v * regB2j1.v + regA3.v * regB3j1.v;

        regCj1.v += regA4.v * regB4j1.v + regA5.v * regB5j1.v
                  + regA6.v * regB6j1.v + regA7.v * regB7j1.v;

        _mm_store_pd(ptrC0, regCj0.v);
        _mm_store_pd(ptrC1, regCj1.v);
      }
    }
  }
}

Comparing to Eigen, we still see performance drop at big matrix, indicating possible improvement in cache usage.

Optimismze L6

Go back to performance curve we see that L5 archives much smaller performance gain than L4 does, further unroll along p-axis dosen't bring any obvious improvement.

In the inner-most loop, the code is accesing the same B values in each iteration, which means they will be cached persistently, so B values should have no caching issue.

So what makes optimization L4 so successful if it's not the caching of B? Let's compare L3 & L4

L3:

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j);

C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p)   * B(p,j+1)
                              +  A(i:i+2, p+1) * B(p+1,j+1);

L4:

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j)
                          +  A(i:i+2, p+2) * B(p+2,j)
                          +  A(i:i+2, p+3) * B(p+3,j);

C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p)   * B(p,j+1)
                              +  A(i:i+2, p+1) * B(p+1,j+1)
                              +  A(i:i+2, p+2) * B(p+2,j+1)
                              +  A(i:i+2, p+3) * B(p+3,j+1);

in L4, the volume of unrolled instances of statement S0 is (2,4,2) in order of (i,p,j); in L3, the volume is only (2,2,2), so in order to cover same volume, we need two L3 iteration:

loop (i,p):
  iteration: (i,p,j) covers (2,2,2) volume
    C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                              +  A(i:i+2, p+1) * B(p+1,j);
    C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p)   * B(p,j+1)
                                  +  A(i:i+2, p+1) * B(p+1,j+1);

loop (i,p+2):
  iteration: (i,p+2,j) covers another (2,2,2) volume
    C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p+2) * B(p+2,j)
                              +  A(i:i+2, p+3) * B(p+3,j);
    C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p+2) * B(p+2,j+1)
                                  +  A(i:i+2, p+3) * B(p+3,j+1);

ignore B (which supposed to be in register or cached persistently), the total cost comparison is:

optimization	volume of S0	iteration	total loads	total store
L3	(2,2,2)	2x	12	4
L4	(2,4,2)	1x	10	2

the difference between number of loads/stores is not too big, and they can even be ignored if the data is cached very well (for example, if they are accessing data at fixed address, like B(p,j), or at address falling into same cache line with previous accessing, like B(p+1,j)). the problem is that in L3, we need two inner-loop instance for the same job, so from performance perspective, its impact on the cache is much more bigger than the instructions. the extra read/write of C also increases competing cache line capacity with A.(Reading from same cache line is not considered to be cache-line competing, but reading &storing to C(i,j) is very heavy competing with reading from A(i,j))

So to further reduce such cache competing, we should unroll along j-axis instead of p:

void MY_MMult(int m, int n, int k, double *a, int lda, double *b, int ldb,
              double *c, int ldc) {
  int i, j, p;
  double *ptrA0;
  double *ptrA1;
  double *ptrA2;
  double *ptrA3;

  double *ptrC0;
  double *ptrC1;
  double *ptrC2;
  double *ptrC3;

  v2df_t regCj0;
  v2df_t regCj1;
  v2df_t regCj2;
  v2df_t regCj3;
  v2df_t regA0, regB0j0, regB0j1, regB0j2, regB0j3;
  v2df_t regA1, regB1j0, regB1j1, regB1j2, regB1j3;
  v2df_t regA2, regB2j0, regB2j1, regB2j2, regB2j3;
  v2df_t regA3, regB3j0, regB3j1, regB3j2, regB3j3;

  for (j = 0; j < n; j+=4) {   /* Loop over the columns of C */
    for (p = 0; p < k; p+=4) { /* Update C( i,j ) with the inner */
      regB0j0.v = _mm_loaddup_pd(&B(p+0, j));
      regB1j0.v = _mm_loaddup_pd(&B(p+1, j));
      regB2j0.v = _mm_loaddup_pd(&B(p+2, j));
      regB3j0.v = _mm_loaddup_pd(&B(p+3, j));

      regB0j1.v = _mm_loaddup_pd(&B(p+0, j+1));
      regB1j1.v = _mm_loaddup_pd(&B(p+1, j+1));
      regB2j1.v = _mm_loaddup_pd(&B(p+2, j+1));
      regB3j1.v = _mm_loaddup_pd(&B(p+3, j+1));

      regB0j2.v = _mm_loaddup_pd(&B(p+0, j+2));
      regB1j2.v = _mm_loaddup_pd(&B(p+1, j+2));
      regB2j2.v = _mm_loaddup_pd(&B(p+2, j+2));
      regB3j2.v = _mm_loaddup_pd(&B(p+3, j+2));

      regB0j3.v = _mm_loaddup_pd(&B(p+0, j+3));
      regB1j3.v = _mm_loaddup_pd(&B(p+1, j+3));
      regB2j3.v = _mm_loaddup_pd(&B(p+2, j+3));
      regB3j3.v = _mm_loaddup_pd(&B(p+3, j+3));

      for (i = 0; i < m; i += 2) { /* Loop over the rows of C */
        ptrC0 = &C(i, j);
        ptrC1 = &C(i, j + 1);
        ptrC2 = &C(i, j + 2);
        ptrC3 = &C(i, j + 3);

        ptrA0 = &A(i, p+0);
        ptrA1 = &A(i, p+1);
        ptrA2 = &A(i, p+2);
        ptrA3 = &A(i, p+3);

        regCj0.v = _mm_load_pd(ptrC0);
        regCj1.v = _mm_load_pd(ptrC1);
        regCj2.v = _mm_load_pd(ptrC2);
        regCj3.v = _mm_load_pd(ptrC3);

        regA0.v = _mm_load_pd(ptrA0);
        regA1.v = _mm_load_pd(ptrA1);
        regA2.v = _mm_load_pd(ptrA2);
        regA3.v = _mm_load_pd(ptrA3);

        regCj0.v += regA0.v * regB0j0.v + regA1.v * regB1j0.v;
        regCj0.v += regA2.v * regB2j0.v + regA3.v * regB3j0.v;

        regCj1.v += regA0.v * regB0j1.v + regA1.v * regB1j1.v;
        regCj1.v += regA2.v * regB2j1.v + regA3.v * regB3j1.v;

        regCj2.v += regA0.v * regB0j2.v + regA1.v * regB1j2.v;
        regCj2.v += regA2.v * regB2j2.v + regA3.v * regB3j2.v;

        regCj3.v += regA0.v * regB0j3.v + regA1.v * regB1j3.v;
        regCj3.v += regA2.v * regB2j3.v + regA3.v * regB3j3.v;

        _mm_store_pd(ptrC0, regCj0.v);
        _mm_store_pd(ptrC1, regCj1.v);
        _mm_store_pd(ptrC2, regCj2.v);
        _mm_store_pd(ptrC3, regCj3.v);
      }
    }
  }
}

We can see that L6 is approaching Eigen's performance much better than L5 does.

How To Optimize Gemm wiki pages

https://github.com/flame/how-to-optimize-gemm/wiki

Copyright by Prof. Robert van de Geijn ([email protected]).

Adapted to Github Markdown Wiki by Jianyu Huang ([email protected]).

Acknowledgement

This material was partially sponsored by grants from the National Science Foundation (Awards ACI-1148125/1340293 and ACI-1550493).

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

C(i:i+2, j) = C(i:i+2, j) +  A(i:i+2, p)   * B(p,j)
                          +  A(i:i+2, p+1) * B(p+1,j);

C(i:i+2, j+1) = C(i:i+2, j+1) +  A(i:i+2, p)   * B(p,j+1)
                              +  A(i:i+2, p+1) * B(p+1,j+1);

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
eigen @ b17bcdd		eigen @ b17bcdd
figures		figures
src		src
.gitmodules		.gitmodules
README.md		README.md
gemm-result.png		gemm-result.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How To Optimize Gemm -- a revised version

Orginal code: MMult0

Fundamental strategy

Optimize L0

Optimize L1

Optimize L2

Optimismze L3

Optimismze L4

Optimismze L5

Optimismze L6

How To Optimize Gemm wiki pages

Table of contents

Related Links

Acknowledgement

About

Releases

Packages

Languages

usstq/how-to-optimize-gemm

Folders and files

Latest commit

History

Repository files navigation

How To Optimize Gemm -- a revised version

Orginal code: MMult0

Fundamental strategy

Optimize L0

Optimize L1

Optimize L2

Optimismze L3

Optimismze L4

Optimismze L5

Optimismze L6

How To Optimize Gemm wiki pages

Table of contents

Related Links

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages