Skip to content

Optimization_4x4_12

Jianyu Huang edited this page Aug 11, 2016 · 4 revisions

Copy the contents of file MMult_4x4_11.c into a file named MMult_4x4_12.c and change the contents:

Change the first lines in the makefile to

OLD  := MMult_4x4_11
NEW  := MMult_4x4_12
  • make run
octave:3> PlotAll        % this will create the plot

This time the performance graph will look something like

We now pack to 4xk block of A before calling AddDot4x4. We see a performance drop. If one examines the inner kernel

void InnerKernel( int m, int n, int k, double *a, int lda, 
                                       double *b, int ldb,
                                       double *c, int ldc )
{
  int i, j;
  double 
    packedA[ m * k ];

  for ( j=0; j<n; j+=4 ){        /* Loop over the columns of C, unrolled by 4 */
    for ( i=0; i<m; i+=4 ){        /* Loop over the rows of C */
      /* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
	 one routine (four inner products) */
      PackMatrixA( k, &A( i, 0 ), lda, &packedA[ i*k ] );
      AddDot4x4( k, &packedA[ i*k ], 4, &B( 0,j ), ldb, &C( i,j ), ldc );
    }
  }
}

one notices that each 4xk block of A is packed repeatedly, once for every time the outer loop is executed.

Clone this wiki locally