Dipl.-Phys. Ing. Gordon Taft
Matrix Multiplication
Blocked Algorithm

The blocked version of the i-j-k algorithm is written simply as

for (i=0;i<N/B;i++)
    for (j=0;j<N/B;j++)
        for(k=0;k<N/B;k++)
            c[i][j] += A[i][k]*B[k][j];

- B = Block Size (which we assume divides N)

The blocking optimization works only if the blocks fit in cache.


Types of Cache Misses

1. Compulsory misses: Cache misses caused by the first access to the block that has never been in cache (also known as cold-start misses)

2. Capacity misses: Cache misses caused when the cache cannot contain all the blocks needed during execution of a program. Occur because of
                            blocks being replaced and later retrieved when accessed.

3. Conflict misses: Cache misses that occur when multiple blocks complete for the same set.


Main Cache Principles

Temporal Locality: If an item is referenced, it will tend to be referenced again soon.
                         (Keep more recently accessed data items closer to the processor)

Spatial Locality:   If an item is referenced, items whose addresses are close by tend to be referenced soon.
                         (Move blocks consists of contiguous words to the cache)

ijk & jik
misses / iteration = 1.25
kij & ikj
misses / iteration = 0.5

jki & kji
misses / iteration = 2.0

for (i=0;i<n;i++) {
    for(j=0;j<n;j++) {
        sum = 0;
        for(k=0;k<n;k++)
            sum += a[i][k] * b[k][j];
        c[i][j] = sum;
    }
}
for (k=0;k<n;k++) {
    for(i=0;i<n;i++) {
        r = a[i][k];
        for(j=0;j<n;j++)
            c[i][j] = r * b[k][j];
    }
}

for (j=0;j<n;j++) {
    for(k=0;k<n;k++) {
        r = b[k][j];
        for(i=0;i<n;i++)
            c[i][j] = r * a[i][k];
    }
}