Recently, I posted a basic introduction to CUDA C for programming GPUs, which showed how to do a vector addition. This illustrated some of the CUDA basic syntax, but it wasn't a complex-enough example to bring to light some of the trickier issues to do with designing algorithms carefully to minimise data movement. Here we move on to the more complicated algorithm for matrix multiplication, C = AB, where we'll see that elements of the matrices get used multiple times, so we'll want to put them in the shared memory to minimise the number of times they get retrieved from the much slower global (or device) memory. We'll also see that, because data that a thread puts into shared memory is only accessible by the other threads in the same thread block, we need to be careful how we do this.
The purpose of CUDA is to allow developers to program GPUs much more easily than previously, and since its inception in 2007, the use of GPUs has opened up beyond just graphics to more general, e.g. scientific, computing, which is often referred to as general-purpose GPU computing - GPGPU.