Small Benchmark: CUDA's Dynamic Parallelism
Introduction
I implemented some statistical methods for selectivity estimations in
DBMS in the last weeks and tried out the dynmaic parallelism feature of
the compute capability 3.5., which is available on the K20 and the GTX
780 as well as the GTX Titan. Dynamic Parallelism means in short that CUDA
kernels can start other kernels. You can use a different number of workgroups
and threads for the nested kernels, hence the term dynamic. Together with
the possbility to allocate global memory from inside kernels, which is
available since Fermi, this allows the developer to do almost everything
inside kernels without involving the CPU.
After the first working implementation of my algorithms I also implemented
a version without dynamic parallelism and was surprise that this ran around
10% faster on the same GPU. So I did a little test-run to verify that the
slower execution time was because of dynamic parallelism.
Experiment
I call a simple multiply-kernel a thousand times.
#include <stdio.h>
#include <sys/time.h>
#include "cuda_runtime.h"
double time_in_seconds (void) {
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}
__global__ void multiply(const double* a, const double* b, double*c) {
const int tidx = threadIdx.x + blockDim.x * blockIdx.x;
c[tidx] = a[tidx]*b[tidx];
}
__global__ void multiply_dynpar(const double* a, const double* b, double*c, const uint size, const uint runs = 1000) {
const uint numThreads = 256;
const uint numWG = size/numThreads;
for(uint r=0;r<runs;++r) {
multiply<<<numWG,numThreads>>>(a,b,c);
cudaDeviceSynchronize();
}
}
void multiply_trad(const double* a, const double* b, double*c, const uint size, const uint runs = 1000) {
const uint numThreads = 256;
const uint numWG = size/numThreads;
for(uint r=0;r<runs;++r) {
multiply<<<numWG,numThreads>>>(a,b,c);
cudaThreadSynchronize();
}
}
int main(int argc, char** argv) {
uint s = 1000*10;
if(argc>1) {
s = atoi(argv[1]);
printf("Size: %d\n",s);
}
double *dev_a, *dev_b, *dev_c;
cudaMalloc( (void**)&dev_a, s*sizeof(double));
cudaMalloc( (void**)&dev_b, s*sizeof(double));
cudaMalloc( (void**)&dev_c, s*sizeof(double));
double t1,t2;
for(int i=0;i<2;++i) {
t1 = time_in_seconds();
multiply_dynpar<<<1,1>>>(dev_a,dev_b,dev_c,s);
cudaThreadSynchronize();
t2 = time_in_seconds();
printf("Time dynp: %.3f\n", t2-t1);
t1 = time_in_seconds();
multiply_trad(dev_a,dev_b,dev_c,s);
t2 = time_in_seconds();
printf("Time trad: %.3f\n", t2-t1);
}
}
And here are the results:
> test_dynamicpar/Release/test_dynamicpar
Time trad: 0.023
Time dynp: 0.030
Time trad: 0.021
Time dynp: 0.029
> test_dynamicpar/Release/test_dynamicpar 100000
Size: 100000
Time trad: 0.039
Time dynp: 0.053
Time trad: 0.038
Time dynp: 0.052
> test_dynamicpar/Release/test_dynamicpar 1000000
Size: 1000000
Time trad: 0.181
Time dynp: 0.196
Time trad: 0.180
Time dynp: 0.195
> test_dynamicpar/Release/test_dynamicpar 10000000
Size: 10000000
Time trad: 1.604
Time dynp: 1.675
Time trad: 1.601
Time dynp: 1.675
As you can see, the execution time for the traditional approach is always a bit lower. If you have some explanation, I'll be happy to hear it.