PGStrom for Postgres
Introduction
A lot of colleagues mailed me the link to PGStrom,
because they knew that this is in my domain and some of them also knew
that 10x speedups and more by using a GPU in my field of work usually makes
me skeptical.
PGStrom executes statements in the form of
SELECT * FROM table WHERE <complex formula>
...on the GPU. This is not a bad idea at all. If you take a look at my
last blog post, you can see that we have easily 5 checks out of 6.
The algorithm can be processed in a SIMD fashion, streamed without overhead
and we do not need to transfer data more than once (a.k.a. "fits into GPU
memory"). However, the most important check is missing, and that's the
part where I am skeptical: I am not sure if the transfer to the GPU is
faster than the computation on the CPU.
I don't think that it is appropriate to compare Postgres against an optimized
implementation on the GPU. Postgres processes 10 Mio records in 7 seconds
in their experiment. 10 Mio sounds like a huge number, but actually the
example statement processes only two columns with floating point numbers.
Assuming these numbers are 32 bit-float types, the CPU processes only 76
MB -- in 7 seconds! Remember that one CPU thread is capable of transferring
around 4 GB of data per second!
Experiment and Implementation
In my opinion one way of a fair comparison of GPU vs CPU is to hard-code this query in C and in CUDA and compare the run-time. Because of the detailed description of the PGStrom test setup on the wiki-page (better than in most research papers...), we can easily reproduce their tests. This SQL statement
SELECT COUNT(*) FROM t1 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15;
becomes the following C code:
for(int i=0; i<(int)t_size; i++) {
result += (sqrt( pow((x[i]-25.6),2) + pow((y[i]-12.8),2)) < 15);
}
This can easily be parallelized with openmp. The GPU implementation is very similar to the one from the last post. We copy and process data on CPU and GPU in parallel. This requires 5 parallel lines of work:
- copy data from the columns to a buffer in main memory (this must be pinned memory to allow async. transfers)
- copy data from the main memory buffer to the GPU memory buffer
- process/filter data in the GPU memory buffer by executing the kernel on the GPU
- copy the result of the kernel (a bitvector) from GPU to CPU
- count the 1-bits on the CPU and add this to the final result
The buffers hold 64*1024 floating point numbers each. You can adjust this by setting the following in pgstrom_test.h:
static const int STREAM_BUF_SIZE = 128*1024; //in number of floats
#define NUMBER_OF_THREADS 512
Additionally I implemented a naive version that copies everything to the GPU, processes it and copies the result back. This way we can also compare the GPU's and the CPU's pure execution time without the transfer. If you hold a replication of the table in the GPU's memory, this would be the number to look at. However, PGStrom also streams the data.
Hardware
Our test machine is a Linux workstation equipped with two Xeon E5-2690 hexa core CPUs and a NVidia Tesla K20 with 8 GB of RAM. The GPUs are connected via PCI-Express version 3 with a peak bandwidth of 11.2 GB/s.
Results
The CPU takes 73 ms with one thread and 12 ms with 32 threads. The naive
GPU version needs 3 ms to execute the kernel and give the result back.
But the whole setup without streaming takes 76 ms, 67 ms with streaming.
However, 40 ms of the streaming approach are needed for the setup, i.e.,
allocating buffers on the GPU and pinned memory on the CPU. I think it
is possible to do this once at the startup of the DBMS.
Therefore, in my opinion it is fair to compare the 15 ms on the GPU to
the 12 ms on the parallel CPU.
./pgstrom_test
Size is 76.294 MB
Result: 9934; CPU 0.073 sec
Result: 9934; p CPU 0.012 sec
Result: 9934; GPU naive 0.003 (0.076) sec
Result: 9934; GPU stream 0.015 (0.067) sec
As always with data-intensive processes the transfer is the bottleneck. And yes, this is data-intensive. Everything can be computed in linear time. It takes a lot of square roots and exponents in the formula to make this compute-intensive.
Although at a first glance my simple implementation looks faster than PGStrom, it really is not. PGStrom just needs 150 ms more to parse the SQL-statement, create the kernel code, load the data from disk, and deal with all the additional overhead of the DBMS.
Conclusion
Et voila, Speed-Up of 100x for the sequential CPU implementation and 650x
for the parallel implementation compared to Postgres. Not because of faster
hardware (on this machine Postgres needed a bit over 5 seconds for the
query) but because of an optimization to the problem. Postgres is there
to execute generic statements on generic data under ACID-properties. It
loads data from disk and stores tables row-wise. It can execute statements
of the discuessed form but it is not optimized for that kind of problem.
I guess it makes virtual function calls for every arithmetic operator in
the filter-formula every time it evaluates one record.
If I had to optimize my database system for this kind of statement, I
would use a JIT-compiler to generate executable code from the formula,
and store data column-wise. I guess, PGStrom does something similar. But
with a parallel CPU implementation instead of CUDA, they would not be much
slower, maybe even faster.
tl;dr; Do not compare apples and oranges.