PGStrom for Postgres

PGStrom is a tool to execute a special type of SQL query on the GPU. The developers(?) claim to achieve a speedup of 50x compared to standard Postgres because they use the GPU. I do not doubt that PGStrom is faster than Postgres, but this is not because of the GPU. In this post I show that a multi-core CPU can process this queries as fast as the GPU. For the test-case described on the PGStrom wiki page the CPU is even a bit faster.

PGStrom apples oranges

2013-04-28

Introduction

A lot of colleagues mailed me the link to PGStrom, because they knew that this is in my domain and some of them also knew that 10x speedups and more by using a GPU in my field of work usually makes me skeptical.
PGStrom executes statements in the form of

SELECT * FROM table WHERE <complex formula>

...on the GPU. This is not a bad idea at all. If you take a look at my last blog post, you can see that we have easily 5 checks out of 6. The algorithm can be processed in a SIMD fashion, streamed without overhead and we do not need to transfer data more than once (a.k.a. "fits into GPU memory"). However, the most important check is missing, and that's the part where I am skeptical: I am not sure if the transfer to the GPU is faster than the computation on the CPU.
I don't think that it is appropriate to compare Postgres against an optimized implementation on the GPU. Postgres processes 10 Mio records in 7 seconds in their experiment. 10 Mio sounds like a huge number, but actually the example statement processes only two columns with floating point numbers. Assuming these numbers are 32 bit-float types, the CPU processes only 76 MB -- in 7 seconds! Remember that one CPU thread is capable of transferring around 4 GB of data per second!

Experiment and Implementation

In my opinion one way of a fair comparison of GPU vs CPU is to hard-code this query in C and in CUDA and compare the run-time. Because of the detailed description of the PGStrom test setup on the wiki-page (better than in most research papers...), we can easily reproduce their tests. This SQL statement

SELECT COUNT(*) FROM t1 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15;

becomes the following C code:


for(int i=0; i<(int)t_size; i++) {

  result += (sqrt( pow((x[i]-25.6),2) + pow((y[i]-12.8),2)) < 15);

}

This can easily be parallelized with openmp. The GPU implementation is very similar to the one from the last post. We copy and process data on CPU and GPU in parallel. This requires 5 parallel lines of work:

copy data from the columns to a buffer in main memory (this must be pinned memory to allow async. transfers)
copy data from the main memory buffer to the GPU memory buffer
process/filter data in the GPU memory buffer by executing the kernel on the GPU
copy the result of the kernel (a bitvector) from GPU to CPU
count the 1-bits on the CPU and add this to the final result

The buffers hold 64*1024 floating point numbers each. You can adjust this by setting the following in pgstrom_test.h:


static const int STREAM_BUF_SIZE = 128*1024; //in number of floats

#define NUMBER_OF_THREADS 512

Additionally I implemented a naive version that copies everything to the GPU, processes it and copies the result back. This way we can also compare the GPU's and the CPU's pure execution time without the transfer. If you hold a replication of the table in the GPU's memory, this would be the number to look at. However, PGStrom also streams the data.

Hardware

Our test machine is a Linux workstation equipped with two Xeon E5-2690 hexa core CPUs and a NVidia Tesla K20 with 8 GB of RAM. The GPUs are connected via PCI-Express version 3 with a peak bandwidth of 11.2 GB/s.

Results

The CPU takes 73 ms with one thread and 12 ms with 32 threads. The naive GPU version needs 3 ms to execute the kernel and give the result back. But the whole setup without streaming takes 76 ms, 67 ms with streaming. However, 40 ms of the streaming approach are needed for the setup, i.e., allocating buffers on the GPU and pinned memory on the CPU. I think it is possible to do this once at the startup of the DBMS.
Therefore, in my opinion it is fair to compare the 15 ms on the GPU to the 12 ms on the parallel CPU.


./pgstrom_test

Size is 76.294 MB

Result: 9934; CPU 0.073 sec

Result: 9934; p CPU 0.012 sec

Result: 9934; GPU naive 0.003 (0.076) sec

Result: 9934; GPU stream 0.015 (0.067) sec

As always with data-intensive processes the transfer is the bottleneck. And yes, this is data-intensive. Everything can be computed in linear time. It takes a lot of square roots and exponents in the formula to make this compute-intensive.

Although at a first glance my simple implementation looks faster than PGStrom, it really is not. PGStrom just needs 150 ms more to parse the SQL-statement, create the kernel code, load the data from disk, and deal with all the additional overhead of the DBMS.

Conclusion

Et voila, Speed-Up of 100x for the sequential CPU implementation and 650x for the parallel implementation compared to Postgres. Not because of faster hardware (on this machine Postgres needed a bit over 5 seconds for the query) but because of an optimization to the problem. Postgres is there to execute generic statements on generic data under ACID-properties. It loads data from disk and stores tables row-wise. It can execute statements of the discuessed form but it is not optimized for that kind of problem. I guess it makes virtual function calls for every arithmetic operator in the filter-formula every time it evaluates one record.
If I had to optimize my database system for this kind of statement, I would use a JIT-compiler to generate executable code from the formula, and store data column-wise. I guess, PGStrom does something similar. But with a parallel CPU implementation instead of CUDA, they would not be much slower, maybe even faster.

tl;dr; Do not compare apples and oranges.

The source code is on github. The project files for nsight are also in the repository.

Not A Paper - Research Snippets that were not published.