GPUs for Query Execution: Ocelot, gpudb and our approach
Ocelot, gpudb and our approach compared
I already wrote a short post about the research state of query processing on the GPU. This year, a lot has changed. At VLDB two papers were presented that focus on the topic of SQL query execution on GPUs. Except for their general proof-of-concept, that GPUs can execute queries, their results are quite different. Thankfully, both teams provide their implementations as open source software.
The first paper The Yin and Yang of processing data warehousing queries on GPU devices focuses on queries from the Star Schema Benchmark and compares the results of their implementation gpudb on different desktop GPUs and Intel's i7. They mention limitations, e.g., data must be in pinned memory for fast transfer, but to my surprise they conclude that a GPU is always faster than the CPU despite the data transfer (with Speed-Ups from 2x to 7x).
The authors of the second paper Hardware-Oblivious Parallelism for In-Memory Column-Stores do not intent to show that GPUs are better query executors, but argue that a DBMS must be able to use the new architectures which are integrated into the CPU. AMD's Accelerated Processing Units and the integrated GPUs in Intel's Haswell architecture are getting more and more popular. Not using these chips is similar to not supporting multiple threaded execution today. Their implementation also uses a different approach: they replaced the MonetDB execution engine with a similar one based on OpenCL. It also uses a column-at-a-time processing model and supports all queries, MonetDB can handle. An additional data manager transfers columns to the GPU when they are needed and caches them their. The measurements are all done on a hot cache, so transfer times are ignored. However, at SF 8 the GPU's memory is needed for the input data and intermediate results have to be swaped to the host memory. For this reason the results are a bit counter-intuitive: the GPU is fast for the small and slow for the intermediate and it cannot compute the large dataset. It would be quite interesting to see Ocelot working on an APU or another real heterogenous processor, where data transfer is not necessary.
In the previous post I already mentioned that we tried OLAP query execution
on GPUs and published the results on ADBIS. We tried an even more radical
approach than the authors of the first paper and wrote the queries by hand
according a map/reduce-like model described in our paper.
We already have a prototype that generates the code from SQL queries for
the CPU - it is possible to adjust this to produce OpenCL code for queries
that can run on the GPU. The limits for our approach are explained in the
paper. However, it was not our goal to find a general model for all queries
but a simple and fast one for TPC-H-like queries. We show that this runs
well on a multi-core CPU and on a GPU. Still, in contrast to GPUDB the
transfer takes significantly longer than the execution. For most queries
the exection times are below 20 ms, while it takes around a 100 ms to transfer
the input data and necessary indices. Simlar to Ocelot we cache the data
on the GPU and ignore the transfer times (of the input data) for our measurements.
Please note, that you cannot expect results like that on an integrated
GPU. First, there are much less powerful (problems are cooling and limited
space on the die of a CPU) and second, the memory bandwidth is lower. Also,
it is not that trivial to share memory between CPU and GPU even if they
are on the same chip.
Nevertheless, we compared pure execution times to see if a GPU can be
faster for query execution than a CPU. The results show that it mostly
is, but not in general: Q1 takes longer to execute on the GPU than on the
CPU.
Similar to the authors of Ocelot we were able to show that the OpenCL
implementation performs well on the CPU in most cases compared with the
native implementation, which uses Intel's TBB for
parallel execution.
Therefore, I support the conclusion of Heimel et al. If CPUs with integrated GPUs show up in machines used for DBMS (most of Intel's Xeons still do not have an integrated GPU) we can benfit from using OpenCL at the price of a little worse performance. The alternatives are not to use the available hardware or to provide and maintain two (or more) implementations, one for each device/architecture. However, in some cases the vendors will do that for perfect performance. In Vectorwise something similar is already done for the CPU with different compilers and parameters for each operator as described in Micro Adaptivity in Vectorwise.
But in my opinion there are things we regularly do in a DBMS which are better suited to the GPU's execution model. We should focus on doing these on the GPU. Selectivity Estimations for instance often require a lot of calculations. Because of the low amount of data that needs to be transferred this can even be executed on external cards with high speedups as we will show soon. For APUs it is important to schedule operations wisely.
Additional information about our paper
In a first version of our paper we had Q4 as complete example, but because of limited space we had to leave it out for ADBIS. Here it is:
/*
select o_orderpriority, count(*) from orders
where o_orderdate >= '1993-07-01'
and o_orderdate < '1993-10-01'
and exists (select * from lineitem
where l_orderkey = o_orderkey
and l_commitdate < l_receiptdate)
group by o_orderpriority
order by o_orderpriority
*/
__kernel void global_compute (
__global unsigned short* p_c1,
... //columns/indexes as parameters
__global long* interim_result,
int bs//num of tuples processed by 1 thread
) {
__local long l_thread_data[RESULT_SIZE];
long p_thread_data[RESULT_SIZE];
... //init memory with 0
uint start = (get_group_id(0) * WG_SIZE * bs)
+ get_local_id(0);
uint end = start + WG_SIZE * bs;
/* local compute */
for (unsigned i0=start; i0<end; i0+=WG_SIZE)) {
unsigned short c1 = p_c1[i0];
if (!(c1 >= 8582)) continue;
if (!(c1 < 8674)) continue;
unsigned c0 = p_c0[i0];
unsigned ind1e = ind1[i0 + 1];
for (unsigned i1=ind1[i0]; i1<ind1e; ++i1) {
if (!(p_c2[i1] < p_c3[i1])) continue;
++p_thread_data[c0];
break;
}
}
/* local accumulate */
for (int j=0; j < RESULT_SIZE; ++j) {
local_sum(&l_thread_data[j],p_thread_data[j]);
}
... //copy l_thread_data to interim_result
}
Parameters for the kernel are pointers to the columns in global memory,
a pointer to the result memory and the size of the block that is processed
by each thread. The query parameters, such as the date and the number of
expected results, are compiled into the code at runtime. We use WG_SIZE as
abbreviation for get_local_size(0), which is the number of threads
per workgroup. Before starting the local Compute phase private (per thread)
and local (per workgroup) memory segments are initialized and every thread
calculates its starting position.
The actual Compute phase is represented as a for-loop. Inside the loop
the tuples are filtered according to the where clause, which is a
date range in Q4. The inner loop after the filter instructions represents
the join operation between lineitem and orders. In this loop
we iterate over the join index and filter tuples according to the inner select's where clause.
As soon as one tuple is found, the private counter for this orderpriority is
incremented and the rest of the inner loop is skipped. At the end of the
Compute phase every thread of a workgroup holds the result for the tuples
it processed in private memory.
In the local Accumulate phase these results are added to the interim result
of every workgroup. A sum function optimized for the GPU is executed
on every line of the result to sum up the counted tuples of the Compute
phase.
The global Accumulate phase (not in the code) copies the local results
of all workgroups to the shared memory of one workgroup and executes the
same sum function like the local Accumulate.
The full code for our experiments (with OpenCL) can be found on github and here are the presentation slides.
References
- The Yin and Yang of processing data warehousing queries on GPU devices
- Star Schema Benchmark
- Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, Volker Markl, "Hardware-Oblivious Parallelism for In-Memory Column Stores"
- Hannes Rauhe, Jonathan Dees, Kai-Uwe Sattler, Franz Faerber: Multi-level Parallel Query Execution Framework for CPU and GPU. ADBIS 2013: 330-343
- Bogdan Răducanu, Peter Boncz, and Marcin Zukowski. 2013. Micro adaptivity in Vectorwise. SIGMOD 2013
- Code on github
- Slides of the ADBIS 2013 presentation