GPUs for Query Execution: Ocelot, gpudb and our approach

On ADBIS we published a paper about query execution on GPUs. A week earlier two papers with the same general topic were presented at VLDB. I'm comparing the approaches and the findings. In my opinion, the Ocelot approach is the right one if we see heterogneous processors in the future. My conclusion for external GPUs connected via PCIe is different: although high end GPUs can outperform comparable CPUs often, not always, they are not suitable for query execution in general because of the data transfer bottleneck. But the paper about gpudb shows different results.

ADBIS paper query execution

2013-09-06

Ocelot, gpudb and our approach compared

I already wrote a short post about the research state of query processing on the GPU. This year, a lot has changed. At VLDB two papers were presented that focus on the topic of SQL query execution on GPUs. Except for their general proof-of-concept, that GPUs can execute queries, their results are quite different. Thankfully, both teams provide their implementations as open source software.

The first paper The Yin and Yang of processing data warehousing queries on GPU devices focuses on queries from the Star Schema Benchmark and compares the results of their implementation gpudb on different desktop GPUs and Intel's i7. They mention limitations, e.g., data must be in pinned memory for fast transfer, but to my surprise they conclude that a GPU is always faster than the CPU despite the data transfer (with Speed-Ups from 2x to 7x).

The authors of the second paper Hardware-Oblivious Parallelism for In-Memory Column-Stores do not intent to show that GPUs are better query executors, but argue that a DBMS must be able to use the new architectures which are integrated into the CPU. AMD's Accelerated Processing Units and the integrated GPUs in Intel's Haswell architecture are getting more and more popular. Not using these chips is similar to not supporting multiple threaded execution today. Their implementation also uses a different approach: they replaced the MonetDB execution engine with a similar one based on OpenCL. It also uses a column-at-a-time processing model and supports all queries, MonetDB can handle. An additional data manager transfers columns to the GPU when they are needed and caches them their. The measurements are all done on a hot cache, so transfer times are ignored. However, at SF 8 the GPU's memory is needed for the input data and intermediate results have to be swaped to the host memory. For this reason the results are a bit counter-intuitive: the GPU is fast for the small and slow for the intermediate and it cannot compute the large dataset. It would be quite interesting to see Ocelot working on an APU or another real heterogenous processor, where data transfer is not necessary.

In the previous post I already mentioned that we tried OLAP query execution on GPUs and published the results on ADBIS. We tried an even more radical approach than the authors of the first paper and wrote the queries by hand according a map/reduce-like model described in our paper. We already have a prototype that generates the code from SQL queries for the CPU - it is possible to adjust this to produce OpenCL code for queries that can run on the GPU. The limits for our approach are explained in the paper. However, it was not our goal to find a general model for all queries but a simple and fast one for TPC-H-like queries. We show that this runs well on a multi-core CPU and on a GPU. Still, in contrast to GPUDB the transfer takes significantly longer than the execution. For most queries the exection times are below 20 ms, while it takes around a 100 ms to transfer the input data and necessary indices. Simlar to Ocelot we cache the data on the GPU and ignore the transfer times (of the input data) for our measurements. Please note, that you cannot expect results like that on an integrated GPU. First, there are much less powerful (problems are cooling and limited space on the die of a CPU) and second, the memory bandwidth is lower. Also, it is not that trivial to share memory between CPU and GPU even if they are on the same chip.
Nevertheless, we compared pure execution times to see if a GPU can be faster for query execution than a CPU. The results show that it mostly is, but not in general: Q1 takes longer to execute on the GPU than on the CPU.
Similar to the authors of Ocelot we were able to show that the OpenCL implementation performs well on the CPU in most cases compared with the native implementation, which uses Intel's TBB for parallel execution.

Therefore, I support the conclusion of Heimel et al. If CPUs with integrated GPUs show up in machines used for DBMS (most of Intel's Xeons still do not have an integrated GPU) we can benfit from using OpenCL at the price of a little worse performance. The alternatives are not to use the available hardware or to provide and maintain two (or more) implementations, one for each device/architecture. However, in some cases the vendors will do that for perfect performance. In Vectorwise something similar is already done for the CPU with different compilers and parameters for each operator as described in Micro Adaptivity in Vectorwise.

But in my opinion there are things we regularly do in a DBMS which are better suited to the GPU's execution model. We should focus on doing these on the GPU. Selectivity Estimations for instance often require a lot of calculations. Because of the low amount of data that needs to be transferred this can even be executed on external cards with high speedups as we will show soon. For APUs it is important to schedule operations wisely.

Additional information about our paper

In a first version of our paper we had Q4 as complete example, but because of limited space we had to leave it out for ADBIS. Here it is:


/*
select o_orderpriority, count(*) from orders
where o_orderdate >= '1993-07-01' 
and o_orderdate < '1993-10-01'
and exists (select * from lineitem 
 where l_orderkey = o_orderkey 
 and l_commitdate < l_receiptdate)
group by o_orderpriority
order by o_orderpriority
*/

__kernel void global_compute (
    __global unsigned short* p_c1,
    ... //columns/indexes as parameters
    __global long* interim_result,
    int bs//num of tuples processed by 1 thread
    ) {
  __local long l_thread_data[RESULT_SIZE];
  long p_thread_data[RESULT_SIZE];
  ... //init memory with 0
  uint start = (get_group_id(0) * WG_SIZE * bs)
               + get_local_id(0);
  uint end = start + WG_SIZE * bs;
/* local compute */
  for (unsigned i0=start; i0<end; i0+=WG_SIZE)) {
    unsigned short c1 = p_c1[i0];
    if (!(c1 >= 8582)) continue;
    if (!(c1 < 8674))  continue;
    unsigned c0 = p_c0[i0];
    unsigned ind1e = ind1[i0 + 1];
    for (unsigned i1=ind1[i0]; i1<ind1e; ++i1) {
      if (!(p_c2[i1] < p_c3[i1])) continue;
      ++p_thread_data[c0];
      break;
    }
  }
/* local accumulate */
  for (int j=0; j < RESULT_SIZE; ++j) {
    local_sum(&l_thread_data[j],p_thread_data[j]);
  }
  ... //copy l_thread_data to interim_result
}

Parameters for the kernel are pointers to the columns in global memory, a pointer to the result memory and the size of the block that is processed by each thread. The query parameters, such as the date and the number of expected results, are compiled into the code at runtime. We use WG_SIZE as abbreviation for get_local_size(0), which is the number of threads per workgroup. Before starting the local Compute phase private (per thread) and local (per workgroup) memory segments are initialized and every thread calculates its starting position.
The actual Compute phase is represented as a for-loop. Inside the loop the tuples are filtered according to the where clause, which is a date range in Q4. The inner loop after the filter instructions represents the join operation between lineitem and orders. In this loop we iterate over the join index and filter tuples according to the inner select's where clause. As soon as one tuple is found, the private counter for this orderpriority is incremented and the rest of the inner loop is skipped. At the end of the Compute phase every thread of a workgroup holds the result for the tuples it processed in private memory.
In the local Accumulate phase these results are added to the interim result of every workgroup. A sum function optimized for the GPU is executed on every line of the result to sum up the counted tuples of the Compute phase.
The global Accumulate phase (not in the code) copies the local results of all workgroups to the shared memory of one workgroup and executes the same sum function like the local Accumulate.

The full code for our experiments (with OpenCL) can be found on github and here are the presentation slides.

Not A Paper - Research Snippets that were not published.