Processing strings on the GPU
IntroductionThe GPU is well suited to handle data types such as numbers with the same size. A good indicator for the fact that GPUs are not well at handling variable length strings is, that no official library is able to handle them. Even very simple string-operations, such as comparisons, are not available and have to be implemented by the application developer.
There are two main problems when dealing with strings: First, Strings - even with constant length - are data intensive and require only a very small amount of processing in general, e.g., a string comparison often requires just the comparison of the first character. The transfer to the GPU therefore often dominates the processing time. Second, variable length values do not fit well to the SIMD processing architecture of GPUs. There is always a fixed number of processing cores doing the same instruction. We cannot dynamically assign work to each of them, but variable length values require this.
A popular suggestion to cope with the problem is dictionary encoding. If there is only a small number of distinct values, this limits the amount of actual string operations to a minimum. In case of a lot of unique values, we can process the dictionary itself on the the CPU and the references on the GPU. This leads to the situation, that either the GPU and the CPU have to communicate every time a string operation is needed or that the CPU has to pre-process everything, which can easily lead to unnecessary overhead. E.g., an equi-join on a string column would mean, that the CPU has to prepare a mapping for the string references in the joined columns, but in most cases the GPU may only need parts of these mappings. Therefore dictionary encoding with large dictionaries only makes sense if it is the minor part of the workload, otherwise the GPU will barely be used.
To evaluate these thoughts we compared CPU/GPU performance of two operations which are often used in databases with columns containing strings. The first operation checks how many strings in one column of a column-wise stored table start with a given string (startswith-operation). The second operation searches all values for a sub-string (substring-operation) and returns teh number of values that match.
For both operations we implemented three different versions:
std: In the first approach we used the containers of the C++ standard library. The column is a simple vector of strings. Since there is no library available which handles strings on the GPU, this was only tested on the CPU. It is our baseline. (It doesn't matter for our experiments, but this is the structure which handles updates best.)
Custom: Our hand-written implementation uses one block of memory to store all strings contigously. An array holds the information on where one value starts and how long it is (of course one of these variables would be enough, but for the GPU this is gives a few advantages). The core of the substring operation is the Boyer-Moore-Horspool algorithm. We tried different implementations and used the one we found here . As a side note: we had to implement basic functions such as strstr() on our own and used them for both implementations. Interestingly our strstr() is around 20% faster, because it does only check for equality not for greater/lesser.
Dictionary: We used our Custom structure as dictionary and a simple array holds the references to the dictionary. We implemented it as read-only structure, i.e., the strings in the dictionary are sorted. Column stores, such as the one in HANA, do this in their main storage as well. To avoid sorting when new values are inserted updates are buffered in another structure until the system decides to recreate the main storage. C-Store handles this similarly: they differentiate between Read-optimized Store and Writeable Store .
These three implementations of the two operations are now executed with three different approaches:
Single: The sequential CPU implementation is straight-forward. With one thread we just compare every value with the given string in a loop.
Parallel: The parallel implementation uses OpenMP  to execute the loop on all available cores, which requires one additional line of code and one additional compiler switch... I love it.
GPU: Although the basic execution logic on the GPU is the same - a high number of values is compared to the search string in parallel, the implementation requires more effort.
- allocate memory on the GPU
- transfer the data to the GPU
- execute the kernel
- transfer the result back to the CPU
- process the result on the CPU
- free the memory on the GPU
Step 5 is necessary, since the kernel does not give the final result - the number of matching values - but a bitvector with a bit set on every matching position. With a second kernel call we could add these numbers up to the final result, but we do this on the CPU instead. This two-phase approach is needed, because we cannot synchronize all threads of the GPU efficiently within one kernel. A more detailed description can be found in the reduction sample of the CUDA SDK .
Our test machine is a Linux workstation equipped with two Xeon E5-2690 hexa core CPUs and a NVidia Tesla K10 with 8 GB of RAM. The GPUs are connected via PCI-Express version 3 with a peak bandwidth of 11.2 GB/s. We measured the run-time for the operation with different data structures executed with one thread on the CPU (Single), 32 threads on the CPU (Multi) and on the GPU.
As data set we used the L_COMMENT column of the TPC-H benchmark (SF 10), which consists of 60 mio strings with 10 to 43 characters. The column consists of approx. 70% distinct values. For our experiment we searched for the term "the". 935,863 entries start with this term and it's a substring of 2,068,744 values.
The one-threaded std implementation of the substring implementation takes around 600 ms to execute. With the help of OpenMP we could achieve a speedup of about 4 by executing the loop in parallel. The Custom implementation runs approx. in half the time for the sequential as well as the parallel implementation. The third bar shows that 90% of the time is needed to allocate/free memory (named GPU overhead) and transfer the data to the GPU, less than a milli-second is spent on the CPU to process the interim result. Because of the high transfer costs the GPU implementation needs three times as long as the parallel CPU implementation, although the kernel is four times faster. We added a third group (Custom w/o transfer), where the column is already stored in the GPUs memory. As expected, the overhead for allocating and transferring the result is negligible. The third group shows the same operation, when we keep the data on the CPU and on the GPU. The allocation of memory for the interim results and its transfer take approx. one ms and are therefore negligible. The operation on the dictionary encoded column are two simple binary searches on the dictionary to find the range of values that match the term we search for. 99% of the time is spent to find the matching values in the array.
The single-threaded execution of the second operation takes between 2 and 3 seconds for the three different implementation. The focus of this experiment is the comparison of multi-core CPU and GPU, therefore I zoomed the plot to the interesting results. The plain execution time for all variants with CPU and GPU lie between 110 and 150 ms. However, the necessary transfer to the GPU adds 200 ms again. There is no bar for the GPU-dictionary version because there was an error in the result. There was no surprise in the "wrong" version, maybe I'll fix this some time...
Interpretation of the results
The startwith operation is much faster on the dictionary encoded column because it benefits from the sorted dictionary. The dictionary encoding is even disadvantageous for the substr operation, because a full scan is necessary in the dictionary first to get the matching references and then another lookup is necessary to find the references in the array. If there were no matching values in the column the operation would most likely be a bit faster on the dictionary encoded column. The same counts for small dictionaries.
The startswith-operation is clearly memory bound. We can only achieve a speedup of around 4 with 16 cores. There is a significant difference between the std and the custom implementation, because the contiguous memory layout allows much faster access while the usage of std::string forces the algorithm to dereference another pointer. The substring-operation is obviously CPU bound. We achieve a speedup of 20 with 16 cores resp. 32 threads. The Horspool algorithms reads a string more than once. Therefore the cores operate on values in the cache and are not limited by the memory bus.
When executing the operations on the GPU the transfer dominates the execution time, even if we transfer only the dictionary-encoded values. However, it is a surprise that the plain execution of the startswith-operation is significantly faster on the GPU than on the multi-core CPU. We think this is because, although the code doesn't show it, the actual execution fits well to the SIMD processing model. All threads compare the first character with the constant value at the same time and then in most cases the loop continues to the next iteration because there was no hit. This is not the case for the substring operation, which is indeed slower on the GPU than on the CPU. The reason for this might be the high branch divergence when executing the Horspool algorithm. Another problem is that we cannot use the faster Shared Memory for our implementation because it is shared between the threads of one workgroup and every thread needs a different amount because of the variable length values. Dealing with this problem would induce more overhead for the memory management and more branch divergence because every thread would execute different code depending on where the string is stored. Maybe just comparing all values in parallel is not the right approach, but a massively parallel SIMD-like algorithm for searches in short strings may not be possible. Let me know, if you know more about this.
ConclusionUsing the GPU for string-operations does not seem to be beneficial. The biggest problem is the data transfer which dominates the execution time. Another practical problem is that even the most commonly used functions have to be reimplemented for the GPU. Dictionary encoding might be a solution for this problem, because we can do the actual string operations on the CPU and leave the rest to the CPU. However, in many cases the operations on the dictionary dominate the execution time so there is no benefit in using the GPU.
Update 28.4.2014 The source code for the tests is now on Github.
 C-Store Homepage
 Stonebraker et al: "C-Store: A Column-oriented DBMS" (2005)