Parallel programming: Hello World with Intel TBB

Introduction

To many people out there, parallel programming may not sound very useful, and actually pretty complicated. However, everybody expects processors to have several cores and software to make use of them. The whole purpose of parallel programming is to leverage the capabilities of multi-core processors, but this comes with a cost: we need to rethink our way of programming, moving from the old single-threaded or “a thread per task” programs to applications that describe their work-flow in tasks that can be concurrently processed: what is called task-based programming.

This may sound difficult, but it is necessary to produce code that can actually use several cores, and also scale properly (ie. keep a decent performance level) from single-core processors to computer grids with hundreds of cores. The problem of multi-thread apps made by hand is that often, threads are planned per task, but even worse, the number of logical threads in the program is not equal to the number of physical threads on the CPU, but to the number of tasks, which is fixed. This means the program does not scale at all, unless thread pools are created and managed by hand, which means a huge code overhead for the thread pool management system. Besides, this means you still have to do load balancing between threads on your own, which is again a difficult task.

To the difference of multi-threaded apps with synchronisation code, the goal here is to leave all thread management and task assignment to a specifically built library. In this article, I am going to explain the very basics of Intel Threading Building Blocks, most likely one of the most efficient and simple multi-core programming libraries in the wild.

Intel TBB in a nutshell

Intel Threading Building Blocks is a library designed to help expressing parallel algorithms in C++, without having to consider all the technical aspects of thread management (resource usage, communication between threads, mutual exclusion). Quoting from the official website, “Intel TBB is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for scalability and performance.”

With Intel TBB, you will usually write a class whenever you want to setup a parallel algorithm. In this class, you will describe how you want your input data to be split between the different processes that will do the computation. You will then describe the computation itself, as a method of the class (the parenthesis operator), and finally, you will write a join method that is used to perform whatever additional steps are necessary on the results of each local computation to put them together.

A quick example of how easy to use TBB is may be better than a long and meaningless speech. We’ll get back to the details later.

Hello, World (sort of)

The code I’m going to show is an example of how to use the parallel_reduce() function (documentation), which is useful for parallelizing loops when you will perform a task on a set of data but where the order of processing is not relevant. For instance, when searching through an unordered set. The idea is somehow similar to the Divide and Conquer approach used in serial algorithms: TBB is going to split the set into subsets (called ranges) that are reasonably calculable by a single CPU, and then join the results of the subsets back together in order to return the result of the whole original set.

My sample class is pretty simple: it prints every number of a given set (from 1 to 10000) to the screen. What we will do in the parallel version is print these numbers when each range is being processed. The example was chosen because it illustrates a side-effect of using multiple cores: the order of processing every element is not preserved, which is exactly why parallel_reduce should only be used for code where every iteration is independent from others. Indeed, you will very probably see that the numbers are not being printed in their lexicographical order, as would have in a serial program.

However, a noticeable thing here is that results are joined in the same order than if the algorithm had been ran with a single CPU. This is easy to check: when you print the numbers during the joining operation, they will all appear in the lexicographical order. It’s time now for us to have a glance at the code:

numberprinter.h

#ifndef NUMBERPRINTER_H
#define NUMBERPRINTER_H

#include <tbb/blocked_range.h>
#include <vector>

using namespace std;
using namespace tbb;

class NumberPrinter {
private:
    static char *develop(size_t i);

public:
    vector<char *> my_vect;

    void operator()(const blocked_range<size_t>& r);
    NumberPrinter(NumberPrinter& x, split);
    NumberPrinter(int highbit);

    void join(const NumberPrinter& y);

};
#endif // NUMBERPRINTER_H

numberprinter.cpp

#include "numberprinter.h"
#include <vector>
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

using namespace std;

char * NumberPrinter::develop(size_t i)
{
    cout << i << std::endl;
    char *foo=(char*) malloc((log2(i)+3)*sizeof(char));
    sprintf(foo, "s%d", i);
    return foo;
}


void NumberPrinter::operator()( const blocked_range<size_t>& r) {
    for(size_t i=r.begin(); i!=r.end(); ++i)
        my_vect.push_back(develop(i));
}

NumberPrinter::NumberPrinter( NumberPrinter& x, split) {}
NumberPrinter::NumberPrinter(int highbit) {}

void NumberPrinter::join(const NumberPrinter& y) {
    my_vect.reserve(my_vect.size()+  y.my_vect.size());
    for (int i=0; i<y.my_vect.size(); ++i)
            my_vect.push_back(y.my_vect.at(i));
//  my_vect.insert(my_vect.end(), y.my_vect.begin(), y.my_vect.end());
}

The file using the class

#include "tbb/parallel_reduce.h"
#include "tbb/task_scheduler_init.h"
#include "numberprinter.h"

using namespace tbb;

void printNumbers (int n) {
    NumberPrinter p(n);
    parallel_reduce(blocked_range<size_t>(0,n), p, auto_partitioner());
}

int main(int argc, char *argv[]) {
    task_scheduler_init init;

    printNumbers(1000);
    return 0;
}

Source code explanation

Let’s start with the header file of the class. You can see that we define a vector<char *> my_vector that we’ll use to store the output data of the computation.

There are two constructors defined: These just match the requirements of Intel TBB to make the class run. In fact, the copy constructor is called a split constructor (because it has a useless argument so that Intel TBB differentiates it from a normal copy constructor). TBB calls this split constructor when it decides that there is too much data too be handled by a single core. It then creates a copy of one of the class instances, and distributes the data evenly among them (this is entirely transparent for the user, and it’s one of the many things that makes Intel TBB so cool).

We’ll now have a look at the three methods defined. The private method develop(size_t i) is the one where we’ll perform our local computation. i is the number to compute, and our method just displays it and returns a string version of it. Nothing thrilling here, but you get the idea of how it works. You’ll notice when running the code on several CPUs that most of the time, numbers are not displayed in the right order: that’s because several develop(i) methods are concurrently running from several instances of the class.

The operator(const blocked_range<size_t>& r) override is used by Intel TBB, as it’s the function that is called to loop through the items assigned to this instance of the class. TBB uses a blocked_range<size_t> to present the data to class instances. Here, we just call develop(i) on every item, and save the result in my_vector.

Finally, the join(const NumberPrinter& y) method is the one that is called by TBB when all the local computations are done on several processes that manage adjacent data in the original blocked_range. TBB will always call join(const NumberPrinter& y) in the order in which the data has been distributed. For instance, if processes a, b and c are all given a third of the data (in that order), then a will always join with b and b with c. This can be easily checked: just print my_vector after the joining has been done and you’ll notice that in there, all the elements are sorted.

There is only the main file left to explain: first thing, you should initialize the Intel TBB library by creating a task_scheduler_init object. You should only use TBB during the lifespan of this object, so you may just put it at the beginning of your main. We then call the printNumbers(int n) function that creates a NumberPrinter object. Note that I give it 0 as a parameter: actually, in this code, I don’t use the parameter at all (but you’ll need a one parameter constructor for Intel TBB to use parallel_reduce). I then call the parallel_reduce(blocked_range<size_t>(0,n), p, auto_partitioner()) function.

The first parameter is a range of integers between 0 and n, that will be used as input data. There are different kinds of ranges for different data types, and you can also define your own ranges to use with algorithms. This is all explained in the Intel TBB book, and most likely on their website’s tutorial). The second parameter is the object we created, that will run the computation on the data.

The last object is a function that will determine automatically into how many parts the input data should be split, depending on how many CPUs are available. For a purely educational purpose, I don’t really care if I have the right parameter here, so I use the auto_partitioner() rather than choosing how many processors to use myself. Intel TBB’s documentation states that it’s not a big issue if you don’t tune this value too much. Anyway, if your code is run on several architectures, it might be better not to impose a static value here, or to make it tunable by the user of your code.

Want to go further?

I wanted to put a lot of things in this section, but unfortunately, it’s been months since I started writing this blog post, and I forgot what I had to say here. You should just be aware that Intel TBB is much more than just templates for parallelizing algorithms (and there are many more than just the parallel_reduce one !). You will also get to use highly concurrent containers, that can be safely used between threads. There are two kinds of such containers: the first ones use fine-grained locking so that threads wait only when data consistency would be endangered. The second kind uses lock-free algorithms, in which separate threads correct the inconsistencies caused by concurrent use of the containers. Of course, these containers still have a little impact on parallel speed-up, and are slower to use than STL containers. But at least, they’re safe to use, and this will take a huge load of your shoulders when you’ll need to share data between threads.

Another important thing is that you will be able to write new memory allocators, designed for concurrent execution of code. The first advantage they have is that they are scalable: several threads can allocate memory at the same time with them, thus considerably reducing bottlenecks when threads need to allocate. The other important aspect they have is that they will try to align the memory they allocate on cache lines, in order to get rid of the false sharing problem (in a few words, when your data is not aligned in the cache, you may have to load a new cache line just to get the last few bits). Indeed, false sharing occurs in concurrent programs when several processors share a cache line, because one of them may move it to get the end of the data it needs, while the other one is still using it. The Intel TBB allocators will reduce these problems and improve performance a great deal.

If you want to read more, you can either read the tutorial), or buy the excellent book from James Reinders. I learnt Intel TBB with this book, and everything was explained in a very clear way (unlike in this post :D).

I hope that this post has been useful to you, and that you now understand what task-based programming is, how to represent a simple computation in a task template, how to parallelize it with Intel TBB, and finally, how exactly do you get performance improvements (and avoid performance bottlenecks) with concurrent programming. Any questions are welcome (except “why did it take you 8 months to write a damn blog post?”).