4.7.2. Programming

An SMP system enables you to run multiple threads efficiently and concurrently across multiple cores. However, it is insufficient in many cases. You must rewrite code to improve application performance by exploiting the benefit of parallelization.

The operating system cannot automatically parallelize an application. It can only treat the application as a single scheduling unit. In such cases, you must split the application into multiple smaller tasks. Each of these tasks can be independently scheduled by the OS, as separate threads. A thread is a part of a program that can be run independently and concurrently with other parts of a program. If you decompose an application into smaller execution entities that can be separately scheduled, the OS can spread the threads of the application across multiple cores.

Decomposition methods

It is good practice to decompose your application into smaller tasks capable of parallel execution. The best way to do it depends on the characteristics of the original application. You can break down large data-processing algorithms into smaller pieces, with a number of similar threads that execute in parallel on smaller portions of a dataset. This method is known as data decomposition.

A different approach is task decomposition. You can identify areas of code that are independent of each other and capable of being executed concurrently. This is more difficult because you must consider the discrete operations being carried out and the interactions among them.

For algorithms that you cannot handle through data or task decomposition, you must analyze the program to identify functional blocks. These are independent pieces of code with defined inputs and outputs that have some scope to be parallelized. Such functional blocks often depend on input from other blocks, but do not have a corresponding dependency on time.

When decomposing an application using these techniques, you must consider the overheads associated with task creation and management. An appropriate level of granularity is required for best performance. If you make your datasets too small, too big, or have too many datasets, it can reduce performance.

Threading models

There are two widely used threading models, the fork-join model and workers pool model.

In the fork-join model, individual threads have explicit start and end conditions. There is an overhead associated with managing their creation, destruction, and latencies associated with the synchronization point. Therefore, threads must be sufficiently long-lived to justify these costs.

If some execution threads are repeatedly required to consume input data, you can use the workers pool threading model. You can create a pool of worker threads at the start of the application. The pool can consist of multiple instances of the same algorithm, where the distributor, also called producer or boss, dispatches the task to the first available worker thread. Alternatively, the workers pool can contain several different data processing operators, and data items are tagged to show which worker can consume the data.

In each of these models, the amount of work to be performed by a thread can be variable and unpredictable. Even for threads that operate on a fixed quantity of data, data dependencies can cause different execution times for similar threads. There can be some synchronization overhead. For example, a parent thread must wait for all spawned threads to return in the fork-join model; a pool of workers must complete data consumption before execution can be resumed.

Threading libraries

You can use a threading library to modify your source code, and make your target application capable of concurrent execution. Multi-threading support is available in the OS. When modifying existing code, you must ensure that all shared resources are protected by proper synchronization.

This includes any libraries used by the code, as all libraries are not reentrant. In some cases, there can be separate reentrant libraries for use in multi-threaded applications. A library that is designed to be used in multi-threaded applications is called thread-safe. If a library is not known to be thread-safe, only one thread is allowed to make calls to the library functions.

The most commonly used standard in this area is POSIX threads (Pthreads), a subset of the wider POSIX standard. POSIX (IEEE std. 1003) is the Portable Operating System Interface, a collection of OS interface standards. Its goal is to ensure interoperability and portability of code among systems. Pthreads defines a set of API calls for creating and managing threads.

Pthreads libraries are available for Linux, Solaris, and Windows. There are several other multi-threading frameworks. Take OpenMP for example, it can simplify multi-threaded development by providing high-level primitives, or even automatic multi-threading. OpenMP is a multi-platform, multi-language API that supports shared memory multi-processing through a set of libraries, compiler directives, and environment variables.

The Pthreads standard provides a set of C primitives that enable you to create, manage, and terminate threads and to control thread synchronization and scheduling attributes. You can use Pthreads to build multi-threaded software to run on our SMP system.

Pthreads provides the following types:

  • pthread_t - thread identifier.

  • pthread_mutex_t - mutex.

  • sem_t - semaphore.

You must modify your code to include the appropriate header files:

  • #include <pthread.h>

  • #include <semaphore.h>

You must also link your code using the pthread library with the switch -lpthread.

To create a thread, you must call pthread_create(), a library function that requires four arguments:

  • The first argument is a pointer to a pthread_t, which is where you want to store the thread identifier.

  • The second argument is the attribute that can point to a structure that modifies the thread's attributes, for example, scheduling priority, or be set to NULL if no special attributes are required.

  • The third argument is the function that the new thread starts by executing. The thread is terminated if this function returns.

  • The fourth argument is a void * pointer supplied to the thread. This can receive a pointer to a variable or data structure containing relevant information to the thread function.

A thread can complete either by returning, or calling pthread_exit(). Both can terminate the thread. A thread can be detached, using pthread_detach(). A detached thread automatically has its associated data structures released on exit.

For a thread that has not been detached, this resource cleanup occurs as part of a pthread_join() call from another thread. The library function pthread_join() enables you to make a thread stall and wait for completion of another thread. Use this function with caution because so-called zombie threads can be created by joining a thread that has already completed. It is not possible to join a detached thread, one that has called pthread_detach().

Mutexes are created with the pthread_mutex_init() function. The functions pthread_mutex_lock() and pthread_mutex_unlock() are used to lock or unlock a mutex.

The function pthread_mutex_lock() blocks the thread until the mutex can be locked. pthread_mutex_trylock() checks whether the mutex can be claimed and returns an error if it cannot, rather than blocking.

A mutex can be deleted when it is no longer required with the pthread_mutex_destroy() function. Semaphores are created in a similar way, using sem_init(). However, you must specify the initial value of the semaphore. The functions sem_post() and sem_wait() are used to increment and decrement the semaphore.

The GNU tools for ARM cores support full thread-local storage using the Native POSIX Thread Library (NPTL) that enables efficient use of POSIX threads with the Linux kernel. There is a one-to-one correspondence between threads created with pthread_create() and kernel tasks.

The following example shows how to use the Pthreads library.

Example 4.11. Pthreads code

void *thread(void *vargp);

int main(void)
	pthread_t tid;
	pthread_create(&tid, NULL, thread, NULL);
	/* Parallel execution area */
	pthread_join(tid, NULL);
	return 0;

/* thread routine */
void *thread(void *vargp)
	/* Parallel execution area */
	printf(“Hello World from a POSIX thread!\n”);
	return NULL;

Inter-thread communications

Threads use semaphores to signal to another thread. For example, where one thread produces a buffer containing shared data, it can use a semaphore to indicate to another thread that the data can now be processed.

For more complex signaling, a message passing protocol might be required. Threads within a process use the same memory space, so an easy way to implement message passing is by posting in a previously agreed-upon mailbox and then incrementing a semaphore.

Threaded performance

There are a few general points to consider when writing a multi-threaded application:

  • Each thread has its own stack space. You must be careful with its size if large numbers of threads are in use.

  • Multiple threads contending for the same mutex or semaphore result in contention and wasted core cycles.

Thread affinity

Thread affinity refers to the practice of assigning a thread to a particular core or cores. When the scheduler wants to run a particular thread, it uses only the selected cores even if others are idle. This can be a problem if too many threads have an affinity set to a specific core. By default, threads can run on any core in an SMP system.

ARM DS-5 Streamline can reveal the affinity of a thread by using a display mode called Core map. You can use this mode to visualize how tasks are divided up by the kernel and shared among several cores.

Thread safety and reentrancy

Functions that can be used concurrently by more than one thread must be both thread-safe and reentrant. This is important for device drivers and for library functions.

For a function to be reentrant, it must meet the following conditions:

  • All data must be supplied by the caller.

  • The function must not hold static or global data over successive calls.

  • The function cannot return a pointer to static data.

  • The function cannot itself call functions that are not reentrant.

For a function to be thread-safe, it must protect shared data with locks. This means that you must change the implementation by adding synchronization blocks to protect concurrent accesses to shared resources, from different threads. Reentrancy is a stronger property; it means that not every thread-safe function is reentrant.

There are common library functions that are not reentrant. For example, the function ctime() returns a pointer to static data that is over-written on each call.

Performance issues

There are several multi-core specific issues related to performance of threads:

  • Bandwidth.

    The connection to external memory is shared among all cores within a cluster. Individual cores run at speeds far higher than the external memory and so are potentially limited in I/O-intensive code by the available bandwidth.

  • Thread dependencies and priority inversion.

    The execution of a higher-priority thread can be stalled by a lower-priority thread holding a lock to some shared data. Alternatively, an incorrect split in thread functionality can lead to a situation where no benefit is seen because the threads have fully serialized dependencies.

  • Cache contention and false sharing.

    If multiple threads are using data that resides within the same coherent cache lines, there can be cache line migration overhead even if the actual variables are not shared.

Bandwidth concerns

Bandwidth issues can be optimized in a number of ways. The code itself must be optimized to minimize cache misses, and therefore reduce the bandwidth utilization.

Another option is to control thread allocation. The kernel scheduler does not monitor data usage by threads. Instead, it uses priority to decide which threads to run. You can provide hints that enable more efficient scheduling by using thread affinity.

Thread dependencies

A program that relies on threads executing in a particular sequence to work correctly might have a race condition. Single-core real-time systems often implicitly rely on tasks being executed in a priority-based order. Tasks then execute to completion, without preemption. Later, tasks can rely on earlier tasks having completed. This can cause problems if such software is moved to a multi-core system without checking for such assumptions.

A lower-priority task can run at the same time as a higher-priority task, but the expected execution order of the original single-core system is no longer guaranteed. There are several ways to resolve this problem. A simple approach is to set task affinity to make those tasks run on the same core. This requires little change to the legacy code, but does break the symmetry of the system and remove scope for load balancing. A better approach is to enforce serial execution by using the kernel synchronization mechanisms that give you explicit control over the execution flow and better SMP performance. However, this approach requires the legacy code to be modified.

Cache thrashing

Cortex-A series processors use physically tagged caches that remove the requirement for flushing caches on context switch.

In an SMP system, tasks can migrate among different cores in the system. The scheduler starts a task on a core. It runs for a certain period and is then replaced by a different task. When that task is restarted later by the scheduler, this could be on a different core. This means that the task does not get the potential benefit of cache data already in the core cache.

Memory-intensive tasks that quickly fill the data cache might thrash each other’s cached data. This results in poor performance, because of the higher number of cache misses; this also increases system energy usage, because of additional interaction with external memory.

Multi-core optimizations for cache line migration mitigate the effects of cache thrashing. In addition, the OS scheduler can try to reduce the problem by keeping tasks on the same core. You can also do this by setting core affinity to threads and processes.

False sharing

This is a problem of systems with shared coherent caches and is a form of involuntary memory contention.

It can happen when a core regularly accesses data that is never changed, and shares a cache line with data that is altered by another core. The MESI protocol can end up migrating data that is not truly shared among different parts of the memory system, costing clock cycles and power.

Even though there is no actual coherency to be maintained, the MESI protocol invalidates the cache line, forcing it to be reloaded on each write. However, the cache-to-cache migration capability of multi-core clusters reduces the overhead.

Therefore, you must avoid having cores operating on independent data that is stored within the same cache line and increasing the level of detail for inner loop parallelization.

Deadlock and livelock

When writing code that includes critical sections, you must know that the following common problems can lead to correct execution of the program:

  • Deadlock is the situation where two or more threads are waiting for each other to release a resource. Such threads are blocked, waiting for a lock that can never be released.

  • Livelock occurs when multiple threads can execute, without blocking indefinitely as in the deadlock case. However, the system as a whole cannot proceed, because of a repeated pattern of resource contention.

Both deadlocks and livelocks can be avoided either by correct software design, or by the use of lock-free software techniques.

Copyright © 2014 ARM. All rights reserved.ARM DAI0425