Multicore Application Programming

CPU Affinity

by Robert Love on July 1, 2003

The ability in Linux to bind one or more processes to one or more processors, called CPU affinity.
The 2.5 kernel introduced a set of system calls for setting and retrieving the CPU affinity of a process.

When processes bounce between processors: they constantly cause cache invalidations, and the data they want is never in the cache when they need it.
Thus, cache miss rates grow very large. CPU affinity protects against this and improves cache performance.

In a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular thread, it is possible to ensure maximum execution speed for that thread.
Restricting a thread to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU.


Soft vs. Hard CPU Affinity


  • soft affinity
  • It's also called natural affinity, is the tendency of a scheduler to try to keep processes on the same CPU as long as possible.
  • hard affinity
  • This is provided via a CPU affinity system call.

Using the New System Calls


#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <sched.h>
int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask);
int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);

A CPU affinity mask is represented by the cpu_set_t structure, a "CPU set", pointed to by mask. A set of macros for manipulating CPU sets is described in CPU_SET(3).
  • sched_setaffinity()
  • sets the CPU affinity mask of the thread whose ID is pid to the value specified by mask. If pid is zero, then the calling thread is used. The argument cpusetsize is the length (in bytes) of the data pointed to by mask. Normally this argument would be specified as sizeof(cpu_set_t). If the thread specified by pid is not currently running on one of the CPUs specified in mask, then that thread is migrated to one of the CPUs specified in mask.
  • sched_getaffinity()
  • writes the affinity mask of the thread whose ID is pid into the cpu_set_t structure pointed to by mask. The cpusetsize argument specifies the size (in bytes) of mask. If pid is zero, then the mask of the calling thread is returned.


#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include 

int pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset);
int pthread_getaffinity_np(pthread_t thread, size_t cpusetsize, cpu_set_t *cpuset);
  • pthread_setaffinity_np()
  • The pthread_setaffinity_np() function sets the CPU affinity mask of the thread thread to the CPU set pointed to by cpuset. If the call is successful, and the thread is not currently running on one of the CPUs in cpuset, then it is migrated to one of those CPUs.
  • pthread_getaffinity_np()
  • The pthread_getaffinity_np() function returns the CPU affinity mask of the thread thread in the buffer pointed to by cpuset.

Examples


The following code sets the affinity of the process thread to a specific CPU core.
In this example, we define the CPU core id using the variable core_id.

#include <stdio.h>
#include <stdlib.h>
#define __USE_GNU
#include <sched.h>
#include <errno.h>
#include <unistd.h>
 
#define print_error_then_terminate(en, msg) \
  do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
 
 
int main(int argc, char *argv[]) {
 
  // We want to camp on the 2nd CPU. The ID of that core is #1.
  const int core_id = 1;
  const pid_t pid = getpid();
 
  // cpu_set_t: This data set is a bitset where each bit represents a CPU.
  cpu_set_t cpuset;
  // CPU_ZERO: This macro initializes the CPU set set to be the empty set.
  CPU_ZERO(&cpuset);
  // CPU_SET: This macro adds cpu to the CPU set set.
  CPU_SET(core_id, &cpuset);
 
  // sched_setaffinity: If successful the function returns zero
  const int set_result = sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
  if (set_result != 0) {
    print_error_then_terminate(set_result, "sched_setaffinity");
  }
 
  // Check what is the actual affinity mask that was assigned to the thread.
  // sched_getaffinity: If successful, the function always initializes all bits in the cpu_set_t object and returns zero.
  const int get_affinity = sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset);
  if (get_affinity != 0) {
    print_error_then_terminate(get_affinity, "sched_getaffinity");
  }
 
  // CPU_ISSET: This macro returns a nonzero value (true) if cpu is a member of the CPU set set, and zero (false) otherwise.
  if (CPU_ISSET(core_id, &cpuset)) {
    fprintf(stdout, "Successfully set thread %d to affinity to CPU %d\n", pid, core_id);
  } else {
    fprintf(stderr, "Failed to set thread %d to affinity to CPU %d\n", pid, core_id);
  }
 
  return 0;
}

The following code sets the affinity of each pthread to a different and specific CPU core.
The selection is made with the variable speid (that is user defined) and contains a number from 0 to (CPU NUMBER – 1).

int s, j;
cpu_set_t cpuset;
pthread_t thread;
 
thread = pthread_self();
 
/* Set affinity mask to include CPUs 0 to 7 */
 
CPU_ZERO(&cpuset);
CPU_SET(speid, &cpuset);
 
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (s != 0) {
    handle_error_en(s, "pthread_setaffinity_np");
}
 
/* Check the actual affinity mask assigned to the thread */
s = pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (s != 0) {
    handle_error_en(s, "pthread_getaffinity_np");
}
 
printf("Set returned by pthread_getaffinity_np() contained:\n");
for (j = 0; j < CPU_SETSIZE; j++) {
    if (CPU_ISSET(j, &cpuset)) {
        fprintf(stderr,"%d CPU %d\n",speid, j);
    }
}


Multicore
Application
Programming
For Windows, Linux, and Oracle ® Solaris
Darryl Gove



1 Hardware, Processes, and Threads


Computer Architecture


Performance is affected by hardware configuration
  • Memory or CPU architecture
  • Numbers of cores/processor
  • Network speed and architecture

Asymmetric multiprocessing (AMP or ASMP)


Asymmetric multiprocessing was the only method for handling multiple CPUs before symmetric multiprocessing (SMP) was available.
Multiple CPUs are tied either via LAN or other interconnect strategy.
  • Each CPU has its own copy of the OS.
  • No one standard is available for AMP
  • Require the developer to decide which code needs to run where

Symmetric multiprocessing (SMP)


The architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes.

  • all processors see everything
  • there is only one kernel
  • applications can migrate between processors
Problems with SMP:
  • SMP may be complex because of duplicated hardware such as processor fans
  • SMP does not scale perfectly
  • Multiple processors can lead to race conditions

Multi-core processor


A multi-core processor is a computer processor integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions, as if the computer had several processors.
An execution thread is the smallest processing unit in an operating system.
A thread is typically contained inside a process.
Multiple threads can exist within the same process and share resources.
On a uniprocessor, multithreading generally occurs by switching between different threads.
On a multi-core system, the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task.

2 Coding for Performance


Performance is affected by hardware configuration
  • Memory or CPU architecture
  • Numbers of cores/processor
  • Network speed and architecture

A system can be multiprocessor and multicore (e.g., dual processor, quad core = eight processing cores per system unit).

7 Using Automatic Parallelization and OpenMP


Using Automatic Parallelization to Produce a Parallel Application

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenging because it requires complex program analysis and because the best approach may depend upon parameter values that are not known at compilation time. Most compilers are able to perform some degree of automatic parallelization. Current compilers can only automatically parallelize loops. Loops are a very good target for parallelization because they are often iterated, so the block of code will therefore accumulate significant time. Automatic parallelization in GCC is currently determined by the user via the compile command (-ftree-parallelize-loops=4)

Using OpenMP to Produce a Parallel Application


OpenMP is a library for executing C, C++ and Fortran code on multiple processors at the same time.


OpenMP (Multi-Processing) is completely independent from MPI.
  • MPI is for Process
  • OpenMP is for Thread
Process vs. Thread:
  • Program starts with a single process
  • Processes have their own (private) memory space
  • A process can create one or more threads
  • Threads created by a process share its memory space
    • Read and write to same memory addresses
    • Share same process ids and file descriptors
  • Each thread has a unique instruction counter and stack pointer
  • A thread can have private storage on the stack

The OpenMP specification defines an API that enables a developer to add directives to their serial code that will cause the compiler to produce a parallel version of the application.
The OpenMP, directives in the source code are used to express parallel constructs.
These directives start with the phrase #pragma omp .
Under appropriate compilation flags, they are read by the compiler and used to generate a parallel version of the application.
If the required compiler flags are not provided, the directives are ignored.
Some other advantages to using OpenMP are as follows:
  • The directives are recognized only when compiled with a particular compiler flag, so the same source base can be used to generate the serial and parallel versions of the code.
  • Each directive is limited in scope to the region of code to which it applies.
  • the compiler and supporting library are responsible for all the parallelization work and the runtime management of threads.
Loop Parallelized Using OpenMP, OpenMP directives tell the compiler to add machine code for parallel execution of the following block


void calc( double* array1, double * array2, int length )
{
  #pragma omp parallel for
  for ( int i=0; i<length; i++ )
    {
    array1[i] += array2[i];
  }
}

Setting up OpenMP on Ubuntu / Linux


  • Run
  • 
    $ sudo apt-get install libomp-dev
    
  • code
  • 
    /*
        -lgomp -fopenmp
    */
    
    #include <omp.h> // necessary header file for OpenMP API
    #include <stdio.h>
    
    int main(int argc, char *argv[]){
        printf("OpenMP running with %d threads\n", omp_get_max_threads());
    
    #pragma omp parallel
        {
        //Code here will be executed by all threads
        printf("Hello World from thread %d\n", omp_get_thread_num());
        }
    
        return 0;
    }
    
  • build
    • -fopenmp
    • Compiler flag to enable OpenMP
    
    $ gcc -lgomp -fopenmp omp.c -o omp
    
  • test
  • 
    $ gcc -lgomp -fopenmp omp.c -o omp
    $ ./omp
    OpenMP running with 4 threads
    Hello World from thread 0
    Hello World from thread 1
    Hello World from thread 2
    Hello World from thread 3
    
    
OpenMP may try to use all available cpus if OMP_NUM_THREADS is not set.

Runtime Behavior of an OpenMP Application


OpenMP follows a fork-join type model. The runtime library will create a team of threads.
The number of threads that will be used is set by the environment variable OMP_NUM_THREADS , but this can be adjusted by the application at runtime by calls into the runtime support library.

9 Scaling with Multicore Processors


To maximize the performance,
  • Each thread of an application needs to be efficient
  • the application needs to be able to effectively utilize multiple threads.

Constraints to Application Scaling



Chapter 10 Other Parallelization Technologies


Clustering Technologies


Message Passing Interface (MPI) is a parallelization model that allows a single application to span over multiple nodes.
Communication between the nodes is, as the name suggests, accomplished by passing messages between the nodes.


Each node executes the same application.


留言

熱門文章