9月 04, 2020

Multicore Application Programming

CPU Affinity

by Robert Love on July 1, 2003

The ability in Linux to bind one or more processes to one or more processors, called CPU affinity.
The 2.5 kernel introduced a set of system calls for setting and retrieving the CPU affinity of a process.

When processes bounce between processors: they constantly cause cache invalidations, and the data they want is never in the cache when they need it.
Thus, cache miss rates grow very large. CPU affinity protects against this and improves cache performance.

In a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular thread, it is possible to ensure maximum execution speed for that thread.
Restricting a thread to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU.

Soft vs. Hard CPU Affinity

soft affinity
hard affinity

Using the New System Calls


#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <sched.h>
int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask);
int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);

A CPU affinity mask is represented by the cpu_set_t structure, a "CPU set", pointed to by mask. A set of macros for manipulating CPU sets is described in CPU_SET(3).

sched_setaffinity()
sched_getaffinity()


#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include 

int pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset);
int pthread_getaffinity_np(pthread_t thread, size_t cpusetsize, cpu_set_t *cpuset);

pthread_setaffinity_np()
pthread_getaffinity_np()

Examples

The following code sets the affinity of the process thread to a specific CPU core.
In this example, we define the CPU core id using the variable core_id.


#include <stdio.h>
#include <stdlib.h>
#define __USE_GNU
#include <sched.h>
#include <errno.h>
#include <unistd.h>
 
#define print_error_then_terminate(en, msg) \
  do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
 
 
int main(int argc, char *argv[]) {
 
  // We want to camp on the 2nd CPU. The ID of that core is #1.
  const int core_id = 1;
  const pid_t pid = getpid();
 
  // cpu_set_t: This data set is a bitset where each bit represents a CPU.
  cpu_set_t cpuset;
  // CPU_ZERO: This macro initializes the CPU set set to be the empty set.
  CPU_ZERO(&cpuset);
  // CPU_SET: This macro adds cpu to the CPU set set.
  CPU_SET(core_id, &cpuset);
 
  // sched_setaffinity: If successful the function returns zero
  const int set_result = sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
  if (set_result != 0) {
    print_error_then_terminate(set_result, "sched_setaffinity");
  }
 
  // Check what is the actual affinity mask that was assigned to the thread.
  // sched_getaffinity: If successful, the function always initializes all bits in the cpu_set_t object and returns zero.
  const int get_affinity = sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset);
  if (get_affinity != 0) {
    print_error_then_terminate(get_affinity, "sched_getaffinity");
  }
 
  // CPU_ISSET: This macro returns a nonzero value (true) if cpu is a member of the CPU set set, and zero (false) otherwise.
  if (CPU_ISSET(core_id, &cpuset)) {
    fprintf(stdout, "Successfully set thread %d to affinity to CPU %d\n", pid, core_id);
  } else {
    fprintf(stderr, "Failed to set thread %d to affinity to CPU %d\n", pid, core_id);
  }
 
  return 0;
}

The following code sets the affinity of each pthread to a different and specific CPU core.
The selection is made with the variable speid (that is user defined) and contains a number from 0 to (CPU NUMBER – 1).


int s, j;
cpu_set_t cpuset;
pthread_t thread;
 
thread = pthread_self();
 
/* Set affinity mask to include CPUs 0 to 7 */
 
CPU_ZERO(&cpuset);
CPU_SET(speid, &cpuset);
 
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (s != 0) {
    handle_error_en(s, "pthread_setaffinity_np");
}
 
/* Check the actual affinity mask assigned to the thread */
s = pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (s != 0) {
    handle_error_en(s, "pthread_getaffinity_np");
}
 
printf("Set returned by pthread_getaffinity_np() contained:\n");
for (j = 0; j < CPU_SETSIZE; j++) {
    if (CPU_ISSET(j, &cpuset)) {
        fprintf(stderr,"%d CPU %d\n",speid, j);
    }
}

Multicore
Application
Programming
For Windows, Linux, and Oracle ® Solaris
Darryl Gove

1 Hardware, Processes, and Threads

Computer Architecture

Performance is affected by hardware configuration

Memory or CPU architecture
Numbers of cores/processor
Network speed and architecture

Asymmetric multiprocessing (AMP or ASMP)

Asymmetric multiprocessing was the only method for handling multiple CPUs before symmetric multiprocessing (SMP) was available.
Multiple CPUs are tied either via LAN or other interconnect strategy.

Each CPU has its own copy of the OS.
No one standard is available for AMP
Require the developer to decide which code needs to run where

Symmetric multiprocessing (SMP)

The architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes.

all processors see everything
there is only one kernel
applications can migrate between processors

Problems with SMP:

SMP may be complex because of duplicated hardware such as processor fans
SMP does not scale perfectly
Multiple processors can lead to race conditions

Multi-core processor

A multi-core processor is a computer processor integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions, as if the computer had several processors.
An execution thread is the smallest processing unit in an operating system.
A thread is typically contained inside a process.
Multiple threads can exist within the same process and share resources.
On a uniprocessor, multithreading generally occurs by switching between different threads.
On a multi-core system, the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task.

2 Coding for Performance

Performance is affected by hardware configuration

Memory or CPU architecture
Numbers of cores/processor
Network speed and architecture

A system can be multiprocessor and multicore (e.g., dual processor, quad core = eight processing cores per system unit).

7 Using Automatic Parallelization and OpenMP

Using Automatic Parallelization to Produce a Parallel Application

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenging because it requires complex program analysis and because the best approach may depend upon parameter values that are not known at compilation time. Most compilers are able to perform some degree of automatic parallelization. Current compilers can only automatically parallelize loops. Loops are a very good target for parallelization because they are often iterated, so the block of code will therefore accumulate significant time. Automatic parallelization in GCC is currently determined by the user via the compile command (-ftree-parallelize-loops=4)

Using OpenMP to Produce a Parallel Application

OpenMP is a library for executing C, C++ and Fortran code on multiple processors at the same time.

OpenMP (Multi-Processing) is completely independent from MPI.

MPI is for Process
OpenMP is for Thread

Process vs. Thread:

Program starts with a single process
Processes have their own (private) memory space
A process can create one or more threads
Threads created by a process share its memory space

Read and write to same memory addresses
Share same process ids and file descriptors

Each thread has a unique instruction counter and stack pointer

The OpenMP specification defines an API that enables a developer to add directives to their serial code that will cause the compiler to produce a parallel version of the application.
The OpenMP, directives in the source code are used to express parallel constructs.
These directives start with the phrase #pragma omp .
Under appropriate compilation flags, they are read by the compiler and used to generate a parallel version of the application.
If the required compiler flags are not provided, the directives are ignored.
Some other advantages to using OpenMP are as follows:

The directives are recognized only when compiled with a particular compiler flag, so the same source base can be used to generate the serial and parallel versions of the code.
Each directive is limited in scope to the region of code to which it applies.
the compiler and supporting library are responsible for all the parallelization work and the runtime management of threads.

Loop Parallelized Using OpenMP, OpenMP directives tell the compiler to add machine code for parallel execution of the following block


void calc( double* array1, double * array2, int length )
{
  #pragma omp parallel for
  for ( int i=0; i<length; i++ )
    {
    array1[i] += array2[i];
  }
}

Setting up OpenMP on Ubuntu / Linux


$ sudo apt-get install libomp-dev

code


/*
    -lgomp -fopenmp
*/

#include <omp.h> // necessary header file for OpenMP API
#include <stdio.h>

int main(int argc, char *argv[]){
    printf("OpenMP running with %d threads\n", omp_get_max_threads());

#pragma omp parallel
    {
    //Code here will be executed by all threads
    printf("Hello World from thread %d\n", omp_get_thread_num());
    }

    return 0;
}

build

-fopenmp


$ gcc -lgomp -fopenmp omp.c -o omp

test


$ gcc -lgomp -fopenmp omp.c -o omp
$ ./omp
OpenMP running with 4 threads
Hello World from thread 0
Hello World from thread 1
Hello World from thread 2
Hello World from thread 3

OpenMP may try to use all available cpus if OMP_NUM_THREADS is not set.

Runtime Behavior of an OpenMP Application

OpenMP follows a fork-join type model. The runtime library will create a team of threads.
The number of threads that will be used is set by the environment variable OMP_NUM_THREADS , but this can be adjusted by the application at runtime by calls into the runtime support library.

9 Scaling with Multicore Processors

To maximize the performance,

Each thread of an application needs to be efficient
the application needs to be able to effectively utilize multiple threads.

Constraints to Application Scaling

Chapter 10 Other Parallelization Technologies

Clustering Technologies

Message Passing Interface (MPI) is a parallelization model that allows a single application to span over multiple nodes.
Communication between the nodes is, as the name suggests, accomplished by passing messages between the nodes.

Each node executes the same application.

搜尋此網誌

I'm Jay's father

Multicore Application Programming

CPU Affinity

by Robert Love on July 1, 2003

Soft vs. Hard CPU Affinity

Using the New System Calls

Examples

Multicore
Application
Programming
For Windows, Linux, and Oracle ® Solaris
Darryl Gove

1 Hardware, Processes, and Threads

Computer Architecture

Asymmetric multiprocessing (AMP or ASMP)

Symmetric multiprocessing (SMP)

Multi-core processor

2 Coding for Performance

7 Using Automatic Parallelization and OpenMP

Using Automatic Parallelization to Produce a Parallel Application

Using OpenMP to Produce a Parallel Application

Setting up OpenMP on Ubuntu / Linux

Runtime Behavior of an OpenMP Application

9 Scaling with Multicore Processors

Constraints to Application Scaling

Chapter 10 Other Parallelization Technologies

Clustering Technologies

留言

熱門文章

A Tutorial on the Device Tree

Linux Modem Manager

Multicore Application Programming

CPU Affinity by Robert Love on July 1, 2003

Soft vs. Hard CPU Affinity

Using the New System Calls

Examples

Multicore Application Programming For Windows, Linux, and Oracle ® Solaris Darryl Gove

1 Hardware, Processes, and Threads

Computer Architecture

Asymmetric multiprocessing (AMP or ASMP)

Symmetric multiprocessing (SMP)

Multi-core processor

2 Coding for Performance

7 Using Automatic Parallelization and OpenMP

Using Automatic Parallelization to Produce a Parallel Application

Using OpenMP to Produce a Parallel Application

Setting up OpenMP on Ubuntu / Linux

Runtime Behavior of an OpenMP Application

9 Scaling with Multicore Processors

Constraints to Application Scaling

Chapter 10 Other Parallelization Technologies

Clustering Technologies

留言

熱門文章

A Tutorial on the Device Tree

Linux Modem Manager

CPU Affinity

by Robert Love on July 1, 2003

Multicore
Application
Programming
For Windows, Linux, and Oracle ® Solaris
Darryl Gove