Hyperthreading

While Hyperthreading (officially called Hyper-Threading Technology) is available since 2002 the Cluster btrzx2 installed in 2017 is the first in Bayreuth to have enabled this feature. Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel, while the concept behind the technology has been patented by Sun Microsystems. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per core, each of which has its own processor architectural state. Each logical processor can be individually halted, interrupted or directed to execute a specified thread, independently from the other logical processor sharing the same physical core.
While the following examples focuses on btrzx2's nodes labeled compute8 the other "CPU-Based" node types labeled compute20 and compute40 are analogous, just wider. The nodescompute8 got their name from having two CPUs Intel E5-2623 v4 featuring a clock speed of 2.60GHz and 4 physical cores each. The output of the command numactl --hardware shows that nicely:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 32673 MB
node 0 free: 30863 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 32768 MB
node 1 free: 31225 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

Please note that in this output node means processing node within the machine in lieu of node of the cluster. This system houses two Intel E5-2623 v4 labeled node 0 and node 1 each with 32GB of ram attached. (For considerations about ram-access please see the page about NUMA). In this setup the first Hyperthread of the first core of node 0 bears the label 0 while the seconed Hyperthread of the first core of node 0 bears the label 8. This easyly can be seen by submitting the following job requesting two threads (ppn=2) to btrzx2's queueing system:
#PBS -l nodes=1:ppn=2:compute8,walltime=00:50:00
#PBS -j oe
numactl --showi

The result will be an output file containing
policy: default
preferred node: current
physcpubind: 0 8 
cpubind: 0 
nodebind: 0 
membind: 0 

with the line physcpubind telling which hyperthread cores were allocted for this job. Interestingly when requesting 4 real cores in the form of 8 hyperthread cores by using the parameter ppn=8 when submitting the job the output will show, that all threads are preferrably allocated on the same CPU chip to:
policy: default
preferred node: current
physcpubind: 0 1 2 3 8 9 10 11 
cpubind: 0 
nodebind: 0 
membind: 0 

All hyperthreads of a compute node of type compute8 are requested by using ppn=16 which will produce the output
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 

A sequential Job therefore should request nodes=1:ppn=2 since using one pysical core utilizes both hyperthread cores. While Jobs parallelized by using OpenMP can benefit from using all hyperthreads of a compute node, jobs parallelized using MPI usually wil be slowed when using more MPI processes than physical cores available. The following example executes a MPI-job on two cluster nodes.

#PBS -l nodes=2:ppn=16:compute8,walltime=00:05:00
#PBS -j oe
module load intel_parallel_studio_xe_2016_update4
awk 'NR%2==1' $PBS_NODEFILE & my_nodefile
$MPI_RUN -ordered-output -prepend-rank -machinefile my_nodefile /bin/tcsh -c 'taskset -c -p $$; hostname' | sort

This example uses the command taskset -c -p $$ and hostname to show which hyperthread cores on which computenode have been allocated to each MPI process respectively. While requesting all hyperthread cores of two nodes (nodes=2:ppn=16:compute8) [can be done with nodes=2:ppn=40:compute20 for the other node type] the line awk 'NR%2==1' $PBS_NODEFILE & my_nodefile singles out every other core. The PBS queueing system passes the name of a file containing a list of all hyperthread cores to use in the environmet variable $PBS_NODEFILE. Here awk 'NR%2==1' is used to print the lines with odd numbers into the file my_nodefile. The resulting job output will look like
[0] pid 44595's current affinity list: 0,8
[0] r03n30
[10] pid 43651's current affinity list: 2,10
[10] r03n29
[11] pid 43652's current affinity list: 3,11
[11] r03n29
[12] pid 43653's current affinity list: 4,12
[12] r03n29
[13] pid 43654's current affinity list: 5,13
[13] r03n29
[14] pid 43655's current affinity list: 6,14
[14] r03n29
[15] pid 43656's current affinity list: 7,15
[15] r03n29
[1] pid 44596's current affinity list: 1,9
[1] r03n30
[2] pid 44597's current affinity list: 2,10
[2] r03n30
[3] pid 44598's current affinity list: 3,11
[3] r03n30
[4] pid 44599's current affinity list: 4,12
[4] r03n30
[5] pid 44600's current affinity list: 5,13
[5] r03n30
[6] pid 44601's current affinity list: 6,14
[6] r03n30
[7] pid 44602's current affinity list: 7,15
[7] r03n30
[8] pid 43649's current affinity list: 0,8
[8] r03n29
[9] pid 43650's current affinity list: 1,9
[9] r03n29


Last modified January 2017 by Dr. Bernhard L. Winkler