HPL参数优化 - Meng Hao's Homepage

¶1. 实验平台KNL配置

Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name : Intel® Xeon Phi™ 7210
Packages (sockets) : 1
Cores : 64
Processors (CPUs) : 256
Cores per package : 64
Threads per core : 4

RAM: 96GB
MCDRAM: 16 GB

理论峰值
n 1641.332=2662.4 Gflops*

¶2. HPL.dat文件中需要优化的参数

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
50000 100000 150000 200000  Ns
10            # of NBs
1 2 4 8 16 32 64 128 256 512     NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

需要优化的主要有N，NB，PxQ等

¶3. 根据内存情况，获取理论最优的N值

根据KNL的MCDRAM为16G，NN8=16G，得到N约为46000
从46000左右开始设置N值，进行测试。

log信息提取命令

1	awk -F: '/WR/' knl003_hpl.o31829 > 30000_128_8_8.log

阶段性结果

N	NB	Ps	Qs	Result（Gflops）
30000	128	8	16	9.435e+02
30000	128	16	16	9.086e+01
30000	256	8	16	6.042e+02
35000	128	8	16	1.034e+03
35000	64	8	16	8.820e+02
35000	128	14	14	6.002e+02
35000	128	1	128	6.019e+02
35000	128	10	16	7.427e+02
35000	128	8	8	1.149e+03
35000	128	8	10	8.718e+02
39200	128	8	8	1.227e+03
39200	175	8	8	1.208e+03
41600	128	8	8	1.074e+03
32768	128	8	8	1.111e+03

更改运行脚本

#!/usr/bin/bash
 
 
#PBS -N knl003_hpl
#PBS -l nodes=1,walltime=01:00:00
 
cd /home/asc0146/haomeng/code/hpl/bin/Linux_Intel64/test1/
export KMP_AFFINITY=scatter,verbose
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
mpiexec -np 64 ./xhpl | tee HPL.out

进行实验

N	NB	Ps	Qs	Result（Gflops）
35000	128	8	8	效果很差

参数选择指导

参考Developer Guide for Intel.
The most significant parameters in HPL.dat are P, Q, NB, and N. Specify them as follows:

P and Q - the number of rows and columns in the process grid, respectively.
P*Q must be the number of MPI processes that HPL is using.
Choose P ≤ Q.
NB - the block size of the data distribution.
The table below shows recommended values of NB for different Intel® processors:

N - the problem size:
For homogeneous runs, choose N divisible by NB*LCM(P,Q), where LCM is the least common multiple of the two numbers.
For heterogeneous runs, see Heterogeneous Support in the Inte；l Optimized MP LINPACK Benchmark for how to choose N.

NOTE

Increasing N usually increases performance, but the size of N is bounded by memory. In general, you can compute the memory required to store the matrix (which does not count internal buffers) as 8NN/(P*Q) bytes, where N is the problem size and P and Q are the process grids in HPL.dat. A general rule of thumb is to choose a problem size that fills 80% of memory. When offloading to Intel Xeon Phi coprocessors, you may choose a problem size that fills 70% of memory, to leave room for additional buffers needed for offloading. Choose N and NB such that N > > NB.

最新优化情况
- 优化结果

N	NB	Ps	Qs	Result（Gflops）
3920	128	8	16	9.435e+02