Compiled below are benchmarks for a range of hardware and a range of problems, something like 38 benchmarks in total. We cover here a wide range of systems using from 1 up to 320 cpu's, and we cover pme problems with sizes ranging from ~23k to ~193k atoms and one medium generalized Born benchmark. Both the systems and problems overlap somewhat with benchmarks run by Ross Walker at SDSC (thanks Ross!), but there are enough differences that both sets should be considered.
All these benchmarks were run using formatted output for trajectories (ioutfm = 0). There is a new binary trajectory format based on netCDF that works very well in most instances that can improve very high end scaling by as much as 5-10%, and it is worth a try if you do very high cpu count runs.
Bob Duke
NIEHS and UNC-Chapel Hill
May 19, 2006
********************************************************************************
Human Factor Va (blood coagulation cascade) protein in TIP3P water.
Benchmark contributed by Dr. Chang Jun Lee of Prof. Lee Pedersen's lab.
This benchmark is not yet available for general public use.
Orthogonal periodic unit cell, 105.28 x 148.49 x 125.35 angstrom.
Full Particle Mesh Ewald simulation, direct nonbonded cutoff 9.0 angstrom,
reciprocal space FFT grid dimensions 108 x 150 x 128.
2 femtosecond timestep, bonds to H constrained (SHAKE).
Energies printed every 100 steps (mdout).
Trajectory printed every 250 steps (mdcrd).
Equilibration temperature = 300K.
Human Factor Va (blood coagulation cascade) protein in TIP3P water.
Benchmark contributed by Dr. Chang Jun Lee of Prof. Lee Pedersen's lab.
This benchmark is not yet available for general public use.
Orthogonal periodic unit cell, 105.28 x 148.49 x 125.35 angstrom.
Full Particle Mesh Ewald simulation, direct nonbonded cutoff 9.0 angstrom,
reciprocal space FFT grid dimensions 108 x 150 x 128.
2 femtosecond timestep, bonds to H constrained (SHAKE).
Energies printed every 100 steps (mdout).
Trajectory printed every 250 steps (mdcrd).
Equilibration temperature = 300K.
Human Factor IX (blood coagulation cascade) protein in TIP3P water.
Benchmark contributed by Dr. Lalith Perera of Prof. Lee Pedersen's lab.
Orthogonal periodic unit cell, 142.09 x 83.34 x 78.68 angstrom.
Full Particle Mesh Ewald simulation, direct nonbonded cutoff 8.0 angstrom,
reciprocal space FFT grid dimensions 144 x 90 x 80.
1.5 femtosecond timestep, bonds to H constrained (SHAKE).
Energies printed every 50 steps (mdout).
Trajectory printed every 250 steps (mdcrd).
Equilibration temperature = 300K.
This benchmark of Factor IX differs from the default Amber Factor IX benchmark in that it uses the NVT (vs. NVE) ensemble (ie., there is a thermostat), it outputs energies more frequently, and it outputs a trajectory at a frequency that is more than adequate to meet most needs (general consensus seems to be that a trajectory snapshop every picosecond is generally more than sufficient). There is slight additional overhead associated with the NVT ensemble compared to the NVE ensemble. Also, printing a trajectory at high frequency tests the file i/o capability of both the hardware and software.
Human Factor IX (blood coagulation cascade) protein in TIP3P water.
Benchmark contributed by Dr. Lalith Perera of Prof. Lee Pedersen's lab.
Orthogonal periodic unit cell, 142.09 x 83.34 x 78.68 angstrom.
Full Particle Mesh Ewald simulation, direct nonbonded cutoff 8.0 angstrom,
reciprocal space FFT grid dimensions 144 x 90 x 80.
1.5 femtosecond timestep, bonds to H constrained (SHAKE).
Energies printed every 50 steps (mdout).
Trajectory printed every 250 steps (mdcrd).
Equilibration temperature = 300K.
This benchmark of Factor IX differs from the default Amber Factor IX benchmark in that it uses the NPT (vs. NVE) ensemble (ie., there is a thermostat and barostat), it outputs energies more frequently, and it outputs a trajectory at a frequency that is more than adequate to meet most needs (general consensus seems to be that a trajectory snapshop every picosecond is generally more than sufficient). The NPT ensemble offers additional challenges to scaling over both the NVE and NVT ensemble. Also, printing a trajectory at high frequency tests the file i/o capability of both the hardware and software.
Human Factor IX (blood coagulation cascade) protein in TIP3P water.
This is the Amber Factor IX benchmark and is run exactly as
distributed in the Amber 9 benchmarks tree aside from total dynamics step
count.
Benchmark contributed by Dr. Lalith Perera of Prof. Lee Pedersen's lab.
Orthogonal periodic unit cell, 142.09 x 83.34 x 78.68 angstrom.
Full Particle Mesh Ewald simulation, direct nonbonded cutoff 8.0 angstrom,
reciprocal space FFT grid dimensions 144 x 90 x 80.
1.5 femtosecond timestep, bonds to H constrained (SHAKE).
Energies printed every 100 steps (mdout).
Trajectory NOT printed.
Equilibration temperature = 300K.
Dihydrofolate Reductase (DHFR) protein in TIP3P water.
This is the Joint Amber-Charmm Benchmark, and is run exactly as distributed
in the Amber 9 benchmarks tree aside from total dynamics step count.
Orthogonal periodic unit cell, 62.23 x 62.24 x 62.23 angstrom.
Full Particle Mesh Ewald simulation, direct nonbonded cutoff 9.0 angstrom,
reciprocal space FFT grid dimensions 64 x 64 x 64.
1.0 femtosecond timestep, bonds to H constrained (SHAKE).
Energies printed every 100 steps (mdout).
Trajectory NOT printed.
Equilibration temperature = 300K.
Myoglobin protein, implicit solvent.
This is the Amber generalized Born myoglobin benchmark and is run exactly as
distributed in the Amber 9 benchmarks tree aside from total dynamics step
count.
Generalized Born simulation (igb=1), nonbonded cutoff 20.0 angstrom.
1.0 femtosecond timestep, bonds to H constrained (SHAKE).
Slowly varying forces evaluation every 4 steps (nrespa=4).
Energies printed every 200 steps (mdout).
Trajectory NOT printed.
********************************************************************************
IBM p575 Power 5, bassi.nersc.gov, 118 8-cpu nodes
1.9 GHz Power 5+ cpu, 2 MB L2 cache, 36 MB L3 cache, 32 GB memory per node
IBM "Federation" HPS switch (node interconnect)
GPFS file system
IBM XL Fortran, MASSV vector math library
This is an incredibly well tuned sp5, and pmemd scales better here than on other sp5's I have seen. My hat is off to the NERSC folks for how they are getting this machine to perform.
IBM p655+ Power 4, dslogin.sdsc.edu, 272 8-cpu nodes
1.7 GHz Power 4 cpu, 32 GB memory per node
IBM "Federation" HPS switch (node interconnect)
GPFS file system
IBM XL Fortran, MASSV vector math library
Cray XT3 MPP, bigben.psc.edu, 2068 1-cpu nodes
2.4 GHz Opteron cpu, 1 GB memory per node
Cray proprietary switch (node interconnect)
Lustre file system
The file system software on this machine can cause stalling of high scaling jobs on disk i/o. Recently Cray has released a buffering library to help alleviate this problem (suggestions for PMEMD configuration and runtime environment variables will be posted soon on the amber web site). This problem is mostly seen when jobs that do huge amounts of file i/o are running concurrently with your job. The benchmark data given below was taken before the machine was overloaded with this type of job. For this machine or any other machine, if the RunMD time in the system logfile for the master process (0) is significantly larger than the time for other processes (say more than about 4-5 times larger) you may be encountering this problem. Using binary trajectory files will not help, but using the Cray iobuf library and ioutfm=0 should at least reduce the impact of this system imbalance. See the amber web site for more details. The Cray XD1 is probably subject to similar problems.
SGI Altix 3700 Bx2, cobalt.ncsa.uiuc.edu 2 512-cpu nodes
1.6 GHz Itanium 2 cpu, 9 MB L3 cache, 1024 GB memory per node
SGI NUMAlink 4 node interconnect
File system details unknown
Linux Opteron cluster, jacquard.nersc.gov, 356 2-cpu nodes
2.2 GHz Opteron cpu, 6 GB memory per node
Mellanox Infiniband node interconnect / MVAPICH
GPFS file system
Pathscale Fortran 90 compiler, math library not used
Dell Linux Xeon cluster, topsail.unc.edu , 512 2-cpu nodes
3.6 GHz Xeon (EM64T) cpu, 2 MB L2 cache, 4 GB memory per node
Infiniband node interconnect / MVAPICH
IBRIX Parallel Files System
Intel Fortran 90 compiler, Intel MKL
HP Alphaserver ES45, lemieux.psc.edu, 750 4-cpu nodes
1 GHz Alpha EV6.8CB cpu, 4 GB memory per node
Quadrics switch (node interconnect)
There are both local file systems on the nodes and a shared scratch file
system. Access to the scratch file system is prone to stalling jobs, so I
would recommend working out how to use the $LOCAL file system (which I have
NOT done myself; I typically just benchmark to $HOME, but there is limited
storage there). See the PSC lemieux user manual online (www.psc.edu).
NOTE - You should run using the "2-rail" option for the interconnect when using 64 or more cpu's (at 32 it may also help, depending on problem size). See the PSC lemieux user manual online (www.psc.edu).
IBM BlueGene, IBM Rochester BlueGene Capacity on Demand Center, 1024 2-cpu
nodes
700 MHz PowerPC 440 cpu, 4 MB L3 cache, 0.5 GB memory per node
IBM proprietary switch (node interconnect)
GPFS file system
IBM XL Fortran, MASSV vector math library
The machine was used with the processors in "virtual-node" mode instead of "coprocessor" mode (performance was significantly better). Code was compiled with -qarch=440 instead of -qarch=440d (ie., special math instructions not used, performance was better). Code was compiled without -qhot (using this causes incorrect results). The interconnect for BG/L is configurable; we used a partition with an 8 x 8 x 8 geometry for all benchmarks.
The MASSV library significantly improves generalized Born performance, but at the time of the tests could not be used for really large simulations due to a bug in the library. It worked fine for the MB benchmark (2492 atoms).
The IBM BG/L architecture is unique in that a less powerful cpu without much memory was intentionally used with a very high performance configurable cpu interconnect. To take full advantage of this architecture, features must be added to PMEMD; this work is anticipated to be completed for PMEMD 10.
SGI Origin 3800, zephyr.isis.unc.edu 1 64-cpu node
0.5 GHz MIPS R14000 cpu, 8 MB L2 cache, 64 GB memory per node
Symmetric multiprocessor, memory-interconnected
File system details unknown
Duke's workstation cluster (tiger, lion), 2 2-cpu nodes
3.2 GHz Intel Xeon cpu, 0.5 MB cache, 2 GB memory per node
Gigabit ethernet, dedicated Intel server NIC's,
dedicated crossover (XO) cable (no switch needed) / MPICH-1.2.6
Intel Fortran 90 compiler, Intel MKL
Duke's standalone workstation (cheetah), 1 2-cpu node
3.0 GHz Intel Pentium D cpu (EM64T), 1 MB cache, 2 GB memory per node
Shared memory interconnect / MPICH-1.2.6
Intel Fortran 90 compiler, Intel MKL
*******************************************************************************
IBM SP5 - HFVa - NVT ensemble, PME, 193,013 atoms
#procs nsec/day scaling, %
2 0.112 100
4 0.220 98
8 0.427 95
16 0.824 92
32 1.58 88
64 2.98 83
96 4.15 77
128 5.24 73
160 6.17 69
192 7.02 65
224 7.78 62
256 8.07 56
288 8.55 53
320 9.0 50
*******************************************************************************
IBM SP5 - HFVa - NPT ensemble, PME, 193,013 atoms
#procs nsec/day scaling, %
2 0.112 100
4 0.218 96
8 0.422 93
16 0.816 89
32 1.55 85
64 2.94 81
96 4.09 75
128 5.14 72
160 6.00 67
192 6.80 63
224 7.02 56
256 7.58 53
288 7.71 48
320 8.0 45
*******************************************************************************
IBM SP5 - FACTOR IX - NVE ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.245 100
4 0.476 97
8 0.946 96
16 1.80 92
32 3.28 84
64 6.17 79
96 8.58 73
128 10.45 67
160 11.78 60
192 13.22 56
224 13.79 50
256 14.6 47
288 15.4 44
320 15.4 44
*******************************************************************************
IBM SP5 - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.243 100
4 0.475 98
8 0.931 96
16 1.78 92
32 3.26 84
64 6.06 78
96 8.15 70
128 9.74 63
160 10.08 46
192 11.08 47
224 11.78 43
256 12.23 39
288 12.58 36
320 13.5 35
*******************************************************************************
IBM SP5 - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.506 100
4 0.995 98
8 1.95 96
16 3.69 91
32 7.08 87
64 12.52 77
96 16.62 68
128 18.78 58
160 20.1 50
192 19.2 40
224 22.2 39
256 22.2 34
*******************************************************************************
IBM SP5 - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
2 0.511 100
4 1.016 100
8 2.02 99
16 3.93 96
32 7.32 90
64 12.71 78
128 19.63 60
*******************************************************************************
IBM SP4 - FACTOR IX - NVE ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.245 100
4 0.476 97
8 0.946 96
16 1.80 92
32 3.28 84
64 6.17 79
96 8.58 73
128 10.45 67
160 11.78 60
192 13.22 56
224 13.79 50
256 14.6 47
288 15.4 44
320 15.4 44
*******************************************************************************
IBM SP4 - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.208 100
4 0.385 93
8 0.689 83
16 1.301 78
32 2.409 72
64 4.334 65
96 5.972 60
128 7.082 53
192 8.698 44
256 8.471 32
*******************************************************************************
IBM SP5 - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.506 100
4 0.995 98
8 1.95 96
16 3.69 91
32 7.08 87
64 12.52 77
96 16.62 68
128 18.78 58
160 20.1 50
192 19.2 40
224 22.2 39
256 22.2 34
*******************************************************************************
Cray XT3 - HFVa - NVT ensemble, PME, 193,013 atoms
#procs nsec/day scaling, %
2 0.098 100
4 0.184 94
8 0.365 93
16 0.711 91
32 1.34 85
64 2.49 79
96 3.08 65
128 4.24 67
160 4.94 63
192 5.38 57
224 5.88 53
256 5.57 44
288 5.86 41
320 6.15 39
*******************************************************************************
Cray XT3 - HFVa - NPT ensemble, PME, 193,013 atoms
#procs nsec/day scaling, %
2 0.085 100
4 0.171 101
8 0.343 101
16 0.649 96
32 1.26 93
64 2.30 85
96 3.15 77
128 3.98 73
160 4.58 68
192 4.91 60
224 5.37 57
256 5.10 47
288 5.28 43
320 5.57 41
*******************************************************************************
Cray XT3 - FACTOR IX - NVT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.217 100
4 0.398 92
8 0.820 94
16 1.57 90
32 2.83 82
64 4.93 71
96 6.89 66
128 7.85 57
160 8.70 50
192 9.39 45
224 10.05 41
256 10.20 37
288 10.29 33
320 10.05 29
*******************************************************************************
Cray XT3 - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.218 100
4 0.400 92
8 0.795 91
16 1.56 89
32 2.80 80
64 4.70 67
96 6.23 59
128 7.69 55
160 7.67 44
192 8.53 41
224 9.00 37
256 9.60 34
288 9.74 31
320 9.89 28
*******************************************************************************
Cray XT3 - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.482 100
4 0.904 94
8 1.80 93
16 3.46 90
32 6.26 81
64 10.54 68
96 13.09 57
128 15.7 51
160 14.9 39
192 16.0 35
224 15.7 29
*******************************************************************************
Cray XT3 - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
2 0.435 100
4 0.867 100
8 1.71 99
16 3.35 96
32 6.45 93
64 11.37 82
128 18.78 67
*******************************************************************************
SGI Altix - FACTOR IX - NVT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.311 100
4 0.592 95
8 1.17 94
16 2.10 85
32 3.54 71
64 6.29 63
96 7.71 52
128 6.17 31
*******************************************************************************
SGI Altix - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.293 100
4 0.586 100
8 1.14 97
16 2.03 87
32 3.41 73
64 5.94 64
96 7.28 52
128 6.00 32
*******************************************************************************
SGI Altix - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.738 100
4 1.38 94
8 2.86 97
16 5.30 90
32 9.19 78
64 12.17 51
96 15.16 43
128 11.84 25
*******************************************************************************
SGI Altix - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
2 0.385 100
4 0.762 99
8 1.35 87
16 2.88 93
32 4.91 80
64 7.71 63
128 7.71 31
*******************************************************************************
Opteron Infiniband Cluster - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.226 100
4 0.438 97
8 0.842 93
16 1.58 87
32 2.77 76
64 4.91 68
96 6.35 58
128 7.04 49
160 7.20 40
*******************************************************************************
Opteron Infiniband Cluster - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.491 100
4 0.947 96
8 1.82 92
16 3.22 82
32 6.08 77
64 10.05 64
96 11.84 50
128 12.00 38
*******************************************************************************
Opteron Infiniband Cluster - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
2 0.426 100
4 0.842 99
8 1.66 97
16 3.25 95
32 6.00 88
64 10.80 79
96 14.40 70
*******************************************************************************
Dell/Intel Infiniband Cluster - HFVa - NVT ensemble, PME, 193,013 atoms
#procs nsec/day scaling, %
2 0.101 100
4 0.190 94
8 0.355 87
16 0.690 85
32 1.25 77
64 2.30 71
96 3.10 64
128 3.62 56
160 4.15 51
192 4.45 46
224 4.50 40
*******************************************************************************
Dell/Intel Infiniband Cluster - HFVa - NPT ensemble, PME, 193,013 atoms
#procs nsec/day scaling, %
2 0.101 100
4 0.188 93
8 0.356 88
16 0.653 81
32 1.22 76
64 2.25 70
96 3.02 63
128 3.63 56
160 3.98 49
192 4.13 43
224 4.06 36
*******************************************************************************
Dell/Intel Infiniband Cluster - FACTOR IX - NVT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.217 100
4 0.393 92
8 0.778 88
16 1.48 86
32 2.59 76
64 4.66 67
96 5.94 57
128 6.48 47
160 6.89 40
*******************************************************************************
Dell/Intel Infiniband Cluster - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.211 100
4 0.393 93
8 0.741 88
16 1.45 86
32 2.54 75
64 4.47 66
96 5.63 56
128 6.42 47
160 6.65 39
*******************************************************************************
Dell/Intel Infiniband Cluster - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.470 100
4 0.895 95
8 1.71 91
16 3.03 81
32 5.84 78
64 9.60 64
96 11.1 49
*******************************************************************************
Dell/Intel Infiniband Cluster - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
2 0.584 100
4 1.15 99
8 2.22 95
16 4.15 89
32 6.86 73
64 10.7 57
96 10.8 39
*******************************************************************************
HP Alphaserver/Quadrics Cluster - FACTOR IX - NVT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.132 100
4 0.254 96
8 0.455 86
16 0.825 78
32 1.36 65
64 2.74 65 *
96 3.37 53 *
128 3.65 43 *
160 4.47 42 *
* - 2-rail option used on interconnect
*******************************************************************************
HP Alphaserver/Quadrics Cluster - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.130 100
4 0.246 94
8 0.445 85
16 0.790 76
32 1.34 64
64 2.66 64 *
96 3.43 55 *
128 3.37 40 *
160 4.02 39 *
* - 2-rail option used on interconnect
*******************************************************************************
HP Alphaserver/Quadrics Cluster - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.282 100
4 0.540 96
8 0.982 87
16 1.69 75
32 3.20 71
64 5.61 62 *
80 6.52 58 *
96 4.49 33 *
* - 2-rail option used on interconnect
*******************************************************************************
IBM BG/L - FACTOR IX - NVE ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.054 100
4 0.103 96
8 0.200 93
16 0.388 90
32 0.721 84
64 1.33 78
96 1.92 74
128 2.42 70
160 2.84 66
192 3.29 64
224 3.43 57
256 3.77 55
288 4.00 52
320 4.00 47
352 4.18 44
384 4.10 40
416 4.21 38
448 4.15 35
*******************************************************************************
IBM BG/L - FACTOR IX - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.109 100
4 0.211 97
8 0.408 94
16 0.758 87
32 1.49 86
64 2.84 82
96 3.89 75
128 4.85 70
160 5.54 64
192 5.68 54
224 6.08 50
256 6.35 46
288 6.55 42
320 6.35 36
*******************************************************************************
IBM BG/L - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
2 0.099 100
4 0.196 99
8 0.391 99
16 0.771 98
32 1.50 95
64 2.84 90
96 3.89 82
112 4.32 78
128 5.14 81
144 5.33 75
*******************************************************************************
SGI Origin 3800 - FACTOR IX - NVT ensemble, PME, 90,906 atoms
#procs nsec/day scaling, %
2 0.074 100
4 0.143 97
8 0.284 96
12 0.410 93
16 0.527 89
*******************************************************************************
SGI Origin 3800 - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
2 0.166 100
4 0.320 96
8 0.640 96
12 0.909 91
16 1.200 90
*******************************************************************************
Intel Xeon gigabit ethernet cluster - FACTOR IX - NPT ensemble, PME,
90,906 atoms
#procs nsec/day scaling, %
1 0.116 --
2 0.182 100
4 0.293 80
*******************************************************************************
Intel Xeon gigabit ethernet cluster - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day scaling, %
1 0.254 --
2 0.432 100
4 0.702 81
*******************************************************************************
Intel Xeon gigabit ethernet cluster - MB - Generalized Born, 2492 atoms
#procs nsec/day scaling, %
1 0.292 --
2 0.554 100
4 1.094 99
*******************************************************************************
Dell EM64T Pentium dual cpu - FACTOR IX - NPT ensemble, PME, 90,906 atoms
#procs nsec/day
1 0.108
2 0.189
*******************************************************************************
Dell EM64T Pentium dual cpu - JAC - NVE ensemble, PME, 23,558 atoms
#procs nsec/day
1 0.300
2 0.520
*******************************************************************************