Image

UAHPC

Parallel Programming Methods (Algorithms, MPI) and Linux Clustering
User avatar
Shahram
Posts: 293
Joined: Sun Feb 05, 2017 8:31 am

UAHPC

Postby Shahram » Tue Mar 21, 2017 10:00 pm


UAHPC (formerly RC2) is a 84 node (1400 core) cluster featuring Dell PowerEdge M610s and M620s with ~21.8 Teraflops theoretical sustained performance. 18 nodes contain two Intel Hexa Core Nehalem Xeon X5650 processors, 48GB of SDRAM, and the 59 newest nodes contain two Intel Octa Core E5-2650 or E5-2640v2 processors and 64GB of RAM per node. 3 nodes contain two Intel Quad Core Nehalem Xeon X5550 processors, 64 GB of SDRAM and 1 node contains two Intel Hexa Core Nehalem Xeon X5650 processors, and 48GB of SDRAM. There are three high memory nodes.

These compute nodes are controlled by a Dell PowerEdge M830 master node containing two 10-core processors, and 3TB of 15,000 RPM SAS6 Hard Drive capacity for sharing applications and home directories across the cluster. In addition, two dedicated storage nodes allow for efficient handling of data between compute nodes and the data storage devices. The storage nodes are connected via PERC H700 or H810 controllers to a total of approximately 100 TB of storage in five Dell PowerVault MD1200s, plus another 20TB of internal disks in the second storage node. The storage nodes have 10G connectivity to the internet.

All nodes are connected internally within their Dell M1000e chassis by Infiniband 4x QDR at a throughput of 40 Gbit/s, and all the chassis are interconnected through a pair of external Infiniband switches (2:1 oversubscribed). Storage is shared between nodes using NFS on IPoIB.

UAHPC Configuration

Dell Blade architecture
Rocks 6.2
Centos 6.6
SLURM 15.08
2-seat license for Intel Cluster Studio for Linux



User avatar
Shahram
Posts: 293
Joined: Sun Feb 05, 2017 8:31 am

Re: UAHPC

Postby Shahram » Wed Apr 19, 2017 5:38 pm

What partitions and QOSes are available?

-p main : This is the partition which we expect to be used most of the time.
-p owners : This is the partition for system stakeholders who own nodes. It gives a higher priority and may preempt jobs running in the main partition.

--qos main : This is the QOS which we expect to be used most of the time. It has a limit of 24 hours per job so that users with pending jobs can expect theirs to begin in a reasonable time. It has no resource limits other than length of run. Users should submit and requeue in this one. The use of check-pointing is highly advised. Note that this is not the default queue, you will have to explicitly request it.

--qos long : This queue will run jobs for up to 1 wk. It is limited to a total of 170 cpus and is intended for jobs that cannot easily be set up to checkpoint for the main queue

--qos debug : This is a 15 minute debug queue. You can test a new job here without worrying that it might hang for too long and cause problems for other people. It has a little extra priority so it can sneak in and get your test done.

Additionally, owners each have their own QOS which is named after their username or department, e.g.,
--qos math : gives math department users access to 16 cores in the owners partition.




Return to “Parallel Programming and Clusters”

Who is online

Users browsing this forum: No registered users and 1 guest