Architecture

What is the HPC?

HPC stands for High Performance Computing. It is a cluster of powerful computers connected over a fast network. Researchers use it to run jobs that are too big or too slow to run on a personal laptop.

You connect to the cluster over SSH, write a PBS script that describes your job (how many cores, how much memory, how long it will run), and submit it. The PBS scheduler then places your job on a suitable compute node and runs it.

Key Points
The cluster uses PBS (Portable Batch System) as the job scheduler. Jobs run on compute nodes. The login node is only for light tasks like editing files and submitting jobs. Resources are shared among all users, so please use them responsibly.

Cluster Architecture

HPC CLUSTER ARCHITECTURE

Access Layer

Master Node

Job scheduler

High-Speed Network

Compute Layer — CPU Nodes

gpc1

52 cores

gpc2

52 cores

. . .

gpc3 to gpc31

gpc32

52 cores

bmc1–7

52 cores

Compute Layer — GPU Nodes

gpu1

4x Tesla T4

16 GB each

gpu2

4x Tesla T4

16 GB each

gpu3

4x Tesla T4

16 GB each

Cluster Summary

~39

Total Compute Nodes

Cores per CPU Node

~1872

Total CPU Cores

384 GB

RAM per CPU Node

4x T4

GPUs per GPU Node

16 GB

VRAM per GPU

Node Type	Names	Purpose
Login Nodes	`login1`, `login2`	SSH in, edit files, submit jobs
CPU Compute	`gpc1–gpc32`, `bmc1–bmc7`	CPU batch jobs
GPU Compute	`gpu1–gpu3`	GPU accelerated workloads

What is a Node and a Core?

A node is one computer inside the cluster. A core is a single processing unit inside that computer. Each node has many cores that can work at the same time.

Example: A node has 52 cores. If you request ppn=4 in your PBS script, your job uses 4 of those cores. The other 48 cores remain free for other users jobs running on the same node.

What is the Login Node?

When you SSH into the cluster, you land on the login node. It is a regular computer shared by everyone who is logged in at that moment. It is strictly meant for light tasks only.

The login node is not meant for running actual computations. For that, you use the PBS scheduler to submit your job to a compute node. You will learn more about this in the PBS section.

✅ Allowed on Login Node	❌ Not Allowed on Login Node
Editing scripts with `nano` or `vim`	Running Python training scripts
Submitting jobs with `qsub`	Running simulations directly
Checking jobs with `qstat`	Compiling large codebases
Creating folders and small files	Processing large datasets
Transferring files with `scp`	Running any heavy program

Important: The login node is shared by all users. If someone runs a heavy job on it, the whole system slows down for everyone. In bad cases it can cause the login node to crash and no one will be able to connect to the cluster until the admin fixes it. Always use the scheduler for any real computation.

Checking Login Node Load with `top`

You can run top on the login node to see the current CPU and memory usage. If the load is very high, it means someone is running something heavy on the login node and you should inform the admin.

login1 — top

(base) [rk21045@login1 ~]$ top top - 01:29:18 up 4 days, 7:26, 9 users, load average: 168.21, 171.62, 171.92 Tasks: 591 total, 5 running, 581 sleeping, 5 stopped, 0 zombie %Cpu(s): 99.7 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 39487868+total, 10684056 free, 55634912 used, 32855971+buff/cache KiB Swap: 67108860 total, 67108860 free, 0 used. 33805238+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 61813 ab19032 20 0 105.7g 16.5g 6112 R 1693 4.4 62936:45 l914.exe 40196 ph17086 20 0 104.2g 1.4g 5704 R 1209 0.4 71422:49 l508.exe 60007 ph18011 20 0 104.2g 11.8g 6104 R 1167 3.1 62691:54 l914.exe 43326 ms21099 20 0 104.2g 11.5g 5976 R 1135 3.1 65828:04 l914.exe 2490 nobody 20 0 176864 1872 1256 S 0.3 0.0 1:52.56 gmond 61885 rk21045 20 0 173236 2924 1700 R 0.3 0.0 0:00.08 top 146608 ph25042 20 0 185300 2936 1052 S 0.3 0.0 0:10.02 sshd 1 root 20 0 191684 4764 2636 S 0.0 0.0 0:17.08 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.51 kthreadd 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker

In the output above, the load average is 168 and CPU usage is 99.7%. This is very high. You can also see that four users are running jobs directly on the login node with CPU usage above 1000% each. This is wrong and will slow down the entire system.

What to look for in top:

%MEM (MOST IMPORTANT)
High memory usage is the biggest danger on the login node.
If any process is using large memory (e.g. >1–2% or multiple GBs), it can crash the node for everyone.

Free memory (KiB Mem free)
If this becomes very low, the system may start swapping or freeze.

%CPU per process
Should be very low on login nodes. High values (100+) indicate misuse, but CPU overload mainly slows the system.

Load average
Should be less than total cores. High load means heavy usage, but is less dangerous than memory exhaustion.

Important:
CPU abuse makes the system slow.
Memory abuse can crash the entire login node.

If you see high memory usage, report it to the system admin immediately.

What are Compute Nodes?

Compute nodes are dedicated machines reserved for running jobs. Each node has its own CPUs, large RAM, and in the case of GPU nodes, graphics cards. You do not SSH into them directly. The PBS scheduler sends your job there automatically after you submit it from the login node.

You can choose which queue to submit your job to depending on how long your job will run and how many cores it needs. Each queue has different limits.

Available Queues

Queue	Max Walltime	Max Cores per User	Use For
`default`	8 hours	200 cores	Quick jobs, testing, short calculations
`short`	72 hours (3 days)	100 cores	Medium length production jobs
`long`	1080 hours (45 days)	100 cores	Long running simulations
`infinity`	4380 hours (about 6 months)	50 cores	Very long calculations, use sparingly

Tip: Start with the default queue when testing your script. Only move to long or infinity once you are sure the job runs correctly. Submitting a broken job to a long queue wastes your allocation and blocks resources for others.

What is the HPC?

Cluster Architecture

Cluster Summary

What is a Node and a Core?

What is the Login Node?

Checking Login Node Load with top

What are Compute Nodes?

Available Queues

Checking Login Node Load with `top`