IISER Mohali Logo
IISER Mohali Unofficial HPC Guide

Architecture

Read in Order: This page assumes familiarity with basic Linux commands and the PBS scheduler.
1️⃣ Linux Basics 2️⃣ PBS Jobs 3️⃣ Architecture (you are here)

Skipping ahead may make concepts harder to follow. Come back here after completing the first two sections.

What is the HPC?

HPC stands for High Performance Computing. It is a cluster of powerful computers connected over a fast network. Researchers use it to run jobs that are too big or too slow to run on a personal laptop.

You connect to the cluster over SSH, write a PBS script that describes your job (how many cores, how much memory, how long it will run), and submit it. The PBS scheduler then places your job on a suitable compute node and runs it.

Key Points
The cluster uses PBS (Portable Batch System) as the job scheduler. Jobs run on compute nodes. The login node is only for light tasks like editing files and submitting jobs. Resources are shared among all users, so please use them responsibly.

Cluster Architecture

HPC CLUSTER ARCHITECTURE
Master Node
Job scheduler
High-Speed Network
gpc1
52 cores
gpc2
52 cores
. . .
gpc3 to gpc31
gpc32
52 cores
bmc1–7
52 cores
gpu1
4x Tesla T4
16 GB each
gpu2
4x Tesla T4
16 GB each
gpu3
4x Tesla T4
16 GB each

Cluster Summary

~39
Total Compute Nodes
52
Cores per CPU Node
~1872
Total CPU Cores
384 GB
RAM per CPU Node
4x T4
GPUs per GPU Node
16 GB
VRAM per GPU
Node TypeNamesPurpose
Login Nodes login1, login2 SSH in, edit files, submit jobs
CPU Compute gpc1–gpc32, bmc1–bmc7 CPU batch jobs
GPU Compute gpu1–gpu3 GPU accelerated workloads

What is a Node and a Core?

A node is one computer inside the cluster. A core is a single processing unit inside that computer. Each node has many cores that can work at the same time.

Example: A node has 52 cores. If you request ppn=4 in your PBS script, your job uses 4 of those cores. The other 48 cores remain free for other users jobs running on the same node.

What is the Login Node?

When you SSH into the cluster, you land on the login node. It is a regular computer shared by everyone who is logged in at that moment. It is strictly meant for light tasks only.

The login node is not meant for running actual computations. For that, you use the PBS scheduler to submit your job to a compute node. You will learn more about this in the PBS section.

✅ Allowed on Login Node❌ Not Allowed on Login Node
Editing scripts with nano or vimRunning Python training scripts
Submitting jobs with qsubRunning simulations directly
Checking jobs with qstatCompiling large codebases
Creating folders and small filesProcessing large datasets
Transferring files with scpRunning any heavy program
Important: The login node is shared by all users. If someone runs a heavy job on it, the whole system slows down for everyone. In bad cases it can cause the login node to crash and no one will be able to connect to the cluster until the admin fixes it. Always use the scheduler for any real computation.

Checking Login Node Load with top

You can run top on the login node to see the current CPU and memory usage. If the load is very high, it means someone is running something heavy on the login node and you should inform the admin.

login1 — top
(base) [rk21045@login1 ~]$ top top - 01:29:18 up 4 days, 7:26, 9 users, load average: 168.21, 171.62, 171.92 Tasks: 591 total, 5 running, 581 sleeping, 5 stopped, 0 zombie %Cpu(s): 99.7 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 39487868+total, 10684056 free, 55634912 used, 32855971+buff/cache KiB Swap: 67108860 total, 67108860 free, 0 used. 33805238+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 61813 ab19032 20 0 105.7g 16.5g 6112 R 1693 4.4 62936:45 l914.exe 40196 ph17086 20 0 104.2g 1.4g 5704 R 1209 0.4 71422:49 l508.exe 60007 ph18011 20 0 104.2g 11.8g 6104 R 1167 3.1 62691:54 l914.exe 43326 ms21099 20 0 104.2g 11.5g 5976 R 1135 3.1 65828:04 l914.exe 2490 nobody 20 0 176864 1872 1256 S 0.3 0.0 1:52.56 gmond 61885 rk21045 20 0 173236 2924 1700 R 0.3 0.0 0:00.08 top 146608 ph25042 20 0 185300 2936 1052 S 0.3 0.0 0:10.02 sshd 1 root 20 0 191684 4764 2636 S 0.0 0.0 0:17.08 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.51 kthreadd 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker

In the output above, the load average is 168 and CPU usage is 99.7%. This is very high. You can also see that four users are running jobs directly on the login node with CPU usage above 1000% each. This is wrong and will slow down the entire system.

What to look for in top:

%MEM (MOST IMPORTANT)
High memory usage is the biggest danger on the login node.
If any process is using large memory (e.g. >1–2% or multiple GBs), it can crash the node for everyone.

Free memory (KiB Mem free)
If this becomes very low, the system may start swapping or freeze.

%CPU per process
Should be very low on login nodes. High values (100+) indicate misuse, but CPU overload mainly slows the system.

Load average
Should be less than total cores. High load means heavy usage, but is less dangerous than memory exhaustion.

Important:
CPU abuse makes the system slow.
Memory abuse can crash the entire login node.

If you see high memory usage, report it to the system admin immediately.

What are Compute Nodes?

Compute nodes are dedicated machines reserved for running jobs. Each node has its own CPUs, large RAM, and in the case of GPU nodes, graphics cards. You do not SSH into them directly. The PBS scheduler sends your job there automatically after you submit it from the login node.

You can choose which queue to submit your job to depending on how long your job will run and how many cores it needs. Each queue has different limits.

Available Queues

QueueMax WalltimeMax Cores per UserUse For
default 8 hours 200 cores Quick jobs, testing, short calculations
short 72 hours (3 days) 100 cores Medium length production jobs
long 1080 hours (45 days) 100 cores Long running simulations
infinity 4380 hours (about 6 months) 50 cores Very long calculations, use sparingly
Tip: Start with the default queue when testing your script. Only move to long or infinity once you are sure the job runs correctly. Submitting a broken job to a long queue wastes your allocation and blocks resources for others.