IISER Mohali High Performance Computing Cluster
High-Performance Computing (HPC) enables researchers to solve complex computational problems that would be impractical on regular computers. The IISER Mohali HPC cluster provides shared computing resources for the research community.
A node is an individual computer within the cluster. Think of each node as a separate, powerful workstation. The cluster consists of multiple nodes connected via a high-speed network.
┌─────────────────────────────────────────────────────────────────┐
│ HPC CLUSTER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Login │ │ Login │ │ Master │ │
│ │ Node 1 │ │ Node 2 │ │ Nodes │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └───────────────┴────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ High-Speed Network │ │
│ └───────────┬───────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Compute │ │ Compute │ ... │ Compute │ │
│ │ Node │ │ Node │ │ Node │ │
│ │ (gpc1) │ │ (gpc2) │ │ (gpc32) │ │
│ │ │ │ │ │ │ │
│ │ 52 cores │ │ 52 cores │ │ 52 cores │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
| Node Type | Names | Purpose |
|---|---|---|
| Login Nodes | login1, login2 | Where you login, edit files, submit jobs |
| CPU Compute Nodes | gpc1-gpc32, bmc1-bmc7 | Where your CPU jobs run |
| GPU Compute Nodes | gpu1-gpu3 | For GPU-accelerated computations |
A core (also called CPU core or processor) is an individual processing unit within a node. Each node contains multiple cores that can work independently or together.
Example: If a node has 52 cores and you request ppn=4,
your job will use 4 cores, leaving 48 cores available for other users.
If your program is NOT parallelized (single-threaded), you MUST use:
#PBS -l nodes=1:ppn=1
Requesting more cores than your program can use is RESOURCE THEFT from other researchers!
| Component | Specification |
|---|---|
| Total Compute Nodes | ~39 nodes (gpc1-gpc32, bmc1-bmc7, gpu1-gpu3) |
| Cores per Node (typical) | 52 cores (2 × Intel Xeon Gold 6230R) |
| Total CPU Cores | ~1872 cores |
| Memory per Node | ~384 GB (CPU nodes) |
| GPU Nodes | 4 × Tesla T4 16GB PCIe per GPU node |
The HPC cluster is a SHARED RESOURCE funded by the institute for the entire research community. Wasting resources directly impacts your colleagues' research and is considered a SERIOUS VIOLATION of usage policy.
Misusing resources may result in account suspension!
| Violation | Problem | Correct Approach |
|---|---|---|
| Requesting more cores than your program uses | Idle cores are blocked from other users | Request only what you need |
Running single-threaded code with ppn=40 |
39 cores sit completely idle while blocked! | Use ppn=1 for serial jobs |
| Requesting entire nodes unnecessarily | Blocks all 52 cores on that node | Request specific core count needed |
| Setting excessive walltime | Resources reserved but unused | Estimate realistic runtime + small buffer |
| Running jobs on login nodes | Slows down system for everyone | Always use job submission (qsub) |
| Not cleaning up scratch space | Fills shared storage | Delete temporary files after job completes |
If your code is NOT parallelized → Use nodes=1:ppn=1
Most Python scripts, serial Fortran/C programs, and simple simulations are NOT parallel!
Before requesting multiple cores, TEST your program to see actual CPU usage:
# Step 1: Get an interactive session
qsub -I -l nodes=1:ppn=4 -q short
# Step 2: Once on the compute node, run your program in background
cd /path/to/your/project
./your_program &
# Step 3: Check CPU usage with 'top'
top -u $USER
# Step 4: Look at %CPU column:
# ~100% = using 1 core → use ppn=1
# ~200% = using 2 cores → use ppn=2
# ~400% = using 4 cores → use ppn=4
# ~5200% = using all 52 → use ppn=52
# Step 5: If %CPU is ~100%, YOUR CODE IS NOT PARALLEL!
# Use ppn=1 only!
Requesting more cores does NOT make your program faster!
If your program is not written to use multiple cores (parallel programming), it will only use 1 core regardless of how many you request. The other cores will sit idle while being blocked from other users.
Jobs are submitted to queues based on their expected runtime and resource requirements. Each queue has different time limits, core limits, and priorities.
| Queue Name | Max Walltime | Max Cores/User | Typical Use Case |
|---|---|---|---|
default |
8 hours | 200 cores | Quick jobs, testing, short calculations |
short |
72 hours (3 days) | 100 cores | Medium-length production jobs |
long |
1080 hours (45 days) | 100 cores | Long-running simulations |
infinity |
4380 hours (~6 months) | 50 cores | Extended calculations (use sparingly!) |
| Queue Name | Description |
|---|---|
gpushort |
Short GPU jobs |
gpulong |
Long GPU jobs |
# View detailed information about all queues
qstat -Qf
# Example output (partial):
Queue: long
queue_type = Execution
total_jobs = 48
state_count = Transit:0 Queued:1 Held:12 Waiting:0 Running:35 Exiting:0
resources_max.ncpus = 416
resources_max.walltime = 1080:00:00
max_user_run = 100
max_user_res.ncpus = 100
enabled = True
started = True
default queue for testinginfinity only when absolutely necessary
A PBS job script is a shell script with special #PBS directives that tell
the scheduler what resources you need.
#!/bin/bash
#===============================================
# PBS DIRECTIVES - Resource requests
#===============================================
#PBS -N my_job_name # Job name (appears in qstat)
#PBS -l nodes=1:ppn=1 # 1 node, 1 core (for serial jobs!)
#PBS -l walltime=08:00:00 # Maximum runtime: 8 hours
#PBS -q default # Queue name
#PBS -o output.log # Standard output file
#PBS -e error.log # Standard error file
#===============================================
# CHANGE TO WORKING DIRECTORY
#===============================================
cd $PBS_O_WORKDIR
#===============================================
# LOAD REQUIRED MODULES
#===============================================
module load anaconda3
#===============================================
# RUN YOUR PROGRAM
#===============================================
python my_script.py
Unless you KNOW your code is parallelized, always use:
#PBS -l nodes=1:ppn=1
| Directive | Description | Example |
|---|---|---|
#PBS -N |
Job name (max 15 characters recommended) | #PBS -N simulation01 |
#PBS -l nodes=X:ppn=Y |
Request X nodes with Y processors per node | #PBS -l nodes=1:ppn=1 |
#PBS -l walltime= |
Maximum job runtime (HH:MM:SS) | #PBS -l walltime=08:00:00 |
#PBS -q |
Queue selection | #PBS -q long |
#PBS -o |
Output file path | #PBS -o logs/output.log |
#PBS -e |
Error file path | #PBS -e logs/error.log |
#PBS -j oe |
Join output and error into one file | #PBS -j oe |
#PBS -M |
Email address for notifications | #PBS -M user@iisermohali.ac.in |
#PBS -m |
When to send email (a=abort, b=begin, e=end) | #PBS -m abe |
Walltime is the maximum time your job is allowed to run. Format: HH:MM:SS
# Examples of walltime settings
#PBS -l walltime=01:00:00 # 1 hour
#PBS -l walltime=08:00:00 # 8 hours (max for 'default' queue)
#PBS -l walltime=72:00:00 # 72 hours / 3 days (max for 'short' queue)
#PBS -l walltime=168:00:00 # 168 hours (7 days)
#PBS -l walltime=720:00:00 # 720 hours (30 days)
#PBS -l walltime=1080:00:00 # 1080 hours / 45 days (max for 'long' queue)
If your job exceeds the walltime, it will be killed immediately without saving any progress. Always add a buffer to your estimated runtime, and implement checkpointing for long jobs.
# Method 1: Separate output and error files
#PBS -o output.log
#PBS -e error.log
# Method 2: Combined output and error
#PBS -o combined.log
#PBS -j oe
# Method 3: With full path (recommended)
#PBS -o /persistent/data1/username/project/logs/job_output.log
#PBS -e /persistent/data1/username/project/logs/job_error.log
# Method 4: Include job ID in filename (in your script)
exec > "$PBS_O_WORKDIR/output_$PBS_JOBID.log" 2>&1
Targeting specific nodes is generally NOT recommended because:
Let the scheduler choose a node automatically unless you have a specific reason (e.g., GPU nodes, specific software installed on certain nodes).
# RECOMMENDED: Let scheduler choose automatically
#PBS -l nodes=1:ppn=1
# Target a specific node (use only if necessary)
#PBS -l nodes=gpc30:ppn=4
# Request multiple specific nodes
#PBS -l nodes=gpc30:ppn=4+gpc31:ppn=4
# Request GPU nodes
#PBS -l nodes=gpu1:ppn=4
Sometimes a node may show as "free" but still have problems. For example,
gpc25 in the long queue has been known to cause job failures
even when pbsnodes shows it as available.
If your job keeps failing on a specific node:
# Submit a job script
qsub job.sh
# Output example:
430105.iisermhpc1
Interactive jobs give you a shell on a compute node for testing and debugging.
# Basic interactive job (1 core)
qsub -I -l nodes=1:ppn=1 -q default
# Interactive job with specific walltime
qsub -I -l nodes=1:ppn=4 -l walltime=02:00:00 -q short
# Interactive job on specific node (if needed)
qsub -I -l nodes=gpc30:ppn=1 -q long
# Delete a specific job
qdel 430105.iisermhpc1
# Delete multiple jobs
qdel 430105.iisermhpc1 430106.iisermhpc1
# Delete all your jobs (be careful!)
qdel $(qstat -u $USER | grep $USER | awk '{print $1}')
# View all your jobs
qstat -u $USER
# Example:
qstat -u ms21080
# Sample output:
Job ID Name User Time S Queue
---------------- --------- -------- ---- - ------
430035.iisermhpc1 simulation ms21080 03:52 R long
430036.iisermhpc1 test_job ms21080 02:45 R default
430045.iisermhpc1 analysis ms21080 00:00 Q short
430103.iisermhpc1 e04 ms21080 00:00 H long
# Status codes:
# R = Running
# Q = Queued (waiting to start)
# H = Held (job has been held, check why)
# E = Exiting
# C = Completed
# Get detailed information about a specific job
qstat -f 430035.iisermhpc1
# Example output:
Job Id: 430104.iisermhpc1
Job_Name = e04
Job_Owner = ms21080@login1
job_state = H
queue = long
server = iisermhpc1
...
Resource_List.nodes = gpc25:ppn=2
Resource_List.walltime = 1080:00:00
...
comment = job held, too many failed attempts to run
run_count = 21
Exit_status = -3
...
# Key fields to look for:
# - job_state: Current status (R, Q, H, etc.)
# - exec_host: Which node(s) the job is running on
# - resources_used: CPU time, memory, walltime used
# - Resource_List: What was requested
# - comment: Error messages or hold reasons
# - Exit_status: Exit code (0 = success, non-zero = error)
# - run_count: How many times job tried to run
# View all jobs in the cluster
qstat
# View all queues with their status
qstat -Q
# View detailed queue information
qstat -Qf
Before submitting jobs (especially to specific nodes), you should check what resources are available. This helps you choose the right queue and avoid problematic nodes.
# View status of a specific node
pbsnodes gpc25
# Example output:
gpc25
Mom = gpc25
ntype = PBS
state = free
pcpus = 52
resources_available.arch = linux
resources_available.host = gpc25
resources_available.mem = 394860832kb
resources_available.ncpus = 52
resources_available.vnode = gpc25
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = long
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Jan 29 00:43:49 2026
last_used_time = Fri Jan 9 12:07:53 2026
# Key fields:
# state = free → Node is available
# state = job-exclusive → All cores in use
# state = offline → Node is not available
# state = down → Node has problems
# pcpus = 52 → Total cores on node
# resources_assigned.ncpus = 0 → Currently used cores
# queue = long → Which queue this node belongs to
A node may show state = free but still have issues. For example,
network problems or filesystem mount issues can cause jobs to fail even on
"free" nodes. If your job keeps failing on a specific node, try a different one!
# View status of all nodes
pbsnodes -a
Use this script to find which nodes have free cores in a specific queue. This is very useful before submitting jobs!
pbsnodes -a | awk '
/^[a-z]/ {node=$1}
/queue = long/ {inqueue=1}
/pcpus/ {if(inqueue) total=$3}
/resources_assigned.ncpus/ {
if(inqueue){
used=$3;
free=total-used;
printf "%-8s : %3d free cores\n", node, free;
sum+=free;
if(free==total) fullfree++;
inqueue=0
}}
END {
print "------------------------";
print "Total free cores =", sum;
print "Fully free nodes =", fullfree;
}'
gpc27 : 46 free cores
gpc28 : 20 free cores
gpc29 : 12 free cores
gpc30 : 49 free cores
gpc31 : 52 free cores
gpc32 : 20 free cores
gpc25 : 52 free cores
------------------------
Total free cores = 251
Fully free nodes = 2
Once you have this output, you can:
#PBS -l nodes=gpc31:ppn=4# For 'short' queue - change "long" to "short":
pbsnodes -a | awk '
/^[a-z]/ {node=$1}
/queue = short/ {inqueue=1}
/pcpus/ {if(inqueue) total=$3}
/resources_assigned.ncpus/ {
if(inqueue){
used=$3;
free=total-used;
printf "%-8s : %3d free cores\n", node, free;
sum+=free;
if(free==total) fullfree++;
inqueue=0
}}
END {
print "------------------------";
print "Total free cores =", sum;
print "Fully free nodes =", fullfree;
}'
# For 'default' queue - change to: /queue = default/
# For 'infinity' queue - change to: /queue = infinity/
# Add this to your ~/.bashrc file for easy access
checkfree() {
local queue=${1:-long}
echo "=== Free cores in '$queue' queue ==="
pbsnodes -a | awk -v q="$queue" '
/^[a-z]/ {node=$1}
/queue =/ {if($3==q) inqueue=1}
/pcpus/ {if(inqueue) total=$3}
/resources_assigned.ncpus/ {
if(inqueue){
used=$3;
free=total-used;
if(free>0) printf "%-8s : %3d free cores\n", node, free;
sum+=free;
if(free==total) fullfree++;
inqueue=0
}}
END {
print "------------------------";
print "Total free cores =", sum;
print "Fully free nodes =", fullfree+0;
}'
}
# After adding to ~/.bashrc, reload it:
source ~/.bashrc
# Now you can use:
checkfree long
checkfree short
checkfree default
checkfree infinity
checkfree gpushort
checkfree gpulong
Once you identify free nodes, if you know a specific node is NOT working (like gpc25 in the example), target a different node or let the scheduler choose automatically:
# Option 1: Let scheduler choose (SAFEST)
#PBS -l nodes=1:ppn=1
# Option 2: Target a known working node
#PBS -l nodes=gpc31:ppn=1
Remember: Use ppn=1 unless your code is actually parallelized!
#!/bin/bash
#PBS -N python_analysis
#PBS -l nodes=1:ppn=1 # ONLY 1 CORE for serial job!
#PBS -l walltime=04:00:00
#PBS -q default
#PBS -o python_out.log
#PBS -e python_err.log
cd $PBS_O_WORKDIR
module load anaconda3
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
python analysis.py
echo "Job finished at: $(date)"
#!/bin/bash
#PBS -N openmp_sim
#PBS -l nodes=1:ppn=16 # 16 cores on 1 node
#PBS -l walltime=48:00:00
#PBS -q short
#PBS -o omp_output.log
#PBS -e omp_error.log
cd $PBS_O_WORKDIR
# Set OpenMP threads to match requested cores
export OMP_NUM_THREADS=16
echo "Using $OMP_NUM_THREADS threads"
./my_openmp_program
#!/bin/bash
#PBS -N mpi_simulation
#PBS -l nodes=4:ppn=20 # 4 nodes, 20 cores each = 80 total
#PBS -l walltime=72:00:00
#PBS -q short
#PBS -o mpi_output.log
#PBS -e mpi_error.log
cd $PBS_O_WORKDIR
module load openmpi-4.1.0
# Calculate total processes
NPROCS=$(wc -l < $PBS_NODEFILE)
echo "Running on $NPROCS processors"
mpirun -np $NPROCS ./my_mpi_program
#!/bin/bash
#PBS -N geant4_sim
#PBS -l nodes=1:ppn=2
#PBS -l walltime=168:00:00
#PBS -q long
#PBS -o geant4_out.log
#PBS -e geant4_err.log
cd $PBS_O_WORKDIR
# Load Geant4
module load codes/geant4/11.1
# Set Geant4 data paths
export G4ENSDFSTATEDATA=/gscratch/apps/root/geant4/install/share/Geant4/data/G4ENSDFSTATE2.3
export G4LEVELGAMMADATA=/gscratch/apps/root/geant4/install/share/Geant4/data/PhotonEvaporation5.7
export G4LEDATA=/gscratch/apps/root/geant4/install/share/Geant4/data/G4EMLOW8.2
export G4PARTICLEXSDATA=/gscratch/apps/root/geant4/install/share/Geant4/data/G4PARTICLEXS4.0
# Record timing
START=$(date +%s)
echo "Started at: $(date)"
./sim run.mac
END=$(date +%s)
echo "Finished at: $(date)"
echo "Duration: $((END-START)) seconds"
#!/bin/bash
# Use this when you've checked available nodes and want to avoid a problematic one
# First run: checkfree long (see Section 8)
# Then choose a working node from the output
#PBS -N my_simulation
#PBS -l nodes=gpc31:ppn=1 # Targeting gpc31 specifically
#PBS -l walltime=720:00:00
#PBS -q long
#PBS -o output.log
#PBS -e error.log
cd $PBS_O_WORKDIR
echo "Running on node: $(hostname)"
./my_program
| Possible Cause | Solution |
|---|---|
| Requested resources not available | Use the free cores checking script to find available resources, then reduce core/node request |
| Targeting a busy node | Remove specific node requirement, let scheduler choose |
| Queue is full | Wait, or try a different queue |
| Exceeded user limits | Check qstat -Qf for max_user_res.ncpus limits |
# Check why job is held
qstat -f JOB_ID | grep -E "(comment|run_count|Exit_status)"
# Example output:
comment = job held, too many failed attempts to run
run_count = 21
Exit_status = -3
# This means the node has issues! Steps to fix:
# 1. Delete the held job
qdel JOB_ID
# 2. Check which nodes are free (see Section 8)
# 3. Either let scheduler choose or pick a different node
# 4. Resubmit
qsub job.sh
# Check exit status
qstat -f JOB_ID | grep Exit_status
# Common exit codes:
# 0 = Success
# 1 = General error in your program
# -3 = Job couldn't start (node/environment issue)
# 137 = Killed (memory limit exceeded or SIGKILL)
# 265 = Walltime exceeded
# Check error log
cat error.log
# Check the node status
pbsnodes gpc25
# Even if state = free, the node might have issues!
# Check last_used_time - if it's old, node might be problematic
# Solution: Use a different node or let scheduler choose
#PBS -l nodes=1:ppn=1 # Let scheduler choose
# OR
#PBS -l nodes=gpc30:ppn=1 # Target known working node
Before submitting a long job, always test with a short run first:
default queue with short walltimelong or infinity| Command | Description |
|---|---|
qsub job.sh |
Submit a job script |
qsub -I -l nodes=1:ppn=1 -q default |
Start interactive session |
qdel JOB_ID |
Delete/cancel a job |
qhold JOB_ID |
Hold a queued job |
qrls JOB_ID |
Release a held job |
| Command | Description |
|---|---|
qstat |
Show all jobs in queue |
qstat -u $USER |
Show only your jobs |
qstat -f JOB_ID |
Detailed job information |
qstat -Q |
Show queue summary |
qstat -Qf |
Detailed queue information |
| Command | Description |
|---|---|
pbsnodes -a |
Show all nodes status |
pbsnodes NODE_NAME |
Show specific node status |
checkfree long |
Check free cores in long queue (after adding function to .bashrc) |
| Command | Description |
|---|---|
module avail |
List available modules |
module load NAME |
Load a module |
module unload NAME |
Unload a module |
module list |
Show loaded modules |
module purge |
Unload all modules |
| Variable | Description |
|---|---|
$PBS_O_WORKDIR |
Directory where qsub was executed |
$PBS_JOBID |
Unique job identifier |
$PBS_NODEFILE |
File containing list of assigned nodes |
$PBS_JOBNAME |
Name of the job |
$PBS_QUEUE |
Queue the job is running in |
cd $PBS_O_WORKDIR in my script?ppn=1 for serial (non-parallel) jobsFor technical issues, contact the HPC support team at: helpdesk-hpc@iisermohali.ac.in
For community discussions: hpc-community@iisermohali.ac.in