GPU & Sessions | IISER Mohali Unofficial HPC Guide

Critical: GPU Access Only Works from login1
When you SSH into the cluster, you may land on login1 or login2. GPU node access only works from login1. If you're on login2 and try to request a GPU node, the command will hang or fail.

Check which login node you're on: hostname
If you see login2: exit and SSH again until you land on login1.

How to Access GPU Nodes

Use an interactive PBS session to get a shell on a GPU node. This works only from login1.

# From login1 only — will NOT work from login2
[user@login1 ~]$ qsub -I -q gpulong

Queue	GPU Nodes	Current Status	Use For
`gpushort`	gpu1	⚠️ Currently down (may be revived)	Training jobs, production runs
`gpulong`	gpu2, gpu3	✅ Active	Training jobs, production runs

What happens: After running the command, you'll be dropped directly into a shell on a GPU node (e.g., gpu2 or gpu3). Your prompt will change to something like [user@gpu2 ~]$.

Target a Specific GPU Node

If you want to request a specific GPU node (e.g., gpu2 or gpu3), add the -l host= flag:

# Request gpu2 specifically
[user@login1 ~]$ qsub -I -q gpulong -l host=gpu2

# Request gpu3 specifically
[user@login1 ~]$ qsub -I -q gpulong -l host=gpu3

💡 Use this when you know a particular node is free (check with nvidia-smi first) or if you need to resume work on the same node.

Small Jobs vs Long Jobs

Job Type	Recommended Approach
Small/quick jobs (< 30 min, testing, debugging)	Just run directly in the interactive shell. Keep your terminal open. If it disconnects, you can restart quickly.
Long jobs (training, production runs)	Always use tmux so your job survives SSH disconnects, terminal closes, or network drops.

What is tmux?

tmux is a terminal multiplexer. It lets you create sessions that keep running even after you disconnect. You can reattach later to check progress or interact with your job.

tmux may not be pre-installed: If tmux is not available on your account, you'll need to install it yourself in your home directory. Search online for "install tmux without root" or ask ChatGPT for step-by-step help.

tmux Quick Start

1. Start a new tmux session

[user@gpu2 ~]$ tmux new -s my_training

You'll see a green status bar at the bottom. You're now inside the tmux session.

2. Run your code

[user@gpu2 ~]$ python train.py
# or whatever command you need

3. Detach safely (job keeps running!)

Press Ctrl + B, release both, then press D.

# You'll see: [detached]
[user@gpu2 ~]$ # back to normal shell, job still runs in tmux

4. Reattach later to check progress

[user@gpu2 ~]$ tmux attach -t my_training

5. List or kill sessions

tmux ls                    # list all your tmux sessions
tmux kill-session -t my_training  # end a session when done

Full Workflow Example

# Step 1: From login1, get interactive GPU access
[user@login1 ~]$ qsub -I -q gpulong
qsub: waiting for job 12345.iisermhpc1 to start
qsub: job 12345.iisermhpc1 ready

# Step 2: You're now on gpu2
[user@gpu2 ~]$ tmux new -s resnet_train

# Step 3: Inside tmux, run your job
[user@gpu2 ~]$ python train_resnet.py --epochs 100
Epoch 1/100 - loss: 2.341 - acc: 0.12
Epoch 2/100 - loss: 1.987 - acc: 0.24
...

# Step 4: Detach (Ctrl+B, then D)
# [detached]

# Step 5: Later, reattach to check
[user@login1 ~]$ ssh user@hpc.iisermohali.ac.in
[user@login1 ~]$ qsub -I -q gpulong  # get back to same gpu node
[user@gpu2 ~]$ tmux attach -t resnet_train
Epoch 47/100 - loss: 0.412 - acc: 0.86
...

tmux Quick Reference

Command	What it does
`tmux new -s name`	Create new session named "name"
`tmux ls`	List all your tmux sessions
`tmux attach -t name`	Reattach to session "name"
`tmux kill-session -t name`	End session "name"
Ctrl+B then D	Detach from current session
Ctrl+B then %	Split pane vertically
Ctrl+B then "	Split pane horizontally
Ctrl+B then Arrow	Navigate between panes

Check GPU Usage with nvidia-smi

Before starting your job, always check if GPUs are already in use. Use nvidia-smi to see real-time GPU status.

[user@gpu3 ~]$ nvidia-smi

How to Read nvidia-smi Output

Column	What to look for
GPU-Util	Percentage of GPU currently in use. 0% = idle, 90%+ = heavily used.
Memory-Usage	How much VRAM is occupied. If near 15360MiB, GPU is full.
Processes	Lists active jobs. "No running processes" = GPU is free.
Temp / Pwr	High temps (>80°C) or power caps may indicate heavy load.

Rule of thumb: If a GPU shows >80% utilization or >12GB memory used, assume someone is actively using it. Choose a different GPU or wait.

Watch GPU Status Live

Use watch to refresh nvidia-smi every few seconds:

watch -n 2 nvidia-smi

Press Ctrl + C to stop watching.

Serious Warning: Use Responsibly

GPU nodes are a shared, limited resource. This interactive access method gives you direct control — use it thoughtfully.

Never kill another user's job. Use nvidia-smi and ps aux to check before starting work.
Do not monopolize GPUs. If you're done, exit cleanly so others can use the resource.
Do not run destructive commands. You have elevated access — misuse can affect everyone.
Report issues, don't exploit them. If you find a problem, tell the admin, don't take advantage.

Good Practices Checklist

✅ Always run nvidia-smi before starting a job
✅ Use tmux for long jobs so they survive disconnections
✅ Name your tmux sessions clearly (e.g., myproject_train)
✅ Clean up: tmux kill-session when your job is done
✅ Exit GPU node cleanly: type exit when finished
✅ Be considerate: GPU time is shared — don't block others unnecessarily
✅ If gpushort (gpu1) comes back online, use it for quick tests to free up gpulong nodes

What If I Accidentally Kill Someone's Job?

Stop immediately. Don't try to hide it or restart it yourself.
Notify the user if you know who was running the job.
Inform HPC support at helpdesk-hpc@iisermohali.ac.in with details.
Learn from it. Double-check nvidia-smi and process lists next time.

Remember: Accidents happen. Honesty and quick communication minimize harm and maintain trust in the shared resource.

About GPU Queue Walltime

Despite the queue names gpushort and gpulong, GPU nodes do not enforce walltime limits in the current configuration. Your interactive session will continue until:

You manually exit the node
The node is rebooted by admin
Your tmux session is killed

This is why using tmux is critical — your job won't be auto-killed by a timer, but it will stop if your SSH connection drops without tmux.