IISER Mohali Logo
IISER Mohali Unofficial HPC Guide

GPU & Sessions

How to Access GPU Nodes

Use an interactive PBS session to get a shell on a GPU node. This works only from login1.

# From login1 only — will NOT work from login2
[user@login1 ~]$ qsub -I -q gpulong
QueueGPU NodesCurrent StatusUse For
gpushortgpu1⚠️ Currently down (may be revived)Training jobs, production runs
gpulonggpu2, gpu3✅ ActiveTraining jobs, production runs
What happens: After running the command, you'll be dropped directly into a shell on a GPU node (e.g., gpu2 or gpu3). Your prompt will change to something like [user@gpu2 ~]$.

Target a Specific GPU Node

If you want to request a specific GPU node (e.g., gpu2 or gpu3), add the -l host= flag:

# Request gpu2 specifically
[user@login1 ~]$ qsub -I -q gpulong -l host=gpu2

# Request gpu3 specifically
[user@login1 ~]$ qsub -I -q gpulong -l host=gpu3
💡 Use this when you know a particular node is free (check with nvidia-smi first) or if you need to resume work on the same node.

Small Jobs vs Long Jobs

Job TypeRecommended Approach
Small/quick jobs
(< 30 min, testing, debugging)
Just run directly in the interactive shell. Keep your terminal open. If it disconnects, you can restart quickly.
Long jobs
(training, production runs)
Always use tmux so your job survives SSH disconnects, terminal closes, or network drops.

What is tmux?

tmux is a terminal multiplexer. It lets you create sessions that keep running even after you disconnect. You can reattach later to check progress or interact with your job.

tmux may not be pre-installed: If tmux is not available on your account, you'll need to install it yourself in your home directory. Search online for "install tmux without root" or ask ChatGPT for step-by-step help.

tmux Quick Start

1. Start a new tmux session

[user@gpu2 ~]$ tmux new -s my_training

You'll see a green status bar at the bottom. You're now inside the tmux session.

2. Run your code

[user@gpu2 ~]$ python train.py
# or whatever command you need

3. Detach safely (job keeps running!)

Press Ctrl + B, release both, then press D.

# You'll see: [detached]
[user@gpu2 ~]$ # back to normal shell, job still runs in tmux

4. Reattach later to check progress

[user@gpu2 ~]$ tmux attach -t my_training

5. List or kill sessions

tmux ls                    # list all your tmux sessions
tmux kill-session -t my_training  # end a session when done

Full Workflow Example

# Step 1: From login1, get interactive GPU access
[user@login1 ~]$ qsub -I -q gpulong
qsub: waiting for job 12345.iisermhpc1 to start
qsub: job 12345.iisermhpc1 ready

# Step 2: You're now on gpu2
[user@gpu2 ~]$ tmux new -s resnet_train

# Step 3: Inside tmux, run your job
[user@gpu2 ~]$ python train_resnet.py --epochs 100
Epoch 1/100 - loss: 2.341 - acc: 0.12
Epoch 2/100 - loss: 1.987 - acc: 0.24
...

# Step 4: Detach (Ctrl+B, then D)
# [detached]

# Step 5: Later, reattach to check
[user@login1 ~]$ ssh user@hpc.iisermohali.ac.in
[user@login1 ~]$ qsub -I -q gpulong  # get back to same gpu node
[user@gpu2 ~]$ tmux attach -t resnet_train
Epoch 47/100 - loss: 0.412 - acc: 0.86
...

tmux Quick Reference

CommandWhat it does
tmux new -s nameCreate new session named "name"
tmux lsList all your tmux sessions
tmux attach -t nameReattach to session "name"
tmux kill-session -t nameEnd session "name"
Ctrl+B then DDetach from current session
Ctrl+B then %Split pane vertically
Ctrl+B then "Split pane horizontally
Ctrl+B then ArrowNavigate between panes

Check GPU Usage with nvidia-smi

Before starting your job, always check if GPUs are already in use. Use nvidia-smi to see real-time GPU status.

[user@gpu3 ~]$ nvidia-smi
(base) [ms21080@gpu3 ~]$ nvidia-smi Fri May 22 19:59:37 2026 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:60:00.0 Off | 0 | | N/A 44C P0 25W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:61:00.0 Off | 0 | | N/A 48C P0 27W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla T4 Off | 00000000:DA:00.0 Off | 0 | | N/A 48C P0 26W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla T4 Off | 00000000:DB:00.0 Off | 0 | | N/A 40C P0 25W / 70W | 0MiB / 15360MiB | 6% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

How to Read nvidia-smi Output

ColumnWhat to look for
GPU-UtilPercentage of GPU currently in use. 0% = idle, 90%+ = heavily used.
Memory-UsageHow much VRAM is occupied. If near 15360MiB, GPU is full.
ProcessesLists active jobs. "No running processes" = GPU is free.
Temp / PwrHigh temps (>80°C) or power caps may indicate heavy load.
Rule of thumb: If a GPU shows >80% utilization or >12GB memory used, assume someone is actively using it. Choose a different GPU or wait.

Watch GPU Status Live

Use watch to refresh nvidia-smi every few seconds:

watch -n 2 nvidia-smi

Press Ctrl + C to stop watching.

Serious Warning: Use Responsibly

GPU nodes are a shared, limited resource. This interactive access method gives you direct control — use it thoughtfully.

  • Never kill another user's job. Use nvidia-smi and ps aux to check before starting work.
  • Do not monopolize GPUs. If you're done, exit cleanly so others can use the resource.
  • Do not run destructive commands. You have elevated access — misuse can affect everyone.
  • Report issues, don't exploit them. If you find a problem, tell the admin, don't take advantage.

Good Practices Checklist

  • ✅ Always run nvidia-smi before starting a job
  • ✅ Use tmux for long jobs so they survive disconnections
  • ✅ Name your tmux sessions clearly (e.g., myproject_train)
  • ✅ Clean up: tmux kill-session when your job is done
  • ✅ Exit GPU node cleanly: type exit when finished
  • ✅ Be considerate: GPU time is shared — don't block others unnecessarily
  • ✅ If gpushort (gpu1) comes back online, use it for quick tests to free up gpulong nodes

What If I Accidentally Kill Someone's Job?

  1. Stop immediately. Don't try to hide it or restart it yourself.
  2. Notify the user if you know who was running the job.
  3. Inform HPC support at helpdesk-hpc@iisermohali.ac.in with details.
  4. Learn from it. Double-check nvidia-smi and process lists next time.
Remember: Accidents happen. Honesty and quick communication minimize harm and maintain trust in the shared resource.

About GPU Queue Walltime

Despite the queue names gpushort and gpulong, GPU nodes do not enforce walltime limits in the current configuration. Your interactive session will continue until:

  • You manually exit the node
  • The node is rebooted by admin
  • Your tmux session is killed

This is why using tmux is critical — your job won't be auto-killed by a timer, but it will stop if your SSH connection drops without tmux.