AI/Torch

This example should illustrate how to submit a pytorch job on an h100-node and utilize 2 of its gpus.

python script

This Python script demonstrates the use of PyTorch for building and training a simple neural network on randomly generated data. It is designed to detect and utilize GPUs (cuda) if available. The script trains the model for five epochs, calculates the loss using a cross-entropy criterion, and saves the trained model to a file.

import torch
import torch.nn as nn
import torch.optim as optim

# GPU Information
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    device_names = [torch.cuda.get_device_name(i) for i in range(num_gpus)]
    print(f"Running on {num_gpus} GPU(s): {device_names}")
else:
    print("No GPUs available. Running on CPU.")

# Device Selection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Dummy Dataset
data = torch.randn(100, 10).to(device)  # 100 Samples, 10 Features
labels = torch.randint(0, 2, (100,)).to(device)  # Binary labels

# Simple Neural Network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(10, 2)

    def forward(self, x):
        return self.fc(x)

# Model, Loss, Optimizer
model = SimpleNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training Loop
for epoch in range(5):  # Train for 5 epochs
    optimizer.zero_grad()
    outputs = model(data)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# Save Model
torch.save(model.state_dict(), "simple_model.pth")
print("Model saved to simple_model.pth")

save this in your example dir as pytorch_example.py

submission script

#!/bin/bash 

#SBATCH --job-name=torch_gpu_job
#SBATCH --output=torch_gpu_job.out
#SBATCH --error=torch_gpu_job.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:h100:2 # (1)
#SBATCH --partition=GPU # (2)
#SBATCH --time=00:03:00


module load python/3.12.4 # (3)

# Run Python Script
python3 pytorch_example.py  # (4)

Specify what type of gpu and how many of them you would use. Because h100 gpu’s are rare it might make sense to replace this by L40
On festus you would to specify GPU partition for h100.
Load module with torch installation.
Please use python3 inside your submission script. Because python will only point to the systems python installation, not the module ones.

save this in your example dir as pytorch_example.submit

submission and evaluation

When the job has runned there should be a file named “torch_gpu_job.out”.

[bt123456@festus02 pytorch_example]$ cat torch_gpu_job.out 
Running on 2 GPU(s): ['NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3']
Epoch 1, Loss: 0.7616337537765503
Epoch 2, Loss: 0.7504904866218567
Epoch 3, Loss: 0.74031662940979
Epoch 4, Loss: 0.7311420440673828
Epoch 5, Loss: 0.7229485511779785
Model saved to simple_model.pth