AI/Torch
This example should illustrate how to submit a pytorch job on an h100-node and utilize 2 of its gpus.
python script
This Python script demonstrates the use of PyTorch for building and training a simple neural network on randomly generated data. It is designed to detect and utilize GPUs (cuda) if available. The script trains the model for five epochs, calculates the loss using a cross-entropy criterion, and saves the trained model to a file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
save this in your example dir as pytorch_example.py
submission script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
- Specify what type of gpu and how many of them you would use. Because h100 gpu’s are rare it might make sense to replace this by L40
- On festus you would to specify GPU partition for h100.
- Load module with torch installation.
- Please use python3 inside your submission script. Because python will only point to the systems python installation, not the module ones.
save this in your example dir as pytorch_example.submit
submission and evaluation
When the job has runned there should be a file named “torch_gpu_job.out”.
[bt123456@festus02 pytorch_example]$ cat torch_gpu_job.out
Running on 2 GPU(s): ['NVIDIA H100 80GB HBM3', 'NVIDIA H100 80GB HBM3']
Epoch 1, Loss: 0.7616337537765503
Epoch 2, Loss: 0.7504904866218567
Epoch 3, Loss: 0.74031662940979
Epoch 4, Loss: 0.7311420440673828
Epoch 5, Loss: 0.7229485511779785
Model saved to simple_model.pth