This page is mainly dedicated to examples of different job types. For a more comprehensive explanation on different job types, please refer to [Jobs Overview]({{site.baseurl}}{% link jobs/Overview.md %})
This page is mainly dedicated to examples of different job types. For a more comprehensive explanation on different job types, please refer to [Jobs Overview]({{site.baseurl}}{% link jobs/Overview.md %}).
Now let's walk through `my_script.sbatch` line by line to see what each directive does.
-`#!/bin/bash`: This line needs to be included at the start of **all** your batch scripts. It basically specifies the script to be run with a shell called `bash`.
-`#!/bin/bash`: This line needs to be included at the start of **all** your batch scripts. It basically specifies the script to be run with the `bash` shell.
Lines 2-7 are your `SBTACH` directives. These lines are where you specify different options for your job including its name, output and error files path/name, list of nodes you want to use, resource limits, and more if required. Let's walk through them line by line:
Lines 2-7 are your `SBATCH` directives. These lines are where you specify different options for your job including its name, output and error files path/name, list of nodes you want to use, resource limits, and more if required. Let's walk through them line by line:
-`#SBATCH --job-name=test_job`: This directive gives your job a name that you can later use to easier track and manage your job when looking for it in the queue. In this example, we've called it `test_job`. You can read about job management at `/software/env-modules.html`.
-`#SBATCH --output=test_job.out`: Used to specify where your output file is generated, and what it's going to be named. In this example, we have not provided a path, but only provided a name. When you use the `--output` directive without specifying a full path, just providing a filename, Slurm will store the output file in the current working directory from which the `sbatch` command was executed.
...
...
@@ -89,322 +89,7 @@ After the last `#SBATCH` directive, commands are ran like any other regular shel
This script as discussed previously, is a non-interactive job. Non-interactive jobs are submitted to the queue with the use of the `sbatch` command. In this case, we submit our job using `sbatch my_script.sbatch`.
### Jupyter Notebook batch job example
As you know, there is no Graphical User Interface (GUI) available when you connect to the cluster through your shell, hence in order to have access to some application's GUI, port fortforwarding is necessary [(What is SSH port forwarding?)](https://www.youtube.com/watch?v=x1yQF1789cE&ab_channel=TonyTeachesTech). In this example, we will do port forwarding to access Jupyter Notebook's web portal. You will basically send and receive your data through a specified port on your local machine that is tunneled to the port on the cluster where the Jupyter Notebook server is running. This setup enables you to work with Jupyter Notebooks as if they were running locally on your machine, despite actually being executed on a remote cluster node. After a successful setup, you can access Jupyter's portal through your desired browser through a generated link by Jupyter **on your local machine**.
First, create your sbatch script file. I'm going to call mine `jupyterTest.sbatch`. Then add the following to it:
Replace `/home/username/outputs/` with your actual directory path for storing output and error files.
First, let's take a look at what the new directives and commands in this script do:
Note that most of the directives at the start of this script have previously been discussed at "Basic batch job example", so we are only going to discuss the new ones:
-`--nodelist=cn01`: Using `--nodelist` you can specify the exact name(s) of the node(s) you want your job to run on. In this case, we have specified it to be `cn01`.
-`--ntasks=1`: This directive tells SLURM to allocate resources for one task. A "task" in this context is essentially an instance of your application or script running on the cluster. For many applications, especially those that don't explicitly parallelize their workload across multiple CPUs or nodes, specifying a single task is sufficient. However, if you're running applications that can benefit from parallel execution, you might increase this number. This directive is crucial for optimizing resource usage based on the specific needs of your job. For instance, running multiple independent instances of a data analysis script on different subsets of your data could be a scenario where increasing the number of tasks is beneficial.
-`--cpus-per-task=1`: This sets the number of CPUs allocated to each task specified by `--ntasks`. By default, setting it to 1 assigns one CPU to your task, which is fine for tasks that are not CPU-intensive or designed to run on a single thread. However, for applications that are multi-threaded and can utilize more than one CPU core for processing, you would increase this value to match the application's capability to parallelize its workload.
- The variable initializations such as `node=...`, `user=...` are used to retrieve some information from the node you are running your job on to produce the right command for you to later run **locally**, and set up the SSH tunnel. You don't need to worry about these.
- The `echo` command is going to write the ssh tunneling command to your `.out` file with the help of the variables. We will explain how to use that generated command further below.
-`module load jupyter`: Loads the required modules to add support for the command `jupyter`.
-`jupyter notebook --no-browser --port=${port} --ip=${node}` runs a jupyter notebook and makes it listen on our specified port and address to later be accessible through your local machine's browser.
Then, submit your Batch job using `sbatch jupyterTest.sbatch`. Make sure to replace `jupyterTest.sbatch` with whatever file name and extension you choose.
At this stage, if you go and read the content of `jupyterTest.out`, there is a generated command that must look like the following:
Copy that line and run it in your local machine's command line. Then, enter your login credentials for `binary` and hit enter. You should not expect anything magical to happen. In fact, if everything is successful, your shell would go to a new line without generating any output.
You can now access Jupyter's GUI through a browser of your choice on your local machine, at the address that jupyter notebook has generated for you. For some reason, Jupyter writes the address to `stderr`, so you must look for it inside your `jupyterTest.err` file. Inside that file, there must be a line containing a link similar to the following:
```bash
http://127.0.0.1:9001/?token=...(your token is here)...
```
Copy that address and paste it into your browser, and you must successfuly access Jupyter's GUI.
### Apptainer TensorFlow batch job example
This example shows how to execute a TensorFlow script, `tfTest.py`, that trains a simple neural network on the MNIST dataset using GPUs.
First, create a Python script called `tfTest.py` with the provided content:
Next, create a SLURM batch job script named `job-test-nv-tf.sh`. This script requests GPU resources, loads necessary modules, and runs your TensorFlow script inside an Apptainer container:
```bash
#!/bin/bash
#SBATCH --job-name=tensorflow_test_job
#SBATCH --output=result.txt
#SBATCH --nodelist=gpu1
#SBATCH --gres=gpu:A100:2
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=1000
module load python3
module load apptainer
echo"run Apptainer TensorFlow GPU"
apptainer run --nv tensorflowGPU.sif python3 tfTest.py
```
This script runs the `tfTest.py` script inside the TensorFlow GPU container (`tensorflowGPU.sif`)
You can now submit your job to Slurm using `sbatch job-test-nv-tf.sbatch`.
After the job completes, you can check the output in `result.txt`. The output should include information about the available physical devices (GPUs), the TensorFlow version, and the output from training the model on the MNIST dataset.
The beginning and end of the file might look something like this:
As previously mentioned, Nvidia GPU Cloud (NGC) containers are basically pre-configured and optimized Docker containers that include GPU-accelerated software for AI, machine learning, and high-performance computing (HPC) applications. These containers provide a ready-to-use environment with NVIDIA drivers, CUDA toolkit, and popular deep learning frameworks, that are also scanned for security vulnerabilities and exposures.
If you just pull an NGC container for the HPC software suite you have in mind, you don't need to spend time configuring complex software stacks or worry about compatibility issues between different libraries and frameworks.
You can take a look at the containers available on NGC [here](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=weightPopularDESC&query=&page=&pageSize=)
Now here's how you can pull and use a container from NGC:
#### Create and setup an NGC account
1. Visit [this link](https://ngc.nvidia.com/signin) and enter the email you'd like to sign up with (or sign in if you have an account already and skip to step 3).
**Note:** Don't get misguided by the "login" page. If the email you enter is not found as an already-existing account, you will automatically be redirected to the sign up page.
2. Fill out all the required fields and verify your E-mail afterwards, as prompted.
3. Login to your account, and later follow the steps at [this link](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) to setup an API key with which you will pull your containers.
**Note:** You need to make sure you save your API key once it's revealed after its generation. You later need to save it on the login node.
4. Once you have your API key ready, go ahead and login to the login node (Binary).
Run the following command:
```bash
ngc config set
```
You will be prompted to enter your API key. Go ahead and paste it, and then hit Enter.
For the rest of the prompts, just enter the default available option, and hit Enter.
Upon a successful set, you will be shown the path at which the configuration has been saved to.
#### Pull and run Pytorch
If you don't find Pytorch useful, you can pull any other container found at [this link](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=weightPopularDESC&query=&page=&pageSize=)
Once you find your desired container, click on it, and look for the "**Get Container v**" button at the top right of the screen.
In this example's case, our container's link is `nvcr.io/nvidia/pytorch:23.05-py3`.
Run the following command, which both pulls your container, and converts it to an `.sif` file which Apptainer can work with.
**Remember:** Apptainer is very similar to Docker, with the most crucial difference that it runs under user privileges rather than root.
Note the `nvcr.io/nvidia/pytorch:23.05-py3` section of the command. If you are pulling another container, make sure you replace it with the proper link.
In Pytorch's case, this command is going to take a while, as the container is quite large.
Wait and let everything download and unpack as necessary. And remember this operation's time varies from container to container, based on the container size.
Once the operation is completed successfully, you will be able to see the `pytorch_23.05.sif` file at the path you ran the command in.
Now be careful! Don't get tempted to execute the container on the login node (you can't even if you try to), but the whole purpose of the cluster is to use the compute nodes and just use the login node as the entry/access point.
You can now write a script which uses the container we just installed (`pytorch_23.05.sif`) to execute some Python program. Here's how:
First, let's make our sample Pytorch program and save it inside a file called `pytorch_test.py`:
This `.sbatch` script simply tells Slurm to run this script on node `gpu1` which has 8x A100 GPUs, and save the output in a file called `/home/mani/dev/pytorch_test.out`.
**Note:** You need to change the path for both the `.out` and `.err` file to _your_ desired/accessible path.
You can see at line ` apptainer exec --nv /home/mani/dev/pytorch_23.05.sif python /home/mani/dev/pytorch_test.py` we have provided _our_ working path to both the container `.sif` file as well as the python program.
You need to make sure you change these to where _you_ have saved those files.
After everything is ready to go, submit your `run_pytorch.sbatch` to SLURM:
```bash
sbatch run_pytorch.sbatch
```
This job takes somewhere from 10-15 seconds to complete.
If you run `squeue -u $USER` before it's completed, you will be able to see your job listed in the queue.
Once you don't see your job in the queue any more, it means it has completed. It's now time to check the `.out` file!
You should expect something like:
```text
[mani@binary dev]$ cat pytorch_test.out
starting: Tue Oct 15 17:15:42 EDT 2024
Random Tensor:
tensor([[0.6414, 0.6855, 0.5599],
[0.5254, 0.2902, 0.0842],
[0.0418, 0.1184, 0.9758],
[0.7644, 0.6543, 0.0109],
[0.9723, 0.4741, 0.8250]])
Tensor after adding 2:
tensor([[2.6414, 2.6855, 2.5599],
[2.5254, 2.2902, 2.0842],
[2.0418, 2.1184, 2.9758],
[2.7644, 2.6543, 2.0109],
[2.9723, 2.4741, 2.8250]])
Using CUDA
Current device: 0
Device count: 8
Device name: NVIDIA A100-SXM4-80GB
ended: Tue Oct 15 17:15:53 EDT 2024
```
**Note:** Even if your script's execution is successful, you will see the `.err` file; however, it will be empty.
If your output file is empty, try seeing if there is anything informational in the `.err` file to diagnose the issue.
You can find more job examples where we run TensorFlow and PyTorch containers at the [Apptainer]({{site.baseurl}}{% link software/apptainer.md %}) page.
Before we talk about [Apptainer](https://apptainer.org/), we need to know what containerization is.
## What is Containerization?
Containerization is a smart way to package applications with all their dependencies, to make sure they run consistently across different environments. Unlike full virtualization, which creates entire virtual machines, containerization shares the host operating system's kernel. This makes containers lighter and faster to spin up.
### Docker: The Popular Choice
[Docker](https://www.docker.com/) is probably the containerization platform you've heard of most. It's widely used, but it has one potential drawback:
It runs containers as **root**, which can be a security concern in some settings.
### Apptainer: Security for HPC
This is where Apptainer comes in. Formerly known as Singularity, Apptainer is similar to Docker but with a key difference:
It runs containers under **user privileges**.
Apptainer is designed with scientific and application virtualization in mind which makes it a go-to choice for many researchers and institutions.
## Where to find containers?
Apptainer uses `.sif` (Singularity Image Format) files, which are different from the image files that Docker uses. If you're already familiar with `.sif` files and Apptainer/Singularity, you can skip to the examples section. Otherwise, here are two ways to get containers for Apptainer:
### Converting Existing Containers or Images
If you have existing Docker container images, you can convert them to `.sif` format using Apptainer's command-line tool. Here's an example:
1. First, ensure your local Docker image is available in your Docker Daemon:
```bash
docker images
```
2. Once you've confirmed the image is available locally, you can use Apptainer to build an `sif` file from the local Docker image:
**Note** that the `tensorflow` image is just an example. Make sure you replace it with your own image name.
This command tells Apptainer to use a locally available Docker image, make an `.sif` image file out of it, and save it as `tensorflow.sif`.
### Nvidia GPU Cloud (NGC)
But what if we want ready-to-pull HPC-tailored containers?
Nvidia GPU Cloud (NGC) containers are basically pre-configured and optimized Docker containers that include GPU-accelerated software for AI, machine learning, and high-performance computing (HPC) applications. These containers provide a ready-to-use environment with NVIDIA drivers, CUDA toolkit, and popular deep learning frameworks, that are also scanned for security vulnerabilities and exposures.
If you just pull an NGC container for the HPC software suite you have in mind, you don't need to spend time configuring complex software stacks or worry about compatibility issues between different libraries and frameworks.
You can take a look at the containers available on NGC [here](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=weightPopularDESC&query=&page=&pageSize=).
### Pulling from Docker Hub
[Docker Hub](https://hub.docker.com/) is basically a cloud-based registry that allows users to store, share, and manage Docker container images.
Apptainer can also pull containers directly from Docker and automatically convert them to `.sif` format. Here's how you can do it:
**Note** that `docker://tensorflow/tensorflow` is just an example link that is available on Docker Hub. You need to make sure you replace it with _your_ desired container link.
This command pulls the latest `tensorflow` image from Docker Hub and creates an `tensorflow.sif` file in your **current** directory (where you ran the command), which is ready for use with Apptainer.
**Warning:** Have in mind that in most cases, you will benefit from using NGC containers as they are tailored for HPC workloads and there is a very vast range of software that you could use for HPC applications.}}}
However, in some cases, you might benefit from some existing Docker containers that may not be available on NGC.
## Apptainer job examples
We will go through two examples:
1. We will setup an NGC account, use it to pull the Pytorch container, and run a sample job with it.
2. We will use a TensorFlow container and run a sample job.
### NGC's Pytorch container example
#### Create and setup an NGC account
1. Visit [this link](https://ngc.nvidia.com/signin) and enter the email you'd like to sign up with (or sign in if you have an account already and skip to step 3).
**Note:** Don't get misguided by the "login" page. If the email you enter is not found as an already-existing account, you will automatically be redirected to the sign up page.
2. Fill out all the required fields and verify your E-mail afterwards, as prompted.
3. Login to your account, and later follow the steps at [this link](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) to setup an API key with which you will pull your containers.
**Note:** You need to make sure you save your API key once it's revealed after its generation. You later need to save it on the login node.
4. Once you have your API key ready, go ahead and login to the login node (Binary).
Run the following command:
```bash
ngc config set
```
You will be prompted to enter your API key. Go ahead and paste it, and then hit Enter.
For the rest of the prompts, just enter the default available option, and hit Enter.
Upon a successful set, you will be shown the path at which the configuration has been saved to.
#### Pull and run Pytorch
If you don't find Pytorch useful, you can pull any other container found at [this link](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=weightPopularDESC&query=&page=&pageSize=).
Once you find your desired container, click on it, and look for the "**Get Container v**" button at the top right of the screen.
In this example's case, our container's link is `nvcr.io/nvidia/pytorch:23.05-py3`.
Run the following command, which both pulls your container, and converts it to an `.sif` file which Apptainer can work with.
**Remember:** Apptainer is very similar to Docker, with the most crucial difference that it runs under user privileges rather than root.
Note the `nvcr.io/nvidia/pytorch:23.05-py3` section of the command. If you are pulling another container, make sure you replace it with the proper link.
In Pytorch's case, this command is going to take a while, as the container is quite large.
Wait and let everything download and unpack as necessary. And remember this operation's time varies from container to container, based on the container size.
Once the operation is completed successfully, you will be able to see the `pytorch_23.05.sif` file at the path you ran the command in.
Now be careful! Don't get tempted to execute the container on the login node (you can't even if you try to), but the whole purpose of the cluster is to use the compute nodes and just use the login node as the entry/access point.
You can now write a script which uses the container we just installed (`pytorch_23.05.sif`) to execute some Python program. Here's how:
First, let's make our sample Pytorch program and save it inside a file called `pytorch_test.py`:
This `.sbatch` script simply tells Slurm to run this script on node `gpu1` which has 8x A100 GPUs, and save the output in a file called `/home/mani/dev/pytorch_test.out`.
**Note:** You need to change the path for both the `.out` and `.err` file to _your_ desired/accessible path.
You can see at line `apptainer exec --nv /home/mani/dev/pytorch_23.05.sif python /home/mani/dev/pytorch_test.py` we have provided _our_ working path to both the container's `.sif` file as well as the python program.
You need to make sure you change these to where _you_ have saved those files.
After everything is ready to go, submit your `run_pytorch.sbatch` to SLURM:
```bash
sbatch run_pytorch.sbatch
```
This job takes somewhere from 10-15 seconds to complete.
If you run `squeue -u $USER` before it's completed, you will be able to see your job listed in the queue.
Once you don't see your job in the queue any more, it means it has completed. It's now time to check the `.out` file!
You should expect something like:
```text
[mani@binary dev]$ cat pytorch_test.out
starting: Tue Oct 15 17:15:42 EDT 2024
Random Tensor:
tensor([[0.6414, 0.6855, 0.5599],
[0.5254, 0.2902, 0.0842],
[0.0418, 0.1184, 0.9758],
[0.7644, 0.6543, 0.0109],
[0.9723, 0.4741, 0.8250]])
Tensor after adding 2:
tensor([[2.6414, 2.6855, 2.5599],
[2.5254, 2.2902, 2.0842],
[2.0418, 2.1184, 2.9758],
[2.7644, 2.6543, 2.0109],
[2.9723, 2.4741, 2.8250]])
Using CUDA
Current device: 0
Device count: 8
Device name: NVIDIA A100-SXM4-80GB
ended: Tue Oct 15 17:15:53 EDT 2024
```
**Note:** Even if your script's execution is successful, you will see the `.err` file; however, it will be empty.
If your output file is empty, try seeing if there is anything informational in the `.err` file to diagnose the issue.
### TensorFlow container job example
In this example, we assume you have a TensorFlow `.sif` file available, through either of the methods we have explained previously in this page.
Now let's proceed with how we can execute a TensorFlow script (`tfTest.py`), that trains a simple neural network on the MNIST dataset using GPUs.
First, create a Python script called `tfTest.py` with the provided content:
Next, create a SLURM batch job script named `job-test-nv-tf.sh`. This script requests GPU resources, loads necessary modules, and runs your TensorFlow script inside an Apptainer container:
```bash
#!/bin/bash
#SBATCH --job-name=tensorflow_test_job
#SBATCH --output=result.txt
#SBATCH --nodelist=gpu1
#SBATCH --gres=gpu:A100:2
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=1000
module load python3
module load apptainer
echo"run Apptainer TensorFlow GPU"
apptainer run --nv tensorflowGPU.sif python3 tfTest.py
```
**Note** that you need to replace the name and path of _your_ TensorFlow `.sif` image and `tfTest.py` file at line `apptainer run --nv tensorflowGPU.sif python3 tfTest.py` if not within the same path as the `sbatch` script's.
This script runs the `tfTest.py` script inside the TensorFlow GPU container (`tensorflowGPU.sif`)
You can now submit your job to Slurm using `sbatch job-test-nv-tf.sbatch`.
After the job completes, you can check the output in `result.txt`. The output should include information about the available physical devices (GPUs), the TensorFlow version, and the output from training the model on the MNIST dataset.
The beginning and end of the file might look something like this:
Jupyter notebooks are interactive web-based environments that allow you to create and share documents containing live code, equations, visualizations, and narrative text. They're particularly useful for data analysis, scientific computing, and machine learning tasks, as they enable you to execute Python code in cells, see the results immediately, and document your workflow all in one place.
## Jupyter Notebook job example
As you may know, there is no Graphical User Interface (GUI) available when you connect to the cluster through your shell, hence in order to have access to some application's GUI, port forwarding is necessary [(What is SSH port forwarding?)](https://www.youtube.com/watch?v=x1yQF1789cE&ab_channel=TonyTeachesTech). In this example, we will do port forwarding to access Jupyter Notebook's web portal. You will basically send and receive your data through a specified port on your local machine that is tunneled to the port on the cluster where the Jupyter Notebook server is running. This setup enables you to work with Jupyter Notebooks as if they were running locally on your machine, despite actually being executed on a remote cluster node. After a successful setup, you can access Jupyter's portal through your desired browser through a generated link by Jupyter **on your local machine**.
First, create your `sbatch` script file. I'm going to call mine `jupyterTest.sbatch`. Then add the following to it:
Replace `/home/username/outputs/` with your actual directory path for storing output and error files.
First, let's take a look at what the new directives and commands in this script do:
Note that most of the directives at the start of this script have previously been discussed at "Basic batch job example", so we are only going to discuss the new ones:
-`--nodelist=cn01`: Using `--nodelist` you can specify the exact name(s) of the node(s) you want your job to run on. In this case, we have specified it to be `cn01`.
-`--ntasks=1`: This directive tells SLURM to allocate resources for one task. A "task" in this context is essentially an instance of your application or script running on the cluster. For many applications, especially those that don't explicitly parallelize their workload across multiple CPUs or nodes, specifying a single task is sufficient. However, if you're running applications that can benefit from parallel execution, you might increase this number. This directive is crucial for optimizing resource usage based on the specific needs of your job. For instance, running multiple independent instances of a data analysis script on different subsets of your data could be a scenario where increasing the number of tasks is beneficial.
-`--cpus-per-task=1`: This sets the number of CPUs allocated to each task specified by `--ntasks`. By default, setting it to 1 assigns one CPU to your task, which is fine for tasks that are not CPU-intensive or designed to run on a single thread. However, for applications that are multi-threaded and can utilize more than one CPU core for processing, you would increase this value to match the application's capability to parallelize its workload.
- The variable initializations such as `node=...`, `user=...` are used to retrieve some information from the node you are running your job on to produce the right command for you to later run **locally**, and set up the SSH tunnel. You don't need to worry about these.
- The `echo` command is going to write the ssh tunneling command to your `.out` file with the help of the variables. We will explain how to use that generated command further below.
-`module load jupyter`: Loads the required modules to add support for the command `jupyter`.
-`jupyter notebook --no-browser --port=${port} --ip=${node}` runs a jupyter notebook and makes it listen on our specified port and address to later be accessible through your local machine's browser.
Then, submit your Batch job using `sbatch jupyterTest.sbatch`. Make sure to replace `jupyterTest.sbatch` with whatever file name and extension you choose.
At this stage, if you go and read the content of `jupyterTest.out`, there is a generated command that must look like the following:
Copy that line and run it in your local machine's command line. Then, enter your login credentials for `binary` and hit enter. You should not expect anything magical to happen. In fact, if everything is successful, your shell would go to a new line without generating any output.
You can now access Jupyter's GUI through a browser of your choice on your local machine, at the address that jupyter notebook has generated for you. For some reason, Jupyter writes the address to `stderr`, so you must look for it inside your `jupyterTest.err` file. Inside that file, there must be a line containing a link similar to the following:
```bash
http://127.0.0.1:9001/?token=...(your token is here)...
```
Copy that address and paste it into your browser, and you must successfully access Jupyter's GUI.