Unverified Commit f0d81edd authored by Aishwary Shukla's avatar Aishwary Shukla Committed by GitHub

Merge branch 'master' into aish-jsro-guide

parents d160d6ac 5f88f5b6
source "https://rubygems.org"
gem "jekyll-rtd-theme", git: "https://github.com/StarHPC/jekyll-rtd-theme"
#gem "jekyll-rtd-theme", git: "file:///home/Hofstra/jekyll-rtd-theme/.git/"
gem "github-pages", group: :jekyll_plugins
......
......@@ -3,6 +3,11 @@ lang: en
description: Star HPC – at Hofstra University
homeurl: https://starhpc.hofstra.io
#debug: true
#theme: jekyll-rtd-theme
# needed to build via GH actions
remote_theme: StarHPC/jekyll-rtd-theme
readme_index:
......
......@@ -44,8 +44,7 @@ key:
### I need Python package X but the one on Star is too old or I cannot find it
You can choose different Python versions with either the module system
or using Anaconda/Miniconda. See here: `/software/modules` and
`/software/python_r_perl`.
or using Anaconda/Miniconda. See [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}).
In cases where this still doesn't solve your problem or you would like
to install a package yourself, please read the next section below about
......@@ -56,7 +55,7 @@ solution for you, please contact us and we will do our best to help you.
### Can I install Python software as a normal user without sudo rights?
Yes. Please see `/software/python_r_perl`.
Yes. Please see [Virtual environments]({{site.baseurl}}{% link software/virtual-env.md %}).
## Compute and storage quota
......@@ -81,7 +80,7 @@ File limits (inodes) -> These limit the number of files a user can create, regar
To check the quota of the main project storage (parallel file system - /fs1/proj/<project>), you can use this command:
To check the quota of the main project storage (parallel file system - `/fs1/proj/<project>`), you can use this command:
$ mmlsquota -j <fileset_name> <filesystem_name>
......@@ -123,7 +122,7 @@ your local PC.
### How can I access a compute node from the login node?
Please read about Interactive jobs at `/jobs/creating-jobs.md/`.
Please read about Interactive jobs at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### My ssh connections are dying / freezing
......@@ -145,18 +144,9 @@ you can take a look at this page explaining
[keepalives](https://the.earth.li/~sgtatham/putty/0.60/htmldoc/Chapter4.html#config-keepalive)
for a similar solution.
## Jobs and queue system
### I am not able to submit jobs longer than two days
Please read about `label_partitions`.
### Where can I find an example of job script?
You can find job script examples at `/jobs/creating-jobs.md/`.
Relevant application specific examples (also for beginning users) for a
few applications can be found in `sw_guides`.
You can find job script examples at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### When will my job start?
......@@ -178,6 +168,8 @@ new jobs are submitted that get higher priority.
In the command line, see the job queue by using `squeue`.
For a more comprehensive list of commands to monitor/manage your jobs, please see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
### Why does my job not start or give me error feedback when submitting?
Most often the reason a job is not starting is that Star is full at
......@@ -186,8 +178,7 @@ there is an error in the job script and you are asking for a
configuration that is not possible on Star. In such a case the job
will not start.
To find out how to monitor your jobs and check their status see
`monitoring_jobs`.
To find out how to monitor your jobs and check their status see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
Below are a few cases of why jobs don't start or error messages you
might get:
......@@ -204,7 +195,7 @@ core nodes - with both a total of 32 GB of memory/node. If you ask for
full nodes by specifying both number of nodes and cores/node together
with 2 GB of memory/core, you will ask for 20 cores/node and 40 GB of
memory. This configuration does not exist on Star. If you ask for 16
cores, still with 2GB/core, there is a sort of buffer within SLURM no
cores, still with 2GB/core, there is a sort of buffer within Slurm no
allowing you to consume absolutely all memory available (system needs
some to work). 2000MB/core works fine, but not 2 GB for 16 cores/node.
......@@ -219,8 +210,7 @@ mem-per-cpu 4000MB will cost you twice as much as mem-per-cpu 2000MB.
Please also note that if you want to use the whole memory on a node, do
not ask for 32GB, but for 31GB or 31000MB as the node needs some memory
for the system itself. For an example, see here:
`allocated_entire_memory`
for the system itself.
**Step memory limit**
......@@ -245,7 +235,7 @@ For instance:
QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has
been exceeded. Basically, you have asked for more time than allowed for
the given QOS/Partition. Please have a look at `label_partitions`
the given QOS/Partition.
**Priority vs. Resources**
......@@ -253,14 +243,6 @@ Priority means that resources are in principle available, but someone
else has higher priority in the queue. Resources means the at the moment
the requested resources are not available.
### Why is my job not starting on highmem nodes although the highmem queue is empty?
To prevent the highmem nodes from standing around idle, normal jobs may
use them as well, using only 32 GB of the available memory. Hence, it is
possible that the highmem nodes are busy, although you do not see any
jobs queuing or running on <span class="title-ref">squeue -p
highmem</span>.
### How can I customize emails that I get after a job has completed?
Use the mail command and you can customize it to your liking but make
......@@ -276,7 +258,7 @@ script:
The overhead in the job start and cleanup makes it unpractical to run
thousands of short tasks as individual jobs on Star.
The queueing setup on star, or rather, the accounting system generates
The queueing setup on Star, or rather, the accounting system generates
overhead in the start and finish of a job of about 1 second at each end
of the job. This overhead is insignificant when running large parallel
jobs, but creates scaling issues when running a massive amount of
......@@ -286,25 +268,86 @@ unparallelizable part of the job. This is because the queuing system can
only start and account one job at a time. This scaling problem is
described by [Amdahls Law](https://en.wikipedia.org/wiki/Amdahl's_law).
If the tasks are extremly short, you can use the example below. If you
want to spawn many jobs without polluting the queueing system, please
have a look at `job_arrays`.
If the tasks are extremly short (e.g. less than 1 second), you can use the example below.
If you want to spawn many jobs without polluting the queueing system, please
have a look at [array jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#array-jobs).
By using some shell trickery one can spawn and load-balance multiple
independent task running in parallel within one node, just background
the tasks and poll to see when some task is finished until you spawn the
next:
<div class="literalinclude" language="bash">
```bash
#!/usr/bin/env bash
# Jobscript example that can run several tasks in parallel.
# All features used here are standard in bash so it should work on
# any sane UNIX/LINUX system.
# Author: roy.dragseth@uit.no
#
# This example will only work within one compute node so let's run
# on one node using all the cpu-cores:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
# We assume we will (in total) be done in 10 minutes:
#SBATCH --time=0-00:10:00
# Let us use all CPUs:
maxpartasks=$SLURM_TASKS_PER_NODE
# Let's assume we have a bunch of tasks we want to perform.
# Each task is done in the form of a shell script with a numerical argument:
# dowork.sh N
# Let's just create some fake arguments with a sequence of numbers
# from 1 to 100, edit this to your liking:
tasks=$(seq 100)
cd $SLURM_SUBMIT_DIR
for t in $tasks; do
# Do the real work, edit this section to your liking.
# remember to background the task or else we will
# run serially
./dowork.sh $t &
# You should leave the rest alone...
# count the number of background tasks we have spawned
# the jobs command print one line per task running so we only need
# to count the number of lines.
activetasks=$(jobs | wc -l)
# if we have filled all the available cpu-cores with work we poll
# every second to wait for tasks to exit.
while [ $activetasks -ge $maxpartasks ]; do
sleep 1
activetasks=$(jobs | wc -l)
done
done
# Ok, all tasks spawned. Now we need to wait for the last ones to
# be finished before we exit.
echo "Waiting for tasks to complete"
wait
echo "done"
```
files/multiple.sh
And here is the `dowork.sh` script:
</div>
```bash
#!/usr/bin/env bash
And here is the `dowork.sh` script:
# Fake some work, $1 is the task number.
# Change this to whatever you want to have done.
<div class="literalinclude" language="bash">
# sleep between 0 and 10 secs
let sleeptime=10*$RANDOM/32768
files/dowork.sh
echo "Task $1 is sleeping for $sleeptime seconds"
sleep $sleeptime
echo "Task $1 has slept for $sleeptime seconds"
```
</div>
Source: [HPC-UiT FAQ](https://hpc-uit.readthedocs.io/en/latest/help/faq.html)
......@@ -28,7 +28,7 @@ Imagine a user is optimizing a complex algorithm's parameters. By initiating an
Batch jobs are submitted to a queue on the cluster and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
#### Example Scenario
You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion and output the results to your desired location for you to view.
For a real example on Batch jobs, view `/jobs/creating-jobs.html`.
For a real example on Batch jobs, view [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### 3. Array jobs
When you're faced with executing the same task multiple times with only slight variations, array jobs offer an efficient solution. This job type simplifies the process of managing numerous similar jobs by treating them as a single entity that varies only in a specified parameter.
......@@ -42,7 +42,7 @@ Imagine a fluid dynamics job that requires complex calculations spread over mill
## Resources
Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
For a list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`.
For a list of the resources available at Star HPC, take a look at [About star]({{site.baseurl}}{% link quickstart/about-star.md %}).
### Common Errors
Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br>
......
......@@ -19,9 +19,9 @@ squeue
Sample output:
```bash
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234 batch my_job jsmith R 5:23 1 cn01
1235 batch array_job jdoe R 2:45 1 cn02
1235 batch arr_job jdoe R 2:45 1 cn02
1236 gpu gpu_task asmith PD 0:00 1 (Resources)
```
......@@ -216,7 +216,25 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
If a job fails, try checking the following:
1. Look at the job's output and error files.
2. Check the job's resource usage with `sacct`
3. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
Remember, if you're having persistent issues, don't hesitate to reach out to the support team.
2. Check the job state and exit code:
```
sacct --brief
```
Sample output:
```
JobID State ExitCode
------------ ---------- --------
1040 TIMEOUT 0:0
1041 FAILED 6:0
1042 TIMEOUT 0:0
1043 FAILED 1:0
1046 COMPLETED 0:0
1047 RUNNING 0:0
```
`FAILED` indicates the process terminated with with a non-zero exit code.
The first number in the ExitCode column is the exit code and the number after the colon is the signal that caused the process to terminate if it was terminated by a signal.
3. Check the job's resource usage with `sacct`
4. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
If you face persistent issues, please do not hesitate to reach out to us for help.
......@@ -4,7 +4,7 @@ sort: 2
# Submitting Jobs
This page is mainly dedicated to examples of different job types. For a more comprehensive explanation on different job types, please refer to [Jobs Overview]({{site.baseurl}}{% link jobs/Overview.md %})
This page is mainly dedicated to examples of different job types. For a more comprehensive explanation on different job types, please refer to [Jobs Overview]({{site.baseurl}}{% link jobs/Overview.md %}).
## Batch jobs (Non-interactive)
......@@ -69,9 +69,9 @@ python3 /path/to/python_script/my_script.py
Now let's walk through `my_script.sbatch` line by line to see what each directive does.
- `#!/bin/bash`: This line needs to be included at the start of **all** your batch scripts. It basically specifies the script to be run with a shell called `bash`.
- `#!/bin/bash`: This line needs to be included at the start of **all** your batch scripts. It basically specifies the script to be run with the `bash` shell.
Lines 2-7 are your `SBTACH` directives. These lines are where you specify different options for your job including its name, output and error files path/name, list of nodes you want to use, resource limits, and more if required. Let's walk through them line by line:
Lines 2-7 are your `SBATCH` directives. These lines are where you specify different options for your job including its name, output and error files path/name, list of nodes you want to use, resource limits, and more if required. Let's walk through them line by line:
- `#SBATCH --job-name=test_job`: This directive gives your job a name that you can later use to easier track and manage your job when looking for it in the queue. In this example, we've called it `test_job`. You can read about job management at [Monitoring jobs]({{ site.baseurl }}{% link jobs/monitoring-jobs.md %}).
- `#SBATCH --output=test_job.out`: Used to specify where your output file is generated, and what it's going to be named. In this example, we have not provided a path, but only provided a name. When you use the `--output` directive without specifying a full path, just providing a filename, Slurm will store the output file in the current working directory from which the `sbatch` command was executed.
......@@ -89,160 +89,12 @@ After the last `#SBATCH` directive, commands are ran like any other regular shel
This script as discussed previously, is a non-interactive job. Non-interactive jobs are submitted to the queue with the use of the `sbatch` command. In this case, we submit our job using `sbatch my_script.sbatch`.
### Jupyter Notebook batch job example
As you know, there is no Graphical User Interface (GUI) available when you connect to the cluster through your shell, hence in order to have access to some application's GUI, port fortforwarding is necessary [(What is SSH port forwarding?)](https://www.youtube.com/watch?v=x1yQF1789cE&ab_channel=TonyTeachesTech). In this example, we will do port forwarding to access Jupyter Notebook's web portal. You will basically send and receive your data through a specified port on your local machine that is tunneled to the port on the cluster where the Jupyter Notebook server is running. This setup enables you to work with Jupyter Notebooks as if they were running locally on your machine, despite actually being executed on a remote cluster node. After a successful setup, you can access Jupyter's portal through your desired browser through a generated link by Jupyter **on your local machine**.
First, create your sbatch script file. I'm going to call mine `jupyterTest.sbatch`. Then add the following to it:
```bash
#!/bin/bash
#SBATCH --nodelist=cn01
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00
#SBATCH --job-name=jupyterTest1
#SBATCH --output=/home/mani/outputs/jupyterTest1.out
#SBATCH --error=/home/mani/outputs/jupyterTest1.err
# get tunneling info
XDG_RUNTIME_DIR=""
node=$(hostname -s)
user=$(whoami)
port=9001
# print tunneling instructions to jupyterTest1
echo -e "
Use the following command to set up ssh tunneling:
ssh -p5010 -N -f -L ${port}:${node}:${port} ${user}@binary.star.hofstra.edu"
module load jupyter
jupyter notebook --no-browser --port=${port} --ip=${node}
```
Replace `/home/username/outputs/` with your actual directory path for storing output and error files.
First, let's take a look at what the new directives and commands in this script do:
Note that most of the directives at the start of this script have previously been discussed at "Basic batch job example", so we are only going to discuss the new ones:
- `--nodelist=cn01`: Using `--nodelist` you can specify the exact name(s) of the node(s) you want your job to run on. In this case, we have specified it to be `cn01`.
- `--ntasks=1`: This directive tells Slurm to allocate resources for one task. A "task" in this context is essentially an instance of your application or script running on the cluster. For many applications, especially those that don't explicitly parallelize their workload across multiple CPUs or nodes, specifying a single task is sufficient. However, if you're running applications that can benefit from parallel execution, you might increase this number. This directive is crucial for optimizing resource usage based on the specific needs of your job. For instance, running multiple independent instances of a data analysis script on different subsets of your data could be a scenario where increasing the number of tasks is beneficial.
- `--cpus-per-task=1`: This sets the number of CPUs allocated to each task specified by `--ntasks`. By default, setting it to 1 assigns one CPU to your task, which is fine for tasks that are not CPU-intensive or designed to run on a single thread. However, for applications that are multi-threaded and can utilize more than one CPU core for processing, you would increase this value to match the application's capability to parallelize its workload.
- The variable initializations such as `node=...`, `user=...` are used to retrieve some information from the node you are running your job on to produce the right command for you to later run **locally**, and set up the SSH tunnel. You don't need to worry about these.
- The `echo` command is going to write the ssh tunneling command to your `.out` file with the help of the variables. We will explain how to use that generated command further below.
- `module load jupyter`: Loads the required modules to add support for the command `jupyter`.
- `jupyter notebook --no-browser --port=${port} --ip=${node}` runs a jupyter notebook and makes it listen on our specified port and address to later be accessible through your local machine's browser.
Then, submit your Batch job using `sbatch jupyterTest.sbatch`. Make sure to replace `jupyterTest.sbatch` with whatever file name and extension you choose.
At this stage, if you go and read the content of `jupyterTest.out`, there is a generated command that must look like the following:
```bash
ssh -p5010 -N -f -L 9001:cn01:9001 <your-username>@binary.star.hofstra.edu
```
Copy that line and run it in your local machine's command line. Then, enter your login credentials for `binary` and hit enter. You should not expect anything magical to happen. In fact, if everything is successful, your shell would go to a new line without generating any output.
You can now access Jupyter's GUI through a browser of your choice on your local machine, at the address that jupyter notebook has generated for you. For some reason, Jupyter writes the address to `stderr`, so you must look for it inside your `jupyterTest.err` file. Inside that file, there must be a line containing a link similar to the following:
```bash
http://127.0.0.1:9001/?token=...(your token is here)...
```
Copy that address and paste it into your browser, and you must successfuly access Jupyter's GUI.
### Apptainer TensorFlow batch job example
This example shows how to execute a TensorFlow script, `tfTest.py`, that trains a simple neural network on the MNIST dataset using GPUs.
First, create a Python script called `tfTest.py` with the provided content:
```python
import tensorflow as tf
physical_devices = tf.config.list_physical_devices(device_type=None)
print("Num of Devices:", len(physical_devices))
print("Devices:\n", physical_devices)
print("Tensorflow version information:\n",tf.__version__)
print("begin test...")
mnist = tf.keras.datasets.mnist
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
model.evaluate(x_test, y_test, verbose=2)
```
Next, create a Slurm batch job script named `job-test-nv-tf.sh`. This script requests GPU resources, loads necessary modules, and runs your TensorFlow script inside an Apptainer container:
```bash
#!/bin/bash
#SBATCH --job-name=tensorflow_test_job
#SBATCH --output=result.txt
#SBATCH --nodelist=gpu1
#SBATCH --gres=gpu:A100:2
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=1000
module load python3
module load apptainer
echo "run Apptainer TensorFlow GPU"
apptainer run --nv tensorflowGPU.sif python3 tfTest.py
```
This script runs the `tfTest.py` script inside the TensorFlow GPU container (`tensorflowGPU.sif`)
You can now submit your job to Slurm using `sbatch job-test-nv-tf.sbatch`.
After the job completes, you can check the output in `result.txt`. The output should include information about the available physical devices (GPUs), the TensorFlow version, and the output from training the model on the MNIST dataset.
The beginning and end of the file might look something like this:
```text
run Apptainer TensorFlow GPU
Num of Devices: X
Devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), ...]
Tensorflow version information:
X.XX.X
begin test...
...
313/313 - 0s - loss: X.XXXX - accuracy: 0.XXXX
```
You can find more job examples where we run TensorFlow and PyTorch containers at the [Apptainer]({{site.baseurl}}{% link software/apptainer.md %}) page.
## Interactive jobs
Interactive jobs are those were the user needs to provide input to the application through an interactive pseudo-terminal. For example, this includes the shell and Ncurses-based TUI utilities.
### Starting an Interactive job
To start an interactive job, you use the `srun` command with specific parameters that define your job's resource requirements. Here's an example:
......
......@@ -28,7 +28,7 @@ Members of Hofstra University, Nassau Community College, or Adelphi University,
### Requesting an account
To get an account on Star, you need to complete out the registration form at [Star Account Management Web Application](http://localhost:3000). There, you will need to provide us the following information:
To get an account on Star, you need to complete out the [request form](https://access.starhpc.hofstra.io/apply). There, you will need to provide us the following information:
- Your full name, date of birth, and nationality.
- Your position (master student, PhD, PostDoc, staff member,
......@@ -65,7 +65,17 @@ Submit the above information through the online registration form.
## Login node
Access to the cluster is provided through SSH access to the login node. The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
### About the login node
The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
### Connection and credentials
Access to the cluster is provided through SSH to the login node. Upon your account's creation, you can access the login node using the address provided in your welcome Email.
If you have existing Linux lab credentials, use them to log in. Otherwise, login credentials will be provided to you.
Additionally, the login node provides access to your Linux lab files, **But note that** the login node is **not** just another Linux lab machine. It simply provides mutual features (e.g., credentials) for convenience.
## Scheduler policies
......
---
sort: 3
---
# Apptainer
## Why use Apptainer?
Apptainer is a tool available on the Star cluster for running containers.
Containers are isolated software environments that run applications packaged in an image format, which bundles the application and its dependencies. Similar to virtual machine images, the applications are already installed and are typically pre-configured. However, containers are lighter than virtual machines as containers run directly on the host operating system, while virtual machines include a full operating system of their own.
The use of containers not only allows for quicker and easier deployment of pre-configured applications, but since it isolates the application from the host system, it also simplifies dependency management, prevents potential version or dependency conflicts, and ensures consistency and reproducibility. This is especially critical with scientific applications, applications that have complex dependencies, and systems where multiple versions of the same software are needed, which is common in an HPC environment. Without the use of containers, ensuring that applications run consistently across different systems can be quite challenging due to varying software dependencies and configurations.
This approach allows you to bring already-built applications and workflows from other Linux environments to the Star cluster, and run them without any reconfiguration or additional installation. You can build a container image on your own local system and then run it on the cluster without any other setup, knowing that the application will be installed and configured the same way on both systems. An extensive ecosystem of container images is also available, so this allows you to run containerized applications without any of the hassle of setting them up or installing their dependencies in the first place.
## Why not use Docker?
Docker is probably the container platform you're most familiar with. It is widely used for development, but it was not built for HPC enviornments and is not compatible with HPC resource management or the security model of HPC clusters.
This is where Apptainer comes in. Apptainer is a Linux Foundation-supported fork of Singularity, a purpose built container platform for use in HPC environments. Like Docker, Apptainer/Singularity provides a solution for encapsulating applications and their dependencies within lightweight portable container images. Unlike Docker, Apptainer is designed with the needs of high-performance computing in mind, which makes it the go-to choice for researchers and institutions with data-intensive applications.
Apptainer has some differences from Docker. Don't worry though. It is designed to be fully compatible with Docker and it can pull and run Docker images. So, you can still run Docker locally and then bring over the same images onto Star.
## Where can you get container images?
Apptainer can run containers from any Docker compatible image repository (e.g. DockerHub). Users of the Star cluster can also leverage the large collection of HPC-tailored container images from the NVIDIA GPU Cloud (NGC) repository.
Apptainer runs `.sif` files in the Singularity Image Format, which is different from the image files that Docker uses. If you're already familiar with `.sif` files and Apptainer/Singularity, you can skip to the examples section. Otherwise, here are two ways to get containers for Apptainer:
### Converting Existing Containers or Images
If you have existing Docker container images, you can convert them to `.sif` format using Apptainer's command-line tool. Here's an example:
1. First, ensure your local Docker image is available in your Docker Daemon:
```bash
docker images
```
2. Once you've confirmed the image is available locally, you can use Apptainer to build an `sif` file from the local Docker image:
```bash
apptainer build tensorflow.sif docker-daemon://tensorflow/tensorflow:latest
```
**Note** that the `tensorflow` image is just an example. Make sure you replace it with your own image name.
This command tells Apptainer to use a locally available Docker image, make an `.sif` image file out of it, and save it as `tensorflow.sif`.
### Nvidia GPU Cloud (NGC)
But what if we want ready-to-pull HPC-tailored containers?
Nvidia GPU Cloud (NGC) containers are basically pre-configured and optimized Docker containers that include GPU-accelerated software for AI, machine learning, and high-performance computing (HPC) applications. These containers provide a ready-to-use environment with NVIDIA drivers, CUDA toolkit, and popular deep learning frameworks, that are also scanned for security vulnerabilities and exposures.
If you just pull an NGC container for the HPC software suite you have in mind, you don't need to spend time configuring complex software stacks or worry about compatibility issues between different libraries and frameworks.
You can take a look at the containers available on NGC [here](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=weightPopularDESC&query=&page=&pageSize=).
### Pulling from Docker Hub
[Docker Hub](https://hub.docker.com/) is basically a cloud-based registry that allows users to store, share, and manage Docker container images.
Apptainer can also pull containers directly from Docker and automatically convert them to `.sif` format. Here's how you can do it:
```bash
apptainer pull --name tensorflow.sif docker://tensorflow/tensorflow
```
**Note** that `docker://tensorflow/tensorflow` is just an example link that is available on Docker Hub. You need to make sure you replace it with _your_ desired container link.
This command pulls the latest `tensorflow` image from Docker Hub and creates an `tensorflow.sif` file in your **current** directory (where you ran the command), which is ready for use with Apptainer.
**Warning:** Have in mind that in most cases, you will benefit from using NGC containers as they are tailored for HPC workloads and there is a very vast range of software that you could use for HPC applications.}}}
However, in some cases, you might benefit from some existing Docker containers that may not be available on NGC.
## Apptainer job examples
We will go through two examples:
1. We will setup an NGC account, use it to pull the Pytorch container, and run a sample job with it.
2. We will use a TensorFlow container and run a sample job.
### NGC's Pytorch container example
#### Create and setup an NGC account
1. Visit [this link](https://ngc.nvidia.com/signin) and enter the email you'd like to sign up with (or sign in if you have an account already and skip to step 3).
**Note:** Don't get misguided by the "login" page. If the email you enter is not found as an already-existing account, you will automatically be redirected to the sign up page.
2. Fill out all the required fields and verify your E-mail afterwards, as prompted.
3. Login to your account, and later follow the steps at [this link](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) to setup an API key with which you will pull your containers.
**Note:** You need to make sure you save your API key once it's revealed after its generation. You later need to save it on the login node.
4. Once you have your API key ready, go ahead and login to the login node (Binary).
Run the following command:
```bash
ngc config set
```
You will be prompted to enter your API key. Go ahead and paste it, and then hit Enter.
For the rest of the prompts, just enter the default available option, and hit Enter.
Upon a successful set, you will be shown the path at which the configuration has been saved to.
#### Pull and run Pytorch
If you don't find Pytorch useful, you can pull any other container found at [this link](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=weightPopularDESC&query=&page=&pageSize=).
Once you find your desired container, click on it, and look for the "**Get Container v**" button at the top right of the screen.
In this example's case, our container's link is `nvcr.io/nvidia/pytorch:23.05-py3`.
Run the following command, which both pulls your container, and converts it to an `.sif` file which Apptainer can work with.
```bash
apptainer pull pytorch_23.05.sif docker://nvcr.io/nvidia/pytorch:23.05-py3
```
**Remember:** Apptainer is very similar to Docker, with the most crucial difference that it runs under user privileges rather than root.
Note the `nvcr.io/nvidia/pytorch:23.05-py3` section of the command. If you are pulling another container, make sure you replace it with the proper link.
In Pytorch's case, this command is going to take a while, as the container is quite large.
Wait and let everything download and unpack as necessary. And remember this operation's time varies from container to container, based on the container size.
Once the operation is completed successfully, you will be able to see the `pytorch_23.05.sif` file at the path you ran the command in.
Now be careful! Don't get tempted to execute the container on the login node (you can't even if you try to), but the whole purpose of the cluster is to use the compute nodes and just use the login node as the entry/access point.
You can now write a script which uses the container we just installed (`pytorch_23.05.sif`) to execute some Python program. Here's how:
First, let's make our sample Pytorch program and save it inside a file called `pytorch_test.py`:
```python
import torch
def main():
# Create a random tensor
x = torch.rand(5, 3)
print("Random Tensor:")
print(x)
# Perform a simple operation
y = x + 2
print("\nTensor after adding 2:")
print(y)
# Check if CUDA is available and use it
if torch.cuda.is_available():
device = torch.device("cuda")
x = x.to(device)
y = y.to(device)
print("\nUsing CUDA")
else:
print("\nCUDA not available")
print(f"\nCurrent device: {torch.cuda.current_device()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
if __name__ == "__main__":
main()
```
Now, make a script called `run_pytorch.sbatch` and save it with the following content:
```bash
#!/bin/bash
#SBATCH --job-name=pytorch_test
#SBATCH --output=/home/mani/dev/pytorch_test.out
#SBATCH --error=/home/mani/dev/pytorch_test.err
#SBATCH --nodelist=gpu1
start_time=$(date)
# Run the PyTorch script
echo "starting: $start_time"; echo ""
apptainer exec --nv /home/mani/dev/pytorch_23.05.sif python /home/mani/dev/pytorch_test.py
end_time=$(date)
echo ""
echo "ended: $end_time"
```
This `.sbatch` script simply tells Slurm to run this script on node `gpu1` which has 8x A100 GPUs, and save the output in a file called `/home/mani/dev/pytorch_test.out`.
**Note:** You need to change the path for both the `.out` and `.err` file to _your_ desired/accessible path.
You can see at line `apptainer exec --nv /home/mani/dev/pytorch_23.05.sif python /home/mani/dev/pytorch_test.py` we have provided _our_ working path to both the container's `.sif` file as well as the python program.
You need to make sure you change these to where _you_ have saved those files.
After everything is ready to go, submit your `run_pytorch.sbatch` to SLURM:
```bash
sbatch run_pytorch.sbatch
```
This job takes somewhere from 10-15 seconds to complete.
If you run `squeue -u $USER` before it's completed, you will be able to see your job listed in the queue.
Once you don't see your job in the queue any more, it means it has completed. It's now time to check the `.out` file!
You should expect something like:
```text
[mani@binary dev]$ cat pytorch_test.out
starting: Tue Oct 15 17:15:42 EDT 2024
Random Tensor:
tensor([[0.6414, 0.6855, 0.5599],
[0.5254, 0.2902, 0.0842],
[0.0418, 0.1184, 0.9758],
[0.7644, 0.6543, 0.0109],
[0.9723, 0.4741, 0.8250]])
Tensor after adding 2:
tensor([[2.6414, 2.6855, 2.5599],
[2.5254, 2.2902, 2.0842],
[2.0418, 2.1184, 2.9758],
[2.7644, 2.6543, 2.0109],
[2.9723, 2.4741, 2.8250]])
Using CUDA
Current device: 0
Device count: 8
Device name: NVIDIA A100-SXM4-80GB
ended: Tue Oct 15 17:15:53 EDT 2024
```
**Note:** Even if your script's execution is successful, you will see the `.err` file; however, it will be empty.
If your output file is empty, try seeing if there is anything informational in the `.err` file to diagnose the issue.
### TensorFlow container job example
In this example, we assume you have a TensorFlow `.sif` file available, through either of the methods we have explained previously in this page.
Now let's proceed with how we can execute a TensorFlow script (`tfTest.py`), that trains a simple neural network on the MNIST dataset using GPUs.
First, create a Python script called `tfTest.py` with the provided content:
```python
import tensorflow as tf
physical_devices = tf.config.list_physical_devices(device_type=None)
print("Num of Devices:", len(physical_devices))
print("Devices:\n", physical_devices)
print("TensorFlow version information:\n",tf.__version__)
print("begin test...")
mnist = tf.keras.datasets.mnist
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
model.evaluate(x_test, y_test, verbose=2)
```
Next, create a SLURM batch job script named `job-test-nv-tf.sh`. This script requests GPU resources, loads necessary modules, and runs your TensorFlow script inside an Apptainer container:
```bash
#!/bin/bash
#SBATCH --job-name=tensorflow_test_job
#SBATCH --output=result.txt
#SBATCH --nodelist=gpu1
#SBATCH --gres=gpu:A100:2
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=1000
module load python3
module load apptainer
echo "run Apptainer TensorFlow GPU"
apptainer run --nv tensorflowGPU.sif python3 tfTest.py
```
**Note** that you need to replace the name and path of _your_ TensorFlow `.sif` image and `tfTest.py` file at line `apptainer run --nv tensorflowGPU.sif python3 tfTest.py` if not within the same path as the `sbatch` script's.
This script runs the `tfTest.py` script inside the TensorFlow GPU container (`tensorflowGPU.sif`)
You can now submit your job to Slurm using `sbatch job-test-nv-tf.sbatch`.
After the job completes, you can check the output in `result.txt`. The output should include information about the available physical devices (GPUs), the TensorFlow version, and the output from training the model on the MNIST dataset.
The beginning and end of the file might look something like this:
```text
run Apptainer TensorFlow GPU
Num of Devices: X
Devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), ...]
TensorFlow version information:
X.XX.X
begin test...
...
313/313 - 0s - loss: X.XXXX - accuracy: 0.XXXX
```
---
sort: 1
---
# Environment modules
## Introduction to Environment Modules
......
---
sort: 4
---
# Jupyter Notebook
Jupyter Notebook is an interactive web application that provides an environment where you can create and share documents with live code, equations, visualizations, and narrative text. It is great for data analysis, scientific computing, and machine learning tasks. You can run Python code in cells, see results right away, and document your work all in one place.
## Running Jupyter Notebook
Jupiter Notebook is installed on the cluster and can be started like any other workload, by launching it through Slurm. Jupiter is available as an [environment module]({{site.baseurl}}{% link software/env-modules.md %}), so it would be loaded into the environment with the `module` command. The example script that follows shows you this.
Alternatively, you could run Jupyter in a container. That would make it easy to load the environment you need when there is a container image available with your desired toolset pre-installed. Check out [Apptainer]({{site.baseurl}}{% link software/apptainer.md %}) to learn more.
```note
Use Your Storage Effectively
{:.h4.mb-2}
The directory `/fs1/projects/{project-name}/` lives on the parallel file-system storage, where most of your work should reside. While your home directory (`/home/{username}/`) can be used for quick experiments and convenient access to scripts, keep in mind that it has limited capacity and worse performance. The parallel file-system storable is much faster and has way more space for your notebooks and data.
```
### Step 1: Create the Job Script
You would create a job script to launch Jupyter Notebook and most other applications on the cluster. As the compute nodes (where workloads run on the cluster) are not directly reachable from the campus network, you will need to perform SSH port forwarding to access your Jupyter Notebook instance. The following script starts Jupyter Notebook on an available port and provides you the SSH command needed to then reach it. You can copy and paste this example to get started. From the login node, save this as `jupyter.sbatch`:
```bash
#!/bin/bash
#SBATCH --nodelist=<compute-node>
#SBATCH --gpus=2
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00
#SBATCH --job-name=jupyter_notebook
#SBATCH --output=/fs1/projects/<project-name>/%x_%j.out
#SBATCH --error=/fs1/projects/<project-name>/%x_%j.err
# Connection variables
LOGIN_NODE="<login-node-address>" # Set this to the login node's address from the welcome email
LOGIN_PORT="<login-port>" # Set this to the port number from the welcome email
XX="<xx>" # Set this to a number from 01-30
module load jupyter
check_port() {
nc -z localhost $1
return $(( ! $? ))
}
# Find an available port
port=8888
while ! check_port $port; do
port=$((port + 1))
done
compute_node=$(hostname -f)
user=$(whoami)
echo "==================================================================="
echo "To connect to your Jupyter notebook, run this command on your local machine:"
echo ""
echo "ssh -N -L ${port}:${compute_node}:${port} -J ${user}@adams204${XX}.hofstra.edu:${LOGIN_PORT},${user}@${LOGIN_NODE}:${LOGIN_PORT} ${user}@${LOGIN_NODE}"
echo ""
echo "When finished, clean up by running this command on the login node:"
echo "scancel ${SLURM_JOB_ID}"
echo "==================================================================="
# Start Jupyter notebook
jupyter notebook --no-browser --port=${port} --ip=0.0.0.0
```
The script uses these Slurm parameters:
- `--nodelist`: Specifies which compute node to use (e.g., `gpu1` or `cn01`)
- `--gpus=2`: This enables us to use 2 of the GPUs on the specified node. See each node's GPU information [here]({{site.baseurl}}{% link quickstart/about-star.md %}). Without this specification, you cannot see or use the GPUs on the compute node. Feel free to replace this number with another **valid option**.
- `--ntasks=1`: Runs one instance of Jupyter
- `--cpus-per-task=1`: Use one CPU thread. Note hyperthreading may be enabled on the compute nodes.
- `--time=00:30:00`: Sets a 30-minute time limit for the job (The format is `hh:mm:ss`)
### Step 2: Replace the placeholders
The `<...>` placeholders need to be replaced with what _you_ need:
- `<login-node-address>` needs to be replaced with the address of the login node provided in your welcome email
- `<login-port>` needs to be replaced with the port number from your welcome email
- `<xx>` needs to be replaced with a number between 01-30 (inclusive)
- `<compute-node>` needs to be replaced with an available compute node from the cluster nodes list. You can find the full list of nodes on the [About Star]({{site.baseurl}}{% link quickstart/about-star.md %}) page.
- Change the path for the `--output` and `--error` directives to where _you_ would like these files to be saved.
### Step 3: Submit the job
```bash
sbatch jupyter.sbatch
```
Upon your job's submission to the queue, you will see the output indicating your job's ID. You need to replace _your_ job ID value with the `<jobid>` placeholder throughout this documentation.
_**Your job may not start right away!**_
{:.bg-yellow-light.color-orange-9.p-2}
If you run `squeue` immediately after submitting your job, you might see a message such as `Node Unavailable` next to your job. Another job may be actively using those resources, and your job will be held in the queue until your request can be satisfied by the available resources.
In such case, the `.out` or `.err` files will not be created yet, as your job hasn't run yet.
Before proceeding to **Step 4**, wait until your job has changed to the `RUNNING` state as reported by the `squeue` command.
### Step 4: Check your output file for the SSH command
```bash
cat jupyter_notebook_<jobid>.out # Run this command in the directory the .out file is located.
```
Replace `<jobid>` with the job ID you received after submitting the job.
### Step 5: Run the SSH port-forwarding command
Open a new terminal on your local machine and run the SSH command provided in the output file. If prompted for a password, use your Linux lab password if you haven't set up SSH keys. You might be requested to enter your password multiple times. **Note** that the command will appear to hang after successful connection - this is the expected behavior. Do not terminate the command (`Ctrl + C`) as this will disconnect your Jupyter notebook session (unless you intend to do so).
### Step 6: Find and open the link in your browser
Check the error file on the login node for your Jupyter notebook's URL:
```bash
cat jupyter_notebook_<jobid>.err | grep '127.0.0.1' # Run this command in the directory the .err file is located.
```
Replace `<jobid>` with the job ID you received after submitting the job.
_**Be patient!**_
{:.bg-yellow-light.color-orange-9.p-2}
Make sure you wait about 30 seconds after executing the SSH port-forwarding command on your local machine. It takes the `.err` file a little time to be updated and include your link.
You might see two lines being printed. Either link works.
Copy the URL from the error file and paste it into your **local machine's browser**.
### Step 7: Clean up
If you're done prior to the job's termination due to the walltime, clean up your session by running this command on the login node:
```bash
scancel <jobid>
```
Replace `<jobid>` with the job ID you received after submitting the job.
Afterwards, press `Ctrl + C` on your local computer's terminal session, where you ran the port forwarding command. This would terminate the SSH connection.
## Working on the Compute Node
Do you need to access the node running Jupyter Notebook? You can use `srun` to launch an interactive shell. Check out [interactive jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#interactive-jobs) for more information.
---
sort: 2
---
# Virtual Environment Guide
Managing software dependencies and configurations can be challenging in an HPC environment. Users often need different versions of the same software or libraries, leading to conflicts and complex setups. [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}) provide a solution by allowing users to dynamically modify their shell environment using simple commands. This simplifies the setup process, ensures that users have the correct software environment for their applications, and reduces conflicts and errors caused by incompatible software versions. Environment modules work on the same principle as virtual environments, i.e. the manipulation of environment variables. If an environment module is not available for a given version you need, you can instead create a virtual environment using the standard version manager tools provided with many common languages. Virtual environments allow for managing different versions of lanugages and dependencies independent of the system version or other virtual environments, so they are often used by developers to isolate dependencies for different projects.
......@@ -294,7 +298,7 @@ Remove the renv directory and associated files. This deletes the environment and
### How to create and use a virtual environment in Julia
Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing ']')
Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing `]`)
#### Setup environment
Create a new project directory and activate it as a Julia environment.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment