Unverified Commit f0d81edd authored by Aishwary Shukla's avatar Aishwary Shukla Committed by GitHub

Merge branch 'master' into aish-jsro-guide

parents d160d6ac 5f88f5b6
source "https://rubygems.org"
gem "jekyll-rtd-theme", git: "https://github.com/StarHPC/jekyll-rtd-theme"
#gem "jekyll-rtd-theme", git: "file:///home/Hofstra/jekyll-rtd-theme/.git/"
gem "github-pages", group: :jekyll_plugins
......
......@@ -3,6 +3,11 @@ lang: en
description: Star HPC – at Hofstra University
homeurl: https://starhpc.hofstra.io
#debug: true
#theme: jekyll-rtd-theme
# needed to build via GH actions
remote_theme: StarHPC/jekyll-rtd-theme
readme_index:
......@@ -16,4 +21,4 @@ exclude:
plugins:
- jemoji
- jekyll-avatar
- jekyll-mentions
\ No newline at end of file
- jekyll-mentions
......@@ -44,8 +44,7 @@ key:
### I need Python package X but the one on Star is too old or I cannot find it
You can choose different Python versions with either the module system
or using Anaconda/Miniconda. See here: `/software/modules` and
`/software/python_r_perl`.
or using Anaconda/Miniconda. See [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}).
In cases where this still doesn't solve your problem or you would like
to install a package yourself, please read the next section below about
......@@ -56,7 +55,7 @@ solution for you, please contact us and we will do our best to help you.
### Can I install Python software as a normal user without sudo rights?
Yes. Please see `/software/python_r_perl`.
Yes. Please see [Virtual environments]({{site.baseurl}}{% link software/virtual-env.md %}).
## Compute and storage quota
......@@ -81,7 +80,7 @@ File limits (inodes) -> These limit the number of files a user can create, regar
To check the quota of the main project storage (parallel file system - /fs1/proj/<project>), you can use this command:
To check the quota of the main project storage (parallel file system - `/fs1/proj/<project>`), you can use this command:
$ mmlsquota -j <fileset_name> <filesystem_name>
......@@ -123,7 +122,7 @@ your local PC.
### How can I access a compute node from the login node?
Please read about Interactive jobs at `/jobs/creating-jobs.md/`.
Please read about Interactive jobs at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### My ssh connections are dying / freezing
......@@ -145,18 +144,9 @@ you can take a look at this page explaining
[keepalives](https://the.earth.li/~sgtatham/putty/0.60/htmldoc/Chapter4.html#config-keepalive)
for a similar solution.
## Jobs and queue system
### I am not able to submit jobs longer than two days
Please read about `label_partitions`.
### Where can I find an example of job script?
You can find job script examples at `/jobs/creating-jobs.md/`.
Relevant application specific examples (also for beginning users) for a
few applications can be found in `sw_guides`.
You can find job script examples at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### When will my job start?
......@@ -178,6 +168,8 @@ new jobs are submitted that get higher priority.
In the command line, see the job queue by using `squeue`.
For a more comprehensive list of commands to monitor/manage your jobs, please see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
### Why does my job not start or give me error feedback when submitting?
Most often the reason a job is not starting is that Star is full at
......@@ -186,8 +178,7 @@ there is an error in the job script and you are asking for a
configuration that is not possible on Star. In such a case the job
will not start.
To find out how to monitor your jobs and check their status see
`monitoring_jobs`.
To find out how to monitor your jobs and check their status see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
Below are a few cases of why jobs don't start or error messages you
might get:
......@@ -204,7 +195,7 @@ core nodes - with both a total of 32 GB of memory/node. If you ask for
full nodes by specifying both number of nodes and cores/node together
with 2 GB of memory/core, you will ask for 20 cores/node and 40 GB of
memory. This configuration does not exist on Star. If you ask for 16
cores, still with 2GB/core, there is a sort of buffer within SLURM no
cores, still with 2GB/core, there is a sort of buffer within Slurm no
allowing you to consume absolutely all memory available (system needs
some to work). 2000MB/core works fine, but not 2 GB for 16 cores/node.
......@@ -219,8 +210,7 @@ mem-per-cpu 4000MB will cost you twice as much as mem-per-cpu 2000MB.
Please also note that if you want to use the whole memory on a node, do
not ask for 32GB, but for 31GB or 31000MB as the node needs some memory
for the system itself. For an example, see here:
`allocated_entire_memory`
for the system itself.
**Step memory limit**
......@@ -245,7 +235,7 @@ For instance:
QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has
been exceeded. Basically, you have asked for more time than allowed for
the given QOS/Partition. Please have a look at `label_partitions`
the given QOS/Partition.
**Priority vs. Resources**
......@@ -253,14 +243,6 @@ Priority means that resources are in principle available, but someone
else has higher priority in the queue. Resources means the at the moment
the requested resources are not available.
### Why is my job not starting on highmem nodes although the highmem queue is empty?
To prevent the highmem nodes from standing around idle, normal jobs may
use them as well, using only 32 GB of the available memory. Hence, it is
possible that the highmem nodes are busy, although you do not see any
jobs queuing or running on <span class="title-ref">squeue -p
highmem</span>.
### How can I customize emails that I get after a job has completed?
Use the mail command and you can customize it to your liking but make
......@@ -276,7 +258,7 @@ script:
The overhead in the job start and cleanup makes it unpractical to run
thousands of short tasks as individual jobs on Star.
The queueing setup on star, or rather, the accounting system generates
The queueing setup on Star, or rather, the accounting system generates
overhead in the start and finish of a job of about 1 second at each end
of the job. This overhead is insignificant when running large parallel
jobs, but creates scaling issues when running a massive amount of
......@@ -286,25 +268,86 @@ unparallelizable part of the job. This is because the queuing system can
only start and account one job at a time. This scaling problem is
described by [Amdahls Law](https://en.wikipedia.org/wiki/Amdahl's_law).
If the tasks are extremly short, you can use the example below. If you
want to spawn many jobs without polluting the queueing system, please
have a look at `job_arrays`.
If the tasks are extremly short (e.g. less than 1 second), you can use the example below.
If you want to spawn many jobs without polluting the queueing system, please
have a look at [array jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#array-jobs).
By using some shell trickery one can spawn and load-balance multiple
independent task running in parallel within one node, just background
the tasks and poll to see when some task is finished until you spawn the
next:
<div class="literalinclude" language="bash">
```bash
#!/usr/bin/env bash
# Jobscript example that can run several tasks in parallel.
# All features used here are standard in bash so it should work on
# any sane UNIX/LINUX system.
# Author: roy.dragseth@uit.no
#
# This example will only work within one compute node so let's run
# on one node using all the cpu-cores:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
# We assume we will (in total) be done in 10 minutes:
#SBATCH --time=0-00:10:00
# Let us use all CPUs:
maxpartasks=$SLURM_TASKS_PER_NODE
# Let's assume we have a bunch of tasks we want to perform.
# Each task is done in the form of a shell script with a numerical argument:
# dowork.sh N
# Let's just create some fake arguments with a sequence of numbers
# from 1 to 100, edit this to your liking:
tasks=$(seq 100)
cd $SLURM_SUBMIT_DIR
for t in $tasks; do
# Do the real work, edit this section to your liking.
# remember to background the task or else we will
# run serially
./dowork.sh $t &
# You should leave the rest alone...
# count the number of background tasks we have spawned
# the jobs command print one line per task running so we only need
# to count the number of lines.
activetasks=$(jobs | wc -l)
# if we have filled all the available cpu-cores with work we poll
# every second to wait for tasks to exit.
while [ $activetasks -ge $maxpartasks ]; do
sleep 1
activetasks=$(jobs | wc -l)
done
done
# Ok, all tasks spawned. Now we need to wait for the last ones to
# be finished before we exit.
echo "Waiting for tasks to complete"
wait
echo "done"
```
files/multiple.sh
And here is the `dowork.sh` script:
</div>
```bash
#!/usr/bin/env bash
And here is the `dowork.sh` script:
# Fake some work, $1 is the task number.
# Change this to whatever you want to have done.
<div class="literalinclude" language="bash">
# sleep between 0 and 10 secs
let sleeptime=10*$RANDOM/32768
files/dowork.sh
echo "Task $1 is sleeping for $sleeptime seconds"
sleep $sleeptime
echo "Task $1 has slept for $sleeptime seconds"
```
</div>
Source: [HPC-UiT FAQ](https://hpc-uit.readthedocs.io/en/latest/help/faq.html)
......@@ -28,7 +28,7 @@ Imagine a user is optimizing a complex algorithm's parameters. By initiating an
Batch jobs are submitted to a queue on the cluster and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
#### Example Scenario
You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion and output the results to your desired location for you to view.
For a real example on Batch jobs, view `/jobs/creating-jobs.html`.
For a real example on Batch jobs, view [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### 3. Array jobs
When you're faced with executing the same task multiple times with only slight variations, array jobs offer an efficient solution. This job type simplifies the process of managing numerous similar jobs by treating them as a single entity that varies only in a specified parameter.
......@@ -42,7 +42,7 @@ Imagine a fluid dynamics job that requires complex calculations spread over mill
## Resources
Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
For a list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`.
For a list of the resources available at Star HPC, take a look at [About star]({{site.baseurl}}{% link quickstart/about-star.md %}).
### Common Errors
Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br>
......
......@@ -19,10 +19,10 @@ squeue
Sample output:
```bash
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234 batch my_job jsmith R 5:23 1 cn01
1235 batch array_job jdoe R 2:45 1 cn02
1236 gpu gpu_task asmith PD 0:00 1 (Resources)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234 batch my_job jsmith R 5:23 1 cn01
1235 batch arr_job jdoe R 2:45 1 cn02
1236 gpu gpu_task asmith PD 0:00 1 (Resources)
```
To see **only** your job:
......@@ -216,7 +216,25 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
If a job fails, try checking the following:
1. Look at the job's output and error files.
2. Check the job's resource usage with `sacct`
3. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
Remember, if you're having persistent issues, don't hesitate to reach out to the support team.
2. Check the job state and exit code:
```
sacct --brief
```
Sample output:
```
JobID State ExitCode
------------ ---------- --------
1040 TIMEOUT 0:0
1041 FAILED 6:0
1042 TIMEOUT 0:0
1043 FAILED 1:0
1046 COMPLETED 0:0
1047 RUNNING 0:0
```
`FAILED` indicates the process terminated with with a non-zero exit code.
The first number in the ExitCode column is the exit code and the number after the colon is the signal that caused the process to terminate if it was terminated by a signal.
3. Check the job's resource usage with `sacct`
4. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
If you face persistent issues, please do not hesitate to reach out to us for help.
This diff is collapsed.
......@@ -28,7 +28,7 @@ Members of Hofstra University, Nassau Community College, or Adelphi University,
### Requesting an account
To get an account on Star, you need to complete out the registration form at [Star Account Management Web Application](http://localhost:3000). There, you will need to provide us the following information:
To get an account on Star, you need to complete out the [request form](https://access.starhpc.hofstra.io/apply). There, you will need to provide us the following information:
- Your full name, date of birth, and nationality.
- Your position (master student, PhD, PostDoc, staff member,
......@@ -65,7 +65,17 @@ Submit the above information through the online registration form.
## Login node
Access to the cluster is provided through SSH access to the login node. The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
### About the login node
The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
### Connection and credentials
Access to the cluster is provided through SSH to the login node. Upon your account's creation, you can access the login node using the address provided in your welcome Email.
If you have existing Linux lab credentials, use them to log in. Otherwise, login credentials will be provided to you.
Additionally, the login node provides access to your Linux lab files, **But note that** the login node is **not** just another Linux lab machine. It simply provides mutual features (e.g., credentials) for convenience.
## Scheduler policies
......
This diff is collapsed.
---
sort: 1
---
# Environment modules
## Introduction to Environment Modules
......@@ -109,4 +113,4 @@ Example:
For further details, users are encouraged to refer to the man pages for module and modulefile:
```bash
man module
```
\ No newline at end of file
```
---
sort: 4
---
# Jupyter Notebook
Jupyter Notebook is an interactive web application that provides an environment where you can create and share documents with live code, equations, visualizations, and narrative text. It is great for data analysis, scientific computing, and machine learning tasks. You can run Python code in cells, see results right away, and document your work all in one place.
## Running Jupyter Notebook
Jupiter Notebook is installed on the cluster and can be started like any other workload, by launching it through Slurm. Jupiter is available as an [environment module]({{site.baseurl}}{% link software/env-modules.md %}), so it would be loaded into the environment with the `module` command. The example script that follows shows you this.
Alternatively, you could run Jupyter in a container. That would make it easy to load the environment you need when there is a container image available with your desired toolset pre-installed. Check out [Apptainer]({{site.baseurl}}{% link software/apptainer.md %}) to learn more.
```note
Use Your Storage Effectively
{:.h4.mb-2}
The directory `/fs1/projects/{project-name}/` lives on the parallel file-system storage, where most of your work should reside. While your home directory (`/home/{username}/`) can be used for quick experiments and convenient access to scripts, keep in mind that it has limited capacity and worse performance. The parallel file-system storable is much faster and has way more space for your notebooks and data.
```
### Step 1: Create the Job Script
You would create a job script to launch Jupyter Notebook and most other applications on the cluster. As the compute nodes (where workloads run on the cluster) are not directly reachable from the campus network, you will need to perform SSH port forwarding to access your Jupyter Notebook instance. The following script starts Jupyter Notebook on an available port and provides you the SSH command needed to then reach it. You can copy and paste this example to get started. From the login node, save this as `jupyter.sbatch`:
```bash
#!/bin/bash
#SBATCH --nodelist=<compute-node>
#SBATCH --gpus=2
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00
#SBATCH --job-name=jupyter_notebook
#SBATCH --output=/fs1/projects/<project-name>/%x_%j.out
#SBATCH --error=/fs1/projects/<project-name>/%x_%j.err
# Connection variables
LOGIN_NODE="<login-node-address>" # Set this to the login node's address from the welcome email
LOGIN_PORT="<login-port>" # Set this to the port number from the welcome email
XX="<xx>" # Set this to a number from 01-30
module load jupyter
check_port() {
nc -z localhost $1
return $(( ! $? ))
}
# Find an available port
port=8888
while ! check_port $port; do
port=$((port + 1))
done
compute_node=$(hostname -f)
user=$(whoami)
echo "==================================================================="
echo "To connect to your Jupyter notebook, run this command on your local machine:"
echo ""
echo "ssh -N -L ${port}:${compute_node}:${port} -J ${user}@adams204${XX}.hofstra.edu:${LOGIN_PORT},${user}@${LOGIN_NODE}:${LOGIN_PORT} ${user}@${LOGIN_NODE}"
echo ""
echo "When finished, clean up by running this command on the login node:"
echo "scancel ${SLURM_JOB_ID}"
echo "==================================================================="
# Start Jupyter notebook
jupyter notebook --no-browser --port=${port} --ip=0.0.0.0
```
The script uses these Slurm parameters:
- `--nodelist`: Specifies which compute node to use (e.g., `gpu1` or `cn01`)
- `--gpus=2`: This enables us to use 2 of the GPUs on the specified node. See each node's GPU information [here]({{site.baseurl}}{% link quickstart/about-star.md %}). Without this specification, you cannot see or use the GPUs on the compute node. Feel free to replace this number with another **valid option**.
- `--ntasks=1`: Runs one instance of Jupyter
- `--cpus-per-task=1`: Use one CPU thread. Note hyperthreading may be enabled on the compute nodes.
- `--time=00:30:00`: Sets a 30-minute time limit for the job (The format is `hh:mm:ss`)
### Step 2: Replace the placeholders
The `<...>` placeholders need to be replaced with what _you_ need:
- `<login-node-address>` needs to be replaced with the address of the login node provided in your welcome email
- `<login-port>` needs to be replaced with the port number from your welcome email
- `<xx>` needs to be replaced with a number between 01-30 (inclusive)
- `<compute-node>` needs to be replaced with an available compute node from the cluster nodes list. You can find the full list of nodes on the [About Star]({{site.baseurl}}{% link quickstart/about-star.md %}) page.
- Change the path for the `--output` and `--error` directives to where _you_ would like these files to be saved.
### Step 3: Submit the job
```bash
sbatch jupyter.sbatch
```
Upon your job's submission to the queue, you will see the output indicating your job's ID. You need to replace _your_ job ID value with the `<jobid>` placeholder throughout this documentation.
_**Your job may not start right away!**_
{:.bg-yellow-light.color-orange-9.p-2}
If you run `squeue` immediately after submitting your job, you might see a message such as `Node Unavailable` next to your job. Another job may be actively using those resources, and your job will be held in the queue until your request can be satisfied by the available resources.
In such case, the `.out` or `.err` files will not be created yet, as your job hasn't run yet.
Before proceeding to **Step 4**, wait until your job has changed to the `RUNNING` state as reported by the `squeue` command.
### Step 4: Check your output file for the SSH command
```bash
cat jupyter_notebook_<jobid>.out # Run this command in the directory the .out file is located.
```
Replace `<jobid>` with the job ID you received after submitting the job.
### Step 5: Run the SSH port-forwarding command
Open a new terminal on your local machine and run the SSH command provided in the output file. If prompted for a password, use your Linux lab password if you haven't set up SSH keys. You might be requested to enter your password multiple times. **Note** that the command will appear to hang after successful connection - this is the expected behavior. Do not terminate the command (`Ctrl + C`) as this will disconnect your Jupyter notebook session (unless you intend to do so).
### Step 6: Find and open the link in your browser
Check the error file on the login node for your Jupyter notebook's URL:
```bash
cat jupyter_notebook_<jobid>.err | grep '127.0.0.1' # Run this command in the directory the .err file is located.
```
Replace `<jobid>` with the job ID you received after submitting the job.
_**Be patient!**_
{:.bg-yellow-light.color-orange-9.p-2}
Make sure you wait about 30 seconds after executing the SSH port-forwarding command on your local machine. It takes the `.err` file a little time to be updated and include your link.
You might see two lines being printed. Either link works.
Copy the URL from the error file and paste it into your **local machine's browser**.
### Step 7: Clean up
If you're done prior to the job's termination due to the walltime, clean up your session by running this command on the login node:
```bash
scancel <jobid>
```
Replace `<jobid>` with the job ID you received after submitting the job.
Afterwards, press `Ctrl + C` on your local computer's terminal session, where you ran the port forwarding command. This would terminate the SSH connection.
## Working on the Compute Node
Do you need to access the node running Jupyter Notebook? You can use `srun` to launch an interactive shell. Check out [interactive jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#interactive-jobs) for more information.
---
sort: 2
---
# Virtual Environment Guide
Managing software dependencies and configurations can be challenging in an HPC environment. Users often need different versions of the same software or libraries, leading to conflicts and complex setups. [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}) provide a solution by allowing users to dynamically modify their shell environment using simple commands. This simplifies the setup process, ensures that users have the correct software environment for their applications, and reduces conflicts and errors caused by incompatible software versions. Environment modules work on the same principle as virtual environments, i.e. the manipulation of environment variables. If an environment module is not available for a given version you need, you can instead create a virtual environment using the standard version manager tools provided with many common languages. Virtual environments allow for managing different versions of lanugages and dependencies independent of the system version or other virtual environments, so they are often used by developers to isolate dependencies for different projects.
......@@ -294,7 +298,7 @@ Remove the renv directory and associated files. This deletes the environment and
### How to create and use a virtual environment in Julia
Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing ']')
Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing `]`)
#### Setup environment
Create a new project directory and activate it as a Julia environment.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment