Merge branch 'master' into aish-jsro-guide

f0d81edd · Aishwary Shukla · GitHub · d160d6ac · 5f88f5b6 · f0d81edd
Unverified Commit f0d81edd authored Dec 11, 2024 by Aishwary Shukla Committed by GitHub Dec 11, 2024
11 changed files
--- a/Gemfile
+++ b/Gemfile
 source "https://rubygems.org"
 gem "jekyll-rtd-theme", git: "https://github.com/StarHPC/jekyll-rtd-theme"
+#gem "jekyll-rtd-theme", git: "file:///home/Hofstra/jekyll-rtd-theme/.git/"
 gem "github-pages", group: :jekyll_plugins

--- a/_config.yml
+++ b/_config.yml
@@ -3,6 +3,11 @@ lang: en
 description: Star HPC – at Hofstra University
 homeurl: https://starhpc.hofstra.io
+#debug: true
+#theme: jekyll-rtd-theme
+# needed to build via GH actions
 remote_theme: StarHPC/jekyll-rtd-theme
 readme_index:
@@ -16,4 +21,4 @@ exclude:
 plugins:
  - jemoji
  - jekyll-avatar
  - jekyll-mentions
\ No newline at end of file
--- a/help/faq.md
+++ b/help/faq.md
@@ -44,8 +44,7 @@ key:
 ### I need Python package X but the one on Star is too old or I cannot find it
 You can choose different Python versions with either the module system
-or using Anaconda/Miniconda. See here: `/software/modules` and
+or using Anaconda/Miniconda. See [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}).
-`/software/python_r_perl`.
 In cases where this still doesn't solve your problem or you would like
 to install a package yourself, please read the next section below about
@@ -56,7 +55,7 @@ solution for you, please contact us and we will do our best to help you.
 ### Can I install Python software as a normal user without sudo rights?
-Yes. Please see `/software/python_r_perl`.
+Yes. Please see [Virtual environments]({{site.baseurl}}{% link software/virtual-env.md %}).
 ## Compute and storage quota
@@ -81,7 +80,7 @@ File limits (inodes) -> These limit the number of files a user can create, regar
-To check the quota of the main project storage (parallel file system - /fs1/proj/<project>), you can use this command:
+To check the quota of the main project storage (parallel file system - `/fs1/proj/<project>`), you can use this command:
    $ mmlsquota -j <fileset_name> <filesystem_name>
@@ -123,7 +122,7 @@ your local PC.
 ### How can I access a compute node from the login node?
-Please read about Interactive jobs at `/jobs/creating-jobs.md/`.
+Please read about Interactive jobs at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
 ### My ssh connections are dying / freezing
@@ -145,18 +144,9 @@ you can take a look at this page explaining
 [keepalives](https://the.earth.li/~sgtatham/putty/0.60/htmldoc/Chapter4.html#config-keepalive)
 for a similar solution.
-## Jobs and queue system
-### I am not able to submit jobs longer than two days
-Please read about `label_partitions`.
 ### Where can I find an example of job script?
-You can find job script examples at `/jobs/creating-jobs.md/`.
+You can find job script examples at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
-Relevant application specific examples (also for beginning users) for a
-few applications can be found in `sw_guides`.
 ### When will my job start?
@@ -178,6 +168,8 @@ new jobs are submitted that get higher priority.
 In the command line, see the job queue by using `squeue`.
+For a more comprehensive list of commands to monitor/manage your jobs, please see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
 ### Why does my job not start or give me error feedback when submitting?
 Most often the reason a job is not starting is that Star is full at
@@ -186,8 +178,7 @@ there is an error in the job script and you are asking for a
 configuration that is not possible on Star. In such a case the job
 will not start.
-To find out how to monitor your jobs and check their status see
+To find out how to monitor your jobs and check their status see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
-`monitoring_jobs`.
 Below are a few cases of why jobs don't start or error messages you
 might get:
@@ -204,7 +195,7 @@ core nodes - with both a total of 32 GB of memory/node. If you ask for
 full nodes by specifying both number of nodes and cores/node together
 with 2 GB of memory/core, you will ask for 20 cores/node and 40 GB of
 memory. This configuration does not exist on Star. If you ask for 16
-cores, still with 2GB/core, there is a sort of buffer within SLURM no
+cores, still with 2GB/core, there is a sort of buffer within Slurm no
 allowing you to consume absolutely all memory available (system needs
 some to work). 2000MB/core works fine, but not 2 GB for 16 cores/node.
@@ -219,8 +210,7 @@ mem-per-cpu 4000MB will cost you twice as much as mem-per-cpu 2000MB.
 Please also note that if you want to use the whole memory on a node, do
 not ask for 32GB, but for 31GB or 31000MB as the node needs some memory
-for the system itself. For an example, see here:
+for the system itself.
-`allocated_entire_memory`
 **Step memory limit**
@@ -245,7 +235,7 @@ For instance:
 QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has
 been exceeded. Basically, you have asked for more time than allowed for
-the given QOS/Partition. Please have a look at `label_partitions`
+the given QOS/Partition.
 **Priority vs. Resources**
@@ -253,14 +243,6 @@ Priority means that resources are in principle available, but someone
 else has higher priority in the queue. Resources means the at the moment
 the requested resources are not available.
-### Why is my job not starting on highmem nodes although the highmem queue is empty?
-To prevent the highmem nodes from standing around idle, normal jobs may
-use them as well, using only 32 GB of the available memory. Hence, it is
-possible that the highmem nodes are busy, although you do not see any
-jobs queuing or running on <span class="title-ref">squeue -p
-highmem</span>.
 ### How can I customize emails that I get after a job has completed?
 Use the mail command and you can customize it to your liking but make
@@ -276,7 +258,7 @@ script:
 The overhead in the job start and cleanup makes it unpractical to run
 thousands of short tasks as individual jobs on Star.
-The queueing setup on star, or rather, the accounting system generates
+The queueing setup on Star, or rather, the accounting system generates
 overhead in the start and finish of a job of about 1 second at each end
 of the job. This overhead is insignificant when running large parallel
 jobs, but creates scaling issues when running a massive amount of
@@ -286,25 +268,86 @@ unparallelizable part of the job. This is because the queuing system can
 only start and account one job at a time. This scaling problem is
 described by [Amdahls Law](https://en.wikipedia.org/wiki/Amdahl's_law).
-If the tasks are extremly short, you can use the example below. If you
+If the tasks are extremly short (e.g. less than 1 second), you can use the example below.
-want to spawn many jobs without polluting the queueing system, please
-have a look at `job_arrays`.
+If you want to spawn many jobs without polluting the queueing system, please
+have a look at [array jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#array-jobs).
 By using some shell trickery one can spawn and load-balance multiple
 independent task running in parallel within one node, just background
 the tasks and poll to see when some task is finished until you spawn the
 next:
-<div class="literalinclude" language="bash">
+```bash
+#!/usr/bin/env bash
+# Jobscript example that can run several tasks in parallel.
+# All features used here are standard in bash so it should work on
+# any sane UNIX/LINUX system.
+# Author: roy.dragseth@uit.no
+#
+# This example will only work within one compute node so let's run
+# on one node using all the cpu-cores:
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=20
+# We assume we will (in total) be done in 10 minutes:
+#SBATCH --time=0-00:10:00
+# Let us use all CPUs:
+maxpartasks=$SLURM_TASKS_PER_NODE
+# Let's assume we have a bunch of tasks we want to perform.
+# Each task is done in the form of a shell script with a numerical argument:
+# dowork.sh N
+# Let's just create some fake arguments with a sequence of numbers
+# from 1 to 100, edit this to your liking:
+tasks=$(seq 100)
+cd $SLURM_SUBMIT_DIR
+for t in $tasks; do
+  # Do the real work, edit this section to your liking.
+  # remember to background the task or else we will
+  # run serially
+  ./dowork.sh $t &
+  # You should leave the rest alone...
+  # count the number of background tasks we have spawned
+  # the jobs command print one line per task running so we only need
+  # to count the number of lines.
+  activetasks=$(jobs | wc -l)
+  # if we have filled all the available cpu-cores with work we poll
+  # every second to wait for tasks to exit.
+  while [ $activetasks -ge $maxpartasks ]; do
+    sleep 1
+    activetasks=$(jobs | wc -l)
+  done
+done
+# Ok, all tasks spawned. Now we need to wait for the last ones to
+# be finished before we exit.
+echo "Waiting for tasks to complete"
+wait
+echo "done"
+```
-files/multiple.sh
+And here is the `dowork.sh` script:
-</div>
+```bash
+#!/usr/bin/env bash
-And here is the `dowork.sh` script:
+# Fake some work, $1 is the task number.
+# Change this to whatever you want to have done.
-<div class="literalinclude" language="bash">
+# sleep between 0 and 10 secs
+let sleeptime=10*$RANDOM/32768
-files/dowork.sh
+echo "Task $1 is sleeping for $sleeptime seconds"
+sleep $sleeptime
+echo "Task $1 has slept for $sleeptime seconds"
+```
-</div>
+Source: [HPC-UiT FAQ](https://hpc-uit.readthedocs.io/en/latest/help/faq.html)
--- a/jobs/Overview.md
+++ b/jobs/Overview.md
@@ -28,7 +28,7 @@ Imagine a user is optimizing a complex algorithm's parameters. By initiating an
 Batch jobs are submitted to a queue on the cluster and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
 #### Example Scenario
 You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion and output the results to your desired location for you to view.
-For a real example on Batch jobs, view `/jobs/creating-jobs.html`.
+For a real example on Batch jobs, view [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
 ### 3. Array jobs
 When you're faced with executing the same task multiple times with only slight variations, array jobs offer an efficient solution. This job type simplifies the process of managing numerous similar jobs by treating them as a single entity that varies only in a specified parameter.
@@ -42,7 +42,7 @@ Imagine a fluid dynamics job that requires complex calculations spread over mill
 ## Resources
 Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
-For a list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`.
+For a list of the resources available at Star HPC, take a look at [About star]({{site.baseurl}}{% link quickstart/about-star.md %}).
 ### Common Errors
 Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br>

--- a/jobs/monitoring-jobs.md
+++ b/jobs/monitoring-jobs.md
@@ -19,10 +19,10 @@ squeue
 Sample output:
 ```bash
-JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-   1234     batch  my_job    jsmith  R       5:23      1 cn01
+   1234     batch   my_job   jsmith  R       5:23      1 cn01
-   1235     batch array_job  jdoe    R       2:45      1 cn02
+   1235     batch  arr_job     jdoe  R       2:45      1 cn02
-   1236       gpu  gpu_task  asmith  PD       0:00      1 (Resources)
+   1236       gpu gpu_task   asmith PD       0:00      1 (Resources)
 ```
 To see **only** your job:
@@ -216,7 +216,25 @@ JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 If a job fails, try checking the following:
 1. Look at the job's output and error files.
-2. Check the job's resource usage with `sacct`
+2. Check the job state and exit code:
-3. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
+   ```
+sacct --brief
-Remember, if you're having persistent issues, don't hesitate to reach out to the support team.
+```
+   Sample output:
+   ```
+JobID             State ExitCode
+------------ ---------- --------
+1040            TIMEOUT      0:0
+1041             FAILED      6:0
+1042            TIMEOUT      0:0
+1043             FAILED      1:0
+1046          COMPLETED      0:0
+1047            RUNNING      0:0
+```
+   `FAILED` indicates the process terminated with with a non-zero exit code.
+   The first number in the ExitCode column is the exit code and the number after the colon is the signal that caused the process to terminate if it was terminated by a signal.
+3. Check the job's resource usage with `sacct`
+4. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
+If you face persistent issues, please do not hesitate to reach out to us for help.
--- a/jobs/submitting-jobs.md
+++ b/jobs/submitting-jobs.md
--- a/quickstart/quickstart.md
+++ b/quickstart/quickstart.md
@@ -28,7 +28,7 @@ Members of Hofstra University, Nassau Community College, or Adelphi University,
 ### Requesting an account
-To get an account on Star, you need to complete out the registration form at [Star Account Management Web Application](http://localhost:3000). There, you will need to provide us the following information:
+To get an account on Star, you need to complete out the [request form](https://access.starhpc.hofstra.io/apply). There, you will need to provide us the following information:
 -   Your full name, date of birth, and nationality.
 -   Your position (master student, PhD, PostDoc, staff member,
@@ -65,7 +65,17 @@ Submit the above information through the online registration form.
 ## Login node
-Access to the cluster is provided through SSH access to the login node. The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
+### About the login node
+The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
+### Connection and credentials
+Access to the cluster is provided through SSH to the login node. Upon your account's creation, you can access the login node using the address provided in your welcome Email.
+If you have existing Linux lab credentials, use them to log in. Otherwise, login credentials will be provided to you.
+Additionally, the login node provides access to your Linux lab files, **But note that** the login node is **not** just another Linux lab machine. It simply provides mutual features (e.g., credentials) for convenience.
 ## Scheduler policies

--- a/software/apptainer.md
+++ b/software/apptainer.md
--- a/software/env-modules.md
+++ b/software/env-modules.md
+---
+sort: 1
+---
 # Environment modules
 ## Introduction to Environment Modules
@@ -109,4 +113,4 @@ Example:
 For further details, users are encouraged to refer to the man pages for module and modulefile:
 ```bash
 man module
 ```
\ No newline at end of file
--- a/software/jupyter-notebook.md
+++ b/software/jupyter-notebook.md
+---
+sort: 4
+---
+# Jupyter Notebook
+Jupyter Notebook is an interactive web application that provides an environment where you can create and share documents with live code, equations, visualizations, and narrative text. It is great for data analysis, scientific computing, and machine learning tasks. You can run Python code in cells, see results right away, and document your work all in one place.
+## Running Jupyter Notebook
+Jupiter Notebook is installed on the cluster and can be started like any other workload, by launching it through Slurm. Jupiter is available as an [environment module]({{site.baseurl}}{% link software/env-modules.md %}), so it would be loaded into the environment with the `module` command. The example script that follows shows you this.
+Alternatively, you could run Jupyter in a container. That would make it easy to load the environment you need when there is a container image available with your desired toolset pre-installed. Check out [Apptainer]({{site.baseurl}}{% link software/apptainer.md %}) to learn more.
+```note
+Use Your Storage Effectively
+{:.h4.mb-2}
+The directory `/fs1/projects/{project-name}/` lives on the parallel file-system storage, where most of your work should reside. While your home directory (`/home/{username}/`) can be used for quick experiments and convenient access to scripts, keep in mind that it has limited capacity and worse performance. The parallel file-system storable is much faster and has way more space for your notebooks and data.
+```
+### Step 1: Create the Job Script
+You would create a job script to launch Jupyter Notebook and most other applications on the cluster. As the compute nodes (where workloads run on the cluster) are not directly reachable from the campus network, you will need to perform SSH port forwarding to access your Jupyter Notebook instance. The following script starts Jupyter Notebook on an available port and provides you the SSH command needed to then reach it. You can copy and paste this example to get started. From the login node, save this as `jupyter.sbatch`:
+```bash
+#!/bin/bash
+#SBATCH --nodelist=<compute-node>
+#SBATCH --gpus=2
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --time=00:30:00
+#SBATCH --job-name=jupyter_notebook
+#SBATCH --output=/fs1/projects/<project-name>/%x_%j.out
+#SBATCH --error=/fs1/projects/<project-name>/%x_%j.err
+# Connection variables
+LOGIN_NODE="<login-node-address>"  # Set this to the login node's address from the welcome email
+LOGIN_PORT="<login-port>"          # Set this to the port number from the welcome email
+XX="<xx>"                          # Set this to a number from 01-30
+module load jupyter
+check_port() {
+   nc -z localhost $1
+   return $(( ! $? ))
+}
+# Find an available port
+port=8888
+while ! check_port $port; do
+   port=$((port + 1))
+done
+compute_node=$(hostname -f)
+user=$(whoami)
+echo "==================================================================="
+echo "To connect to your Jupyter notebook, run this command on your local machine:"
+echo ""
+echo "ssh -N -L ${port}:${compute_node}:${port} -J ${user}@adams204${XX}.hofstra.edu:${LOGIN_PORT},${user}@${LOGIN_NODE}:${LOGIN_PORT} ${user}@${LOGIN_NODE}"
+echo ""
+echo "When finished, clean up by running this command on the login node:"
+echo "scancel ${SLURM_JOB_ID}"
+echo "==================================================================="
+# Start Jupyter notebook
+jupyter notebook --no-browser --port=${port} --ip=0.0.0.0
+```
+The script uses these Slurm parameters:
+- `--nodelist`: Specifies which compute node to use (e.g., `gpu1` or `cn01`)
+- `--gpus=2`: This enables us to use 2 of the GPUs on the specified node. See each node's GPU information [here]({{site.baseurl}}{% link quickstart/about-star.md %}). Without this specification, you cannot see or use the GPUs on the compute node. Feel free to replace this number with another **valid option**.
+- `--ntasks=1`: Runs one instance of Jupyter
+- `--cpus-per-task=1`: Use one CPU thread. Note hyperthreading may be enabled on the compute nodes.
+- `--time=00:30:00`: Sets a 30-minute time limit for the job (The format is `hh:mm:ss`)
+### Step 2: Replace the placeholders
+The `<...>` placeholders need to be replaced with what _you_ need:
+- `<login-node-address>` needs to be replaced with the address of the login node provided in your welcome email
+- `<login-port>` needs to be replaced with the port number from your welcome email
+- `<xx>` needs to be replaced with a number between 01-30 (inclusive)
+- `<compute-node>` needs to be replaced with an available compute node from the cluster nodes list. You can find the full list of nodes on the [About Star]({{site.baseurl}}{% link quickstart/about-star.md %}) page.
+- Change the path for the `--output` and `--error` directives to where _you_ would like these files to be saved.
+### Step 3: Submit the job
+```bash
+sbatch jupyter.sbatch
+```
+Upon your job's submission to the queue, you will see the output indicating your job's ID. You need to replace _your_ job ID value with the `<jobid>` placeholder throughout this documentation.
+_**Your job may not start right away!**_
+{:.bg-yellow-light.color-orange-9.p-2}
+If you run `squeue` immediately after submitting your job, you might see a message such as `Node Unavailable` next to your job. Another job may be actively using those resources, and your job will be held in the queue until your request can be satisfied by the available resources.
+In such case, the `.out` or `.err` files will not be created yet, as your job hasn't run yet.
+Before proceeding to **Step 4**, wait until your job has changed to the `RUNNING` state as reported by the `squeue` command.
+### Step 4: Check your output file for the SSH command
+```bash
+cat jupyter_notebook_<jobid>.out  # Run this command in the directory the .out file is located.
+```
+Replace `<jobid>` with the job ID you received after submitting the job.
+### Step 5: Run the SSH port-forwarding command
+Open a new terminal on your local machine and run the SSH command provided in the output file. If prompted for a password, use your Linux lab password if you haven't set up SSH keys. You might be requested to enter your password multiple times. **Note** that the command will appear to hang after successful connection - this is the expected behavior. Do not terminate the command (`Ctrl + C`) as this will disconnect your Jupyter notebook session (unless you intend to do so).
+### Step 6: Find and open the link in your browser
+Check the error file on the login node for your Jupyter notebook's URL:
+```bash
+cat jupyter_notebook_<jobid>.err  | grep '127.0.0.1' # Run this command in the directory the .err file is located.
+```
+Replace `<jobid>` with the job ID you received after submitting the job.
+_**Be patient!**_
+{:.bg-yellow-light.color-orange-9.p-2}
+Make sure you wait about 30 seconds after executing the SSH port-forwarding command on your local machine. It takes the `.err` file a little time to be updated and include your link.
+You might see two lines being printed. Either link works.
+Copy the URL from the error file and paste it into your **local machine's browser**.
+### Step 7: Clean up
+If you're done prior to the job's termination due to the walltime, clean up your session by running this command on the login node:
+```bash
+scancel <jobid>
+```
+Replace `<jobid>` with the job ID you received after submitting the job.
+Afterwards, press `Ctrl + C` on your local computer's terminal session, where you ran the port forwarding command. This would terminate the SSH connection.
+## Working on the Compute Node
+Do you need to access the node running Jupyter Notebook? You can use `srun` to launch an interactive shell. Check out [interactive jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#interactive-jobs) for more information.
--- a/software/virtual-env.md
+++ b/software/virtual-env.md
+---
+sort: 2
+---
 # Virtual Environment Guide
 Managing software dependencies and configurations can be challenging in an HPC environment. Users often need different versions of the same software or libraries, leading to conflicts and complex setups. [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}) provide a solution by allowing users to dynamically modify their shell environment using simple commands. This simplifies the setup process, ensures that users have the correct software environment for their applications, and reduces conflicts and errors caused by incompatible software versions. Environment modules work on the same principle as virtual environments, i.e. the manipulation of environment variables. If an environment module is not available for a given version you need, you can instead create a virtual environment using the standard version manager tools provided with many common languages. Virtual environments allow for managing different versions of lanugages and dependencies independent of the system version or other virtual environments, so they are often used by developers to isolate dependencies for different projects.
@@ -294,7 +298,7 @@ Remove the renv directory and associated files. This deletes the environment and
 ### How to create and use a virtual environment in Julia
-Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing ']')
+Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing `]`)
 #### Setup environment
 Create a new project directory and activate it as a Julia environment.