Merge conflict resolved between ngc-apptainer and

latest version of master.

Merge conflict resolved between ngc-apptainer and
latest version of master.
ce678ed9 · Mani Tofigh · 7f8137e8 · f529bd85 · ce678ed9 · ce678ed9
Commit ce678ed9 authored Nov 15, 2024 by Mani Tofigh
Hide whitespace changes
Inline Side-by-side

Showing with 97 additions and 54 deletions

faq.md help/faq.md +83 -40

Overview.md jobs/Overview.md +2 -2

submitting-jobs.md jobs/submitting-jobs.md +11 -11

virtual-env.md software/virtual-env.md +1 -1

No files found.
--- a/help/faq.md
+++ b/help/faq.md
@@ -44,8 +44,7 @@ key:
 ### I need Python package X but the one on Star is too old or I cannot find it

 You can choose different Python versions with either the module system
-or using Anaconda/Miniconda. See here: `/software/modules` and
-`/software/python_r_perl`.
+or using Anaconda/Miniconda. See [Environment modules]({{site.baseurl}}{% link software/env-modules.md %}).

 In cases where this still doesn't solve your problem or you would like
 to install a package yourself, please read the next section below about
@@ -56,7 +55,7 @@ solution for you, please contact us and we will do our best to help you.

 ### Can I install Python software as a normal user without sudo rights?

-Yes. Please see `/software/python_r_perl`.
+Yes. Please see [Virtual environments]({{site.baseurl}}{% link software/virtual-env.md %}).

 ## Compute and storage quota

@@ -81,7 +80,7 @@ File limits (inodes) -> These limit the number of files a user can create, regar



-To check the quota of the main project storage (parallel file system - /fs1/proj/<project>), you can use this command:
+To check the quota of the main project storage (parallel file system - `/fs1/proj/<project>`), you can use this command:

    $ mmlsquota -j <fileset_name> <filesystem_name>

@@ -123,7 +122,7 @@ your local PC.

 ### How can I access a compute node from the login node?

-Please read about Interactive jobs at `/jobs/creating-jobs.md/`.
+Please read about Interactive jobs at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).

 ### My ssh connections are dying / freezing

@@ -145,18 +144,9 @@ you can take a look at this page explaining
 [keepalives](https://the.earth.li/~sgtatham/putty/0.60/htmldoc/Chapter4.html#config-keepalive)
 for a similar solution.

-## Jobs and queue system
-
-### I am not able to submit jobs longer than two days
-
-Please read about `label_partitions`.
-
 ### Where can I find an example of job script?

-You can find job script examples at `/jobs/creating-jobs.md/`.
-
-Relevant application specific examples (also for beginning users) for a
-few applications can be found in `sw_guides`.
+You can find job script examples at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).

 ### When will my job start?

@@ -178,6 +168,8 @@ new jobs are submitted that get higher priority.

 In the command line, see the job queue by using `squeue`.

+For a more comprehensive list of commands to monitor/manage your jobs, please see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
+
 ### Why does my job not start or give me error feedback when submitting?

 Most often the reason a job is not starting is that Star is full at
@@ -186,8 +178,7 @@ there is an error in the job script and you are asking for a
 configuration that is not possible on Star. In such a case the job
 will not start.

-To find out how to monitor your jobs and check their status see
-`monitoring_jobs`.
+To find out how to monitor your jobs and check their status see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).

 Below are a few cases of why jobs don't start or error messages you
 might get:
@@ -204,7 +195,7 @@ core nodes - with both a total of 32 GB of memory/node. If you ask for
 full nodes by specifying both number of nodes and cores/node together
 with 2 GB of memory/core, you will ask for 20 cores/node and 40 GB of
 memory. This configuration does not exist on Star. If you ask for 16
-cores, still with 2GB/core, there is a sort of buffer within SLURM no
+cores, still with 2GB/core, there is a sort of buffer within Slurm no
 allowing you to consume absolutely all memory available (system needs
 some to work). 2000MB/core works fine, but not 2 GB for 16 cores/node.

@@ -219,8 +210,7 @@ mem-per-cpu 4000MB will cost you twice as much as mem-per-cpu 2000MB.

 Please also note that if you want to use the whole memory on a node, do
 not ask for 32GB, but for 31GB or 31000MB as the node needs some memory
-for the system itself. For an example, see here:
-`allocated_entire_memory`
+for the system itself.

 **Step memory limit**

@@ -245,7 +235,7 @@ For instance:

 QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has
 been exceeded. Basically, you have asked for more time than allowed for
-the given QOS/Partition. Please have a look at `label_partitions`
+the given QOS/Partition.

 **Priority vs. Resources**

@@ -253,14 +243,6 @@ Priority means that resources are in principle available, but someone
 else has higher priority in the queue. Resources means the at the moment
 the requested resources are not available.

-### Why is my job not starting on highmem nodes although the highmem queue is empty?
-
-To prevent the highmem nodes from standing around idle, normal jobs may
-use them as well, using only 32 GB of the available memory. Hence, it is
-possible that the highmem nodes are busy, although you do not see any
-jobs queuing or running on <span class="title-ref">squeue -p
-highmem</span>.
-
 ### How can I customize emails that I get after a job has completed?

 Use the mail command and you can customize it to your liking but make
@@ -276,7 +258,7 @@ script:
 The overhead in the job start and cleanup makes it unpractical to run
 thousands of short tasks as individual jobs on Star.

-The queueing setup on star, or rather, the accounting system generates
+The queueing setup on Star, or rather, the accounting system generates
 overhead in the start and finish of a job of about 1 second at each end
 of the job. This overhead is insignificant when running large parallel
 jobs, but creates scaling issues when running a massive amount of
@@ -286,25 +268,86 @@ unparallelizable part of the job. This is because the queuing system can
 only start and account one job at a time. This scaling problem is
 described by [Amdahls Law](https://en.wikipedia.org/wiki/Amdahl's_law).

-If the tasks are extremly short, you can use the example below. If you
-want to spawn many jobs without polluting the queueing system, please
-have a look at `job_arrays`.
+If the tasks are extremly short (e.g. less than 1 second), you can use the example below.
+
+If you want to spawn many jobs without polluting the queueing system, please
+have a look at [array jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#array-jobs).

 By using some shell trickery one can spawn and load-balance multiple
 independent task running in parallel within one node, just background
 the tasks and poll to see when some task is finished until you spawn the
 next:

-<div class="literalinclude" language="bash">
+```bash
+#!/usr/bin/env bash
+
+# Jobscript example that can run several tasks in parallel.
+# All features used here are standard in bash so it should work on
+# any sane UNIX/LINUX system.
+# Author: roy.dragseth@uit.no
+#
+# This example will only work within one compute node so let's run
+# on one node using all the cpu-cores:
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=20
+
+# We assume we will (in total) be done in 10 minutes:
+#SBATCH --time=0-00:10:00
+
+# Let us use all CPUs:
+maxpartasks=$SLURM_TASKS_PER_NODE
+
+# Let's assume we have a bunch of tasks we want to perform.
+# Each task is done in the form of a shell script with a numerical argument:
+# dowork.sh N
+# Let's just create some fake arguments with a sequence of numbers
+# from 1 to 100, edit this to your liking:
+tasks=$(seq 100)
+
+cd $SLURM_SUBMIT_DIR
+
+for t in $tasks; do
+  # Do the real work, edit this section to your liking.
+  # remember to background the task or else we will
+  # run serially
+  ./dowork.sh $t &
+
+  # You should leave the rest alone...
+
+  # count the number of background tasks we have spawned
+  # the jobs command print one line per task running so we only need
+  # to count the number of lines.
+  activetasks=$(jobs | wc -l)
+
+  # if we have filled all the available cpu-cores with work we poll
+  # every second to wait for tasks to exit.
+  while [ $activetasks -ge $maxpartasks ]; do
+    sleep 1
+    activetasks=$(jobs | wc -l)
+  done
+done
+
+# Ok, all tasks spawned. Now we need to wait for the last ones to
+# be finished before we exit.
+echo "Waiting for tasks to complete"
+wait
+echo "done"
+```

-files/multiple.sh
+And here is the `dowork.sh` script:

-</div>
+```bash
+#!/usr/bin/env bash

-And here is the `dowork.sh` script:
+# Fake some work, $1 is the task number.
+# Change this to whatever you want to have done.

-<div class="literalinclude" language="bash">
+# sleep between 0 and 10 secs
+let sleeptime=10*$RANDOM/32768

-files/dowork.sh
+echo "Task $1 is sleeping for $sleeptime seconds"
+sleep $sleeptime
+echo "Task $1 has slept for $sleeptime seconds"
+```

-</div>
+Source: [HPC-UiT FAQ](https://hpc-uit.readthedocs.io/en/latest/help/faq.html)
--- a/jobs/Overview.md
+++ b/jobs/Overview.md
@@ -28,7 +28,7 @@ Imagine a user is optimizing a complex algorithm's parameters. By initiating an
 Batch jobs are submitted to a queue on the cluster and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
 #### Example Scenario
 You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion and output the results to your desired location for you to view.
-For a real example on Batch jobs, view `/jobs/creating-jobs.html`.
+For a real example on Batch jobs, view [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).

 ### 3. Array jobs
 When you're faced with executing the same task multiple times with only slight variations, array jobs offer an efficient solution. This job type simplifies the process of managing numerous similar jobs by treating them as a single entity that varies only in a specified parameter.
@@ -42,7 +42,7 @@ Imagine a fluid dynamics job that requires complex calculations spread over mill

 ## Resources
 Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
-For a list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`.
+For a list of the resources available at Star HPC, take a look at [About star]({{site.baseurl}}{% link quickstart/about-star.md %}).

 ### Common Errors
 Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br>

--- a/jobs/submitting-jobs.md
+++ b/jobs/submitting-jobs.md
@@ -73,7 +73,7 @@ Now let's walk through `my_script.sbatch` line by line to see what each directiv

 Lines 2-7 are your `SBATCH` directives. These lines are where you specify different options for your job including its name, output and error files path/name, list of nodes you want to use, resource limits, and more if required. Let's walk through them line by line:

- `#SBATCH --job-name=test_job`: This directive gives your job a name that you can later use to easier track and manage your job when looking for it in the queue. In this example, we've called it `test_job`. You can read about job management at `/software/env-modules.html`.
+- `#SBATCH --job-name=test_job`: This directive gives your job a name that you can later use to easier track and manage your job when looking for it in the queue. In this example, we've called it `test_job`. You can read about job management at [Monitoring jobs]({{ site.baseurl }}{% link jobs/monitoring-jobs.md %}).
 - `#SBATCH --output=test_job.out`: Used to specify where your output file is generated, and what it's going to be named. In this example, we have not provided a path, but only provided a name. When you use the `--output` directive without specifying a full path, just providing a filename, Slurm will store the output file in the current working directory from which the `sbatch` command was executed.
 - `#SBATCH --error=test_job.err`: Functions similar to `--output` except it contains error messages generated during the execution of your job, if any. **The `.err` file is always going to be generated even if your job execution is successful; however, it's going to be empty if there are no errors.**
 - `#SBATCH --nodes=1`: Specifies your job to run on one available node. This directive basically tells the scheduler "Run my job on any available node you find, and I don't care which one". **It's also possible to specify the name of the node(s) you'd like to use which we will cover in future examples.**
@@ -82,7 +82,7 @@ Lines 2-7 are your `SBATCH` directives. These lines are where you specify differ

 After the last `#SBATCH` directive, commands are ran like any other regular shell script.

- `module load python3`: Loads necessary files and modules in order for the command `python3` to be valid when used. Please refer to `/software/env-modules.html` for more detail on how the command `module` works.
+- `module load python3`: Loads necessary files and modules in order for the command `python3` to be valid when used. Please refer to [Environment modules]({{ site.baseurl }}{% link software/env-modules.md %}) for more detail on how the command `module` works.
 - `python3 my_script.py`: Just like any other `python3` command, this line runs the `my_script.py` file using Python. **Later, the output(s) and/or error(s) of this operation is written to the files we have specified in our directives.**

 ### Batch Job Submission
@@ -174,7 +174,7 @@ Once submitted, you'll be placed in an interactive shell on the allocated node w

 ## Array jobs

-To submit an array job, you use the `--array` as a part of your `sbatch`. This option specifies a range of indices that SLURM uses to create multiple tasks from a single job submission. Each task in the array is assigned a unique SLURM_ARRAY_TASK_ID that can be used within your scripts to differentiate between them.
+To submit an array job, you use the `--array` as a part of your `sbatch`. This option specifies a range of indices that Slurm uses to create multiple tasks from a single job submission. Each task in the array is assigned a unique SLURM_ARRAY_TASK_ID that can be used within your scripts to differentiate between them.

 ### Array job example

@@ -220,7 +220,7 @@ except FileNotFoundError:

 This script basically takes an argument from the command line (Expected to be the `SLURM_ARRAY_TASK_ID`). Then constructs a filename from this ID, reads the corresponding input file, counts its lines, writes the count to an output file, and in case of any missing files it handles them by printing a message instead of crashing.

-Finally, to run this script as part of an array job on 3 files, adjust the `--array` option in your SLURM script (`process_array.sbatch`) to `1-3`.
+Finally, to run this script as part of an array job on 3 files, adjust the `--array` option in your Slurm script (`process_array.sbatch`) to `1-3`.

 ```bash
 #!/bin/bash
@@ -235,9 +235,9 @@ module load python3
 python3 process_data.py $SLURM_ARRAY_TASK_ID
 ```

-In the context of SLURM job submission scripts, %A and %a are special placeholders used within directives like --output and --error to dynamically generate filenames based on the job's array ID and the individual task ID within the array. Here's what each placeholder represents:
+In the context of Slurm job submission scripts, %A and %a are special placeholders used within directives like --output and --error to dynamically generate filenames based on the job's array ID and the individual task ID within the array. Here's what each placeholder represents:

- `%A`: This placeholder is replaced by the SLURM job array's ID. The job array ID is a unique identifier assigned by SLURM to the entire array job at the time of submission. It helps you group and identify all tasks belonging to the same array job.
+- `%A`: This placeholder is replaced by the Slurm job array's ID. The job array ID is a unique identifier assigned by SLURM to the entire array job at the time of submission. It helps you group and identify all tasks belonging to the same array job.
 - `%a`: This placeholder is substituted with the specific task ID within the job array. Since an array job consists of multiple tasks, each with a unique task ID (determined by the `--array` option when the job is submitted), `%a` allows you to create distinct output or error files for each task, making it easier to troubleshoot and analyze the results of individual tasks.

 For example, if you submit an array job with the --array=1-10 option and use the following in your script:
@@ -247,19 +247,19 @@ For example, if you submit an array job with the --array=1-10 option and use the
 #SBATCH --error=job_error_%A_%a.err
 ```

-SLURM will create separate output and error files for each of the ten tasks in the array. If the array job's ID is 12345, the files for the first task will be named job_output_12345_1.out and job_error_12345_1.err, the files for the second task will be job_output_12345_2.out and job_error_12345_2.err, and so on.
+Slurm will create separate output and error files for each of the ten tasks in the array. If the array job's ID is 12345, the files for the first task will be named job_output_12345_1.out and job_error_12345_1.err, the files for the second task will be job_output_12345_2.out and job_error_12345_2.err, and so on.

 Now submit this job using `sbatch process_array.sbatch` and you must see 6 different output files (3 ending in `.out` and 3 in `.err`). The `.out` files each contain the content of the relevant text file they read from, and the `.err` files are expected to be empty if everything has run smoothly.

 ### Array Job Submission

-To submit an array job, use the `sbatch` command with your SLURM script that includes the `--array` option. For example:
+To submit an array job, use the `sbatch` command with your Slurm script that includes the `--array` option. For example:

 ```bash
 sbatch process_array.sbatch
 ```

-This will submit the entire array job. SLURM will then manage the execution of individual tasks within the array based on available resources.
+This will submit the entire array job. Slurm will then manage the execution of individual tasks within the array based on available resources.

 ## Parallel jobs

@@ -300,7 +300,7 @@ mpirun -n $NUM ./$OBJ

 This script compiles the MPI program using `mpicc` and runs it with `mpirun`, and specifies the number of processes with `-n`.

-Next, prepare a SLURM batch job script named `job-test-mpi.sbatch` to submit your MPI job. This script requests cluster resources and runs your MPI program through `mpi_hello_world.sh`:
+Next, prepare a Slurm batch job script named `job-test-mpi.sbatch` to submit your MPI job. This script requests cluster resources and runs your MPI program through `mpi_hello_world.sh`:

 ```bash
 #!/bin/bash
@@ -320,7 +320,7 @@ This script sets up a job with the name mpi_job_test, specifies output and error

 ### Parallel Job Submission

-Submit your parallel MPI job to SLURM using the `sbatch` command `sbatch job-test-mpi.sbatch`, specifying the desired number of parallel processes with `-n`. For example, to run with 8 parallel processes:
+Submit your parallel MPI job to Slurm using the `sbatch` command `sbatch job-test-mpi.sbatch`, specifying the desired number of parallel processes with `-n`. For example, to run with 8 parallel processes:

 ```bash
 sbatch -n 8 job-test-mpi.sh 8

--- a/software/virtual-env.md
+++ b/software/virtual-env.md
@@ -298,7 +298,7 @@ Remove the renv directory and associated files. This deletes the environment and

 ### How to create and use a virtual environment in Julia

-Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing ']')
+Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing `]`)

 #### Setup environment
 Create a new project directory and activate it as a Julia environment.