### I am not able to submit jobs longer than two days
Please read about `label_partitions`.
### Where can I find an example of job script?
You can find job script examples at `/jobs/creating-jobs.md/`.
Relevant application specific examples (also for beginning users) for a
few applications can be found in `sw_guides`.
You can find job script examples at [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### When will my job start?
...
...
@@ -178,6 +168,8 @@ new jobs are submitted that get higher priority.
In the command line, see the job queue by using `squeue`.
For a more comprehensive list of commands to monitor/manage your jobs, please see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
### Why does my job not start or give me error feedback when submitting?
Most often the reason a job is not starting is that Star is full at
...
...
@@ -186,8 +178,7 @@ there is an error in the job script and you are asking for a
configuration that is not possible on Star. In such a case the job
will not start.
To find out how to monitor your jobs and check their status see
`monitoring_jobs`.
To find out how to monitor your jobs and check their status see [Monitoring jobs]({{site.baseurl}}{% link jobs/monitoring-jobs.md %}).
Below are a few cases of why jobs don't start or error messages you
might get:
...
...
@@ -204,7 +195,7 @@ core nodes - with both a total of 32 GB of memory/node. If you ask for
full nodes by specifying both number of nodes and cores/node together
with 2 GB of memory/core, you will ask for 20 cores/node and 40 GB of
memory. This configuration does not exist on Star. If you ask for 16
cores, still with 2GB/core, there is a sort of buffer within SLURM no
cores, still with 2GB/core, there is a sort of buffer within Slurm no
allowing you to consume absolutely all memory available (system needs
some to work). 2000MB/core works fine, but not 2 GB for 16 cores/node.
...
...
@@ -219,8 +210,7 @@ mem-per-cpu 4000MB will cost you twice as much as mem-per-cpu 2000MB.
Please also note that if you want to use the whole memory on a node, do
not ask for 32GB, but for 31GB or 31000MB as the node needs some memory
for the system itself. For an example, see here:
`allocated_entire_memory`
for the system itself.
**Step memory limit**
...
...
@@ -245,7 +235,7 @@ For instance:
QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has
been exceeded. Basically, you have asked for more time than allowed for
the given QOS/Partition. Please have a look at `label_partitions`
the given QOS/Partition.
**Priority vs. Resources**
...
...
@@ -253,14 +243,6 @@ Priority means that resources are in principle available, but someone
else has higher priority in the queue. Resources means the at the moment
the requested resources are not available.
### Why is my job not starting on highmem nodes although the highmem queue is empty?
To prevent the highmem nodes from standing around idle, normal jobs may
use them as well, using only 32 GB of the available memory. Hence, it is
possible that the highmem nodes are busy, although you do not see any
jobs queuing or running on <spanclass="title-ref">squeue -p
highmem</span>.
### How can I customize emails that I get after a job has completed?
Use the mail command and you can customize it to your liking but make
...
...
@@ -276,7 +258,7 @@ script:
The overhead in the job start and cleanup makes it unpractical to run
thousands of short tasks as individual jobs on Star.
The queueing setup on star, or rather, the accounting system generates
The queueing setup on Star, or rather, the accounting system generates
overhead in the start and finish of a job of about 1 second at each end
of the job. This overhead is insignificant when running large parallel
jobs, but creates scaling issues when running a massive amount of
...
...
@@ -286,25 +268,86 @@ unparallelizable part of the job. This is because the queuing system can
only start and account one job at a time. This scaling problem is
described by [Amdahls Law](https://en.wikipedia.org/wiki/Amdahl's_law).
If the tasks are extremly short, you can use the example below. If you
want to spawn many jobs without polluting the queueing system, please
have a look at `job_arrays`.
If the tasks are extremly short (e.g. less than 1 second), you can use the example below.
If you want to spawn many jobs without polluting the queueing system, please
have a look at [array jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}#array-jobs).
By using some shell trickery one can spawn and load-balance multiple
independent task running in parallel within one node, just background
the tasks and poll to see when some task is finished until you spawn the
next:
<divclass="literalinclude"language="bash">
```bash
#!/usr/bin/env bash
# Jobscript example that can run several tasks in parallel.
# All features used here are standard in bash so it should work on
# any sane UNIX/LINUX system.
# Author: roy.dragseth@uit.no
#
# This example will only work within one compute node so let's run
# on one node using all the cpu-cores:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
# We assume we will (in total) be done in 10 minutes:
#SBATCH --time=0-00:10:00
# Let us use all CPUs:
maxpartasks=$SLURM_TASKS_PER_NODE
# Let's assume we have a bunch of tasks we want to perform.
# Each task is done in the form of a shell script with a numerical argument:
# dowork.sh N
# Let's just create some fake arguments with a sequence of numbers
# from 1 to 100, edit this to your liking:
tasks=$(seq 100)
cd$SLURM_SUBMIT_DIR
for t in$tasks;do
# Do the real work, edit this section to your liking.
# remember to background the task or else we will
# run serially
./dowork.sh $t &
# You should leave the rest alone...
# count the number of background tasks we have spawned
# the jobs command print one line per task running so we only need
# to count the number of lines.
activetasks=$(jobs | wc-l)
# if we have filled all the available cpu-cores with work we poll
# every second to wait for tasks to exit.
while[$activetasks-ge$maxpartasks];do
sleep 1
activetasks=$(jobs | wc-l)
done
done
# Ok, all tasks spawned. Now we need to wait for the last ones to
## Never send support requests directly to staff members
Please do not contact your cluster administrator directly. Always send
requests and inquiries to <support@starhpc.hofstra.io> for the quickest
response. On <support@starhpc.hofstra.io>, requests get tracked and have
higher visibility. Some of our staff members only work part time.
Sending the request to <support@starhpc.hofstra.io> makes sure that
somebody will pick it up. Please note in the request that it is for Star,
as there are other systems that are managed by us.
Please do not contact your cluster administrator directly. Please visit the [Issue Tracker](https://github.com/StarHPC/Issues) first, as there are different forms provided for various queries.
## Please do not treat us as "Let me Google that for you" assistants
...
...
@@ -47,7 +41,7 @@ png, tiff, etc) of what you saw on your monitor. From these, we would be
unable to copy and paste commands or error messages, unnecessarily slowing
down our research on your problem.
## New problem == new E-mail
## New problem == new ticket
Do not send support requests by replying to unrelated issues. Every
issue gets a number and this is the number that you see in the subject
...
...
@@ -61,16 +55,16 @@ know the solution but sometimes we don't know the problem.
In short (quoting from <http://xyproblem.info>):
-User wants to do X.
-User doesn't know how to do X, but thinks they can fumble their way
to a solution if they can just manage to do Y.
-User doesn't know how to do Y either.
-User asks for help with Y.
-Others try to help user with Y, but are confused because Y seems
like a strange problem to want to solve.
-After much interaction and wasted time, it finally becomes clear
that the user really wants help with X, and that Y wasn't even a
suitable solution for X.
- User wants to do X.
- User doesn't know how to do X, but thinks they can fumble their way
to a solution if they can just manage to do Y.
- User doesn't know how to do Y either.
- User asks for help with Y.
- Others try to help user with Y, but are confused because Y seems
like a strange problem to want to solve.
- After much interaction and wasted time, it finally becomes clear
that the user really wants help with X, and that Y wasn't even a
suitable solution for X.
To avoid the XY problem, if you struggle with Y but really what you are
after is X, please also tell us about X. Tell us what you really want to
...
...
@@ -106,7 +100,7 @@ The better you describe the problem the less we have to guess and ask.
Sometimes, just seeing the actual error message is enough to give an
useful answer. For all but the simplest cases, you will need to make the
problem reproducible, which you should *always* try anyway. See the
problem reproducible, which you should _always_ try anyway. See the
following points.
## Complex cases: Create an example which reproduces the problem
@@ -28,7 +28,7 @@ Imagine a user is optimizing a complex algorithm's parameters. By initiating an
Batch jobs are submitted to a queue on the cluster and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
#### Example Scenario
You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion and output the results to your desired location for you to view.
For a real example on Batch jobs, view `/jobs/creating-jobs.html`.
For a real example on Batch jobs, view [Submitting jobs]({{site.baseurl}}{% link jobs/submitting-jobs.md %}).
### 3. Array jobs
When you're faced with executing the same task multiple times with only slight variations, array jobs offer an efficient solution. This job type simplifies the process of managing numerous similar jobs by treating them as a single entity that varies only in a specified parameter.
...
...
@@ -42,7 +42,7 @@ Imagine a fluid dynamics job that requires complex calculations spread over mill
## Resources
Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
For a list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`.
For a list of the resources available at Star HPC, take a look at [About star]({{site.baseurl}}{% link quickstart/about-star.md %}).
### Common Errors
Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br>
@@ -294,7 +294,7 @@ Remove the renv directory and associated files. This deletes the environment and
### How to create and use a virtual environment in Julia
Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing ']')
Julia's built-in package manager, Pkg, provides functionality similar to virtual environments in other languages. The primary method is using project environments, which are defined by Project.toml and Manifest.toml files. These environments allow you to have project-specific package versions and dependencies. To create and manage these environments, you use Julia's REPL in package mode (accessed by pressing `]`)
#### Setup environment
Create a new project directory and activate it as a Julia environment.