In the context of High-Performance Computing (HPC), a job is a unit of work that you submit to the cluster for execution. This work can range from a simple command to a complex series of computations that utilize the cluster's computing resources, including CPUs, GPUs, and memory. Jobs are executed on the cluster's computing nodes, rather than on the login or head node, to ensure that intensive computations do not interfere with the system's overall operation and performance.
**Example**: Suppose you're conducting a scientific simulation. On the cluster, you would submit this task as a job. This job would detail the resources needed (e.g., "use 2 of the A100 GPUs on node gpu1"), the software to run (perhaps a simulation tool contained within an Apptainer container), and the input data. The job system then queues your request and executes it on the appropriate compute nodes when resources become available, without you needing to manually log into those nodes and start the computations.
**Example Scenario**: Suppose you're conducting a simulation. On the cluster, you would submit this task as a job. This job would detail the resources needed (e.g., "use 2 of the A100 GPUs on node gpu1"), the software to run (perhaps a simulation tool contained within an Apptainer container), and the input data. The job system then queues your request and executes it on the appropriate compute nodes when resources become available, without you needing to manually log into those nodes and start the computations.
### Why use jobs?
...
...
@@ -16,32 +16,32 @@ In the context of High-Performance Computing (HPC), a job is a unit of work that
***Scalability**: Jobs can be scaled across multiple nodes and processors, based on your preference and resource requirements.
***Control**: Through job specifications, you have control over the resources allocated to your task, such as the number of CPUs/GPUs, memory, and runtime. This ensures that your job runs under optimal conditions tailored to its requirements.
## 2. Job types
## Job types
In HPC environments, navigating through the variety of job types is bounded to choosing the right gear for your workload. Each job type provides specific computational demands, ranging from straightforward tasks to complex, large-scale computations. Here, you can read about different types of jobs and their use cases.
### 1. Interactive jobs
Interactive jobs serve as your direct path to the cluster's computing resources by enabling real-time interaction with your applications. This job type is particularly beneficial for tasks necessitating prompt feedback, such as exploratory data analysis or debugging phases of development.
#### Example
#### Example Scenario
Imagine a user is optimizing a complex algorithm's parameters. By initiating an interactive job, the user can execute adjustments in real-time, directly observing the effects of their tweaks on the algorithm's performance.
### 2. Batch jobs
Batch jobs are submitted to the cluster's job queue and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
#### Example
#### Example Scenario
You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion, and outputting the results to your desired location for you to view.
For a real example on Batch jobs, view <u>page-name<u>.
### 3. Array jobs
When you're faced with executing the same task multiple times with only slight variations, array jobs offer an efficient solution. This job type simplifies the process of managing numerous similar jobs by treating them as a single entity that varies only in a specified parameter.
#### Example
#### Example Scenario
Suppose a user has to analyze 100 different genomic sequences with the same computational tool. Instead of submitting 100 separate batch jobs, they can submit a single array job that processes each sequence in parallel,which significantly simplifies job management and resource allocation.
### 4. Parallel jobs
Parallel jobs are designed to run computations concurrently across multiple nodes or processors, ideal for highly parallelizable tasks. These jobs can significantly reduce computation time for suitable tasks by taking advantage of the cluster's distributed resources.
#### Example
#### Example Scenario
Imagine a fluid dynamics job that require complex calculations spread over millions of grid points. By submitting a parallel job, the simulation can for instance be distributed across multiple A100 GPUs on both gpu1 and gpu2, and enable the task to complete much faster than it would on a single node.
## 3. Resources
Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. Efficient use ensures not only the success of individual jobs but also maintains the overall health and performance of the cluster for all users. <br>
## Resources
Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
For a more comprehensive list of the resources available at Star HPC, take a look at <u>page-name<u>.
### Common Errors
...
...
@@ -52,8 +52,8 @@ Common issues arise from: <br>
***Insufficient Memory**: Jobs that attempt to use more memory than allocated can crash or cause system instability.
### Unhealthy Jobs
#### Example 1
#### Example Scenario 1
Imagine a user submits a job requiring data analysis with minimal computational intensity but requests all 8 A100 GPUs on gpu1, anticipating it will speed up the process. However, the job only actively uses one GPU, leaving the remaining seven idle but unavailable to other users. This is a significant overallocation of resources, leading to inefficiencies and potential delays for other tasks requiring GPU support.
#### Example 2
#### Example Scenario 2
A user requesting more resource than physically available on the cluster. For instance requesting 9 A100 GPUs on gpu1 where the maximum is 8. This would cause an immediate error and would job would not run and terminate instantly.