Commit 78be527e authored by Mani Tofigh's avatar Mani Tofigh

Applied all changes recommended by Alexander Rosenberg in the docs file for…

Applied all changes recommended by Alexander Rosenberg in the docs file for section jobs/Overview.md
parent c1b10aa1
...@@ -8,7 +8,7 @@ sort: 1 ...@@ -8,7 +8,7 @@ sort: 1
In the context of High-Performance Computing (HPC), a job is a unit of work that you submit to the cluster for execution. This work can range from a simple command to a complex series of computations that utilize the cluster's computing resources, including CPUs, GPUs, and memory. Jobs are executed on the cluster's computing nodes, rather than on the login or head node, to ensure that intensive computations do not interfere with the system's overall operation and performance. In the context of High-Performance Computing (HPC), a job is a unit of work that you submit to the cluster for execution. This work can range from a simple command to a complex series of computations that utilize the cluster's computing resources, including CPUs, GPUs, and memory. Jobs are executed on the cluster's computing nodes, rather than on the login or head node, to ensure that intensive computations do not interfere with the system's overall operation and performance.
**Example Scenario**: Suppose you're conducting a simulation. On the cluster, you would submit this task as a job. This job would detail the resources needed (e.g., "use 2 of the A100 GPUs on node gpu1"), the software to run (perhaps a simulation tool contained within an Apptainer container), and the input data. The job system then queues your request and executes it on the appropriate compute nodes when resources become available, without you needing to manually log into those nodes and start the computations. **Example Scenario**: Suppose you're conducting a simulation. On the cluster, you would submit this task as a job. This job would detail the resources needed (e.g., "use 2 of the A100 GPUs on node `gpu1`"), the software to run (perhaps a simulation tool contained within an Apptainer container), and the input data. The job system then queues your request and executes it on the appropriate compute nodes when resources become available, without you needing to manually log into those nodes and start the computations.
### Why are jobs used? ### Why are jobs used?
...@@ -17,17 +17,17 @@ In the context of High-Performance Computing (HPC), a job is a unit of work that ...@@ -17,17 +17,17 @@ In the context of High-Performance Computing (HPC), a job is a unit of work that
* **Control**: Through job specifications, you have control over the resources allocated to your task, such as the number of CPUs/GPUs, memory, and runtime. This ensures that your job runs under optimal conditions tailored to its requirements. * **Control**: Through job specifications, you have control over the resources allocated to your task, such as the number of CPUs/GPUs, memory, and runtime. This ensures that your job runs under optimal conditions tailored to its requirements.
## Job types ## Job types
In HPC environments, navigating through the variety of job types is bounded to choosing the right gear for your workload. Each job type provides specific computational demands, ranging from straightforward tasks to complex, large-scale computations. Here, you can read about different types of jobs and their use cases. In HPC environments, choosing the right job type is like choosing the right gear for your workload. Each job type is suited for different computing needs, ranging from straightforward tasks to complex, large-scale computations.
### 1. Interactive jobs ### 1. Interactive jobs
Interactive jobs serve as your direct path to the cluster's computing resources by enabling real-time interaction with your applications. This job type is particularly beneficial for tasks necessitating prompt feedback, such as exploratory data analysis or debugging phases of development. Interactive jobs serve as your direct path to the cluster's computing resources by enabling real-time interaction with your applications. This job type is typically used for tasks necessitating prompt feedback, such as exploratory data analysis or debugging phases of development.
#### Example Scenario #### Example Scenario
Imagine a user is optimizing a complex algorithm's parameters. By initiating an interactive job, the user can execute adjustments in real-time, directly observing the effects of their tweaks on the algorithm's performance. Imagine a user is optimizing a complex algorithm's parameters. By initiating an interactive job, the user can execute adjustments in real-time, directly observing the effects of their tweaks on the algorithm's performance.
### 2. Batch jobs ### 2. Batch jobs
Batch jobs are submitted to the cluster's job queue and run without user interaction. This is the most common job type for tasks that don't require real-time feedback. Batch jobs are submitted to a queue on the cluster and run without user interaction. This is the most common job type for tasks that don't require real-time feedback.
#### Example Scenario #### Example Scenario
You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion, and outputting the results to your desired location for you to view. You've developed a script for processing a large dataset that requires no human interaction to complete its task. By submitting this as a batch job, the cluster undertakes the task, allowing the job to run to completion and output the results to your desired location for you to view.
For a real example on Batch jobs, view `/jobs/creating-jobs.html`. For a real example on Batch jobs, view `/jobs/creating-jobs.html`.
### 3. Array jobs ### 3. Array jobs
...@@ -36,13 +36,13 @@ When you're faced with executing the same task multiple times with only slight v ...@@ -36,13 +36,13 @@ When you're faced with executing the same task multiple times with only slight v
Suppose a user has to analyze 100 different genomic sequences with the same computational tool. Instead of submitting 100 separate batch jobs, they can submit a single array job that processes each sequence in parallel,which significantly simplifies job management and resource allocation. Suppose a user has to analyze 100 different genomic sequences with the same computational tool. Instead of submitting 100 separate batch jobs, they can submit a single array job that processes each sequence in parallel,which significantly simplifies job management and resource allocation.
### 4. Parallel jobs ### 4. Parallel jobs
Parallel jobs are designed to run computations concurrently across multiple nodes or processors, ideal for highly parallelizable tasks. These jobs can significantly reduce computation time for suitable tasks by taking advantage of the cluster's distributed resources. Parallel jobs are designed to run computations concurrently across multiple nodes or processors, which is ideal for highly parallelizable tasks. These jobs can significantly reduce computation time for suitable tasks by taking advantage of the cluster's distributed resources.
#### Example Scenario #### Example Scenario
Imagine a fluid dynamics job that requires complex calculations spread over millions of grid points. By submitting a parallel job, the simulation can for instance be distributed across multiple A100 GPUs on both gpu1 and gpu2, and enable the task to complete much faster than it would on a single node. Imagine a fluid dynamics job that requires complex calculations spread over millions of grid points. By submitting a parallel job, the simulation can for instance be distributed across multiple A100 GPUs on both `gpu1` and `gpu2`, and enable the task to complete much faster than it would on a single node.
## Resources ## Resources
Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br> Resources within an HPC environment are finite and include CPUs, GPUs, memory, and storage. <br>
For a more comprehensive list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`. For a list of the resources available at Star HPC, take a look at `/quickstart/about-star.html`.
### Common Errors ### Common Errors
Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br> Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution. <br>
...@@ -53,7 +53,7 @@ Common issues arise from: <br> ...@@ -53,7 +53,7 @@ Common issues arise from: <br>
### Unhealthy Jobs ### Unhealthy Jobs
#### Example Scenario 1 #### Example Scenario 1
Imagine a user submits a job requiring data analysis with minimal computational intensity but requests all 8 A100 GPUs on gpu1, anticipating it will speed up the process. However, the job only actively uses one GPU, leaving the remaining seven idle but unavailable to other users. This is a significant overallocation of resources, leading to inefficiencies and potential delays for other tasks requiring GPU support. Imagine a user submits a job for data analysis that only needs a minimal set of resources but requests all 8 A100 GPUs on `gpu1`, anticipating that would speed up the process. Counterintuitively, this would make it take longer until the job is actually executed as the job scheduler must wait until all the requested resources are available for the job to run. Furthermore, if the job only actively uses one GPU, the remaining seven would be idle but left unavailable to other users. This is a significant overallocation of resources, leading to inefficiencies and potential delays for other tasks requiring GPU support.
#### Example Scenario 2 #### Example Scenario 2
A user requesting more resource than physically available on the cluster. For instance requesting 9 A100 GPUs on gpu1 where the maximum is 8. This would cause an immediate error and the job would not run and terminate instantly. A user requests more resources than are physically available on the cluster. For instance, a user submits a job that requests 9 A100 GPUs on `gpu1`, where the maximum is 8. This would cause the job to fail immediately and produce an error.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment