added info about the walltime and corrected a job submission command

d160d6ac · Aishwary Shukla · e05aca84 · d160d6ac · d160d6ac
Commit d160d6ac authored Nov 13, 2024 by Aishwary Shukla
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 5 deletions

scheduling-optimization.md jobs/scheduling-optimization.md +7 -4

submitting-jobs.md jobs/submitting-jobs.md +1 -1

No files found.
--- a/jobs/scheduling-optimization.md
+++ b/jobs/scheduling-optimization.md
@@ -156,11 +156,13 @@ module load python3
 python3 quick_task.py
 ```
-Note that `quick_task.py`'s name and location needs to be changed relative to _your_ file(s).
+Note that `quick_task.py`'s name and location needs to be changed relative to _your_ file(s). quick_task.py is the actual job script that you want to run.
 ## **How Can I Submit a Long Job?**
-For long jobs, select the **long QoS**, which allows for extended runtimes but may have lower scheduling priority. It’s advisable to implement **checkpointing** in your application if possible. Checkpointing allows your job to save progress at intervals, so you can resume from the last checkpoint in case of interruptions, mitigating the risk of resource wastage due to unexpected failures.
+For long jobs, select the **long QoS**, which allows for extended runtimes but may have lower scheduling priority. Note that while specifying the qos you still need to specify the walltime (\--time=). The difference between walltime and cpu time from this simple example: If a job is running for one hour using two CPU cores, the walltime is one hour while the cpu-time is 1hr x 2CPUs = 2 hours.
+It’s advisable to implement **checkpointing** in your application if possible. Checkpointing allows your job to save progress at intervals, so you can resume from the last checkpoint in case of interruptions, mitigating the risk of resource wastage due to unexpected failures.
 Be aware of **fairshare implications**; consistently running long jobs can reduce your priority over time. Plan your submissions accordingly to balance resource usage.
@@ -175,10 +177,11 @@ Be aware of **fairshare implications**; consistently running long jobs can reduc
 #SBATCH --nodes=2
 #SBATCH --mem=64G
-srun python train_model.py
+module load python3
+python3 my_long_job.py
 ```
-This script requests sufficient time and resources for an extended computation, using the **long QoS**.
+This .sbatch script requests sufficient time and resources for an extended computation, using the **long QoS**. my_long_job.py is the python file which is the job that you want to run
 ## **What Can I Do to Get My Job Started More Quickly? Any Other PRO Tips?**

--- a/jobs/submitting-jobs.md
+++ b/jobs/submitting-jobs.md
@@ -77,7 +77,7 @@ Lines 2-7 are your `SBTACH` directives. These lines are where you specify differ
 - `#SBATCH --output=test_job.out`: Used to specify where your output file is generated, and what it's going to be named. In this example, we have not provided a path, but only provided a name. When you use the `--output` directive without specifying a full path, just providing a filename, Slurm will store the output file in the current working directory from which the `sbatch` command was executed.
 - `#SBATCH --error=test_job.err`: Functions similar to `--output` except it contains error messages generated during the execution of your job, if any. **The `.err` file is always going to be generated even if your job execution is successful; however, it's going to be empty if there are no errors.**
 - `#SBATCH --nodes=1`: Specifies your job to run on one available node. This directive basically tells the scheduler "Run my job on any available node you find, and I don't care which one". **It's also possible to specify the name of the node(s) you'd like to use which we will cover in future examples.**
- `#SBATCH --time=10:00`: This line specifies how long you want your job to run, after it's out the queue and starts execution. In this case, the job will be **terminated** after 10 minutes. Acceptable time formats include `mm`, `mm:ss`, `hh:mm:ss`, `days-hh`, `days-hh:mm` and `days-hh:mm:ss`.
+- `#SBATCH --time=10:00`: This line specifies how long you want your job to run, after it's out the queue and starts execution. In this case, the job will be **terminated** after 10 minutes. It is also called the Walltime. Here is a simple example to explain the difference between walltime and cpu time: If a job is running for one hour using two CPU cores, the walltime is one hour while the cpu-time is 1hr x 2CPUs = 2 hours. Acceptable time formats include `mm`, `mm:ss`, `hh:mm:ss`, `days-hh`, `days-hh:mm` and `days-hh:mm:ss`.
 - `#SBATCH --mem=1G` Specifies the maximum main memory required _per_ node. In this case we set the cap to 1 gigabyte. If you don't use a memory unit, Slurm automatically uses MegaBytes: `#SBATCH --mem=4096` requests 4096MB of RAM. **If you want to request all the memory on a node, you can use** `--mem=0`.
 After the last `#SBATCH` directive, commands are ran like any other regular shell script.