added some seed content

2e914b65 · Alexander Rosenberg · 57b746d7 · 2e914b65 · 2e914b65
Commit 2e914b65 authored Dec 13, 2024 by Alexander Rosenberg
Hide whitespace changes
Inline Side-by-side

Showing with 214 additions and 0 deletions

fairshare.md jobs/fairshare.md +81 -0

scheduling-policies.md jobs/scheduling-policies.md +133 -0

No files found.
--- a/jobs/fairshare.md
+++ b/jobs/fairshare.md
+---
+sort: 6
+---
+# Fair Share
+The fair-share system is designed to encourage users to balance their use of resources over their allocation period. Fair-share is the largest factor in determining priority, but not the only one. For more details see [Job Prioritisation]({{site.baseurl}}{% link jobs/scheduling-policies.md %}#job-prioritization).
+## Fair Share Score
+Your Fair Share score is a number between *0* and *1*. Projects with a *larger* Fair Share score receive a *higher priority* in the queue.
+A project is given an [allocation of compute units](https://docs.nesi.org.nz/Getting_Started/Accounts-Projects_and_Allocations/What_is_an_allocation/) over a given *period*. An institution also has a percentage *Fair Share entitlement* of each machine's deliverable capacity over that same period.
+```note
+Although we use the term "Fair Share entitlement" in this article, it bears only a loose relationship to an institution's contractual entitlement to receive allocations from the NeSI HPC Compute & Analytics service. The Fair Share entitlement is managed separately for each cluster, and is adjusted as needed by NeSI staff so that each institution can receive, as nearly as possible, its contractual entitlement to the service as a whole, as well as a mix of cluster hours that corresponds closely to the needs of that institution's various project teams.
+```
+- *Your project's expected rate of use = (your institution's Fair Share entitlement × your project's allocation) / (sum of your institution's allocations × period)*
+- *Your institution's expected rate of use* = your institution's *Fair Share entitlement* on that machine
+If an entity — an institution or project team — is using the machine more slowly than expected, for Fair Share purposes it is considered a light user. By contrast, one using the machine faster than expected is a heavy user.
+- Projects at lightly using institutions get a higher Fair Share score than those at heavily using institutions.
+- Within each institution, lightly using projects get a higher Fair Share score than heavily using projects.
+- Using *faster* than your *expected rate of usage* will usually cause your Fair Share score to *decrease*. The more extreme the overuse, the more severe the likely drop.
+- Using *slower* than your *expected rate of usage* will usually cause your Fair Share score to *increase*. The more extreme the under-use, the greater the Fair Share bonus.
+- Using the cluster *unevenly* will cause your Fair Share score to *decrease*.
+## What is Fair Share?
+Fair Share is a mechanism to set job priorities. It is based on a share of the cluster, that is, a fraction of the cluster's overall computing capacity.
+### Fair Share on Star
+On Mahuika and the Māui XC nodes, but not on the Māui ancillary nodes, we set a project's expected rate of use based on that project's percentage share of all then-current allocations awarded to that project's institution on that cluster. This percentage share is in turn derived from the sizes (in compute units or nodes) and duration (in days) — and thus the expected rates of use of those same allocations.
+Therefore:
+- If the size of your allocation increases, your project's share of the cluster will increase. Conversely, if the size of your allocation decreases, your project's share of the cluster will decrease.
+- If the size of another project's allocation increases, your project's share of the cluster will decrease, since, even though your allocation's size has remained the same, the total size of other allocations has increased and thus your allocation's share has decreased. Conversely, if the size of the other project's allocation decreases, your project's share of the cluster will increase.
+- If the cluster gets larger (e.g. we purchase and install more computing capacity), your project's share of the cluster will not change, but that share of the cluster will correspond to a higher rate of core hour usage. This situation will only last until more allocations are issued, or existing allocations are made larger, to take advantage of the increased capacity. The opposite will occur if the cluster shrinks, though cluster shrinkage is not expected to occur.
+On the Star nodes, Fair Share is not designed to ensure that all project teams get the same share of the cluster.
+### Fair Share on the Star development nodes (GPU3 and GPU4)
+The development nodes form a small resource, only two nodes of 64 CPU cores each. It is intended for pre- and post-processing work related to computational jobs carried out on the main Star nodes. Therefore, we do not make allocations of CPU core hours on these nodes. Instead, each project team that has a current allocation on the Star nodes is entitled to an equal share of the time on these two Star development nodes.
+Because job priority on the Star development nodes is still heavily influenced by Fair Share, project teams that have recently been doing a lot of work on the Star development nodes will find their jobs there deprioritised, so that other project teams can access the resource. However, even heavy users of the Star development nodes can still access resources there if those CPU cores would otherwise be idle.
+## How does Fair Share work?
+The starting point for a Fair Share calculation is a comparison of the project's actual share of use to the expected share of use. This share of use is based on what all users of the cluster have actually used during the relevant period of time, not what the cluster was capable of delivering during that same period. Currently, each period is five minutes.
+Because five minutes is a short time, Fair Share aggregates the ratio of actual share to expected share since records began on that cluster. But as the time gets further back from the present, each five-minute window has slightly less influence on fair share scores. Our current configuration has it that after two weeks (that is, 4,032 successive five-minute windows), the effect of the ratio for that five-minute slice is worth only half of what it was worth initially; after four weeks, it is worth a quarter; after six weeks, one eighth; and so on. The effect of this decay curve is that over-use or under-use in the recent past has a greater effect on your project's fair share score than the same extent of overuse or under-use long ago.
+One important implication of Fair Share is that allocations are implicitly aged: you cannot bank core hours by refraining from submitting work. If, for example, you expect to have a lot of computational work to carry out in September, you can't get a significant priority boost in September by refraining from carrying out computational work in March. In fact, you will get the best advantage from Fair Share by submitting work at close to a constant rate.
+If you expect that your project team will need widely varying rates of computer use during your allocation period and you can predict when your busy and quiet periods will be, please [Contact our Support Team]({{site.baseurl}}{% link help/contact.md %}) to enquire about splitting your project's allocation up into parts. Please be aware that we cannot guarantee this option will be available for any given project, and that we are most likely to be able to accommodate such a request for projects that expect to use the cluster heavily on average, can predict when they will need their heaviest use with a high degree of confidence, and give us plenty of notice.
+For full details on Slurm's Fair share mechanism, please see [this page](https://slurm.schedmd.com/priority_multifactor.html#fairshare){:target="_blank"}.
+## How do I check my project's Fair Share score?
+- The command `nn_corehour_usage <project_code>`, on a Mahuika or Māui login node, will show, along with other information, the current fair share score and ranking of the specified project.
+- The `sshare` command, on a Star login node, will show the fair share tree. A related command, `nn_sshare_sorted`, will show projects in order from the highest fair share score to the lowest.
+In our current configuration, Fair Share scores are attached to projects, not to individual users.
+## My project's Fair Share score is too low. How can I improve it?
+If you have just carried out an unusually large spike of work, your fair share score will naturally be lowered for a while, and should come back to normal after a few days.
+If, on the other hand, you have more work to do than expected, please [Contact our Support Team]({{site.baseurl}}{% link help/contact.md %}) to apply for a larger allocation. Project teams may request a larger allocation on Star, though not on the Star development nodes.
+If you believe your project's fair share score has become corrupted, or your ability to get work done is affected by a low Fair Share entitlement for your institution on that cluster, please [Contact our Support Team]({{site.baseurl}}{% link help/contact.md %}).
+## Sources
+* [https://docs.nesi.org.nz/Scientific_Computing/Running_Jobs_on_Maui_and_Mahuika/Fair_Share/](https://docs.nesi.org.nz/Scientific_Computing/Running_Jobs_on_Maui_and_Mahuika/Fair_Share/)
--- a/jobs/scheduling-policies.md
+++ b/jobs/scheduling-policies.md
+---
+sort: 5
+---
+# Scheduler Policies (Alex's version)
+## Job Priority
+The job scheduler (Slurm) on Star uses a priority based scheduling method. Each job submitted is assigned a priority in order to determine the relative importance and the order in which to schedule the pending jobs. A job's priority is an integer value that is calculated based on a number of factors as explained below. Jobs start when sufficient resources (CPUs, GPUs, memory, licenses) are available and are not already reserved for a job with higher priority. A job's priority determines its position in the queue relative to other jobs and the order in which the pending jobs will run. The pending job with the highest priority will, in principle, be scheduled first, except when a smaller/shorter job can start without delaying a job with a higher priority, a strategy known as backfilling. There are several commands that can be used to to get insights in why your job is waiting and what the position of your job is in the queue. These commands can be found on this page as well.
+## Priority Factors
+Job priority scores are determined by a number of factors:
+### Quality of Service (QoS)
+The QOS factor is a value given by the QoS associated with the job, specified at job jubmission with the `--qos` option.
+The "debug" Quality of Service can be gained by adding the sbatch command line option `--qos=debug`.
+This adds 5000 to the job priority so raises it above all non-debug jobs, but is limited to one small job per user at a time: no more than 15 minutes and no more than 2 nodes.
+### Fair Share
+We use Slurm's concept of "fair-share" to promote balanced resource usage among accounts. The fair-share factor is designed such that the scheduler deprioritizes accounts with excessive resource utilization. It makes sure that accounts that have not used the cluster as much get a higher priority for their jobs, while accounts that have used the cluster a lot don't overuse it.
+The job priority decreases whenever the project uses more core-hours than expected. The [Fair Share]({{site.baseurl}}{% link jobs/fairshare.md %}) policy means that projects that have consumed many CPU core hours in the recent past compared to their expected rate of use (either by submitting and running many jobs, or by submitting and running large jobs) will have a lower priority, and projects with little recent activity compared to their expected rate of use will see their waiting jobs start sooner.
+The fair-share factor is a fractional number between 0 and 1 that is assigned to all accounts based on their past usage. Slurm computes this number regularly and it changes based on your usage and on the total number of accounts on the system. The job priority calculation considers this variable in determining the priority of pending jobs. On Star, the Fair Share factor can contribute up to 1000 points to the job priority. To see the recent usage and current fair-share score of a project, you can use the command nn_corehour_usage.
+The faishare parameter has a 'forgetting' threshold that causes it to only consider the recent history of the account and not the account's total use throughout its lifetime.
+### Job Age
+Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks.
+Note that the job age parameter is bounded so that priority stops increasing when the bound is reached.
+### Job Size or "TRES" (Trackable RESources)
+The job size factor can be configured to favor small or large jobs. Currently on Star, this factor is configured to prioritize smaller jobs. This factor is also often used to favor jobs that request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times.
+### Project Allocation Class
+| Project class | Class Priority Score |
+|---------------|----------------------|
+| Proposal Development | 10 |
+| Postgraduate         | 20 |
+| Collaborator         | 30 |
+| Merit                | 40 |
+| Commercial           | 40 |
+### Nice values
+It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the `--nice` option of `sbatch` or the `scontrol update` command.  The command `scontrol top <jobid>` adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition.
+### Holds
+Jobs with a priority of 0 are in a "held" state and will never start without further intervention.  You can hold jobs with the command `scontrol hold <jobid>` and release them with `scontrol release <jobid>`.  Jobs can also end up in this state when they get requeued after a node failure.
+## Other Limits
+Cluster and partition-specific limits can sometimes prevent jobs from starting regardless of their priority score.  For details see the [partition limits](#) page.
+## Priority Calculation
+Slurm calculate the priority of each job as a weighted sum of these factors:
+```
+Job_priority =
+    (PriorityWeightAge)       * (age_factor)        +
+    (PriorityWeightFairshare) * (fair-share_factor) +
+    (PriorityWeightJobSize)   * (job_size_factor)   +
+    (PriorityWeightPartition) * (partition_factor)  +
+    (PriorityWeightQOS)       * (QOS_factor)        +
+    (possibly some other advanced factors that are not relevant for Star)
+```
+All the factors in this formula are floating point numbers between 0.0 and 1.0, while the weights are integer values that determine how important the factors should be considered.
+The current configuration of the cluster can be found by running the following command:
+```
+scontrol show config | grep ^Priority
+```
+As of this writing, we use the following weights in the Slurm configuration on Star:
+(TBD)
+This means that the priority of a job is mainly determined by a fairshare component and a little bit by its age.
+The age of a job refers to how long a job has already been waiting in the queue on a time scale from 0 to 100 days: if it was just queued, the age factor will be 0.0. After 50 days of waiting the age factor will be 0.5 and after 100 days or more the job’s age factor will be and stay at the maximum value of 1.0.
+Finally, the most important factor is the fairshare factor: it indicates how much a user recently has been using the system compared to the share of the system that was allocated to this user. This usage decays over time and favors the most recent usage statistics: assuming the user would not use the cluster anymore, his or her usage will decay to half of its original value after a configured half-life period of one week.
+## Backfill
+Backfill is a scheduling strategy in which lower priority jobs can be scheduled earlier than higher priority jobs to fill idle slots, provided they are finished before the next high priority job is expected to start based on resource availability. In other words, backfilling allows small, short jobs to run immediately if in doing so they will not delay the expected start time of any higher-priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs it is important that all users set reasonably accurate job time limits for backfilling to work well.
+While the kinds of jobs that can be backfilled will also get a low job size score, it is our general experience that an ability to be backfilled is on the whole more useful when it comes to getting work done on the cluster.
+More information about backfilling can be found at [SchedMD's Scheduling Configuration Guide](https://slurm.schedmd.com/sched_config.html){:target="_blank"}.
+## Useful commands related to priority and fairshare
+### sprio: show priority per job
+The [sprio](http://slurm.schedmd.com/sprio.html){:target="_blank"} command shows the priority per job, including the individual components for the job age and fairshare (both already multiplied by their corresponding weights). This can be useful for comparing jobs to other waiting jobs and finding out why a job is still waiting.
+### sshare: show your fairshare number
+For a user running the [sshare](http://slurm.schedmd.com/sshare.html){:target="_blank"} command, it will show in more detail the current fairshare number for this user and the two main components that determine this number: the (normalized) share of the system that was assigned to him/her and the effective usage of the system by this user.
+### squeue: show all jobs, priorities, estimated start times
+The [squeue](http://slurm.schedmd.com/squeue.html){:target="_blank"} command shows all jobs on the system and sorts them, by default, by status and priority: first the waiting jobs are shown by descending priority, then the running ones. The position in the queue could give some kind of indication about when your job will start. In order to list the actual priorities, one can run squeue with some additional flags:
+```
+squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.18R %p"
+```
+This will include all the default columns and, additionally, a column with the actual priority of each job.
+Another useful option for squeue is `--start`: it will show the estimated start time for (some) waiting jobs, in case SLURM can already calculate one. Note that these are very rough estimates, since they depend on several factors.
+The priority given to a job can also be obtained with squeue:
+```
+squeue -o %Q -j jobid
+```
+## Sources
+* https://wiki.hpc.rug.nl/habrok/advanced_job_management/job_prioritization
+* https://docs.nesi.org.nz/Scientific_Computing/Running_Jobs_on_Maui_and_Mahuika/Job_prioritisation/