Commit 8bbd1f17 authored by Mani Tofigh's avatar Mani Tofigh

Merge remote-tracking branch 'origin/master'

parents b675c22c 520c5632
root = true
[*]
end_of_line = lf # or cr if that's the format you want
......@@ -3,4 +3,5 @@ _site
.sass-cache
Gemfile.lock
.bundle
.DS_Store
......@@ -17,12 +17,18 @@ code examples are provided under the [MIT](https://opensource.org/licenses/MIT)
### Install build tools and dependencies.
<details>
<summary>Liquid 4.0.3</summary>
> [!WARNING]
> Due to Liquid not being updated to work with Ruby 3.2.x, **make sure you have Ruby 3.1.x or older installed**.
> If you have the latest dependencies installed, the following does not apply anymore.
>
> The original `jekyll-rtd-theme` 2.0.10 required `github-pages` 209, which effectively capped the version of Liquid to 4.0.3.
> Due to Liquid 4.0.3 and older not being updated to work with Ruby 3.2.x, Ruby 3.1.x or older was required for Liquid 4.0.3.
> https://talk.jekyllrb.com/t/liquid-4-0-3-tainted/7946/18
>
> #### With Cygwin
> As of this writing (8/8/2024), Cygwin provides Ruby versions 2.6.4-1 and 3.2.2-2. Make sure to install the former. Additionally, the version of bundler supplied with Ruby 2.6 is too old and the version of RubyGems is too new. *After installing the following dependencies*, you must then install the correct versions of RubyGems and bundler manually:
> As of 8/8/2024, Cygwin provided Ruby versions 2.6.4-1 and 3.2.2-2. You would need to make sure to install the former. As the version of bundler supplied with Ruby 2.6 is too old and the version of RubyGems is too new, the correct versions of RubyGems and bundler would need to be installed manually after installing all the other dependencies:
> ```
> gem update --system 3.2.3
> gem install bundler -v 2.1.4
......@@ -31,6 +37,8 @@ code examples are provided under the [MIT](https://opensource.org/licenses/MIT)
> bundler -v
> ```
</details>
To allow building of native extensions, install `ruby-devel`, `gcc`, and `make`.
Install `libxml2`, `libxml2-devel`, `libxslt`, `libxslt-devel`, `libiconv`,
......
File mode changed from 100755 to 100644
......@@ -4,10 +4,13 @@ sort: 3
# Contact Us
If you need help, please file a support request via <support@starhpc.hofstra.io>,
and our team will try to assist you as soon as possible.
If you need help, please file a support request via the provided forms at our [Issue Tracker](https://github.com/StarHPC/Issues/issues/new/choose).
If you need to contact us directly, you can reach us at <support@starhpc.hofstra.io>, and our team will try to assist you as soon as possible.
<!-- ![rtfm]({{ site.baseurl }}/help/rtfm.png "rtfm") -->
<!-- ![rtfm]({{ site.baseurl }}/help/rtfm2.png "rtfm2") -->
<!-- ![rtfm]({{ site.baseurl }}/help/rtfm3.png "rtfm3") -->
![rtfm]({{ site.baseurl }}/help/rtfm4.png "rtfm4")
......@@ -8,16 +8,15 @@ sort: 1
### I forgot my password - what now?
You can reset it here: [link to be provided]
{% comment %}You can reset it here: [link to be provided]{% endcomment %}
Please contact the [support team]({{site.baseurl}}{% link help/contact.md %}).
### How do I change my password on Star?
The password can be changed on the [password reset page](#). Log in using
your username on Star.
You can run the `passwd` command on the login node to change your password. Please note `passwd` will have no affect from the compute nodes.
The `passwd` command known from other Linuxes does not work. The Star
system is using a centralised database for user management. This will
override the password changes done locally on Star.
{% comment %}A web portal is currently under development. Once launched, your password can also be changed from the password reset page, [link to be provided]. Log in using your username on Star.{% endcomment %}
### What is the ssh key fingerprint for star.hofstra.edu?
......
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
......@@ -12,13 +12,7 @@ practices.
## Never send support requests directly to staff members
Please do not contact your cluster administrator directly. Always send
requests and inquiries to <support@starhpc.hofstra.io> for the quickest
response. On <support@starhpc.hofstra.io>, requests get tracked and have
higher visibility. Some of our staff members only work part time.
Sending the request to <support@starhpc.hofstra.io> makes sure that
somebody will pick it up. Please note in the request that it is for Star,
as there are other systems that are managed by us.
Please do not contact your cluster administrator directly. Please visit the [Issue Tracker](https://github.com/StarHPC/Issues) first, as there are different forms provided for various queries.
## Please do not treat us as "Let me Google that for you" assistants
......@@ -47,7 +41,7 @@ png, tiff, etc) of what you saw on your monitor. From these, we would be
unable to copy and paste commands or error messages, unnecessarily slowing
down our research on your problem.
## New problem == new E-mail
## New problem == new ticket
Do not send support requests by replying to unrelated issues. Every
issue gets a number and this is the number that you see in the subject
......@@ -106,7 +100,7 @@ The better you describe the problem the less we have to guess and ask.
Sometimes, just seeing the actual error message is enough to give an
useful answer. For all but the simplest cases, you will need to make the
problem reproducible, which you should *always* try anyway. See the
problem reproducible, which you should _always_ try anyway. See the
following points.
## Complex cases: Create an example which reproduces the problem
......
This diff is collapsed.
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
This diff is collapsed.
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
---
sort: 3
---
# Monitoring Jobs
Here you can see how to manage and monitor your jobs on our HPC cluster. Whether you're running batch jobs, interactive sessions, or array jobs, these tools and commands will help you keep track of your work and manage your resources.
## Checking Job Status
### Using `squeue`
The `squeue` command is possibly your most common tool for viewing the status of jobs in the queue. Here's a basic usage:
```bash
squeue
```
Sample output:
```bash
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234 batch my_job jsmith R 5:23 1 cn01
1235 batch array_job jdoe R 2:45 1 cn02
1236 gpu gpu_task asmith PD 0:00 1 (Resources)
```
To see **only** your job:
```bash
squeue -u your_username
```
To see jobs on a specific partition:
```bash
squeue -p partition_name
```
### Jobs States
These are common job states that you might see under the `ST` column of `squeue`'s output:
- R: Running
- PD: Pending
- CG: Completing
- CD: Completed
- F: Failed
- TO: Timeout
- CA: Cancelled
## Detailed Job Information
### Using `scontrol`
To get detailed informatnio about a specific job:
```bash
scontrol show job job_id
```
Sample output:
```bash
JobId=1234 JobName=my_job
UserId=jsmith(1001) GroupId=users(1001) MCS_label=N/A
Priority=4294901758 Nice=0 Account=default QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:10:12 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2023-06-01T10:00:00 EligibleTime=2023-06-01T10:00:00
AccrueTime=2023-06-01T10:00:00
StartTime=2023-06-01T10:05:00 EndTime=2023-06-01T11:05:00 Deadline=N/A
PreemptEligibleTime=2023-06-01T10:05:00 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-01T10:05:00
Partition=batch AllocNode:Sid=login01:12345
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cn01
BatchHost=cn01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=4G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/jsmith/my_script.sh
WorkDir=/home/jsmith
StdErr=/home/jsmith/my_job.err
StdIn=/dev/null
StdOut=/home/jsmith/my_job.out
Power=
```
### Cancelling Jobs
To cancel a job:
```bash
scancel job_id
```
To cancel all your jobs:
```bash
scancel -u your_username
```
To cancel all your pending jobs:
```bash
scancel -t PENDING -u your_username
```
## Modifying Jobs
If you initially submit a job and then remember some attribute needs to be changed, you don't need to cacnel and resubmit the whole job. You can modify certain attributes of a job that's already in the queue using the `scontrol update` command.
For example, to change the time limit of a job:
```bash
scontrol update JobId=job_id TimeLimit=2:00:00
```
To change the number of CPUs:
```bash
scontrol update JobId=job_id NumCPUs=4
```
## Monitoring Resource Usage
### Using `sstat`
For running jobs, you can use `sstat` to get resource usage statistics:
```bash
sstat -j job_id --format=JobID,AveCPU,AveRSS,AveVMSize
```
Sample output:
```bash
JobID AveCPU AveRSS AveVMSize
-------- ------------ ------------ -----------
1234.0 00:05:23 1234K 4567K
```
### Using `sacct`
For completed jobs, use `sacct` o view accounting data:
```bash
sacct -j job_id --format=JobID,JobName,MaxRSS,Elapsed
```
Sample output:
```bash
JobID JobName MaxRSS Elapsed
------------ ---------- ---------- ----------
1234 my_job 4096K 00:15:23
```
## Monitoring Cluster Status
### Using `sinfo`
To see the overall status of the cluster:
```bash
sinfo
```
Sample output:
```bash
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 mix gpu1
defq* up infinite 2 idle cn01,gpu2
```
To see more detailed node information on for example `gpu1`:
```bash
sinfo -n gpu1 -o "%n %c %m %t %f %G %D %P %C %O"
```
Sample output:
```bash
HOSTNAMES CPUS MEMORY STATE AVAIL_FEATURES GRES NODES PARTITION CPUS(A/I/O/T) CPU_LOAD
gpu1 128 1 mix location=local (null) 1 defq* 5/123/0/128 1.67
```
## Job Arrays
For job arrays, you can use most of the above commands with some modifications.
To see the status of all tasks in a job array:
```bash
squeue -j array_job_id
```
Sample output:
```bash
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234_1 batch array_job jdoe R 5:23 1 cn01
1234_2 batch array_job jdoe R 5:23 1 cn02
1234_3 batch array_job jdoe PD 0:00 1 (Resources)
```
## Troubleshooting
If a job fails, try checking the following:
1. Look at the job's output and error files.
2. Check the job's resource usage with `sacct`
3. Verify that you requested sufficient resources, and your job did not get terminated due to needing more resources than requested.
Remember, if you're having persistent issues, don't hesitate to reach out to the support team.
This diff is collapsed.
File mode changed from 100755 to 100644
This diff is collapsed.
......@@ -65,7 +65,7 @@ Submit the above information through the online registration form.
## Login node
Access to the cluster is provided through SSH [(What is SSH?)](https://www.youtube.com/watch?v=qWKK_PNHnnA&ab_channel=Tinkernut) to the login node. The login node serves as the gateway or entry point to the cluster. It is important to understand that the login node is not for running computationally intensive tasks itself. Instead, it is for tasks such as file management, editing, and job submission. The actual computational work is done on the compute nodes, which you access indirectly by submitting jobs through Slurm, the job scheduling system.
Access to the cluster is provided through SSH access to the login node. The login node serves as the gateway or entry point to the cluster. Note that most software tools are not available on the login node and it is not for prototyping, building software, or running computationally intensive tasks itself. Instead, the login node is specifically for accessing the cluster and performing only very basic tasks, such as copying and moving files, submitting jobs, and checking the status of existing jobs. For development tasks, you would use one of the development nodes, which are accessed the same way as the large compute nodes. The compute nodes are where all the actual computational work is performed. They are accessed by launching jobs through Slurm with `sbatch` or `srun`.
## Scheduler policies
......
File mode changed from 100755 to 100644
......@@ -820,7 +820,7 @@ In C++, there's no virtual environment to delete. However, you can remove your l
Remember to also remove or comment out the environment variable settings in your `~/.bashrc` or `~/.bash_profile` if you no longer need them.
# Rust
## Rust
### How to simulate a virtual environment with rust
......
File mode changed from 100755 to 100644
......@@ -48,6 +48,20 @@ Rsync is a particularly useful tool and is recommended for transferring files to
When transferring very large files or datasets, it is advised to use rsync and to calculate and confirm checksums to ensure data integrity.
## Cyberduck
Cyberduck is a file transfer application with an intuitive graphical interface for transfering files to or from a remote machine. Cyberduck is available for both Windows and Mac. Download it from [cyberduck.io](https://cyberduck.io/).
Click "Open Connection" and a new window will be displayed like below. Select "SFTP (SSH File Transfer Protocol)" from the top dropdown menu. Enter the server, port number, your username, and Linux Lab password. Then click "Connect".
![3-connection.png]({{ site.baseurl }}/images/cyberduck_setup_images/3-connection.png "3-connection.png")
If you see a window asking about an "Unknown fingerprint", click "Always" and then "Allow".
![4-fingerprint.png]({{ site.baseurl }}/images/cyberduck_setup_images/4-fingerprint.png "4-fingerprint.png")
You should now be able to see your user's home directory on the cluster. You can transfer files to and from it by dragging and dropping files between this window and your "Finder" windows.
## Network Interfaces and Bandwidth
All file transfer access to the Star HPC Cluster is currently through the login node's 1GbE interface. Users should be aware of potential bandwidth limitations, especially when transferring large amounts of data.
......@@ -55,3 +69,4 @@ All file transfer access to the Star HPC Cluster is currently through the login
## User Authentication and Permissions
File transfers are authenticated in the same way as SSH access. SSH keys are the preferred method for secure authentication, although password authentication is currently allowed. Plans for implementing Multi-Factor Authentication (MFA) are being considered for future security enhancements.
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment