“High-performance computing” supercomputers have been around for longer than some of their users today. The first supercomputer, the Cray-1, was setup in 1976 and was put in operation at Los Alamos National Laboratory. In the course of history, the design of supercomputers underwent several revolutions. Today, most universities and an increasing part of the industry in several domains exploit the computational power of clusters of interconnected servers.
These High-performance Computing (HPC) clusters are used for large scale data processing and data analysis, fine grained parallel calculations and simulations of ever increasing fidelity. This course material is meant to introduce learners to the core principles behind the using a HPC cluster, how to connect to it, how to dispatch jobs and retrieve their results, and how to program applications that can harness the full power of this machinery.
Please note that this lesson uses Python 3 without the intent of claiming python to be the universal language for HPC. Python is merely used as a vehicle to convey concepts that due to the intuitiveness of the language should be easy to transport to other languages and domains.
Prerequisites
If you have already written small programs with a language of your choice and know the difference between a “variable” and a “function” and obtain a minimal knowledge of using the UNIX command line (e.g. if you have completed shell-novice, you are good to go.
This lesson guides you through the basics of using a computer cluster (or batch farm or supercomputer). If you’re already comfortable with using systems like LSF, Slurm or PBS/Pro and have already run applications on a super computer or even wrote parallel applications to run on a cluster, you probably won’t learn much from this lesson. But you are welcome to help the others as a technical assistant or contribute to this course material.
Setup | Dowload files used on the lesson. | |
00:00 | Logging in |
What is a super computer?
Where is a super computer? How do I connect to it? How do I transfer data to and from it? |
00:50 | Batch systems and schedulers 101 |
Why is a scheduler necessary?
How do I launch a program to run on any one remote host in the cluster? How do I capture the output of a program that is run on a remote host? |
01:35 | Working with the scheduler |
How do I know if something went wrong with my jobs?
How to decrease the waiting time of your jobs? How do I cancel a job? |
02:05 | Working with the shared file system |
How do I store a file on node so that another node can see it?
How do I store a file on node so that another node cannot see it? Do I really need scp for all of the above? |
02:35 | Changing the Environment | How to extend the software installation of a HPC cluster? |
03:05 | Estimation of Pi for Pedestrians | How do I find the portion of a code snippet that consumes the longest time? |
03:55 | Parallel Estimation of Pi for Pedestrians |
What are data parallel algorithms?
How can I estimate the yield of parallelization without writing one line of code? How do I use multiple cores on a computer in my program? |
04:40 | Higher levels of parallelism |
What were the key changes when using the multiprocessing library?
How could this be implemented with dask? How does a conversion using dask high level API compare? |
05:25 | Searching for Pi | How do I analyze a lot of large files efficiently? |
06:10 | Bonus session: Distributing computations among computers |
What is the message passing interface (MPI)?
How do I exploit parallelism using the message passing interface (MPI)? |
07:05 | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.