Computational lab course that spans research data processing workflow starting just after the point of acquisition through to computation and visualization. Course is taught through Python & POSIX. Students will work with large datasets of their choosing.
Example topics - subject to change.
Topic | Tools |
---|---|
Navigation | command line/shell, ssh, tmux, Kerberos+AFS, FarmShare/Sherlock |
Code & Single-threaded Processing | git, Stanford GitLab, Jupyter/Colab, ssh tunneling, ngrok/serveo/sish |
Data Representation & Access | numpy, pandas, SQLAlchemy |
Data Transfer | rsync, rclone, Stanford Google Drive |
Data File Formats | HDF5, Parquet, AVRO, Arrow, Hadoop |
High-Performance Computing | job scheduling/SLURM, workflows, pipelines |
Optimizing & Abstracting Data Access | HDF5 chunking, Python data object model |
OS Virtualization | chroot, docker, & singularity |
Interactive Web Visualization | dash/plotly, holoviews/bokeh |
Distributed Computing | dask, modin, Apache Spark |
Parallelism | OpenMPI |
The first offering of the course was Spring 2021.
Spring 2022 will be held on Wed & Fri from 9:45 - 11:15AM in Packard 101.