BI Lab - BIOE 301P

BIOE 301P - Research Data & Computation

Computational lab course that spans research data processing workflow starting just after the point of acquisition through to computation and visualization. Course is taught through Python & POSIX. Students will work with large datasets of their choosing.

Learning Goals

Catalog data in an archival framework for long-term storage and use in an efficient and programmatic manner
Choose an appropriate data file format and set optimized storage file parameters specific to the research to be conducted
Transform raw datasets into smaller derivative datasets for rapid data processing
Develop computational workflows for conducting programmatic, reproducible, accessible, and transparent research
Scale code from interactive, single-threaded workloads to parallel and distributed toolchains
Prototype iterative and rapid visualization frameworks on data to validate workflows and conduct preliminary analyses

Topics/Weekly Schedule

Example topics - subject to change.

Topic	Tools
Navigation	command line/shell, ssh, tmux, Kerberos+AFS, FarmShare/Sherlock
Code & Single-threaded Processing	git, Stanford GitLab, Jupyter/Colab, ssh tunneling, ngrok/serveo/sish
Data Representation & Access	numpy, pandas, SQLAlchemy
Data Transfer	rsync, rclone, Stanford Google Drive
Data File Formats	HDF5, Parquet, AVRO, Arrow, Hadoop
High-Performance Computing	job scheduling/SLURM, workflows, pipelines
Optimizing & Abstracting Data Access	HDF5 chunking, Python data object model
OS Virtualization	chroot, docker, & singularity
Interactive Web Visualization	dash/plotly, holoviews/bokeh
Distributed Computing	dask, modin, Apache Spark
Parallelism	OpenMPI

Spring 2022

The first offering of the course was Spring 2021.

Spring 2022 will be held on Wed & Fri from 9:45 - 11:15AM in Packard 101.