BIOE 301P - Research Data & Computation

Computational lab course that spans research data processing workflow starting just after the point of acquisition through to computation and visualization. Course is taught through Python & POSIX. Students will work with large datasets of their choosing.


Learning Goals

  • Catalog data in an archival framework for long-term storage and use in an efficient and programmatic manner
  • Choose an appropriate data file format and set optimized storage file parameters specific to the research to be conducted
  • Transform raw datasets into smaller derivative datasets for rapid data processing
  • Develop computational workflows for conducting programmatic, reproducible, accessible, and transparent research
  • Scale code from interactive, single-threaded workloads to parallel and distributed toolchains
  • Prototype iterative and rapid visualization frameworks on data to validate workflows and conduct preliminary analyses


Topics/Weekly Schedule

Example topics - subject to change.

Topic Tools
Navigation command line/shell, ssh, tmux, Kerberos+AFS, FarmShare/Sherlock
Code & Single-threaded Processing git, Stanford GitLab, Jupyter/Colab, ssh tunneling, ngrok/serveo/sish
Data Representation & Access numpy, pandas, SQLAlchemy
Data Transfer rsync, rclone, Stanford Google Drive
Data File Formats HDF5, Parquet, AVRO, Arrow, Hadoop
High-Performance Computing job scheduling/SLURM, workflows, pipelines
Optimizing & Abstracting Data Access HDF5 chunking, Python data object model
OS Virtualization chroot, docker, & singularity
Interactive Web Visualization dash/plotly, holoviews/bokeh
Distributed Computing dask, modin, Apache Spark
Parallelism OpenMPI


Spring 2022

The first offering of the course was Spring 2021.

Spring 2022 will be held on Wed & Fri from 9:45 - 11:15AM in Packard 101.

Last Updated: 2022/11/27 - 00:47

© 2023

Made with Python, Jinja, and Bootstrap