Skip to content Skip to navigation

BIOE 301P - Research Data & Computation

Computational lab course that spans research data processing workflow starting just after the point of acquisition through to computation and visualization. Course is taught through Python & POSIX. Students will work with a dataset (>10GB) of their choosing (student or instructor provided).

 

Learning Goals

  • Catalog data in an archival framework for long-term storage and use in an efficient and programmatic manner
  • Choose an appropriate data file format and set optimized storage file parameters specific to the research to be conducted
  • Transform raw datasets into smaller derivative datasets for rapid data processing
  • Develop computational workflows for conducting programmatic, reproducible, accessible, and transparent research
  • Scale code from interactive, single-threaded workloads to parallel and distributed toolchains
  • Prototype iterative and rapid visualization frameworks on data to validate workflows and conduct preliminary analyses

 

Topics/Weekly Schedule

 

Topic Tools
Navigation command line/shell, ssh, tmux, Kerberos+AFS, FarmShare/Sherlock
Code & Single-threaded Processing git, Stanford GitLab, Jupyter/Colab, ssh tunneling, ngrok/serveo/sish
Data Representation & Access numpy, pandas, SQLAlchemy
Data Transfer rsync, rclone, Stanford Google Drive
Data File Formats HDF5, Parquet, AVRO, Arrow, Hadoop
High-Performance Computing job scheduling/SLURM, workflows, pipelines
Optimizing & Abstracting Data Access HDF5 chunking, Python data object model
OS Virtualization chroot, docker, & singularity
Interactive Web Visualization dash/plotly, holoviews/bokeh
Distributed Computing dask, modin, Apache Spark
Parallelism OpenMPI

 

Spring 2021

The first offering of the course will be Spring 2021. Remote instruction, synchronous flipped classroom format. Wed & Fri 10:00-11:20AM

 

 

Last modified: 
03/30/2021 - 23:08