Biostatistics Workshop 2021

Title: Best practices in computational reproducibility
Speaker: Jean Baptiste Poline, McGill University; jean-baptiste.poline@mcgill.ca
Date: Sunday, June 6, 2021

Description​: 

There is an increasing concern that computational research is hard to reproduce and reuse. Software version changes, code is hard to run on a new computational environment, data is modified but not versioned. This workshop will describe best practices and tools used to improve reproducible research, with a specific emphasis on computational aspects. We will also show that these tools help with research efficiency.

Attendees who wished to participate in the hands on exercise are expected to bring their own laptop and follow the [neurohackademy](https://neurohackademy.org/setup/) setup instructions to install There is an increasing concern that computational research is hard to reproduce and reuse. Software version changes, code is hard to run on a new computational environment, data is modified but not versioned. This workshop will describe best practices and tools used to improve reproducible research, with a specific emphasis on computational aspects. We will also show that these tools help with research efficiency.

Attendees who wished to participate in the hands on exercise are expected to bring their own laptop and follow the [neurohackademy](https://neurohackademy.org/setup/) setup instructions to install git.

(Times below assumes an 8:30 start. Times will change accordingly for different start time.)

  1. Introduction to best practices in computational reproducibility (8:30)
     
  2.  Why and How to version and collaborate on code (9:00)
    • In this talk, we will introduce the concepts of git as a content tracking system, and of GitHub, the "social coding" platform. We will present the list of best coding practices and explain how these best practices foster a more reproducible science.
  3. Introduction to containers (10:00)
    • We will introduce the container technology and compare it to other systems for encapsulating computing environments, such as virtual machines or virtual environments. We will show when this technology is best used and what are the potential pitfalls.
  4. Versioning data (10:30)
    • In this lesson we will introduce the concept of data versioning and integrity checking. We will present DataLad, a software able to track data versions through git.
  5. python ecosystem for statistics/machine learning (11:00)
    • While R is the most used language for statisticians, with a plethora of packages, Python has an interesting ecosystem for data processing and has recently developed first class libraries for machine learning. We will present this ecosystem with its capacities and limitations.
  6. Exercise version code (12:00)

    • ​​In this hands on session, we will show how to use git and GitHub to version some code and collaborate on a software project.