From research software engineering to data science: how technology will impact the way we research

Florencia D’Andrea, Ph.D.

Postdoctoral fellow at the Master of Data Science program

University of British Columbia

Slides

https://www.flor14.github.io/wsu-dandrea/slides.html




Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Title slide illustration: The Turing Way Community, & Scriberia. (2020, March 3). Illustrations from the Turing Way book dashes. Zenodo. http://doi.org/10.5281/zenodo.3695300

Hello

Dr. Florencia D’Andrea

lab

field work

Do researchers develop software?

research software

Software that is used to generate, process or analyse results that you intend to appear in a publication

Research software can be anything from a few lines of code written by yourself, to a professionally developed software package

Do researchers develop software?

Yes, they do!

1. Are Researchers Software Developers?

2. Why is it important to write “good software” in research?

3. What kind of practices and/or tools we need to apply for creating better research software?

- Examples

4. We don’t have training !
Where to find resources to learn?

5. Where can I find other researchers interested in software?

6. Career paths

- Survey results

Final comments

Are Researchers Software Developers?

Scientists can create software as part of their research

1 2 3 4 5 6

Software quality can affect research results

What are the potential consequences of developing bad quality software?

Reproducibility crisis

Code unavailable is one of the reasons why researchers can’t reproduce their articles

Reproducibility

A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer

Reproducibility

  • Empirical
  • Statistical
  • Computational

Computational reproducibility

When detailed information is provided about code, software, hardware and implementation details.

Why is it important to write “good software” in research?

The more detailed information is provided about the software we create, the more likely it is to reproduce research results using it

1 2 3 4 5 6

Example

Situation 1

Julia wants to share R code with her supervisor

Plot’s code

Julia is running this code on her computer ✅

Show the code
library(readr)
library(tidyr)
library(ggplot2)

coffee_data <- read_csv("data/coffee_data.csv")

coffee_data |> 
  pivot_longer(cols=aroma:moisture,
               names_to="caracteristicas_cafe",
               values_to="value") |> 
  ggplot(aes(value, total_cup_points)) +
  geom_smooth(method= "gam", span=0.3, color='purple') +
  facet_wrap(~caracteristicas_cafe, scale="free_y")

But!

Her supervisor gets an error ❌

Error in pivot_longer(., 
cols = aroma:moisture,
names_to = "types_coffe",
: could not find function "pivot_longer"

🤔 What could be going on?

New package version

Code that is running in your computer now may not work anymore if you upgrade the packages!

  • tidyr package version 0.8.3 does not include pivot_longer()
  • This functions was added in tidyr version 1.0.0

Computational environment

Characteristics of a 💻 that can affect the behavior of the work done on it, such as:

  • your operating system

  • what software do you have installed

  • software package versions are installed

Computational environments

Important to capture them to avoid:

  • Problems when sharing your code

  • That you code breaks with time

The sleight-of-hand trick that can simplify scientific computing

Computational environments and the tools to manage them can help researchers to deliver code that is reproducible, documented and shareable.

Situation 1-bis

Julia wants to run the code 10 years later

Challenge to scientists: does your ten-year-old code still run?

‘Missing documentation and obsolete environments force participants in the Ten Years Reproducibility Challenge to get creative.’

Computational environments

Important to capture them to avoid:

  • Problems when sharing your code

  • That you code breaks with time

How to avoid getting this error?

Minimum: Documentation

sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kableExtra_1.3.4 knitr_1.41       forcats_0.5.1    stringr_1.4.1   
 [5] dplyr_1.0.9      purrr_0.3.4      readr_2.1.2      tidyr_1.2.0     
 [9] tibble_3.1.8     ggplot2_3.4.0    tidyverse_1.3.2 

loaded via a namespace (and not attached):
 [1] lattice_0.20-45     svglite_2.1.0       lubridate_1.9.1    
 [4] assertthat_0.2.1    digest_0.6.29       utf8_1.2.2         
 [7] R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
[10] reprex_2.0.1        evaluate_0.16       httr_1.4.3         
[13] pillar_1.8.0        rlang_1.0.6         googlesheets4_1.0.0
[16] readxl_1.4.0        rstudioapi_0.13     Matrix_1.4-1       
[19] rmarkdown_2.16      labeling_0.4.2      splines_4.2.1      
[22] webshot_0.5.3       googledrive_2.0.0   bit_4.0.4          
[25] munsell_0.5.0       broom_1.0.0         compiler_4.2.1     
[28] modelr_0.1.8        xfun_0.35           pkgconfig_2.0.3    
[31] systemfonts_1.0.4   mgcv_1.8-40         htmltools_0.5.4    
[34] tidyselect_1.2.0    fansi_1.0.3         viridisLite_0.4.0  
[37] crayon_1.5.1        tzdb_0.3.0          dbplyr_2.2.1       
[40] withr_2.5.0         grid_4.2.1          nlme_3.1-157       
[43] jsonlite_1.8.0      gtable_0.3.0        lifecycle_1.0.3    
[46] DBI_1.1.3           magrittr_2.0.3      scales_1.2.0       
[49] cli_3.4.1           stringi_1.7.8       vroom_1.5.7        
[52] farver_2.1.1        fs_1.5.2            xml2_1.3.3         
[55] ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.1        
[58] tools_4.2.1         bit64_4.0.5         glue_1.6.2         
[61] hms_1.1.1           parallel_4.2.1      fastmap_1.1.0      
[64] yaml_2.3.5          timechange_0.2.0    colorspace_2.0-3   
[67] gargle_1.2.0        rvest_1.0.2         haven_2.5.0        

Tools for reproducibility

We can use different tools to ensure the reproducibility of a project such as:

Capturing the computational environment is needed to ensure code reproducibility

Situation 2

Julia is writing code for an article. What is the best way to share the code with other colleagues?

Git

A version control system

GitHub

Repository hosting service

Why Git + GitHub is the best way to share code?

  • It’s possible to assign names to each code version and revisit specific points in its history.

  • Your code is backed up online on GitHub.

  • It is easy to track the authorship of each change made

Why Git + GitHub is the best way to share code?

  • You can share your code as an open source project

  • Foster collaboration

    • pull requests
    • project boards
    • github issues

Many research groups have their GitHub organizations online

:::footer https://github.com/dib-lab :::

Git + GitHub provide much more than just version control. They can also be used to foster collaboration and publish code.

Situation 3

Julia wants to publish her first article. How can she do it in a reproducible way?

Research Compendia

  1. Organize files according to a prevailing convention.

  2. Provide separation between data, methods and results expressing the relationship between the three.

  3. Specify the environment

Documentation

Did you ever consider that your next scientific article can have not only readers but also users?


Readers

Users

Project organization and documentation are also relevant to achieving computational reproducibility.

Other practices

  • Testing
  • Continuous integration / development (ci - cd)
  • Package development
  • Code review
  • Basic computational practices
  • Automation / use of pipelines

High reproducibility doesn’t mean that things couldn’t fail

What happens when the technology fails?

Some factors affecting Reproducibility

  • The tools that ensure the reproducibility (Docker, Conda) could change with time
  • The software/hardware we use change with time

Some factors affecting Reproducibility

How the tools that ensure the reproducibility (Docker, Conda) change with time

Example 1 - Inactive Image Retention Policy (Nov 1, 2020): DockerHub removing Docker images in free accounts after 6 months

Some factors affecting Reproducibility

The software/hardware we use change with time

Example 2 - Mac M1 - Apple silicon chip (hardware) (Oct 2021) Docker issues

{fig.align:‘center’}

Software mantainance

Should our code have maintainers?

What kind of practices and/or tools we need to apply for creating better research software?

  • Capturing computational environments.
  • Version controlling and publishing our code.

1 2 3 4 5 6

What kind of practices and/or tools we need to apply for creating better research software?

  • Document the code for your users.
  • Software could need maintenance.

1 2 3 4 5 6

Ok, but I need a second career to learn all this!

“As researchers are under immense pressure to maintain expertise in their research domains, they have little time to stay current with the latest software engineering practices (…) The lack of career incentives has occurred partially because the academic environment and culture have developed over hundreds of years, while software has only recently become important, in some fields over the last 60+ years, but in many others, just in the last 20 or fewer years (Foster, 2006).”

There are many communities that can help to improve research software

  • ReproHack
  • The Turing Way
  • ROpenSci / pyOpenSci
  • The Carpentries

1. Reprohack

Reproducibility hackathons

Authors: Send their papers with publicly available associated code and data

Participants: Attempt to reproduce published research of their choice from a list of proposed papers

At the end, the participants give feedback to the authors

2. The Turing Way project

The TWP involve and support a diverse community of contributors to make data science accessible, comprehensible and effective for everyone.

Book

The Turing Way Handbook (2022)

3. rOpenSci / pyOpenSci

Software Carpentry

Lessons and infrastructure to teach them

The Carpentry lessons in Unix, python, R and more

We don’t have training! Where to find resources to learn?

There are many communities that can help researchers to learn how to improve their software

1 2 3 4 5 6

Why is there no career for software developers in academia?

  • In 2012, the Software Sustainability Institute (SSI) organized the Collaborations Workshop that addressed this question.

  • Foundation of the UK RSE association, and later to the Society of Research Software Engineering

research software engineer

What do RSEs do?

Fundamentally, RSEs build software to support scientific research. They generally don’t have research questions of their own. They develop the computer tools to help other people to do cool things.

Research Software Engineers includes researchers who spend a significant amount of time programming, full-time software engineers writing code to solve research problems, and those somewhere in-between.

US Research Software Sustainability Institute

US RSSI mission is to improve the recognition, development, and use , of software for a more sustainable research enterprise.

US Research software engineer association

Community-driven effort that brings together people who write and contribute to research software within the US.

US Research software engineer association

Where can I find experts or other researchers interested in software?

  • The RSE role has been created to recognize the effort of software developers in academia
  • The US-RSE association connects all the RSEs in the US

1 2 3 4 5 6

RSE International Survey 2022

  • Conducted by the UK Software Sustainability Institute since 2016
  • The survey covers all aspects of the practice of research software engineering.
  • RSE survey 2022 -> US: 161 participants
What is the highest level of education you have attained? (one choice list)
How did you learn the skills you need to become an Research Software Engineer / Research Software Developer? (free text)

In which discipline is your highest academic qualification? (one choice list)

What training programs are you involved with? (free text)

Do you cite the software you use?

Software in general has not been well-cited (…), in part because the scholarly culture has not treated software as something that should be cited, or in some cases, even mentioned.

In general, when your software contributes to a paper, are you acknowledged in that paper? (one choice)

Publishing different article types

  • Data Papers

  • Methods Papers

  • Micropublishing

  • Software Papers

  • Registered Reports

Journal of Open Source Software

The Journal of Open Source Software is a developer friendly, open access journal for research software packages.

Master of Data Science program - UBC

10 month program - final capstone project with industry partners

Data Science

  • Statistics
  • Machine learning
  • Programming
  • Software development / reproducibility
  • Data visualization
  • Databases and cloud computing

Career paths

  • RSEs are mostly self-taugh
  • Software is often under-cited or overlooked in academic publications

1 2 3 4 5 6

Career paths

  • Some universities have created RSE groups to support research in their institutions
  • What new roles will appear in the future?

1 2 3 4 5 6

Final comments

  • There as an increasing demand for coding skills in academia

  • There are communities and associations that are trying to provide support to the ongoing demand

  • “New” technical career paths emerged in relation to this need

Questions?