Postdoctoral fellow at the Master of Data Science program
University of British Columbia
https://www.flor14.github.io/wsu-dandrea/slides.html
This work is licensed under a Creative Commons Attribution 4.0 International License.
Title slide illustration: The Turing Way Community, & Scriberia. (2020, March 3). Illustrations from the Turing Way book dashes. Zenodo. http://doi.org/10.5281/zenodo.3695300
Click here! I can be a link!
lab
field work
Press R
research software
Software that is used to generate, process or analyse results that you intend to appear in a publication
Research software can be anything from a few lines of code written by yourself, to a professionally developed software package
Yes, they do!
Press R
1. Are Researchers Software Developers?
2. Why is it important to write “good software” in research?
3. What kind of practices and/or tools we need to apply for creating better research software?
- Examples
4. We don’t have training !
Where to find resources to learn?
5. Where can I find other researchers interested in software?
6. Career paths
- Survey results
Final comments
Scientists can create software as part of their research
1 2 3 4 5 6
What are the potential consequences of developing bad quality software?
Code unavailable is one of the reasons why researchers can’t reproduce their articles
A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer
Computational reproducibility
When detailed information is provided about code, software, hardware and implementation details.
Victoria Stodden (2014) What scientific idea is ready for retirement? - www.edge.org
The more detailed information is provided about the software we create, the more likely it is to reproduce research results using it
1 2 3 4 5 6
Situation 1
Julia wants to share R code with her supervisor
Julia is running this code on her computer ✅
library(readr)
library(tidyr)
library(ggplot2)
coffee_data <- read_csv("data/coffee_data.csv")
coffee_data |>
pivot_longer(cols=aroma:moisture,
names_to="caracteristicas_cafe",
values_to="value") |>
ggplot(aes(value, total_cup_points)) +
geom_smooth(method= "gam", span=0.3, color='purple') +
facet_wrap(~caracteristicas_cafe, scale="free_y")
Error in pivot_longer(.,
cols = aroma:moisture,
names_to = "types_coffe",
: could not find function "pivot_longer"
🤔 What could be going on?
Code that is running in your computer now may not work anymore if you upgrade the packages!
tidyr
package version 0.8.3
does not include pivot_longer()
tidyr
version 1.0.0
Computational environment
Characteristics of a 💻 that can affect the behavior of the work done on it, such as:
your operating system
what software do you have installed
software package versions are installed
Important to capture them to avoid:
Problems when sharing your code
That you code breaks with time
Computational environments and the tools to manage them can help researchers to deliver code that is reproducible, documented and shareable.
Julia wants to run the code 10 years later
‘Missing documentation and obsolete environments force participants in the Ten Years Reproducibility Challenge to get creative.’
Important to capture them to avoid:
Problems when sharing your code
That you code breaks with time
Minimum: Documentation
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kableExtra_1.3.4 knitr_1.41 forcats_0.5.1 stringr_1.4.1
[5] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[9] tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2
loaded via a namespace (and not attached):
[1] lattice_0.20-45 svglite_2.1.0 lubridate_1.9.1
[4] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
[7] R6_2.5.1 cellranger_1.1.0 backports_1.4.1
[10] reprex_2.0.1 evaluate_0.16 httr_1.4.3
[13] pillar_1.8.0 rlang_1.0.6 googlesheets4_1.0.0
[16] readxl_1.4.0 rstudioapi_0.13 Matrix_1.4-1
[19] rmarkdown_2.16 labeling_0.4.2 splines_4.2.1
[22] webshot_0.5.3 googledrive_2.0.0 bit_4.0.4
[25] munsell_0.5.0 broom_1.0.0 compiler_4.2.1
[28] modelr_0.1.8 xfun_0.35 pkgconfig_2.0.3
[31] systemfonts_1.0.4 mgcv_1.8-40 htmltools_0.5.4
[34] tidyselect_1.2.0 fansi_1.0.3 viridisLite_0.4.0
[37] crayon_1.5.1 tzdb_0.3.0 dbplyr_2.2.1
[40] withr_2.5.0 grid_4.2.1 nlme_3.1-157
[43] jsonlite_1.8.0 gtable_0.3.0 lifecycle_1.0.3
[46] DBI_1.1.3 magrittr_2.0.3 scales_1.2.0
[49] cli_3.4.1 stringi_1.7.8 vroom_1.5.7
[52] farver_2.1.1 fs_1.5.2 xml2_1.3.3
[55] ellipsis_0.3.2 generics_0.1.3 vctrs_0.5.1
[58] tools_4.2.1 bit64_4.0.5 glue_1.6.2
[61] hms_1.1.1 parallel_4.2.1 fastmap_1.1.0
[64] yaml_2.3.5 timechange_0.2.0 colorspace_2.0-3
[67] gargle_1.2.0 rvest_1.0.2 haven_2.5.0
We can use different tools to ensure the reproducibility of a project such as:
Julia is writing code for an article. What is the best way to share the code with other colleagues?
Git
A version control system
GitHub
Repository hosting service
It’s possible to assign names to each code version and revisit specific points in its history.
Your code is backed up online on GitHub.
It is easy to track the authorship of each change made
You can share your code as an open source project
Foster collaboration
:::footer https://github.com/dib-lab :::
Julia wants to publish her first article. How can she do it in a reproducible way?
Organize files according to a prevailing convention.
Provide separation between data, methods and results expressing the relationship between the three.
Specify the environment
Did you ever consider that your next scientific article can have not only readers but also users?
Readers
Users
Press R
Example 1 - Inactive Image Retention Policy (Nov 1, 2020): DockerHub removing Docker images in free accounts after 6 months
Example 2 - Mac M1 - Apple silicon chip (hardware) (Oct 2021) Docker issues
{fig.align:‘center’}
Should our code have maintainers?
1 2 3 4 5 6
1 2 3 4 5 6
“As researchers are under immense pressure to maintain expertise in their research domains, they have little time to stay current with the latest software engineering practices (…) The lack of career incentives has occurred partially because the academic environment and culture have developed over hundreds of years, while software has only recently become important, in some fields over the last 60+ years, but in many others, just in the last 20 or fewer years (Foster, 2006).”
ROpenSci
/ pyOpenSci
Authors: Send their papers with publicly available associated code and data
Participants: Attempt to reproduce published research of their choice from a list of proposed papers
At the end, the participants give feedback to the authors
The TWP involve and support a diverse community of contributors to make data science accessible, comprehensible and effective for everyone.
Book
Scientific package review process
Lessons and infrastructure to teach them
The Carpentry lessons in Unix, python, R and more
There are many communities that can help researchers to learn how to improve their software
1 2 3 4 5 6
In 2012, the Software Sustainability Institute (SSI) organized the Collaborations Workshop that addressed this question.
Foundation of the UK RSE association, and later to the Society of Research Software Engineering
research software engineer
What do RSEs do?
Fundamentally, RSEs build software to support scientific research. They generally don’t have research questions of their own. They develop the computer tools to help other people to do cool things.
Press R - Image and text from Woolston (2022) Why science needs more research software engineers - Nature career
Research Software Engineers includes researchers who spend a significant amount of time programming, full-time software engineers writing code to solve research problems, and those somewhere in-between.
Source: Daniel Katz blogpost
US RSSI mission is to improve the recognition, development, and use , of software for a more sustainable research enterprise.
Community-driven effort that brings together people who write and contribute to research software within the US.
First US conference. US-RSE Conference 2023: Software Enabled Discovery and Beyond. October 16-18th, 2023, Chicago, IL
International RSE Conference. RSECon 23 September 5-7th, 2023, UK
1 2 3 4 5 6
Software in general has not been well-cited (…), in part because the scholarly culture has not treated software as something that should be cited, or in some cases, even mentioned.
The turing way / he Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.
Data Papers
Methods Papers
Micropublishing
Software Papers
Registered Reports
The turing way / he Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.
The Journal of Open Source Software is a developer friendly, open access journal for research software packages.
10 month program - final capstone project with industry partners
1 2 3 4 5 6
1 2 3 4 5 6
There as an increasing demand for coding skills in academia
There are communities and associations that are trying to provide support to the ongoing demand
“New” technical career paths emerged in relation to this need
Questions?