effectiveness of open science policies at esec/fse 2019

2020-04-24

Introduction

For those who don’t know me, I have been advocating for open science (open access, open data, and open source) practices in software engineering research for many years now. Sometimes I could push my role into a more practical, hands-on one, and I could help developing better tools or platforms towards openness. Other times, my passion has been recognized to put me into more decision-making roles including being open science chair of several international workshops, conferences, and journals of software engineering research supported by some good friends who were placed in the right positions and having the right motivation. In my chairing roles, I developed several policies and guidelines (see the links in the previous sentence) for authors towards better transparency, reproducibility, and repeatability of studies. Indeed, developing policies and strategies for openness has been one of my latest passions lately, and policy development is something I have been working on with Jon Bell for the last months with the SIGSOFT open science initiative, which we co-chair. A first output, driven by Jon, has been the SIGSOFT Artifact Evaluation Working Group. I am driving some common open science policies that we will encourage all software engineering venues to adopt. The latest incarnation of my policies has been at ESEC/FSE 2019, but I am happy to report that derived forms of them were autonomously adopted at MSR 2020, ESEC/FSE 2020, and ICSME 2020, which warms my heart. What these policies have in common is the following desire: Artifacts related to a study (which include, but are not limited to, raw and transformed data, extended proofs, appendices, analysis scripts, software, virtual machines and containers, qualitative codebooks) and the paper itself should, in principle, be made available on the Internet:

provided that there are no ethical, legal, technical, economical, or sensible barriers preventing the disclosure. We know that is not always possible to disclose artifacts (from this point on, simply called data). So, these policies have so far never demanded data from authors. I wanted to change this at ESEC/FSE 2019.

This report

This report summarizes how the ESEC/FSE 2019 open science policies came to life and how they resulted in openness for the conference. In particular, I provide how authors of 277 papers provided data at peer review time, and how authors of 74 published papers provided their papers and data openly. Finally, I discuss on the obtained results, and provide lessons learned for the development of SIGSOFT open science policies.

The ESEC/FSE 2019 open science guidelines

The ESEC/FSE 2019 conference has been my largest deployment of open science policies so far, after ESEM 2018. I received almost carte blanche to draft my idea from the co-chairs Marlon Dumas and Dietmar Pfahl and from the PC chairs Sven Apel and Alessandra Russo (whom I deeply thank for their enthusiastic support). For ESEC/FSE 2019, I wanted to achieve something radical: to demand from authors the they disclose their data for peer review and, in case of acceptance, make it archived open data. Those refusing to comply should be desk-rejected before review time or rejected after review. Of course, there would be several exceptions to be made, e.g., in case of NDAs, ethical reasons, and so on. What happened was that, before the call for papers was sent out, we decided to involve a group of good people to review the guidelines. While there was complete support for open science in general, some of the reviewers expressed concern that the change would have been too strong for the community to process and, perhaps, understand. Someone suggested to involve the steering committee of ESEC/FSE. At that point, given the little time we had, I decided to propose something in between:

To _demand from authors that they declare_ what they would do with their data at review time and in case of acceptance.

Author declarations would have to be present at submission time, on EasyChair, and in the papers themselves. Authors would not be punished in any way, no matter what information they would give. This version of the guidelines was received positively. PC members were instructed that they were allowed to complain in case data was not provided but to not reject papers based on absence of data. Reviewers were also asked to not hunt for information around in case the authors would disclose data at review time and/or self-archive a preprint which was not blinded. Let’s see the core of ESEC/FSE 2019 guidelines / policies for open science, summarized in this toggle for your convenience.

# Open Data and Open Source

Fostering open data should be done as:

- Archived on preserved digital repositories such as zenodo.org, figshare.com, or institutional repositories.
  - Personal or institutional websites, consumer cloud, or services such as Academia.edu and Researchgate.net are not archived and preserved digital repositories.
- Released under a proper open data license such as the CC0 dedication or the CC-BY 4.0 license when publishing the data.
  - Different open licenses, if mandated by institutions or regulations, are also permitted.

Upon submission, we ask authors to provide a supporting statement on the data availability in the submission system and as text in the submitted papers (ideally, in a subsection named Data Availability within the Results section).

Please note that the success of the open science initiative depends on the willingness (and possibilities) of authors to disclose their data and that all submissions will undergo the same review process independent of whether or not they disclose their analysis code or data.

We encourage authors to make data available upon submission (either privately or publicly) and especially upon acceptance (publicly).

In any case, we ask authors who cannot disclose industrial or otherwise non-public data, for instance due to non-disclosure agreements, to please provide an explicit (short) statement in the paper (ideally, in a subsection named Data Availability within the Results section)..

Similarly, we encourage authors to make their research software accessible as open source and citable, using the same process.
Any acceptable open source license is encouraged.

# Open Access

We encourage ESEC/FSE 2019 authors to self-archive their pre- and postprints in open, preserved repositories. 
This is legal and allowed by all major publishers including ACM (granted in the copyright transfer agreement), and it lets anybody in the world reach your paper.

If the authors of your paper wish to do this, we recommend:

- Upon acceptance to ESEC/FSE 2019, revise your article according to the peers comments, generate a PDF version of it (postprint), and submit it to arXiv.org, which supports article versioning.

We also provided authors with links on open access and self-archiving and on disclosing data for double blind peer review and preparing it for open data. How did it go? Let’s find out.

Method

I analyzed the EasyChair metadata of 277 ESEC/FSE 2019 main track submissions which were sent out for peer review. I obtained the EasyChair export from the PC chairs. The export contained only data that was pertinent to my analysis. To name an example, I could not access reviews of submissions. In particular, as anticipated in the open science guidelines, the EasyChair submission system had these two mandatory text fields:

I open coded each EasyChair submission to find out:

I then retrieved the ESEC/FSE 2019 main research track papers from the ACM digital library website. I inspected each paper and looked for any mention of disclosed data. I coded the information accordingly to find out:

The description of all codes and how they emerged is available in the Results section. As a methodological note, in this report I am reporting on mere, direct, non-paywalled availability of artifacts and papers. The definitions for open access, open data, and open source foresee a proper exploitation of copyright rules to produce a written statement of intentions and expectations that users of the released material are asked to follow (i.e., a license). For the present work, I consider instead open data, open source, and open access any artifact or paper that is released on the Internet barrier-free, that is, no paywall, registration, or request mechanisms exist for obtaining the material.

Anonymity

In this report and all data I provide, I am protecting:

That is, one dataset, submitted.df, includes all submissions including those that were accepted, one dataset published.df includes only published papers. Each line represents a submission that was sent for peer review in the main track. It is possible to identify a couple of published papers in the published.df dataset, but the dataset contains publicly available information (all information comes from published papers) and the papers are not linked to their record as submission to EasyChair.

if (file.exists("ESEC_FSE_2019-open_science.csv")){
  all.submissions.df <- read.csv2("ESEC_FSE_2019-open_science.csv", header = T)
  submitted.df <- all.submissions.df[1:(length(all.submissions.df)-4)]
  published.df <- na.exclude(all.submissions.df)[(length(all.submissions.df)-3):length(all.submissions.df)]
  write.csv(x = submitted.df[sample(nrow(submitted.df)),], file = "ESEC_FSE_2019-open_science-submitted.csv", row.names = F)
  write.csv(x = published.df[sample(nrow(published.df)),], file = "ESEC_FSE_2019-open_science-published.csv", row.names = F)
} else { # this is what will be executed by anyone else reproducing this Rmd.
  submitted.df <- read.csv("ESEC_FSE_2019-open_science-submitted.csv", header = T)
  published.df <- read.csv("ESEC_FSE_2019-open_science-published.csv", header = T)
}

Results

In this section I report on the 277 submissions and the 74 published papers that were published at ESEC/FSE 2019 main track.

Submissions

A total of 277 submissions were sent out for peer review. Authors filled out two EasyChair fields that I analyzed. The dataset submitted.df provides 8 variables, or columns:

It should be noted here that all involved people (reviewers and chairs) were instructed to believe the authors in their explanations, so I assumed all stated reasons as truthful as well. Let’s first see the percentages for submissions:

sum(submitted.df$SUBMITTED_DATA_SUMMARY) / NROW(submitted.df)

[1] 0.6750903

sum(submitted.df$SUBMITTED_DATA_PROMISE_PUBLISHED_SUMMARY) / NROW(submitted.df)

[1] 0.866426

sum(submitted.df$SUBMITTED_YES_DATA_SUMMARY_YES_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)

[1] 0.66787

sum(submitted.df$SUBMITTED_NO_DATA_SUMMARY_YES_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)

[1] 0.198556

sum(submitted.df$SUBMITTED_NO_DATA_SUMMARY_NO_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)

[1] 0.1263538

sum(submitted.df$SUBMITTED_YES_DATA_SUMMARY_NO_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)

[1] 0.007220217

ESEC/FSE 2019 had 67% of the 277 submissions with accompanying data available for peer review. A total of 87% submissions promised open data in case of acceptance, and 67% of submissions had both accompanying data and promised to open data in case of acceptance. I was curious to see where data submitted for peer review (67% of submissions) was hosted on I further analyzed the submissions with the following codes:

Here is the distribution of the codes:

yes.review <- sort(table(as.character(submitted.df[submitted.df$SUBMITTED_DATA_SUMMARY == 1,]$SUBMITTED_DATA_DETAILS)), decreasing = T)
    ggplot(as.data.frame(yes.review)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Where data was hosted at peer review time') + ylab('Count') + theme_bw()  

plot of chunk submissionsgraphs Version control systems / repositories were used in most cases (72 submissions over 187 that presented data), followed by cloud drives (50). Only 47 submissions, or 25.1% of submitted data, was on properly archived systems for hosting open data. I also wanted to understand why authors do not provide data at peer review time. I dissected the 90 submissions (33% of the total) with the following codes:

no.review <- sort(table(as.character(submitted.df[submitted.df$SUBMITTED_DATA_SUMMARY == 0,]$SUBMITTED_DATA_DETAILS)), decreasing = T)
    ggplot(as.data.frame(no.review)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Reasons for not disclosing data at peer review time') + ylab('Count') + theme_bw() + theme(axis.text.x=element_text(angle=90,hjust=1))

plot of chunk submissionsgraphsnodata Almost a tenth (8%) of the submission that did not provide data for peer review was because there was no data to disclose in the first place. Submissions coded with NO_NO_REASON formally violated the open science policies of ESEC/FSE 2019. There was no penalty for this. Submissions which omitted an explanation for not providing the data were the second most represented category (19% of those not providing data at peer review, or 7% of all submissions overall). On the other hand, 81% of the submissions which did not provide data for peer review provided an explanation for not doing so.

Published papers

A total of 74 submissions were accepted at ESEC/FSE 2019 main research track. The proceedings were made available in August 2019. Near the end of March 2020, to allow authors some time to self-archive their papers, I inspected each published paper to find out the following:

sum(published.df$PUBLISHED_DATA_SUMMARY, na.rm = T) / (NROW(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY) - sum(is.na(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY)))

[1] 0.7027027

sum(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY, na.rm = T) / (NROW(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY) - sum(is.na(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY)))

[1] 0.8918919

A total of 70% of published papers made data openly available. Almost all (89%) of the published papers were made freely available, outside of ACM Digital Library paywall. Similarly to submissions, I checked where data and papers where hosted on. I inspected the submissions and coded the columns PUBLISHED_DATA_DETAILS and PUBLISHED_OPEN_ACCESS_DETAILS as follows:

I am plotting the YES_ codes here only.

yes.published.data <- sort(table(as.character(published.df[published.df$PUBLISHED_DATA_SUMMARY == 1,]$PUBLISHED_DATA_DETAILS)), decreasing = T)
    ggplot(as.data.frame(yes.published.data)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Where data was hosted after acceptance') + ylab('Count') + theme_bw()

plot of chunk publishedgraphs

yes.published.paper <- sort(table(as.character(published.df[published.df$PUBLISHED_OPEN_ACCESS_SUMMARY == 1,]$PUBLISHED_OPEN_ACCESS_DETAILS)), decreasing = T)
    ggplot(as.data.frame(yes.published.paper)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Where papers were hosted after acceptance') + ylab('Count') + theme_bw()
    

plot of chunk publishedgraphs For data, version control repositories were used in most cases (31 papers over 52 that presented data), followed properly archived systems for open data (13). Only 25% of published data, in line with the submitted papers, was on proper systems for hosting open data. For papers, a plain website (which includes GitHub pages) was the most preferred way to offer an open access version of the paper (41 of the 66 papers that were freely available.), followed by properly archived systems for open access (20%) and two papers which were offered with ACM open access option.

Discussion

Being this my website, I will dedicate the first discussion of the results as a celebration of the open science guidelines. The main research track of ESEC/FSE 2019 has been my greatest achievement so far in terms of impact and openness of data.

Now, to the slightly happier news.

Here is why all my open science policies are so petulant to not store code, data, and papers on mere websites (and, sorry, GitHub and researchgate.net count as websites): web resources disappear, pretty quickly. Wallace Köhler has performed a series of longitudinal studies of generic URLs (that is, not archived), which found, among other interesting results, the following: only 34.4% of URLs are available after 5 years down to around 0.005% after 20 years. In the worst average case, the half-life availability of Web pages has been estimated to 2 years. This issue includes, to a certain extent, source code hosted on GitHub. GitHub is not archived, hence, Github is not forever. While I am not foreseeing this happening anytime soon for GitHub, it did happen that a company, BitBucket, decided to delete repositories in bulk. The Software Heritage project is aiming to archive research software that is hosted on GitHub, but it likely requires a manual trigger from the authors. Furthermore, archived open data is maximized when it is linked from the paper. Archiving snapshots of source code from GitHub is very easy with zenodo and with figshare.

Lessons learned for SIGSOFT open science guidelines

I started drafting the SIGSOFT open science guidelines, which will borrow from what I learned in this and all previous experiences. In particular, these are some of the lessons learned:

Data availability

The present report was written in R markdown and it is fully reproducible. The report and the underlying datasets are released as open access and open data: Daniel Graziotin. (2020). Effectiveness of Open Science Policies at ESEC/FSE 2019 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3765247


I do not use a commenting system anymore, but I would be glad to read your comments and feedback. Feel free to contact me.