effectiveness of open science policies at esec/fse 2019
2020-04-24
Introduction
For those who don’t know me, I have been advocating for open science (open access, open data, and open source) practices in software engineering research for many years now. Sometimes I could push my role into a more practical, hands-on one, and I could help developing better tools or platforms towards openness. Other times, my passion has been recognized to put me into more decision-making roles including being open science chair of several international workshops, conferences, and journals of software engineering research supported by some good friends who were placed in the right positions and having the right motivation.
In my chairing roles, I developed several policies and guidelines (see the links in the previous sentence) for authors towards better transparency, reproducibility, and repeatability of studies. Indeed, developing policies and strategies for openness has been one of my latest passions lately, and policy development is something I have been working on with Jon Bell for the last months with the SIGSOFT open science initiative, which we co-chair. A first output, driven by Jon, has been the SIGSOFT Artifact Evaluation Working Group.
I am driving some common open science policies that we will encourage all software engineering venues to adopt. The latest incarnation of my policies has been at ESEC/FSE 2019, but I am happy to report that derived forms of them were autonomously adopted at MSR 2020, ESEC/FSE 2020, and ICSME 2020, which warms my heart. What these policies have in common is the following desire: Artifacts related to a study (which include, but are not limited to, raw and transformed data, extended proofs, appendices, analysis scripts, software, virtual machines and containers, qualitative codebooks) and the paper itself should, in principle, be made available on the Internet:
-
without any barrier (e.g., paywalls, registration forms, request mechanisms),
-
under a proper open license that specifies purposes for re-use and re-purposing,
-
properly archived and preserved,
provided that there are no ethical, legal, technical, economical, or sensible barriers preventing the disclosure. We know that is not always possible to disclose artifacts (from this point on, simply called data). So, these policies have so far never demanded data from authors. I wanted to change this at ESEC/FSE 2019.
This report
This report summarizes how the ESEC/FSE 2019 open science policies came to life and how they resulted in openness for the conference. In particular, I provide how authors of 277 papers provided data at peer review time, and how authors of 74 published papers provided their papers and data openly. Finally, I discuss on the obtained results, and provide lessons learned for the development of SIGSOFT open science policies.
The ESEC/FSE 2019 open science guidelines
The ESEC/FSE 2019 conference has been my largest deployment of open science policies so far, after ESEM 2018. I received almost carte blanche to draft my idea from the co-chairs Marlon Dumas and Dietmar Pfahl and from the PC chairs Sven Apel and Alessandra Russo (whom I deeply thank for their enthusiastic support).
For ESEC/FSE 2019, I wanted to achieve something radical: to demand from authors the they disclose their data for peer review and, in case of acceptance, make it archived open data. Those refusing to comply should be desk-rejected before review time or rejected after review. Of course, there would be several exceptions to be made, e.g., in case of NDAs, ethical reasons, and so on. What happened was that, before the call for papers was sent out, we decided to involve a group of good people to review the guidelines.
While there was complete support for open science in general, some of the reviewers expressed concern that the change would have been too strong for the community to process and, perhaps, understand. Someone suggested to involve the steering committee of ESEC/FSE. At that point, given the little time we had, I decided to propose something in between:
To _demand from authors that they declare_ what they would do with their data at review time and in case of acceptance.
Author declarations would have to be present at submission time, on EasyChair, and in the papers themselves. Authors would not be punished in any way, no matter what information they would give. This version of the guidelines was received positively. PC members were instructed that they were allowed to complain in case data was not provided but to not reject papers based on absence of data. Reviewers were also asked to not hunt for information around in case the authors would disclose data at review time and/or self-archive a preprint which was not blinded.
Let’s see the core of ESEC/FSE 2019 guidelines / policies for open science, summarized in this toggle for your convenience.
# Open Data and Open Source
Fostering open data should be done as:
- Archived on preserved digital repositories such as zenodo.org, figshare.com, or institutional repositories.
- Personal or institutional websites, consumer cloud, or services such as Academia.edu and Researchgate.net are not archived and preserved digital repositories.
- Released under a proper open data license such as the CC0 dedication or the CC-BY 4.0 license when publishing the data.
- Different open licenses, if mandated by institutions or regulations, are also permitted.
Upon submission, we ask authors to provide a supporting statement on the data availability in the submission system and as text in the submitted papers (ideally, in a subsection named Data Availability within the Results section).
Please note that the success of the open science initiative depends on the willingness (and possibilities) of authors to disclose their data and that all submissions will undergo the same review process independent of whether or not they disclose their analysis code or data.
We encourage authors to make data available upon submission (either privately or publicly) and especially upon acceptance (publicly).
In any case, we ask authors who cannot disclose industrial or otherwise non-public data, for instance due to non-disclosure agreements, to please provide an explicit (short) statement in the paper (ideally, in a subsection named Data Availability within the Results section)..
Similarly, we encourage authors to make their research software accessible as open source and citable, using the same process.
Any acceptable open source license is encouraged.
# Open Access
We encourage ESEC/FSE 2019 authors to self-archive their pre- and postprints in open, preserved repositories.
This is legal and allowed by all major publishers including ACM (granted in the copyright transfer agreement), and it lets anybody in the world reach your paper.
If the authors of your paper wish to do this, we recommend:
- Upon acceptance to ESEC/FSE 2019, revise your article according to the peers comments, generate a PDF version of it (postprint), and submit it to arXiv.org, which supports article versioning.
We also provided authors with links on open access and self-archiving and on disclosing data for double blind peer review and preparing it for open data. How did it go? Let’s find out.
Method
I analyzed the EasyChair metadata of 277 ESEC/FSE 2019 main track submissions which were sent out for peer review. I obtained the EasyChair export from the PC chairs. The export contained only data that was pertinent to my analysis. To name an example, I could not access reviews of submissions. In particular, as anticipated in the open science guidelines, the EasyChair submission system had these two mandatory text fields:
-
Data Availability. Are you disclosing your data for peer review? If yes, please indicate here a link to the data. If not, please indicate the reason why. Please report the same supporting sentences in your submission, ideally in a subsection named Data Availability within the Results section.
-
Data Archiving . Are you planning to turn your data into archived open data in case of acceptance? If not, please tell us why. Please report the same supporting sentences in your submission, ideally in a subsection named Data Availability within the Results section.
I open coded each EasyChair submission to find out:
-
if the related data was provided for peer review, and where, and
-
if authors would commit to make the related data publicly available as open data after acceptance.
I then retrieved the ESEC/FSE 2019 main research track papers from the ACM digital library website. I inspected each paper and looked for any mention of disclosed data. I coded the information accordingly to find out:
-
if the related data was provided openly, and where, and
-
if a a freely accessible copy of the paper was available and where.
The description of all codes and how they emerged is available in the Results section.
As a methodological note, in this report I am reporting on mere, direct, non-paywalled availability of artifacts and papers. The definitions for open access, open data, and open source foresee a proper exploitation of copyright rules to produce a written statement of intentions and expectations that users of the released material are asked to follow (i.e., a license). For the present work, I consider instead open data, open source, and open access any artifact or paper that is released on the Internet barrier-free, that is, no paywall, registration, or request mechanisms exist for obtaining the material.
Anonymity
In this report and all data I provide, I am protecting:
-
The identity of authors by (1) excluding all EasyChair columns, (2) including only data that I produced through my qualitative analysis, (3) shuffling all dataset rows before exporting them.
-
The choices that accepted authors made between submission and camera-ready. This report does not aim to point fingers at anyone. Data about published papers is disjointed from data about submissions, and it contains data related to published papers only. The dataset of published papers also contains only my qualitative codes.
That is, one dataset, submitted.df
, includes all submissions including those that were accepted, one dataset published.df
includes only published papers. Each line represents a submission that was sent for peer review in the main track. It is possible to identify a couple of published papers in the published.df
dataset, but the dataset contains publicly available information (all information comes from published papers) and the papers are not linked to their record as submission to EasyChair.
if (file.exists("ESEC_FSE_2019-open_science.csv")){
all.submissions.df <- read.csv2("ESEC_FSE_2019-open_science.csv", header = T)
submitted.df <- all.submissions.df[1:(length(all.submissions.df)-4)]
published.df <- na.exclude(all.submissions.df)[(length(all.submissions.df)-3):length(all.submissions.df)]
write.csv(x = submitted.df[sample(nrow(submitted.df)),], file = "ESEC_FSE_2019-open_science-submitted.csv", row.names = F)
write.csv(x = published.df[sample(nrow(published.df)),], file = "ESEC_FSE_2019-open_science-published.csv", row.names = F)
} else { # this is what will be executed by anyone else reproducing this Rmd.
submitted.df <- read.csv("ESEC_FSE_2019-open_science-submitted.csv", header = T)
published.df <- read.csv("ESEC_FSE_2019-open_science-published.csv", header = T)
}
Results
In this section I report on the 277 submissions and the 74 published papers that were published at ESEC/FSE 2019 main track.
Submissions
A total of 277 submissions were sent out for peer review. Authors filled out two EasyChair fields that I analyzed. The dataset submitted.df
provides 8 variables, or columns:
-
SUBMITTED_DATA_DETAILS
open codes that begin withYES_
followed by a type of storage for data that has been provided for peer review, open codes that begin withNO_
followed by a reason for non providing data. -
SUBMITTED_DATA_PROMISE_PUBLISHED_DETAILS
open codes that begin withYES_
if the authors committed to make data publicly available as open data after acceptance, open codes that begin withNO_
followed by a reason otherwise. -
SUBMITTED_DATA_SUMMARY
1 if data has been provided for peer review, 0 otherwise. -
SUBMITTED_DATA_PROMISE_PUBLISHED_SUMMARY
1 if the authors committed to make data publicly available as open data after acceptance. -
SUBMITTED_YES_DATA_SUMMARY_YES_DATA_PROMISE_PUBLISHED
1 if the data is available for peer review and authors committed to open data after acceptance. -
SUBMITTED_NO_DATA_SUMMARY_YES_DATA_PROMISE_PUBLISHED
1 if the data is not available for peer review but authors committed to open data after acceptance. -
SUBMITTED_NO_DATA_SUMMARY_NO_DATA_PROMISE_PUBLISHED
1 if the data is not available for peer review and authors did not commit to open data after acceptance. -
SUBMITTED_YES_DATA_SUMMARY_NO_DATA_PROMISE_PUBLISHED
1 if the data is available for peer review but authors did not commit to open data after acceptance.
It should be noted here that all involved people (reviewers and chairs) were instructed to believe the authors in their explanations, so I assumed all stated reasons as truthful as well. Let’s first see the percentages for submissions:
sum(submitted.df$SUBMITTED_DATA_SUMMARY) / NROW(submitted.df)
[1] 0.6750903
sum(submitted.df$SUBMITTED_DATA_PROMISE_PUBLISHED_SUMMARY) / NROW(submitted.df)
[1] 0.866426
sum(submitted.df$SUBMITTED_YES_DATA_SUMMARY_YES_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)
[1] 0.66787
sum(submitted.df$SUBMITTED_NO_DATA_SUMMARY_YES_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)
[1] 0.198556
sum(submitted.df$SUBMITTED_NO_DATA_SUMMARY_NO_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)
[1] 0.1263538
sum(submitted.df$SUBMITTED_YES_DATA_SUMMARY_NO_DATA_PROMISE_PUBLISHED) / NROW(submitted.df)
[1] 0.007220217
ESEC/FSE 2019 had 67% of the 277 submissions with accompanying data available for peer review. A total of 87% submissions promised open data in case of acceptance, and 67% of submissions had both accompanying data and promised to open data in case of acceptance. I was curious to see where data submitted for peer review (67% of submissions) was hosted on I further analyzed the submissions with the following codes:
-
YES_REPOSITORY
when data was hosted on a version control repository such as a git repository. GitHub repositories were the vast majority here. -
YES_CLOUD
when data was hosted in the cloud which for most of the cases was a cloud drive such as Dropbox or OneDrive. -
YES_ARCHIVED
when data was hosted on a properly archived system for open data such as figshare, Zenodo, or an institutional repository. -
YES_WEBSITE
when data was hosted on a website. I verified whether each provided link pointed to a website, and I also verified the links provided in the website themselves. Those links that pointed elsewhere, e.g., a GitHub repository or a Dropbox link, were not coded as websites. -
YES_FILESHARING
when data was hosted on a filesharing website, such as anonfiles.com or uploaded.net.
Here is the distribution of the codes:
yes.review <- sort(table(as.character(submitted.df[submitted.df$SUBMITTED_DATA_SUMMARY == 1,]$SUBMITTED_DATA_DETAILS)), decreasing = T)
ggplot(as.data.frame(yes.review)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Where data was hosted at peer review time') + ylab('Count') + theme_bw()
with the following codes:
-
NO_REVIEW_TIME
contained all soft reasons for not being able to provide data at submission time. Reasons included “We need more work to clean the data properly” (note: all quoted text has been slightly altered to protect author identities), “The proposed approach is still under development”, “Not at this stage, the data is tied up to the method and will not be useful to review”, to a simple “We did not have the time to prepare this”. -
NO_NO_REASON
represented all submissions that did not provide any reason at all for not including data for peer review. Usually, the answer here was “No.”. -
NO_DBR_VIOLATION
were those submissions which declared that it would be impossible to disclose data for peer review without unveiling author identities. Reasons ranged from least articulated “Disclosing the data now would reveal authorship” to “Our models are publicly available in a repository in which authorship is revealed (commit authors and messages, names and emails in the source code, etc.).”. -
NO_NDA
were submissions that were in cooperation with industry, which often does not permit disclosing raw data, e.g., “We do not have permission from our industry partners to share the data” or implies a non-permission through NDAs: “Due to a non-disclosure agreement, we are not allowed to publish the raw data of our study”. -
NOT_APPPLICABLE
not all software engineering submissions gather data or produce software. Some subfields (e.g., formal methods) build models (“Our paper is a conceptual model based on prior work. There is no data that can be provided”) or provide proofs (“Since the submission is a formal method, I do not have any data to disclose.”). Hence, papers are sometimes truly self-contained. These papers maximize their value when they are open access. -
NO_ANONYMIZATION
were those submissions claiming that it would be impossible to properly anonymize data (from the point of view of, e.g., participants) for the peer review process. -
NO_TECHNICAL_ISSUES
represented those submissions for which authors claimed that technical reasons prevent the disclosure of data. The most stated reason was size of dataset. -
NO_PARTICIPANTS_CONSENT
sometimes participants themselves ask for their data not to be disclosed in any form, and it is a moral obligation to follow their request. Other times, the authors did not explicitly tell participants that their data could have been made openly available and decided to simply not disclose in absence of consent. -
NO_OWN_PROPERTY
the least represented reason was the one for which authors enforced their intellectual rights of data and were not willing to disclose under any condition.
no.review <- sort(table(as.character(submitted.df[submitted.df$SUBMITTED_DATA_SUMMARY == 0,]$SUBMITTED_DATA_DETAILS)), decreasing = T)
ggplot(as.data.frame(no.review)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Reasons for not disclosing data at peer review time') + ylab('Count') + theme_bw() + theme(axis.text.x=element_text(angle=90,hjust=1))
. On the other hand, 81% of the submissions which did not provide data for peer review provided an explanation for not doing so.
Published papers
A total of 74 submissions were accepted at ESEC/FSE 2019 main research track. The proceedings were made available in August 2019. Near the end of March 2020, to allow authors some time to self-archive their papers, I inspected each published paper to find out the following:
-
PUBLISHED_DATA_DETAILS
open codes that begin withYES_
followed by a type of storage for data that has been published after acceptance,NO
otherwise. -
PUBLISHED_OPEN_ACCESS_DETAILS
open codes that begin withYES_
followed by a type of storage for a paper that has been published after acceptance,NO
otherwise. In case a paper was made freely available in multiple places, I picked the “best” one, where an archived repository would be the best option. -
PUBLISHED_DATA_SUMMARY
1 if the data was made freely available after acceptance, 0 otherwise. -
PUBLISHED_OPEN_ACCESS_SUMMARY
1 if the paper was made freely available after acceptance, 0 otherwise.
sum(published.df$PUBLISHED_DATA_SUMMARY, na.rm = T) / (NROW(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY) - sum(is.na(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY)))
[1] 0.7027027
sum(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY, na.rm = T) / (NROW(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY) - sum(is.na(published.df$PUBLISHED_OPEN_ACCESS_SUMMARY)))
[1] 0.8918919
A total of 70% of published papers made data openly available. Almost all (89%) of the published papers were made freely available, outside of ACM Digital Library paywall. Similarly to submissions, I checked where data and papers where hosted on. I inspected the submissions and coded the columns PUBLISHED_DATA_DETAILS
and PUBLISHED_OPEN_ACCESS_DETAILS
as follows:
-
YES_REPOSITORY
when data was hosted on a version control repository such as a git repository. GitHub repositories were the vast majority here. -
YES_ARCHIVED
when data was hosted on a proper archived system for open data such as figshare, Zenodo, or an institutional repository. -
YES_WEBSITE
when data was hosted on a website (I verified each link to actually point to a website. Those that pointed elsewhere, e.g., a GitHub repository or a Dropbox link, where coded differently.) -
YES_CLOUD
when data was hosted in the cloud which for most of the cases was a cloud drive such as Dropbox or OneDrive. -
NO
when there was no offered data. I did not have to code the reason for not providing accompanying data, because a reason was never stated.
I am plotting the YES_
codes here only.
yes.published.data <- sort(table(as.character(published.df[published.df$PUBLISHED_DATA_SUMMARY == 1,]$PUBLISHED_DATA_DETAILS)), decreasing = T)
ggplot(as.data.frame(yes.published.data)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Where data was hosted after acceptance') + ylab('Count') + theme_bw()
yes.published.paper <- sort(table(as.character(published.df[published.df$PUBLISHED_OPEN_ACCESS_SUMMARY == 1,]$PUBLISHED_OPEN_ACCESS_DETAILS)), decreasing = T)
ggplot(as.data.frame(yes.published.paper)) + geom_bar(aes(x = Var1, y = Freq), stat = "identity") + xlab('Where papers were hosted after acceptance') + ylab('Count') + theme_bw()
and two papers which were offered with ACM open access option.
Discussion
Being this my website, I will dedicate the first discussion of the results as a celebration of the open science guidelines. The main research track of ESEC/FSE 2019 has been my greatest achievement so far in terms of impact and openness of data.
-
About 67% of submissions provided data at peer review time and 70% of published papers disclosed data and software openly. This is in line with the frequency of accompanying data in submitted papers (67%) and a bit lower than the rate of promises of making data available in case of acceptance (87%).
-
Almost 90% of published papers were freely available outside of a paywall. This is a big win in terms of open access.
Now, to the slightly happier news.
-
Three quarter (75%) of accepted submissions presented open data that is at danger of disappearing in the medium run.
-
More than half (62%) of the openly available published papers are at risk of disappearing in the medium run.
Here is why all my open science policies are so petulant to not store code, data, and papers on mere websites (and, sorry, GitHub and researchgate.net count as websites): web resources disappear, pretty quickly. Wallace Köhler has performed a series of longitudinal studies of generic URLs (that is, not archived), which found, among other interesting results, the following: only 34.4% of URLs are available after 5 years down to around 0.005% after 20 years. In the worst average case, the half-life availability of Web pages has been estimated to 2 years. This issue includes, to a certain extent, source code hosted on GitHub. GitHub is not archived, hence, Github is not forever. While I am not foreseeing this happening anytime soon for GitHub, it did happen that a company, BitBucket, decided to delete repositories in bulk.
The Software Heritage project is aiming to archive research software that is hosted on GitHub, but it likely requires a manual trigger from the authors. Furthermore, archived open data is maximized when it is linked from the paper. Archiving snapshots of source code from GitHub is very easy with zenodo and with figshare.
Lessons learned for SIGSOFT open science guidelines
I started drafting the SIGSOFT open science guidelines, which will borrow from what I learned in this and all previous experiences. In particular, these are some of the lessons learned:
-
Forcing authors to disclose their data for peer review and as open data in case of acceptance seems non-necessary given how many and how much authors complied with the “soft version” of the policies.
- However, forcing authors to state a declaration on data availability, without being punitive when they refuse to, produces positive results.
-
NDAs, technical reasons, and ethical reasons for not disclosing data, which are all good reasons for not disclosing data, are way lower than expected.
-
Authors need clear instructions and proper tools to ease up pain when disclosing data.
-
Most authors (I did not count this but I would estimate more than 90%) did not create a section on data availability as suggested in the policies. It was a very demanding task to look for links, citations, or sentences on data availability, which was put in any possible place of papers. I think that this information should be present in the call for papers and, hopefully, in ACM templates for submissions. We need a standard place to disclose data availability.
-
Authors need to be reminded and better informed regarding the volatility of the Web. Too many preprints were hosted on personal websites, on Github and researchgate.net. These are not archived hosting platforms. The same holds for source code or even data hosted at Github. Authors need to be informed and supported on digital preservation mechanisms.
-
Authors seem to appreciate step-by-step howtos on how to achieve what they are expected to do. Examples are my tutorial on how to disclose open data for double blind review and easily turn it into open data upon acceptance and Arie van Deursen’s tutorial on green open access and self-archiving. One way to enhance preservation of disclosed data might be to create or point to tutorials on how to automatize this process when Github is used as a platform (see the end of previous section). These links should be present in the open science guidelines.
-
Authors need to be reminded that a promise to make data available upon request or an online form to access either data or paper is (1) a barrier, technically putting a resource behind a wall, (2) a mechanism destroying digital preservation as whoever evaluates requests needs to be alive.
-
A further issue that should be addressed in future reports is on proper licensing of open data and open source.
Data availability
The present report was written in R markdown and it is fully reproducible. The report and the underlying datasets are released as open access and open data: Daniel Graziotin. (2020). Effectiveness of Open Science Policies at ESEC/FSE 2019 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3765247
I do not use a commenting system anymore, but I would be glad to read your feedback. Feel free to contact me.