50 Years of Arabidopsis - Citation Network

In May 2015, Nicholas Provart invited me contribute to a review of the past fifty years of Arabidopsis research for New Phytologist. My contribution consisted of an interactive data visualization tool that displays 54,033 Arabidopsis publications from 1965 through 2015. A search tool enables users to find papers by author or title, and clicking on a paper displays links to all the papers that have cited it, and all the papers it has cited.

Thomson Reuters’ Biosis database covering publications from 1800 to March 2015 was searched for papers with Arabidopsis in the title, abstract or keywords. This resulted in a data set of 54,116 papers. Concept codes and bibliographic information were provided as data dumps from Thomson Reuters. A further data set encompassing all 1,830,099 citations of these papers was also provided. The citations were both by papers in the original 54k data set and by papers from outside it. In total, 283,110 papers cited the 54k data set. Structured taxonomic data were available for 275,307 of these papers.

Taxonomic data were used to flag papers in the original 54k data set as being cited by non-Arabidopsis thaliana papers. Note that some papers in the superset of citing papers actually were indexed with Arabidopsis thaliana in the Taxonomy field but didn’t mention Arabidopsis in the title, abstract or keywords, thus the 54k set is a slight underestimation of the total number of Arabidopsis papers.

Concept codes for the 54k set of Arabidopsis papers were also provided by Thomsom Reuters as a text file. We used the concept codes to flag papers in the 54k set as having been involved in the least abundant code out of 46 that were chosen to provide a balance between not being too general (75% of the papers were tagged with 03504: Genetics – Plant) and not being too specific (just one paper was tagged with 64500: Paleobiology). Concept code terms ranged in prevalence from 11.6% for “51512: Plant physiology – Reproduction” to 1.04% for “13014: Metabolism – Nucleic acids, purines and pyrimidines”. Custom Perl scripts were used to parse that data. Graphing was done with Javascript and d3.js. This view runs on the BAR and also as an app on Araport.

The paper was published in New Phytologist:

Nicholas Provart, Jose Alonso, Sarah Assmann, Dominique Bergmann, Siobhan Brady, Jelena Brkljacic, John Browse, Clint Chapple, Vincent Colot, Sean Cutler, Jeff Dangl, David Ehrhardt, Joanna Friesner, Wolf Frommer, Erich Grotewold, Elliot Meyerowitz, Jennifer Nemhauser, Magnus Nordborg, Craig Pikaard, John Shanklin, Chris Somerville, Shauna Somerville, Mark Stitt, Keiko Torii, Jamie Waese, Doris Wagner, and Peter McCourt. 50 Years of Arabidopsis Research: Highlights and Future Directions. New Phytologist, Octoboer 2015. Tansley Review. DOI: 10.1111/nph.13687


Category Research, Data Visualization, UX Design

Date August 2017

My video describing the project was shown at the 2015 International Conference on Arabidopsis Research in Paris: https://www.youtube.com/watch?v=7c6qqJzC3wM