The move to providing the underlying data behind research articles has been a major step towards promoting reproducibility, transparency and data re-use. However, analyses of the quality and annotation of published data sets can sometimes reveal significant and widespread problems.
Excel can convert gene names into dates
A recent article published in Genome Biology has highlighted a continuing problem with using Microsoft’s Excel spreadsheet program to handle data containing gene names. The problem was first identified in 2004 by Zeeberg et al. and some workarounds were suggested. They noticed that gene symbols such as SEPT2 (Septin 2) and MARCH1 (Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase) are converted to ‘2-Sep’ or ‘2006/09/02’ and ‘1-Mar’ by default when typed or copied into Excel. However, it looks as if researchers are still generally unaware that their data could be being transformed by Excel’s default settings. The three authors of the new paper looked at supplementary data sets that were associated with publications in leading genomics journals.
The researchers screened 35,175 supplementary Excel files from 18 journals. They found 7467 gene lists associated with 3597 published papers. When they opened and checked the Excel files they discovered gene name errors in 987 supplementary files from 704 published articles.
For some journals, more than 20% of the supplementary files had errors, for example Nucleic Acids Research, Genome Biology, Nature Genetics, Genome Research, Genes and Development and Nature. The journals with the lowest proportion (<10%) of affected files in the study were Molecular Biology and Evolution, Bioinformatics, DNA Research and Genome Biology and Evolution. The full results of the analysis are presented in the paper. From their analyses, the researchers found that gene name errors in supplementary files have increased at an annual rate of over 15% over the past five years, outpacing the increase in published papers in this field of 3.8% per year.
The authors of the paper suggest a few ways that researchers, reviewers and editorial staff at journals might use to spot these kinds of errors before they make it into publication. They also call for journals to make supplementary data sets open access as some supplementary files can be pay-walled in subscription journals and so could not be checked as part of their analyses. The researchers found that Google Sheets did not generate the same type of conversion errors with gene names that were found in Microsoft Excel, LibreOffice Calc or Apache OpenOffice Calc.
What does this mean for researchers?
It is important to be aware of this potential problem if you are dealing with data that might be affected by Excel’s default formatting. There are some practical suggestions in the 2004 paper and some checking tools suggested in the new Genome Biology paper.
Zeeberg, B. R., Riss, J., Kane, D. W., Bussey, K. J., Uchio, E., Linehan, W. M., Barrett, J. C. and Weinstein, J. N. (2004) Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5:80, doi: 10.1186/1471-2105-5-80