Journal Interrobang's Journal: Data Corruption from Excel Autocorrect 1
Someone on TECHWR-L posted a link to this paper (under the paradoxical title "The Cupertino Effect"), which is about how Excel's autocorrect feature can corrupt statistical analysis of genetic data if/when Excel "makes the wrong assumption" about an entry based on how it looks:
When processing microarray data sets, we recently noticed that some gene names were being changed inadvertently to non-gene names. A little detective work traced the problem to default date format conversions and floating-point format conversions in the very useful Excel program package. The date conversions affect at least 30 gene names; the floating-point conversions affect at least 2,000 if Riken identifiers are included. These conversions are irreversible; the original gene names cannot be recovered.
As the author points out, this can cause gene names to come back in analyses as "unknown," because "[a] default date conversion feature in Excel
The authors also note that there is a problem with "RIKEN [4] clone identifiers of the form nnnnnnnEnn" being converted to a floating-point number.
The paper also gives some idea of the devastating scale of the problem and its significance for people doing these sorts of analyses: "A non-expert user might well fail to notice that approximately 3% of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern."
As far as I know personally and can glean from the paper, the autocorrect and/or conversion feature is nearly impossible to disable completely, and can only be worked around, possibly unsuccessfully 100% of the time. This suggests that perhaps Excel is not the tool of choice for doing these sorts of analyses. (Does the spreadsheet application in OpenOffice work differently?)
I shall check (Score:1)
I have both Excel and OpenOffice at work, so I'll check tomorrow.
It's a bit silly that they provided a C program to check for corruptions. Providing a VB program (that runs in Excel) would be much more useful for people.
Hugs