Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Data Corruption from Excel Autocorrect

Interrobang (245315) writes | more than 5 years ago

Microsoft 1

Someone on TECHWR-L posted a link to this paper (under the paradoxical title "The Cupertino Effect"), which is about how Excel's autocorrect feature can corrupt statistical analysis of genetic data if/when Excel "makes the wrong assumption" about an entry based on how it looks:

Someone on TECHWR-L posted a link to this paper (under the paradoxical title "The Cupertino Effect"), which is about how Excel's autocorrect feature can corrupt statistical analysis of genetic data if/when Excel "makes the wrong assumption" about an entry based on how it looks:

When processing microarray data sets, we recently noticed that some gene names were being changed inadvertently to non-gene names. A little detective work traced the problem to default date format conversions and floating-point format conversions in the very useful Excel program package. The date conversions affect at least 30 gene names; the floating-point conversions affect at least 2,000 if Riken identifiers are included. These conversions are irreversible; the original gene names cannot be recovered.

As the author points out, this can cause gene names to come back in analyses as "unknown," because "[a] default date conversion feature in Excel ... was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 ... was being converted to '1-DEC.'"

The authors also note that there is a problem with "RIKEN [4] clone identifiers of the form nnnnnnnEnn" being converted to a floating-point number.

The paper also gives some idea of the devastating scale of the problem and its significance for people doing these sorts of analyses: "A non-expert user might well fail to notice that approximately 3% of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern."

As far as I know personally and can glean from the paper, the autocorrect and/or conversion feature is nearly impossible to disable completely, and can only be worked around, possibly unsuccessfully 100% of the time. This suggests that perhaps Excel is not the tool of choice for doing these sorts of analyses. (Does the spreadsheet application in OpenOffice work differently?)

cancel ×

1 comment

Sorry! There are no comments related to the filter you selected.

I shall check (1)

Hemi Rodner (570284) | more than 5 years ago | (#27381235)

I have both Excel and OpenOffice at work, so I'll check tomorrow.

It's a bit silly that they provided a C program to check for corruptions. Providing a VB program (that runs in Excel) would be much more useful for people.

Hugs

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>