Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Finding a Needle in a Haystack of Data

ScuttleMonkey posted more than 8 years ago | from the mathematical-sieve dept.

Math 173

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]

cancel ×

173 comments

Sorry! There are no comments related to the filter you selected.

Google (4, Interesting)

biocute (936687) | more than 8 years ago | (#14204954)

Does Google have the technology to do this kind of scientific searches yet?

If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.

Re:Google (1)

paulsgre (890463) | more than 8 years ago | (#14205091)

But can it find potential girlfriends for Slashdotters? Now that's what I would call isolating a useful (and rare) signal buried in a large dataset. When i see THOSE results, I will be impressed.
And if so, I've got a useful signal that could use some burying...

Re:Google (2, Funny)

sapped (208174) | more than 8 years ago | (#14205181)

But can it find potential girlfriends for Slashdotters?

Wow. There really are't any out there. Check it out on google [google.com] yourselves.

The same results come back in images, groups, news, etc. Man. What a sad bunch.

Re:Google (2, Funny)

garcia (6573) | more than 8 years ago | (#14205103)

Does Google have the technology to do this kind of scientific searches yet?

It's only in Beta thus it's not useful ;-)

Was it just me or was this story broken at first? (1)

Wisgary (799898) | more than 8 years ago | (#14204958)

It just refused to load for me.

Re:Was it just me or was this story broken at firs (3, Funny)

MarkGriz (520778) | more than 8 years ago | (#14205132)

"It just refused to load for me."

Maybe your interest in the story was deemed statistically insignificant.

Re:Was it just me or was this story broken at firs (1)

Wisgary (799898) | more than 8 years ago | (#14205168)

So... I'm just another piece of hay... :(

Indexes (0)

CastrTroy (595695) | more than 8 years ago | (#14204971)

All you have to do is index it properly, and lots of data can be searched really fast.

Re:Indexes (2, Informative)

Husgaard (858362) | more than 8 years ago | (#14205042)

They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index.

Re:Indexes (1)

CastrTroy (595695) | more than 8 years ago | (#14205083)

But that's the trick. Finding a good way to index the data.

Re:Indexes (0)

Anonymous Coward | more than 8 years ago | (#14205347)

This is dealing with apparently random fluctuating data and finding patterns. Such as examining a radio signal to see if it is just noise or if aliens are saying hello a la SETI. Or on a more prosaic level whether the fluctuation of the temperature of a reactor is significant. This could be a real boon for chemical plant operations, for example. This kind of data can't be indexed.

Indexing Nonsense. (0)

Anonymous Coward | more than 8 years ago | (#14205111)

"They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index."

And somehow Google manages with Slashdot.

Re:Indexes (1)

Marko DeBeeste (761376) | more than 8 years ago | (#14205090)

If we had ham, we could have ham and eggs. If we had eggs.

Roland Alert (0, Troll)

stinerman (812158) | more than 8 years ago | (#14204979)

... but be advised that the links do not point to his site, but the actual article (are the editors doing this or has he given it up).

Re:Roland Alert (1, Funny)

Anonymous Coward | more than 8 years ago | (#14205189)

Where have you been?
Don't you know the editors are in cahoots with the the Beatles Beatles guy now?
Please, try to keep up with the conspiracy theories, mkay? Jeez!

The most obvious application (5, Interesting)

Billosaur (927319) | more than 8 years ago | (#14204983)

I see this as being a boon to SETI [berkeley.edu] . If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?

Seti (1)

jurt1235 (834677) | more than 8 years ago | (#14205061)

Also the first "usefull" application for this kind of technique which popped up in my head. Actually, the process in my head to make this one item popup is maybe usefull too (-: Lot of random data, and this one is being associated with the article.

Re:The most obvious application (0)

Anonymous Coward | more than 8 years ago | (#14205236)

I'll try to set aside my dislike of SETI, due to it's religious nature (the search for something that has no shred of evidence), but I would liken SETI more to the search for a needle in a hay barn, and in the end it turns out that there was no needled.

Not to say there isn't ET life out there, but until some evidenc points to it, we may as well assume that the universe is empty except for us.

Go Case, I knew that my tuition dollars were going somewhere

Re:The most obvious application (way OT) (1)

blackcoot (124938) | more than 8 years ago | (#14205506)

it's been a while since i last did much perl, but shouldn't the last line of your sig be:

($world = $world) =~ s/bad/good/g;

otherwise you're making your world better but not ever doing anything with it...

Tagging (0, Flamebait)

saskboy (600063) | more than 8 years ago | (#14204986)

It would be a lot easier to find data if we tagged it with things like the Evil Bit, and Broadcast Flag, or like Technocrati's blog tags that the user associates with their data so a search can find it.

What does God use to tag a galaxy with though?

Obligatory (1)

drewzhrodague (606182) | more than 8 years ago | (#14205018)

"What does god want with a starship?" -Spock

Re:Obligatory (1)

RetroGeek (206522) | more than 8 years ago | (#14205084)

Kirk said this, not Spock

Re:Obligatory (1)

drewzhrodague (606182) | more than 8 years ago | (#14205207)

Oops! And here I thought I was Spock -- er, spot-on. =_)

Re:Obligatory (1)

RadioD00d (714469) | more than 8 years ago | (#14205398)

Nope - it was McCoy

ITagging (1, Funny)

Anonymous Coward | more than 8 years ago | (#14205363)

"What does God use to tag a galaxy with though?"

Are you telling us there's such a thing as Intelligent Tagging?

Ya' know... (3, Funny)

jacobcaz (91509) | more than 8 years ago | (#14204988)

82.67% of all statistics are made up anyway...

Re:Ya' know... (2, Funny)

saskboy (600063) | more than 8 years ago | (#14205007)

"82.67% of all statistics are made up anyway..."

Well yeah, 50% of all statisticians finished in the bottom half of their class.

Re:Ya' know... (1)

Tony Hoyle (11698) | more than 8 years ago | (#14205308)

Not necessarily... only works if there are an even number of statisticians, and if nobody scored the mean score.

eg. if there 100 statisticians, the mean score is 37 and 10 statisticians scored that, only 45% of statisticians are techincally in the bottom half (and 45% in the top half). 10% are exactly in the middle.

You could say that the 10% are in both the bottom and top half... in which case 55% are in the bottom half and 55% are in the top half!!

Re:Ya' know... (2, Funny)

$RANDOMLUSER (804576) | more than 8 years ago | (#14205362)

Jeez. How anal. You should take some time and count the flowers.

Re:Ya' know... (1)

saskboy (600063) | more than 8 years ago | (#14205374)

"You could say that the 10% are in both the bottom and top half..."

I'm not sure now if a comment like that puts you in the top half, or bottom half. :-P

Re:Ya' know... (1)

Funakoshi (925826) | more than 8 years ago | (#14205287)

Very true. Also interesting is that 95% of men like to use statistics to seem more intelligent...

I hope YOU know that ... (1)

dazey (903451) | more than 8 years ago | (#14205310)

from the moment you posted that comment, the value you gave increased just a little bit more ...

Sounds useful. (1, Funny)

RandoX (828285) | more than 8 years ago | (#14204989)

I can't even find my keys some days.

Re:Sounds useful. (1)

TheComputerMutt.ca (907022) | more than 8 years ago | (#14205130)

And how would this help you with your inabbility to locate them?

It wouldn't, that's how.

You ever thought [command]-[F] while... (1)

atrocious cowpat (850512) | more than 8 years ago | (#14205232)

.. looking for stuff on your [real world] desktop?

I have, have actually had my arm and fingers twitching for the keyboard...

I think I need a major vacation soon, somewhere with no IT-devices whatsoever.

a.c.

Re:Sounds useful. (1)

tzot (834456) | more than 8 years ago | (#14205417)

I can't even find my keys some days.
Really? [obnoxiousfumes.com]

Simplify your data set (0, Funny)

Anonymous Coward | more than 8 years ago | (#14204990)

A good strategy for finding useful information in oceans of data is to reduce the data set by sloughing off large chunks of irrelevant data. For example, if you want to find useful info in /., you would want to start by excluding all stories submitted by Roland Piquepaille.

Why is technology starting to... (0, Flamebait)

ImaDikWeed (936992) | more than 8 years ago | (#14204998)

scare me?

I see all of these IT breakthroughs and it's just giving me the creeps!

Re:Why is technology starting to... (1)

sonofagunn (659927) | more than 8 years ago | (#14205139)

It means you're getting old!

Re:Why is technology starting to... (1)

ImaDikWeed (936992) | more than 8 years ago | (#14205290)

It means you're getting old!

You're right, of course. I am getting old. I guess what I think of "civil rights" is an antiquated idea after all.

The Real Challenge is Further Off (3, Funny)

AthenianGadfly (798721) | more than 8 years ago | (#14205000)

"But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people."

When asked about more advanced applications for the technology, researchers replied it will probably be "quite a while" before the technology could be used for extremely high noise environments. Said one, "I mean, it's going to be a long time before we're up to finding finding useful comments on Slashdot or something."

Ah-ha! (0)

paulsgre (890463) | more than 8 years ago | (#14205003)

2. ?? ----- New statistical techniques 3. Profit!

Numb3rs (1, Funny)

vanyel (28049) | more than 8 years ago | (#14205014)

Sounds like they've been watching Numb3rs ;-)

Re:Numb3rs (4, Funny)

Shadow Wrought (586631) | more than 8 years ago | (#14205079)

A favorite quote, "Physicists see equations as a reflection of reality, Engineers see reality as a reflection of equations; Mathematicians have never made the connection."

use this on slashdot (0, Redundant)

vineet000 (936983) | more than 8 years ago | (#14205022)

Could surely use this technology on slashdot to find those very few intelligent &/| funny comments.

Now that's a change... (2, Funny)

Havenwar (867124) | more than 8 years ago | (#14205025)

The Case team discovered a technique that is built on the principle of comparing a set of summary characteristics for any sub region of the observations with the background variation. From these characteristics, attempts are made to find small regions that appear significantly different from the background--a difference that cannot simply be attributed to random chance

So, basically its the one search engine that can only find the words "horny teen nekkid" if it is NOT on a pr0n-page. I can see uses for that. Not for me, but I'm sure SOMEONE is interested in finding other kinds of pages once in a while.

Re:Now that's a change... (0)

Anonymous Coward | more than 8 years ago | (#14205069)

>Not for me, but I'm sure SOMEONE is interested in finding other kinds of pages once in a w

You'd need this statistical analysis to find that someone. ;-)

Re:Now that's a change... (1)

Havenwar (867124) | more than 8 years ago | (#14205213)

So the first job of this marvelous search engine statistical method... is to find the one person in a bunch of perverts that would appreciate the results.

jack thompson.

see, didn't need a search engine. and why pray tell would he want to find the one page that said "nekkid teen sexy" and wasn't a pr0n page? Oh to press charges of course. Pages like that corrupt our young! Never mind the real pr0n, thats so.. out there. It's the one page that mentions a single naughty word thats in for the trouble!

PDF Warning? (1)

Anonymous Coward | more than 8 years ago | (#14205031)

Why do we need to be warned that it's a PDF? I can understand an "MS Word Warning" but PDF is platform independent. What's wrong with PDF?

Re:PDF Warning? (1)

tzot (834456) | more than 8 years ago | (#14205472)

PDF documents are not handled directly by your browser.

Stay tuned... (1)

Donald Darko (936706) | more than 8 years ago | (#14205033)

Stay tuned for the upcoming Google release of this technology (Beta of course) - To be followd by MSN and Time Warner's combining powers in attempt to compete.

Sorry (0, Redundant)

plams (744927) | more than 8 years ago | (#14205037)

I'd use it to find some decent pr0n among the oceans of crap out there. But seriously, 'searching' is a very interesting subject and a lot deeper than most people realize. Try to explain how Google works to a non-geek.

Re:Sorry (0)

Anonymous Coward | more than 8 years ago | (#14205159)

Proposed pseudocode:

select *.jpg from *@net.world
where image_descr like '%asian%teen%'
      or image_descr like '%anal%porn stars%'
      or image_descr like '%money shot%';

9...9...9...9... (1)

r3adah3ad (936993) | more than 8 years ago | (#14205055)

"...a difference that cannot simply be attributed to random chance..." If it's random, how do you know?

Re:9...9...9...9... (1)

chill (34294) | more than 8 years ago | (#14205106)

Random has NO pattern what so ever. By detecting a pattern, however small, implies non-random data. QED

  -Charles

Re:9...9...9...9... (2, Insightful)

Stonehand (71085) | more than 8 years ago | (#14205292)

Not really.

The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.

To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text of Rissanen's first paper on MML, and... ?

What the OP might want to look into is the notion of uncompressability, and perhaps Kolmogorov complexity. Of course, the latter is incomputable, but that's life.

Re:9...9...9...9... (1)

CoolVibe (11466) | more than 8 years ago | (#14205390)

If you have an infinite amount of random data, every pattern will be in there somewhere. At least, that's what I was led to believe.

Re:9...9...9...9... (1)

chill (34294) | more than 8 years ago | (#14205679)

If you have an infinite amount of random data, every pattern will be in there somewhere. At least, that's what I was led to believe.

Yes, but only if you look at smaller segments, which changes your dataset. For example, if you spot the first 30 digits of Pi in an infinitely random set, the question becomes is your random set Pi? If not, the pattern only applies to those 30 digits and thus your set changes and is no longer the infinite set of random data.

And they aren't dealing with an "infinite" set, but a smaller subset. Thus, the odds of finding the collective works of Shakespeare is significantly smaller.

  -Charles

Re:9...9...9...9... (1)

vux984 (928602) | more than 8 years ago | (#14205698)

What about infinitely long patterns?

Re:9...9...9...9... (4, Insightful)

flynt (248848) | more than 8 years ago | (#14205117)

Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.

Re:9...9...9...9... (0)

Anonymous Coward | more than 8 years ago | (#14205405)

You can't. I've only read the original paper, it's disturbingly 'heavy' to find on Slashdot but heres my take.
It's something between Shannons original work and Linear prediction. This is exploding the dimentionality into a 'geometric' signal. They use a metric called the 'Likelyhood Ratio Test' which compares an extracted feature against signals believed to be in the background set. (standard fare for signals work - but with a higher dimention) The nub of it all can be found in paraphrasing Shannon - you'll find whatever you're looking for eventually. ie all signal communication depends on an a priori understanding of the production model in the mind of the observer, or psychologically - everything that isn't reCOGNISABLE is noise. What makes it recognisable is the assumptions you already have. But even the assumption that platonic geometric components of a signal constitute 'intelligence' or 'intent' or even point to a structured underlying mechanism is not sound. The application is best used to quickly scan large volumes of data to find interesting areas that might be worthy of further investigation, it's not a feature extraction method so much as a novelty-o-meter.

Case Western Reserve University (3, Interesting)

tomzyk (158497) | more than 8 years ago | (#14205063)

FYI: Its abbreviation is not "CWRU" anymore. As of about 2 years ago, they changed it to simply "Case" and gave it the silly new logo of 2 paperclips stuck together.

Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.

Knowing is half the battle.

Re:Case Western Reserve University (1)

Manhigh (148034) | more than 8 years ago | (#14205099)

The name of the school is still Case Western Reserve University.

Despite the fact that its OK to officially call it 'Case' now (it wasnt OK to do so in '97), CWRU is still a valid abbreviation. Plus I paid so much money to that place that I'll call it whatever I damn well please.

- '02

Re:Case Western Reserve University (2, Funny)

Anonymous Coward | more than 8 years ago | (#14205112)

Actually, its not two paper clips together. It's a fat man holding a surf board. Look for yourself [case.edu]

Re:Case Western Reserve University (0)

Anonymous Coward | more than 8 years ago | (#14205491)

Forget the logo and the namechange, the girls are still as ugly as they always were.

Re:Case Western Reserve University (1)

ThosLives (686517) | more than 8 years ago | (#14205566)

I have to say, I'm glad that my alma mater (Case School of Engineering, 2000) is actually still doing real science. I'm kind of disappointed at all the folks above who posted about "finding useful information in the noise of internet information" though; that type of information gathering is not quite the same as discerning between special-cause and random-cause fluctuations in a signal (mostly because the Internet consists mostly of special-cause variation: i.e., things people have written or created). Distinguishing between two different pieces of non-random data is vastly different than picking up non-random from random.

Incidentally, I don't mind the switch to Case from CWRU (you would not believe how many people asked if Case Western Reserve University was a military school - I guess they forgot to teach people about the Western Reserve Territory in elementary school). The name change is nothing compared to the Peter B. Lewis building...

Speaking of needle in a haystack ... (5, Funny)

airrage (514164) | more than 8 years ago | (#14205081)

Someone asked me to give ten different ways to find a needle in a haystack, these are my thoughts:

1) INDUSTRIAL MAGNENT
2) BLIND LUCK
3) BURN THE HAY, PICK UP THE NEEDLE
4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
10) TRIAL AND ERROR - ONE PERSON

Mythbusters (2, Informative)

everphilski (877346) | more than 8 years ago | (#14205153)

Mythbusters actually did an ep where they built two different needle-in-haystack finding machines, one actually did quite well...

-everphilski-

Re:Mythbusters (1)

Tony Hoyle (11698) | more than 8 years ago | (#14205372)

Their solutions were kinda destructive though.

I'd like to see a way of finding a needle in a haystack that left you with a (largely) intact haystack afterwards, not a pile of ash or a wet sludge.

Huge inductive coils would be a good start... probably wouldn't find the bone one though - maybe some kind of MRI?

Re:1) INDUSTRIAL MAGNENT (2, Funny)

Anne_Nonymous (313852) | more than 8 years ago | (#14205211)

1) INDUSTRIAL MAGNET

DBAs everwhere are cringing and covering their data.

Re:Speaking of needle in a haystack ... (1)

yapplejax (931268) | more than 8 years ago | (#14205492)

The needle won't necessarily sink. I recall experiments floating a needle on top of water because it wasn't heavy enough to break the surface tension.

Re:Speaking of needle in a haystack ... (0)

Anonymous Coward | more than 8 years ago | (#14205667)

Introduce the hay on the bottom. No surface tension.

Re:Speaking of needle in a haystack ... (1)

pherthyl (445706) | more than 8 years ago | (#14205574)

I've got another one..

11) LET COWS EAT THE HAY, DISECT DEAD COW

lameness filter blah

To find a signal in a sea of noise... (1)

San Francisco (936406) | more than 8 years ago | (#14205095)

Perhaps this technology can make Usenet useful once again.

Roland is still around? (0)

Anonymous Coward | more than 8 years ago | (#14205108)

At least he links to the original articles now. However, it still seems like he is trying to better his pagerank (just like that Beatles-Beatles guy). This doesn't irritate me as much as what he used to do, but I still think it's pretty lame. I for one applaud all the people who submit stories without the need to link to their personal or business websites (aka "slashvertisements"). We need more people like them.

Maybe Slashdot can use it to find dupes (1)

kk49 (829669) | more than 8 years ago | (#14205119)

End
Of
Message

Maybe Slashdot can use it to find dupes (1)

Bloggins (783115) | more than 8 years ago | (#14205148)

End of Message

Re:Maybe Slashdot can use it to find dupes (1)

UMEngin (895769) | more than 8 years ago | (#14205179)

That would be like finding the hay in the haystack.

Profit? (0)

Anonymous Coward | more than 8 years ago | (#14205123)

1) Decide what signal you want
2) Generate a large enough random dataset that it is inevitable your desired signal appears
3) Profit!

But will it help me... (-1, Flamebait)

Anonymous Coward | more than 8 years ago | (#14205128)

... find my girlfriend's clit? Don't laugh. If you ever saw how unkempt her mass of pubic hair was, you'd realize this is a canoncial "needle in haystack" problem.

Re:But will it help me... (1)

stinerman (812158) | more than 8 years ago | (#14205178)

I know the warranty will be void if you shave off the pubic hair yourself (intentional damage to the product), but you might want to try it anyway. Buy the hairless variety next time and you should be in good shape.

Re:But will it help me... (0)

Anonymous Coward | more than 8 years ago | (#14205214)

Believe me, I have.

Re:But will it help me... (0)

Anonymous Coward | more than 8 years ago | (#14205332)

Mmmmmmmmm bush.... better than stubble...

Finding a Needle in a Haystack is easy.... (0)

Anonymous Coward | more than 8 years ago | (#14205164)

SETI? (1)

ruiner13 (527499) | more than 8 years ago | (#14205199)

Would this be useful to reduce the computations needed for the SETI@Home folks too? Seems they have a bit of data to sort through... Hell, genetic enginering too. Look for useful patterns in hundreds of DNA strands.

Mythbusters did this... (1)

slashname3 (739398) | more than 8 years ago | (#14205304)

Mythbusters did this one already. They built two machines/processes to find needles in haystacks. One used a process to burn away the hay leaving the needles and the other used magnets and gravity to separate the needles from the hay.

Oh, wait. Their talking about data. Never mind.

Huh? Piquepaille? (0)

Anonymous Coward | more than 8 years ago | (#14205311)

This looks like something related to straw... weird, huh?

We are at the horizon of a cultural singularity... (1)

Errandboy of Doom (917941) | more than 8 years ago | (#14205324)

THE SINGULARITY

Throughout history, we championed the content creator. Only a tiny fraction of the population could write or understood math or science. Only a tiny fraction could dedicate themselves to the arts.

Most individuals' time was consumed by being agrarian generalists: they owned a farm, and they were constantly occupied by all the repairs and maintenance of their property. It wasn't a job, it was a way of life. But now, more and more, our economy makes us all incredible specialists. We're confined not only to a literal cubicle, but to a cubicle of tasks, often only seeing one tiny part of our contribution to social welfare. But as a result, we end up with leisure time. (Cf. Judge Skelly Wright's opinion in Javins v. First National Realty Corporation). While those reading /. while at work might quibble, the fact is that we all now have meaningful leisure time in some sense, we're not dedicated 100% to our livelihood.

In addition, current technology is allowing us to collaborate and share information as a global community like it never has before.

What does all this mean? For one, it means that techies can have bands [jonathancoultonblog.com] , and even get national coverage [npr.org] , without giving up their day jobs. In fact, if MySpace is any evidence [wired.com] , anyone can have a band... and a lot of us already do. Also, given that 80,000 blogs are created each day [technorati.com] (though 40,000 are probably also abandoned each day), huge throngs of people have something to say and are able to say it to huge, unrelated throngs of people.

The singularity is similar to the way other areas of economics have evolved. It used to be that 90% of the population made 100% of the food, and now only 10% of the population provides 100% of the food. It's the opposite for art and science (naturally, as we're freed from producing necessities, we can devote more time to producing luxuries, improving general quality of life, and solving more complex problems). Traditionally, 1% of the population made all the cultural content. The singularity? Soon, 99% of the population will be making 100% of the content.

For the first time in history, we are the captains not only of our personal destiny, but of our cultural destiny. However, as cultural creativity becomes so democratized, our contribution will become less and less controlling. Like Warhol said, it's not that we're all going to be famous, it's that we each only get 15 minutes.

THE DOWNSIDE OF A CULTURE OF CREATIVES, AND A SILVER LINING FOR SEARCH

A professor once said to me, "No one cares how much you know anymore, that's why we have the Internet. The important thing is creating new ideas." The formidible aspect of the new society of cultural creatives is that soon, no one will really need you to create ideas anymore either. Your drop in the cultural bucket is less and less meaningful every day. Content is easier and easier to make and share, and everyone wants to play, so as a corrolary, it will become harder and harder to find compensation as a cultural creative.

So what's the new valuable thing, in this storm of data/content? Maybe not making worthwhile contributions to the arts, science, knowledge, (which is important, but self sustaining). However, finding the worthwhile signal amidst all cultural noise is becoming more and more valuable. Someone needs to be a sieve for all the content being thrown around right now. Technologies of search and sort are the ways to do it. Google is not prospering because it learned something about advertising. Google is prospering because it precociously encapsulates the spirit of the dawning age, while most of us are still trying to figure out just what the hell I'm talking about.

Significant % of patterns in randomness (2, Informative)

G4from128k (686170) | more than 8 years ago | (#14205350)

Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried. If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical significance level of 5% (even if the data is 100% random). "Cancer clusters" are an excellent example of this -- if you slice a dice a population enough different ways you are bound to find some geographic/demographic/ethnographic subgroup with a very high chance of some cancer.


Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.

Re:Significant % of patterns in randomness (5, Informative)

zex (214881) | more than 8 years ago | (#14205567)

If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical level of 5% (even if the data is 100% random).


If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.

You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to .... are more susceptible to brain cancer", then you're probably right. I too see those as examples of restricting the domain of samples until you find a pattern - but the pattern nonetheless exists.

Re:Significant % of patterns in randomness (1)

hackstraw (262471) | more than 8 years ago | (#14205604)

Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried.

No, God put the figure of Jesus in the sky, but made it not look too much like Jesus just to test the difference between the believers and non-believers. Trust me, it was not easy to do all that with nobody looking.

SETI? (1)

Nom du Keyboard (633989) | more than 8 years ago | (#14205358)

SETI?

Regarding fraudulent transactions... (2, Interesting)

ahmusch (777177) | more than 8 years ago | (#14205359)

Current fraud detection systems in use in the financial industry are based on two primary knowledge bases:

1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.

Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?

2. A comparison of recent transactions. For example:

A sudden wave of big-ticket purchases very close together in time, such as hitting a Best Buy the same day as buying jewelry.

A single card making multiple high-value transactions (3 or more) within an hour.

A pattern of unattended-auth-transaction (think pay-at-the pump) to big ticket purchase to unattended-auth and back.

Using geometric statistical analysis could only complement pattern analysis in any case, and I fail to see how it's superior to the existing behavior scoring algorithms which are based on an individual's past history, weighting each new transaction to determine if it's "out of profile", and if so, by what margin. Sometimes the fraud is only revealed by several transactions scoring progressively higher on the fraud-o-meter, and I suspect the geometric statistic analysis would fail to trigger that as an event, as it would be a continuation of the pattern.

My ability to read statistics papers is sadly out of date. Anyone want to give a shot at translating this into non-doctoral English?

Novelty detection algorithms are not novel (0)

Anonymous Coward | more than 8 years ago | (#14205685)

This paper basically describes an improvement of existing, simple and classical methods to handle the case where it is hard to formulate a model for the background noise (or "normal" behaviour in the cause of fraud detection system), or where it is hard to estimate a model of the background noise because data points are expensive to collect. This method might be useful in the case of particle physics problems - where gathering data is costly, where measures are very noisy, or where it is impossible to have prior knowledge about the background noise / the detected signal.

However, for most of the common applications, the simple, classical method (Likelihood ratio tests) works perfectly!

Likelihood ratio tests (the standard thing) work as follow... Compute :
ô1 : the set of parameters of a noise/perturbations model that fits the data.
ô2 : the set of parameters of a noise+signal model that fits the data.
L(ô1|x) = how well your data is explained by the best fitting noise/perturbations model.
L(ô2|x) = how well your data is explained by the best fitting perturbation + signal model.
If L(ô1|x)/L(ô2|x) threshold, there is a signal ; and some of its properties can be infered from the set of parameters ô2.

grep (0)

Anonymous Coward | more than 8 years ago | (#14205380)

This is why we have grep.

Hey, wait a minute! (3, Funny)

$RANDOMLUSER (804576) | more than 8 years ago | (#14205434)

An article posted by Roland Piquepaille with no links back to his site???
WTF? Roland? You feeling OK?

plagiarism (1)

Edie O'Teditor (805662) | more than 8 years ago | (#14205465)

Finding useful information in oceans of data is an increasingly complex problem in many scientific areas
It's even harder finding anything useful among your shit postings, Strawcock. Or rather anything useful and original.

Hope there are no Jedis around (1)

LostBurner (916484) | more than 8 years ago | (#14205471)

"This is not the signal you're looking for..." Hope the signals they're looking for don't come accompanied with Jedis. Or maybe all the chaos is because of Jedi obfuscation of real signals?

I wonder if they were inspired by... (1)

sciscitor (798043) | more than 8 years ago | (#14205601)

The paper by Jeremy Stribling, Daniel Aguayo and Maxwell Krohn:

Rooter: A Methodology for the Typical Unification of Access Points and Redundancy

Definitely some interesting parallels between the two. Maybe someone who understands this stuff better could elaborate?

When the kernel is playing dumb.. play along! (1)

Outsomniac (930516) | more than 8 years ago | (#14205636)

Here... it's like this.. are ya ready?: just pick a chunk of data and call it's useful! It's no different than say, ummm, naming a name, which are then sold... worshipped.. reported... entire charity fund rasiers setup for... and even typed about on fine forums like this one! See the post below about cyber terror that's selling the credibility... they'll show ya how 'the system' works!
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>