Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
User Journal

Journal JWSmythe's Journal: Finding repeated phrases in MySQL text fields 4

Being that Slashdot is the biggest audience of computer geeks that I know, this should be the right place to ask a question that stumps me. :)

    Some of you know that I am the owner/publisher/programmer of freeinternetpress.com. I was playing with the "tag cloud" idea, but it doesn't quite satisfy what I want.

    I wrote a script that looks at the 100 most recent news stories, pulls all of the words from the text and subject, splits it on spaces (and other delimiting characters), and gives me a nice list of words by frequency on the page. The rough equivalent from the command line would be:

cat story.txt | sed -e s/\ /\\\n/g | sort | uniq -c | sort -r -n -k 1 | head -20

    It then shows them in tag cloud format, sized for frequency. Each word is linked to a script that finds the most recent story with that word in it, and send you directly to that.

    Mine is all done with SQL queries and a little array magic in PHP, not shell commands, I swear.

    What I can't quite figure out is, how do I do the same thing for phrases? If John Smith made the news, there may be plenty of people with the first name "John" making the news, so John may show up frequently. Smith may also show up with some sort of frequency (in an obscure world where there are only 4 common last names). But, if John Smith goes on a shooting rampage, it would be reasonable to think that "John Smith kills" would show up in multiple news stories. They may say "John Smith kills 14 in mystery rampage" or "John Smith kills coworkers at super spook spy shack". You never know what will come up, but it would be amazingly advantageous to have that phrase.

    While I can't think that we'll cover every breaking news story, I can think that the hundreds of RSS feeds that we're aggregating would. If this was applied to the RSS feeds, we would then have a beautiful resource. Think Google News automated and unfiltered. Yes, Google News filters their news, and does adjust what is shown based on who it thinks are "good" sources, and some big news simply doesn't show up.

    In thinking about this, I thought about the brute force method. Find every word, go back and find the word before and check that against the database. go back and find the word after and check against the database. Continue this to up to 5 word phrases.

    On just our own 100 most recent stories, there are 19374 words. Of those, there are 6176 unique words. I run this against a "stopwords" table, so common words (like "and" "the" "or" "I" "he" "she", etc). We're using about 1000 stopwords. Even with this, there are 5676 unique words.

    Does anyone have any suggestions?

This discussion has been archived. No new comments can be posted.

Finding repeated phrases in MySQL text fields

Comments Filter:
  • If you reduce the "key space" of the problem - does it make more tractable for you.

    Having a standard mapping down to a key set of phrases
    eg massacre, mass shooting, going postal ==> gun rampage
    war, police action, interdiction ==> war

    Is this the type of concept you are after?

    I know such a list is out there somewhere (a reverse thesaurus?), someone working in semantics/linguistics probably has such a list, becuase most of the work needs to be in upfront.

    • Well, that could help to some degree, but ... well ... probably not a lot.

      Some of our biggest stories (most readers) have been...

      The train crash in Los Angeles a few years ago.

      The Pope dying.

      and a normal human explanation of USC 2257.

      The next real breaking story won't ever be something we'd expect. If it was, I could be sitting there with a bag of cheezie poofs and my video camera waiting for it to happen. :)

      • by tqft ( 619476 )

        Saw this on planet.mozilla.org
        The skinny - bayesian filtering of rss feeds

        http://mesquilla.com/2009/03/15/automatically-determining-interesting-rss-feed-posts/ [mesquilla.com]
        "One of the interesting applications of automatic categorization of message items is the categorization of feed postings. Feed aggregations like Planet Mozilla often have many more posts than is convenient for most people to keep up with. How do you decide what to read, and what to skip?

        The bayesian classifier that is part of the Thunderbird and Seamo

        •     That sounds like a great idea! Thanks!

              I don't know if I can exactly use their code, but I'm sure I can find something that'll cooperate nicely. :) I'll post a new journal entry if it works. :)

"When the going gets tough, the tough get empirical." -- Jon Carroll

Working...