Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Internet Searching Using Regular Expressions?

Cliff posted more than 13 years ago | from the incomprehensible-search-but--patters-better-results dept.

The Internet 8

/[Aa]non([aiy]mo?u?s)? [KkCc]ow[ae]rd/ asks: "Remarkably few people have a working understanding of regular expressions. But, those that do know how useful they can be for searching text. Has anyone out there seen a large search engine (like Google) that will take regular expressions for queries? How about a newsgroup search engine?" Aside from the fact that many regular expressions read like snippets of line noise, they are the best thing I've seen for searches, and it's a lot easier than -adding +alot +of +search -terms.

cancel ×

8 comments

Re:Performance? (1)

dougmc (70836) | more than 13 years ago | (#359920)

However, doing the prelimiary match using the regexp would definately be resource-prohibitive. In the above example, you would have to read the text of each file in to do the regexp. Not to mention the cost of keeping the text around.
You do it by looking for the words in the regex, then you search for those (the words can be indexed -- the regex can't), then you apply the regex to the results to further narrow them down. This his how glimpse/webglimpse (www.webglimpse.net [webglimpse.net] ) works.

However, this all falls apart when the keywords are all very common, because nearly every page everywhere will contain them, and so the actual regular expression search will have to search thousands of pages. But for uncommon words, it works fairly well.

glimpse does do this, but it has problems with memory usage and being slow at times -- probably exactly because it does this.

Sorry, but I don't know of any full-Internet search engine that allows this. Your best bet is probably to write something that looks for the keywords in a regex, feeds them to google, then downloads every page that matches and then runs the regex on your own computer to further narrow down the results. Depending on how common your keywords are, it may work well, or it may try to download half the Internet.

Re:FP RegEx (1)

Doctor Dark (87531) | more than 13 years ago | (#359921)

Try some of these... http://www.leidenuniv.nl/ub/biv/specials.htm

Glimpse / Webglimpse (1)

polymath69 (94161) | more than 13 years ago | (#359922)

Some sites are indexed for quick searching using Webglimpse [webglimpse.net] , which supports modified RE searching. There's a short list of such sites here [webglimpse.org] .

Unfortunately I don't know of any Web-wide RE-capable database. Here's hoping someone downthread does...

--

Re:Change the Storage Method [Was: Re:Performance? (1)

fwc (168330) | more than 13 years ago | (#359923)

I should have been more specific. This method would work good in an example like above, but what if you are talking about a search something like:

[JjFf][ae][nb]\s*[0-9]{1,2},{0,1}\s*[0-9]{2,4}

This gets really messy really fast and exactly *HOW* are you going to do a query on a keyed database using this?

Change the Storage Method [Was: Re:Performance?] (1)

SagSaw (219314) | more than 13 years ago | (#359924)

Here is how I would try to do this:

  1. Get an insanly fast (and probably expensive) database.
  2. In this database, I would store the location of every word in each document. (ie. somewhere is words 17 and 87, over is word 45, there is words 98, 109, and 1)
  3. When somebody searches for 'somewhere*over*there' I would query the database for a list of the locations of 'somewhere', 'over', and 'there', sorted by the document they originated from.
  4. Now it should be easy to find the documents which have the words in the right order. Simple look for documents where the location somewhere is less than the location of over is less than the location of there.

    This would probably require a very fast, large database to accomplish, but it could be done.

Re:Performance? (1)

Carl Drougge (222479) | more than 13 years ago | (#359925)

I don't know about the performance of it, but postgresql supports regexps, something like this as I recall:

SELECT whatever FROM somewhere WHERE something ~ /somewhere.*over.*there/

Performance? (2)

fwc (168330) | more than 13 years ago | (#359926)

I'm trying to envision how you could do this in a reasonably fast manner on a very large database. From what I know about regular expressions, they can be VERY cpu intensive even on a small file. For instance, how would you match:

somewhere.*over.*there

Across an entire internet sized search engine? I guess you could pre-select documents containing somewhere and over and there and then proceed with screening them through a "standard" regexp search.

However, doing the prelimiary match using the regexp would definately be resource-prohibitive. In the above example, you would have to read the text of each file in to do the regexp. Not to mention the cost of keeping the text around.

That said, I can see how you could implement a regexp-like front end to a search tool if you had some restrictions as to what you could do with the regular expressions. However, I suspect the idea was more to be able to do advanced conditionals and other funky stuff within the regular expressions, and limiting this would probably limit the usefullness of the product.

So, maybe to summarize my rambling, the initial hurdle would be to re-invent the way normal regexps work in order to be efficient in a multi-giabyte database.

^I love regular expressions, but$ (3)

human bean (222811) | more than 13 years ago | (#359927)

If you read Don Knuth's volume two, and do some elementary math, you can get a feel for what this would take.

Best to just download search results with a spider and hit them with grep, if you've got the time. [sigh].

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...