Regular Expression Recipes

Regular Expression Recipes 258

Posted by timothy on Tuesday March 22, 2005 @03:45PM from the prune-talkin' dept.

r3lody writes "If you spend time working writing applications that have to do pattern matches and/or replacements, you know about some of the intricacies of regular expressions. For many people they can be an arcane hodgepodge of odd characters that somehow manage to do wonderful things, but they don't have enough time (or interest) to really understand how to code them. Nathan A. Good has written Regular Expression Recipes: A Problem-Solution Approach for those people. In its relatively slim 289 pages, he offers 100 regular expressions in a cookbook format, tailored to solve problems in one of six broad categories (Words and Text, URLs and Paths, CSV and Tab-Delimited Files, Formatting and Validating, HTML and XML, and Coding and Using Commands)." Read on for the rest of Lodato's review.

Regular Expression Recipes: A Problem-Solution Approach
author	Nathan A. Good
pages	289
publisher	Apress
rating	8/10
reviewer	Raymond Lodato (rlodato AT yahoo DOT com)
ISBN	159059441X
summary	A cookbook of useful regular expressions for Perl, Python and more.

Regular expressions are not restricted to just the Perl or shell environments, so Nathan offers variations for Python, PHP, and VIM as well. In most cases the translation is relatively straight-forward, but in a few cases a different environment may have (or lack) additional facilities, prompting a different expression to do the same task.

Before you even read chapter 1, Nathan provides a quick summary course on regular expressions, with detail given to each of the five environments you might utilize. He has written the syntax overview in a highly-readable format, making it easy to understand the gobbledy-gook of the most bizarre concoctions you might encounter.

The first chapter (Words and Text) starts simply enough. He gives examples of how to find single words, multiple words, and repeated words, along with examples of how to replace various detected strings with others. In each case he gives an example of its use for each platform, followed by a bit-by-bit breakdown of how it works. Not every environment is given on every example, and in many cases the "How It Works" section refers to the first one, as most REs are identical between the platforms.

The next chapter (URLs and Paths) offers various methods of doing commonly needed parsing. Pulling out file names, query strings, and directories, as well as reconstructing them in useful fashions is covered in the 15 offerings given here. Validating, converting, and extracting fields of CSV and tab-delimited files are handled in chapter 3, while chapter 4 is concerned with validating field formats, as well as re-formatting text for the fields. Chapter 5 handles similar tasks for HTML and XML documents. The final chapter covers expressions that facilitate the management of program code, log files, and the output of selected commands.

First, I must admit that there are a number of useful solutions provided, especially for someone who is concerned with application and web development. However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations. It almost seemed as though the author was trying to pad out the solution count to the magic number 100. A simple example: three solutions in chapter one cover (a) replacing smart quotes with straight quotes, (b) replacing copyright symbols with the (c) tri-graph, and (c) replacing trademark symbols with the (tm) sequence. In each case, the expression was simply "s/\xhh/ rep /g;". Did we really need three separate chapters for that? I don't think so.

Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding. Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.

Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.

You can purchase Regular Expression Recipes: A Problem-Solution Approach from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Regular Expression Recipes

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 258 Comments Log In/Create an Account

Comments Filter:

Curious (Score:3, Funny)

by LiquidCoooled ( 634315 ) writes: on Tuesday March 22, 2005 @03:48PM (#12014937) Homepage Journal

I was performing a strange custom regular expression on the book review, and discovered that it outputted the following:

"Regex coders are in league with the devil"

Who woulda thunk it!

- - Re:Curious (Score:5, Funny)
    
    by Saeed al-Sahaf ( 665390 ) writes: on Tuesday March 22, 2005 @04:53PM (#12015718) Homepage
    
    interesting, so what secret regular expression construct matches what is nowhere in the original string?
    It's something called a joke. A joke is something said or done to evoke laughter or amusement, especially an amusing story with a punch line. Jokes employ something called humor. Humor is the quality that makes something laughable or amusing. Many Slashdotters are unable to perceive, enjoy, or express what is amusing, comical, incongruous, or absurd, often referred to as humor impaired.
    
    - - Re:Curious (Score:2)
        
        by ErikZ ( 55491 ) writes:
        
        Sorry, your "Just Kidding!" line dosn't cut it.
        
        Just admit you're a robot and your life should go much smoother from now on.
        
        And not an amusing robot like Bender, or HAL.
        
        Re:Curious (Score:3, Funny)
        
        by ErikZ ( 55491 ) writes:
        
        You're pretty touchy for a mechanical abomination, devoid of all life and only a mere shadow of the men you were bult to replace.
        
        You should try tweaking your .conf.
Points (Score:5, Informative)

by 2.7182 ( 819680 ) writes: on Tuesday March 22, 2005 @03:49PM (#12014946)

I really liked this book, but

1. the binding broke
2. the index has a lot of typos.

- Re:Points (Score:2, Funny)
  
  by LiquidCoooled ( 634315 ) writes:
  
  2. the index has a lot of typos.
  
  No problem, the website issued a global regex and a pot of tip-ex for all customers.
- Typical... (Score:2)
  
  by Saeed al-Sahaf ( 665390 ) writes:
  
  This seems to be typical for tech books: Way overpriced (although this one seems more reasonable), incredibly crappy binding, and less than aggressive proof reading.
  - Re:Typical... (Score:2)
    
    by nametaken ( 610866 ) writes:
    
    You may be right, but its my experience that tech books are surprisingly reasonable in price. My average textbook in any other field costs >$100, and my tech books are usually in the $40 range. Of course, this could just be that there's less demand for "Financial Accounting" than my PHP or Java books. Dunno, just my experience.
  - Re:Typical... (Score:4, Insightful)
    
    by Monkelectric ( 546685 ) writes: <slashdot AT monkelectric DOT com> on Wednesday March 23, 2005 @01:20AM (#12020487)
    
    I had a brief skirmish with the tech book publishing industrty (and believe me thats the right word). The real problem is they pay authors BY THE PAGE so their incentive is to write flowery, lengthy language which conveys as little information in as much space as possible. This in turn justifies high book prices and higher author royalties.
    
- Re:Points (Score:3, Funny)
  
  by poot_rootbeer ( 188613 ) writes:
  
  2. the index has a lot of typos.
  
  Yeah, but in a book about regexes, you have to study the index VERY CAREFULLY to determine whether there are any typos or not.
Bran... (Score:3, Funny)

by Anonymous Coward writes: on Tuesday March 22, 2005 @03:49PM (#12014950)

...is the best regular recipe.

Another one? (Score:2, Insightful)

by cmstremi ( 206046 ) writes:

Isn't there already enough coverage for Regex's? With all the existing books and the nearly endless availability of free information and sites (including many using the 'recipie' format) online, who will want this book.
- Re:Another one? (Score:2)
  
  by scrotch ( 605605 ) writes:
  
  I don't know if this book would satisfy it, but personally, I'm tired of finding regex references that don't provide (or don't claim to provide) complete, working expressions. It seems like a pretty common occurrence to want to check that an entered email address could actually be an email address, but every regex tutorial/reference I have wimps out. They all say that their example is 'just for learning' or 'needs to be checked' or some such.
  
  A cookbook approach to Regexs seems great to me. Look up the one
  - Try them out (Score:5, Insightful)
    
    by DavidNWelton ( 142216 ) writes: on Tuesday March 22, 2005 @04:52PM (#12015702) Homepage
    
    Sometimes, with complex regexp's, it's handy to be able to build them incrementally. I know it's just one of many, but I wrote a little tool that's handy for this. It's called regexpviewer, and it's available here:
    
    http://www.dedasys.com/freesoftware/applications.h tml [dedasys.com]
    
    Perhaps other people can recommend other tools they've found useful for learning/building regular expressions.
    
    - Re:Try them out (Score:4, Interesting)
      
      by c_ollier ( 35683 ) writes: on Tuesday March 22, 2005 @06:48PM (#12017022) Journal
      
      The Regulator [osherove.com] is a nice Open Source tool, but Windows only. It integrates expressions from RegExLib.com [regexlib.com], and has syntax highlighting & brace matching.
      
- Re:Another one? (Score:2)
  
  by northcat ( 827059 ) writes:
  
  If you don't want this, don't buy it.
- Re:Another one? (Score:4, Informative)
  
  by carnivore302 ( 708545 ) writes: on Tuesday March 22, 2005 @04:50PM (#12015675) Journal
  
  I don't think there is a need for another book on regexps, since there is already the excellent Mastering Regular Expressions [amazon.com] by Jeffrey Friedl. What else then the best can you expect from an O'Reilly book?
  
Regular expressions in a cookbook? (Score:5, Informative)

by DeadSea ( 69598 ) * writes: on Tuesday March 22, 2005 @03:51PM (#12014962) Homepage Journal

Sounds like good eating. ;-)
Regular expressions are great, but once you know them and you think you can conquer the world, I find they occasionally let you down. The text editor I was using had a rudementary regular expression search that did not support non-greedy matching. I found that writing a regular expression that finds C style /* comments */ to be quite tricky with only greeding matching [ostermiller.org]. I wrote it up as an article where I build the expression piece by piece showing common things you might try that won't work.
If you want more of a challenge, try writing a regular expression that find any <script></script> tags along with anything in between using only greedy matching. You will find that the length of your regular expression goes up exponentially with the length of your ending condition.
--
Calculator for Converting Currency [ostermiller.org]

- Re:Regular expressions in a cookbook? (Score:5, Interesting)
  
  by interiot ( 50685 ) writes: on Tuesday March 22, 2005 @03:58PM (#12015060) Homepage
  
  Yup, regular expressions are not capable of a full-range of computing... they're pretty close [wikipedia.org] (they're the lowest of four in the Chomsky hierarchy), but still have a few limitations that can't be resolved without wrapping some extra code around them.
  It still boggles my mind that people knew this in 1956 though.
  
  - Re:Regular expressions in a cookbook? (Score:5, Informative)
    
    by merlyn ( 9918 ) writes: on Tuesday March 22, 2005 @04:08PM (#12015194) Homepage Journal
    
    Yup, regular expressions are not capable of a full-range of computing
    
    That's the "classic" regular expressions, not the modern regular expressions accepted by PCRE, and Perl itself. In fact, Perl regular expressions are full Turing machines, with PCRE being a few steps behind that. So PCRE isn't really PCRE... it's P-likeCRE. {grin}
    
    - Re:Regular expressions in a cookbook? (Score:4, Insightful)
      
      by interiot ( 50685 ) writes: on Tuesday March 22, 2005 @04:17PM (#12015291) Homepage
      
      You mean all the sections of the perl regexp manual [cpan.org] that say "WARNING: This extended regular expression feature is considered highly experimental, and may be changed or deleted without notice" and then go on to say things that make my head truly ache?
      I personally treat this like I do Perl5 threads... as something to be afraid of, and hopeful that things will be much improved in Perl 6.
      
- Re:Regular expressions in a cookbook? (Score:3, Interesting)
  
  by pcraven ( 191172 ) writes:
  
  This [regular-expressions.info] is a cool article on catastrophic backtracking. I remember the first time that got me. It would occasionally cause severe issues on a production server we had. I swung and missed with my reg ex on that one.
  - Re:Regular expressions in a cookbook? (Score:3, Interesting)
    
    by prockcore ( 543967 ) writes:
    
    I've been doing regex for a long time (over 10 years), and the best rule I can give newbies to follow is "match less, not more"
    
    Write your regex's so that they generalize as little as possible.
    
    For example, matching an xml tag use /]+>/ instead of //
    
    If you're using ".*?" in a regex, you might want to look at rewriting it.. it's almost never needed and almost always causes problems.
    - Re:Regular expressions in a cookbook? (Score:3, Interesting)
      
      by prockcore ( 543967 ) writes:
      
      (damn, I should really preview sometimes)
      
      The examples I gave are: /<[^>]+>/ instead of /<.*?>/
      - Re:Regular expressions in a cookbook? (Score:2)
        
        by syukton ( 256348 ) * writes:
        
        I've been doing regular expressions for half as long and I completely agree. (your suggestion is actually the pattern I use when looking for tagging...)
- ignorance is bliss (Score:3, Informative)
  
  by RelliK ( 4466 ) writes:
  
  If you want more of a challenge, try writing a regular expression that find any tags along with anything in between using only greedy matching.
  duh! Repeat after me: HTML is not a regular language. There is no regular expression that can match it. The problem arises when people try to use regular expressions without understanding what they are. But, as the saying goes, when the only tool you have is a hammer, everything looks like a nail...
  - Re:ignorance is bliss (Score:3, Informative)
    
    by DeadSea ( 69598 ) * writes:
    
    > duh! Repeat after me: HTML is not a regular language. There is no regular expression that can match it.
    Script tags cannot be nested which makes that portion of html able to be matched by a regular expression.
    --
    Currency conversion calculator [ostermiller.org]
- BTW, thanks Stephen (Score:2)
  
  by boomgopher ( 627124 ) writes:
  
  Your Syntax Highlighting library for Java rocks, thanks a million.
- - Re:Regular expressions in a cookbook? (Score:3, Informative)
    
    by DeadSea ( 69598 ) * writes:
    
    Your expression fails for this case:
    <script><scri</script>
    It will match <scri< with your |</scri[^p] rule and then go on to match beyond the end of your regular expression.
    But I acknowledge that it may be quadratic rather than exponenetial even with a correct regular expression.
    --
    Exchange Rate Calculator [ostermiller.org]
REGEX (Score:5, Funny)

by null etc. ( 524767 ) writes: on Tuesday March 22, 2005 @03:54PM (#12015003)

Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding.
I'm not sure I understand what your quibble is - do you dislike the fact that he uses non-capturing groups, or the fact that he disposes of them at certain points?
Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.
This seems like a relatively novice mistake, and I'm surprised it would show up in a book on regular expressions.
Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.
It's nice that he covers five environments for regular expressions. I'm sure everyone has heard of Mastering Regular Expressions [oreilly.com], published by O'Reilly. The Perl Cookbook [oreilly.com] also does a good job at solving common problems with Regular expressions.
This is just my opinion, but I think what the world needs is a book on Regular Expression Design Patterns.

Unacceptable mistakes (Score:5, Interesting)

by gniv ( 600835 ) writes: on Tuesday March 22, 2005 @03:55PM (#12015008)

In a number of expressions, Nathan uses [A-z] to capture all letters.

How can this be a good book when it makes such mistakes? If this book is for beginners (as it seems) the editing process should have been much better.

- Re:Unacceptable mistakes (Score:2)
  
  by Kiryat Malachi ( 177258 ) writes:
  
  *Technically*, [A-z] does capture all letters. It does not, however, capture *only* letters.
  
  (Just to be pedantic.)
  - Re:Unacceptable mistakes (Score:3, Informative)
    
    by Speare ( 84249 ) writes:
    
    No, [A-z] does not capture all letters. For example, "Å" and "é" are not usually included in the class [A-z], but it is often a part of the class \w.
    - Re:Unacceptable mistakes (Score:2)
      
      by Kiryat Malachi ( 177258 ) writes:
      
      I don't consider those letters, you damn foreign devil.
      
      (I kid, I kid.)
- Re:Unacceptable mistakes (Score:2)
  
  by hackstraw ( 262471 ) * writes:
  
  In a number of expressions, Nathan uses [A-z] to capture all letters.
  
  How can this be a good book when it makes such mistakes? If this book is for beginners (as it seems) the editing process should have been much better.
  
  I guess the author was a novice and used Word to type the book. Word is notorious of automiscorrecting technical documents.
- - Re:Unacceptable mistakes (Score:3, Insightful)
    
    by tehshen ( 794722 ) writes:
    
    [A-z] accepts all characters from A to z, including [ \ ] ^ _ and `. You want [A-Za-z] or \w (latter for 'not punctuation').
    - Re:Unacceptable mistakes (Score:3, Informative)
      
      by hattmoward ( 695554 ) writes:
      
      \w is [A-Za-z0-9_]. The reviewer mentions use of the POSIX character class [[:alpha:]], which is more in line with what you want, and will (is supposed to) match alpha characters in non-ASCII character sets.
      - Re:Unacceptable mistakes (Score:2)
        
        by tehshen ( 794722 ) writes:
        
        I didn't know about [[:alpha:]], thanks. \w varies between each implementation, apparently - this screenshot [regular-expressions.info] shows it matching foreign characters with accents and stuff.
        
        Though I would use [A-Za-z0-9_] just to be on the safe side.
      - Re:Unacceptable mistakes (Score:2)
        
        by lgw ( 121541 ) writes:
        
        As someone who's had the misfortune to work with EBCDIC, I'd point out that [[:alpha:]] is the only cross-platform answer, otherwise you can get special characters even in [A-Z], and you probably want non-ASCII alphabetics in any case.
  - Re:Unacceptable mistakes (Score:2)
    
    by BinLadenMyHero ( 688544 ) writes:
    
    There are other chars between 'Z' and 'a'.
    The correct way is '[A-Za-z]'.
  - Re:Unacceptable mistakes (Score:2)
    
    by roman_mir ( 125474 ) writes:
    
    [a-zA-Z] - this is the correct way to do it.
    
    BTW. regular expressions present a complete Turing machine. [A-z] is wrong due to implementation of the expressions engine. They are most likely implemented in a way, that uses character 'A' as x41. Since 'Z' is x5A and 'a' is x61 there is a gap in there that would include a bunch of other characters.
    - Re:Unacceptable mistakes (Score:4, Interesting)
      
      by slim ( 1652 ) writes: <{ten.puntrah} {ta} {nhoj}> on Tuesday March 22, 2005 @05:13PM (#12015937) Homepage
      
      BTW. regular expressions present a complete Turing machine.
      
      Actually no: regular expressions are a great example of a language which is not Turing complete, but is useful nonetheless.
      
      The classic limitation of regexes is that you can't use them to parse arbitrarily nested brackets -- because there is no concept of a stack. A Turing machine would be able to do this.
      
      (Researching this post [yes! researching!] I found a couple of mailing list posts from various peoplel suggesting that Perl regexes are Turing complete. If this is true [which I have not established], it's because Perl extends the concept of REs in various ways)
      
      - Re:Unacceptable mistakes (Score:2)
        
        by roman_mir ( 125474 ) writes:
        
        oops, you are right, I should have said Perl 5.8 regexp are Turing complete, that's why I said it was interesting.
  - Re:Unacceptable mistakes (Score:2)
    
    by khrtt ( 701691 ) writes:
    
    Why is [A-z] wrong
    
    Because there are some characters between the letters Z and a in ASCII.
    
    what's the correct way to do it?
    
    [A-Za-z] - for us-ascii, or
    [:alpha:] - for other charsets, if your system supports it.
  - - - Re:jeez (Score:2)
        
        by Surt ( 22457 ) writes:
        
        That would totally change the nature of slashdot. Think about what would happen to arguments if you could go back and make little corrections to your logic/premises. You'd be able to make your responders look like fools.
        
        Re:jeez (Score:2)
        
        by roman_mir ( 125474 ) writes:
        
        well, I think it should still be possible to edit your comment within say an hour of posting IF noone replied to you yet.
  - - Re:Unacceptable mistakes (Score:2, Informative)
      
      by LordoftheWoods ( 831099 ) writes:
      
      the uppercase letters A-Z are followed by a number of special symbols,
      
      Indeed. If anyone is interested in why ASCII sticks a few characters in there, it's because it allows you to flip a bit to switch between cases.
Minor variations (Score:5, Funny)

by pocari ( 32456 ) writes: on Tuesday March 22, 2005 @03:55PM (#12015009)

However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations.
I can relate. I have cookbooks for food that have all these recipes that are nothing but flour, butter, eggs, and sugar. Do we need all these recipes for pancakes, cupcakes, cookies, crepes, waffles, popovers, bread, quick bread, bread sticks? Won't people figure out eventually to put a little less sugar in waffles with savory ingredients?
Japanese cookbooks are even worse. Soy sauce, sake, mirin...boooooooring!

- Re:Minor variations (Score:2)
  
  by null etc. ( 524767 ) writes:
  
  I can relate. I have cookbooks for food that have all these recipes that are nothing but flour, butter, eggs, and sugar. Do we need all these recipes for pancakes, cupcakes, cookies, crepes, waffles, popovers, bread, quick bread, bread sticks?
  If you think there's only a minor variation between cookies and bread, let me adopt you. You'll be the easiest kid ever to take care of.
  Yum, peanut butter and jelly cookies'mich.
- - Re:Minor variations (Score:2)
    
    by winkydink ( 650484 ) * writes:
    
    1) commercial yeast technique
    2) sourdough starter technique
    3) poolish technique
    4) pumpernickel
    5) ok, I can only think of 4 offhand :)
    - Re:Minor variations (Score:2)
      
      by Gulthek ( 12570 ) writes:
      
      Mmm...sourdough starter.
I personally... (Score:5, Informative)

by BlueCodeWarrior ( 638065 ) writes: <steevk@gmail.com> on Tuesday March 22, 2005 @03:55PM (#12015012) Homepage

...use 'Mastering Regular Expressions [oreilly.com] . It's a good book on the topic as well.

- Re:I personally... (Score:3, Informative)
  
  by Bryson ( 112202 ) writes:
  
  > use 'Mastering Regular Expressions . It's a good book on the topic as well.
  
  I'm one of the few people who doesn't like Friedl's /Mastering
  Regular Expressions/. (I have the first edition.)
  
  First, he says that extended regexp engines, such as Perl's, use
  nondeterministic finite automata (NFA). Not true; NFA's can
  accept exactly the same languages as DFA's (deterministic finite
  automata). The extended regexps use search-and-backtrack
  engines.
  
  Friedl gives some examples of (extended) regexps that have
  catastro
add this book to your list (Score:3, Informative)

by yagu ( 721525 ) writes: <yayaguNO@SPAMgmail.com> on Tuesday March 22, 2005 @03:55PM (#12015014) Journal

While I can't vouch for the quality of the reviewed book,if you want something definitive on regular expressions, Mastering Regular Expressions, Second Edition [amazon.com] by Jeffrey E. F. Friedl is an absolute must for your professional library. Jeffrey breaks down and then builds back up what regular expressions are and how they work, and offers an entire matrix breakout of the slightly different implementations among the most common utilities (grep, sed, awk, perl...). Not to shill for amazon, but if you select the reviewed book, the "buy this book too, and you get this great price" deal actually includes the Mastering Regular Expressions, Second Edition. . Get 'em both, you won't be sorry.

- - Re:add this book to your list (Score:2)
    
    by yagu ( 721525 ) writes:
    
    I wish... (hope we're far enough to be out of the modding radar....). I actually have had a recent very bad experience with amazon.... so this took a bit of a swallow to recommend this way, but the "Mastering..." is SUCH a great book... I think any professional should have at LEAST "Mastering..." as part of their library (like I said in original post, can't vouch for that book... the general reviews I've seen lead me to think it isn't nearly as good).
two problems (Score:4, Funny)

by EphemeralPhart ( 107572 ) writes: on Tuesday March 22, 2005 @03:56PM (#12015026)

Some people, when confronted with a problem, think ``I know, I'll use regular expressions.'' Now they have two problems.

Jamie Zawinski

- Re:two problems (Score:2)
  
  by GerritHoll ( 70088 ) writes:
  
  The original post can be found here [google.nl]
Linda Richmond says... (Score:5, Funny)

by Anonymous Coward writes: on Tuesday March 22, 2005 @04:05PM (#12015146)

I'm feeling a bit verklempt!

Talk amongst yourselves!

Alright, I'll give you a tawpic:

"Regular Expressions are neither regular nor expressions."

Discuss.

a Cookbook eh? (Score:3, Funny)

by chiapetofborg ( 726868 ) writes: on Tuesday March 22, 2005 @04:08PM (#12015187) Homepage

Anyone have any good recipies for [cookies]+ ?

Regexes are overused (Score:5, Informative)

by ryantate ( 97606 ) writes: <ryantate@ryantate.com> on Tuesday March 22, 2005 @04:14PM (#12015263) Homepage

Anyone who drops in regularly on a Perl discussion forum (like perlmonks.org) knows that programmers tend to over-use regular expressions.

Regexes are actually a pretty poor way to extract information from comma-delimited or tab-delimited files, for example. By the time you're done dealing with escaped commas, escaped tabs, quoting characters (which many CSV and TDT exporters use in addition to commas and tabs), escaped quote characters, escaped newlines, and escaped escape chars, you end up with a super-complicated regex.

HTML is even more complicated. You have HTML comments and nested tags on top of everything else.

To validate a simple email address, Jeffrey Friedl in his Mastering Regular Expressions book for O'Reilly writes an *11-page* regex.

Most of the time the correct answer is not "here is a regex recipe" but rather "here is a simple library to do the job property with a parser", like Text::CSV or HTML::Parser in perl.

- Re:Regexes are overused (Score:2, Informative)
  
  by stratjakt ( 596332 ) writes:
  
  Of course, the compiled regex will likely be faster than any parsing library you write. So it all depends what you're doing.
  
  For some sort of system that processes umpteen billion transactions per second, they can be a godsend. For parsing a .conf file once every six months when the machine is rebooted, it's a waste of time.
  
  It's all about knowing how and when to use the tool. A pneumatic nailgun can save a carpenter hours on a jobsite, but it's a waste of time to set it all up if you only need to knock
  - Re:Regexes are overused (Score:2)
    
    by ryantate ( 97606 ) writes:
    
    Very true. But I doubt someone who knows how to benchmark code and is handling thousands or more transactions per second is grabbing a regex recipe out of a book.
- Re:Regexes are overused (Score:4, Insightful)
  
  by Black Perl ( 12686 ) writes: on Tuesday March 22, 2005 @04:29PM (#12015431)
  
  Yes, exactly. Any good book on Regexes should have a chapter on when NOT to use them.
  
  I see many people trying to use regexes to do parsing, when they should be using a specialized parser.
  
  - Re:Regexes are overused (Score:3, Informative)
    
    by smittyoneeach ( 243267 ) * writes:
    
    Consider the boost libraries http://boost.org/ [boost.org].
    
    You get tokenizer, regex, and a parser library (spirit), in sorted by increasing caliber.
    
    It's all about the right tool for the job.
- Re:Regexes are overused (Score:4, Funny)
  
  by Anonymous Coward writes: on Tuesday March 22, 2005 @04:37PM (#12015541)
  
  > *11-page* regex.
  
  I think that's a sure sign of insanity. Or autism at the least.
  
- Re:Regexes are overused (Score:2)
  
  by MikeBabcock ( 65886 ) writes:
  
  I agree -- many parsing jobs are much simpler doing basic character-at-a-time C code, especially validation.
  
  If you're searching for occasions of something or other in a long document, grep is obviously going to be an easy way (with regex's), but if you want to extract the hostname from a URI, just code it.
- Re:Regexes are overused (Score:3, Insightful)
  
  by Alan Shutko ( 5101 ) writes:
  
  To validate a simple email address, Jeffrey Friedl in his Mastering Regular Expressions book for O'Reilly writes an *11-page* regex.
  
  That's not quite fair. That regex validates any RFC822 address, and the syntax allowed isn't simple. Validating things that are currently used is fairly easy, but there's a lot of historical baggage in RFC822 addressing.
- Re:Regexes are overused (Score:4, Insightful)
  
  by 2short ( 466733 ) writes: on Tuesday March 22, 2005 @11:30PM (#12019672)
  
  "an *11-page* regex."
  
  That's insane. My feelings on Regexes were set early in my career. I discovered them, and like many started using them everywhere. Then in a code review, my boss pointed to one particularly complex one and said "See, there's why you shouldn't try to do such complex things with regular expressions, this one has a bug" "Where?" says me. "Let's leave that as an exercise for the student. Come ask me if you can't figure it out in an hour or so." Well, I certainly wasn't going to admit defeat, even though it took me several hours to find the rather subtle problem. So I went back and demanded to know how he had spotted it so fast. And he said "I didn't. It was a regex 3 lines long. It had to have a bug."
  
Regex Coach helps building Regexp (Score:5, Informative)

by uss_valiant ( 760602 ) writes: on Tuesday March 22, 2005 @04:18PM (#12015294) Homepage

Regex Coach [weitz.de]

This program assists you building regular expressions. I've never used it (real men code regexp at once and it works). But some friends recommend it.

- - Re:Regex Coach helps building Regexp (Score:2, Informative)
    
    by DigitalDeviation ( 857048 ) writes:
    
    Regex Coach is nice for those long regexs that you may have missed an escape somewhere. I write most regexs myself, but I'm no guru at it. Regex Coach is a nice verification that the regex works (particularly for extracting something from a large string).
ambiguous use of "they/them" (Score:2)

by pomakis ( 323200 ) writes:

The vi-style regular-expression substitution technique might help: :-)

"If you spend time working writing applications that have to do pattern matches and/or replacements, you know about some of the intricacies of $regular expressions$. For $many people$ \1 can be an arcane hodgepodge of odd characters that somehow manage to do wonderful things, but \2 don't have enough time (or interest) to really understand how to code \1."
Different flavors? (Score:4, Informative)

by dpbsmith ( 263124 ) writes: on Tuesday March 22, 2005 @04:30PM (#12015448) Homepage

In an average month, I use regular expressions as implemented in Microsoft Visual C++ 6.0, BBEdit Lite, TextWrangler, Apple MPW, and REALBasic. Every single one of them has _significant_ differences in syntax and semantics.

My understanding is that even the UNIX world sports several different flavors of regular expression in grep, egrep, fgrep, etc.

The biggest barrier to _my_ use of regular expressions is that every time I switch from one regular expression context to another, it takes me a good half hour to refresh my memory of what does and doesn't work in each environment.

- Re:Different flavors? (Score:2)
  
  by wk633 ( 442820 ) writes:
  
  My understanding is that even the UNIX world sports several different flavors of regular expression in grep, egrep, fgrep, etc.
  
  Er, well, not exactly. grep, extended (egrep) and fixed (fgrep) allow for different feature/speed tradeoffs, but they are consistent in their use of regular expressions. Where you will find differences is between the regex syntax of vi, perl, sed, grep, etc.
  
  After ten+ years, I still consult a reference for all the escape codes and such. Used to be a book, now it's google.
HTML, XML, CSV, but why? (Score:4, Interesting)

by AGTiny ( 104967 ) writes: on Tuesday March 22, 2005 @04:33PM (#12015499)

Of course everyone should know how to build a regex, but why take time discussing how to parse common formats such as HTML, XML, CSV, and so on? Every language likely has a good standard module/library/package that does it all for you, hopefully in the most efficient way, and gives you an easy API. I write Perl, and have used XML::*, HTML::*, DBD::CSV, Text::CSV, the list goes on. No need to write a single regex there. Another good set of modules is Regexp::Common, giving you correct regexes for parsing semi-hard things like IP addresses, MAC addresses, phone numbers, etc.

Free Alternative (Score:4, Informative)

by MudButt ( 853616 ) writes: on Tuesday March 22, 2005 @04:48PM (#12015661)

This is free... And interactive...
http://www.regexlib.com/ [regexlib.com]

About 279 pages too long (Score:4, Insightful)

by natoochtoniket ( 763630 ) writes: on Tuesday March 22, 2005 @04:57PM (#12015764)

I have a huge, 1000+ page Betty Crocker cookbook which I hardly ever use. It gives detailed recipes for particular dishes, but nothing that helps me to just throw a dinner together. And nothing that helps me to create anything new.
My very favorite recipe book is a tiny little thing of about 40 pages. For each kind of meat and each kind of vegetable, it lists what spices and sauces go well with it, how long and how hot to cook it, and how to tell when it is done. There is a little section on how to make about a dozen differnet sauces. That's it.
A programming language has syntax and semantics. For regular expressions, Chomsky gave both fully in his original paper on the subject. The added conveniences that some utilities provide are all listed in their respective man pages. The entire subject, if it were collected together, should be about 10 pages. With some explanation of language theory, grammars, and such, the whole might be worth a chapter. Get out an undergraduate compiler-theory book (such as Aho/Sethi/Ullman). They have less than a chapter on regular expressions, and they cover the topic fairly well.
But, I suppose, there is a difference between a cookbook that is made for cooks to use as a reference, and a cookbook that is made for non-cooks to follow by rote. Learn how to cook. You will be surprised how seldom you actually refer to the 1000+ page cookbooks.

- Re:About 279 pages too long (Score:2)
  
  by ErikZ ( 55491 ) writes:
  
  Actually, I'd be very interesting in what your 40 page cookbook is called.
  - Re:About 279 pages too long (Score:2)
    
    by ErikZ ( 55491 ) writes:
    
    (sigh) Interested.
In one ear, out the other (Score:2)

by sahonen ( 680948 ) writes:

Whenever I need to use some regex, I google for a regex reference and try to figure out how to do what I want to do. Then the next time I need to use regex, I have to do it again. I literally cannot hold regex in my head for more than a day or so.
check out regex coach if you want to learn (Score:2, Interesting)

by Anonymous Coward writes:

I found this tool while doing my undergrad. Having this tool and playing with it showed me how to understand and how to sucessfully write regexs. 5 minutes of playing with it and you be enlightened.

http://www.weitz.de/regex-coach/ [weitz.de]
Separating the men from the boys... (Score:2)

by mnemotronic ( 586021 ) writes:

That's a phrase a co-worker once tossed out to differentiate regex wranglers from lowly code cowboys. The implication being that real programmers use REs. At that time in my life I knew a dozen or so programming languages, but had avoided learning REs. That little quip prompted me to start learning, first in AWK, then via Perl. Today, I'm proud to say that I can fumble my way around a regex pretty good. I'm still a little fuzzy when it comes to concepts like "negative look-ahead" and "positive look-beh
Use more than regular expressions (Score:2)

by klui ( 457783 ) writes:

Rather than relying only on regular expressions, it would be beneficial to use regexps along with sed/(g)awk/perl. If the incantation that you use using regexps is obscure to you, how will the next guy who will support your stuff feel? Break up your uber regexp into a simpler combination of regexp(grep)/sed/awk combination.

With that, I almost always use anchoring via ^ or $.
:help pattern (Score:4, Informative)

by digitect ( 217483 ) writes: <digitectNO@SPAMdancingpaper.com> on Tuesday March 22, 2005 @10:23PM (#12019047)

Of course, if you use the one true text editor [vim.org], all you need to know about regular expressions is:

:help pattern

:)

- Re:Email RegEx (Score:2)
  
  by tehshen ( 794722 ) writes:
  
  As you specified all forms of e-mail addresses...
  
  (I would post one here, but the lameness filter hates it, so I'll just link to it [regexlib.com]).
  
  Covers RFC 8288, as well as IP addresses.
- Re:Email RegEx (Score:3, Interesting)
  
  by Sir_Real ( 179104 ) writes:
  
  I'm still looking for a good email regex
  
  Well, you asked for it [ex-parrot.com].
  
  Actually, I asked for it last week, in #linux on freenode. Scary huh?
  - Re:Email RegEx (Score:2)
    
    by skids ( 119237 ) writes:
    
    Hrm, that's a very inefficient (in terms of usability) regex someone sold you. If you are working in Perl, look into using the qr// operator to build it up from subexpressions. You can easily reduce that to 1/10th the size/complexity.
    - Re:Email RegEx (Score:2)
      
      by Sir_Real ( 179104 ) writes:
      
      Yes, I see that. To contextualize this page further, I specifically asked for a regular expression that was usable outside of Perl, hence some of the verbosity. To me, it's more of a demonstration of the concept that regex isn't a panacea, and that email address verification is non-trivial (or at least more difficult than I was initially led to believe).
- Re:Email RegEx (Score:2)
  
  by rduke15 ( 721841 ) writes:
  
  I found this little online Email address syntax checker [alma.ch] which is useful in comparing the results of various classical Perl modules. It is very slow (maybe on purpose to avoid abuse?).
- Re:A language in their own right. (Score:2, Informative)
  
  by APDent ( 81994 ) writes:
  
  Regular expressions are not Turing complete.
- Re:A language in their own right. (Score:2, Informative)
  
  by smoany ( 832744 ) writes:
  
  Um, last time I checked, Reg. Exp's are not turing complete. Take the expression O^n 1^n, which can be made by Turing machines. If you can make that for me using a Regular Expression, you deserve a Turing Award. Regular expressions are DFA/NFA complete, not turing complete... not even close!
- Re:A language in their own right. (Score:4, Informative)
  
  by khrtt ( 701691 ) writes: on Tuesday March 22, 2005 @04:07PM (#12015173)
  
  Regular expressions are probably the first Turing-complete language to be encapsulated in another Turing-complete language (C).
  
  Don't you just love to sound like a StarTrek character, with all that fancy terminology?
  
  Go look up your complexity book - if you have one - regexes are not even close to Turing-complete.
  
- Re:cant get used to them (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  regular expressions are nice and all but i still cant get used to them .. a good manual should be kept handy at all times. [ ... ]
  Suggestions are welcome.
  
  I have a suggestion. Write a few regular expressions to get your brain refreshed on them, then go read this excellent article [plover.com] on how regular expressions work. At the very least, it will clear some confusing things up. Most likely you'll find that having a better understanding of the underlying concepts will make it easier for you to work with re
  - Re:cant get used to them (Score:4, Informative)
    
    by B'Trey ( 111263 ) writes: on Tuesday March 22, 2005 @05:01PM (#12015801)
    
    If you really want to understand regexes, get Jeffrey E. E. Friedl's "Mastering Regular Expressions" from O'Reilly. It's much deeper than the casaul reader will ever need, but if you get through it you will certainly know how regexes work from both a user perspective and from a regex engine perspective.
    
- Re:cant get used to them (Score:2, Informative)
  
  by Waffle Iron ( 339739 ) writes:
  
  regular expressions are nice and all but i still cant get used to them
  They may be kind of hard to get used to, but not has hard as writing, debugging and maintaining a dozen or more lines of custom string parsing code for each case where you would use one.
- Re:cant get used to them (Score:4, Informative)
  
  by halber_mensch ( 851834 ) writes: on Tuesday March 22, 2005 @04:22PM (#12015350)
  
  A good starting point is to understand finite automata and regular languages first. See http://en.wikipedia.org/wiki/Automata_theory/ [wikipedia.org] for a good first reference on automata. If you can grok automata, regular expressions will click with you.
  
- Re:Regexes How2 (Score:5, Informative)
  
  by softcoder ( 252233 ) writes: on Tuesday March 22, 2005 @05:23PM (#12016053)
  
  In addition to a good book, or even INSTEAD of a good book, download and use THE REGEX COACH
  http://www.weitz.de/regex-coach/
  
  It is a very very nice interactive pgm that lets you debug REGEXES on the fly visually, by feeding them sample text.
  
- - Re:cant get used to them (Score:2)
    
    by cayenne8 ( 626475 ) writes:
    
    If they could come up with some good ways to get CR/LF's out of MS excel files...in a csv, I'd be estatic!!
    I get excel files dumped to me for inserts into Oracle databases...some are HELL to clean up. Especially the comma delimited ones..with freeform text fields...that allow the user to put hard returns in them....
    That and one more bitch. When did MS make it so damned hard to change the delimiter in excel?? I remember a few editions ago, when you saved as a CSV, it gave you a wizard type thing to choose
- Re:F*ck this book and all others like it: (Score:2)
  
  by winkydink ( 650484 ) * writes:
  
  slicker than greased pigeon shit
  I somehow think that a lot of /.'ers will find an analogy of .NET to pigeon shit as quite apropos. :)
- - Re:F*ck this book and all others like it: (Score:2, Funny)
    
    by winkydink ( 650484 ) * writes:
    
    Yeah, mod me down as a troll, don't even READ my comment. [...]
    
    You dumb slashbot fucks have no idea what a regex is [...]
    
    Sycophants and asshats, monkeys who crawl around above my office trying to figure out which wire the rats chewed through. Know-nothing idiots [...]
    
    Fuck you and your iPods. All those white earbuds do is help me pick out the clueless wannabes. No true geek would own one.
    
    Let me guess... you didn't finish the Dale Carnegie course, did you?
    - Re:F*ck this book and all others like it: (Score:2)
      
      by east coast ( 590680 ) writes:
      
      you didn't finish the Dale Carnegie course, did you?
      
      I got a laugh out of hsi/her comments, if that counts for anything.
      
      And I found the iPod comment very insightful...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Curious (Score:3, Funny)

Re:Curious (Score:5, Funny)

Re:Curious (Score:2)

Re:Curious (Score:3, Funny)

Points (Score:5, Informative)

Re:Points (Score:2, Funny)

Typical... (Score:2)

Re:Typical... (Score:2)

Re:Typical... (Score:4, Insightful)

Re:Points (Score:3, Funny)

Bran... (Score:3, Funny)

Another one? (Score:2, Insightful)

Re:Another one? (Score:2)

Try them out (Score:5, Insightful)

Re:Try them out (Score:4, Interesting)

Re:Another one? (Score:2)

Re:Another one? (Score:4, Informative)

Regular expressions in a cookbook? (Score:5, Informative)

Re:Regular expressions in a cookbook? (Score:5, Interesting)

Re:Regular expressions in a cookbook? (Score:5, Informative)

Re:Regular expressions in a cookbook? (Score:4, Insightful)

Re:Regular expressions in a cookbook? (Score:3, Interesting)

Re:Regular expressions in a cookbook? (Score:3, Interesting)

Re:Regular expressions in a cookbook? (Score:3, Interesting)

Re:Regular expressions in a cookbook? (Score:2)

ignorance is bliss (Score:3, Informative)

Re:ignorance is bliss (Score:3, Informative)

BTW, thanks Stephen (Score:2)

Re:Regular expressions in a cookbook? (Score:3, Informative)

REGEX (Score:5, Funny)

Unacceptable mistakes (Score:5, Interesting)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:3, Informative)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:3, Insightful)

Re:Unacceptable mistakes (Score:3, Informative)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:4, Interesting)

Re:Unacceptable mistakes (Score:2)

Re:Unacceptable mistakes (Score:2)

Re:jeez (Score:2)

Re:jeez (Score:2)

Re:Unacceptable mistakes (Score:2, Informative)

Minor variations (Score:5, Funny)

Re:Minor variations (Score:2)

Re:Minor variations (Score:2)

Re:Minor variations (Score:2)

I personally... (Score:5, Informative)

Re:I personally... (Score:3, Informative)

add this book to your list (Score:3, Informative)

Re:add this book to your list (Score:2)

two problems (Score:4, Funny)

Re:two problems (Score:2)

Linda Richmond says... (Score:5, Funny)

a Cookbook eh? (Score:3, Funny)

Regexes are overused (Score:5, Informative)

Re:Regexes are overused (Score:2, Informative)

Re:Regexes are overused (Score:2)

Re:Regexes are overused (Score:4, Insightful)

Re:Regexes are overused (Score:3, Informative)

Re:Regexes are overused (Score:4, Funny)

Re:Regexes are overused (Score:2)

Re:Regexes are overused (Score:3, Insightful)

Re:Regexes are overused (Score:4, Insightful)

Regex Coach helps building Regexp (Score:5, Informative)

Re:Regex Coach helps building Regexp (Score:2, Informative)

ambiguous use of "they/them" (Score:2)

Different flavors? (Score:4, Informative)

Re:Different flavors? (Score:2)

HTML, XML, CSV, but why? (Score:4, Interesting)

Free Alternative (Score:4, Informative)

About 279 pages too long (Score:4, Insightful)

Re:About 279 pages too long (Score:2)

Re:About 279 pages too long (Score:2)

In one ear, out the other (Score:2)

check out regex coach if you want to learn (Score:2, Interesting)