Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Researchers Expanding Diff, Grep Unix Tools

timothy posted more than 2 years ago | from the now-with-raisins dept.

Google 276

itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."

cancel ×

276 comments

Strange names (4, Funny)

gnasher719 (869701) | more than 2 years ago | (#38305988)

Space characters in the name of a Unix command line tool is asking for trouble.

Re:Strange names (0)

bobdinkel (530885) | more than 2 years ago | (#38306020)

There's nothing that says the name of the tool and the command you type must be the same. I wouldn't sweat it.

Re:Strange names (4, Insightful)

Longjmp (632577) | more than 2 years ago | (#38306526)

Definitely
II mean, where would we end up if unix commands actually give a hint what they are doing ;-)
As a unix novice, if I wanted to search for something, my first choice of course would be grep
Also if I wanted help on something, the first word that jumps to my mind would be man

heh.

Re:Strange names (4, Funny)

toadlife (301863) | more than 2 years ago | (#38306836)

"I have only been able to come up with one algorithm for creating Unix command names: think of a good English word to describe what you want to do, then think of an obscure near- or partial-synonym, throw away all the vowels, arbitrarily shorten what's left, and then, finally, as a sop to the literate programmer, maybe reinsert one of the missing vowels."

Rachel Padman [mindspring.com]

Re:Strange names (1)

serviscope_minor (664417) | more than 2 years ago | (#38308170)

Also if I wanted help on something, the first word that jumps to my mind would be man

If you want help, perhaps you should read the MMMAAANNNual.

hint.

Re:Strange names (4, Insightful)

mfnickster (182520) | more than 2 years ago | (#38307526)

There's nothing that says the name of the tool and the command you type must be the same

Very true. Unix programmers seem to follow these rules:

  1. delete any spaces in the name
  2. delete any vowels in the name
  3. delete any superfluous consonants
  4. chuck the entire thing and just abbreviate it to the first letter of each word in the name

So these tools will likely be run as "ctxtfrgrp" and "hierdiff" or just "cfgrep" and "hdiff"

Re:Strange names (1)

unixisc (2429386) | more than 2 years ago | (#38307864)

Like 'cat' for concatenate, or vi for what exactly?

Re:Strange names (2)

urdak (457938) | more than 2 years ago | (#38308156)

Like 'cat' for concatenate, or vi for what exactly?

"vi" is short of "visual".
First there was "ed", the, you guessed it, "editor". But "ed" was a real pain to use, because you wouldn't see what you were actually editing (if you ever used ed, you'd know what I mean). So the "visual" editor "vi" was invented.

Re:Strange names (4, Insightful)

realyendor (32515) | more than 2 years ago | (#38306056)

I expect those are just the spoken names and that the commands will still be single words, similar to:
"GNU awk" -> gawk
"enhanced grep" -> egrep

Re:Strange names (1)

dougmc (70836) | more than 2 years ago | (#38306102)

"enhanced grep" -> egrep

Well, except that egrep is already taken :)

But yeah, your point is valid and probably correct.

Re:Strange names (2)

dougmc (70836) | more than 2 years ago | (#38306126)

and I really should spend a few more seconds thinking about what I'm responding to. Obviously gawk and egrep are existing tools, given as examples, not proposed names for these new tools.

Re:Strange names (3, Informative)

EdIII (1114411) | more than 2 years ago | (#38306596)

and I really should spend a few more seconds thinking about what I'm responding to

That's not what Slashdot is about........

Re:Strange names (2)

ivoras (455934) | more than 2 years ago | (#38306348)

But of course, "eegrep" isn't :)

(enhanced enhaced grep)

Re:Strange names (4, Funny)

ripler (19188) | more than 2 years ago | (#38306454)

Next thing you know we'll have CSIgrep. (enhance enhance enhance grep)

Re:Strange names (4, Funny)

Anne Thwacks (531696) | more than 2 years ago | (#38306676)

CSIgrep would take 30 mins to get the result! (With ad breaks)

Re:Strange names (1)

berashith (222128) | more than 2 years ago | (#38306850)

yes, but it is nice to know that all of your expectations for the first 26 minutes are incorrect.

Re:Strange names (1)

Tomato42 (2416694) | more than 2 years ago | (#38306932)

sign me in if it will search a 1TB data set in those 30min!

Re:Strange names (1)

marcosdumay (620877) | more than 2 years ago | (#38307426)

No, those 30min is per bit.

But you'd be surprized by the amount of information you can gather from a single bit!

Re:Strange names (1)

bryan1945 (301828) | more than 2 years ago | (#38307496)

If you use a TV instead of a monitor, science and computer stuff runs really, really fast.

Re:Strange names (0)

Anonymous Coward | more than 2 years ago | (#38306858)

grep++

Re:Strange names (1)

Noughmad (1044096) | more than 2 years ago | (#38306902)

Just wait until Microsoft sees your post and we'll have eeegrep.

Re:Strange names (1)

Mister Liberty (769145) | more than 2 years ago | (#38307654)

You meant '(enhanced enhanced grep)'.

There, enhanced that for ya.

They should call it... (3, Insightful)

goombah99 (560566) | more than 2 years ago | (#38306792)

perl. Isn't this exactly why perl was invented?

Re:They should call it... (1)

The Askylist (2488908) | more than 2 years ago | (#38306998)

I've always thought perl should be renamed SOS - Self-Obfuscating Scripting. But then again I prefer languages to be human-readable.

Re:They should call it... (0)

Anonymous Coward | more than 2 years ago | (#38307134)

Actually people just wanted a way to execute random line noise as if it was a program.

That's also why programming over a bad terminal connection in perl can have disastrous consequences.

You never quite know if that garbly-gook is line noise or the last 10 minutes of your work.

Off the top of my head..... @#$%&JVJDV)@#MSDC)(FGDSG(DF)GSDFG(SDFG

That's probably valid perl that compiles and computes something.

Re:They should call it... (1)

marcosdumay (620877) | more than 2 years ago | (#38307454)

Yes, also sed, and awk.

They are still ages behind prolog, that will parse context dependent texts....

Re:They should call it... (1)

FractalParadox (1347411) | more than 2 years ago | (#38307644)

Yes, next up will be awk so that it can take many sequential operations at once ... we'll call it sqawk.

the perl man page (2)

goombah99 (560566) | more than 2 years ago | (#38307488)

From the header of 1988 perl man page:

Submitted-by: Larry Wall
Posting-number: Volume 13, Issue 1
Archive-name: perl/part01

[ Perl is kind of designed to make awk and sed semi-obsolete. This posting
      will include the first 10 patches after the main source. The following
      description is lifted from Larry's manpage. --r$ ]

      Perl is a interpreted language optimized for scanning arbitrary text
      files, extracting information from those text files, and printing
      reports based on that information. It's also a good language for many
      system management tasks. The language is intended to be practical
      (easy to use, efficient, complete) rather than beautiful (tiny,
      elegant, minimal). It combines (in the author's opinion, anyway) some
      of the best features of C, sed, awk, and sh, so people familiar with
      those languages should have little difficulty with it. (Language
      historians will also note some vestiges of csh, Pascal, and even
      BASIC-PLUS.) Expression syntax corresponds quite closely to C
      expression syntax. If you have a problem that would ordinarily use sed
      or awk or sh, but it exceeds their capabilities or must run a little
      faster, and you don't want to write the silly thing in C, then perl may
      be for you. There are also translators to turn your sed and awk
      scripts into perl scripts.

Subject line is not part of the comment (1)

Tetsujin (103070) | more than 2 years ago | (#38307718)

They should call it... perl. Isn't this exactly why perl was invented?

Perl could do this - with the right libraries. But that's the real value they're adding here. They created tools that operate on files with knowledge of the structure of those files. So for instance a "diff" between two XML files with identical contents but differences in formatting could report that the files are identical... Or if you had some file structure that defined a directed-graph structure, a format meant to be edited in-place (and which therefore might sometimes have holes in it where data was removed - or which might have data presented in a different order depending on the sequence of operations used to store it) - the "diff" tool would decode the files, examining the data structure they're meant to represent - and show the differences in that.

Obviously it could be done in Perl - but it wouldn't be a one-liner unless you had those libraries which translate the particular file format into the desired level of abstraction.

Re:Strange names (2)

rwa2 (4391) | more than 2 years ago | (#38306130)

Yay, a tools thread!

I am liking meld (python-based visual diff)

But I suppose they have a different concept of hierarchical diff than diffing/merging two directory structures.

Re:Strange names (1)

pclminion (145572) | more than 2 years ago | (#38306128)

If the FS supports spaces in filenames, then you have broken code if you can't tolerate it. MS wisely put a space in the "Program Files" name when they added long filenames to Windows. That'll put any delusions about being able to ignore it to a direct immediate stop.

Re:Strange names (0)

Anonymous Coward | more than 2 years ago | (#38306290)

Try reading the post you replied to.

Re:Strange names (3, Interesting)

adonoman (624929) | more than 2 years ago | (#38306322)

But having to use quotes every time you call a command is a sure way to make sure your command is never used.

Would you rather type this:
./"Context-Free Grep" ...
or this:
./cfgrep ..

Re:Strange names (1)

sys_mast (452486) | more than 2 years ago | (#38306500)

I was going to say ./cgrep but your suggestion is better since it won't be confused with "Context Grep" Which would imply it is NOT Context free.

so is the other command ./hdiff ?

Re:Strange names (4, Insightful)

iluvcapra (782887) | more than 2 years ago | (#38306752)

If you don't like a tool's name, export an alias.

It's not about typing commands as much as it's about making these work:

$ find . -name ".txt" | xargs wc
$ for file in $*; do
mv $file old/$file
done

Versus these:

$ find . -name ".txt" -print0 | xargs -0 wc
$ for file in $*; do
mv "$file" "old/$file"
done

A lot of scripts you run into are just broken because of braindead assumptions.

Re:Strange names (1)

Anonymous Coward | more than 2 years ago | (#38307166)

Actually, your "correct" code is also broken. It should read:

for file in "$@"; do
    mv "$file" "old/$file"
done

You see, $* expands to string, not list.

Re:Strange names (2)

gangien (151940) | more than 2 years ago | (#38306818)

in scripts, i pretty much quote everything. seems to be the way to avoid problems. of course, i'm not a sysadmin by trade, so maybe it's bad for some reason or something.

when at the prompt i hit tab.

We'd probably avoid a lot of problems, if people wouldn't be so lazy to not type a few extra characters.

Re:Strange names (0)

Anonymous Coward | more than 2 years ago | (#38307170)

add an alias in your .kshrc and then call it what you want....

Re:Strange names (1)

jandrese (485) | more than 2 years ago | (#38306416)

Ironically, many of Microsoft's tools have trouble dealing with the space in the filename, including the blasted Run window.

Just because there is a way to make it work doesn't means there isn't a problem with it. All unix shells can handle spaces in filenames, but the methods to do so are not always intuitive and it's easy to mess up things like shell scripts. Even the "proper" solutions have problems.

And I can't stand "Program Files", what a mess that has been.

Re:Strange names (1)

Dog-Cow (21281) | more than 2 years ago | (#38308186)

The Run window has no problems with spaces. The problem is that you expect it to read your mind and figure out which part is the command, which is arguments, and which arguments are really one argument. If you quote, everything works just fine.

Re:Strange names (1)

Anonymous Coward | more than 2 years ago | (#38306572)

MS wisely put a space in the "Program Files" name when they added long filenames to Windows.

You mean the PROGRA~1 directory?

Re:Strange names (0)

Anonymous Coward | more than 2 years ago | (#38307184)

MS wisely put a space in the "Program Files" name when they added long filenames to Windows.

And with in swedish Windows XP that dir was called "Program",(you'd end up having both the swedish dir and "Program Files", since many programs didn't give any kind of $AppInstallPath variable, but hard-coded "Program Files". I've seen many an XP boot always opening the "Program" folder. I'm thinking it was some daemon or something-ware that was supposed to open something in "Program Files" but opened "Program" in windows explorer when it got to the space...

Re:Strange names (4, Informative)

mytec (686565) | more than 2 years ago | (#38306540)

According to this paper [dartmouth.edu] , they are called bgrep and bdiff.

How's it compare to Meld? (1)

Compaqt (1758360) | more than 2 years ago | (#38306010)

A nice GUI diff for Linux. (Has 3-way).

Click here to install [deb]

Re:How's it compare to Meld? (3, Insightful)

Anonymous Coward | more than 2 years ago | (#38306246)

It is surprising that Slashdot even let you post a deb: url, as the filter usually seems to destroy most non-http(s) links. However, not everyone uses a Debian-based distro, and not everyone tries some random package (even from the repository) before reading a little about it, so posting the home page [sourceforge.net] would have been a bit more useful.

Re:How's it compare to Meld? (2)

Compaqt (1758360) | more than 2 years ago | (#38306474)

Yeah, I usually post a disclaimer ("for Debian/Ubuntu/Mint" -- now "Debian/Mint/Ubuntu").

Second, yes, /. does allow that, and I hope they continue to do so, because deb:// and click to install is neat and handy (even a lot of old Linux hands don't even know about it).

Finally, (as you mentioned) it's not a link to download software, but rather install software from the repositories, so there's that level of security.

Re:How's it compare to Meld? (1)

garry_g (106621) | more than 2 years ago | (#38306444)

Or ASCII GUI: vimdiff ... works fine, also with 3 files ...

Re:How's it compare to Meld? (1)

pak9rabid (1011935) | more than 2 years ago | (#38307066)

I like kompare [caffeinated.me.uk] .

Cheap North Face (-1)

Anonymous Coward | more than 2 years ago | (#38306016)

Cheap North Face [cheapnface.com] Cheap North Face

awk? (2)

realyendor (32515) | more than 2 years ago | (#38306028)

Done! It's called "awk". Just set the RS and FS fields as appropriate. :P

DOE?????? (0)

Anonymous Coward | more than 2 years ago | (#38306034)

What's the relevance of this work to DOE? Shouldn't DOD be the funding agency? Or does DOE simply have more money than they know what to do with?

Re:DOE?????? (1)

GameboyRMH (1153867) | more than 2 years ago | (#38306080)

Well if we can use our computers more efficiently then we'll save energy. On the other hand I can't imagine what use the DOD would have for this, especially since they seem to run Windows at every opportunity...

Re:DOE?????? (1)

amiga3D (567632) | more than 2 years ago | (#38306160)

I think maybe some of the scientist types at the DOE were behind the funding.

Re:DOE?????? (3, Interesting)

iced_tea (588173) | more than 2 years ago | (#38306298)

They have HUGE amounts of data kicking around from various simulations/experiments.

Check out the wikipedia article for supercomputers [wikipedia.org] , and you'll see DOE mentioned.

Tools like this could help with analysis and finding certain data sets. IIRC, regex are already used in DNA sequencing. There is probably a similar application and use for tools like this with their data.

Follow the money...? (1, Interesting)

dzfoo (772245) | more than 2 years ago | (#38306038)

funded in part by Google and the U.S. Energy Department

I wonder what's the interest of these two in this.

          -dZ.

Re:Follow the money...? (0)

mvar (1386987) | more than 2 years ago | (#38306178)

I was about to post the exact same thing. According to the article

The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort.

What a load of crap. These "new programs" sound more like a high school or an open source project. Since when a government agency cares about a Unix admin's toolbox so much that decides to fund something that could (and probably already has) been solved with a script. wtf?

Re:Follow the money...? (5, Insightful)

Tanktalus (794810) | more than 2 years ago | (#38306334)

Context-free grep/diff can be used to search for data/changes in arbitrary non-line-record-based files. Such as XML, HTML, JSON, SQLite databases, other databases, Apache configs, and many other pieces of data. Heck, even most programming languages are not line-based, but statement terminated/separated. Imagine being able to grep for a function name, and getting its entire prototype/usage even when it spans multiple lines (very common in standard glibc headers). And, depending on the plugin's capabilities, you could grep for a function name as a function name and not get back any usage of that text as a variable or embedded in a string, or a comment (skip commented-out calls!).

If there's sufficient configurability, you could ask for the entire block that given text is in, and such a grep would be able to display everything in the corresponding {...}. Makes grep that much valuable.

So, my question is, why aren't more IT-heavy corporations/government departments not involved?

Re:Follow the money...? (1)

hedwards (940851) | more than 2 years ago | (#38306782)

Why does that necessitate screwing around with grep? I can sort of see modifying diff, but with grep if you need that data you'd write a new program to parse it and pipe it.

Re:Follow the money...? (2)

bobaferret (513897) | more than 2 years ago | (#38306940)

So weird. I spent the last 6 months writing some Java libraries that do exactly this. There were some similar things out there, but they weren't licensed appropriately for my uses, or were WAY too expensive. Writing a hierarchical diff engine is the most complex thing I've ever done, hell writing an efficient pure diff engine is insane itself. You have to identify blocks/structure. then you have to diff the structures, then you have to diff the content in the structures. Once all of that is said and done then you have to find a way to represent the differences using the recognized structures. And from my point of view half the reason was to be able to represent ONLY the changes so that I'd have a nice size savings, on a constantly changing tree. You also have to choose a format that allows you to roll back to an previous diff given the initial sate or final state. There are also a large number of trade offs that have to be made including window size etc. You can't do a diff across a massive amount of data w/o a massive amount of processing power and memory. So you effectively have to diff independent streams against each other that have similar sized sliding windows on each stream. /rant Good stuff though, just funny to read about, and difficult to do.

I don't have a an answer to your question, but I wrote my software to deal with IT problems, because diff and grep just weren't good enough, and no one seems to do it for free.

Re:Follow the money...? (0)

Anonymous Coward | more than 2 years ago | (#38307400)

Are they opensourced? Are you going to contact the team to offer some sourcecode ideas?

Re:Follow the money...? (2)

bobaferret (513897) | more than 2 years ago | (#38307662)

LOL and that my friend is the hard part. It cost me $4000 in legal fees to make sure they are not owned by the company I work for, and 6 weeks of work. I'm leaning towards an AGPL/open core model. I just see so many people NOT happy with open core stuff. Also, I didn't get a grant from Google or the D.O.E. And these are just small, yet integral, parts of a larger system. That I don't really want to give away yet. Hell, deciding on licensing is harder than coding sometimes. Gotta feed the family you know, while at the same time pay back the OSS world for all of the great stuff that I use every day for free. How to do both is a hard ethical question. It's easy to say just consult, or write a book. It's much harder to actually _do_ these things. Hell, it's hard enough, just to open up your code to the worlds criticisms. The only thing I know at this point, is that it's not doing me or anyone else any good just sitting on it.

Re:Follow the money...? (-1)

Anonymous Coward | more than 2 years ago | (#38308056)

AGPL is a cancer, if you really care about the freedom of your code go for the Apache license.

Re:Follow the money...? (2)

Doc Ruby (173196) | more than 2 years ago | (#38306434)

Vast amounts of OS SW has been funded by the government. BSD was developed by UC Berkeley, which is largely funded by Pentagon contracts.

And the Internet.

Meanwhile, the vast majority of open source projects never get past the opening statement.

You clearly don't know what it takes to accomplish a project like this one. What have you ever done, that gives you some standing to announce that this Usenix project is a load of crap?

Re:Follow the money...? (1)

RocketRabbit (830691) | more than 2 years ago | (#38307472)

I'm no doctor but I can tell that wound is infected.

Re:Follow the money...? (0)

Anonymous Coward | more than 2 years ago | (#38306206)

Cool. You could be lazy, and I could quote the article for you for sweeet karma. Win-win!

Only... I'd feel... dirty.

So forget it.

RTFA? (4, Informative)

DragonWriter (970822) | more than 2 years ago | (#38306394)

funded in part by Google and the U.S. Energy Department

I wonder what's the interest of these two in this.

FTFA:

Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.

Re:RTFA? (0)

Anonymous Coward | more than 2 years ago | (#38307082)

funded in part by Google and the U.S. Energy Department

I wonder what's the interest of these two in this.

FTFA:

Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.

tl;dr

Re:Follow the money...? (1)

neokushan (932374) | more than 2 years ago | (#38306408)

I wonder why people feel the need to "sign" their posts, when their username is quite clearly visible at the top.

                  -nk

Re:Follow the money...? (0)

Anonymous Coward | more than 2 years ago | (#38306546)

I wonder why people feel the need to "sign" their posts, when their username is quite clearly visible at the top.

AND when they have a sig anyway. I guess their cut-and-pasted "witty" comment is worth more than their who they are?

-AC

Interesting... (3, Interesting)

DangerOnTheRanger (2373156) | more than 2 years ago | (#38306052)

With these tools, you could make grep and diff work with binary files in a meaningful way - very useful at times. I bet you could even adapt the "Context-Free Grep" into a sort of packet sniffer with enough work. I'd sure like to try these new programs sometime.

No download link? (1)

roguegramma (982660) | more than 2 years ago | (#38306072)

I would have wished for a download link ..

bad, wrong and stupid (-1)

Anonymous Coward | more than 2 years ago | (#38306166)

The reason those tools are so useful is that they are not big bloated pieces of shit you morons. Bad, wrong and stupid. Go play in somewhere else. You're making other people dumber with your ideas.

Re:bad, wrong and stupid (2)

interval1066 (668936) | more than 2 years ago | (#38306272)

Do we really need to improve on something that works already? A grep that handles binary formats might be nice, but I think I'd rather see this spun off into some kind of new tool or two, like an "extended" grep and diff, maybe. Maybe they're doing that.

Re:bad, wrong and stupid (2)

gstoddart (321705) | more than 2 years ago | (#38306796)

Do we really need to improve on something that works already?

This would work, but better. No, I'm not being flippant.

If you have structured data (say XML), you could target hierarchies like config-root:server-name:name. That way if the text inside "name" is only being looked for in that one field, you won't hit a bunch of other stuff that also happen to be similar strings but are unrelated.

I'm sure you'd still have your regular grep/diff utilities, but there's definitely places where being able to match these strings in-context would be of value.

Of course, someone is going to need to write a corresponding context-free sed (and maybe awk as well) to go along with the grep. But there's actually a lot of places where this would be a huge improvement in terms of certain kinds of automation.

Use of a context-free grammar also lets this be insensitive to whitespace and newlines, so it would work on "prettified" HTML or stuff that's all formatted haphazardly. This is basically how those things are parsed now ... the grammar rules define the structure, and don't need it to be all perfectly laid out in order to be able to handle it.

Re:bad, wrong and stupid (0)

Anonymous Coward | more than 2 years ago | (#38306354)

It's a new program. They're not replacing grep. They're not going to break into your house and apt-get remove grep. If the data you need to grep is broken into lines, keep using grep. If you'd rather manually sort through data that's not broken neatly into lines, feel free to do that. Personally, this has the potential to be a huge help for me, though it depends a lot on what's required to make the necessary library for a given data type.

Re:bad, wrong and stupid (1)

Anne Thwacks (531696) | more than 2 years ago | (#38306740)

They're not going to break into your house and apt-get remove grep

Are you sure?

They can probably do it remotely on must OS's anyway. Quick - make friends with Theo.

Mod parent up (0)

Lakitu (136170) | more than 2 years ago | (#38307010)

This man has a point -- these government-sponsored dumbification programs have obviously already worked on him. You could be next.

Almost vaporware (1)

gmuslera (3436) | more than 2 years ago | (#38306274)

The grep is "in design process", the diff is "not released yet". And should be a lot of alternative tools to those 2, some that should have go around the same goal (i.e. mailgrep). Im all for improving those 2 venerable tools, but the announcement look a bit of out of time or scale.

sgrep (1)

SgtChaireBourne (457691) | more than 2 years ago | (#38306372)

There used to be a utility, sgrep [helsinki.fi] , for searching SGML/XML.

Object grep (1)

Doc Ruby (173196) | more than 2 years ago | (#38306376)

I'd like a grep tool that could scan XML data for instances of objects (according to some XSD or DTD), and take object state values as arguments to search objects for.

If it could scan objects in memory I'd love that better, but XML seems the only likely candidate for a format that a universal tool would parse.

Re:Object grep (1)

atisss (1661313) | more than 2 years ago | (#38306566)

XPath? XSLT?

XPath? (0)

Anonymous Coward | more than 2 years ago | (#38306688)

Many years ago, I abused the capabilities of Flex (the fast lexical analyzer generator) to instrument students' C++ code. I was actually adding reference-counting code to check for leaks (part of that assignment's grading rubric). I just had to parse the code into a nested tree of { } bracing and adjoining text, and then pattern-match on that tree to find class and method definition boundaries, where I inserted code. I think it only broke on one out of about fifty submissions, where I had to intervene and instrument the code by hand instead.

Something like this could be done to handle XML since it is essentially little regular languages embedded in a well-formed tree of angle brackets and quoted strings. But I wouldn't bother with this, since XSLT and XPath exist for your problem...

I haven't read the original paper for this slashdot discussion, but the idea of a grep-like and sed-like tool that could use context-free grammars rather than regular expressions is very interesting to me. The hard part will be making it concise enough to use from the command-line rather than an edit/compile sort of parser-generator experience.

Ooooh! (3, Interesting)

gstoddart (321705) | more than 2 years ago | (#38306464)

As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar [wikipedia.org] .

That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.

Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.

Sounds really cool.

Re:Ooooh! (1)

skids (119237) | more than 2 years ago | (#38306886)

It will be interesting to see what they come up with. From the paper posted above it looks like it will definitely be taking "wisdom" about certain file types, but I hope they also work on some fuzzy guessing modes as well that do not require prior knowledge of the language being parsed.

The main potential for ick factor is whether they can manage to get a set of commandline flags that can be used/learned incrementally so you don't have to memorize a ream of flags just to get something useful done, and can learn a few new ones every time you need to push the envelope of what you already know. (BTW, Judging from the number of times I see a scripting language launched to do things grep can do perfectly well, most people stopped reading the manpage before the -B and -A options.)

Microsoft Ad (3, Interesting)

lucm (889690) | more than 2 years ago | (#38306628)

I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.

Re:Microsoft Ad (0)

Anonymous Coward | more than 2 years ago | (#38306822)

But pipes should ALWAYS be plain text! SOMEONE THINK OF THE CHILDREN!

etc

etc

Re:Microsoft Ad (0)

Anonymous Coward | more than 2 years ago | (#38306856)

So what? Maybe people want a non-proprietary solution that works on more than one OS.

Powershell envy (1)

Tetsujin (103070) | more than 2 years ago | (#38308136)

So what? Maybe people want a non-proprietary solution that works on more than one OS.

If there are such people, and it's not just me, I'd love to oblige them. :) I really need to get crackin'...

Re:Microsoft Ad (1, Funny)

Mars Saxman (1745) | more than 2 years ago | (#38307710)

That's great for all fifty people who use Powershell.

Oh, to suffer the slings and arrows... (1)

Tetsujin (103070) | more than 2 years ago | (#38308104)

I know I'll be modded down

Dude, the only part of your post that I find objectionable is this assumption that you're going to be crucified for posting your thoughts. I know that there are some people on Slashdot who are pretty predictably triggered to shout down certain opinions - just don't assume that everyone here is like that, OK?

I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what they describe bgrep and bdiff doing could be accomplished in Powershell. I've never been too clear on some of the particulars of how that would be done, though. As I understand it, you can search/filter either XML data streams, or a sequence of .NET objects. Would the way to accomplish this in .NET, then, be to have a commandlet that opens the source file and passes them through as .NET objects? It would be a bit less compact than having the special type handling right in the "find" or "filter" command but it does lend a certain clarity to things, too...

Re:Microsoft Ad (1)

grcumb (781340) | more than 2 years ago | (#38308116)

I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.

Sure, and it's been possible in Perl (for example) forever:

use File::Slurp;

my $multi_line_pattern = join("", @ARGV);
my $text = read_file( 'filename' ) ;
if ($text =~ /$multi_line_pattern/){
# do something useful.
}

The only problem with the above is that it fails in anything other than trivial situations.

The issue isn't passing things through filters, it's doing so in a way that you don't have to write insanely devious and complex filters. This grep tool is still only at the design stage, so I'll not speculate about whether it actually succeeds at this goal or not.

darcs (1)

bcrowell (177657) | more than 2 years ago | (#38306848)

There's a version-control system called darcs (written by the son of a colleague of mine) that incorporates some interesting ideas along these lines. For example, say you have a program with 100,000 lines of code, and there's a function in it called Foo, which is called thousands and thousands of times. You want to change the name of the function to Bar. In a traditional diff-based system, this results in thousands of differences. Darcs is supposed to be able to handle changes like this and recognize that it's only *one* change. It's also supposed to be able to handle the case where programmer A makes this change and checks it in, and then programmer B, who has simultaneously been doing lots of other work on the code, checks in his own changes -- with the old name for the function.

Re:darcs (0)

Anonymous Coward | more than 2 years ago | (#38307874)

Darcs is very smart about patch commutation and as a result it's the tops at cherry-picking patches, but it certainly doesn't operate at all like you describe. Changes to an identifier in a hundred lines is still a hundred lines changed in that one changeset.

Rob Pike called, he wants his idea back (1)

Anonymous Coward | more than 2 years ago | (#38306928)

People have been trying to adapt line-oriented regular expressions to handle other sorts of data since at least the 1980s. Structured regular expressions [cat-v.org] were introduced with the Plan 9 system, but never seem to have caught on elsewhere.

It certainly would be nice to have tools that readily handle multi-line data, rather than forcing everything to fit into a line oriented format. It would be wonderful to be able to fix up indentation in version controlled files without making the history unreadable, for example.

existing tools and suggestion (1)

khipu (2511498) | more than 2 years ago | (#38306970)

PCRE has recursive patterns (available as pcregrep) and .NET has balancing groups, also allowing grep-like operations involving context-free grammars. For XML data, there are various XML query languages that allow wonderfully complex queries over XML structures. There are also refactoring tools that allow syntax-aware searches across source files.

For diff, the situation is a bit more complicated. There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based diff tools already. It seems difficult to come up with something more generic. Let's say you want to diff programming language source files in languages for which there is no diff tools. What good is a context free diff tool going to be? You'd need to specify the entire grammar for the language.

I think the most useful way these people could spend their time and money would be to port PCRE-style recursive patterns and .NET like balancing groups to more UNIX regular expression libraries (foremost, Python).

Terrible idea (4, Insightful)

deblau (68023) | more than 2 years ago | (#38306990)

This violates so many rules of the Unix philosophy [wikipedia.org] that I don't even know where to begin...

FTFA:

Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.

If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT [wikipedia.org] off the shelf. It's already customizable to whatever data format you're working with.

FTFA:

With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.

No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.

Re:Terrible idea (0)

Anonymous Coward | more than 2 years ago | (#38307774)

This violates so many rules of the Unix philosophy [wikipedia.org] that I don't even know where to begin...

So? The Catholic Church arrested Galileo for subverting the Aristotilean philosophy that the sun and stars revolve around the earth. Revered authority can impede progress. Thompson, Ritchie, et. al did great things way back when, but they are not the final word on everything.

If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT [wikipedia.org] off the shelf. It's already customizable to whatever data format you're working with.

Suppose our "data structures" are source code written in a language such as Java or Python....?

Wait, why? (1)

RandomMonkey (908328) | more than 2 years ago | (#38307158)

Why do we need to write another perl?

Structural Regular Expressions (2)

vAltyR (1783466) | more than 2 years ago | (#38307618)

This reminds me of a paper Rob Pike wrote a while back addressing this problem. His solution was a generalization of regular expressions, which he termed Structural Regular Expressions. [cat-v.org] I'm not sure how these stack up against context-free grammars, but it's an interesting approach that seems at least fairly similar to the Dartmouth work. In any case, I didn't see it as a reference, so I thought I'd mention it.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...