Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

How To Build a Web Spider On Linux

kdawson posted more than 7 years ago | from the five-eyes dept.

Programming 104

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

cancel ×

104 comments

Sorry! There are no comments related to the filter you selected.

Hmm... (5, Funny)

joe_cot (1011355) | more than 7 years ago | (#16849238)

Yes, but does it run on ... damn.

Re:Hmm... (1)

martin-boundary (547041) | more than 7 years ago | (#16850094)

... the internet?

Re:Hmm... (-1, Troll)

Anonymous Coward | more than 7 years ago | (#16850536)

Yes, it runs on every evironment I know: debian, suse, gentoo, fedora etc.. you just need an installed ruby and python interpreter. what's funny about that comment?

Re:Hmm... (1)

lpcustom (579886) | more than 7 years ago | (#16850704)

Is "evironment" a new buzzword I haven't heard of?

Re:Hmm... (1)

Fordiman (689627) | more than 7 years ago | (#16851212)

So that's what they're called. I've been building them for years, both for personal data collection and for research for professors I work for (I have a couple of acknowledgements to this effect). I've been calling them 'site scrapers' and 'data reapers'.

And I generally write 'em in PHP. Makes 'em nice and lightweight to redistribute (php.exe and php5ts.dll are usually all that's needed. Sometimes php_http.dll as well.)

Re:Hmm... (1)

moro_666 (414422) | more than 7 years ago | (#16853256)

You must have tons of time on your hands for those crawlers ....

  A modern crawler has to overcome very annoying problems like nslookup delays and network lags that are caused by a third party. If you can write it in a threaded environment, good for you, if you can drop the "single scope" at all and go for an select or even better, epolled version that can crawl thousand sites at a time, even better.

  For simple tasks even the ithreads of perl would do. But i'd suggest a language that supports threads natively (for ease of writing perhaps python or ruby ?).

  Ofcourse if you are just crawling one site, a simple php script could do it as well...

Re:Hmm... (1)

chromatic (9471) | more than 7 years ago | (#16857204)

What do you mean "natively"? Ruby 1.8, at least, doesn't use OS threads. Perl ithreads map to native threads, where they're available.

Re:Hmm... (1)

try_anything (880404) | more than 7 years ago | (#16858714)

You must have tons of time on your hands for those crawlers ....

You can make any program difficult hard by increasing the generality and performance requirements, but there's nothing inherently difficult about screen-scraping from a web site. I've written a few scripts to extract data from web sites, and they're quite simple if your aims are modest. The first crawler I wrote was also my first Perl project, my first time using HTTP, and my first time dealing with HTML. Given a date, it generated a URL, visited that page, extracted some links, followed the links, extracted some text from the pages it found, and printed a report for me. That qualifies as a "crawler" or "spider" given the definition in play here (it "traverses the internet" by following a link,) but there was nothing especially hard about it.

Re:Hmm... (3, Interesting)

strstrep (879828) | more than 7 years ago | (#16853646)

PHP lightweight? Ha!

The PHP interpreter is over 5 megabytes in size. And it isn't thread-safe. That's a lot of memory overhead for a program that's going to be blocking on I/O most of the time, seeing how you'll have to fork() a new process for each new "thread" you want.

Also, languages like Perl and Python have binaries that are about 1 megabyte in size. Now, while they'll probably need to load in extra files for most practical applications, these extra files are typically small. Most importantly, Perl and Python are thread-safe.

Perl, for example, includes libraries such as Thread::Queue, which allows you to very easily create a threading model with worker threads, without having to worry too much about condition variables, mutexes, and the like.

Disclaimer: All measurements done on x86 Debian Linux.

Crawling efficiently (5, Informative)

BadAnalogyGuy (945258) | more than 7 years ago | (#16849262)

Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs

Re:Crawling efficiently (2, Insightful)

Anonymous Coward | more than 7 years ago | (#16849714)

Python has a builtin set type. Have no idea why they did not use it.

Re:Crawling efficiently (2, Interesting)

Mr2cents (323101) | more than 7 years ago | (#16849894)

Maybe because they don't know the first thing about efficiency? You'd be surprised how much programmers don't know/care about efficiency. Once, incidentilly also on a crawler (student project), I improved the function reading a tree of URL's from 1 hour(!) to 0.1second! The guy tested it on an example with 10 URL's and it worked, but his implementation was O(n^2) and involved copying huge amounts of memory each step. Don't ask me how he thought this would be scalable.

Re:Crawling efficiently (2, Funny)

Zonk (troll) (1026140) | more than 7 years ago | (#16850346)

Maybe because they don't know the first thing about efficiency? You'd be surprised how much programmers don't know/care about efficiency.


If you're surprised about programmers not knowing/caring about efficiency, do you actually use a computer?

Re:Crawling efficiently (1)

Barnoid (263111) | more than 7 years ago | (#16849812)

True, but irrelevant. The queue's lookup/insertion time should be negligible compared to the time required to connect to the sites, download, and parse the content.

Before your queue will get big enough for lookup/insertion time to become an issue, you'll first have to worry about bigger harddisks and more bandwith.

Re:Crawling efficiently (0)

Anonymous Coward | more than 7 years ago | (#16850706)

You're missing the point.

Re:Crawling efficiently (1)

Rakshasa Taisab (244699) | more than 7 years ago | (#16850624)

Tell me then; how do they make a O(1) FIFO queue out of the associative array?

No, not really interested in the answer, as I'm just pointing out that the code suddenly becomes (unnecessarily) much more complicated.

Re:Crawling efficiently (1)

BadAnalogyGuy (945258) | more than 7 years ago | (#16852084)

if !exists hash[currentURL]
      add hash, [currentURL]
      append array, [currentURL]


That wasn't so hard, was it?

Re:Crawling efficiently (1)

Fordiman (689627) | more than 7 years ago | (#16851232)

I know. Also, I'm not exactly certain why they used Ruby.

My favorite method is to use PHP as a backend for mshta; you can be guaranteed it'll run on any Windows machine, and you have the benefit that a linux machine will at least be able to run the back-end.

The 90s called (5, Funny)

dave562 (969951) | more than 7 years ago | (#16849264)

They want their technology back.

Re:Obligatory (1)

poormanjoe (889634) | more than 7 years ago | (#16849362)

I for one welcome our out-of-date-eight-legged overlords!

Re:Obligatory (2, Funny)

k33l0r (808028) | more than 7 years ago | (#16850316)

Has there ever been a news story on Slashdot that doesn't have a "I, for one, welcome our new [Insert here] overlords" comment attached to it?

Re:Obligatory (1)

Bloke down the pub (861787) | more than 7 years ago | (#16850558)

I for one welcome our new "I for one welcome our new X overlords" overlords.

Re:Obligatory (1)

manastungare (596862) | more than 7 years ago | (#16851498)

I, for one, welcome our clichéd-overlord-joke-bearing Slashdot comments.

Re:Obligatory (1)

tehcyder (746570) | more than 7 years ago | (#16867182)

Has there ever been a news story on Slashdot that doesn't have a "I, for one, welcome our new [Insert here] overlords" comment attached to it?
No.

Next question?

Re:The 90s called (0)

Anonymous Coward | more than 7 years ago | (#16850194)

the 80s called, they want their joke back....

What's the point? (1)

XorNand (517466) | more than 7 years ago | (#16849274)

Why would anyone have a need to write a simple spider nowadays? In 2006, there has to be a better way than just following links. For example, it would be interesting to see something that crawled the various social bookmarking sites and corelated the various terms. For example, User A on Delicious and User B on Stumble Upon both bookmark a link about Pink Floyd and another one about Led Zep. If I'm searching for something about Floyd, the system could recommend some cool info about Led Zep too. (Email me if you need to know where to send my royality checks).

Re:What's the point? (1)

Anonymous Coward | more than 7 years ago | (#16849398)

There are many reasons to build a web-crawler, for example imagine that you have a website with quite a few outgoing links to other sites. It is not hard to imagine that at times, the links going to external places may change in time... Now by having your own spider, it could check if all those pages are still there, and that it they still got the information you intended to link to. In case the site now have a permanent redirection to a new page, you could technically check the content automatically for whether it is the same or not, and update your page with the new links. Pages that are unavailable may be temporarily or permanently removed.

I am in fact using a few different home grown web-crawler myself!

Actually... (3, Interesting)

SanityInAnarchy (655584) | more than 7 years ago | (#16849602)

Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.

Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.

Re:Actually... (1)

pan_piper (963195) | more than 7 years ago | (#16855734)

So... if you want to discuss your illegal items it would be better to design your site in Flash?

Re:Actually... (1)

SanityInAnarchy (655584) | more than 7 years ago | (#16857320)

Not in this game. The pages are not actually user-created, they are generated daily from the actual game data and hosted by the company who runs the game. The only thing you have control over is whether your inventory/bank/whatever appears on the page, and I can just as easily scan for people who refuse to list them.

So, it's actually much more efficient to scan for a specific string that I know will be there for a particular item -- it's literally impossible for them to try to mask it with, say, leetspeak.

Re:What's the point? (0)

Anonymous Coward | more than 7 years ago | (#16850744)

No self-respecting Led Head would want to read about Pink Floyd.

Re:What's the point? (0)

Anonymous Coward | more than 7 years ago | (#16859026)

No self-respecting Floydian would want to read about Dead Zeppelin.

Re:What's the point? (1)

The_Wilschon (782534) | more than 7 years ago | (#16851086)

One might want to study social networks. What better way to do this than to make a graph (as in the nodes and edges type) of myspace or facebook and study that? How are you going to do this? Well, seems like making a spider would be a quite sensible way.

Re:What's the point? (1)

brunascle (994197) | more than 7 years ago | (#16852756)

Why would anyone have a need to write a simple spider nowadays?
simple. to add a search engine to your site without having to rely on someone else's code.

in fact, i'm going to have to do this fairly soon. i've already written a search for articles, but now customers are complaining that they cant search for "customer service." bah!

unfortunately, IBM's spider example is pretty pathetic.

Re:What's the point? (1)

try_anything (880404) | more than 7 years ago | (#16859278)

Why would anyone have a need to write a simple spider nowadays?

You're right. Web 2.0 changes everything. Some people are just conservative, though. My parents are still using bookshelves even though maglev trains made bookshelves obsolete decades ago.

In 2006, there has to be a better way than just following links. For example, it would be interesting to see something that crawled the various social bookmarking sites and corelated the various terms.

You mean follow links and *gasp* do something with the data you find? It'll never happen. The experts have found that following random links for no particular reason and ignoring the data found there is sufficient for all purposes, and your spunky challenge to this paradigm will not be allowed to stand!

Linux to a crawl (1)

The_Abortionist (930834) | more than 7 years ago | (#16849288)

Can linux withstand that kind of activity? No.

After less than a month it would go into a paging frenzy.

If we are talking about some kind of cluster with regular reboots for individual nodes, I could see that. Otherwise, I would recommend Solaris.

downloads (4, Informative)

Bananatree3 (872975) | more than 7 years ago | (#16849298)

for those of us who don't have them, here are the basics:



Wget: http://www.gnu.org/software/wget/ [gnu.org] .

Curl http://curl.haxx.se/ [curl.haxx.se]

yes, I did RTFA (1)

Bananatree3 (872975) | more than 7 years ago | (#16849324)

the article mostly talks about scripting languages. And yes I do know wget come with a lot of Linux distros, but not EVERYONE has it. So there, I DID read TFA.

Re:yes, I did RTFA (4, Funny)

Faylone (880739) | more than 7 years ago | (#16849422)

You RTFA? Are you sure you're in the right place?

Re:downloads (1)

WoLpH (699064) | more than 7 years ago | (#16849510)

And if you really don't want to RTFA and still want to rip a website, try this command (that is, if you have wget installed) "wget -m http://www.server.com/ [server.com] "
A partially better alternative is httrack, it has more features but also tends to be less table ;)

ATTN: Windows/Linux Refugees! (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#16849306)

The only thing more pathetic than a PC user is a PC user trying to be a Mac user. We have a name for you people: switcheurs.

There's a good reason for your vexation at the Mac's user interface: You don't speak its language. Remember that the Mac was designed by artists [atspace.com] , for artists [atspace.com] , be they poets [atspace.com] , musicians [atspace.com] , or avant-garde mathematicians [atspace.com] . A shiny new Mac can introduce your frathouse hovel to a modicum of good taste, but it can't make Mac users out of dweebs [atspace.com] and squares [atspace.com] like you.

So don't force what doesn't come naturally. You'll be much happier if you stick with an OS that suits your personality. And you'll be doing the rest of us a favor, too; you leave Macs to Mac users, and we'll leave beige to you.

Just what the internet needs... (-1)

Anonymous Coward | more than 7 years ago | (#16849366)

... more spiders. While it is true that "you can easily develop Web spiders", there are lots of pitfalls for the careless. At minimum, please avoid hammering sites in your crawling, and obey posted spider.txt notices.

Re:Just what the internet needs... (3, Informative)

ComaVN (325750) | more than 7 years ago | (#16849486)

I think that's robots.txt, *not* spider.txt

Re:Just what the internet needs... (1)

scdeimos (632778) | more than 7 years ago | (#16849560)

How does "spider.txt" get an Insightful when it's "robots.txt"? Sheesh, bump the Mods Roster.

Re:Just what the internet needs... (0)

Anonymous Coward | more than 7 years ago | (#16850020)

Also, following relative and site-relative links, and obeying things like "base href", "href='javascript:...'" and the case-sensitivity of URLs seems to be too difficult for many beginning crawler programmers.

Hardly linux-specific (5, Insightful)

h_benderson (928114) | more than 7 years ago | (#16849384)

All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.

some points (5, Interesting)

cucucu (953756) | more than 7 years ago | (#16849396)

  • Don't forget to check and respect robots.txt [robotstxt.org] . Python [python.org] has a module [python.org] that helps you parse that file
  • Samie [sourceforge.net] and its Python port Pamie [sourceforge.net] are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
  • I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
  • And once I even repeatedly voted on an online poll and changed the course of history.
  • Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
  • If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
  • When will Firefox's automation capabilities match those of IE?

Re:some points (1)

coaxial (28297) | more than 7 years ago | (#16849536)

Don't forget to check and respect robots.txt. Python has a module that helps you parse that file

[sarcasm] Why? Google doesn't. [/sarcasm]

And once I even repeatedly voted on an online poll and changed the course of history.

So did I! Back in 2000 I got the Underwear Gnomes episode of South Park aired.

I think the best use of a spider in an online poll was by whatever Red Sox fan voted a million times for Nomar Garciapara to make the all star team back in 2000.

Re:some points (1)

SanityInAnarchy (655584) | more than 7 years ago | (#16849586)

You don't want to automate IE. Aside from the fact that it's IE, you don't want to use any browser unless you have to. Mechanize is your friend, and you can always change the user agent string if you want to be a jackass.

Firefox's automation capabilities don't need to match those of IE, for pretty much the same reason. The only thing Mechanize can't do is JavaScript, and there are vague plans about that.

Re:some points (1)

VGPowerlord (621254) | more than 7 years ago | (#16849604)

Are you sure that automation still works in IE7?

Re:some points (1)

cucucu (953756) | more than 7 years ago | (#16849682)

no, I didn't update. I have FF for human browsing and IE6 for robot browsing.

Re:some points (1)

Victor Antolini (725710) | more than 7 years ago | (#16850028)

In soviet Windows, IE6 robot's browse you!

Re:some points (0)

Anonymous Coward | more than 7 years ago | (#16849640)

Can't say how automation in IE works, but won't the extensions concept for firefox match that automation?
i.e. an extension like greasemonkey ;)

Re:some points (3, Informative)

killjoe (766577) | more than 7 years ago | (#16849694)

"When will Firefox's automation capabilities match those of IE?"

It's always had it. Look up XUL some day. The entire browser is written in xul.

Re:some points (1)

Gr8Apes (679165) | more than 7 years ago | (#16851412)

When will Firefox's automation capabilities match those of IE?


You have that wrong. It's when will IE's capabilities (automation and otherwise) catch up with FireFox.

Re:some points (1)

IchBinEinPenguin (589252) | more than 7 years ago | (#16860368)

When will Firefox's automation capabilities match those of IE?

Yeah, 'cos I really miss having my machine automatically turned into a Zombie.......

Re:some points (1)

jdigriz (676802) | more than 7 years ago | (#16878400)

>When will Firefox's automation capabilities match those of IE?

Now. http://www.openqa.org/selenium-ide/ [openqa.org]

Web crawler in one line (1)

Znort (634569) | more than 7 years ago | (#16849446)

'Steve? Send the web spiders.' (1)

Channard (693317) | more than 7 years ago | (#16849466)

Dammit, I was hoping this was article was about the evolution of Dr Weird's phone spiders, mechanical creatures that could be sent down your cable line to maul anyone sending you phishing emails and spam.

Re:'Steve? Send the web spiders.' (1)

kfg (145172) | more than 7 years ago | (#16849530)

I was hoping this was article was about the evolution of Dr Weird's phone spiders

It's a web spider, man; not a killer robot spider, but I'll tell you it's a web spider from South Jersey if it'll make you feel any better.

KFG

MORE CORN!!! (1)

everphilski (877346) | more than 7 years ago | (#16851788)

It's not different at all, steve!

Oh sweet Jesus! (3, Insightful)

msormune (808119) | more than 7 years ago | (#16849474)

Pull the article out. The last thing we need is more indexing bots.

crawling is not so trivial (2, Interesting)

cucucu (953756) | more than 7 years ago | (#16849502)

As the two students who started a little web search company, crawling the web is not trivial: http://infolab.stanford.edu/~backrub/google.html [stanford.edu] . An excerpt follows.


Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.

That reminds me. (2, Informative)

archeopterix (594938) | more than 7 years ago | (#16849680)

Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix.
Unfortunately, many web developers still ignore the inevitable, leaving their sites vulnerable to the dreaded Googlebot "attack". While most of the spider developer manuals (TFA included) stress the importance of being polite (respect robots.txt & friends), most of the "become teh Web Master in x days" books don't even mention robots.txt. Go figure.

For a good chuckle, see The Spider of Doom [thedailywtf.com] on the Daily WTF.

And please use robots.txt.

And go see Google Webmaster tools [google.com] .

And don't wear socks with sandals. Well, ok, this one is optional.

Quality of article? (2, Insightful)

interp (815933) | more than 7 years ago | (#16849514)

I've never programmed in Ruby, but I think the comment in Listing 1 says it all:
"Iterate through response hash"

Why would somebody want to do that?
A quick net search "reveals": A simple resp["server"] is all you need.
Maybe the article was meant to be posted on thedailywtf.com?

Re:Quality of article? (0)

Anonymous Coward | more than 7 years ago | (#16850368)

It's not WTF quality, but is less than elegant. This would be better:

require 'net/http'

ARGV.each do |host|
    begin
        res = nil
        Net::HTTP.start(host, 80) do |http|
            res = http.head('/')
        end

        case res
        when Net::HTTPSuccess
            puts "The server at #{host} is #{res['server']}"
        else
            puts "Request failed"
        end
    rescue Exception => e
        puts "Request failed: #{e.message}"
    end
end

I hate (0)

Anonymous Coward | more than 7 years ago | (#16849566)

these Eight legged freaks!!!

Re:I hate (1)

kfg (145172) | more than 7 years ago | (#16849748)

Just because you're paranoid doesn't mean people aren't crawling your site.

KFG

Re-inventing a square wheel (5, Insightful)

rduke15 (721841) | more than 7 years ago | (#16849582)

Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.

The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

HEAD slashdot.org | grep 'Server: '

But it gets worse. To extract a quote from a page, the second script suggests this:

stroffset = resp.body =~ /class="price">/
subset = resp.body.slice(stroffset+14, 10)
limit = subset.index('<')
print ARGV[0] + " current stock price " + subset[0..limit-1] +
" (from stockmoney.com)\n"

You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.

Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

I suppose the only point of that article were the IBM links at the end:

Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...

Re:Re-inventing a square wheel (1)

biffta (961416) | more than 7 years ago | (#16849776)

I'm sensing that you didn't like the script then?

Re:Re-inventing a square wheel (1)

kayditty (641006) | more than 7 years ago | (#16849904)

uhh.. what exactly is HEAD slashdot.org? I'm guessing you mean something like:
perl -e 'print "HEAD / HTTP/1.1\nHost: slashdot.org\n\n"' | nc slashdot.org 80 | grep -i Server

Re:Re-inventing a square wheel (4, Insightful)

rduke15 (721841) | more than 7 years ago | (#16849994)

what exactly is HEAD slashdot.org

It's a (perl) script which comes with libwww-perl [linpro.no] which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.

If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):

$ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'

Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)

Re:Re-inventing a square wheel (1)

Xenna (37238) | more than 7 years ago | (#16853638)

Hah, I must have written at least a few dozen lib-www scripts, but I didn't know about HEAD.

I always used lynx -source -head http://slashdot.org/ [slashdot.org] wish is a lot more typing...

Thanks,
X.

Re:Re-inventing a square wheel (0)

Anonymous Coward | more than 7 years ago | (#16878240)

During the LWP module install, it asks if you want to install HEAD, GET, POST etc.
But it is a problem for some OS X users head/HEAD

http://use.perl.org/~ct/journal/6556 [perl.org]

Okay kids... (4, Informative)

Balinares (316703) | more than 7 years ago | (#16849960)

Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup [crummy.com] or a similar tool.

Example to find all links in a document:
from BeautifulSoup import BeautifulSoup
for tag in BeautifulSoup(html_document).findAll("a"):
  print tag["href"]
Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.

Re:Okay kids... (1)

arevos (659374) | more than 7 years ago | (#16852258)

I couldn't agree more. The author also neglects to use useful standard functions like urllib.urlopen, instead building his own HTTP downloader function. He'd also do well to use urlparse.urljoin to turn a relative href attribute into an absolute URL, and urlparse.urlparse to check things like the protocol and host.

For example:
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
from urlparse import urljoin, urlparse
 
visited_urls = set()
url_stack = []
 
for tag in BeautifulSoup(urlopen(url)).findAll("a"):
    link_url = urljoin(url, tag["href"])
    if urlparse(link_url)[0] == "http" and link_url not in visited_urls:
        url_stack.append(link_url)

Re:Okay kids...(in Ruby) (2, Interesting)

amran (686191) | more than 7 years ago | (#16858416)

I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot [whytheluckystiff.net] library:

doc = Hpricot(open(html_document))
(doc/"a").each { |a| puts a.attributes['href'] }

Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize [rubyforge.org] library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).

Re:Re-inventing a square wheel (1)

matvei (568098) | more than 7 years ago | (#16850098)

Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

What bugs me the most about this article is that the author keeps using the most generic libraries he can find instead of something written for this exact task. He should have used WWW:Mechanize for Perl [cpan.org] or mechanize for Python [sourceforge.net] . I'm sure there's something like this for Ruby, too.

Re:Re-inventing a square wheel (1)

nkv (604544) | more than 7 years ago | (#16850312)

I think the more relevant issue is that this is a bit of a toy program. You fetch a page, pseudo-parse it, find links etc. Why is this such a big source of news? The code is bad, the objective is somewhat simple and it's hardly a Linux specific thing (like someone else here mentioned).

Re:Re-inventing a square wheel (1)

Bogtha (906264) | more than 7 years ago | (#16850478)

The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

HEAD slashdot.org | grep 'Server: '

This code won't catch 404s and other errors. Theirs will. Furthermore, assuming the Ruby library is conformant, their code can deal with multi-line headers, while yours would break.

Things like grep aren't suitable for parsing HTTP responses. You might get results for simple cases, but there are all kinds of corner cases out there that require a proper script. Go ahead and use grep for quick hacks, but you're causing yourself trouble down the line if you expect to use it for anything non-trivial like a spider.

Re:Re-inventing a square wheel (1)

DJDutcher (823189) | more than 7 years ago | (#16851704)

More amusing is HEAD slashdot.org | grep Bender

Re:Re-inventing a square wheel (1)

ChaosDiscord (4913) | more than 7 years ago | (#16854882)

Indeed. For most of my simple spidering needs I've found Perl's WWW::Mechanize to be a dream. I say what I mean: go get this page, find a link labeled "Today's Story" and follow it, on the resulting page find the second form and fill in the username and password fields with $username and $password, click submit, return the resulting page. I've found it useful for scraping sites with regular updates that have unpredictable URLs but constant links. Perl.com's "Screen-scraping with WWW::Mechanize [perl.com] " is a good introduction, then check out the full documentation [cpan.org] .

^sh1t (-1, Troll)

Anonymous Coward | more than 7 years ago | (#16849692)

fate. Let's 8ot be

It's a trap! (2, Funny)

radu.stanca (857153) | more than 7 years ago | (#16849712)

Ah, I can see it clearly now!

1. Post to Slashdot a decoy article(it includes Linux in the subjest) with new spam tricks
2. Watch if spam increases 30% next days
3. Bribe Cowboy Neal with 10G midget lesbian pr0n and get IP adresses of the art. readers
4. Load shotgun and make the world a better place!

Been there, done that (1)

xarak (458209) | more than 7 years ago | (#16849886)


I guess most male CS students will have coded something similar at least once to D/L pr0n.

I did one in shell and one in TCL/TK.

Re:Been there, done that (1)

pandrijeczko (588093) | more than 7 years ago | (#16850770)

Ah! So I can tell the missus it was *YOU* who put those images on my PC then!

User-Agent (1, Troll)

Joebert (946227) | more than 7 years ago | (#16849976)

They forgot the set the User-Agent header to IE.

Reinventing the wheel (1, Interesting)

Anonymous Coward | more than 7 years ago | (#16850504)

I know, I know. Flame me. But I found Heritrix http://crawler.archive.org/ [archive.org] is a very polished package. Used it for my Masters research, and found that it is very extensible. Useful if you are doing real crawling, ie not concentrating on one site.

Incorrect Title (2)

OneSmartFellow (716217) | more than 7 years ago | (#16850604)


Should be: "How Not ..."

I don't think I am alone in my thinking

The guy can't even code! (0)

Anonymous Coward | more than 7 years ago | (#16850636)

Yep, it's _very_ intelligent to loop through dictionary for a specific item. (I may be wrong, since I wouldn't even have nightmares about coding ruby, but it sure as hell looks like it...)

Nostarch press book (1)

praxis22 (681878) | more than 7 years ago | (#16851742)

Nostarch press are releasing a book about this soon, they had a mockup on display at the Frankfurt book fair.

What about Archie, Jughead, or Veronica (0)

Anonymous Coward | more than 7 years ago | (#16852252)

Don't forget that the inet is not just http..

http://en.wikipedia.org/wiki/Archie_search_engine [wikipedia.org]

Walk the dom directly (1)

johnpeb (940443) | more than 7 years ago | (#16854476)

Once i had to collect a lot of info from a website. I used java and wget and some java html parser library (possibly JTidy). anyway the code was very short and clean. I'd recommend DOM walking to other solutions when the data isn't trivial.

screen-scraper (1)

toddcw (134666) | more than 7 years ago | (#16855688)

screen-scraper (http://www.screen-scraper.com/ [screen-scraper.com] ) runs fabulously on Linux, and integrates well with most modern programming languages. It can save all kinds of time over writing Perl and Python scripts. There's a free (as in beer) version available, and a pro version if more features are wanted.

Re:screen-scraper (0)

Anonymous Coward | more than 7 years ago | (#16859910)

You forgot to mention that you wrote that particular piece of software, and that you're hoping this will drum up some business for you.

I did similar things in college... (1)

GWBasic (900357) | more than 7 years ago | (#16865520)

I did similar things in college with Perl. (shudders*) The programs were OS-neutral; I think I developed mine in Windows under Cygwin.

*Yes, I know Slashdot is written in Perl.

Slashdot: Drivel for idiots. (0)

Anonymous Coward | more than 7 years ago | (#16871300)

That's it.

I'm leaving.

You won't have Anonymous Coward to kick around anymore.

Who needs this much drivel, from the idiots who write the article, to the idiots who submit it, to the idiots who approve it?

Garrrrgh.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>