×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Company Offers Customizable Web Spidering

ScuttleMonkey posted more than 4 years ago | from the my-legs-are-longer-than-yours dept.

The Internet 46

TechReviewAl writes "A company called 80legs has come up with an interesting new web business model: customized, on-demand web spidering. The company sells access to its spidering system, charging $2 for every million pages crawled, plus a fee of three cents per hour of processing used. The idea is to offer Web startups a way to build their own web indexes without requiring huge server farms. 'Many startups struggle to find the funding needed to build large data centers, but that's not the approach 80legs took to construct its Web crawling infrastructure. The company instead runs its software on a distributed network of personal computers, much like the ones used for projects such as SETI@home. The distributed computing network is put together by Plura Processing, which rents it to 80legs. Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and other rewards.'"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

46 comments

Ah, abusing someone else's bandwidth... (3, Insightful)

nweaver (113078) | more than 4 years ago | (#29572693)

Lets assume that spidering a page costs 10 kB of data.

So thats $2 for 1M pages, or 10 GB of data download.

So thats at least $1 of data transfer that is being shifted onto the suckers, err "volunteers" who's home network is running this app.

Re:Ah, abusing someone else's bandwidth... (1)

thePowerOfGrayskull (905905) | more than 4 years ago | (#29574127)

It's expensive and inefficient to set up your own crawlers. This gives a viable alternative -- with bandwidth and CPU that is provided by people who know what they're giving up [presumably]; and who are doing it in exchange for some other value received.

All in all, I'd have to say this is a pretty good idea.

It's the bandwidth (1)

kcdoodle (754976) | more than 4 years ago | (#29580441)

It is really easy to make a web crawler in Java. (Look at java.net.http or maybe java.awt.net.http) I made a decent one by myself in about a week. Okay, so my web crawler only does TEXT/HTML. No images, no Active X, no video. From experience, an average web page is about 10Kbytes. Now, anyone's specific application will probably be looking for key words, or else you are just re-creating Google. A key word data crawl would return a LOT less information, but would still require a lot of bandwidth and processing power to do the work. So bandwidth and processing time is what they are selling -- the place where this company's services would be most useful.

Re:It's the bandwidth (1)

thePowerOfGrayskull (905905) | more than 4 years ago | (#29581123)

Sure, that's mostly what I was referring to. The code for a crawler is simple - the resources to use them effectively is something else entirely; the intarwebs are just too big for a startup to do it without laying down tons of cash for hardware/"cloud" hardware.

Re:Ah, abusing someone else's bandwidth... (1)

martin-boundary (547041) | more than 4 years ago | (#29576125)

It's more like 20kb per page. First it gets downloaded onto the user's PC, but unless the company goes to physically pick up his hard disk, that page will have to be uploaded through the network to their servers eventually.

Re:Ah, abusing someone else's bandwidth... (0)

Anonymous Coward | more than 4 years ago | (#29581899)

Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and other rewards.'"

I guess you could call them suckers. Or you could look at it from the point of view that they already pay $X a month for their connection, and if they can earn more than $X in rewards then it's actually a win for them. Not only getting a more or less free connection, but other perks to boot.

Of course, it probably violates most of their TOS agreements with their ISP but hey, that's their own problem. Still, I personally wouldn't be ok with something like that running on my home connection.

Re:Ah, abusing someone else's bandwidth... (1)

skeeto (1138903) | more than 4 years ago | (#29598163)

Almost everyone running Plura's crap is unaware of it. It's embedded in web pages like advertising. For example, you know the highly popular Desktop Tower Defense?

http://www.handdrawngames.com/DesktopTD/Game.asp [handdrawngames.com]

Look at the page sources. There's a Plura bug on it running the whole time you're playing the game. They've already been doing this for a long time.

Company Offers Customizable Penis Enlargement (-1, Offtopic)

Anonymous Coward | more than 4 years ago | (#29572707)

A company called 80inches has come up with an interesting new business model: customized, on-demand penis enlargement. Rob Malda was seen outside their corporate offices to enlarge his rumored toddler-sized penis into something more adult-sized so his wife wouldn't leave him for a real man.

Re:Company Offers Customizable Penis Enlargement (-1, Offtopic)

Anonymous Coward | more than 4 years ago | (#29575671)

Why would some woman marry a man with a toddler-sized penis to begin with?

Nifty... (2, Interesting)

ZekoMal (1404259) | more than 4 years ago | (#29572729)

But whenever I see something that is nifty combined with the internet, I immediately think "now how will this be used to spam and/or infect people..."

Hrm... (4, Insightful)

vertinox (846076) | more than 4 years ago | (#29572771)

Sounds like a legitimate front for identity thieves, spammer, or even worse... Marketers.

I suppose its easier to do than running your own bot net.

Re:Hrm... (1, Funny)

Anonymous Coward | more than 4 years ago | (#29573751)

or even worse... Marketers

So, which one should it be - insightful or redundant?

ooo free games (1)

wjh31 (1372867) | more than 4 years ago | (#29572783)

free games for spare cycles/bandwidth? thats more interesting to me than that spidering stuff, how do i sign up?

Re:ooo free games (1)

negRo_slim (636783) | more than 4 years ago | (#29572801)

Since it's not really free, I'd rather have some monetary compensation if I were to participate in the program.

Seems cheap! (3, Insightful)

Pedrito (94783) | more than 4 years ago | (#29572809)

Seems like an awfully cheap way to spider millions of pages of porn. It would be worthwhile if Google didn't do it already for free.

Re:Seems cheap! (1)

AnEducatedNegro (1372687) | more than 4 years ago | (#29573011)

Turn off safesearch in Bing and search videos.

aEN

Re:Seems cheap! (3, Funny)

sakdoctor (1087155) | more than 4 years ago | (#29573217)

Japanese girls puking each other's mouths...Nope
Bestiality...Nope
Brazilian fart porn...A bit

As my first try of Bing, that wasn't very impressive.

Re:Seems cheap! (0)

Anonymous Coward | more than 4 years ago | (#29573519)

Japanese girls puking each other's mouths...Nope

How the hell do they even DO that?
Vomiting each other's mouths? Only in Japan!

Re:Seems cheap! (1)

Maxmin (921568) | more than 4 years ago | (#29582993)

It would be worthwhile if Google didn't do it already for free.

You've missed the point... or you've never tried to use Google programmatically.

Google's search APIs are all bound to Javascript now. There is no way to connect with them from your Java, Python or Ruby application. Not, that is, if you don't want to get your IP(s) blocked for running too many queries.

This spidering service provides something similar to what Alexa Web Search once did.

Buried in Digsby (4, Informative)

Anonymous Coward | more than 4 years ago | (#29572821)

This is apparently the service that caused a lot of controversy when people discovered it was somewhat hidden in Digsby [wikipedia.org] .

Re:Buried in Digsby (1)

linguizic (806996) | more than 4 years ago | (#29574779)

From the wikipedia entry you cite:

Digsby developer "chris" has stated that CPU usage is limited to 75% for desktops, and 25% for laptops unless operating on battery power.

Does that sound like an insane amount of CPU usage for damn IM client to be using to anyone else? Why the hell would the embed plura into an IM client anyway? This whole thing seems too fishy to me.

Re:Buried in Digsby (1, Insightful)

Anonymous Coward | more than 4 years ago | (#29575953)

Why the hell would the embed plura into an IM client anyway?

Unfortunately, it's all about money.

This will work (1)

YouDoNotWantToKnow (1516235) | more than 4 years ago | (#29572961)

They are currently recruiting only flash game developers but I can imagine this getting as big as advertising is right now. It could even keep newspapers alive. "Do you want to access my free content? Sure, but gimme 10% of your processing power." As long as there is demand for this computing power, we are quite able to harness it.

Free web index for download (2, Informative)

pburt (244477) | more than 4 years ago | (#29572995)

There is a spider crawling the web that claims to be building a free, downloadable web index for similar purposes.
Torrent link for the index and information at http://www.dotnetdotcom.org/ [dotnetdotcom.org] .

Who are the customers? (3, Interesting)

93 Escort Wagon (326346) | more than 4 years ago | (#29573131)

I can see how they might get a fair number of people to donate their spare cycles for this, if the rewards are seen as sufficiently interesting. But are there really a whole bunch of startups (or other companies) that are really champing at the bit to create a new search engine? Other than marketers or malware purveyers, I mean. And do these searches honor robots.txt exclusions?

BTW I took a quick look at 80legs' website in an attempt to get these answers. I came up empty in that regard - so I will comment on how the CEO's hair makes him look like an in-disguise member of the Conehead family. Seriously, what's with the hair?

Occam's razor. (2, Insightful)

icepick72 (834363) | more than 4 years ago | (#29573345)

The levels of indirection present to support this system -- distributed clients, incentives for being a distributed client, power supply vs demand, payment for custom spidering -- make the system many things at the same time and unnecessarily complex, because those things already exist for free and in less complex ways. Many needs are sufficed by the simpler mechanisms and always have been.

Reality... (2, Insightful)

JuSTCHiLLiN (605538) | more than 4 years ago | (#29573375)

Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and spyware.

I am surprised... (1)

barocco (1168573) | more than 4 years ago | (#29573585)

I am surprised that a post containing the words "SETI", "80 legs", "crawling", "computer", "spider", "farm", and "unused power" does not have the plot of Jodi Foster listening to radio telescope and discovering evil giant mutant cyborg space spiders are trying to invade earth and capture humans as batteries

I'm a little confused (2, Interesting)

PCM2 (4486) | more than 4 years ago | (#29573947)

Is there really a big demand out there for outsourced spidering? I had not heard of this market. They seem to be implying that there are all these start-up outfits out there who have invented really amazing, unique UIs that allow people to find exactly what they need on the Web, and all they need to be successful is access to a searchable index. Huh??

I mean, if you're going to be some kind of start-up search engine or "semantic company" (whatever that means), shouldn't Web spidering be your core competency? If you're going to differentiate yourself in the market, how can you buy spidering as a commodity? How to you expect to attract any investment if you're telling potential investors that you rent your spidering capability from another start-up -- let alone one that uses some kind of half-baked P2P technology to do the work?

Seriously, in a world where Google seems willing to partner with just about anybody who needs any kind of searching for reasonable rates, what is this company's proposed customer base? (And no, the Technology Review article includes no quotes from customers at all.)

Re:I'm a little confused (3, Informative)

mgkimsal2 (200677) | more than 4 years ago | (#29574745)

"I mean, if you're going to be some kind of start-up search engine or "semantic company" (whatever that means), shouldn't Web spidering be your core competency? If you're going to differentiate yourself in the market, how can you buy spidering as a commodity?"

Raw spidering is pretty much a commodity already. You're issuing GET requests over HTTP (for the most part). The "semantic" stuff comes in to play analyzing the results and doing interesting things with raw information you get back. If people can spend more time focused on doing the 'interesting bits' and less time on having to scale up to pull in the raw data to analyze, they'll be better off for it and more likely to be able to focus on creating something new/interesting/distinguishing.

People (generally) don't write their own web servers, nor their own TCP/IP stacks, often don't write their own session handling logic, or security code. All of these things have been commoditized. Perhaps too many people are relying on 'cloud computing' these days, but hosting and storage 'in the cloud' is where all the cool kids are playing right now (I don't necessarily agree with it, and probably wouldn't put all my eggs in that basket myself, but others are doing so). Spidering may be the next frontier to get commoditized.

Perhaps not everyone is comfortable 'partnering' with Google for everything? If someone was going to work on developing the 'next big thing', would you rather invest in something where the people had spent an inordinate amount of time building network capacity up to do drone work, or used a service like 80legs, or built the prototypes on Google's servers? Depending on the project, any of those make sense, but I'd prefer to use a service like 80legs myself. They're small enough and hungry enough they should give top notch customer service at this stage, whereas Google's not going to give you a number to call for direct service (maybe they do if you're spending loads of money, but then you're back to wise use of money).

The P2P aspect of how they're doing the spidering may be clever, but I'd rather see a more direct use of data-center resources around the globe, rather than relying on a seti-like participation model.

Re:I'm a little confused (2, Informative)

Jack9 (11421) | more than 4 years ago | (#29574927)

Advertising uses a fair amount of spidering for such things as contextual searching (where has a user been and what are their interests). Amazon was completely apatheic, in regards to a company who offered 50 mil for sending them crawling business. I was surprised, to say the least. When it was attempted to do so piecemeal, Amazon got very upset. So there's a demand, but it's probably not very large (# of capitalized consumers).

Rent our botnet! (2, Interesting)

Animats (122034) | more than 4 years ago | (#29575909)

This looks like an attempt to monetize a botnet. What, exactly, do the people running their "client" get out of this? Do they know they're sucking bandwidth, and possibly being billed for it, on behalf of someone else?

I run a web spider [sitetruth.com] of sorts. And I know the people who run a big search engine. Reading the web sites isn't the bottleneck. Analyzing the results and building the database is. Outsourcing the reading part doesn't buy you much. If this just did a crawl, it would be of very limited value. That's not what it does.

What they're really doing [pbworks.com] is offering a service that lets their customers run the customer's Java code on other people's machines in the botnet. That's worrisome. There are some security limits, which might even work. Supposedly, all the Java apps can do is look at crawled pages and phone results home. Right.

This thing uses the Plura botnet. [pluraprocessing.com] "Plura® is a grid computing system. We contract with affiliates, who are owners of web pages, software, and other services, to distribute our grid computing code. We utilize the excess resources of peripheral computers that are browsing the internet when such browsing leads to a web page of one of our affiliates. That web page has imbedded code that allows the visitor to participate in the grid computing process. We also utilize embedded code in software and other services to allow such participation." Not good.

The main infection vector is apparently the Digsby chat client [lifehacker.com] , which comes bundled with various crapware. The Digsby feature list [digsby.com] does not mention that Plura is in their package.

This thing needs to be treated as hostile code by firewalls and virus scanners.

Re:Rent our botnet! (2, Interesting)

javajedi (81810) | more than 4 years ago | (#29577027)

Outsourcing the reading part doesn't buy you much. If this just did a crawl, it would be of very limited value. That's not what it does.

Wrong. If I want to spider a single web site, many sites have rate-limiters that kick in and will block me after a while. This would allow me to hit it from multiple machines.

There are some security limits, which might even work. Supposedly, all the Java apps can do is look at crawled pages and phone results home. Right.

Why the sarcasm? This seems like a perfect use case for the JVM's security mechanism.

Re:Rent our botnet! (1)

Ant P. (974313) | more than 4 years ago | (#29581011)

many sites have rate-limiters that kick in and will block me after a while. This would allow me to hit it from multiple machines.

Many sites have rate limiters to prevent denial-of-service attacks. This would allow easy DDoS attacks.

ftfy

Re:Rent our botnet! (1)

Animats (122034) | more than 4 years ago | (#29581611)

If I want to spider a single web site, many sites have rate-limiters that kick in and will block me after a while. This would allow me to hit it from multiple machines.

The better web spiders run very slowly as seen from each site. At one time, Google only read about one page every few minutes per site. The Internet was slower then. Cuil's crawler is known to be overly aggressive, but that's a design flaw. (Too much distribution, not enough coordination.)

At SiteTruth, we never read more than 20 pages from a single site, rate limit to one new page every 2 seconds, and read no more than 3 pages in parallel. And we obey "robots.txt". We don't hide the identity of our bot; the user-agent string is "SiteTruth.com site rating system", and we list that with "Bots vs. Browsers". That seems to be enough to avoid blocking. We see some sites that are very slow, but they're very slow when seen from an ordinary browser. If sites are blocking your crawler, it must be very aggressive.

We're just looking for the name and address of the business behind the site, though, checking "About", "Help", "Contact", and other places a human would look for that info. We're not trying to suck up the site's entire content.

Might be interested in this (0)

Anonymous Coward | more than 4 years ago | (#29589095)

I'm starting a website for car restoration of a specific kind of car. There are a few other sites out there that talk about their own cars or clubs you can join. Some even have a few links to other sites.

I'd love to use a service like this to search for tons of links to all sorts of places on the web - but do it without just copy/paste of someone's links page. I'd rather do my own work with my own tool and not spend tons of time sorting through thousands of Google hits.

This might work for me.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...