Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

The First Open Ranking of the World Wide Web Is Available

Unknown Lamer posted about 9 months ago | from the brought-to-you-by-164 dept.

Programming 53

First time accepted submitter vigna writes "The Laboratory for Web Algorithmics of the Università degli studi di Milano together with the Data and Web Science Group of the University of Mannheim have put together the first entirely open ranking of more than 100 million sites of the Web. The ranking is based on classic and easily explainable centrality measures applied to a host graph, and it is entirely open — all data and all software used is publicly available. Just in case you wonder, the number one site is YouTube, the second Wikipedia, and the third Twitter." They are using the Common Crawl data (first released in November 2011). Pages are ranked using harmonic centrality with raw Indegree centrality, Katz's index, and PageRank provided for comparison. More information about the web graph is available in a pre-print paper that will be presented at the World Wide Web Conference in April.

Sorry! There are no comments related to the filter you selected.

Fuck Beta. (-1, Offtopic)

lvxferre (2470098) | about 9 months ago | (#46229775)

You know the drill.

Re:Fuck Beta. (-1)

Anonymous Coward | about 9 months ago | (#46229873)

that was so last week

Re:Fuck Beta. (-1)

Anonymous Coward | about 9 months ago | (#46230011)

No, no, it's today too.

See right there, #98,573,319: beta.slashdot.org.

Re:Fuck Beta. (-1)

Anonymous Coward | about 9 months ago | (#46230127)

I know you were just joking, but it is much higher then that:
Harmonic centrality Indegree centrality Katz's index PageRank

4072812. beta.slashdot.org 37928192 18987108 34236886

Re:Fuck Beta. (1)

SleazyRidr (1563649) | about 9 months ago | (#46232307)

I thought you guys were boycotting this week. At least bring some facts to your discussion http://www.twst.com/update/388... [twst.com]

Google vs Bing (1)

QilessQi (2044624) | about 9 months ago | (#46229805)

It looks like google.com is 4th, and bing.com is...

um...

...ok, does anybody know where bing.com is?

Re:Google vs Bing (1)

QilessQi (2044624) | about 9 months ago | (#46229819)

Ok, found it. 52nd.

Usefullness of results? (0)

Anonymous Coward | about 9 months ago | (#46229817)

Using old data, they rank digg above reddit.

Re:Usefullness of results? (3, Funny)

goombah99 (560566) | about 9 months ago | (#46230395)

Using old data, they rank digg above reddit.

Yeah Digg is too crowded, no one goes there anymore.

If you have not done so please donate to Wikipedia.

CC? (1)

Rich_Lather (925834) | about 9 months ago | (#46229867)

Creative Commons? Why would that be in the top ten?

Re:CC? (0)

Anonymous Coward | about 9 months ago | (#46229969)

gmpg.org
What in the name is Metamemetics ????

Re:CC? (2)

Garble Snarky (715674) | about 9 months ago | (#46231401)

Maybe because the rank is controlled by links, and many pages link to CC even though people seldom follow those links?

Re:CC? (1)

nullchar (446050) | about 9 months ago | (#46232369)

I would like to compare the rank of this link graph to DNS requests for a "popularity" graph.

Missing data (1)

c008644 (3529249) | about 9 months ago | (#46229889)

How can porn account for about 75% of all web traffic if the most common web sites are not listed on this report?

Re:Missing data (1)

amicusNYCL (1538833) | about 9 months ago | (#46229937)

Because videos are big.

Re:Missing data (0)

Anonymous Coward | about 9 months ago | (#46230329)

But I only watch the last 20 seconds.

Re:Missing data (1)

S.O.B. (136083) | about 9 months ago | (#46232053)

But I only last 20 seconds.

FTFY

Re:Missing data (1)

SleazyRidr (1563649) | about 9 months ago | (#46232327)

Part of the beauty of a joke is what you leave unsaid. Let your audience work out why what you just said is funny, instead of just giving them everything.

two different unidersities (0)

Anonymous Coward | about 9 months ago | (#46229931)

"Laboratory for Web Algorithmics together with the Data and Web Science Group of the University of Mannheim"

Two different universities are involved, check the links.

gmpg.org? (1)

Chris Mattern (191822) | about 9 months ago | (#46229957)

It's #1 in PageRank and Katz's index, #3 in Indegree centrality, What the hell is it? I went there, and I *still* don't know what it is.

Re:gmpg.org? (1)

Ksevio (865461) | about 9 months ago | (#46230065)

It's a metadata profile definition that's linked to by lots of social media sites. It's pretty much just defined in the header (lets people add different attributes to html). For the same reason that w3.org is up there since people link to it when setting a doctype.

Re:gmpg.org? (1)

Trepidity (597) | about 9 months ago | (#46230245)

There's an extension to the <link rel> tag that overloads it by, instead of linking to actual related data (as the tag was intended to do), treats the target of the link as defining a schema / data format, when rel="profile". The URL is then essentially a globally unique key for the data format; parsers that recognize the format will see the key and know how to parse some other information on the page. gmpg.org is the host of one of the early ones, XFN [gmpg.org] , which is linked in default Wordpress installs [wordpress.org] .

Re:gmpg.org? (1)

nullchar (446050) | about 9 months ago | (#46232445)

Should all of the <link rel>'s be excluded from the dataset used to build the giant graph?

Re:gmpg.org? (1)

Trepidity (597) | about 9 months ago | (#46232477)

Probably; I don't think they are really links, in the sense of something that is ever actually rendered as a hyperlink. I would probably also exclude things like loading JS resources.

Re:gmpg.org? (1)

L7_ (645377) | about 9 months ago | (#46238939)

Read this article about systems rendering the http links in DTD headers from XHTML specs - http://www.w3.org/blog/systeam... [w3.org]

It's from 2008, still stands.

We're #164! We're #164! (-1, Flamebait)

damn_registrars (1103043) | about 9 months ago | (#46229973)

Being in the top 200 out of 100 million really isn't bad.

Re:We're #164! We're #164! (0)

Anonymous Coward | about 9 months ago | (#46230097)

well

Re:We're #164! We're #164! (2, Interesting)

Anonymous Coward | about 9 months ago | (#46230123)

164. slashdot.org

4072812. beta.slashdot.org

Profit!!!

Re:We're #164! We're #164! (0)

Anonymous Coward | about 9 months ago | (#46230181)

Being in the top 200 out of 100 million really isn't bad.

Yeah well that's before the BETA (fuck the BETA btw).
After BETA /. rank will be 200 000 001

I'd say the results are pretty obvious... (2)

ausekilis (1513635) | about 9 months ago | (#46230141)

Up top you have those web sites that have their fingers in damned near everything, because they are looking at "centralization" of the website. More and more websites are using videos, and who better than YouTube to host? Need to provide a way to search your website? Google has already done it for you. Need to update your 3 billion fans what you're having for lunch? Facebook and Twitter have you covered. I can't see the list from work, but I'd wager that Facebook is up there too, with their ever-present "like" buttons. What's surprising is Wikipedia, you'll only sometimes see a link to Wikipedia, even on discussions on Slashdot, they don't go out there and wave their hands saying "everybody link to me" like other sites do.

What about other aspects that would make a website "good"? Such as ease of navigation (find what you want in 5 clicks or less)? Size/amount of useful content? Number of external sites that link to their content?
If we included that sort of data, YouTube could potentially be far up there with Wikipedia. I would think Google and Bing would be ruled out entirely since by their very design they don't hold real data.

Re:I'd say the results are pretty obvious... (0)

Anonymous Coward | about 9 months ago | (#46231507)

Yeah, I think wikipedia is on there because it is just so damn useful.

Re:I'd say the results are pretty obvious... (1)

SleazyRidr (1563649) | about 9 months ago | (#46232353)

Wikipedia aren't out there telling everyone to link to them, they're just sitting quietly in the corner being awesome, and everyone links to them because they're such a good source of information.

Re:I'd say the results are pretty obvious... (1)

rtb61 (674572) | about 9 months ago | (#46235293)

More likely, with sufficient money, simply gaming the system with thousands even tens of thousands of bogus web sites all linking back to the advertising revenue targeted web site.

What realistic rating get them from accurately identified people, with specific reviews, one set of review per person. The rest is just bullshot programming and cake.

Ah, memories! (0)

Anonymous Coward | about 9 months ago | (#46230221)

http://www.nbcnews.com/id/15196982/ns/business-us_business/t/google-buys-youtube-billion/

There was a company looking at long-term growth instead of short-term profits - and they got both!

thanks (1)

Anonymous Coward | about 9 months ago | (#46230305)

How nice of them to rank the problems of the internet for us.

Nothing there... (1)

GnuPooh (696143) | about 9 months ago | (#46230403)

Am I missing something? Does this site require Java or Silverlight or something. The page is very stark and there's no ranking shown. What am I missing? Did it get slashdotted?

Re:Nothing there... (1)

hierofalcon (1233282) | about 9 months ago | (#46231975)

Works in firefox, doesn't work in chrome. YMMV

This ranking isn't based on popularity (1)

galloog1 (3433335) | about 9 months ago | (#46230423)

If you look at the way they developed this list, it is closer to how Google ranks their searches. The metrics are scored on how many other pages link to the sites. For example, reddit and slashdot aren't high on the list because they link to other sites but very few link back. Creative Commons is in the top ten because everyone links there. It also explains why Myspace is so darn high.

The stupidity of heathendom is thereby made manife (0)

Anonymous Coward | about 9 months ago | (#46230441)

...but said heathendom is too stupid to notice. If only you were willing to hang yourself more often, how entropy would drop! Death to faggots, to heathens, to you.
--
Sheshbazzar

Linux is dying (2)

Trepidity (597) | about 9 months ago | (#46230475)

It is now official. The Università degli studi di Milano has confirmed: Linux is dying.

One more crippling bombshell hit the already beleaguered Linux community when UNIMI confirmed that Linux's flagship domain, kernel.org [kernel.org] , fell to a shocking #1797 in the Common Crawl rankings. You don't need to be the Amazing Kreskin to predict Linux's future. Its domain now ranks just behind Excite.com, the now-irrelevant search engine from the 1990s, which edges it out at #1796.

The glaring gap between Linux's ranking and the rankings of those in the vibrant, enterprise-ready world is in itself embarrassing enough: Apple #8, Microsoft #17, even Oracle #248. But what seals the coffin is that Linux has fallen behind even the notoriously moribund FreeBSD operating system in these industry-leading metrics, trailing it by nearly one thousand, five hundred positions.

Some surprising results (1)

XxtraLarGe (551297) | about 9 months ago | (#46230529)

Amazon.com is not in the top 10, and MySpace is still in the top 20. Even more baffling, MySpace isn't even in the top 50 of any of the other rankings, so how did they come up with this score?

Re:Some surprising results (0)

Anonymous Coward | about 9 months ago | (#46230637)

by using data fro 3 years ago.

Google bias (1)

Martin S. (98249) | about 9 months ago | (#46230645)

Google are common accused of giving their own sites preferential results.

However this suggests not, with google Page Rank being generally lower than the "web data commons" result for the same sites, e.g. YouTube & Google

Re:Google bias (1)

TulioSerpio (125657) | about 9 months ago | (#46230753)

gmpg.org isn't a very ranked site, or I'm missing something?

Needs a Caption (1)

ZombieBraintrust (1685608) | about 9 months ago | (#46230805)

They need to put a caption on their results stating when the data for the ranking was last crawled.

Centrality, not popularity (0)

Anonymous Coward | about 9 months ago | (#46230901)

It's important to keep in mind that this measures centrality, a measure of how interconnected a site is, not popularity. There are several extremely large self-contained communities that rarely get linked by other sites For instance, 4chan threads are deleted after a certain amount of time, meaning there's very little on the site to link to, which means it has low centrality; this is reflected by its position somewhere over 3000 on the linked page, much lower than its popularity in relation to many of the sites above it. I'm sure there are other examples, too, this is just the one that jumped out at me.

Wow! (0)

Anonymous Coward | about 9 months ago | (#46231035)

My own site is in the top 22,000,000! :-) ...and the bravenet version that was supposed to be shut down 10 years ago is still in the top 50,000,000... :-(

Stack rank the internet? (1)

bobbied (2522392) | about 9 months ago | (#46231387)

So... Are we going to toss the bottom 10% or the top?

Doesn't this work for Google?

This is not accurate. (1)

Anonymous Coward | about 9 months ago | (#46231543)

Whatever they are doing here does not reflect anything too useful (from my perspective). Source: I have a number of sites in the top 10,000 - and nothing here makes any sense. It doesn't correlate with any real world metrics I can see. ie: Sites that receive 140,000 visitors a day, and have millions of incoming links are showing up in the 1 million area, and sites of mine with little-to-no power are showing up in the top 100,000. Weird.

I predict (0)

Anonymous Coward | about 9 months ago | (#46231561)

that page 10171778 wont be last for long.

Harmonic Centrality is the wrong measure (1)

Scotland (3022857) | about 9 months ago | (#46232195)

This article and open rankings work is great, but...

The default ranking we show you is by harmonic centrality. If you want, you can find its definition in Wikipedia. But we can explain it easily.

Suppose your site is example.com. Your score by harmonic centrality is, as a start, the number of sites with a link towards example.com. They are called sites at distance one. Say, there are 50 such sites: your score is now 50.

There will be also sites with a link towards sites that have a link towards example.com, but they are not at distance one. They are called sites at distance two. Say, there are 80 such sites: they are not as important as before—we will give them just half a point. So you get 40 more points and your score is now 90.

We can go on: there will be also sites with a link towards sites that have a link towards sites that have a link towards example.com (!), but they are not at distance one or two. They are called sites at distance three. Say, there are 100 such sites: as you can guess, we will give them just one third of a point. So you get 33.333 more points and your score is now 123.333.

My intuition:

Incoming links with degree one should be allocated 1 point. *yep*
Incoming links with degree two should be allocated half of 1 point = 0.5 points. *yep*
Incoming links with degree three should be allocated half of 0.5 points = 0.25 points. *NOPE* It actually gets allocated 0.33 points.

This means degree ten links still get 0.1 point? 10 hops away and they're still showing up significantly? That measure is broken. 10 hops away should score vanishingly small... 0.5^(10-1) = 0.001 points is much more reasonable.

The measure shouldn't be 1/n (harmonic centrality), it should be 0.5^(n-1). I would love to see that set of rankings.

The intuitive measure is Eigenvector Centrality (1)

Scotland (3022857) | about 9 months ago | (#46232351)

Upon further reading (http://en.wikipedia.org/wiki/Centrality), methods that use an attenuation factor like I described are called Eigenvector Centrality, which Katz and PageRank are specific implementations of.

I would love to see that set of rankings.

I guess I have! :-)

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?