Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

cancel ×

89 comments

so much porn (-1)

Anonymous Coward | about a year ago | (#43383043)

..

Great plan there smart guys! (-1, Redundant)

Narcocide (102829) | about a year ago | (#43383051)

And then what after that? Print it all out on paper and mail it back to the people in weekly editions for a small subscription fee?

Re: Great plan there smart guys! (0)

Anonymous Coward | about a year ago | (#43383091)

Does your annoyingly snarky post have a point behind it, or are you just enjoying being a dick?

Presumably (3, Insightful)

AliasMarlowe (1042386) | about a year ago | (#43383143)

Perhaps they mean one billion web pages rather than web sites. It seems unlikely that the UK could host a billion web sites (even the American billion of 10^9 rather than the British billion of 10^12).

Re:Presumably (1)

Anonymous Coward | about a year ago | (#43383189)

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.

Re:Presumably (4, Informative)

Trpajzlix (2747079) | about a year ago | (#43383283)

Ehm, "everyone else". In Czech bilion = 10^12.
The Brits use the same billion=10^9 as everyone else speaking english
FTFY

Re:Presumably (1)

Anonymous Coward | about a year ago | (#43383299)

I confirm... at least in Portugal and France, 1 billion is 10^12, rather than just 10^9.

Re:Presumably (2, Insightful)

Alain Williams (2972) | about a year ago | (#43383313)

Because of the ambiguity I usually say either ''a thousand million'' or use the SI prefix Giga. So: it will be an archive of a Giga web page. Hmmm: doesn't quite trip off the tongue, unfortunately.

Similarly with dates. What does 10/5/13 mean ? 10 May 2013 or 5 October 2013 ? I favour the first (to know why see how I spelled 'favour'), but recognising that it can be misunderstood (by those who spell differently), I would usually write dates as 10 May 2013 - no ambiguity.

Re:Presumably (3, Insightful)

Tastecicles (1153671) | about a year ago | (#43383813)

I use YYYY/MM/DD. By extension, HH:MM:SS. Logical.

Re:Presumably (0)

Anonymous Coward | about a year ago | (#43384043)

I use YYYY/MM/DD. By extension, HH:MM:SS. Logical.

YYYY/MM/DD doesn't work so well within a file name...

Re:Presumably (1)

K. S. Kyosuke (729550) | about a year ago | (#43384195)

Just write 20130407-171547 like everybody else and be done with it.

Re:Presumably (0)

Anonymous Coward | about a year ago | (#43384861)

Works great on a Mac. Get into the 21st century, guys.

Re:Presumably (1)

AmiMoJo (196126) | about a year ago | (#43384749)

We had to send some drawings to some guys in the US a while back. Rush job just to check everything would fit. First they complained that we has post-dated it, then that our dimensions were impossible small until it was pointed out that "mills" means "millimetres" and not 1/1000th of an inch (which surely should be 1/1200th, unless it was an attempt at metrification).

if it were inches, it would be "thou". (0)

Anonymous Coward | about a year ago | (#43385777)

five-thou is five one-thousandths of an inch.

mill would be metric.

Re:Presumably (1)

Livius (318358) | about a year ago | (#43383667)

Not only that, people not speaking English use a word with a different pronunciation! And spelling! And grammatical rules!

Re:Presumably (1)

CanEHdian (1098955) | about a year ago | (#43384433)

This is because the English didn't use the -illion and -illiard system, just kept -illion

Rest of Europe: million, milliard, billion, billiard, trillion, trilliard
England: million, billion, trillion

This is something that cannot be "fixed" other than adopting the SI system.

Re:Presumably (0)

Anonymous Coward | about a year ago | (#43385421)

It's a little more complicated than that: British English did until recently use the "million, milliard, billion"; "million, billion, trillion" has been imported from the US within the last 20 years or so. In the early '90s the BBC were often very careful to say "thousand million", because "billion" was by then ambiguous. Since the late nineties we have "standardized" on 10^9=billion, much to the annoyance of the rest of Europe. Sorry about that chaps 8(

Re:Presumably (1)

tehcyder (746570) | about a year ago | (#43391093)

This website is written in English. So, for example you would see 3.1415927 here rather than 3,1415927. It is silly to quibble about how conventions are different in other languages/cultures. I wouldn't go to a Russian language website and start moaning about how the alphabet is all fucked up.

Re:Presumably (2)

Joce640k (829181) | about a year ago | (#43383465)

Spain uses 10^12

Re:Presumably (2)

Carewolf (581105) | about a year ago | (#43383935)

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.

No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.

Re:Presumably (1)

Tim the Gecko (745081) | about a year ago | (#43387025)

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.

No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.

I think you are a little out of date:

The Economist Pocket Style Book recommended 10^9 for "billion" back in 1986.

Re:Presumably (1)

tehcyder (746570) | about a year ago | (#43391123)

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.

No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.

No one in Britain uses billion to mean 10^12 unless they are being deliberately anachronistic, and have no interest in communicating with other people. In the UK, you would say the world population was 7 billion, for instance.

Re:Presumably (1)

tehcyder (746570) | about a year ago | (#43391049)

Yes, it's not like the fucking summary says it's one billion web pages or anything, is it? Oh, wait...

This is FLAG in Florida (-1)

Anonymous Coward | about a year ago | (#43383053)

Home of the highest level Scientology services.

This is the Advanced Organization in Clearwater.

This is Saint Hill Manor, headquarters in the United Kingdom and once Ron's home.

This is the Founding Church, in Washington D.C.

This is the International Ecclesiastic Mangement Centre.

This is Celebrity Centre International in Hollywood

This is the Scientology College in England

And this is the Freewinds, Scientology's advanced religious retreat

Re:This is FLAG in Florida (1)

Anonymous Coward | about a year ago | (#43388843)

The trolls here are just getting weirder.

Yes. For future generations. (0)

Anonymous Coward | about a year ago | (#43383057)

I'm sure they'll want to look at low-def goatse 20 years from now.

archive.org? (5, Interesting)

denpun (1607487) | about a year ago | (#43383075)

Why not work with the good folks at archive.org and their Internet wayback machine [archive.org] ?

Is it not a similar idea?

The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.

Re:archive.org? (1)

denpun (1607487) | about a year ago | (#43383083)

Was not able to access the article linked btw. (or parent site for that matter). /.ed already?

Re:archive.org? (0)

Anonymous Coward | about a year ago | (#43383167)

The Internet Wayback Machine folks could use the funding

The funding wouldn't help them unless it was more than the cost of actually undertaking the work. In which case the bit that actually helps them is more like a straight donation, so why not do the work themselves and make that donation too (assuming that they wanted to).

Re:archive.org? (5, Insightful)

kaiidth (104315) | about a year ago | (#43383225)

Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others. Part of that is because funding doesn't always work that way. You can get money for claiming that you are going to do the very first über-awesome UK archive, but your chances of receiving the funding becomes rather lower if in the very first breath you point out that somebody else has been doing pretty much this for a decade. Another part of it is: most politicians would likely want the national heritage, such as it is (jubilee celebration tweets - please...) to be held by that nation's own national library.

I would imagine the BL have referenced archive.org work extensively, but differentiate this project with what tits in suits like to call "a compelling USP." To put it in plain English, they'll have a neat explanation that suggests that they are totally aware of previous work in the domain whilst making sure that this project looks a) different, b) excitingly new and c) contextually, better.

Re:archive.org? (1)

Anonymous Coward | about a year ago | (#43383397)

Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others.

Where 'others' also includes people who might wish to make use of the library, but are refused admission despite a research case. Whereas all UK undergraduates are automatically granted access.

Re:archive.org? (1)

93 Escort Wagon (326346) | about a year ago | (#43384987)

Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others.

And you REALLY don't want to piss off their Rare Book Retrieval Unit!

Re:archive.org? (2)

ibwolf (126465) | about a year ago | (#43385055)

I would imagine the BL have referenced archive.org work extensively

They've actually worked closely with the Internet Archive for many many years. This includes commissioning IA to conduct crawls for them of government sites.

Both the BL and IA are members of the International Internet Preservation Consortium (IIPC see: http://netpreserve.org./ [netpreserve.org.] Both are very familiar with what the other is doing in this space.

So why not let IA do all the work? There are several reasons. Part of it is that the BL is responsible for web archiving as far as British cultural heritage is concerned. Relying on a foreign entity to handle it is questionable as they would not be able to enforce any/all policies they might need on IA. You can certainly contract the IA to crawl for you, but it will be on their terms.

However, there is also a question of redundancy. If multiple institutions, all over the world, are all engaged in web archiving, the ultimate result will be much better coverage and resilience. From my experience in dealing with the Internet Archive, this is something they support. Ever since I got involved in web archiving, 10 years ago, the Internet Archive has been a strong support of national libraries, archives and other interested parties doing their own web archiving.

That is why the IIPC was formed. So we could share knowledge and pool resources where useful while each institution follows its own path in web archiving.

Re:archive.org? (1)

kaiidth (104315) | about a year ago | (#43385377)

See, what you're saying is both sensible and unsurprising, but here's what bothers me: TFA doesn't acknowledge any of what you are saying. Instead, it suggests this is a novel activity, which seems ridiculous but happens for political reasons.

Re:archive.org? (1)

Anonymous Coward | about a year ago | (#43383393)

The British Library will probably use the same techniques as internet archive.

Some reasons:
* internet archive may bankrupt and the material may be lost. Government libraries may have - in theory at least - more reliable funding to preserve the material.
* it is easier to do targeted crawling (of specific themes) using your own workers than 3rd party company
* there are some legal matters that may make it more "illegal" for the 3rd party to do the crawling than if the government organization does it (as specified in the law)
* some government organizations may not want to outsource their work to companies for many different reasons (lack of control etc)
* it maybe cheaper to do own generic crawls than pay IA to do it

Re:archive.org? (0)

Anonymous Coward | about a year ago | (#43383957)

Why not work with the good folks at archive.org and their Internet wayback machine [archive.org] ?

The actual reason is legal. The British Library is a specially designated deposit library, and so under Section 44A of the Copyright, Designs and Patents Act 1988 [legislation.gov.uk] it is allowed to make an archival copy of anything from the internet without the copyright holder's permission. It's doubtful whether what archive.org is doing is legal under UK law, not that it cares because archive.org is based in the US.

Re:archive.org? (0)

Anonymous Coward | about a year ago | (#43383997)

"The actual reason is legal."

Nah. That's just the BL's reason for involvement in the work. It doesn't stop the BL collaborating with archive.org (i.e. sharing tools, technology, developers, etc).

Re:archive.org? (0)

Anonymous Coward | about a year ago | (#43385669)

That was my first thought. Isn't the wayback machine already doing this? I found the old original site I had made and run in 2000/2001.

Re:archive.org? (1)

tehcyder (746570) | about a year ago | (#43391653)

Why not work with the good folks at archive.org and their Internet wayback machine [archive.org] ?

Is it not a similar idea?

The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.

This is specifically for UK web sites, and the British Library is a British institution funded by the British taxpayer. Archive.org is US-based and a separate entity.

Gotta love that management "thought" process (4, Funny)

93 Escort Wagon (326346) | about a year ago | (#43383085)

We had a manager, some years ago, who had the bright idea of assigning one staff member the task of printing out our entire website once a month so she (the manager) could look things up easily.

Data Storage (2)

Trpajzlix (2747079) | about a year ago | (#43383089)

How are they going to store the data? Isn`t this whole library idea about storing things for future generations if there has been a war or other mass scale destruction? So when "future generations" uncover this Babylonian/British collection of knowledge hundreds years later, they can still learn from the remains? What are they going to get from a 200 years old harddrive, covered in dust?

Re:Data Storage (4, Funny)

93 Escort Wagon (326346) | about a year ago | (#43383105)

How are they going to store the data?

They're planning to save disk space by just referencing the original page content inside of an iframe.

Re:Data Storage (4, Informative)

Anonymous Coward | about a year ago | (#43383159)

BL, and other memory institutions such as archives, apply a concept, called "Digital Preservation", to the stored data. This concept, based on the OAIS model, covers all stages of storage, administration, maintenance and retrieval of these "remains".

Hardest part of webarchiving is not storing the data but how to render it in 200 years. They also need to store the browser, but nowadays, browsers use so much different "subrenderers" such as Flash, Java, Javascript and CSS engines and whatnot to render a page, so there is also a need to archive all those subrenderers as well.

Best known strategy to date is to create and store emulator containers or VM's with the original software so they can be emulated in the far future.

http://en.wikipedia.org/wiki/Open_Archival_Information_System [wikipedia.org]

Re:Data Storage (0)

Anonymous Coward | about a year ago | (#43383243)

Always impressed by the flowcharts per paragraph ratio in digital preservation. Also the number of uses of the term 'framework'. You're nothing and nobody in this field if you haven't got at least one 'framework' to your name. Digital preservation is very obviously an Augean stable. There is some excellent work in there (including the aforementioned playing with emulators), but the field doesn't half need mucking out.

Re:Data Storage (2)

SternisheFan (2529412) | about a year ago | (#43383235)

How are they going to store the data?

They'll use the "Cloud".

..., Oh, wait...

Re:Data Storage (3, Funny)

N Monkey (313423) | about a year ago | (#43383321)

How are they going to store the data?

They'll use the "Cloud".

..., Oh, wait...

No problems. Plenty of those in the UK.

Re:Data Storage (0)

Anonymous Coward | about a year ago | (#43384025)

On the inter-internet under .co.uk

Inaccurate title! (0)

Anonymous Coward | about a year ago | (#43383113)

It's one billion pages, not one billion websites. Which would have been a lot of websites for a country of 63 million people.

Re:Inaccurate title! (1)

loufoque (1400831) | about a year ago | (#43383137)

Hasn't each person created at least 10 websites in their lives?

You can't just do it once... (4, Interesting)

icebike (68054) | about a year ago | (#43383157)

Unless you do this fairly frequently, say every 6 months at a minimum, the picture left for future generations will be muddled at best.
Its always interesting how the news changes with the passage of time, and events are seen very differently in just a few weeks.

On 9/11 I used this Adobe's web site mining software that essentially captures every link on every page of a site and builds a large web replicate in pdf form. All the links work within that PDF, and every page on the the site is preserved. I pointed it at all the major news web sites, one large PDF for each, burned them to disk, and still have them today. (Yup, I violated a boat load of copyrights).

Two weeks later I did it again. You would be astounded at the difference. Entire pages are missing, not just unlinked, but even when you look for them by URL that appeared in the first capture, you won't find them in the second. Other news sites kept the old stuff on line, but the links often disappeared from their own web pages so that the only way to find these pages was by following links from some other site.

The point is, that a snapshot of the web does very little good, unless it has some collection. Looking at the archives of a newspaper from June 6 1944, wouldn't give you much of an idea of the Normandy invasion, unless you had subsequent editions from days and months forward.
But a web site isn't a newspaper with discrete editions, it is a constantly evolving thing, and archiving it today (or any point in time) is fairly useless, but archiving it daily is largely redundant, (most stories will be the same). You can't tell which stories changed over time based solely on the dates either, so you pretty well have to grab it all.

Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive [archive.org] . They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.

It seems that libraries are about the only place that can get away with ignoring copyright these days.

Re:You can't just do it once... (2)

El_Muerte_TDS (592157) | about a year ago | (#43383315)

> (Yup, I violated a boat load of copyrights).

So, you distributed the created PDFs? If you didn't, and it's still your in private collection, when how did you violate the right of creating copies?

Re:You can't just do it once... (2)

bumburumbi (1047864) | about a year ago | (#43383351)

The National Library of Iceland has had a similar program for a couple of years. The national TLD is collected three times a year and made available via the Wayback Machine [archive.org] . The english version of the project's page [vefsafn.is] is rather terse, but according to the Icelandic version, selected pages are collected more frequently when warranted, e.g. political debates around election times. Icelandic law requires publishers to deposit copies of ther work with the National Library. This includes web pages so the library doesn't have to worry about copyright.
For a small country with few resources, co-operation with other small countries and archive.org is probably best. The task of collectiing the british TLD is orders of magnitude bigger. It may well be cheaper for the British Library to pay for a system tailored to their needs rather than figure out how to make archive.org's software do what the library needs.

Re:You can't just do it once... (0)

Anonymous Coward | about a year ago | (#43383377)

Those websites are terrible, then.

The fact that you aren't linked by permalink and those links change is absolutely embarrassing for a website.
Even Facebook isn't that terrible, single comments have permalinks.

Link-breaking is bad enough as it is already during website restructures, but doing it on purpose through terrible design is rage-inducing.

Re:You can't just do it once... (0)

Anonymous Coward | about a year ago | (#43383387)

I hope you did not pay for that tool, considering that you get the same functionality by using the program wget. You can get the Windows-version of the program (I assume your Adobe program is for Windows) here: http://gnuwin32.sourceforge.net/packages/wget.htm

Wget does however not generate any PDF-versions for you. It does allow you to browse the downloaded websites using your regular browser though, as if they were still on the Internet.

Re:You can't just do it once... (1)

Tastecicles (1153671) | about a year ago | (#43383839)

I use Backstreet. OK it's £13 after the 30-day trial, but it's bloody handy to have a full relinking of crawled content so you can pretty much pull a website, import it into a VM, and do what you want to do there. Me? I PDF what I download using Acrobat X batch conversion then run the fulltext indexing engine. Considering it's all running on a VM it ain't half fast, even if it is currently holding an index of 6 million pages.

Oh yeah, and it runs on Linux via WinE. Not that I run it on Linux, I run it native in Win7 64-bit.

Re:You can't just do it once... (0)

Anonymous Coward | about a year ago | (#43383591)

It seems that libraries are about the only place that can get away with ignoring copyright these days.

Seeing as this programme is set up by the law that CREATES copyright, and is instituted under the aegis of the Copyright Act, I think you're missing the point if you consider this to be 'ignoring copyright'.

Re:You can't just do it once... (1)

dkf (304284) | about a year ago | (#43383753)

Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive [archive.org] . They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.

I imagine that it will eventually happen, and that it will end up enriching the archive.org system when it does. Maybe it won't happen for a year or two, but when we're talking about long term preservation, that's not so important and the global nature of the internet makes it valuable (and logical) to globally coordinate the historical archives of it as well.

It seems that libraries are about the only place that can get away with ignoring copyright these days.

National libraries cannot ignore copyright, but they have a special position with regards to copyright law: they're explicitly empowered to retain copies for future generations whether the publishers like it or not (and whether or not they're Big Media). If you don't want it archived for future generations, don't publish it at all.

The process will take five months (1)

hcs_$reboot (1536101) | about a year ago | (#43383163)

They should definitely reduce the time allotted to that tea break..

Come on morons... (0)

Anonymous Coward | about a year ago | (#43383201)

Its about developing the architecture to take continuous snapshots of the web for intelligence purposes. Nothing more. Or else they would just fund the internet archive.

Re:Come on morons... (2)

SternisheFan (2529412) | about a year ago | (#43383291)

One of the comments from the CNN story was, "The UK web archive is actually using archive.org's software. The point it that archive.org has only got so much money, and only archives a percentage of the web. Having the BL support this is a good thing."

Re:Come on morons... (1)

SternisheFan (2529412) | about a year ago | (#43383333)

I mean the 'BBC article, http://www.bbc.co.uk/news/entertainment-arts-22028738 [bbc.co.uk] ... I noticed that they're such polite postings made by the British people, out of the 88 comments, only 2 were moderated out.

Wandalust1956 5th April 2013 - 10:01

This is just, in essence, the 21st Century equivilant of the Mass Observation project that started in the 1930's and included the diary of a housewife from Cumbria during the 2nd World War...which was turned into a TV play by Victoria Wood. http://www.massobs.org.uk/index.htm [massobs.org.uk] As long as the content is "relevant" to current affairs then it could be a cultural insight to life in the 21st Century.

I'll see your Internet Archive and raise you... (1)

Tastecicles (1153671) | about a year ago | (#43383341)

...typically British utter redundancy.

Re:I'll see your Internet Archive and raise you... (0)

Anonymous Coward | about a year ago | (#43383403)

Redundancy in this sort of things (well, anything that matters to you really) is a good thing. Why rely on a single organisation that loses everything if it runs out of money? Not that the Internet Archive archives everything anyway.

Re:I'll see your Internet Archive and raise you... (1)

Tastecicles (1153671) | about a year ago | (#43383801)

my point is (and I apologise if I didn't make it obvious) that this isn't news. IA has archived the internet, and done a fairly decent job of it. The BL is off on a "Me Too!" campaign and the BBC are all over it like it's a first.

Re:I'll see your Internet Archive and raise you... (1)

tehcyder (746570) | about a year ago | (#43391763)

There is still no harm in a national archiving organisation doing its job for its own country's data.

Re:I'll see your Internet Archive and raise you... (1)

tehcyder (746570) | about a year ago | (#43391749)

...typically British utter redundancy.

Yeah, we're the sort of idiots who make more than one back up of important data. What's the point of that eh?

Hint: redundancy is somethimes a very, very good thing indeed.

NLA (0)

Anonymous Coward | about a year ago | (#43383357)

I believe that the National Library of Australia already does this, but there are issues around copyright for granting access to these archives. Thanks again America for the free trade agreement and all of your shitty copyright rules

Re:NLA (0)

Anonymous Coward | about a year ago | (#43384073)

I believe that the National Library of Australia already does this, but there are issues around copyright for granting access to these archives. Thanks again America for the free trade agreement and all of your shitty copyright rules

You are welcome. Pray that we do not create more evil copyright rules for you to bow down to. (Bwah-hah-hah-ha) - Signed, Evil U.S. (tm)

Preserve Goatse (-1)

Anonymous Coward | about a year ago | (#43383399)

You know it makes sense.

www.Goatse.co.uk www.goatse.org.uk www.goatse.ac.uk www.goatse.ltd.uk www.goatse.plc.uk

Wow (2)

databeam (867515) | about a year ago | (#43383557)

That's going to be a lot of porn!

Illegal Content (1)

wisnoskij (1206448) | about a year ago | (#43383719)

So will they being getting legal permission to host all of this copyrighted material.
Doesn't all the individual websites won their own content, how does archive.org even get around this?
And what about the illegal porn, cracks, hacks, and viruses?

Re:Illegal Content (1)

PPH (736903) | about a year ago | (#43383993)

And what about the Elgin Marbles?

Re:Illegal Content (0)

Anonymous Coward | about a year ago | (#43383995)

They already have legal permission. It is a long existing legal requirement under UK copyright law that for ANY printed material published in the UK, a copy is provided free of charge to the British Library, and on request to five other "libraries of legal deposit" in Scotland, Wales, Oxford, Cambridge and Dublin (that's a library not even in the UK!).

As I understand it, this project has been waiting for suitable updates to the law to be in place before going ahead with its archiving. If you don't like it then you have the choice of not publishing publicaly accessible material on the .uk domain. The law allows the libraries of deposit to make copies for archiving and preservation purposes, and to make these copies available for readers at the library; I don't think they're able to republish pages online as the Internet Wayback machine does. On the other hand archive.org does allow you to opt out via a robots.txt file.

Re:Illegal Content (0)

Anonymous Coward | about a year ago | (#43384141)

"The law allows the libraries of deposit to make copies for archiving and preservation purposes, and to make these copies available for readers at the library; I don't think they're able to republish pages online as the Internet Wayback machine does."

Hah. It's another one of the British Library's 'we'll give access to our mates and sod everybody else' things. Bugger open data, let's corner our market. Also, how useful are a billion tweets if you can only access them as a 'reader at the library'? Are researchers going to do social network analysis by hand?

Re:Illegal Content (0)

Anonymous Coward | about a year ago | (#43385725)

Also, how useful are a billion tweets if you can only access them as a 'reader at the library'?

About as useless as they are under any other circumstance.

Re:Illegal Content (0)

Anonymous Coward | about a year ago | (#43385753)

Hah! True that.

Please pay UK taxes, then. (0)

Anonymous Coward | about a year ago | (#43385825)

If you want access to this then pay toward the taxes that will fund it.

Thanking you in advance,

A UK taxpayer.

Re:Please pay UK taxes, then. (0)

Anonymous Coward | about a year ago | (#43385907)

I am a UK taxpayer you insensitive clod.

Average Web Site (2)

wisnoskij (1206448) | about a year ago | (#43383723)

So the average website contains about 1 thousand pages then? That seems like a lot...

Re:Average Web Site (1)

tehcyder (746570) | about a year ago | (#43391783)

So the average website contains about 1 thousand pages then? That seems like a lot...

No, it doesn't. Imagine how many pages something like the BBC website has on any particular day.

Re:Average Web Site (1)

wisnoskij (1206448) | about a year ago | (#43393089)

Yes, but you would be hard pressed in my opinion to fund more than a few hundred regular websites that contain around or more than 1000 pages. Add in every medium or larger sized forum and it really seems like 1000 is a lot. I think the mode (type of average) website would have something like 10, with a bunch more at the 50 range, and still quite a bit at a few hundred. But I really do not see many websites that have over 1000.

I guess news sites that keep every article they ever published in the last 100 years up would balance the scales with tens of thousands if not more, but I still find it a large number.

wasted effort (0)

Anonymous Coward | about a year ago | (#43384275)

the library or some government agency probably already has an archive of news programs, the library already archives news papers and magazines........ and for everybody else, there's cctv recordings.

Assumptions and questions (2)

Martin S. (98249) | about a year ago | (#43384567)

There seem to be a few post making incorrect assumption and raising questions. I was involved as a technical architect on the long term preservation store aspect of this project few years ago.

archive.org The BL is already cooperating with a number of other organisations do the same thing thing, including the archive.org, the Smithsonian, Scottish, French, Australian, Canadian and quite few other National Libraries. archive.org has been an important technology spike for these but is not the whole solution.

Preservation BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge.

Legal archive.org is essentially opt in; the BL programme is legal deposit requirement. The site content for any uk tld should be collected at least once a year. An important piece of the technology puzzle is to identify these and mange this process.

Scale The last scaling I saw placed the BL archive about two orders of magnitude larger than archive.org and growing faster. The number of new websites in .uk grows faster than the awareness of archive.org. There are a lot of challenges

- Maintain structure and semantic context.

- Searchable Meta Data

- Searchable Content

- Re-Presentation

Re:Assumptions and questions (0)

Anonymous Coward | about a year ago | (#43385035)

"BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge. "
In truth, the BL has a legal responsibility to preserve its archive only as long as they can afford to stay open. Quite rightly, therefore, it plays the game, resulting in news articles like this one that are transparently based on press releases originating with pressandpolicy.bl.uk in which the egos of MPs and senior managers are massaged and complications like 'archive.org have been archiving lots of UK sites for ages' are glossed over. That's fine, or if it isn't fine it is understandable, but it can be confusing. The original press release (most likely this one [pressandpolicy.bl.uk] ) isn't actually that bad since at least it focuses on legal deposit, but like most press releases it fails to preempt, acknowledge or answer the obvious questions like how exactly does this differ from archive.org? and aren't you just reinventing the wheel?

As for the challenges, meh and pshaw. It's all good stuff but how much of that is unique to this project?

Alert: Misleading/Fraudulent Title (0)

Anonymous Coward | about a year ago | (#43384795)

The title says "One Billion UK Websites" but the first sentence of the post says "4.8 million websites." Clearly, the poster is being misleading or fraudulent. Oh timothy please be consistent with your own post.

BL supports strong copyright (0)

Anonymous Coward | about a year ago | (#43385271)

Yet they plan to copy other people's endeavours without a thought.

French National Library does it since 2006 (1)

aikawa (776347) | about a year ago | (#43388363)

The BnF (French National Library) has started doing this in 2006 for a selection of .fr websites.
In 2011 they had 16.5*10^9 files.
They store content on "Petaboxes" made by the Internet Archive.

See http://www.bnf.fr/en/collections_and_services/book_press_media/a.internet_archives.html [www.bnf.fr]

Clarification for posterity (1)

illtud (115152) | about a year ago | (#43429059)

I'm pretty late to this story, but let me clear up some misunderstandings for posterity's sake:

Disclosure: I've been involved in this effort for at least ten years, I'm head of ICT for one of the UK Copyright Libraries (National Library of Wales), and this story goes way back to the Primary Legislation passed by the UK in 2003, and we've been working on the practicalities of this since before that legislation was passed.

* Yes, Internet Archive and others have been archiving web sites for many years. We're using their software for capturing.

* We've been collecting and archiving web sites by agreement with the web publishers for years via the UK Web archive [webarchive.org.uk] project.

* What's different here is that the secondary legislation [legislation.gov.uk] has been passed (in March) that has given the UK copyright libraries the mechanism (agreed with publishers) to extend legal deposit [wikipedia.org] to digital publications, which includes websites.

* This gives the legal deposit libraries the right to add to the national legal deposit collections (the collection of all published material for the UK) digital publications, including ebooks, ejournals and websites.

* Until the 6th of April 2013, we did not have the right (under normal copyright law) to take a copy of websites without permission. Previously we had to request a written agreement from each website we archived to take a copy - obviously this does not scale very far.

* Under the new legislation, we will be taking periodic copies of the entire .uk domain and other websites in other domains which fall under the regulation [legislation.gov.uk] (territoriality has been difficult to define, as you may imagine).

* The difference between us and the Internet Archive is intended to be that given the status as a national collection, the material that we collect is intended to be available in perpetuity. Our print collections go back centuries, and the intention is that the digital material we collect now will also be available in centuries to come. You can read about the distributed redundant storage here [www.bl.uk] .

TL;DR : this is a legal thing, not a technical thing, and it's about a lot more than websites.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...