×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Archiving Digital History at the NARA

timothy posted more than 8 years ago | from the sort-and-toss dept.

Data Storage 202

val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

202 comments

347 petabytes? (4, Insightful)

ravenspear (756059) | more than 8 years ago | (#12916109)

Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?

I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.

Re:347 petabytes? (0)

Anonymous Coward | more than 8 years ago | (#12916137)

Hey just use Spotlight.

Can just imagine typing the first few letters of 'Clinton' and Spotlight going through it's Petabyte index and delivering results 'in real time'.

Not.

It would probably kernel panic at that point and wipe out the 347 petabytes of storage with one crash (especially if Firewire drives with an Oxford chipset were used somewhere) Oh well that was that then.

Re:347 petabytes? (2, Informative)

ravenspear (756059) | more than 8 years ago | (#12916231)

Well considering that Spotlight took about 2 hours to index my 120 GB drive, that would be (347 * 1024^2) * 2 = 72771174 hours = 83,000 years to index that much data.

Now I'm sure the gov would use a faster system than my laptop, but still!

Re:347 petabytes? (1, Interesting)

Anonymous Coward | more than 8 years ago | (#12916396)

Wow. I didn't know one could mess up so simple math so badly... That's a simple rule of thirds - basic high school math!

120GB/2Hr = 60GB/h indexing speed.
347PB = 347 000TB = 347 000 000GB (or use 347 x 1048576 - but HD manufacturers never use that - they like to inflate numbers)
347 000 000GB / (60GB/h) = 5783333 (and 1/3) h.
at 24h/day, 365d/yr, we get 660 years.

You were just a little over 12 times too much. Let's just hop you don't write code for a living :p

Still bloody too much, but it's not like the indexing is going to be done by a single processor across a single bus. Anything like that has got to be done by means of distributed computing (duh), so this math is completely irrelevant anyways :)

And it's not like spotlight is much of a reference either, perhaps make comparisons with big commercial indexing solutions, or open source implementations that could be scaled...

Making a comparison with distributed indexing of rendundant network storage of some sort with a local IDE disk indexing by spotlight is just laughable. Apples and oranges.

Re:347 petabytes? (1)

ravenspear (756059) | more than 8 years ago | (#12916532)

Wow, thanks for catching that. I had it right up to the point where I stopped, but I forgot the last step. I calculated a time of 2 hours for each GB instead of 2 hours for each block of 120 GB. 83,000 / 120 is indeed 660.

The funny thing is I got an A in Calc III last semester. ;)

Re:347 petabytes? (1)

gipsy boy (796148) | more than 8 years ago | (#12916614)

Let's hope you didn't get one in Algorithmic Complexity :)
It's not because an algorithm takes n time for m inputs, that it takes 2*n time for twice that amount (especially not when it comes to indexing, they usually complicate logarithmically, rather than linearily).

Re:347 petabytes? (1)

ravenspear (756059) | more than 8 years ago | (#12916775)

Actually I haven't taken any algorithms classes yet, but that's a good thing to remember.

One thing though, wouldn't it still be linear for the entire process? I mean I understand what you are saying as far as the algorithm goes. It's not necessarily going to take twice as long for the algorithm that creates the index to run createIndex(a,b,c,d) compared to createIndex(a,b).

But you still have to scan twice as many files to derive the inputs. How could that part not be linear?

Try to help correct other's math sans sarcasm. (5, Insightful)

jbn-o (555068) | more than 8 years ago | (#12916734)

You were just a little over 12 times too much. Let's just hop you don't write code for a living :p [...]

To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.

Re:347 petabytes? (3, Informative)

OrangeSpyderMan (589635) | more than 8 years ago | (#12916219)

I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack,

Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archive anyway, but i'd guess you'd know that if you'd ever archived anything....

Re:347 petabytes? (3, Informative)

CodeBuster (516420) | more than 8 years ago | (#12916247)

The most common structure used to index large amounts of data stored on magnetic or other large capacity media is the B-Tree and its variants. The article linked here [bluerwhite.org] explains the basic idea of the balanced multiway tree or B-Tree. The advantage of this type of index is that the index can be stored entirely on the collection of tapes, cartridges, disks or whatever else while only the portion of the tree which currently being operated on need be read into volatile or main memory. The B-Tree allows for efficient access to massive amounts of data while minimizing disk reads and writes. Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.

Re:347 petabytes? (0)

Anonymous Coward | more than 8 years ago | (#12916364)

Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.

That so didn't make sense to me.

But maybe it's just me. :-)

Re:347 petabytes? (0)

Anonymous Coward | more than 8 years ago | (#12916441)

unlimited != infinite

Re:347 petabytes? (0)

Anonymous Coward | more than 8 years ago | (#12916517)

Google.

Re:347 petabytes? (0)

Anonymous Coward | more than 8 years ago | (#12916642)

Probably Reiser9 in the year 2022.

Usually when I archive... (-1, Troll)

Anonymous Coward | more than 8 years ago | (#12916112)

I try to keep it from the RIAA and the MPAA.

Joke:

Q. What's the difference between a businessman and a nigger?
A. The nigger steals a $100 car stereo and lands his ass in prison while the bussinessman steals millions of dollars and is praised by U.S. government.

Re:Usually when I archive... (2, Interesting)

ArchAngel21x (678202) | more than 8 years ago | (#12916232)

I guess you didn't see how Mr. Ebbers or the founder of Aldephia are facing prison time. Quit trying to spread that liberal lie that white collar crime pays off. By the way, it is inappropiate to refer to blacks as niggers. Grow up and learn to be a little more tolerant of diversity.

Re:Usually when I archive... (-1, Flamebait)

Anonymous Coward | more than 8 years ago | (#12916754)

"Liberal Lie" damn, shitAngel, you are stupid.

Compression and moderation? (1)

moz25 (262020) | more than 8 years ago | (#12916125)

Can't they get more storage performance out of their system by (more) aggressively compressing old information? That shouldn't matter too much to the indexing mechanism. Also, it might make sense to tag the importance of different documents so that its compressing/archiving treatment can depend on that.

Re:Compression and moderation? (1)

LiquidCoooled (634315) | more than 8 years ago | (#12916296)

The best way to compress the data is to automatically put the national security black lines over the documents now rather than in 25 years time.
This way, not only will security be maintained (assuming they remove the data rather than just paint it black in a pdf), but it will take up less space.

Most released documents I have seen have most lines blacked out, so after compressing the remaining text, you could fit the entire Clinton admin documents onto a single floppy disk.

Compression (0)

Anonymous Coward | more than 8 years ago | (#12916126)

Surely data of this sort lends itself well to compression?

Data loss will always be a possibility (4, Insightful)

divide overflow (599608) | more than 8 years ago | (#12916130)

It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.

Re:Data loss will always be a possibility (0)

Anonymous Coward | more than 8 years ago | (#12916229)

they should use the google file system.

Re:Data loss will always be a possibility (3, Insightful)

tabdelgawad (590061) | more than 8 years ago | (#12916261)

Actually, it's more like 'inevitable'. I'll bet almost everyone has unintentionally lost digital data permanently and will do so again in the future.

The key, I think, is prioritization. We all do it individually (important stuff gets backed up many times and often, unimportant stuff perhaps never backed up), and NARA will have to do it too. I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance. The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small.

True but... (1)

BlightThePower (663950) | more than 8 years ago | (#12916595)

I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance.

I agree really but I also find the problem with data is you never know until its too late. The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.

I feel that if you can't completely automate backups then the best thing is to give users easy access to backup resources for their own material so they can judge whats most important and what isn't. This happens in some organisations at the moment but not in all; I used to work in a place where I had to make a special appointment with a tech just to burn a CD of stuff on my HD. Guess how much data we regularly lost as an organisation...

Re:Data loss will always be a possibility (1)

doshell (757915) | more than 8 years ago | (#12916795)

I think it also has to do with the fact that the media in which we store information are increasingly less durable (compare stone engraved millenia ago, writings in paper of past centuries still readable today, and the relatively short life expectancy of magnetic and optical media).

Now I'm not saying we should all go back to Stone Age, but it does make you think about the irony of progress...

Re:Data loss will always be a possibility (0)

Anonymous Coward | more than 8 years ago | (#12916278)

and more recently has happened with antiquities in Afghanistan and Iraq

Yeah. Starting with Moslem fuckers blowing up statues of the Buddha because in their skewed eyes they were heathen images. Assholes.

Re:Data loss will always be a possibility (0)

Anonymous Coward | more than 8 years ago | (#12916438)

Did you even know what Huddha was until the report of the destruction of the statues on Fox news. Or are you one of those godless non-christians who does not believe in Jesus. Or perhaps, even worse, one of these christians that do not believe in the Ten Commandments.
Exodus 20:4"You shall not make for yourself a carved image--any likeness of anything that is in heaven above, or that is in the earth beneath, or that is in the water under the earth;
5"you shall not bow down to them nor serve them. For I, the LORD your God, am a jealous God, visiting the iniquity of the fathers upon the children to the third and fourth generations of those who hate Me,
6 "but showing mercy to thousands, to those who love Me and keep My commandments.

Answer is Compression? (4, Informative)

reporter (666905) | more than 8 years ago | (#12916140)

National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

Perhaps, the answer is compression.

Does anyone know whether there is an upper limit to text compression?

In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?

Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.

If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.

Re:Answer is Compression? (1)

slavemowgli (585321) | more than 8 years ago | (#12916182)

There is no theoretical upper limit on text compression as far as I know (and I'd be rather surprised if there was [1]), but there *is* a (very basic) theorem from Kolmogorov complexity that says that there's always data that can't be compressed for any compression algorithm you devise (for a proof, simply consider the number of strings of length =n for a given n).

1. Well, I'd be surprised as long as you don't make any assumptions about the statistical distribution of bits in the text you want to compress. In other words, if you define certain properties and conjecture that all texts satisfies these properties, you may well be able to prove certain things (I'm not sure about this), but I don't think it'd really be very practical, as I'd say it's relatively likely that among those 347 PB there'll be data which does not match these properties.

Re:Answer is Compression? (1)

Indianwells (661008) | more than 8 years ago | (#12916208)

But the article was focused really on archiving as opposed to backup. Compression would work for backup, but archiving is an attempt to make the data searchable as well as permanent. Some type of indexed compression would preserve space size, but is it really that important? If they simply started chaining sans, there would be no issue with storage. If they used flash memory based san, there would be no data loss -- although it would be quite expensive to build. That tied into a highly indexed and hierarchical database with smart data management, and this problem seems surmountable ...

Re:Answer is Compression? (1)

mcrbids (148650) | more than 8 years ago | (#12916545)

There is no theoretical upper limit on text compression as far as I know

Which is obviously some hot gas coming from your posterior. Otherwise: 1 (the Holy bible, heavily compressed)

The amount of compression possible in a given string of numbers is inversely proportional to the amount of randomness in the input.

They use TIFF? (0)

Anonymous Coward | more than 8 years ago | (#12916214)

From the article:
and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data

Why TIFF!? PNG (or any other lossless format) would reduce that considerably.

Re:They use TIFF? (1)

Fear the Clam (230933) | more than 8 years ago | (#12916236)

Why not just convert it to text? If a picture is worth 1000 words, they can knock the data down to 4 gigs right there.

Re:They use TIFF? (1)

Murphy Murph (833008) | more than 8 years ago | (#12916254)

From the article:
and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data

Why TIFF!? PNG (or any other lossless format) would reduce that considerably.


Uhh, maybe because TIFF does support compression.
Both lossy and lossless.

Re:Answer is Compression? (1)

Biogenesis (670772) | more than 8 years ago | (#12916362)

Sounds a bit like 42, it'll tell us the answer, but we need something else to find the question.

Re:Answer is Compression? (3, Interesting)

MasterC (70492) | more than 8 years ago | (#12916369)

The only thing that comes to mind is information entropy [wikipedia.org]. If you're given a text document, you can determine the probability distribution for each letter, letter combinations, for words, or whatever you can think of. Then given the probability distribution, you can determine the information entropy. If, in the sum, you use log with base 2 then H(x) (see formal definitions [wikipedia.org]) gives you the entropy in bits.

For example, if you have a text file with letters of equal probability (all letters have a probability of 1/27) then the bits required to represent a single letter turns out to be ~4.7549 bits. (Indeed, 2^4.7549 = 27)

This is the upper limit of compression. Such methods as the, now 50-years old, Huffman coding [wikipedia.org] do decent work at approaching this limit (used in JPEG, for one).

So the answer to your question is: it's not broadly definiable for "text" or "information" but based on the patterns of the English language or a specific document.

Re:Answer is Compression? (1)

mechsoph (716782) | more than 8 years ago | (#12916506)

For any given set of data, there is a lower limit beyond which it cannot be compressed. This is called the "entropy" of the data. This is essentially how much actual information the data contains. We talked about it in one of my sophomore CS courses. Lempel-Ziv (gzip, I think) compression approaches the entropy of the data as the size of the data approaches infinity.

Re:Answer is Compression? (0)

Anonymous Coward | more than 8 years ago | (#12916695)

Lempel-Ziv (gzip, I think) compression approaches the entropy of the data as the size of the data approaches infinity.

BZZT. That is only if the data can be described accurately with a Markov model. For instance, LZ compresses digits of pi roughly as badly as random digits even though the entropy in that sequence is close to nil.

Re:Answer is Compression? (0)

Anonymous Coward | more than 8 years ago | (#12916546)

In text compression, is there a similar limit?

Yeah, it's called entropy.
The basic model is that in the end you're transmitting bits of pure information encoded in some not-too-efficient method (such as english). And according to the pigeonhole principle, there's no way to compress the data further in average than sending those nonpredictable equiprobable bits directly.

Re:Answer is Compression? (0)

Anonymous Coward | more than 8 years ago | (#12916644)

signal channel applies to text. Doesn't the paper even have examples coding texts?

How is this an answer? Its not about the disk space.

Retain it all. (3, Insightful)

d3m057h3n35 (695460) | more than 8 years ago | (#12916156)

Perhaps it would be best to keep it all, even the stuff that now may seem totally useless, like Clinton administration emails from Janet Reno to Madeleine Albright asking what she thinks about Norman Mineta and his "hot Asian vibe." With search technology improving constantly, it would probably be better than throwing stuff away which could potentially be of interest, or spending time developing the AI to make the task less time-consuming. And besides, we can't make future historians' jobs too easy. They've gotta earn their pay, reminding us of the banalities of this age.

age of clutter... (0)

Anonymous Coward | more than 8 years ago | (#12916169)

Doesn't matter. We can't absorb the information available at any moment in real time. So we certainly cannot go back and absorb it later.
The abandonment of the notion that information should be evaluated and only the best archived -- as in traditional libraries -- is indeed likely to lead to a dark age. But it will be symmetric to the old ones: can't find the target in the clutter instead of being unable to find it in the desert.

Every mail is sacred (3, Insightful)

kfg (145172) | more than 8 years ago | (#12916341)

Every mail is great
If a mail is wasted
The gods get quite irrate

Every mail is wanted
Every mail is good
Every mail is needed
In your network neighborhood

Really, the idea of not being able to record and save every post-it note being equated with those times and places where writing itself was denigrated into virtual nonexistence is a bit silly.

KFG

All that for things like... (1)

suitepotato (863945) | more than 8 years ago | (#12916184)

Dear Monica,
I did what last night? Man, I must have been smashed. You sure? ROTFLMAO...
Yours truly,
Bill

Seriously, we're archiving every little tiny 1 and 0 for what reason? There's some things that can just go in a zip file and be put on a CD and that's it. Want them to stick around forever? Have files put out every so often in leftover space on AOL CDs. They'll never be gone forever.

AOL CD's (0)

Anonymous Coward | more than 8 years ago | (#12916598)

AOL cd's are closed. Belive me, I've tried...

Difference between data and trash (4, Insightful)

HermanAB (661181) | more than 8 years ago | (#12916187)

In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.

Re:Difference between data and trash (1, Insightful)

Anonymous Coward | more than 8 years ago | (#12916407)

The problem is that it can be hard to know where the boundary between important and useless is...

Things that previous generations considered unworthy of preservation are things that are greatly treasured in today's age - look at all the old manuscripts of which we only have a few pages (because scribes reused the parchment). Look at the masterpieces that were painted over to save canvas.

As soon as you start to put hard limits down on what to preserve, and what to leave alone, we risk losing information that our next generations will value.

Besides - in many cases, it could just be easier to save everything. It seems that trying to enforce standards and judging what should and shouldn't be preserved might be more labour-intensive than the alternative. Considering the rate at which informationis generated it might make sense to have a trade-off between conserving storage versus conserving labour... storage is easier/cheaper/more available :)

Re:Difference between data and trash (1)

jacksonj04 (800021) | more than 8 years ago | (#12916533)

The trick is to get your data infrastructure organised to start with. Because I have a predetermined system for organising my class notes (Microsoft OneNote, so shoot me) I can reliably pick out notes from a specific class based on date, or topic based on exam questions, or I can take the Google approach and just go "Find me anything to do with this".

The information I need is preserved in an easily accessible form because I made a decision to make all my class notes organised, and as a result I've replaced 8 ringbinders of poorly organised content with a tablet PC and searchable, editable content.

Good planned structure to start with helps organisation later. Google has made gMail easily organised with tags, the world is getting closer to the idea that *everything* needs to be categorised by date, subject, relevance, people involved etc. but it's a long way yet.

Why do we need to archive everything? (1)

night_sky_nsci (838533) | more than 8 years ago | (#12916195)

I'm a little skeptic about why we have to archive all that information in the first place. History as we know it is established through researching for bits and pieces of evidences and putting them together; we know quite a bit about what happened 200, 300 years ago, but I am sure we don't have an equivalent of 200 petabytes of, say, parchment from which to study our recent history.

It'd be crazy to suggest the NARA audit every single bit (no pun intended) of archival data to determine whether they're worth archiving or not -- not only is it impossible, it flies in the face of the whole idea of archiving. However, the estimate of 347 petabytes may perhaps be too pessimestic, as surely not every kind of information they have are worth archiving. Just my two cents.

Re:Why do we need to archive everything? (4, Insightful)

felix71 (49849) | more than 8 years ago | (#12916356)

Actually, one of the main complaints Historians have is incomplete information about the past. Not having every little tidbit makes it impossible to figure out how people actually lived. History _should_ be more than just names, dates, and events. If we can properly preserve and index items that seem really mundane to us, future generations have a _much_ better chance of having some real understanding of how we developed as a society.

Re:Why do we need to archive everything? (0)

night_sky_nsci (838533) | more than 8 years ago | (#12916423)

Are you familiar with paleopathology? Anthropologists dig up bones and study traces of diseases and injury in human skeletons. They use this to gain further insight on the lifestyle Dark Ages people led; for example, a typical Dark Ages skeleton would have bones that suggest they have been broken a few times moreso than other ages, suggesting to historians Dark Ages people engaged in physically risky and demanding activities. Furthermore, by studying the growth on these broken bone sites, they can figure out how they would have treated them.

Dark Ages (5, Insightful)

TimeTraveler1884 (832874) | more than 8 years ago | (#12916198)

Are we destined for a "digital dark age"?"
If by "dark age" you mean a time in human history where more information is recorded than ever, yes I suppose we are.

I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.

Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.

Personally, I take the "you never know" ideology and save everything.

Re:Dark Ages are ahead! All aboard (2, Funny)

screwthemoderators (590476) | more than 8 years ago | (#12916276)

I think it may be worse than that- that there will be a huge proliferation of false information, sensationalistic 'infotainmnet,' advertising, propaganda, etc... Why, historians of the future may be depending on /. as their main source of of information! Think of what a tragedy that would be!

Not a dark age... was the past so bright? (4, Insightful)

G4from128k (686170) | more than 8 years ago | (#12916206)

Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.

Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).

By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.

To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.

Re:Not a dark age... was the past so bright? (1)

Nasarius (593729) | more than 8 years ago | (#12916674)

I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.

Cost-of-copy and modes of failure (2, Interesting)

G4from128k (686170) | more than 8 years ago | (#12916780)

I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.

Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor of digital (and probably 1000:1). Inaccessible formats are a concern, but an automated batch process at the time of initial archiving can, at least, convert the data to some data format standard with a longer likely lifespan(e.g., plain ASCII, RTF, PDF, HTML, etc.)

Paper is its own single-point of failure concerns and the huge cost of copying makes those concerns real. Digital does add some new modes of failure (e.g., format obsolesce), but I think those are not as burdensome as the physical costs of copies.

Answer is not compression, it's less data. (2, Insightful)

gus goose (306978) | more than 8 years ago | (#12916217)

People should think outside the box.

The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

Bottom line: with the 100 + 38 million dollars (FTFA) assigned to the project I am sure I could eliminate a number of redundant positions, optimise some communication channels, retire voluminous individuals, replace inefficient protocols/people, and basically reduce the sources of data. Hell, if the US were to actually have peace instead of demand it, there would be a much reduced need for military inteligence, political rhetoric, and other civil responsibilities. The military could be half the size, and what do you know, we could not only reduce the requirement for archiving, but could actually save money in the process.

Remeber, govenment is a self-supporting process.

Go ahead, mark me a troll.

gus

Re:Answer is not compression, it's less data. (2, Insightful)

MasterC (70492) | more than 8 years ago | (#12916475)

...other techniques did whittle down the execution time to about 5 hours, the final solution ...is now 2 hours.

That's only a 60% reduction. A 60% reduction of 347 PB is still 138.8 PB...still a huge archival task.

Keep 1% of the data still leaves you with 3.47 PB. Not impossible, but still a daunting task.

Re:Answer is not compression, it's less data. (1)

rbarreira (836272) | more than 8 years ago | (#12916486)

The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

I'm sure I could do that in about 1 hour...

burn, knowledge, burn (2, Interesting)

Leontes (653331) | more than 8 years ago | (#12916223)

The ancient, esteemedgreat library of alexandria [wikipedia.org] was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?

Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.

It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?

Re:burn, knowledge, burn (0)

Anonymous Coward | more than 8 years ago | (#12916234)

Those who forgot history are destined to repeat it.

Re:burn, knowledge, burn (1)

Leontes (653331) | more than 8 years ago | (#12916241)

those that do not look up quotations before posting are destined to misquote them.

Re:burn, knowledge, burn (0)

Anonymous Coward | more than 8 years ago | (#12916630)

Those WHO do not match pronouns are destined to be grammar flamed.

Re:burn, knowledge, burn (2, Interesting)

mrogers (85392) | more than 8 years ago | (#12916408)

Doesn't it diminish the aura of a great work of art if you know that it can always be restored from a backup?

Re:burn, knowledge, burn (0)

Anonymous Coward | more than 8 years ago | (#12916422)

The ancient, esteemedgreat library of alexandria was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?

league of shadows.

Re:burn, knowledge, burn (0)

Anonymous Coward | more than 8 years ago | (#12916481)

Reminds me of the Canadian Broadcasting Corporation (CBC), who did throw a lot of old footage in dumpsters, and then lost some more due to a fire.

And all this time I had been wondering why they weren't coming out with some of the older shows [from my youth] on DVDs :(

Re:burn, knowledge, burn (4, Insightful)

mcrbids (148650) | more than 8 years ago | (#12916591)

Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

Absolutely, yes!

History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.

One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.

History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.

The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?

Those memos and IMs comprise that understand of people today.

So? (3, Insightful)

ArchAngel21x (678202) | more than 8 years ago | (#12916238)

By the time the government comes up with a half ass solution, archive.org will already have it all organized, online, indexed, and backed up.

contract for archiving system (1)

1nv4d3r (642775) | more than 8 years ago | (#12916259)

anybody know what the government has spec'ed TFA's archiving system to do? It says it will need to read 16,000 file formats, and be impervious to terrorist attack (?), but not much else...

I wonder what kind of searches and cross-linking will be done, for instance. What kinds of access control there will be? I'd also just like to see what the 16,000 formats are, out of curiosity. Sounds like a project waaaay larger than the $136 million they've allotted for it so far.

Stupid name.... i'm guessing they were chuckling about 'the ERA of NARA'...

Have a look at the Fedora Project (3, Funny)

pangloss (25315) | more than 8 years ago | (#12916273)

http://www.fedora.info/ [fedora.info]
(Not to be confused with the Linux distribution)

From the website, Fedora is "a general purpose repository service...devoted to...providing open-source repository software that can serve as the foundation for many types of information management systems".

Problem for some is that Fedora can be a little hard to grok. It's not an out-of-the-box repository to install and run, like the repository application mentioned in the article (DSpace). It's an architecture for building repository software. Once you understand the potential for building applications on top of Fedora, you start to see some light at the end of the tunnel for just the sort of issues the article raises.

Relevant, interesting post (4, Funny)

Council (514577) | more than 8 years ago | (#12916290)

Here is a relevant post by Ralph Spoilsport [slashdot.org] on an earlier article, which can be found here [slashdot.org]. I am reproducing it here in full because it is very interesting and highly relevant.

this is actually a BIG question

And one that I have railed about for many years.
I have been in the same position the Author discussed, and I have come to ONLY negative conclusions. In a few words, and I hate to say this, but buddy:

WE'RE FUCKED.

Digital is a loser's proposition. backing up to analogue or even digital data on analogic substrates (such as DV tape) fail. Simply nad purely.

The *only* thing that comes close is some kind of RAID, and those, even with the plummeting price of storage, are still too expensive given the needs.

Also, a RAID assumes a continuity of several things that are not likely to be continuous:

With Video:
Framerate, number of lines, colour depth, aspect ratio, file format, compression format, Operating system compatibility, etc etc etc. All of these things are variables.

With Audio:
sample rate, compression format, bit depth, file format, etc.

Basically all of it points to very bad places.

I am fairly well convinced that our age will simply disappear. They will find our garbage, the few books not pressed on acidic paper, our paintings (fat lot of good the abstract stuff will mean to them) and drawings, that's about it. the rest will just be shiny little bits of crap in the landfill.

Since we will have used up all the dense energy forms, they will be appalled at the energy requirements just to get the few remaining museum piece devices to work. Archiving the 21st century will be impossible. To the 25th century, the 21st century will be seen as a dark age - not only for the holocaust of the die caused by the failure of the petroleum based economy, but from the simple fact that very little of the information formats we are totally geared into will survive, including this note on /.

His problem of saving personal video is just the tip ofthe iceberg. His problem is the problem of our very civilisation, writ small.

That's why I am abandoning video, and going back to painting. In 500 years, my painting CAN survive. the video simply won't.

RS


And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.

What do you think about Ralph's thoughts?

Re:Relevant, interesting post (2, Insightful)

fmaxwell (249001) | more than 8 years ago | (#12916691)

I think that he's being absurdly pessimistic. Those future historians will have no more difficulty reading our media than we have playing the sounds from a wax cylinder for an Edison phonograph. Running archived computer software will be no different than any of us using a software emulator to run a game for a long-dead gaming console.

Sure, optical and magnetic media decay, but there's nothing stopping people from "refreshing" the media before it decays too far. If you have a stack of CD-R discs that are starting to show an increase in correctable errors, then you back them up to new CD-R discs or to DVD-R. You don't have to sit idly by and watch them decay. I've got a CP/M computer with that's over 20 years old and it can still boot from its 10MB (yes, megabyte) hard drive. So it's not like data just disappears five years after it's recorded.

It's also a problem which is being addressed by the industry. There are companies offering long-life CD-R media designed for archival [hi-space-pro.com]. Other companies offer data storage for archival data, much like the climate-controlled vaults where countless audio master tapes and films have been stored for decades.

In closing, I think that 95%+ of archived data will still be able to be accessed in a century -- provided that it is properly stored and cared for.

Re:Relevant, interesting post (0)

Anonymous Coward | more than 8 years ago | (#12916773)

Sure, optical and magnetic media decay, but there's nothing stopping people from "refreshing" the media before it decays too far. If you have a stack of CD-R discs that are starting to show an increase in correctable errors, then you back them up to new CD-R discs or to DVD-R.

Yes, but with current OS's, correctable erros are completely hidden from me. Only when the error is uncorrectable do I get a warning. There is no equivalent of S.M.A.R.T for CDs.

Oh, give me a break... (0)

Anonymous Coward | more than 8 years ago | (#12916783)

The guy conflates integrity preservation solutions (RAID) with data format issues.

Major formats will be figured out after the apolocalypse, don't worry about that. (Sir, we found over 100 million 4 3/4" plastic discs with digital data on them! Should we try to decode them?)Data will be lost, that's true. But some of it will be figured out, just as when we look back at current histories.

In the past, when societies paper or papyrus instead of parchment, the recoverability of their information went down because they didn't survive as well. At other points, changes in inks (due to convenience of manufactur or cost) also led to lower data survivability.

So this isn't a new thing at all.

But most importantly, don't get too excited about it. You no more should be worried about whether your pr0n collection will survive than the average greek or roman was about their inventory/accounting records, or indeed their pr0n collections.

Slightly overdramatic? (2, Insightful)

mrogers (85392) | more than 8 years ago | (#12916298)

Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century? Archiving everything is an attractively simple approach, but if it turns out to be impractical we can always fall back on common sense and restrict ourselves to archiving the maybe 10% of things that have even a remote chance of being interesting in 100 years' time.

Tanks for the Memories (2, Interesting)

Doc Ruby (173196) | more than 8 years ago | (#12916329)

We need to imprint holographic storage on synthetic diamonds. Even if they're slow and expensive, they'll last even longer than the paper records they replace. We'll have to spend a fortune redigitizing all the polymer (CD/DVD, floppy, tape), celluloid (microfilm/fiche) and rotating (disc) media that will age to illegibility within our lifetimes. Until we get holographic gems, we need to archive everything on paper, including those expiring media, in a format easily digitized to a more permanent medium. But of course the government, and barely unaccountable bosses, want the public record to disappear down the memory hole. If they could accelerate the process, including newspapers, they'd spend everything we've got (and more) to make it happen.

Records (2, Informative)

Big Sean O (317186) | more than 8 years ago | (#12916330)

NARA makes a distinction between a document and a record. Any old piece of paper or email is a document, but a record is something which shows how the US government did business.

For example, the email to my supervisor asking when I can take a week's vacation isn't a record. The leave request form I get him to sign is a record. An email about lunch plans: not a record. An email to a coworker about a grant application probably is.

Besides obvious records (eg: financial and legal records), there are many documents that may or may not be records. For the most part, it's up to each program to decide which documents are records and archive them appropriately.

the more I think about it... (2, Insightful)

1nv4d3r (642775) | more than 8 years ago | (#12916397)

I'm not sure most of this stuff is worth making preserving digitally enough to justify the cost. Just print em out, and put them in a Raiders of the Lost Ark-style warehouse. The few people who want to see all of clinton's administration's emails can travel to it and search.

I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records.

.

strip MS HTML from Outlook mails (4, Funny)

rduke15 (721841) | more than 8 years ago | (#12916418)

I don't know about the NASA data sets, but they could certainly save a few petabytes by stripping the stupid HTML part of all Outlook emails...

Moore's Law saves the day (2, Interesting)

G4from128k (686170) | more than 8 years ago | (#12916477)

In 1987, a Mac II came with a 40 MB drive. 17 years later, a PowerMac G5 came with 160 GB drive. This was at least 4000X improvement in storage density and price (and 1987's drive was both physically larger and more expensive than 2004's drive).

Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.

Re:Moore's Law saves the day (1)

Nasarius (593729) | more than 8 years ago | (#12916749)

First, Moore's Law is about transistor density, which has nothing to do with hard drives. Secondly, hard drives haven't been getting any more reliable. That means all these hard drives have to be replaced every few years. It's a nightmare for long-term storage.

I'm guessing... steady state. (3, Interesting)

dpbsmith (263124) | more than 8 years ago | (#12916484)

The Zapruder film was the beginning. In recent years, I've been dumbfounded by the vast extension in recording and documentation of things like crimes in progress, natural disasters, America's Funniest Home Videos, you name it. A plane crashes, and the next day there are ten different home videos from people in the vicinity who had camcorders.

I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.

And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.

Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.

So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.

But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.

It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.

Yeah, but that's 17 years away. (1)

MacDork (560499) | more than 8 years ago | (#12916504)

In 2022, we'll probably have terabyte capacity in our mobile phones. Seriously. In the early 90s, 80 Gb of drive space ran about $80,000 according to this archived historical document. [wired.com] Nowadays, I can get an 80 Gb drive for about $65 according to froogle, [google.com] and that's without considering inflation. Sure at a conservative $1/Gb were looking at $347 million dollars today, but in 17 years time that'll probably look more like two or three hundred thousand bucks. No biggie for our bloated government.

The Solution: (2, Funny)

DarkEdgeX (212110) | more than 8 years ago | (#12916705)

NARA needs to open up tons and tons of GMail accounts. Where do I send my invites so I can contribute?

Elementary My Dear Dewey (1)

chadpnet (627771) | more than 8 years ago | (#12916774)

Who says you have to archive all data digitally. The system thats been working for years at our local public and univ. libraries is storing meta information digitally that references a tangible location.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...