Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Can You Compress Berkeley DB Databases?

Cliff posted more than 13 years ago | from the now-this-would-be-a-neat-hack dept.

Unix 10

Paul Gear asks: "I'm wanting to create a database using Berkeley DB that includes a lot of textual information, and because of the bulk of the data and its obvious compressibility, I was wondering whether it was possible to have the DB libraries automatically compress it at the file level, rather than me compressing each data (rather small) item before putting it in (which would result in much less gain in compression). Section 4.1 of the paper "Challenges in Embedded Database System Administration" talks about automatic compression, but that is the only place in the documentation that it is mentioned. Can anyone point me in the right direction?"

cancel ×

10 comments

Sorry! There are no comments related to the filter you selected.

OS (1)

sql*kitten (1359) | more than 13 years ago | (#532124)

Which OS are you using? NT can compress files transparently to applications. Just right-click the file, select "Properties" then check the appropriate box. You won't need to modify any of your code.

If you end up rolling your own solution ... (1)

Bwah (3970) | more than 13 years ago | (#532125)

look at LZO.
http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html

I've used it in an embedded app to decompress/overlay main applications from rom to ram and can vouch for it's decompression speed.

Tradeoffs (1)

djweis (4792) | more than 13 years ago | (#532126)

Reading through a compressed random-access file may not be as big of a win as you think, since the db will need to decompress things to determine where your data is. OTOH, if you have a fast processor and slower media (CDROM) and plenty of RAM, the drawbacks will disappear after a few spins through due to caching.

Re: mifluz (1)

ghutchis (7810) | more than 13 years ago | (#532127)

The mifluz project [senga.org] uses a compressed Berkeley DB for word indexing. Depending on your application, you may find much of the code less useful, though the db/ directory contains code for creating compressed Berkeley B-Tree databases.

You should consider contacting Loic Dachary--his address is on the Senga project pages.

from my experience (1)

mi (197448) | more than 13 years ago | (#532128)

If you use zlib's replacements for fread, fseek, etc. things will be VERY slow. I tried this aproach with WordNet [princeton.edu] databases and it sucked. Well, WordNet does a lot of seeks, but still.

Teaching the Berkeley DB functions to use libz will be extremely painful too. I'd say, your best bet would be to try the new Linux's filesystem extension, mentioned already by someone else. But I'm not sure how efficient it is with respect to reading/uncompressing/compressing back things, which are read and/or mmaped. In any case, you will not be modifying any code -- just the file's attribute.

I'm afraid, you'll defeat most of the DB's tricks, that know about the sector size and other file-system details.

You could also just uncompress the file into memory and then give it to the database functions, but then you loose the automatic syncronization with the file on the filesystem, which for many is the main reason of using Berkley DB (or gdbm) in the first place.

Database choice? (1)

wmulvihillDxR (212915) | more than 13 years ago | (#532129)

I know this is probably not an option, but have you considered switched to a different database system like one based in SQL? Of course, as I say this, I'm doing web applications for my company using the DB database. I just wanted to know how much of a hassle it would be to switch. I'm sure we could all go into the many virtues and pitfalls of all database systems out there.

fs-level compression Re:OS (2)

StandardDeviant (122674) | more than 13 years ago | (#532130)

FWIW, linux can do this on the ext2fs as well, with chattr +c filename. There are analogs in other unix operating systems and filesystems as well. :-) (man chattr for more info)


--

Performance bottleneck. (2)

billcopc (196330) | more than 13 years ago | (#532131)

Although compressing the data might seem like a good idea, in practice you will curse day and night unless you take certain precautions. Using a compressed filesystem will save you plenty of space, but you will pay in the form of cpu usage. Decompressing data on-the-fly is costly, and in the case of database operations, it can be downright nasty. A simple workaround might be to store the main data (indexes) on an uncompressed filesystem, and leave only the larger blob fields in a separate table stored compressed, that way you could still do lightning fast searches and selects, only slowing down to grab the memos when absolutely required.

ZlibC (3)

Bazman (4849) | more than 13 years ago | (#532132)

from The zlibc web site [linux.lu]

Zlibc is a read-only compressed file-system emulation. It allows executables to uncompress their data files on the fly. No kernel patch, no re-compilation of the executables and the libraries is needed. Using gzip -9, a compression ratio of 1:3 can easily be achieved! (See examples below). This program has (almost) the same effect as a (read-only) compressed file system.

See the web page for more.

Baz

Re:Tradeoffs (3)

Deven (13090) | more than 13 years ago | (#532133)

Reading through a compressed random-access file may not be as big of a win as you think, since the db will need to decompress things to determine where your data is. OTOH, if you have a fast processor and slower media (CDROM) and plenty of RAM, the drawbacks will disappear after a few spins through due to caching.

The caching won't save you from uncompressing the blocks repeatedly. If you really want to compress the database metadata, you basically need a block-oriented compressed filesystem that allows random access within compressed files. I don't know if such a thing already exists, but it's effectively what you'd be writing to do it...

I'd just use zlib to compress the individual entries and not try to compress the entire database as a whole. I've done this before, and it actually works better than you'd think. Even with data entries as small as 50-100 bytes, you get reasonable compression. Yes, you'd get much better compression across the entire database, but you can't hope to access a fully-compressed database without uncompressing it or doing a lot of work to make random-access possible. (And like I said, at that point you might as well be making a compressed filesystem.)
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>