Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Cassandra NoSQL Database 1.2 Released

Soulskill posted about a year and a half ago | from the onward-and-upward dept.

Databases 55

Billly Gates writes "The Apache Foundation released version 1.2 of Cassandra today which is becoming quite popular for those wanting more performance than a traditional RDBMS. You can grab a copy from this list of mirrors. This release includes virtual nodes for backup and recovery. Another added feature is 'atomic batches,' where patches can be reapplied if one of them fails. They've also added support for integrating into Hadoop. Although Cassandra does not directly support MapReduce, it can more easily integrate with other NoSQL databases that use it with this release."

cancel ×

55 comments

Webscale? (-1, Redundant)

Anonymous Coward | about a year and a half ago | (#42454135)

The most important question: Is it webscale?

first (-1)

Anonymous Coward | about a year and a half ago | (#42454147)

post

Hmm. (4, Interesting)

Anonymous Coward | about a year and a half ago | (#42454225)

Maybe someone can explain this to me. I've been keeping an eye out for situations where it would make more sense to use a nosql solutions like Mongo, Couch, etc. for a year or so now, and I just haven't found one.

Under what circumstances do people use a data store that doesn't need data relationships?

Re:Hmm. (3, Insightful)

Anonymous Coward | about a year and a half ago | (#42454267)

When the project is run by an idiot who thinks he needs to incorporate buzzwords over substance into their work.

Re:Hmm. (1)

QuantumRiff (120817) | about a year and a half ago | (#42454835)

But your old fasioned DB isn't "Web Scale": http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html [highscalability.com]

Sorry, I love this video..

Re:Hmm. (1)

haruchai (17472) | about a year and a half ago | (#42457343)

That is funny but the best part is me remembering when MySQL used to be mocked for not being ACID-compliant, robust, etc and the comeback was "well, it's really fast"

Re:Hmm. (1)

vlm (69642) | about a year and a half ago | (#42454319)

Under what circumstances do people use a data store that doesn't need data relationships?

A crude 1980s filesystem, on a system where they don't officially allow direct file storage but do provide a database capable of holding arbitrary binary data.

Re:Hmm. (4, Insightful)

Sarten-X (1102295) | about a year and a half ago | (#42454345)

Assuming you're not trolling...

When one wants to write a ton of data as fast as possible, where the data may not actually be complete or consistent (but still useful). Something on the order of a million rows a minute is a prime candidate for a NoSQL store. Consider, for example, the sum of all posts on Facebook at any given time.

From the other side, an application like the current trend of "Big Data" models, monitoring every aspect of every action on a website (or in a hospital, or through a retail distribution chain, or the environmental systems of a factory) to glean statistically-meaningful information also makes a good use case for NoSQL. At the expense of consistency, the store is designed to be fast and fault-tolerant, so it really doesn't matter whether the data's complete or not. For Big Data applications, which are interested only in statistics, having a few inconsistent records out of billions doesn't matter much to the end result.

Sure, traditional RDBMSs can be tweaked and optimized to make any particular query run as fast as any NoSQL engine... but that's an expensive and time-consuming process that's often not feasible.

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42454533)

FYI

Michael Stonbraker of Ingres, Vertica, etc. database fame has some interesting critiques of both traditional databases and NoSQL.

http://www.youtube.com/watch?v=uhDM4fcI2aI

Re:Hmm. (1)

KingMotley (944240) | about a year and a half ago | (#42454615)

As for your first case, it's less a factor of speed than it is the content of what you are writing. If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes. If you need to be able to query the data at a later point and be able to run statistics on it regularly, then no, especially if accuracy in the statistic is important.

And on the other side, NoSQL typically fails much more than it succeeds because NoSQL defers most of it's logic to the application itself, and that fails very quickly in enterprise situations where you have many needs across many departments. If you think of NoSQL being like a large filing cabinet full of files you have no idea what is in them, if that works for you then NoSQL might be a solution. However, if you want to be able to say look through all the files to find something within them, then NoSQL fails miserably. It also fails if your needs change rapidly, like needing to reanalyze your data in a way that wasn't thought of when you put the system live -- which happens every day in enterprise warehouses.

Re:Hmm. (3, Informative)

samkass (174571) | about a year and a half ago | (#42454701)

If it's mostly free-form crap that doesn't or won't ever have to be analyzed based on the actual content (Blogs, posts, etc) then yes.

I'm going to pretend you weren't trolling to address a good point here. NoSQL is very valuable for human-to-human data. I've seen it be hugely successful in cases when you only need a "human" level of precision about ordering, consistency, and detail. It eliminates single points of failure, global locks, offline operation problems, write contention, etc. It introduces problems for indexing and absolute consistency. But without widespread indexing you tend to get brute-force (Map-Reduce) or narrow-focus (offline indexes on specific field) searches. And that's okay for most humans.

Re:Hmm. (4, Informative)

Sarten-X (1102295) | about a year and a half ago | (#42454897)

That's almost exactly wrong.

"Free-form crap" like blogs doesn't really care what database it's in. Use a blob in MySQL, and it won't matter. You'll be pulling the whole field as a unit and won't do analysis anyway.

The analysis of atomic data is exactly what NoSQL stores are designed for. MapReduce programs are built to evaluate every record in the table, filter out what's interesting, then run computation on that. The computation is done in stages that can be combined later in a multistage process. Rather than joining tables to build a huge set of possibilities, then trimming that table down to a result set, the query operates directly on a smaller data set, leaving correlation for a later stage. The result is a fast and accurate statistic, though there is a loss of precision due to any inconsistent records. Hence, bigger databases are preferred to minimize the error.

I like the analogy of NoSQL being a cabinet full of files, though I'd alter it a little. Rather than having no idea what's in the files, we do know what they're supposed to contain, but they're old and may not be perfectly complete as expected. To find some information about the contents, we have to dive in, flip through all the files, and make an effort. Yes, some files will be useless for our search, and some will be missing important data - but we can still get information of statistical significance. Note that over time, the forms might even change, adding new fields or changing options. We might have to ask a supervisor how to handle such odd cases, which is analogous to pushing some decisions back to the application.

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42454677)

No, I'm not trolling, I'm just ignorant on the subject.

I think I'd been led the wrong direction on use cases for nosql solutions. The idea of "agility" sounded good, which to my mind meant worrying less about the schema. If I need to add field to something, I add a field. But the part about no relations always seemed like a show stopper for any case I'm likely to encounter.

I guess the mental block goes something like (and I'm shoehorning an example here): It'd be nice to store user status updates in a way where I don't have to worry too much about types of update, but I can't do that if correlating 'mentions', the user that posted it, and visibility against user groups would be a problem.

Re:Hmm. (5, Informative)

Sarten-X (1102295) | about a year and a half ago | (#42456445)

I think I'd been led the wrong direction on use cases for nosql solutions.

It sounds like you probably have. There's a lot of misinformation out there parroted by folks who don't really understand NoSQL paradigms. They'll say it lacks ACID, has no schema, relations, or joins, and they'd be right, but sometimes those features aren't actually necessary for a particular application. That's why I keep coming back to statistics: Statistical analysis is perfect for minimizing the effect of outliers such as corrupt data.

The idea of "agility" sounded good, which to my mind meant worrying less about the schema.

Ah, but that's only half of it. You don't have to worry about the schema in a rigid form. You do still need to arrange data in a way that makes sense, and you'll need to have a general idea of what you'll want to query by later, just to set up your keys. If you're working with, for instance, Web crawling records, a URL might make a good key.

If I need to add field to something, I add a field.

Most NoSQL products are column-centric. Adding a column is a trivial matter, and that's exactly how they're meant to be used. Consider the notion of using columns whose names are timestamps. In a RDBMS, that's madness. In HBase, that's almost* ideal. A query by date range can simply ask for rows that have columns matching that particular range. For that web crawler, it'd make perfect sense to have one column for each datum you want to record about a page at a particular time. Perhaps just headers, content. and HTTP code each time, but that's three new columns every time a page is crawled - and assuming a sufficiently-slow crawler, each row could have entirely different sets of columns!

But the part about no relations always seemed like a show stopper for any case I'm likely to encounter.

It's not that there aren't relations, but that they aren't enforced. A web site might have had a crawl attempted, but a 404 was returned. It could still be logged by just having a missing content column for that particular timestamp, and only the 404 column filled. On later queries about content, a filter would ignore everything but 200 responses. For statistics about dead links, the HTTP code might be all that's queried. On-the-fly analysis can be done without reconfiguring the data store.

It'd be nice to store user status updates in a way where I don't have to worry too much about types of update, but I can't do that if correlating 'mentions', the user that posted it, and visibility against user groups would be a problem.

Here's one solution, taking advantage of the multi-value aspect of each row (because that's really the important part [slashdot.org] ):

Store a timestamped column for each event (status update, mention, visibility change). As you guessed, don't worry much about what each event is, but just store the details (much like Facebook's timeline thing). When someone tries to view a status, run a query to pull all events for the user, and run through them to determine the effective visibility privileges, the most recent status, and the number of "this person was mentioned" events. There's your answer.

As you may guess, that'd be pretty slow, but we do have the flexibility to do any kind of analysis without reconfiguring our whole database. We could think ahead a bit, though, and add to our schema for a big speed boost: Whenever a visibility change happens, the new settings are stored serialized in the event. Sure, it violates normalization, but we don't really care about that.Now, our query need not replay all of the user's events... just enough to get the last status and visibility, and any "mentioned" events. That'll at least be pretty likely constant time, regardless of how long our users have been around.

Counting all those "mentioned" events might be a needless waste of time, though... and our query (probably a MapReduce job) must still run through our cluster, and that's a fair amount of delay for something like displaying a webpage. Here's where the dirty secret of NoSQL comes in: It's not alone. We can put a traditional RDBMS in the system, too! Once every few minutes, we can run a batch job to process everything that happened recently, and store only the latest results in a RDBMS! In a table of visible status updates, drop in the latest updates (and count of mentions) that each user group could see. Now we have the analytic freedom and write speed from the NoSQL backend, and the strict schema and instant reads of RDBMS. Sure, there's a couple minutes' delay between the posting and the viewing... but it's a status update. It's not a big deal.

NoSQL solutions are just a new tool for the toolbox. They're great for certain parts of certain applications, but should not ever be assumed to be the perfect tool for every job. Similarly, they should not be assumed to be useless for all applications.

* HBase (with which I'm most familiar, of the various NoSQL packages) actually timestamps columns automatically for technical reasons, and one feature is that you can run any query from an "as-of" point in time. For the sake of the examples and generality, we'll assume it doesn't do this and such functionality must be added (which is trivial).

Re:Hmm. (1)

hythlodayr (709935) | about a year and a half ago | (#42457995)

Not AC, but you seem quite knowledgeable in the NoSQL side of things. Even if it's just for HBase or Cassandra, if you happen to have a write-up I would love to read more (I'm sure others would too). Coming from a RDBMS background, being able to tack on columns and use it to glean info is especially interesting to me, since schemas are a sacred cow. Warehouse solutions like Sybase IQ are column-oriented solutions but I don't think this is the same thing as what you're talking about.

Re:Hmm. (1)

DavidTC (10147) | about a year and a half ago | (#42463391)

A query by date range can simply ask for rows that have columns matching that particular range. For that web crawler, it'd make perfect sense to have one column for each datum you want to record about a page at a particular time. Perhaps just headers, content. and HTTP code each time, but that's three new columns every time a page is crawled - and assuming a sufficiently-slow crawler, each row could have entirely different sets of columns!

Jesus Christ.

Please explain in what possible universe what you just described is better than a normal relational table where each row contains a timestamp, headers, content, and HTTP code. (And presumably a URL, although you left that out.)

And, no, 'a sufficiently-slow crawler' is nonsense. Firstly, the web does not actually work that way (A status is part of the headers, and the headers and the content are returned over the same connection in order. Any webcrawler that has the content has the other two, and if something it has all three, why would anything else be doing anything else?), and second, pretending we're in some sort of world where the web does work that way, if you can pass the damn URL around inside this 'sufficiently-slow crawler', you can pass around the rest of the stuff you're going to put in the the table, and third, if you really are logging three unrelated things happening at three unrelated times, uh, duh, you should put them in three tables. (Now I'm wondering if there's actually any way to get a response that is _just_ the status codes in your universe. Or do you have to pull in every record and check?)

It's not that there aren't relations, but that they aren't enforced. A web site might have had a crawl attempted, but a 404 was returned. It could still be logged by just having a missing content column for that particular timestamp, and only the 404 column filled. On later queries about content, a filter would ignore everything but 200 responses. For statistics about dead links, the HTTP code might be all that's queried. On-the-fly analysis can be done without reconfiguring the data store.

The lack of knowledge about how RDBMs work is amazing. Hint: There is nothing stopping fields from being blank in a RDBMs. (BTW, in more of my 'That is not how the web works so your examples are dumb' series, 404 errors do have content. It might make sense not to log it, but they do, indeed, have content 99% of the time.)

This is the most awesome ignorant sentence every, though, and I think I shall make it my signature: 'On-the-fly analysis can be done without reconfiguring the data store.'

Cause that's why all those RDBM people do, run around reconfiguring their databases so they can analysis things 'on the fly'.

Store a timestamped column for each event (status update, mention, visibility change). As you guessed, don't worry much about what each event is, but just store the details (much like Facebook's timeline thing). When someone tries to view a status, run a query to pull all events for the user, and run through them to determine the effective visibility privileges, the most recent status, and the number of "this person was mentioned" events. There's your answer.

As you may guess, that'd be pretty slow, but we do have the flexibility to do any kind of analysis without reconfiguring our whole database. We could think ahead a bit, though, and add to our schema for a big speed boost: Whenever a visibility change happens, the new settings are stored serialized in the event. Sure, it violates normalization, but we don't really care about that.Now, our query need not replay all of the user's events... just enough to get the last status and visibility, and any "mentioned" events. That'll at least be pretty likely constant time, regardless of how long our users have been around.

Counting all those "mentioned" events might be a needless waste of time, though... and our query (probably a MapReduce job) must still run through our cluster, and that's a fair amount of delay for something like displaying a webpage. Here's where the dirty secret of NoSQL comes in: It's not alone. We can put a traditional RDBMS in the system, too! Once every few minutes, we can run a batch job to process everything that happened recently, and store only the latest results in a RDBMS! In a table of visible status updates, drop in the latest updates (and count of mentions) that each user group could see. Now we have the analytic freedom and write speed from the NoSQL backend, and the strict schema and instant reads of RDBMS. Sure, there's a couple minutes' delay between the posting and the viewing... but it's a status update. It's not a big deal.

Bwhahahahahahaha. So you've decided to store everything that's happening in a random unstructured form that you can't actually get data quickly enough from, instead of just, I dunno, storing the information in a relational DB. So you then have to yank it into a relational DB to actually display it.

And the advantage of this instead of just storing the status update in a relational database, and garbage collecting old records? (Or just doing what normal people do, and having something go through each day marking records as 'old', and having that be an indexed column so you can ignore those records trivially in most queries. I look the NoSQL's people's delusional idea that RDBMSs can't handle the amount of data they have. Yes. Yes they really can.)

Wait a minute...did you just say you couldn't analyse your data on the fly without reconfigure the data store?!

Re:Hmm. (1)

Sarten-X (1102295) | about a year and a half ago | (#42479813)

Oh dear... I seem to have offended your RDBMS-is-God sensibilities again [slashdot.org] . I do so love a good argument. I hope I can find one...

Please explain in what possible universe what you just described is better than a normal relational table where each row contains a timestamp, headers, content, and HTTP code. (And presumably a URL, although you left that out.)

One where every row has a monetary (and time) cost, which is conveniently close to the one we live in. On a huge database, pulling a specific set of rows from a date range may or may not actually align well with how the database is sharded. If you've been partitioning the table by the "URL" column, and now you want to query by the "timestamp" column for a single "URL" value, you're likely going to be doing all your work on a single shard, on a single server. Conversely, if you partition the table by timestamp, all searches for the most recent data will be hammering one server.

The URL is the row key, as stated the first time I mentioned the crawler example.

And, no, 'a sufficiently-slow crawler' is nonsense.

I could make one for you if you like. The point is to illustrate that since most NoSQL stores are column-oriented, it's more expensive to actually make new rows, but columns are cheap and easy. Each row can have its own set of column names to suit its needs. There is no enforcement of any particular table design.

A status is part of the headers, and the headers and the content are returned over the same connection in order. Any webcrawler that has the content has the other two, and if something it has all three, why would anything else be doing anything else?

Because it might. That's the point of the example, to show what can be done, in such a way that the principles can be applied elsewhere. Incidentally, you have highlighted another interesting aspect of such a design: it provides a passive measurement of the speed of the transmitting webserver, by measuring the time to receive the document as the timestamp difference between the status and content fields.

if you really are logging three unrelated things happening at three unrelated times, uh, duh, you should put them in three tables.

In an RDBMS, yes. This isn't an RDBMS, though. Specifically in HBase, large tables are preferred because they allow for easier load balancing. One big table will perform better than three smaller tables. Each table is partitioned by blocks of rows, so with three small tables it's possible (and more likely than pure random chance) that a query for data from the same row will end up running the query on the same node three times. On one big table, only the columns you ask for are scanned for the rows that match the query, and the row's position in one column is related to its position in other columns, so the whole row is found quickly.

(Now I'm wondering if there's actually any way to get a response that is _just_ the status codes in your universe. Or do you have to pull in every record and check?)

You just ask for only those columns. Since it's column-oriented, this is a straightforward operation, just as in an RDBMS you can ask for a row with a WHERE clause on the primary key.

There is nothing stopping fields from being blank in a RDBMs.

Again, you're missing the point of the example. It's not that the columns are blank - it's that they don't exist. They aren't taking space in the database, they aren't compressed, and they aren't null. They simply aren't.

(BTW, that 404 bit was a mistake from revision. It started out as a socket error, but then I got to thinking about whether it should be logged for retrial later, so I changed it to be a known failure, but didn't change the rest of the example)

This is the most awesome ignorant sentence every, though, and I think I shall make it my signature: 'On-the-fly analysis can be done without reconfiguring the data store.'

Cause that's why all those RDBM people do, run around reconfiguring their databases so they can analysis things 'on the fly'.

Apparently they do. As one example from a prior job working with medical data in a nice big multi-million-dollar Oracle server, I was outright told that a particular query would not run, and the DBAs would not let it run, because I was asking for patient records by their address. The address fields weren't indexed, and to add such an index would be far too intense an operation to do immediately. I could fill out forms and jump through hoops and beg for the index to be added in the next software update, but that wasn't planned for a few years.

Yet again, you seem to be missing the point: There is no pre-built plan. There is no black magic to optimizing NoSQL queries, because there is no strict schema. In an RDBMS, the tables are laid out and optimized according to how they're expected to be used, so that the expected queries will be fast. NoSQL typically acts more like data warehousing, where they absorb information as quickly as possible, and make no assumptions about how it will be accessed in the future. No, you don't get the same blazing-fast read access when you're running a complicated query through a widely-distributed cluster, but nothing is ever slower because of how the server's configured.

Bwhahahahahahaha. So you've decided to store everything that's happening in a random unstructured form that you can't actually get data quickly enough from, instead of just, I dunno, storing the information in a relational DB. So you then have to yank it into a relational DB to actually display it.

Yes. Is this a problem for you? The RDBMS is effectively just a cache for the frontend. It doesn't require particularly beefy hardware, as its queries are all trivial. At the last Big Data project I worked on, we ran five MySQL instances on the five web servers behind our load balancer. It worked beautifully. The MySQL instances would be laoded from our Hadoop cluster, which did all the heavy processing work.

And the advantage of this instead of just storing the status update in a relational database, and garbage collecting old records? (Or just doing what normal people do, and having something go through each day marking records as 'old', and having that be an indexed column so you can ignore those records trivially in most queries...

Besides never needing to run those garbage-collection jobs, never needing to worry about performance whether you want those old records or not, and having the ability for each query to determine its own boundary for what's "old", there's also the lower cost for hardware upgrades. Oh, and if we ever want to change our queries in the future (like adding a timeline feature, for instance), we have all our data sitting in the backend store ready to be used in the next update. A quick update to the caches, a new job run through the cluster, and a deployment of the latest web app, and our new feature's live.

Incidentally, this is where the feature you noticed earlier can be used. If our web crawler recorded a passive observation of each website's speed, we can use that directly, without caring whether it was something we had always intended to gather. The information's in the database, so we can use it. As an example, a search engine could recommend faster sites a little above slower ones. A newer crawler could gather better records, but our old data isn't useless, nor locked in an unindexed column that would take too much effort to use.

I [like] the NoSQL's people's delusional idea that RDBMSs can't handle the amount of data they have. Yes. Yes they really can.

Yes, they can, with enough expensive hardware, expensive optimizations, and expensive DBAs vetting every query. If you have the staff, hardware, and time readily available, go ahead and use whatever tool you want. As I've said before, I don't recommend replacing any RDBMS that works with a NoSQL solution just for the sake of change. Rather, new projects should consider well whether it's something useful for their application, and whether their amount of data will actually be sufficient to make the benefits outweigh the disadvantages.

Wait a minute...did you just say you couldn't analyse your data on the fly without reconfigure the data store?!

No, I can run any analysis I want, at any time, without ever thinking about whether I'm conforming to the database's preconfigured optimizations. I use NoSQL.

Re:Hmm. (1)

DavidTC (10147) | about a year and a half ago | (#42525467)

One where every row has a monetary (and time) cost, which is conveniently close to the one we live in. On a huge database, pulling a specific set of rows from a date range may or may not actually align well with how the database is sharded. If you've been partitioning the table by the "URL" column, and now you want to query by the "timestamp" column for a single "URL" value, you're likely going to be doing all your work on a single shard, on a single server. Conversely, if you partition the table by timestamp, all searches for the most recent data will be hammering one server.

Likewise, if you are looking at a specific URL in the NoSQL structure you described, all searches for that will be hammering one server. Wow, it's almost like how a database is used is important to know when it's built, and it's always possible to have a 'sideways' use.

And, of course, you have fallen back on the 'huge database' nonsense which the NoSQL people always fall back on. Despite the fact you are standing there and admitting that NoSQL is slower than an RDBMS.

Which, uh, makes it slower for searching huge databases also.

Again, you're missing the point of the example. It's not that the columns are blank - it's that they don't exist. They aren't taking space in the database, they aren't compressed, and they aren't null. They simply aren't.

Ah, the 'feature' of NoSQL, where the size of empty fields suddenly becomes desperately important for people to care about. Because, of course, it's not like people would throw absurd amounts of data in the NoSQL databases, or tout the ability to do that in the same post they've become worried about the size of empty fields. ...oh, wait.

Yet again, you seem to be missing the point: There is no pre-built plan. There is no black magic to optimizing NoSQL queries, because there is no strict schema. In an RDBMS, the tables are laid out and optimized according to how they're expected to be used, so that the expected queries will be fast. NoSQL typically acts more like data warehousing, where they absorb information as quickly as possible, and make no assumptions about how it will be accessed in the future. No, you don't get the same blazing-fast read access when you're running a complicated query through a widely-distributed cluster, but nothing is ever slower because of how the server's configured.

At least you know what NoSQL actually is: A weirdly complicated file store of consistently-formatted files.

Now, the real question is: In what way is NoSQL better than a filesystem of JSON or XML files?

Obviously, there are some advantages, like the fact that NoSQL has built-in 'file' and record locking and consistency checks.

But, other than that. In what way would just _putting the records in files_ and importing them into SQL be worse?

Apparently they do. As one example from a prior job working with medical data in a nice big multi-million-dollar Oracle server, I was outright told that a particular query would not run, and the DBAs would not let it run, because I was asking for patient records by their address. The address fields weren't indexed, and to add such an index would be far too intense an operation to do immediately. I could fill out forms and jump through hoops and beg for the index to be added in the next software update, but that wasn't planned for a few years.

Very often, DBAs are assholes. And often, they are idiots. Sometimes, they are both.

There are plenty of perfectly functional ways to get that data from the database without slowing down the server, although without indexes it would probably be a somewhat slow response. And, as you pointed out, the failure was due to lack of indexing, which, uh, is trivially solvable, even if the DBAs didn't want to do it. (I have no idea why you would need to wait for a 'software update' for that, unless the DBAs just refused to do anything except at that time. Which is entirely possible.)

No, I can run any analysis I want, at any time, without ever thinking about whether I'm conforming to the database's preconfigured optimizations. I use NoSQL.

Yeah, and if you don't care about the speed of other users, you can do the same damn thing on RDBMS.

The problem, of course, is that RDBMS, often being an _actual used databases_ instead of toys, often are _in use_ and hence you can't just go around doing things that incredibly CPU intensive or you'll fuck up all the other users. (Hence the entire existence of DBAs, whose job it is to make sure you aren't doing that.)

Being NoQL doesn't magically mean there aren't other users. It just means that because it _can't_ return complicated results quickly, no one is _relying on it_ to return complicated results quickly, so it's okay to slow things down willy-nilly.

Re:Hmm. (1)

lauwersw (727284) | about a year and a half ago | (#42459451)

One thing I'm not seeing in the comments yet: don't forget that most NoSQL solutions are written for commodity hardware, which also makes them very suitable for cloud solutions. To get the same kind of performance out of a relational DB, you need expensive hardware.

Cassandra can also be made aware of the rack or data center the nodes are running in, so it can lay out its data replica's for regional data safety (think EC2 data center failures, all too common) but still offer optimal local data access.

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42461095)

You're generally correct, but not all NoSQL databases are created equal. Some are designed for fast writes while others are only good at reads and have terrible write performance.

Generalizing NoSQL databases is as bad as generalizing SQL databases. They all have different performance characteristics.

Re:Hmm. (1)

Sarten-X (1102295) | about a year and a half ago | (#42480451)

True. My experience is primarily with HBase, and the details that I know don't generalize I try to clearly mark. Most of what I say should apply to any BigTable-based NoSQL store, but there are certainly others out there.

Re:Hmm. (4, Informative)

Corporate T00l (244210) | about a year and a half ago | (#42454471)

You'll see these kinds of large-scale columnar stores like Cassandra or HBase being used a lot in metrics and log management projects.

For instance, if you want to generate a histogram of login processing time over the last 90 days, you'll need to record the times of all of your individual logins to do that. If you have millions of logins per hour, that single metric alone is going to generate a lot of rows. If you're also measuring many other points throughout your system, the data starts getting unmanageable with B-tree backed databases and not of high enough value to store in RAM.

in the past, you might deal with this by adding more sophisticated logic at the time of collection. Maybe I'll do random sampling and only pick 1 out of every 1000 transactions to store. But then, I might have a class of users I care about (e.g. users logging in from Syria compared to all users logging in around the world) where the sample frequency causes them to drop to zero. So then I have to do more complicated logic that will pick out 1 out of every 1000 transactions but with separate buckets for each country. But then every time your bucketing changes, you have to change the logic at all of the collection points. I can't always predict in advance what buckets I might need in the future.

With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use). The data collected can have indexes (really more like inverted indexes, but that is a longer discussion) up front for quick query of data facets that you know you want in advance, but still retains the property of super-fast-insert-on-cheap-hardware so that you can store all of the raw data and come back for it later when there is something you didn't think of in advance, and map-reduce for the answer.

Re:Hmm. (1)

KingMotley (944240) | about a year and a half ago | (#42478399)

With more log-structured data stores and map-reduce, it becomes more feasible to collect everything up front on cheaper infrastructure (e.g. even cheap SATA disks are blazingly fast for sequential access, which B-tree DBs don't take advantage of but log-oriented DBs like Cassandra as specifically architected to use).

I'll have to do some performance testing later, but you do realize that almost all relational databases support a concept known as clustered indexes, which takes advantage of sequential access, correct? Sounds like you don't understand how current relational databases work.

Relationships. (0)

Anonymous Coward | about a year and a half ago | (#42454855)

With relational databases, you have to express your relationships in your database schema, as well as in your data objects and business logic.

With non-relationship databases, you have to express your relationships in your data objects and business logic.

Think about it.

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42455219)

Our company considered it when we were having unusually long page loads. Basically we wanted to log when (and where, and with what data) every user on our system requested a page, and then when that request finished being processed. It's a lot of data and doing it with a RDBMS would have bogged an already over burdened system down. The NoSQL solution was fast and since the data was only for our internal use we didn't really care if it was wrong or corrupt.

No way we'd ever use NoSQL for anything a customer needed, but it is the right tool for the job if you need to store a massive amount of data with low latencies and you don't care if it's all screwed up and wrong sometimes. Granted we never actually got to deploy this solution (pushed an update to add a feature, problems on pages we didn't touch went away. Unable to reproduce afterwards), but it's still just another tool on the belt with a highly specialized use.

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42455537)

I use it to store volatile ticket and to distribute sessions instead of using multicast based clustering

Fast (1)

jbolden (176878) | about a year and a half ago | (#42456713)

It always pays to use relational over NoSQL when you can. But just like in data warehousing where it makes sense to denormalize for performance reasons it can make sense to organize the data around specific computations which damage the ability to use SQL.

You won't find any good reason with normal sized data sets and normal number of joins. Computations that require large tables that need to join multiple times in complex ways that can't be overcome with tricks like indexing.... then it can make sense to sacrifice the relational algebra.

Re:Hmm. (1)

Fallingcow (213461) | about a year and a half ago | (#42457155)

It's for people who were letting their programming frameworks do what the fuck every they want with their database structures and decided to take that one step farther.

Admittedly, I kind of like it for low(er) value things where you're likely to have some variation in the structures being inserted, like logging and tracking the status of long-running tasks (upsert and appending to arrays FTW). That's about the only use I've found for the tech, though, and I admit that even in those cases its use is largely a result of laziness.

Re:Hmm. (1)

Anonymous Coward | about a year and a half ago | (#42457383)

I use it when I need a database that supports relationships, tons of them, and doesn't falter at the same relationship type having completely different fields. It's the same -freaking- relationship, with supporting information from several different systems.

I use Neo4j, which is only technically NoSQL, but it has a few query languages of it's own. But I always chuckle at "relational" databases because they all seem to collapse under too many relationships "X" is_a this, is_a that, is_a this2, is_a... why does each of those relationships have to be a row in a table? Shouldn't rows contain actual data rather than place holders for a relationship?

However, use the best tool for the job, not the most popular. 90% of the stuff I use MySQL or postgresql, the 10% becomes too many joins in those where a short cypher statement would do, where it becomes tens of joins...

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42457425)

Graph a relationships between millions of devices, with differing types and attributes of relationships, oh, and that change over time. On top of that an insert rate of 4 million inserts per second on time series data, that can be out if order.

Or reddit.

Re:Hmm. (1)

Lord_Naikon (1837226) | about a year and a half ago | (#42459445)

It's not always about the data relationships. Cassandra for example is very easy to scale horizontally (much easier than traditional databases) and can achieve very high throughput. Last time I checked (a year ago) I could get over 50,000 stores/queries per second on a cluster of cheap commodity hardware (4 servers). That result was achieved with full redundancy (n=2). Such a setup is very resilient against failure (provided clients handle failure of individual nodes correctly). Maintaining such a cluster is also a breeze, with the ability to pull servers at will while operations continue to run. You no longer have to deal with brittle master-master/slave setups.

At the time I checked and tested about 10 different "NoSQL" solutions for viability. I had these requirements in mind:
1) Must scale horizontally, no single master dependency and must continue to work when any single node in the cluster fails.
Lots of NoSQL solutions failed this requirement because they had explicit master servers or didn't do redundant data storage.

2) Must perform at least 10,000 reads/writes of tuples per second per node on the bladeservers we had available.
Again lots of NoSQL solutions failed to perform. Some were incredibly slow, with less than a 1000 queries/sec/node.

3) Must have good management tools.
Most NoSQL databases were crap in this department.

4) Must be well supported by open source (Java) libraries.
Most of them were, but a lot of them failed to cope correctly with unreachable/failed cluster nodes.

In the end Apache Cassandra was the only one which fulfilled all my requirements.
Our use cases were persistent caching (as a cache layer behind memcached), and high volume (simple) data storage.

Re:Hmm. (0)

Anonymous Coward | about a year and a half ago | (#42462279)

Maybe someone can explain this to me. I've been keeping an eye out for situations where it would make more sense to use a nosql solutions like Mongo, Couch, etc. for a year or so now, and I just haven't found one.

Under what circumstances do people use a data store that doesn't need data relationships?

The first section in this document [christof-strauch.de] explains it pretty well.

Re:Hmm. (1)

Bengie (1121981) | about a year and a half ago | (#42464791)

Sharing a resource, not matter how you spin it, will cause contention. The only way to scale a resource that is both read and write heavy is to scale horizontally. This is where NOSQL takes the crown. This is just a prime example, but not the only.

Re:Hmm. (1)

snemarch (1086057) | about a year and a half ago | (#42466793)

Under what circumstances do people use a data store that doesn't need data relationships?

Think (huge!) web content management systems with tree-structure, component-based pages where data varies widely from each page-type, and business requirements are constantly in flux.

While there's definitely data relationships, they're not necessarily very comfortable to map in a traditional RDBMS.

batch != patch (1)

Corporate T00l (244210) | about a year and a half ago | (#42454255)

I'm not sure if it's a typo or a misunderstanding, but the statement in the summary about atomic batching is hilariously incorrect.

Atomic batching has nothing to do with "patches can be reapplied if one of them fails", but rather the more pedantic yet common case where you want a set of data updates to be batched atomically, where all or none of the changes occur, but nothing in between.

Re:batch != patch (1)

vlm (69642) | about a year and a half ago | (#42454333)

sounds like a transaction

Re:batch != patch (1)

FooAtWFU (699187) | about a year and a half ago | (#42454385)

But the atomic batches in v1.2 prevent such inconsistencies, by ensuring that groups of updates are treated as indivisible (atomic) units of work: either all the updates succeed or all of them fail. If they all fail, then the batch is reapplied, and there’s no need to determine which individual updates failed or succeeded.

Looks like there's two parts here. One of them is communicating the changeset to (one or more) nodes, then the other part is actually applying it. If the coordinator failed halfway through, things can still be automatically resumed. There's a whole bunch of interesting semantics around Cassandra's eventual-consistency model (and the implications on how you should be programming to match it, e.g. lots of idempotent inserts and not a lot of updates) which I'm not entirely qualified to expound upon here :D

502 Bad Gateway (0)

Anonymous Coward | about a year and a half ago | (#42454291)

Indeed. I love stumbling across these on the Apache website.

Re:502 Bad Gateway (0)

Anonymous Coward | about a year and a half ago | (#42454335)

But Apache are teh bestest web sarver evar!! Only IIS from Micro$hit has these issues.

Re:502 Bad Gateway (1)

ls671 (1122017) | about a year and a half ago | (#42454499)

the whole http://blogs.apache.org/ [apache.org] domain seems to return this 502 error right now. Maintenance, other problem or just slashdotted even if it is an apache domain ?

They seem to be using "Apache/2.0.63 Server at blogs.apache.org Port 80" in reverse-proxy mode and my guess is the server behind it is down.

Blog entry from Google cache (1)

Anonymous Coward | about a year and a half ago | (#42454873)

The Apache Software Foundation Blog
  The Apache Software... | Main
Wednesday Jan 02, 2013

The Apache Software Foundation Announces Apache Cassandra v1.2

High-performance, super-robust Big Data distributed database introduces support for dense clusters, simplifies application modeling, and improves data cell storage, design, and representation.

Forest Hill, MD –2 January 2013– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced Apache Cassandra v1.2, the latest version of the highly-scalable, fault-tolerant, Big Data distributed database.

Successfully handling thousands of requests per second, Apache Cassandra powers massive data sets quickly and reliably without compromising performance –whether running in the Cloud or partially on-premise in a hybrid data store. Apache Cassandra is successfully used by an array of organizations that include Adobe, Appscale, Appssavvy, Backupify, Cisco, Clearspring, Cloudtalk, Constant Contact, DataStax, Digg, Digital River, Disney, eBay, Easou, Formspring, Hailo, Hobsons, IBM, Mahalo.com, Morningstar, Netflix, Openwave, OpenX, Palantir, PBS, Plaxo, Rackspace, Reddit, RockYou, Shazam, SimpleGeo, Spotify, Thomson-Reuters, Twitter, Urban Airship, US Government, Walmart Labs, Williams-Sonoma, Inc., and Yakaz.

"We are pleased to announce Cassandra 1.2," said Jonathan Ellis, Vice President of Apache Cassandra. "By improving support for dense clusters —powering multiple terabytes per node— as well as simplifying application modeling, and improving data cell storage/design/representation, systems are able to effortlessly scale petabytes of data."

Highlights for the second generation high-performance, NoSQL database includes clustering across virtual nodes, inter-node communication, atomic batches, and request tracing. In addition, Cassandra v1.2 also marks the release of CQL3 (version 3 of the Cassandra Query Language), to simplify application modeling, allow for more powerful mapping, and alleviate design limitations through more natural representation.

"We are really excited to begin taking advantage of all the new features Apache Cassandra v1.2 has to offer – particularly virtual nodes and atomic batches. Both of these new features will play a central role in future enhancements to our architecture," said Ed Anuff, VP, Mobile Platform at Apigee.

"It's great to see the core of Apache Cassandra continue to evolve," said independent software developer Kelly Sommers. "In Cassandra v1.2 the introduction of vnodes will simplify managing clusters while improving performance when adding and rebuilding nodes. v1.2 also includes many new features, performance improvements and further heap reduction to eleviate the burden on the JVM garbage collector."

"The much anticipated release of Cassandra 1.2 brings with it features that simplify application development. Atomic batches provide a mechanism for developers to ensure transactional integrity across a business process, instead of relying on idempotent operations and retry mechanisms," said Brian O’Neill, Lead Architect at Health Market Science. "Additionally, native support for collections is attractive and a compelling reason to explore CQL 3."

"Apache Cassandra continues to be a leading option for scalability and high availability without compromising performance and, with the improvements provided in v1.2, reinforces our commitment to growth while preserving backwards compatibility," added Ellis.

Availability and Oversight
As with all Apache products, Apache Cassandra v1.2 is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. Apache Cassandra source code, documentation, and related resources are available at http://cassandra.apache.org/ [apache.org] .

NoSQL? Then what? (1)

rduke15 (721841) | about a year and a half ago | (#42454907)

There must be something I don't understand. For me the whole point of databases is precisely that they come with SQL to easily do even complex stuff with them.

How can the absence of the only useful feature be a "selling" point. No SQL? No thanks?...

Re:NoSQL? Then what? (0)

Anonymous Coward | about a year and a half ago | (#42455091)

The fact that it can be 30x faster than a traditional RDMS using SQL is a plus. Google uses it with its search engine and it simply is not feasible to be limited by relational data when a simple index for terms is more than sufficient.

Re:NoSQL? Then what? (1)

Anonymous Coward | about a year and a half ago | (#42455157)

SQL is anything but easy from app development viewpoint. You have to either mix it in your code, which is ugly in itself and creates tons of potential SQL injection bugs, or you use ORM and then your database is probably unusable using conventional tools.

NoSQL solves the problem, as native bindings to different languages are the standard interface in this world.

Re:NoSQL? Then what? (0)

Anonymous Coward | about a year and a half ago | (#42455617)

I suggest you to have a look at myBatis, a semi-automatic orm

Re:NoSQL? Then what? (0)

Anonymous Coward | about a year and a half ago | (#42456129)

no thanks. ORMs are garbage for all but the most trivial of cases.

Re:NoSQL? Then what? (0)

Anonymous Coward | about a year and a half ago | (#42462001)

If you have SQL sprinkled throughout your codebase, then you're doing it wrong.

Re:NoSQL? Then what? (2)

hythlodayr (709935) | about a year and a half ago | (#42455923)

"NoSQL" is a highly-misleading name; the SQL language is really besides the point.

The important parts of NoSQL really boils down to:
1. Very high performance.
2. Ability to handle extremely large data (on the order of tens or hundreds of terabytes.).
3. Natural way of dealing with non-flat , non-BLOB data.
4. Better integration with OO languages.

#1 and #2 all come with trade-offs, which is perfectly fine. Not all problems need ACID compliance..

#3 & #4 really goes back to the 90s , though nothing ever stuck (e.g., object-relational databases).

Re:NoSQL? Then what? (1)

Oflameo (2806007) | about a year and a half ago | (#42456751)

Why not LDAP?

Re:NoSQL? Then what? (1)

hythlodayr (709935) | about a year and a half ago | (#42456937)

Why not LDAP?

Not speaking w/ any authority, but afaik LDAP is just an over-the-wire protocol. It says nothing about the underlying database(s) or what the directory services actually represent. That said, Open LDAP [openldap.org] and LDAP vs RDBMS [openldap.org]

Re:NoSQL? Then what? (1)

micheas (231635) | about a year and a half ago | (#42455931)

One of the useful features of solr/lucene is the MLT key word (which stands for More Like This).

Another useful feature of many NOSQL databases is faceted searches with good performance.

It seems to be a very common practice to store the data in an SQL database and duplicate that database in a nosql database to use for searching, then if the nosql database gets corrupted you rebuild from the original data and your searches are incomplete while the rebuild goes on. (worst case I've had to deal with is a couple days for the rebuild).

Many sites use both SQL and NOSQL databases. Eventual consistency is fine for a lot of use cases, in others use cases eventual consistency renders the application completely useless.

Re:NoSQL? Then what? (1)

LingNoi (1066278) | about a year and a half ago | (#42458509)

NoSQL does have some advantages. If you have 2+GB of data in a relational database table and you wish to update a table doing some can take a long time during which your services will be down. Since non-relational databases allow for schema less data, you can simply add the extra column in the code and add code for what to do if the new column doesn't exist (i.e. old data) then deploy it with zero downtime.

These points don't really come into play until you have a huge dataset however so for most stuff I still recommend relational databases.

Apples and Oranges folks (1)

Anonymous Coward | about a year and a half ago | (#42456735)

I can't believe these assholes are getting in an argument about SQL vs NoSQL. Apples and Oranges. NoSQL isn't a complete replacement, nor are rdbms the solve-all solution when you need to scale. Sounds like a bunch of db admins getting threatened that their jobs are going to be in jeopardy.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...