Search This Blog

Sunday, July 4, 2010

Next Generation Database Winners

Application users expects fast response time as specially I am seeing new generation doesn't tolerate 3 secs response time which is widely accepted by industry and companies having challenge in meeting customer expectations with large data sets. NOSQL is the perfect fit for this requirement. Distributed, scalable databases are desperately needed these days. There are lots of players in the NOSQL world but Cassandra/HBase is the garnered competitors.

HBase is the more robust database for a majority of use-cases. Cassandra relies mostly on Key-Value pairs for storage, with a table-like structure added to make more robust data structures possible. It’s a fact that far more people are using HBase than Cassandra at this moment, despite both being similarly recent.

HBase values strong consistency and High Availability. Cassandra values Availability and partitioning tolerance. Cassandra may be useful for storage, but not any data processing. HBase is much handier for that data processing and data ware housing.

HBase also has a nice web-based UI that you can use to view cluster status, determine which nodes store various data, and do some other basic operations. Cassandra lacks this web UI as well as a shell, making it harder to operate.

Installation:
Cassandra is only a Ruby gem install away. That’s pretty impressive. HBase we have to do quite a bit of manual configuration and installation and configuration heavy.

Availability:
Cassandra claims that “writes never fail”, whereas in HBase, if a region server is down, writes will be blocked for affected data until the data is redistributed. This rarely happens in practice, of course, but will happen in a large enough cluster. In addition, HBase has a single point-of-failure (the Hadoop NameNode), but that will be less of an issue as Hadoop evolves. HBase does have row locking, however, which Cassandra does not.

Consistency
Apps usually rely on data being accurate and unchanged from the time of access, so the idea of eventual consistency can be a problem. Cassandra, however, has an internal method of resolving up-to-dateness issues with vector clocks — a complex but workable solution where basically the latest timestamp wins. The HBase/BigTable puts the impetus of resolving any consistency conflicts on the application, as everything is stored versioned by timestamp.

Cassandra only supports one table per install. That means you can’t denormalize and duplicate your data to make it more usable in analytical scenarios. Cassandra is really more of a Key Value store than a Data Warehouse. Furthermore, schema changes require a cluster restart.

Replication:
Cassandra uses a P2P sharing model, whereas HBase (the upcoming version) employs more of a data+logs backup method, aka ‘log shipping’.

Conclusion:
If you need highly available writes with only eventual consistency, then Cassandra is a viable candidate for now. However, many apps are not happy with eventual consistency, and it is still lacking many features. Furthermore, even if writes do not fail, there is still cluster downtime associated with even minor schema changes. HBase is more focused on reads, but can handle very high read and write throughput. It’s much more Data Warehouse ready, in addition to serving millions of requests per second. The HBase integration with MapReduce makes it valuable, and versatile.

3 comments:

  1. Full disclosure: I'm a Cassandra committer.

    > Cassandra only supports one table per install

    Cassandra has supported multiple keyspaces since at least version 0.4. (Current shipping version is 0.6.)

    > there is still cluster downtime associated with even minor schema changes.

    True, but this will change in 0.7. A beta will be released in a few weeks.

    Also, Hadoop integration was introduced in 0.6 and will become even tighter in 0.7.

    ReplyDelete
  2. To clarify one point, there is _node_ downtime associated with schema changes in 0.6, but not _cluster_ downtime. And as Gary said, this limitation is fixed in 0.7.

    ReplyDelete