Search This Blog

Sunday, July 4, 2010

Six reasons to choose NOSQL

NOSQL
Beat performance bottlenecks and quick response time.
Run on clusters of cheap PC servers.
Open Source and no license fee (reduce cost).
Quickly Horizontal scalable
Distributed database
Easy replication support

Next Generation Database Winners

Application users expects fast response time as specially I am seeing new generation doesn't tolerate 3 secs response time which is widely accepted by industry and companies having challenge in meeting customer expectations with large data sets. NOSQL is the perfect fit for this requirement. Distributed, scalable databases are desperately needed these days. There are lots of players in the NOSQL world but Cassandra/HBase is the garnered competitors.

HBase is the more robust database for a majority of use-cases. Cassandra relies mostly on Key-Value pairs for storage, with a table-like structure added to make more robust data structures possible. It’s a fact that far more people are using HBase than Cassandra at this moment, despite both being similarly recent.

HBase values strong consistency and High Availability. Cassandra values Availability and partitioning tolerance. Cassandra may be useful for storage, but not any data processing. HBase is much handier for that data processing and data ware housing.

HBase also has a nice web-based UI that you can use to view cluster status, determine which nodes store various data, and do some other basic operations. Cassandra lacks this web UI as well as a shell, making it harder to operate.

Installation:
Cassandra is only a Ruby gem install away. That’s pretty impressive. HBase we have to do quite a bit of manual configuration and installation and configuration heavy.

Availability:
Cassandra claims that “writes never fail”, whereas in HBase, if a region server is down, writes will be blocked for affected data until the data is redistributed. This rarely happens in practice, of course, but will happen in a large enough cluster. In addition, HBase has a single point-of-failure (the Hadoop NameNode), but that will be less of an issue as Hadoop evolves. HBase does have row locking, however, which Cassandra does not.

Consistency
Apps usually rely on data being accurate and unchanged from the time of access, so the idea of eventual consistency can be a problem. Cassandra, however, has an internal method of resolving up-to-dateness issues with vector clocks — a complex but workable solution where basically the latest timestamp wins. The HBase/BigTable puts the impetus of resolving any consistency conflicts on the application, as everything is stored versioned by timestamp.

Cassandra only supports one table per install. That means you can’t denormalize and duplicate your data to make it more usable in analytical scenarios. Cassandra is really more of a Key Value store than a Data Warehouse. Furthermore, schema changes require a cluster restart.

Replication:
Cassandra uses a P2P sharing model, whereas HBase (the upcoming version) employs more of a data+logs backup method, aka ‘log shipping’.

Conclusion:
If you need highly available writes with only eventual consistency, then Cassandra is a viable candidate for now. However, many apps are not happy with eventual consistency, and it is still lacking many features. Furthermore, even if writes do not fail, there is still cluster downtime associated with even minor schema changes. HBase is more focused on reads, but can handle very high read and write throughput. It’s much more Data Warehouse ready, in addition to serving millions of requests per second. The HBase integration with MapReduce makes it valuable, and versatile.

Database Trend

Databases have been in use since the earliest days of electronic computing. Here is the history and where we are at currently:

1960s Navigational DBMS

1970s Relational DBMS & end of 1970s SQL DBMS

1980s Object Oriented Databases

1990s Performance enhancement through replication in an object-oriented DBMS.

2000s NoSQL databases are non-relational (Next Generation databases)

Typical modern relational databases have shown poor performance on data-intensive applications including indexing a large number of documents, serving pages on high-traffic websites and delivering streaming media.

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases, use case friendly.