Database Analyzer

Sunday, July 4, 2010

Database Trend

Databases have been in use since the earliest days of electronic computing. Here is the history and where we are at currently:

1960s Navigational DBMS

1970s Relational DBMS & end of 1970s SQL DBMS

1980s Object Oriented Databases

1990s Performance enhancement through replication in an object-oriented DBMS.

2000s NoSQL databases are non-relational (Next Generation databases)

Typical modern relational databases have shown poor performance on data-intensive applications including indexing a large number of documents, serving pages on high-traffic websites and delivering streaming media.

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases, use case friendly.

Saturday, January 30, 2010

11g feature: Reference Partitioning

Reference Partitioning

Reference partitioning allows the partitioning of a table based on the partitioning scheme of the table referenced in its referential constraint. The partitioning key is resolved through an existing parent-child relationship, enforced by active primary key or foreign key constraints. This means that tables with a parent-child relationship can be logically equi-partitioned by inheriting the partitioning key from the parent table without duplicating the key columns. The logical dependency also automatically cascades partition maintenance operations, making application development easier and less error-prone.

To create a reference-partitioned table, you specify a PARTITION BY REFERENCE clause in the CREATE TABLE statement. This clause specifies the name of a referential constraint and this constraint becomes the partitioning referential constraint that is used as the basis for reference partitioning in the table.

As with other partitioned tables, you can specify object-level default attributes, and can optionally specify partition descriptors that override the object-level defaults on a per-partition basis.
CREATE TABLE orders
(order_id NUMBER(12),
order_date DATE,
order_mode VARCHAR2(8),
customer_id NUMBER(6),
order_status NUMBER(2),
order_total NUMBER(8,2),
sales_rep_id NUMBER(6),
promotion_id NUMBER(6),
CONSTRAINT orders_pk PRIMARY KEY(order_id)
)
PARTITION BY RANGE(order_date)
(PARTITION Q1_2005 VALUES LESS THAN
(TO_DATE('01-APR-2005','DD-MON-YYYY')),
PARTITION Q2_2005 VALUES LESS THAN
(TO_DATE('01-JUL-2005','DD-MON-YYYY')),
PARTITION Q3_2005 VALUES LESS THAN
(TO_DATE('01-OCT-2005','DD-MON-YYYY')),
PARTITION Q4_2005 VALUES LESS THAN
(TO_DATE('01-JAN-2006','DD-MON-YYYY')));

The example in the slide above creates a list-partitioned table called ORDERS which is range-partitioned on order_date. It is created with four partitions, Q1_2005, Q2_2005, Q3_2005, and Q4_2005. This table will be referenced in the creation of a reference partitioned table on the next slide.
CREATE TABLE order_items
( order_id NUMBER(12) NOT NULL,
line_item_id NUMBER(3) NOT NULL,
product_id NUMBER(6) NOT NULL,
unit_price NUMBER(8,2),
quantity NUMBER(8),
CONSTRAINT order_items_fk
FOREIGN KEY(order_id) REFERENCES orders(order_id)
)
PARTITION BY REFERENCE(order_items_fk);

The reference-partitioned child table ORDER_ITEMS example above is created with four partitions, Q1_2005, Q2_2005, Q3_2005, and Q4_2005, where each partition contains the order_items rows corresponding to orders in the respective parent partition.

If partition descriptors are provided, then the number of partitions described must be exactly equal to the number of partitions or sub-partitions in the referenced table. If the parent table is a composite partitioned table, then the table will have one partition for each sub-partition of its parent; otherwise the table will have one partition for each partition of its parent.

Partition bounds cannot be specified for the partitions of a reference-partitioned table. The partitions of a reference-partitioned table can be named. If a partition is not explicitly named, then it will inherit its name from the corresponding partition in the parent table, unless this inherited name conflicts with one of the explicit names given. In this case, the partition will have a system-generated name.

Partitions of a reference-partitioned table will co-locate with the corresponding partition of the parent table, if no explicit tablespace is specified for the reference-partitioned table's partition.

Hash Partitioning/How Cardinality Affects Distribution

Hash partitioning provides a method of evenly distributing data across a specified number of partitions. Rows are mapped into partitions based on a hash value of the partitioning key and is a better choice than range partitioning when:

a.You do not know beforehand how much data will map into a given range

b.Sizes of range partitions would differ quite substantially

c.Partition pruning and partition-wise joins on a partitioning key are important

d.Range partitioning would cause the data to be undesirably clustered

In 11g hash partitioning can also be used:
a.To enable partial or full parallel partition-wise joins with very likely equi-sized partitions.

b.To distribute data evenly among the nodes of an MPP platform that uses Oracle Real Application Clusters. As a result, you can minimize interconnect traffic
when processing internode parallel statements.

c.To use partition pruning and partition-wise joins according to a partitioning
key that is mostly constrained by a distinct value or value list.

d.To randomly distribute data to avoid I/O bottlenecks if you do not use
a storage management technique that stripes and mirrors across all available
devices.

The concepts of splitting, dropping or merging partitions do not apply to hash partitions. Instead, hash partitions can be added and coalesced and with hash partitioning, pruning is only possible using equality or IN-list predicates.

The number of partitions should always be a power of two (2, 4, 8, and so on)to obtain the most even data distribution.

Cardinality is important for choosing a key for hash partitioning. Oracle recommends that you do not use hash partitioning on key columns with low cardinality (columns in which the number of distinct values is small compared to
the number of rows in the table.)

The hash function works best with a LARGE number of values. Rows are mapped into partitions based on the hash value of the partitioning key. A primary key is a very good hash key because it has a wide range a values.

For example, if you had a hash partitioned table with 8 hash partitions, and a key column with only 10 distinct values, you could encounter the following:

CREATE TABLE dept (deptno NUMBER, deptname VARCHAR(32))
PARTITION BY HASH(deptno) PARTITIONS 8;

Partition DEPTNO Column Values
----------- ---------------------
hash_part1 1,2,3
hash_part2 4,5,6,
hash_part3 7
hash_part4 8
hash_part5
hash_part6
hash_part7 9
hash_part8 10

You would have a scenario where some of your hash partitions would contain a high number of distinct values and your hash partitions would be widely different sizes. It is even possible that some hash partitions could contain zero distinct values and others will contain three or more distinct values. One of the reasons to use hash partitioning is to have evenly sized partitions. That's why we recommend that you do not use hash partitioning when low cardinality is a factor.

In general, the hash-partitioning key should be unique or near unique. Since a table may contain several near-unique keys and thus several candidates for hash
partitioning, the other consideration for choosing the partitioning key is the access profile. Hash partitioning can improve the performance of single-key lookups.

For example:
SELECT * FROM emp WHERE empno = 123 <-----single key lookup SELECT * FROM emp WHERE sal > 5000 <------range lookup

If you create a local index for this table, the database constructs the index so that it is equipartitioned with the underlying table.

The database also ensures that the index is maintained automatically when maintenance operations are performed on the underlying table. The following is an example of creating a local index on the table dept: CREATE INDEX loc_dept_ix ON dept(deptno) LOCAL;

You can optionally name the hash partitions and tablespaces into which the local index partitions are to be stored, but if you do not do so, the database uses the name of the corresponding base partition as the index partition name, and stores the index partition in the same tablespace as the table partition.

Hash partitioned global indexes can improve the performance of indexes where a small number of leaf blocks in the index have high contention in multiuser OLTP
environments. Queries involving the equality and IN predicates on the index partitioning key can efficiently use hash-partitioned global indexes.

The syntax for creating a hash partitioned global index is similar to that used
for a hash partitioned table. For example, the following statement creates a
hash-partitioned global index:
CREATE INDEX hgidx ON tab (c1,c2,c3) GLOBAL
PARTITION BY HASH (c1,c2)
(PARTITION p1 TABLESPACE tbs_1,
PARTITION p2 TABLESPACE tbs_2,
PARTITION p3 TABLESPACE tbs_3,
PARTITION p4 TABLESPACE tbs_4);

Sunday, November 29, 2009

Content management databases

Content Management databases: There are good content management database in market and used wide across several companies like yahoo, Facebook, Amazon.

Thrudb is a set of simple services built on top of the Facebook Apache Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services that can enhance or replace traditional data storage and access layers.

Thrudb - High Level Features

Thrudb - Services

Thrudoc - Document storage service
Thrudex - Indexing and search service
Thruqueue - Persistent message queue service
Throxy - Scaling service

StrokeDB
Minimalistic modular database engine.

StrokeDB stores a set of versioned documents identified by UUID.
A document is a hash-like container of slots: flat set of values tagged with string keys.
Slots store arbitrary serializable values (most common types are booleans, 
numbers, strings, arrays, numbers, time).

The concept.

1. Every repository is identified by UUID.
2. Every repository writes a log of commits.
3. Commit tuples:
   (timestamp, "store", uuid, version)
   (timestamp, "pull",  repo_uuid, repo_timestamp)
4. When you pull from a repository:
   1. Find out the latest timestamp in your history (index: repo_uuid -> repo_timestamp)
   2. If there is not timestamp yet, pull the whole log.
   3. If there is a timestamp for a repository UUID, pull the tail of the log.
   4. For each "store" record: fetch the version.
   5. For each "pull" record - add to a deferred list of repositories waiting for update.
   6. When whole log is fetched, fetch deferred repositories. We have two options here:
     1. Fetch from the same repository we've been fetching from few moments ago (say, fetch B log from A)
     2. Or, fetch directly from the desired repository (B log from B repository)

MongoDB

MongoDB is an open source document-oriented database written in the C++ programming language.

MongoDB is designed for problems without heavy transactional requirements that aren't easily solved by traditional RDBMSs, including problems which require the database to span many servers.

CouchDB

A document database server, accessible via a RESTful JSON API.
Ad-hoc and schema-free with a flat address space.
Distributed, featuring robust, incremental replication with bi-directional conflict detection and management.
Query-able and index-able, featuring a table oriented reporting engine that uses Javascript as a query language.

Sunday, November 8, 2009

Oracle flash drive recommendations

Oracle flash drive recommendations

EFD-friendly
DB workloads

Not as
cost-effective on EFD

Random
reads

B-Tree
leaf access

ROWID
look up into Table

Access
to out-of-line LOB

Access
to overflowed row

Index
scan over Unclustered Table

Compression:
Increases I/O intensity (IOPS/GB)

Serial
reads Random writes

Row
update by PK

Index
maintenance

Reduce
checkpoint interval

TEMP:
Sort areas and Intermediate tables

Sequentially
read and written but I/O done in 1 MB units: not enough to amortize
seeks

Lower
Latency: Get In, Get Out

Redo log
files

Sequentially
read and written and commit latency already handled by cache in
storage controller

Undo
table space

Sequentially
written, randomly read by flashBack. But reads are
for recently written data that is likely to still be in the buffer cache

Large
table scans Buffer pools with lots of writes

Mismatch
between read and write latency characteristics of EFDs can cause unwanted
“Free Buffer Waits”. Buffer pool tuning is necessary after deploying EFDs

Reference from Oracle Openworld 2008 presentation

Redo Logs on EFDs? (or not)

It is a common misconception that Oracle online redo logs will benefit by moving them on EFDs, whereas all the experimental data indicates the opposite position. Testing has shown that moving redo logs on to EFDs results in a low percentage of improvement. It is better to leave them on the write cache backed Fibre Channel drives rather than moving them on to EFDs, thereby using EFDs for other read intensive parts of the database like indexes or data.

Oracle TEMP tablespace on EFDs

Oracle uses this space mainly for data aggregations and sorting. When the database engine cannot fit the sorts in memory, they will be spilled on to disk for storing intermediary results. Oracle typically does large sequential I/Os against these tablespaces in the context of single user. When multiple users are performing concurrent sorts on these tablespaces, the I/O turns out to be largely random in nature. Even though EFDs do not provide as much benefit for large random I/O as they provide to small random operations, still they are far ahead of what regular rotation Fibre Channel drives can deliver. Depending on the availability of space on EFDs, Oracle applications will be benefited by moving the temporary tablespaces to them. Temporary tablespace files should only be moved to EFDs after all the I/O intensive parts have been moved to them.

Database workloads that are the best fit for EFDs

There are no simple, definitive rules that would readily identify applications that best suit the EFDs, but we can follow some guidelines. It is very important to understand the load profile of an application before putting it on the EFDs, taking into consideration the fact that most databases have different workload profiles during different times of the day. The EFDs are suitable for highly read intensive and extremely latency sensitive applications and using these drives against the wrong target may not yield the desired benefit for the investment. It is important to understand the following terminology to help with deciding whether the EFDs are suitable for certain workloads.

Write cache: Most of the storage systems have big write side cache and all write IOPS from a host are generally written to cache and incur no delay due to physical disk access. Storage arrays have write caches sized to match the disk count supported by the controller and support enabling and disabling write cache at the LUN level, if needed.

Read hit: A read request from a database host can be served by storage system immediately if it already exists in storage cache because of a recent read or write or due to prefetch. A read serviced from the storage cache without causing disk access is called a read hit. If the requested data is not available in storage cache, the array must retrieve it from disk; this is referred to as a read miss.

Short-stroked drives: Some extremely latency sensitive applications use this technique on regular Fibre Channel drives to obtain low latencies. This is a technique where data is laid out on many partially populated disks in order to reduce the spindle head movement to provide high IOPS at a very low latency.

Workloads with high cache read-hit rates are already serviced at memory access speed, and deploying them on flash drive technology may not show a significant benefit. Workloads with low cache read-hit rates that exhibit random I/O patterns, with small I/O requests of up to 16 KB, and that require high transaction throughput will benefit most from the low latency of EFDs.

Database and application managers can easily point to mission-critical applications that directly improve business revenue and productivity when business transaction throughput is increased, along with reduced service latencies. Cognizant of these applications, storage administrators would often resort to “short stroking” more drives to ensure the highest possible I/O service level supported for them. EFDs can provide two very important benefits for such applications.

A single EFD can replace many short-stroked drives by its ability to provide a very high transaction rate (IOPS). This reduces the total number of drives needed for the application, increases power saving by not having to keep many spinning disks, and may reduce floor space in the data center as well.

EFDs provide very low latency, so applications where predictable low response time is critical and not all the data can be kept at the host or storage cache may benefit from using such drives. Because of the absence of rotating media in EFDs, their transfer rate is extremely high and data can be served much faster than the best response time that can be achieved even with a large number of short-stroked hard drives.

Enterprise Flash Drives or Solid State Drives

Enterprise flash drives are designed to dramatically increase the performance of latency sensitive applications. Enterprise flash drives, also known as solid state drives (SSD), contain no moving parts and appear as standard drives to existing storage management tools, allowing administrators to manage Tier 0 without special processes or custom tools or extra training. Tier 0 EFDs are ideally suited for applications with high transaction rates and those requiring the fastest possible retrieval and storage of data, such as currency exchange and electronic trading systems, or real-time data acquisition and processing. They also can prove to be extremely good for highly read-intensive workloads like search engine databases.

The EFDs are designed to deliver millisecond application response times and up to 30 times more I/O operations per second (IOPS) than traditional Fibre Channel hard disk drives. Additionally, EFDs consume significantly less energy per IOPS than traditional hard disk drives, providing the opportunity for significantly increased TCO by reducing the data center energy and space footprints. Database performance has long been constrained by the I/O capability of hard disk drives (HDD), and the performance of the HDD has been limited by intrinsic mechanical delays of head seek and rotational latency. EFDs, however, have no moving parts and therefore no seek or rotational latency delays, which dramatically improves their ability to sustain very high number of IOPS with very low overall response times.

Over the past 25 years, the rotational speeds of HDDs have improved from 3,600 rpm to 15,000 rpm, yielding only four times the improvement in IOPS when the rest of the computer technologies like CPU speeds saw double digit growth. EFD technology represents a significant leap in performance and may sustain up to 30 times the IOPS of traditional HDD technology. Proper use of EFDs can deliver vastly increased performance to the database application when compared to traditional Fibre Channel drives, both in transaction rates per minute as well as transaction response time.