Thursday, December 25, 2014

NoSql basics and more

What is NoSQL?

NoSQL encompasses a wide variety of different database technologies that were developed in response to a rise in the volume of data stored about users, objects and products, the frequency in which this data is accessed, and performance and processing needs. Relational databases, on the other hand, were not designed to cope with the scale and agility challenges that face modern applications, nor were they built to take advantage of the cheap storage and processing power available today.

NoSQL Database Types

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address:

Large volumes of structured, semi-structured, and unstructured data
Agile sprints, quick iteration, and frequent code pushes
Object-oriented programming that is easy to use and flexible
Efficient, scale-out architecture instead of expensive, monolithic architecture

Dynamic Schemas

Relational databases require that schemas be defined before you can add data. For example, you might want to store data about your customers such as phone numbers, first and last name, address, city and state – a SQL database needs to know what you are storing in advance.

This fits poorly with agile development approaches, because each time you complete new features, the schema of your database often needs to change. So if you decide, a few iterations into development, that you'd like to store customers' favorite items in addition to their addresses and phone numbers, you'll need to add that column to the database, and then migrate the entire database to the new schema.

If the database is large, this is a very slow process that involves significant downtime. If you are frequently changing the data your application stores – because you are iterating rapidly – this downtime may also be frequent. There's also no way, using a relational database, to effectively address data that's completely unstructured or unknown in advance.

NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed.

Auto-sharding

Because of the way they are structured, relational databases usually scale vertically – a single server has to host the entire database to ensure reliability and continuous availability of data. This gets expensive quickly, places limits on scale, and creates a relatively small number of failure points for database infrastructure. The solution is to scale horizontally, by adding servers instead of concentrating more capacity in a single server.

"Sharding" a database across many server instances can be achieved with SQL databases, but usually is accomplished through SANs and other complex arrangements for making hardware act as a single server. Because the database does not provide this ability natively, development teams take on the work of deploying multiple relational databases across a number of machines. Data is stored in each database instance autonomously. Application code is developed to distribute the data, distribute queries, and aggregate the results of data across all of the database instances. Additional code must be developed to handle resource failures, to perform joins across the different databases, for data rebalancing, replication, and other requirements. Furthermore, many benefits of the relational database, such as transactional integrity, are compromised or eliminated when employing manual sharding.

NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.

Cloud computing makes this significantly easier, with providers such as Amazon Web Services providing virtually unlimited capacity on demand, and taking care of all the necessary database administration tasks. Developers no longer need to construct complex, expensive platforms to support their applications, and can concentrate on writing application code. Commodity servers can provide the same processing and storage capabilities as a single high-end server for a fraction of the price.

Replication

Most NoSQL databases also support automatic replication, meaning that you get high availability and disaster recovery without involving separate applications to manage these tasks. The storage environment is essentially virtualized from the developer's perspective.

Integrated Caching

A number of products provide a caching tier for SQL database systems. These systems can improve read performance substantially, but they do not improve write performance, and they add complexity to system deployments. If your application is dominated by reads then a distributed cache should probably be considered, but if your application is dominated by writes or if you have a relatively even mix of reads and writes, then a distributed cache may not improve the overall experience of your end users.

Many NoSQL database technologies have excellent integrated caching capabilities, keeping frequently-used data in system memory as much as possible and removing the need for a separate caching layer that must be maintained.

NoSQL vs. SQL Summary

	SQL DATABASES	NOSQL DATABASES
Types	One type (SQL database) with minor variations	Many different types including key-value stores,document databases, wide-column stores, and graph databases
Development History	Developed in 1970s to deal with first wave of data storage applications	Developed in 2000s to deal with limitations of SQL databases, particularly concerning scale, replication and unstructured data storage
Examples	MySQL, Postgres, Oracle Database	MongoDB, Cassandra, HBase, Neo4j
Data Storage Model	Individual records (e.g., "employees") are stored as rows in tables, with each column storing a specific piece of data about that record (e.g., "manager," "date hired," etc.), much like a spreadsheet. Separate data types are stored in separate tables, and then joined together when more complex queries are executed. For example, "offices" might be stored in one table, and "employees" in another. When a user wants to find the work address of an employee, the database engine joins the "employee" and "office" tables together to get all the information necessary.	Varies based on database type. For example, key-value stores function similarly to SQL databases, but have only two columns ("key" and "value"), with more complex information sometimes stored within the "value" columns. Document databases do away with the table-and-row model altogether, storing all relevant data together in single "document" in JSON, XML, or another format, which can nest values hierarchically.
Schemas	Structure and data types are fixed in advance. To store information about a new data item, the entire database must be altered, during which time the database must be taken offline.	Typically dynamic. Records can add new information on the fly, and unlike SQL table rows, dissimilar data can be stored together as necessary. For some databases (e.g., wide-column stores), it is somewhat more challenging to add new fields dynamically.
Scaling	Vertically, meaning a single server must be made increasingly powerful in order to deal with increased demand. It is possible to spread SQL databases over many servers, but significant additional engineering is generally required.	Horizontally, meaning that to add capacity, a database administrator can simply add more commodity servers or cloud instances. The database automatically spreads data across servers as necessary
Development Model	Mix of open-source (e.g., Postgres, MySQL) and closed source (e.g., Oracle Database)	Open-source
Supports Transactions	Yes, updates can be configured to complete entirely or not at all	In certain circumstances and at certain levels (e.g., document level vs. database level)
Data Manipulation	Specific language using Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE…	Through object-oriented APIs
Consistency	Can be configured for strong consistency	Depends on product. Some provide strong consistency (e.g., MongoDB) whereas others offer eventual consistency (e.g., Cassandra)

Implementing a NoSQL Database in Your Organization

Often, organizations will begin with a small-scale trial of a NoSQL database in their organization, which makes it possible to develop an understanding of the technology in a low-stakes way. Most NoSQL databases are also open-source, meaning that they can be downloaded, implemented and scaled at little cost. Because development cycles are faster, organizations can also innovate more quickly and deliver superior customer experience at a lower cost.

Understanding NoSQL

NoSQL refers to a database that is not based on SQL (Structured Query Language), which is the language most commonly associated with relational databases. Essentially, NoSQL data isn't relational, NoSQL databases usually do not have schema, and they come with looser consistency models than traditional relational databases do.

The term "NoSQL" refers to the fact that traditional relational databases are not adequate for all solutions, particularly ones involving large volumes of data. But the term has been extended to also mean "Not only SQL", indicating support for potential SQL-based interfaces even if the core database isn't relational. Software developers that use NoSQL solutions don't necessarily advocate dismissing relational databases, but instead see value in using the right data store for the job.

Use of NoSQL

NoSQL data stores respond to key data needs that are not met by relational databases.

Caching solutions

Caching results is a common solution to improving application response. For example, a website may be feeding the same answers to hundreds of thousands of users. Instead of taxing a relational database to recompute the same thing, custom caching solutions can be implemented manually. Several NoSQL data stores provide a similar solution, but the developer does not have to maintain a custom cache.

Key-value stores

Some NoSQL databases store key-value pairs to allow quick lookups, for example, in the case of question/answer access. Relational databases are more oriented towards storing complex data and various relationships between different types of data. This methodology is overcomplicated when a developer wants to implement a quick way to store and access Q&A data.

Document stores

Other types of data are more document oriented and have variations. For example, forms of data can have many optional fields. Relational databases with their rigid schemas require all othese fields to be defined for every row of stored data. Document-based NoSQL stores are more flexible and efficient in handling this.

Faster access to bigger sets of data

Relational databases sacrifice performance when searching large volumes of data. Historically, developers have built systems in which writing SQL queries to find a few rows of data involved thinning out data sets in the most efficient way. The bigger the result set, though, the more expensive the queries become. Large volumes of data or queries that involve aggregating large amounts of data are referred to as "data warehousing."

NoSQL data stores are becoming widely adopted and are being tested in many situations. These situations involve large volumes of data as well as large rates of data growth in many consumer systems.

Less rigid consistency requirements

NoSQL is also considered an alternative to traditional relational databases because certain consistency requirements that are inherently part of relational databases are very different in modern enterprises.

Developers are discovering that certain data requirements don't demand the rigid ACID modelof relational databases that usually comes with worse performance. Instead they can meet their needs using eventual consistency that tends to come with better performance. Some NoSQL data stores even allow the developer to pick how loose or rigid the consistency must be.

Limitations of NoSQL

SQL is a powerful, 40-year-old standard that has been possible because all relational databases have the same concept of storing data in tables and relating data through foreign keys. Although switching from one relational database to another isn't 100% transparent, it is much easier than switching between two different NoSQL data stores. Developers that have learned SQL have little challenge switching between vendors.

Because each NoSQL data store has unique aspects in both how its data is stored as well as how different bits of data relate to each other, no single API manages them all. When embracing a new NoSQL data store, the developer must invest time and effort to learn the new query language as well as the consistency semantics.

NoSQL database examples

Dozens of NoSQL data stores are available; the following are among the most popular:

MongoDB. Open-source document database.
CouchDB. Database that uses JSON for documents, JavaScript for MapReduce queries, and regular HTTP for an API.
GemFire. Distributed data management platform providing dynamic scalability, high performance, and database-like persistence.
Redis. Data structure server wherein keys can contain strings, hashes, lists, sets, and sorted sets.
Cassandra. Database that provides scalability and high availability without compromising performance.
memcached. Open source high-performance, distributed-memory, and object-caching system.
Hazelcast. Open source highly scalable data distribution platform.
HBase. Hadoop database, a distributed and scalable big data store.
Mnesia. Distributed database management system that exhibits soft real-time properties.
Neo4j. Open source high-performance, enterprise-grade graph database.

Monday, December 15, 2014

Teradata 14.0 introduces Teradata Columnar

Teradata 14.0 introduces Teradata Columnar – a new option to organize the data of a user-defined table or join index on disk.

Teradata Columnar offers the ability to partition a table or join index by column. It introduces column-storage as an alternative choice to row-storage for a column partition and autocompression. Column partitioning can be used alone in a single-level partitioning definition or with row partitioning in a multilevel partitioning definition.

Teradata Columnar is a new paradigm for partitioning, storing data, and compression that changes the cost-benefit tradeoffs of the available physical database design choices and their combinations. Teradata Columnar provides a benefit to the user by reducing I/O for certain classes of queries while at the same time decreasing space usage.

A column-partitioned (CP) table or join index has several key characteristics:

It does not have a primary index (a future blog entry will discuss why).
Each of its column partitions can be composed of a single column or multiple columns.
Each column partition usually contains multiple physical rows. Physical rows are the disk-based structures that the Teradata file system uses to store data in order based on the rowid associated with each physical row. The first part of a rowid indicates the partition number so that physical rows are ordered by partition number first and within partition by the rest of the rowid.
A new physical row format COLUMN may be utilized for a column partition; such a physical row is called a container. This is used to implement column-storage, row header compression, and autocompression for a column partition. This provides a compact way to store a series of column partition values.
Alternatively, a column partition may have physical rows with ROW format that are used to implement row-storage; such a physical row is called a subrow. Each column partition value is in its own physical row. Usually a subrow is wide (multicolumn, large character strings, etc.) where the row header overhead for each column partition value is insignificant and having each column partition value in its own physical row provides more direct access to the value.
A CP table is just another type of table that can be accessed by a query. A single query can access multiple kinds of tables.

PPI stands for partitioned primary index which means the table has a primary index and the rows are partitioned on the AMPs (and within a partition, the rows are ordered by a hash of the primary index columns).

A CP table is not a PPI table since a CP table doesn't have a primary index. But a CP table can have RANGE_N and CASE_N row partitioning (the kind of partitioning of rows that is used in PPI) but since there is no primary index, the rows within a row partition are not ordered by a hash of some columns of each row -- they are just in insert order.

A CP table could have a join index on it where the join index does have primary index (but not column partitioning).

A PI or PPI table could have a join index on it where the join index has column partitioning (but not a primary index) plus optionally 1 or more levels of row partitioning.

A NoPI table can't have row partitioning unless it also has column partitioning.

For example:

CREATE TABLE SALES (

    TxnNo     INTEGER,

    TxnDate   DATE,

    ItemNo    INTEGER,

    Quantity  INTEGER )

  PARTITION BY COLUMN,

  UNIQUE INDEX (TxnNo);

This creates a column-partitioned (CP) table that partitions the data of the table vertically. Each column is in its own column partition that is stored using column-storage with row header compression and autocompression. All the data for TxnNo comes first, followed by the data for TxnDate, followed by the data for ItemNo, and then the data for Quantity. Note that a primary index is not specified so this is NoPI table. Moreover, a primary index must not be specified if the table is column partitioned.

The following adds a level of row partitioning (so the table has multilevel partitioning). All the data for TxnNo for the first day comes first, followed by the next day of data forTxnNo, etc. then all the data for TxnDate for the first day, the second day, etc, ending with the last day of data for Quantity.

CREATE TABLE SALES (

    TxnNo     INTEGER,

    TxnDate   DATE,

    ItemNo    INTEGER,

    Quantity  INTEGER )

  PARTITION BY (

      COLUMN,

      RANGE_N(TxnDate BETWEEN

          DATE '2011-01-01' AND DATE '2011-12-31' EACH INTERVAL '1' DAY) ),

  UNIQUE INDEX (TxnNo);

Data Warehouse and ETL Knowledgebase