Big data? Big database technologies

By Guy Harrison*
Friday, 05 August, 2011


As organisations continue to horde massive volumes of data for analysis - producing so-called ‘big data’ - database management technology must evolve to keep up with the challenge. Guy Harrison, head honcho of R&D at Quest Software, explains the technologies that can help cope with these massive data volumes.

Telephony, and particularly mobile telephony, has always pushed the boundaries of database management technology. The sheer volume of mobile voice and mobile data interactions creates multiple headaches for the traditional database management system (DBMS).

On the one hand, the sheer rate of communication often overwhelms the transaction processing capabilities of relational databases. Committing a transaction requires that the database write that transaction to disk; and when the transaction rate becomes extreme, these disk writes become a bottleneck.

Long-term storage of mobile transactions presents another challenge. The larger the amount of data retained in the system, the longer it takes to perform the analytic queries needed for decision making and business intelligence. For this reason, historical data beyond a few days is typically stored in aggregate form.

New database technologies have been emerging over the past few years to address these and other issues.

In-memory databases

Telecom applications have historically been one of the major users of in-memory databases (IMDBs), in which transactions are committed not to disk but to random access memory (RAM) across multiple computers. When a transaction is committed, the change is preserved not by writing it to disk but by replicating it across the cluster. Should any member of the cluster fail, the change can be recovered from another member.

Telecom applications pioneered the use of IMDB in many cases. However, IMDB is rapidly becoming more mainstream today, with both Oracle and IBM integrating recent IMDB acquisitions (TimesTen and SolidDB respectively) into their relational database management systems (RDBMSs). There’s significant innovation in the IMDB space from vendors such as VoltDB, whose flagship IMDB implements a floor-to-ceiling remodelling of the traditional database architecture to fully realise the benefits of IMDB.

Hadoop and MapReduce

While the IMDB has been a common component of mobile telephony application architectures for some years now, new ways of dealing with huge amounts of static historical data are only just emerging.

Many applications generate masses of unstructured data, web logs and so forth that contain information which potentially can create great competitive advantage. Predictive analytics, churn forecasts, social networking, fraud detection and many other business-critical functions can be tackled by processing this unstructured data. However, until recently there have been few practical options for processing this ‘big data’ other than to load it into a DBMS, often at great expense. This expense includes both the cost of data warehouse hardware and software, and the consulting and project costs involved in the extract, transform and load (ETL) of the data.

Web 2.0 companies, and Google in particular, developed new approaches to manage these massive unstructured datasets. Google’s data, effectively all the pages in the World Wide Web, has always been way too massive to load into a single DBMS. Instead, Google uses large clusters of commodity hardware, each of which handles a small subset of web pages. MapReduce is Google’s massively parallel algorithm for distributing work across the clusters. It’s used to build the index that lets Google web searches resolve quickly.

Hadoop, an open source Apache project, implements MapReduce, as well as other key components of the Google architecture. Using a Hadoop cluster, it’s possible to store and process massive amounts of highly granular raw data, without needing to load it into a DBMS. Not only is the ETL process avoided, the cost of storing data in a Hadoop cluster is much lower than the cost of storing it in a database. The cost of storing data in a traditional database appliance can be as much as 100 times greater per gigabyte then storing it in a commodity Hadoop cluster.

The Hadoop ecosystem provides tools for ad hoc query and analysis. ‘Pig’ is a scripting language that lets queries and data flows be written without the large amounts of boilerplate Java code that MapReduce requires. ‘Hive’ provides access to the Hadoop data via HQL, an extended subset of SQL. An increasing number of business intelligence and query tools are adding a Hive support, allowing Hadoop data to be fully integrated with an enterprise’s business data.

The NoSQL movement

The relational database model has dominated database design for a generation, a massive achievement in the world of IT where paradigm shifts are the rule, not the exception. However, as the example of Hadoop illustrates, the relational model is not suitable for all scenarios.

In particular, the RDBMS transactional and consistency model breaks as databases are widely distributed across data centres. The relational model requires that everybody see an update the instant it is applied. When the database is distributed across a large number of servers this rapidly becomes impossible. Indeed, in 2000, Eric Brewer outlined the now famous CAP theorem, which states consistency and high availability cannot both be maintained when a database is partitioned across a fallible wide area network.

Large-scale Web 2.0 sites - Facebook, Twitter and so on - as well as elastic cloud computing services simply could not scale economically across large clusters of computers. As a result, a variety of non-relational databases emerged and eventually the umbrella term ‘NoSQL’ became synonymous with these new technologies.

Within the NoSQL zoo, there are a several distinct family trees. Some NoSQL databases are pure key-stores without an explicit data model; many of these are based on Amazon’s Dynamo key-value store. Some are heavily influenced by Google’s BigTable database, which supports Google products such as Google Maps and Google Reader. Document databases store highly structured self-describing objects, usually in an XML-like format called JSON. Finally, graph databases store complex relationships such as those found in social networks.

Within these four NoSQL families there are at least a dozen database systems of significance. Databases such as Riak, Cassandra, HBase, MongoDB and Neo4J are strong representatives of each category.

NoSQL is a fairly imprecise term - it defines what the databases are not rather than what they are, and rejects SQL rather than the more relevant strict consistency requirement of the relational model. However imprecise the term may be, there’s no doubt that NoSQL databases represent an important direction in database technology.

Hardware innovation

It’s not just database software that is rapidly evolving, the hardware underlying database systems is transforming as well.

For almost the entire history of modern RDBMSs, persistent storage has been provided by magnetic spinning disks. The relative performance of magnetic disk has been so poor when compared to the rest of the hardware stack that supports the RDBMS that database performance tuning has focused primarily on minimising this disk IO. IMDBs represent an extreme attempt to avoid disk IO.

We are now entering the era in which solid state disks (SSDs) represent a real alternative to the traditional magnetic media. SSDs can provide performance hundreds of times superior than traditional disk. Unfortunately, the economics of SSDs is not so attractive in terms of storage, with SSDs costing as much as $50 per GB. Therefore, the best outcome will usually be achieved by implementing a mix of SSDs and traditional disks.

Storage tiering solutions are emerging that allow data to be transparently moved to the most cost-effective medium based on access patterns, providing the best of both worlds.

Last words

Five years ago it seemed that the database technology had reached a plateau, with relational database technology dominating in all sectors and few signs of emerging new paradigms. Today the situation could not be more different - the rate of innovation in database technology has never been greater.

*Guy Harrison is a Director of Research and Development at Quest Software, with more than 20 years’ experience in database design, development, administration and optimisation. Guy is the author of numerous books, articles and presentations on database technology, and currently leads development of Quest’s Hadoop and NoSQL tooling initiatives.

Related Articles

Is the Australian tech skills gap a myth?

As Australia navigates this shift towards a skills-based economy, addressing the learning gap...

How 'pre-mortem' analysis can support successful IT deployments

As IT projects become more complex, the adoption of pre-mortem analysis should be a standard...

The key to navigating the data privacy dilemma

Feeding personal and sensitive consumer data into AI models presents a privacy challenge.


  • All content Copyright © 2024 Westwick-Farrow Pty Ltd