Featured Resource

Using SQL-MapReduce for Advanced Analytical Queries (excerpt)

By Rick F. van der Lans, R20 Consultancy

 

Summary

This whitepaper describes the advantages of merging SQL with MapReduce to create a data warehouse platform that can support today's complex and data-intensive query workload called advanced analytics. You can download the full whitepaper here.

From the early days of decision support systems to the present world of data warehousing, business intelligence, and analytics, the need for analytical capabilities has increased. In the beginning, users were satisfied with relatively straightforward reports, but today they demand advanced online analytical capabilities, such as forecasting, predictive modeling, and clustering.

In addition, data latency has become a more critical issue. It used to be acceptable for users to access week- and sometimes month-old data, compared to today, when users want to analyze data no older than a few seconds. In the past, we could keep our data warehouse relatively small by storing only aggregated data. However today decision makers want to access everything from top-level aggregations to the most granular, detailed data. Users are also demanding more ad hoc, interactive, and exploratory reporting and analytical capabilities. This will unmistakably increase the frequency of analysis. In short, analytical needs have changed and are still changing, and this always means more complex analytical queries on more data. This new form of workload is called advanced analytics.

More complex queries and more data require more powerful data warehouse platforms. Through the years, various database vendors have researched, experimented with, and developed different solutions. Some have implemented data warehouse appliances built on dedicated and proprietary hardware, others have opted for developing in-memory solutions, some have focused on exploiting the processor cache, and still others have switched from storing data in a record-oriented style to a column-oriented style. As expected, some have even tried to combine a few of those solutions.

SQL-MapReduce (SQL-MR) is a solution specifically designed for advanced analytical queries and it's already been implemented by a few database vendors. It's the result of combining the most popular database language, SQL, with a programming model created by Google, MapReduce. The goal of MapReduce is to distribute as much of the processing over as many processors as possible. This whitepaper describes the SQL-MR implementation offered by nCluster, Aster Data's analytical database server.

On the outside, nCluster looks like any other SQL database server. It supports standard SQL and all the common APIs, such as ODBC and JDBC, so it can be accessed by all the popular analytical and reporting tools. What makes nCluster special is the inside. The product was specifically designed for analytics and reporting. nCluster's unique Applications-Within™ architecture runs analytic application logic inside the database, leveraging nCluster's massively parallel architecture and SQL-MapReduce to fully parallelize processing of advanced analytical queries. The first version of nCluster was released in 2006, and the first production deployment was in 2007. The current version 4.5 was released in the first quarter of 2010.

Some of the characteristics of SQL-MR:

Bullet   MapReduce is implemented as a set of SQL table functions. These functions,   although they might be highly sophisticated internally, look like the table functions   SQL already supports.
Bullet   Report developers don't have to learn a new language or a new set of statements.   Instead they only have to study the parameters of the MapReduce functions.
Bullet   Every available reporting and analytical tool that supports SQL can work with SQL-   MR.
Bullet   SQL-MR, as seen by report developers, is still as declarative and storage   independent as SQL itself. They don't have to concern themselves with technical   details such as indexes, partitions, and buffer management parameters.
Bullet   Beside the built-in MapReduce functions, developers can write their own analytical   functions. They can select their preferred language to write these functions,   including Java, C++, C#, Python, and R.
Bullet   The complexity of MapReduce functions can range from simple functions doing   simple selections to functions containing complete forecasting and optimization   algorithms.

There are five reasons why SQL-MR as implemented in nCluster can considerably improve the performance of complex analytical queries:

Bullet   By using SQL-MR instead of classic SQL, most analytical queries are simpler to   formulate (less code), and therefore easier to optimize for the database server.   This ensures better query performance.
Bullet   Most of the analytical processing of a query is pushed to the MapReduce functions.   This function code is distributed over as many nodes as possible, which means as   much query processing is done in parallel and not by a central module. Even   complex forms of analytics are parallelized.
Bullet   The MapReduce functions are coded in a classic programming language and   therefore execute very fast, especially if compiled languages such as Java and   C++ are used. In addition, the performance is quite predictable.
Bullet   The MapReduce functions don't process the data in a set-oriented way, but by each   data item independently or by a partition of data. This gives the developer full   control over how the code is executed, allowing him to program the most efficient   processing strategy. There is a whole group of analytical queries for which the   item-by-item or partition-bypartition approach are the more efficient processing   styles.
Bullet   SQL-MR is a software-based solution and can exploit various 'standard' cluster-   based platforms. If new versions of the standard hardware components are   released, SQL-MR will be able to benefit from them.

To summarize, extending a SQL database server with MapReduce creates a data warehouse platform that combines the expressive query power and productivity of SQL with the parallelizability of MapReduce. The combination has the potential to improve the performance of complex analytical queries running on large to extremely large datasets. SQL-MR is a powerful foundation for deep analytics, complex analytics, operational analytics, self-service analytics, and even automated analytics. Aster Data's nCluster is a mature and robust implementation of SQL-MR and has proven itself to be suitable for advanced analytics.



Download the entire technical whitepaper "Using SQL-MapReduce for Advanced Analytical Queries."

About the Author:
Rick F. van der Lans is an independent analyst, consultant, author and lecturer specializing in data warehousing, business intelligence, service oriented architectures, and database technology. He is Managing Director of R20/Consultancy based in The Netherlands. Rick has advised many large companies worldwide on defining their data warehouse and business intelligence architectures. He is an internationally acclaimed lecturer and for the last twenty years has been presenting professionally across Europe, South America, the United States, and Australia. Rick is Chairman of the European Data Warehouse Conference.

Rick is the author of many acclaimed books on SQL and database servers including, “Introduction to SQL” which has been on the market for 20 years.