Having been on a break from consulting since the year started, I decided to catch up on my reading and started with TDWI’s very well summarized article on Big Data Analytics, written by Phillip Russom (see link elsewhere on this blog). As the article explains, Big Data Analytics is the coming together of Big Data and Advanced Analytics. Big Data is not just about large volumes, but about diversity in data types (structured, semi-structured and unstructured), and variety in data refresh speeds as well – from real-time to delayed, including traditional transactional data which presumably occurs in discrete, varied time intervals, to sensor-type data which is a continuous stream. Advanced Analytics is a revival of everything that was considered too esoteric for most EDW groups to aspire to, from recognizable techniques such as Predictive Analysis, Data Mining, Complex SQL and Statistical Analysis to some less recognizable ones such as Natural Language Processing (used to understand written text) and artificial intelligence (not sure how this is used yet). There is also content analysis, or the analysis of video and audio, that seems even more advanced.
All this is to be supported on a diverse set of fairly new platform choices ranging from hadoop-based implementations to DW appliances to columnar, analytic, or in-memory databases to in-database analytic functions.
It feels as if under the single banner of Big Data Analytics, all that was exciting albeit challenging to consider and certainly not your everyday topic in DW circles is being revived and made mainstream. Now, a distinction can be made between “old-school” EDW-BI solutions which offer a very structured and fairly predictable model of storage and consumption through well-defined data warehouses, dashboards and cubes, and this new “exploratory” or “discovery-oriented” way of looking at data. So, does the thinking around EDW need to change? And how?
Do we need to throw away our EDWs and move all our data to a brand new platform such as above? One point that was made in the article is that the data that is fed into such technologies needs to be as raw as possible, and that traditional ETL processing should not b applied. So, the data can clearly bypass the EDW entirely and be fed directly into one of the above-mentioned technologies to achieve results. Is the role of the EDW diminished by this and will it become simply a historical source to this new, all-encompassing world?
Not so soon. First of all, a little data cleansing and staging doesn’t hurt. Picture raw data with mis-spelled customer names. A little massaging can only add to the quality of the analysis. Thus, the EDW can be used to stage an optimally-cleansed data set that is used as input towards the analytics. If your EDW architecture includes an Operational (ODS) layer to it that already houses cleansed data from the transactional systems, that can be used as a source for the analytics as well.
Beyond that, the traditional database platform that houses the EDW seems a poor choice for being the main platform for supporting Big Data Analytics. For one, it does not support well the storage of the diverse data types that seem to be desired. Nor will it respond well to the exploratory nature of the analysis that seems to be heart and soul of Big Data, with its fixed indexing and partitioning schemes.
It looks like EDW implementations would have to coexist with Big Data Analytics systems, acting as a source of structured data towards it. Either Big Data Analytics would have to be supported on a separate platform, or both it and the EDW would have to be moved to the new platform (This works if the implementation is a hybrid platform supporting the range of structured to unstructured data). This makes sense as traditional EDW platforms have been notoriously ill-suited to the exploratory analysis that some users desire. This coexistence of technologies that support what is well-defined and what is fuzzy seems apt. Also, the traditional EDW approach with its proven way of handling the clearly understood analytics through dashboards and cubes does not have to be thrown away.
Another aspect to think about is that Big Data Analytics is not just for Big Organizations, or even just for implementations literally with immense volumes of data. The Advanced Analytics aspect of it can be applied to any situation. For example, text analytics can be used to extract nuggets of information from the wealth of textual data collected in any organization. I was at a recent engagement with a telecom infrastructure services provider where it occurred to me that the descriptions being collected for tower development projects could be mined for reasons why projects were being killed. Unfortunately, such topics are still considered too far-fetched. One good thing that might happen with the recent upsurge of Big Data is to bring Advanced Analytics more to the forefront.
In conclusion, Big Data Analytics seems disruptive to traditional EDW approaches at first, but in the end it appears a symbiotic union.