In the first part of this blog series I discussed the traditional data warehouse and some of the issues that can occur in the period following the initial project. I concluded by outlining the 5 common challenges that I hear from customers: Inflexibility, complex architecture, slow performance, old technology, and lack of governance. This blog will focus on the fourth of these issues, old technology.
Let’s begin by reminding ourselves of the traditional data warehouse we discussed in the earlier post on complex architectures. For this post I want to focus on the technology components that come together to support these architectures and in turn explore some of the difficulties they experience over time. The first thing to agree on is what we mean by technology. In this case I am talking about the hardware and software that exists in each of the 4 layers of the architecture.
Hardware – the computer platforms themselves, typically Intel servers, typically in a customer owned server room or data center. Generally these are commodity items that are a combination of processor(s), memory, storage and networking.
Software – for simplicities sake I will not focus in detail on the operating systems in use but will instead look at the software used to manage the ETL process and the subsequent storage and optimization of the data.
This allows us to look at the architecture above in a slightly different way.
NB: As the focus of this blog series is the data warehouse I have removed the Front End Analytics layer.
As the traditional ETL process was built using either RDBMS SQL code or a bundled tool it is very common to the same standard hardware used to run both the ETL process and the Store and Optimise layer.
So Where Are the Problems?
This combination of hardware and software worked well when the original typical mid-range data warehouses were being built and deployed. However, in the subsequent years the cracks have begun to show.
As previously mentioned, when we talk about hardware we should be thinking of processors, memory, storage and networking. Each of these items has seen varying degrees of innovation that have resulted in more powerful computing platforms being made available for increasingly lower costs.
Processors – Most people involved in the IT industry are aware of Moore’s law so it should not be a surprise that the amount of computer power available has increased exponentially in recent times.
Much of the more recent hardware performance gains are realized through multi-core and parallel multi-processor systems. The challenge here is that a processor is a single component of an integrated server so is unlikely to be able to be changed without upgrading the entire platform.
Memory – as is the case with the processor, the speed of the server’s memory and in particular the amount in the system often constrains the effectiveness of the data warehouse platform. Modern server platforms include up to date processors that are able to address larger volumes of memory at a much faster rate. This can be hugely beneficial during ETL and query cycles.
Storage – if on board memory is crucial in ensuring that the processor can access the data it requires as quickly as possible, then the underlying (most likely) disk based storage tier cannot be ignored. In fact, many single server data warehouse implementations use basic HDD arrays that struggle to keep up with the demands placed on them by increasing data volumes and complexity of user queries.
Networking – although not an area that has seen as much innovation, the basic I/O specifications of many iteration 1 data warehouse implementations mean that their storage and networking layers have long since passed their optimal state. Networking performance is critical at 2 points within the data warehouse cycle.
- Moving data from source systems into the data warehouse
- Moving result sets from the data warehouse to the BI tool
It seems obvious, yet we still see architectures where either older or slower networking standards are being used. Or even worse, the network routing between the data warehouse and the source systems and BI systems are sub-optimal – and as a result introduce unnecessary bottlenecks and delays.
The reality is that you won’t be able to apply these advantages to your current data warehouse unless it is migrated onto a completely new set of hardware.
“Unfortunately, I regularly speak to customers who are struggling with their data warehouse performance as a result of still using the same server they implemented 5 years ago.”
However, hardware alone is not the answer, although it may well be the problem.
New hardware on its own will most certainly deliver an increase in performance but it’s unlikely to address the wider concerns covered in the rest of this blog series. Concerns relating, for instance, to the need to ingest and prepare data of differing variety, volume and velocity.
In other words, you also have to ensure that your software is optimized to address the new requirements but also to take advantage of modern hardware platforms.
Extract Transform & Load (ETL)
- If you use an ETL Tool or Engine, is it supported on the latest OS and hardware platforms?
- Is your ETL process multi-threaded and able to scale up and out in order to take advantage of hardware innovations?
- Are you able to run multiple parallel queries against your source systems in order to obtain the basic data you require, or are you limited to sequential queries?
- Does your ETL method / tool include native change data capture?
- Is your ETL solution able to retrieve data from both structured and unstructured sources?
- Can you consume web services and OData as easily as flat files and relational databases?
- Can your ETL process seamlessly work with both streaming and batch data?
Store and Optimize
- Are you using a traditional RDBMS platform or one that is optimized for analytic workloads?
- Have you had to create multiple aggregate tables that pre-calculate answers to deliver the performance you require?
- Are you constantly having to tune and update indexes in order to match the performance demands of your users?
- Do you find that your data warehouse regularly swaps volumes of data to disk in order to resolve user queries?
- Are you forced to store multiple copies of your data in order to improve performance, for example OLAP cubes?
- Can your data warehouse platform scale out across commodity hardware to improve performance?
- Can you store both structured and unstructured data within the same repository or seamlessly co-exist and leverage other data stores such as Hadoop?
- Have you had to archive data from your data warehouse in order to maintain performance levels or to remain within hardware limitations?
- Are you able to compress the data you hold, so you can hold more, without impacting on performance?
- Does your data warehouse platform include native analytics libraries for delivering advanced capabilities such as; geo-coding and geo-spatial processing, predictive algorithms (R for example) and embedded data quality tools?
- Is your data warehouse platform able to support open standards for querying such as ODBC, JDBC, MDX, OData, etc.?
To be clear, many of the requirements that we have discussed, particularly in the blog on COMPLEX ARCHITECTURE will never be effectively addressed in a cost effective, manageable fashion without significant changes to legacy hardware and software technologies.
So What’s the Answer?
Is it possible to deliver a data warehouse solution that is designed from the ground up to run optimally on modern hardware platforms? A solution that takes advantage of, and even informs Intels chip development efforts in high performance computing.
Is it possible to take the core of your traditional data warehouse solution and re-model it in a cost effective way to work within a modern platform based on a tight synergy of hardware and software
Can you minimize the challenges of data explosion and duplication within your data warehouse without compromising performance?
The answer is, yes (of course).
The Modern Data Platform
The Modern Data Platform delivers on all the requirements for a next generation data warehouse. Enabling organizations to radically simplify their existing legacy or overly complex solutions in order to lower running costs, improve agility and gain breakthrough performance that delivers real business value.
Remember to follow the rest of the posts in this blog series where we will explore in detail the common challenges of traditional data warehousing.