Reading time: 4 min
In the last few years, we have met more and more client inquiries on “real-time data warehouses” in many forms from casual chit-chat through formal Request For Proposals to actual projects. From these impressions I see that there are misunderstandings as to what a real-time data warehouse really is, what it can be used for in practice, and what consequences it has.
Real-time or online?
Oftentimes “real-time” and “online” are used interchangeably in colloquial discourse despite them not being synonyms, and it frequently leads to confusion.
“Online” (in this context) means that the system is available via a computer network, in practice via the Internet, around the clock. “Real-time”, on the other hand, means that the system reacts to its inputs (nearly) immediately. They are independent attributes.
For example, traditional paper-based cheques are neither online nor real-time. Bank transfers, initiated from web-based frontends, are online; however, they are not always real-time everywhere. (In Hungary until recently they were executed in hours at best and in days at worst.) Flight recorders are absolutely real-time but are not online to the least.
User interfaces of data warehouses are online most of the time (and may only be unavailable while new data is being loaded). Real-time data warehouses, though, are quite rare at Hungarian enterprises.
How much “real-time”?
In most cases when the “real-time” expression is mentioned in the context of data warehouses, it is meant in a business sense, and it means approximately that “we need a faster data update cycle than daily, perhaps about hourly or so, at least to a subset of the data”. This is a way more relaxed “real-time” requirement than what we generally consider in IT, and thus it is much simpler and cheaper to meet.
Traditional daily data update in data warehouses are “batch” based: all daily records from all sources are processed in one go. All new transactions are loaded, as well as changes in the dimensions, and then data marts are recalculated. This process takes hours in most large data warehouses. (As an extreme example we have seen at least one data warehouse in which the update process took nearly 24 hours – daily. They had two databases: one served the users while the other one was being loaded.)
Shorter-than-daily load cycles are most often implemented with “micro-batch” processing. It is nearly the same principle than with the standard batch case, only much smaller batches are processed, much more frequently. It is nearly the same, but there are a few notable differences that require special attention.
When micro-batch-based data update is considered, speed is usually not a major concern, as there are usually few records in any one micro-batch. There are, however, two specialties that need to be taken care of.
The first one is that micro-batches usually only contain changes rather than full data sets. These “incremental” or “delta” loads have a tendency to result in a drift in the final data set: although theoretically deltas add up to the full data set, in practice, they nearly never actually do over longer periods. To make sure that the data warehouse does, indeed, have the complete data set, a complementary periodic full load is recommended – which can be part of the standard daily update cycle.
The other special case is that micro-batches sometimes only consist of a subset of the entities – owing to, perhaps, technical reasons or peculiarities of business processes. For example, the micro-batches contain the transactions but not some of the dimensions. Loading these incomplete sets means that the data in the data warehouse will not be consistent (in the sense that e.g some transactions may not have their proper dimensions). While it sounds rather scary for any traditional warehouse specialist, most of the time it does not have any practical alternatives. These “temporary deviations” must be ironed out by the daily load cycle; during the actual day we have to live with them.
Can we calculate fast enough?
Beyond these two special cases, the data mart layer poses a different challenge, namely that calculating certain aggregations (e.g. KPIs) and other complex structures requires a lot of resources. To make matters worse, sometimes there are even theoretical problems as well, e.g. what is “daily change” for a day that is only partially loaded?
Though it is true that each micro-batch results in only a few changes in the normalized layer, in many cases data marts cannot be recalculated based on the changes only. (I.e. the calculations need a lot of “old” data points in addition to the “new” ones.) This effect can, in practice, limit the extent to which an application can be “real-time”, or significantly raise resource requirements.
More frequent updates bring up yet another issue that is unheard of in traditional data warehouses: data changes “continuously” mid-day. This is commendable only in certain use cases, whereas most users in most situations are indifferent to it at best or get actively annoyed at worst. We have seen cases when it led to technical regressions: data scientists started inserting screenshots in management reports instead of referring to analyses by their live link – out of fear that data constellations may change by the time readers inspect them.
First and foremost, it is very important to clearly delineate “real-time” and “standard” areas of the data warehouse as well as their respective business use cases both for the technical staff and for the users.
Only based on this understanding, and considering the specific context, data patterns, and needs, can an organization successfully manage “real-time” and daily data together. A proper architecture of processing, storing, and using “fast” and “slow” data may range from being fully unified to completely compartmentalized.
Adopting “real-time” data is not only an architectural change but one in data processing practice as well: the incremental, “real-time” cycles do not replace but augment the daily cycle. It means that the same data points are likely to get loaded not once but twice. Thus it is vital to harmonize these two cycles: the daily load smooths the inevitable variations brought forth by the more frequent updates, while at the same time it serves the use cases requiring more stable data. It is also customary that the daily update performs more expensive calculations in data marts that are deferred during micro-batch processing.
Processing data with lower latencies is ever more important in up-to-data data management practices. This can be done in data warehouses as well if expectations are properly set, and the right solutions are implemented that provide the advantages of faster data updates while carefully avoiding the associated pitfalls.