Data Observability on Steroids

CODE

|HIFLYLABS|

Data Observability on Steroids

For a production data platform, we need transparent monitoring. In the dbt Cloud interface, you can view the job statuses/logs, but custom alerts (email, SMS, slack notification) are not currently available or at least not as informative as they should be. Ideally, we need a central tool where we can see the event/logs generated by the Ingestion tool, dbt, and other components of our data stack.
Of course, this requires relatively extensive interfaces between the components (APIs, RPCs, post hooks), and currently, there isn’t a silver bullet for this problem. We feel this part will still evolve a lot in the future and the community will figure out how to manage the collective orchestration in this fragmented ecosystem. What is certain is that this debate is not over yet. 

There are plenty of ways how you can look at data diffs ranging from databases to smaller pieces of your platform. Among those, unit testing has its limitations. You can use dbt_utils and the portable elements from Great Expectations to test your data after a change, but it can be cumbersome to scale those tests when you add new columns to the table. 
For instance, the testing logic has to be updated and expanded to keep the optimal coverage. On a large scale, the marginal benefit of writing those tests diminishes and alerting fatigue can become a serious issue. An automated data monitoring and diffing tool needs no manual maintenance on the user side and is able to catch any anomalies transparently. It should be able to gauge what counts as a breaking change and what is falling into the categories of normal behavior. That way, we are able to minimize false positives and keep human interaction as low as possible.

I think it’s almost half the success if we can get notifications about tests/schema drifting/data freshness errors and the logs are available in one central location, but fortunately, there are several companies now trying to cross this chasm and the area is being mentioned in the most exciting trends for 2022.

There are more of these tools than we can count on two hands, so we will only highlight a few that stood out to us with some of their qualities and fits the dbt-centric workflow. If you are interested in this field, you can find a more complete(ish) list here!

re_data

re_data (mentioned among the hottest repositories) is one of the few open-source alternatives for this problem, (which handles the monitoring aspects and is also targeted for the MDS ecosystem). It combines both data tests, gives you a monitoring dashboard, and it’s even capable of sending Slack notifications. It can be installed as a dbt package, so it integrates well with the ecosystem and leverages the manifest file of dbt to generate model lineages.

Imagine a dbt or Great Expectation testing suite which is made to be able to create time-based metrics and track anomalies using ML accordingly. In fact, they were one of the first who decided to bring monitoring and testing to the same platform.

However, you have to host a web app somewhere and manage the scheduling of the notification package for the time being. A cloud-hosted version is already on the roadmap to cover all sorts of enterprise users. We would consider this rather an advantage because not many of the other vendors allow you to host with its UI, and the setting can also keep your data secure in your own cloud from being distributed to third parties.

Datafold

Sometimes a seemingly small change in your codebase can cause a lot of problems in downstream models and dashboards, invisible to the naked eye when submitting a PR.

Datafold gives you confidence in your data quality through diffs, profiling, and anomaly detection.”

In terms of usability, it is as simple as any other similar tool. You need to supply the connection credentials as well as a temporary schema reserved for Datafold to materialize intermediate results. After that, you should be able to define metrics, observe the catalog of the database and run data diffs. An additional configuration is needed if you want to integrate Datafold with your GitHub workflow through the API.

I think the most interesting feature is that it’s capable of diffing datasets. Setting this up through GitHub Actions is pretty straightforward. All you need to do is first define the workflow YAML for dbt-core build and submit all the artifacts after compilation to Datafold through the API.

You can write your own metrics in SQL, and Datafold makes sure to periodically monitor those metrics. Under the hood, it calculates a prediction interval for these metrics and triggers an alert (via mail, Slack or PagerDuty) when it falls outside the expected range.

The data-diff feature can be very handy to shift data within the source (web-based) or migrate from on-premise to cloud (CLI feature) without losing an event in the process.

Scenario: Let’s say that you used the zero-copy-clone feature to migrate your tables to another schema or you decide to move to another warehouse vendor. Later on, you redact an order because it was wrongly recorded in the order lines table. Running the datadiff feature you realize that the three rows associated with that particular order were only deleted in the replicated table.

Usually, it can be very painful and costly to scan through terabytes of data, but Datafold’s checksum algorithm makes sure that the diffing is still performant on a large scale. It simply hashes values to numbers, sums them up, and compares them across data sources. If there is a difference, the algorithm creates smaller and smaller segments in the data until it finds the rows responsible for the regression.

It allows you to create a per-row summary of missing and unmatching rows, which then you can use to spot-check errors and fix them without having to re-materialize your tables. Read more on checksums here.

Other notable features are column-level lineage, data profiling, and cataloging. Cataloging is a great extra feature to get a quick glance at the data equipped for search until a column level, sampling, and data ownership. I think such a feature falls outside the area of issues that Datafold wants to solve for its customers.

As far as BI tools are concerned, Datafold only supports Mode for now, but soon expects to bring the Looker integrations to users. This can be decisive for those who want to see the end-to-end lineage. What is also promising is that they are also planning to add Hightouch support, which would further expand your data visibility beyond classical ELT.

Metaplane

Metaplane (The Datadog of data) integrates across your data stack from source systems, through warehouses to BI dashboards, then identifies normal behavior and alerts the right people when things go awry.

While Datafold focuses on regressions introduced by codebase changes, Metaplane took the initiative to be a classical monitoring tool to identify silent data bugs even when it’s not flagged by the pipeline. An example here could be data source issues which significantly increase generated revenues displayed on the platform. This is not detectable by tools focused on tools kicked off by development work but occurs unexpectedly.

Unlike Datafold, Metaplane offers out-of-the-box tests even for statistical properties. It also checks for schema changes on the go. 
Where Metaplane also has a slight advantage compared to others, is in terms of their Slack integration, which allows the end-user to interact with the anomaly and mark them either as an anomaly or normal behavior. This way, we can optimize for alert fatigue and false positives by altering the prediction interval.

Elementary

Elementary is an open-source app that enables data teams to detect anomalies in their data before their users do.

It enables you to monitor the key data quality metrics, freshness, volume, and schema changes, including anomaly detection, configured and executed as dbt tests.

The included Data observability report enables you to examine the output of the dbt test:

In our opinion, the most interesting feature of the dbt-data-reliability package, is that it can be used to load the output of the run result into the target database. This basic functionality is currently missing from the dbt core, and we are thankful to the fellows from Elementary for making this available to everyone. This makes us excited and it made many ideas feasible, but more on that later.

Closing Remarks

That being said, that is no ‘best tool’ on the market as of now, but we can see a pattern of similar companies trying to cover as many features as they can and “bundle” (pun intended) together with data cataloging, observability, and governance with a spice of ML. They are aggressively tackling this area to a point where they are inspired by each other (sometimes resulting in a conflict). Each of the tools has its own stand-out features, so you have to make sure to choose the best one for your typical workflow. Hopefully, we are going to see more and more organizations interested in data durability and visibility, incentivizing the development of utilities we have been missing so much but weren’t worth yet implemented in the Modern Data Stack community.

We hope that one of the showcased tools has aroused your interest and that by introducing it to your project you will gain more confidence in your data platform.

Authors:

Zsombor Foldesi 
Son N. Nguyen

You can find our other blogposts here.

 

dbt
Data Observability

Explore more stories

The Joy of Thinking

|HIFLYLABS|

Hiflylabs is supporting Flying School, a Math development program for ninth-grade students in spring 2024.

Thanks for the memories – How to fine-tune LLMs

|HIFLYLABS|

Fine-tuning is all about using genAI to fit your own context and goals. Explore our approach for both everyday and business settings. Open-source model Mistral has hardly seen any Hungarian in its training. Yet it learned to speak fluently from only 80k messages!

We want to work with you.

Hiflylabs is your partner in building your future. Share your ideas and let's work together!