Automated Testing with dbt
Meret RingwaldOne of the key assets of the data team is to provide trustworthy data to the data consumers. If the data you share can’t be trusted, it won’t be used to make decisions, and your organization will be left to base their decisions on guesswork and intuition.
Raw data is ingested into the warehouse from many different sources. It’s then the data product team’s responsibility to define the transformation logic that brings that source data together into meaningful data products for reporting and analytics. Agile data product teams use automated data quality tests to check the validity of the data they provide to their organization before it is used for analytics and decision making.
In this blog post, we look at how you can probe your data stored on BigQuery with different kinds of tests using dbt. We assume that you have basic knowledge about how dbt is set up and that you have a working project. If not check out this blog post on how to set up a project within dbt.
In general, data quality assurance is performed with test specifications that are executed as SQL queries on BigQuery. There are different types of test specifications, such as internal and external consistency checks, KPI verification or access verifications. We can also probe incoming data on a structural level, e.g. whether the schema of the delivered data corresponds to the known structure or not.
dbt comes with two types of tests: singular and generic. Both kinds of tests can be evaluated using the “dbt test” command.
Let’s first talk about singular tests. These are SQL statements that return records, which do not pass a certain condition. They are defined in a .sql file stored in your test directory (dbt default is “tests”). We could for example check, whether our incoming transaction data only has positive tax percent values:
This test would return all orders that have a negative tax percent and would need to be investigated further.
Singular tests are meant for single tests, as the name suggests. If you find repeating yourself by writing similar singular tests over and over again, it would be time to switch to generic tests. In general, generic tests are parameterized queries that accept arguments. They can be added as properties on an existing model in a .yml file.
dbt comes with four “out-of-the-box” generic tests: unique, not_null, accepted_values and relationships. For example, you might want to validate the consistency between a source and a target table (“no record left behind”). Here is an example checking for the order_id:
Apart from the 4 tests coming with dbt core, there are packages like dbt_expectations, which expand the testing capabilities of dbt. For example, you could add a test from dbt_expectations which checks if the shipping_date only contains values of the type date.
The third option is custom build tests using macros. As an example, we can compare a calculation result to a value known from a different analysis. In the code below, we calculate the total turnover for last year and compare it to the published value from the annual report. This is a database regression test and makes sure that the integrity of the database is kept intact even after some modifications.
We then add this test to our other tests:
Another helpful feature is that we can group tests by using tags.
Using these tags, the specific tests to run can be selected at runtime. Example: “dbt test --select tag:test_group1”.
Making sure your data complies with your quality standards can be time consuming, as you need to run many and computational expensive tests. To avoid a delay in the delivery of the data to the end user, it is good practice to define so-called “smoke tests”. These tests are quick and are run before you make the data accessible. They detect major concerns within the data. More fine grained testing, which might take some time to finish, can then take place when you have already loaded the data. With this approach, you have a good tradeoff between time-to-market and making sure your data quality complies with your standards.
With these different possibilities and features in dbt we have all the tools we need to assure data quality. We can find data quality issues and make sure that we address those before we make the data available for further analysis. This allows us to provide trusted data to make data driven decisions.