Applied Dimensionality

ETL Testing

Posted at — Sep 24, 2010
ETL Testing

I’m a fan of testing in each DWH project, because it allows to:

A list of approaches I use for testing ETL-procedures:

‘Water in a Sieve’

Checking whether we could carry all required data into DWH without ‘spilling’ it out.

Common things to check:

Common mistakes found:

‘Excluded Middle’

Checking dimension mapping for a selected dimension. This extends previous test by including dimension totals for checking.

So if we want to check whether ‘products’ were mapped correctly — we compare totals(sums) by time, store, etc, listing all dimensions except products.

If counts(sums) differ in source\DWH — we know that products were mapped incorrectly.  Moreover we know the specific data subset containing error, which helps a lot as well.

‘Do your maths’

Checking DWH calculations. If we’re doing some data transformation\calculating something in DWH — it should be tested as well.

2 approaches to testing:

  1. overall logic testing. For example, if we’re allocating HQ expenses to get regional reports it logical to expect the overall sum of expenses to stay the same )
  2. testing a specific data subset. We can select a single account and verify logic on it.

These two approaches should be combined.

‘Like in good old days’

Checking some heuristic expectations, based on previously loaded data.

  1. There’s no way daily sales can jump 70% compared to quarter average
  2. We’re usually getting about this number of rows from this source
  3. If we’re reloading 3 months of data daily — we expect modest amount of changes in past days.

If you’re using partitioned loads ‘last_load’ partition can be used for such testing.

It’s best when tests are written not by procedure developer himself. You can always apply cross-checking (you write tests for my procedures, I write for yours).

Pay attention to Chapter 4 of ETL Toolkit – this post is just a simple list of typical tests, the methodology is described there.

comments powered by Disqus