Building a Layered Testing Strategy for a Production ETL Pipeline
While working on a production Databricks ETL pipeline, the project reached a point where structured testing became necessary.
The challenge wasn’t whether to add tests, but figuring out the right way to test a platform-heavy pipeline while keeping feedback fast and changes safe.
Starting with Nutter Tests
The first testing framework we introduced was Nutter.
Initially, Nutter was used almost like unit testing:
-
testing individual functions
-
validating small pieces of ETL logic
-
running directly on Databricks clusters
This worked well functionally, but there was a downside.
Because Nutter runs via notebook submit jobs on Databricks clusters, execution was significantly slower than traditional unit test frameworks. As the test suite grew, feedback loops became longer than we were comfortable with.
Introducing Unit Tests for Speed
To address this, we added traditional unit tests (UTCs) for pure Python logic.
-
Much faster execution
-
Easy to run locally and in CI
-
Enabled use of coverage tooling
This became the default way to test:
-
transformation logic
-
helper functions
-
edge cases and regressions
However, unit tests have limits in a Databricks environment.
Anything involving:
-
catalog reads/writes
-
Spark sessions
-
dbutils
would require heavy mocking, reducing confidence.
Repositioning Nutter as Integration Testing
Instead of removing Nutter, we reframed its purpose.
Nutter was kept specifically for:
-
functions that rely on Databricks-native behavior
-
reading from or writing to catalogs
-
interactions with Spark and dbutils
In this setup:
-
unit tests handled speed and coverage
-
Nutter acted as integration tests, validating real platform behavior
This separation made both test types more effective.
Testing ETL Steps Individually
As the pipeline evolved, we introduced another layer of Nutter tests focused on ETL steps, not functions.
-
Each ETL step was tested independently
-
Different use cases and edge conditions were covered
-
Failures were easier to isolate
This acted like “unit testing” for the pipeline structure itself, without running the full flow.
End-to-End Tests and Value Validation
Finally, we added end-to-end (E2E) tests.
These tests:
-
created a fresh ETL pipeline daily
-
ran using a masked version of real client data
-
executed the full flow from ingestion to final outputs
The outputs were compared against a baseline.
This served two purposes:
-
sanity-checking pipeline correctness
-
detecting unexpected changes in output values
Making Impact Visible
One unexpected benefit of the daily E2E runs was impact awareness.
Instead of just knowing that values changed, we could see:
-
which output metrics were affected
-
how broadly a change propagated
This made it easier to:
-
communicate changes to downstream users
-
set expectations around new features
-
avoid surprises in reports already in use
In practice, this shifted testing from “did we break something?” to “who will feel this change?”
Takeaways
-
Different tests exist for different reasons — no single framework fits all
-
Speed matters as much as correctness during development
-
Integration tests are essential in platform-heavy environments like Databricks
-
End-to-end tests are most valuable when they highlight impact, not just failures
-
Good testing improves trust, not just stability