top of page

Building a Layered Testing Strategy for a Production ETL Pipeline

While working on a production Databricks ETL pipeline, the project reached a point where structured testing became necessary.

The challenge wasn’t whether to add tests, but figuring out the right way to test a platform-heavy pipeline while keeping feedback fast and changes safe.

Starting with Nutter Tests

The first testing framework we introduced was Nutter.

 

Initially, Nutter was used almost like unit testing:

  • testing individual functions

  • validating small pieces of ETL logic

  • running directly on Databricks clusters

 

This worked well functionally, but there was a downside.

Because Nutter runs via notebook submit jobs on Databricks clusters, execution was significantly slower than traditional unit test frameworks. As the test suite grew, feedback loops became longer than we were comfortable with.

Introducing Unit Tests for Speed

To address this, we added traditional unit tests (UTCs) for pure Python logic.

  • Much faster execution

  • Easy to run locally and in CI

  • Enabled use of coverage tooling

 

This became the default way to test:

  • transformation logic

  • helper functions

  • edge cases and regressions

 

However, unit tests have limits in a Databricks environment.

Anything involving:

  • catalog reads/writes

  • Spark sessions

  • dbutils

would require heavy mocking, reducing confidence.

Repositioning Nutter as Integration Testing

Instead of removing Nutter, we reframed its purpose.

 

Nutter was kept specifically for:

  • functions that rely on Databricks-native behavior

  • reading from or writing to catalogs

  • interactions with Spark and dbutils

 

In this setup:

  • unit tests handled speed and coverage

  • Nutter acted as integration tests, validating real platform behavior

 

This separation made both test types more effective.

Testing ETL Steps Individually

As the pipeline evolved, we introduced another layer of Nutter tests focused on ETL steps, not functions.

  • Each ETL step was tested independently

  • Different use cases and edge conditions were covered

  • Failures were easier to isolate

 

This acted like “unit testing” for the pipeline structure itself, without running the full flow.

End-to-End Tests and Value Validation

Finally, we added end-to-end (E2E) tests.

 

These tests:

  • created a fresh ETL pipeline daily

  • ran using a masked version of real client data

  • executed the full flow from ingestion to final outputs

The outputs were compared against a baseline.

 

This served two purposes:

  • sanity-checking pipeline correctness

  • detecting unexpected changes in output values

Making Impact Visible

One unexpected benefit of the daily E2E runs was impact awareness.

 

Instead of just knowing that values changed, we could see:

  • which output metrics were affected

  • how broadly a change propagated

 

This made it easier to:

  • communicate changes to downstream users

  • set expectations around new features

  • avoid surprises in reports already in use

 

In practice, this shifted testing from “did we break something?” to “who will feel this change?”

Takeaways

  • Different tests exist for different reasons — no single framework fits all

  • Speed matters as much as correctness during development

  • Integration tests are essential in platform-heavy environments like Databricks

  • End-to-end tests are most valuable when they highlight impact, not just failures

  • Good testing improves trust, not just stability

bottom of page