Ensure your data pipelines are correct and stay correct as requirements evolve
ML pipelines are harder to test than regular software because 'correct' is often probabilistic. But you can still catch the majority of bugs with three categories of tests:
Test each transformation function in isolation with tiny synthetic DataFrames. Verify shapes, column names, dtypes, and that known inputs produce known outputs. These run in milliseconds and should be in your pre-commit hook.
Use Great Expectations or Pandera to assert schema contracts: no unexpected nulls, values within expected ranges, no duplicate IDs, referential integrity. Run these on every pipeline execution.
Run the full pipeline on a small representative slice of real data end-to-end. Verify the output has the expected shape and statistical properties (mean, std within expected bounds).
import pandas as pd
import pytest
from your_pipeline import transform
@pytest.fixture
def sample_df():
return pd.DataFrame({
"user_id": [1, 2, 3],
"amount": [10.0, 0.0, 500.0],
"timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"],
})
def test_transform_adds_expected_columns(sample_df):
result = transform(sample_df)
assert "amount_log" in result.columns
assert "date" in result.columns
def test_transform_no_nulls(sample_df):
result = transform(sample_df)
assert result.notna().all().all()
def test_transform_amount_log_non_negative(sample_df):
result = transform(sample_df)
assert (result["amount_log"] >= 0).all()
def test_transform_rejects_negative_amounts():
bad_df = pd.DataFrame({"user_id": [1], "amount": [-5.0], "timestamp": ["2024-01-01"]})
with pytest.raises(AssertionError):
transform(bad_df)