Measure model fairness mathematically — and understand why you cannot have all metrics at once
There are dozens of proposed fairness metrics, but four are most commonly used in practice. Understanding what each measures — and what it doesn't — is essential.
P(Ŷ=1 | A=0) = P(Ŷ=1 | A=1) The positive prediction rate should be equal across groups A (e.g., race, gender). Easy to measure. Doesn't account for whether underlying rates differ. Appropriate when you believe qualification rates should be equal across groups.
P(Ŷ=1 | Y=1, A=0) = P(Ŷ=1 | Y=1, A=1) AND P(Ŷ=1 | Y=0, A=0) = P(Ŷ=1 | Y=0, A=1) Both true positive rates AND false positive rates should be equal across groups. Stronger than demographic parity. The COMPAS standard ProPublica was applying.
P(Y=1 | Ŷ=p, A=0) = P(Y=1 | Ŷ=p, A=1) = p A prediction of 70% risk should mean 70% probability of the outcome, regardless of group membership. The standard Northpointe was applying.
Similar individuals should receive similar predictions. Requires a similarity metric over individuals — which is itself contentious.
from fairlearn.metrics import MetricFrame, demographic_parity_difference, equalized_odds_difference
from sklearn.metrics import accuracy_score, precision_score
# Evaluate model by demographic group
mf = MetricFrame(
metrics={"accuracy": accuracy_score, "precision": precision_score},
y_true=y_test,
y_pred=y_pred,
sensitive_features=sensitive_features, # e.g., gender column
)
print("By group:")
print(mf.by_group)
print()
print(f"Overall accuracy: {mf.overall['accuracy']:.3f}")
print(f"Accuracy difference across groups: {mf.difference()['accuracy']:.3f}")
# Specific fairness metrics
dpd = demographic_parity_difference(y_test, y_pred, sensitive_features=sensitive_features)
eod = equalized_odds_difference(y_test, y_pred, sensitive_features=sensitive_features)
print(f"Demographic parity difference: {dpd:.3f} (0 = perfectly fair)")
print(f"Equalized odds difference: {eod:.3f} (0 = perfectly fair)")