Analyze documented AI ethics failures — and what should have been done differently
These cases are not historical curiosities. The systemic failures they expose — inadequate testing, misaligned incentives, lack of affected-community input, poor governance — are present in most organizations deploying AI today.
Amazon built a machine learning system to automatically review resumes and rate candidates. The system was trained on resumes submitted over 10 years — a period when Amazon's workforce was predominantly male. The model learned to penalize resumes that included the word 'women's' (as in 'women's chess club') and to downgrade graduates of all-women's colleges. Amazon scrapped the tool in 2018 after internal audits discovered the bias. **What went wrong**: The team optimized for 'who we hired in the past' rather than 'who we should hire'. Historical data encoded historical discrimination. No bias evaluation was performed on protected attributes before deployment. **What should have happened**: Evaluate model predictions stratified by gender before deployment. Test with counterfactual inputs ('women's chess club' vs 'chess club'). Involve HR, legal, and DE&I stakeholders in the review process.
Optum's risk-scoring algorithm, used by major US health systems to identify high-risk patients for care management, systematically underestimated the health needs of Black patients. The algorithm used healthcare cost as a proxy for health need. Black patients with the same health conditions had lower historical costs — not because they were healthier, but because they had less access to healthcare. The model interpreted lower costs as lower need, resulting in Black patients being significantly under-referred to care management programs. **The lesson**: Proxy variables that seem neutral can encode existing inequalities. Healthcare cost is not a neutral measure of healthcare need. The choice of what to optimize matters enormously.