Name-based demographic inference and the unequal distribution of misrecognition

Most studies that use tools like Genderize.io report a single accuracy number: the overall error rate. Jeffrey Lockhart, Molly King, and Christin Munsch wanted to know what that average conceals. Their answer, published in Nature Human Behaviour in 2023, was: a lot.

The method

The researchers surveyed 19,924 scholars who had authored articles in sociology, economics, and communication journals indexed in Web of Science between 2015 and 2020. Critically, they asked each scholar to self-identify their gender and race/ethnicity — then compared those self-reports against the outputs of four gender inference tools (including Genderize.io) and four race/ethnicity tools.

This is rare. Most validation studies compare tool outputs against other imputed labels or editorial records. Lockhart and colleagues compared them against what people actually say about themselves.

The headline numbers

Genderize.io's overall gender error rate was 4.6% — a number that looks reassuring in isolation. But the errors were not distributed equally.

The tool misgendered 43% of Chinese women in the sample. It was wrong 3.5 times more often for women than for men overall. And it was wrong for every single one of the 139 nonbinary scholars in the dataset — a 100% failure rate for a group that the binary classification framework cannot represent by design.

Why it matters

The unequal distribution of error is the paper's central point. An overall accuracy of 95% sounds rigorous. But if the 5% who are misclassified are disproportionately women of color, scholars from East Asia, or gender minorities, then the tool is not just imprecise — it is systematically erasing the people who are already most likely to be undercounted in the research that uses it.

The authors argue this creates a compounding problem. Studies that use name-based gender inference to measure gender gaps will undercount the very groups experiencing the largest gaps. The measurement tool and the phenomenon it measures fail in the same direction.

The practical takeaway

Lockhart, King, and Munsch do not argue that name-based inference should never be used. They argue that researchers should report error rates by subgroup, not just overall — and should consider how differential misclassification might bias their findings. For tools like Genderize.io, the paper is both a challenge and a roadmap: the overall product works well for Western European names, but its value to the research community depends on closing the gap for the populations it currently serves least.

Author

Jeffrey W. Lockhart, Molly M. King, and Christin Munsch

Year

2023