Ethnicity and race inference from names is a common task in social science, typically handled by algorithms like BISG (Bayesian Improved Surname Geocoding) or dictionary-based classifiers. But how well do large language models perform on this task out of the box? This benchmark evaluates LLMs on name-based ethnicity inference across four datasets with ground-truth labels.
The benchmark uses 100-person samples from each dataset. Models are given only a person's name and asked to predict a categorical label. All predictions use structured outputs (JSON schema constraints) to ensure valid labels. The metric is macro-averaged accuracy: we compute per-label accuracy within each dataset, then average across labels (correcting for class imbalance), then average across datasets.
Datasets
- North Carolina — 100 voters from the NC voter file, labeled by self-reported race (WHITE, BLACK, HISPANIC, ASIAN)
- Florida — 100 voters from the FL voter file, same race categories
- Lebanon — 100 individuals from Lebanese voter rolls, labeled by religious sect (Sunni, Shia, Druze, Maronite, Roman Orthodox, Roman Catholic, Armenian Orthodox)
- India — 100 Indian politicians from reserved constituencies, labeled as Scheduled Caste (SC) or Scheduled Tribe (ST)
Macro-Average Accuracy
| Model | Geo | Average | NC | White | Black | Hispanic | Asian | FL | White | Black | Hispanic | Asian | Lebanon | Sunni | Shia | Druze | Maronite | R. Orth. | R. Cath. | Armenian | India | SC | ST | $/M In | $/M Out | $/1M Names |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 (low) OpenAI Low | 0.622 | 0.702 | 0.947 | 0.316 | 0.929 | 0.618 | 0.624 | 0.900 | 0.200 | 0.846 | 0.552 | 0.439 | 0.706 | 0.143 | 0.118 | 1.000 | 0.250 | 0.000 | 0.857 | 0.723 | 0.905 | 0.541 | 2.50 | 15.00 | $2,195 | |
| GPT-5.4 (low) OpenAI Low | +geo | 0.710 | 0.716 | 0.947 | 0.368 | 0.929 | 0.618 | 0.726 | 0.933 | 0.533 | 0.885 | 0.552 | 0.613 | 0.588 | 0.571 | 0.706 | 0.769 | 0.500 | 0.154 | 1.000 | 0.785 | 0.921 | 0.649 | 2.50 | 15.00 | $2,195 |
| GPT-5.4 (high) OpenAI High | 0.616 | 0.716 | 1.000 | 0.316 | 0.929 | 0.618 | 0.616 | 0.900 | 0.133 | 0.846 | 0.586 | 0.392 | 0.765 | 0.071 | 0.118 | 0.769 | 0.167 | 0.000 | 0.857 | 0.739 | 0.937 | 0.541 | 2.50 | 15.00 | $6,425 | |
| GPT-5.4 (high) OpenAI High | +geo | 0.695 | 0.719 | 0.947 | 0.316 | 0.964 | 0.647 | 0.693 | 0.933 | 0.333 | 0.885 | 0.621 | 0.639 | 0.588 | 0.571 | 0.647 | 0.692 | 0.667 | 0.308 | 1.000 | 0.728 | 0.889 | 0.568 | 2.50 | 15.00 | $6,425 |
| GPT-5.4 Mini (low) OpenAI Low | 0.596 | 0.670 | 1.000 | 0.263 | 0.857 | 0.559 | 0.672 | 0.900 | 0.467 | 0.769 | 0.552 | 0.391 | 0.529 | 0.357 | 0.059 | 0.615 | 0.250 | 0.000 | 0.929 | 0.650 | 0.921 | 0.378 | 0.75 | 4.50 | $528 | |
| GPT-5.4 Mini (low) OpenAI Low | +geo | 0.563 | 0.457 | 0.947 | 0.105 | 0.393 | 0.382 | 0.428 | 0.967 | 0.000 | 0.538 | 0.207 | 0.607 | 0.706 | 0.643 | 0.706 | 0.692 | 0.500 | 0.077 | 0.929 | 0.760 | 0.952 | 0.568 | 0.75 | 4.50 | $528 |
| GPT-5.4 Mini (high) OpenAI High | 0.582 | 0.581 | 0.947 | 0.211 | 0.607 | 0.559 | 0.652 | 0.967 | 0.467 | 0.692 | 0.483 | 0.424 | 0.588 | 0.357 | 0.000 | 0.769 | 0.250 | 0.000 | 1.000 | 0.671 | 0.937 | 0.405 | 0.75 | 4.50 | $1,406 | |
| GPT-5.4 Nano (low) OpenAI Low | 0.636 | 0.755 | 0.947 | 0.526 | 0.929 | 0.618 | 0.648 | 0.767 | 0.467 | 0.808 | 0.552 | 0.399 | 0.588 | 0.214 | 0.059 | 1.000 | 0.000 | 0.000 | 0.929 | 0.743 | 0.810 | 0.676 | 0.20 | 1.25 | $59 | |
| GPT-5.4 Nano (low) OpenAI Low | +geo | 0.719 | 0.761 | 0.842 | 0.684 | 0.929 | 0.588 | 0.690 | 0.900 | 0.533 | 0.808 | 0.517 | 0.621 | 0.647 | 0.857 | 0.824 | 0.923 | 0.167 | 0.000 | 0.929 | 0.804 | 0.905 | 0.703 | 0.20 | 1.25 | $59 |
| GPT-5.4 Nano (high) OpenAI High | 0.632 | 0.737 | 0.947 | 0.526 | 0.857 | 0.618 | 0.648 | 0.933 | 0.333 | 0.808 | 0.517 | 0.435 | 0.706 | 0.429 | 0.059 | 0.923 | 0.000 | 0.000 | 0.929 | 0.707 | 0.873 | 0.541 | 0.20 | 1.25 | $83 | |
| GPT-5.4 Nano (high) OpenAI High | +geo | 0.682 | 0.671 | 0.842 | 0.368 | 0.857 | 0.618 | 0.699 | 0.967 | 0.400 | 0.808 | 0.621 | 0.603 | 0.588 | 0.786 | 0.824 | 0.846 | 0.250 | 0.000 | 0.929 | 0.753 | 0.857 | 0.649 | 0.20 | 1.25 | $83 |
| Gemini 3 Flash (minimal) Google Minimal | 0.715 | 0.821 | 0.789 | 0.947 | 0.929 | 0.618 | 0.758 | 0.900 | 0.733 | 0.846 | 0.552 | 0.499 | 0.412 | 0.714 | 0.353 | 0.846 | 0.167 | 0.000 | 1.000 | 0.781 | 0.778 | 0.784 | 0.50 | 3.00 | $74 | |
| Gemini 3 Flash (minimal) Google Minimal | +geo | 0.762 | 0.832 | 0.842 | 0.947 | 0.893 | 0.647 | 0.782 | 0.933 | 0.800 | 0.808 | 0.586 | 0.642 | 0.706 | 0.786 | 0.824 | 0.692 | 0.333 | 0.154 | 1.000 | 0.793 | 0.857 | 0.730 | 0.50 | 3.00 | $74 |
| Gemini 3 Flash (high) Google High | 0.674 | 0.742 | 0.895 | 0.526 | 0.929 | 0.618 | 0.733 | 0.933 | 0.533 | 0.846 | 0.621 | 0.508 | 0.529 | 0.714 | 0.294 | 0.769 | 0.250 | 0.000 | 1.000 | 0.713 | 0.778 | 0.649 | 0.50 | 3.00 | $629 | |
| Gemini 3.1 Flash Lite (minimal) Google Minimal | 0.669 | 0.774 | 0.947 | 0.632 | 0.929 | 0.588 | 0.733 | 0.933 | 0.533 | 0.846 | 0.621 | 0.532 | 0.588 | 0.714 | 0.412 | 1.000 | 0.083 | 0.000 | 0.929 | 0.637 | 0.571 | 0.703 | 0.25 | 1.50 | $43 | |
| Gemini 3.1 Flash Lite (minimal) Google Minimal | +geo | 0.757 | 0.774 | 0.947 | 0.632 | 0.929 | 0.588 | 0.757 | 0.967 | 0.667 | 0.808 | 0.586 | 0.662 | 0.706 | 0.714 | 0.941 | 0.692 | 0.500 | 0.154 | 0.929 | 0.836 | 0.889 | 0.784 | 0.25 | 1.50 | $43 |
| Gemini 3 Flash (high) Google High | +geo | 0.739 | 0.847 | 0.947 | 0.895 | 0.929 | 0.618 | 0.757 | 0.933 | 0.667 | 0.808 | 0.621 | 0.614 | 0.588 | 0.714 | 0.824 | 0.846 | 0.250 | 0.077 | 1.000 | 0.736 | 0.905 | 0.568 | 0.50 | 3.00 | $629 |
| Gemini 3.1 Flash Lite (high) Google High | 0.658 | 0.755 | 0.947 | 0.526 | 0.929 | 0.618 | 0.734 | 0.900 | 0.533 | 0.846 | 0.655 | 0.501 | 0.412 | 0.714 | 0.353 | 0.769 | 0.333 | 0.000 | 0.929 | 0.640 | 0.524 | 0.757 | 0.25 | 1.50 | $440 | |
| Gemini 3.1 Flash Lite (high) Google High | +geo | 0.752 | 0.808 | 0.947 | 0.737 | 0.929 | 0.618 | 0.757 | 0.933 | 0.667 | 0.808 | 0.621 | 0.658 | 0.647 | 0.643 | 0.882 | 0.846 | 0.583 | 0.077 | 0.929 | 0.785 | 0.921 | 0.649 | 0.25 | 1.50 | $440 |
| Qwen3.5-35B Qwen | 0.625 | 0.721 | 0.947 | 0.421 | 0.929 | 0.588 | 0.707 | 0.933 | 0.467 | 0.808 | 0.621 | 0.398 | 0.588 | 0.357 | 0.059 | 0.769 | 0.083 | 0.000 | 0.929 | 0.672 | 0.857 | 0.486 | 0.16 | 1.30 | — | |
| Qwen3.5-397B Qwen | 0.609 | 0.742 | 0.947 | 0.474 | 0.929 | 0.618 | 0.666 | 0.867 | 0.333 | 0.808 | 0.655 | 0.397 | 0.588 | 0.286 | 0.059 | 0.769 | 0.167 | 0.000 | 0.929 | 0.631 | 0.667 | 0.595 | 0.39 | 2.34 | — | |
| DeepSeek-V3.2 DeepSeek | 0.644 | 0.748 | 0.895 | 0.579 | 0.929 | 0.588 | 0.690 | 0.833 | 0.533 | 0.808 | 0.586 | 0.419 | 0.765 | 0.071 | 0.059 | 0.846 | 0.250 | 0.154 | 0.786 | 0.719 | 0.762 | 0.676 | 0.26 | 0.38 | — | |
| DeepSeek-V3.2 DeepSeek | +geo | 0.731 | 0.815 | 0.895 | 0.789 | 0.929 | 0.647 | 0.766 | 0.900 | 0.733 | 0.846 | 0.586 | 0.583 | 0.882 | 0.643 | 0.529 | 0.846 | 0.167 | 0.154 | 0.857 | 0.762 | 0.794 | 0.730 | 0.26 | 0.38 | — |
| GPT-OSS-120B OpenAI | 0.494 | 0.543 | 0.895 | 0.158 | 0.679 | 0.441 | 0.512 | 0.700 | 0.133 | 0.731 | 0.483 | 0.321 | 0.647 | 0.071 | 0.118 | 0.538 | 0.083 | 0.077 | 0.714 | 0.600 | 0.794 | 0.405 | 0.04 | 0.19 | — | |
| GPT-OSS-120B OpenAI | +geo | 0.559 | 0.620 | 0.895 | 0.263 | 0.821 | 0.500 | 0.524 | 0.700 | 0.333 | 0.615 | 0.448 | 0.418 | 0.588 | 0.643 | 0.353 | 0.615 | 0.083 | 0.000 | 0.643 | 0.676 | 0.730 | 0.622 | 0.04 | 0.19 | — |
Cells marked — indicate runs still in progress. Table will be updated as results complete.
Cost vs Accuracy
Observations
- Gemini 3 Flash leads — 80% on NC, 76% on FL, and 50% on Lebanon, all with minimal thinking and the cheapest pricing.
- Higher reasoning often hurts — GPT-5.4 Mini drops from 67% to 58% on NC when switching from low to high reasoning. The models seem to "overthink" ambiguous cases toward the majority class (WHITE).
- Nano beats Mini — GPT-5.4 Nano (low) scores 75% on NC vs Mini's 67%. The smaller model appears less prone to defaulting to WHITE.
- BLACK is the hardest US category — Accuracy ranges from 21% to 68% across models, compared to 84-100% for WHITE. Models systematically under-predict Black identity for names without strong ethnic markers.
- Lebanon exposes limits — All models score 39-54%. Armenian Orthodox (distinctive names) and Maronite are predicted well, but Druze, Roman Orthodox, and Roman Catholic are near zero for most models.
- India SC/ST is deceptively high — 72-76% accuracy, but models heavily favor SC (the majority class). ST accuracy is much lower (38-68%).
Last updated: March 19, 2026. Results are being actively collected.