Ethnicity and race inference from names is a common task in social science, typically handled by algorithms like BISG (Bayesian Improved Surname Geocoding) or dictionary-based classifiers. But how well do large language models perform on this task out of the box? This benchmark evaluates LLMs on name-based ethnicity inference across four datasets with ground-truth labels.

The benchmark uses 100-person samples from each dataset. Models are given only a person's name and asked to predict a categorical label. All predictions use structured outputs (JSON schema constraints) to ensure valid labels. The metric is macro-averaged accuracy: we compute per-label accuracy within each dataset, then average across labels (correcting for class imbalance), then average across datasets.

Datasets

Macro-Average Accuracy

Model Geo Average NC White Black Hispanic Asian FL White Black Hispanic Asian Lebanon Sunni Shia Druze Maronite R. Orth. R. Cath. Armenian India SC ST $/M In $/M Out $/1M Names
GPT-5.4 (low) OpenAI Low 0.622 0.702 0.947 0.316 0.929 0.618 0.624 0.900 0.200 0.846 0.552 0.439 0.706 0.143 0.118 1.000 0.250 0.000 0.857 0.723 0.905 0.541 2.50 15.00 $2,195
GPT-5.4 (low) OpenAI Low +geo 0.710 0.716 0.947 0.368 0.929 0.618 0.726 0.933 0.533 0.885 0.552 0.613 0.588 0.571 0.706 0.769 0.500 0.154 1.000 0.785 0.921 0.649 2.50 15.00 $2,195
GPT-5.4 (high) OpenAI High 0.616 0.716 1.000 0.316 0.929 0.618 0.616 0.900 0.133 0.846 0.586 0.392 0.765 0.071 0.118 0.769 0.167 0.000 0.857 0.739 0.937 0.541 2.50 15.00 $6,425
GPT-5.4 (high) OpenAI High +geo 0.695 0.719 0.947 0.316 0.964 0.647 0.693 0.933 0.333 0.885 0.621 0.639 0.588 0.571 0.647 0.692 0.667 0.308 1.000 0.728 0.889 0.568 2.50 15.00 $6,425
GPT-5.4 Mini (low) OpenAI Low 0.596 0.670 1.000 0.263 0.857 0.559 0.672 0.900 0.467 0.769 0.552 0.391 0.529 0.357 0.059 0.615 0.250 0.000 0.929 0.650 0.921 0.378 0.75 4.50 $528
GPT-5.4 Mini (low) OpenAI Low +geo 0.563 0.457 0.947 0.105 0.393 0.382 0.428 0.967 0.000 0.538 0.207 0.607 0.706 0.643 0.706 0.692 0.500 0.077 0.929 0.760 0.952 0.568 0.75 4.50 $528
GPT-5.4 Mini (high) OpenAI High 0.582 0.581 0.947 0.211 0.607 0.559 0.652 0.967 0.467 0.692 0.483 0.424 0.588 0.357 0.000 0.769 0.250 0.000 1.000 0.671 0.937 0.405 0.75 4.50 $1,406
GPT-5.4 Nano (low) OpenAI Low 0.636 0.755 0.947 0.526 0.929 0.618 0.648 0.767 0.467 0.808 0.552 0.399 0.588 0.214 0.059 1.000 0.000 0.000 0.929 0.743 0.810 0.676 0.20 1.25 $59
GPT-5.4 Nano (low) OpenAI Low +geo 0.719 0.761 0.842 0.684 0.929 0.588 0.690 0.900 0.533 0.808 0.517 0.621 0.647 0.857 0.824 0.923 0.167 0.000 0.929 0.804 0.905 0.703 0.20 1.25 $59
GPT-5.4 Nano (high) OpenAI High 0.632 0.737 0.947 0.526 0.857 0.618 0.648 0.933 0.333 0.808 0.517 0.435 0.706 0.429 0.059 0.923 0.000 0.000 0.929 0.707 0.873 0.541 0.20 1.25 $83
GPT-5.4 Nano (high) OpenAI High +geo 0.682 0.671 0.842 0.368 0.857 0.618 0.699 0.967 0.400 0.808 0.621 0.603 0.588 0.786 0.824 0.846 0.250 0.000 0.929 0.753 0.857 0.649 0.20 1.25 $83
Gemini 3 Flash (minimal) Google Minimal 0.715 0.821 0.789 0.947 0.929 0.618 0.758 0.900 0.733 0.846 0.552 0.499 0.412 0.714 0.353 0.846 0.167 0.000 1.000 0.781 0.778 0.784 0.50 3.00 $74
Gemini 3 Flash (minimal) Google Minimal +geo 0.762 0.832 0.842 0.947 0.893 0.647 0.782 0.933 0.800 0.808 0.586 0.642 0.706 0.786 0.824 0.692 0.333 0.154 1.000 0.793 0.857 0.730 0.50 3.00 $74
Gemini 3 Flash (high) Google High 0.674 0.742 0.895 0.526 0.929 0.618 0.733 0.933 0.533 0.846 0.621 0.508 0.529 0.714 0.294 0.769 0.250 0.000 1.000 0.713 0.778 0.649 0.50 3.00 $629
Gemini 3.1 Flash Lite (minimal) Google Minimal 0.669 0.774 0.947 0.632 0.929 0.588 0.733 0.933 0.533 0.846 0.621 0.532 0.588 0.714 0.412 1.000 0.083 0.000 0.929 0.637 0.571 0.703 0.25 1.50 $43
Gemini 3.1 Flash Lite (minimal) Google Minimal +geo 0.757 0.774 0.947 0.632 0.929 0.588 0.757 0.967 0.667 0.808 0.586 0.662 0.706 0.714 0.941 0.692 0.500 0.154 0.929 0.836 0.889 0.784 0.25 1.50 $43
Gemini 3 Flash (high) Google High +geo 0.739 0.847 0.947 0.895 0.929 0.618 0.757 0.933 0.667 0.808 0.621 0.614 0.588 0.714 0.824 0.846 0.250 0.077 1.000 0.736 0.905 0.568 0.50 3.00 $629
Gemini 3.1 Flash Lite (high) Google High 0.658 0.755 0.947 0.526 0.929 0.618 0.734 0.900 0.533 0.846 0.655 0.501 0.412 0.714 0.353 0.769 0.333 0.000 0.929 0.640 0.524 0.757 0.25 1.50 $440
Gemini 3.1 Flash Lite (high) Google High +geo 0.752 0.808 0.947 0.737 0.929 0.618 0.757 0.933 0.667 0.808 0.621 0.658 0.647 0.643 0.882 0.846 0.583 0.077 0.929 0.785 0.921 0.649 0.25 1.50 $440
Qwen3.5-35B Qwen 0.625 0.721 0.947 0.421 0.929 0.588 0.707 0.933 0.467 0.808 0.621 0.398 0.588 0.357 0.059 0.769 0.083 0.000 0.929 0.672 0.857 0.486 0.16 1.30
Qwen3.5-397B Qwen 0.609 0.742 0.947 0.474 0.929 0.618 0.666 0.867 0.333 0.808 0.655 0.397 0.588 0.286 0.059 0.769 0.167 0.000 0.929 0.631 0.667 0.595 0.39 2.34
DeepSeek-V3.2 DeepSeek 0.644 0.748 0.895 0.579 0.929 0.588 0.690 0.833 0.533 0.808 0.586 0.419 0.765 0.071 0.059 0.846 0.250 0.154 0.786 0.719 0.762 0.676 0.26 0.38
DeepSeek-V3.2 DeepSeek +geo 0.731 0.815 0.895 0.789 0.929 0.647 0.766 0.900 0.733 0.846 0.586 0.583 0.882 0.643 0.529 0.846 0.167 0.154 0.857 0.762 0.794 0.730 0.26 0.38
GPT-OSS-120B OpenAI 0.494 0.543 0.895 0.158 0.679 0.441 0.512 0.700 0.133 0.731 0.483 0.321 0.647 0.071 0.118 0.538 0.083 0.077 0.714 0.600 0.794 0.405 0.04 0.19
GPT-OSS-120B OpenAI +geo 0.559 0.620 0.895 0.263 0.821 0.500 0.524 0.700 0.333 0.615 0.448 0.418 0.588 0.643 0.353 0.615 0.083 0.000 0.643 0.676 0.730 0.622 0.04 0.19

Cells marked — indicate runs still in progress. Table will be updated as results complete.

Cost vs Accuracy

Observations

Last updated: March 19, 2026. Results are being actively collected.