namesbench - Noah Dasanaike

Ethnicity and race inference from names is a common task in social science, typically handled by algorithms like BISG (Bayesian Improved Surname Geocoding) or dictionary-based classifiers. But how well do large language models perform on this task out of the box? This benchmark evaluates LLMs on name-based ethnicity inference across four datasets with ground-truth labels.

The benchmark uses 100-person samples from each dataset. Models are given only a person's name and asked to predict a categorical label. All predictions use structured outputs (JSON schema constraints) to ensure valid labels. The metric is macro-averaged accuracy: we compute per-label accuracy within each dataset, then average across labels (correcting for class imbalance), then average across datasets.

Datasets

North Carolina — 100 voters from the NC voter file, labeled by self-reported race (WHITE, BLACK, HISPANIC, ASIAN)
Florida — 100 voters from the FL voter file, same race categories
Lebanon — 100 individuals from Lebanese voter rolls, labeled by religious sect (Sunni, Shia, Druze, Maronite, Roman Orthodox, Roman Catholic, Armenian Orthodox)
India — 100 Indian politicians from reserved constituencies, labeled as Scheduled Caste (SC) or Scheduled Tribe (ST)

Macro-Average Accuracy

Model	Geo	Average	NC	White	Black	Hispanic	Asian	FL	White	Black	Hispanic	Asian	Lebanon	Sunni	Shia	Druze	Maronite	R. Orth.	R. Cath.	Armenian	India	SC	ST	$/M In	$/M Out	$/1M Names
GPT-5.4 (low) OpenAI Low		0.622	0.702	0.947	0.316	0.929	0.618	0.624	0.900	0.200	0.846	0.552	0.439	0.706	0.143	0.118	1.000	0.250	0.000	0.857	0.723	0.905	0.541	2.50	15.00	$2,195
GPT-5.4 (low) OpenAI Low	+geo	0.710	0.716	0.947	0.368	0.929	0.618	0.726	0.933	0.533	0.885	0.552	0.613	0.588	0.571	0.706	0.769	0.500	0.154	1.000	0.785	0.921	0.649	2.50	15.00	$2,195
GPT-5.4 (high) OpenAI High		0.616	0.716	1.000	0.316	0.929	0.618	0.616	0.900	0.133	0.846	0.586	0.392	0.765	0.071	0.118	0.769	0.167	0.000	0.857	0.739	0.937	0.541	2.50	15.00	$6,425
GPT-5.4 (high) OpenAI High	+geo	0.695	0.719	0.947	0.316	0.964	0.647	0.693	0.933	0.333	0.885	0.621	0.639	0.588	0.571	0.647	0.692	0.667	0.308	1.000	0.728	0.889	0.568	2.50	15.00	$6,425
GPT-5.4 Mini (low) OpenAI Low		0.596	0.670	1.000	0.263	0.857	0.559	0.672	0.900	0.467	0.769	0.552	0.391	0.529	0.357	0.059	0.615	0.250	0.000	0.929	0.650	0.921	0.378	0.75	4.50	$528
GPT-5.4 Mini (low) OpenAI Low	+geo	0.563	0.457	0.947	0.105	0.393	0.382	0.428	0.967	0.000	0.538	0.207	0.607	0.706	0.643	0.706	0.692	0.500	0.077	0.929	0.760	0.952	0.568	0.75	4.50	$528
GPT-5.4 Mini (high) OpenAI High		0.582	0.581	0.947	0.211	0.607	0.559	0.652	0.967	0.467	0.692	0.483	0.424	0.588	0.357	0.000	0.769	0.250	0.000	1.000	0.671	0.937	0.405	0.75	4.50	$1,406
GPT-5.4 Nano (low) OpenAI Low		0.636	0.755	0.947	0.526	0.929	0.618	0.648	0.767	0.467	0.808	0.552	0.399	0.588	0.214	0.059	1.000	0.000	0.000	0.929	0.743	0.810	0.676	0.20	1.25	$59
GPT-5.4 Nano (low) OpenAI Low	+geo	0.719	0.761	0.842	0.684	0.929	0.588	0.690	0.900	0.533	0.808	0.517	0.621	0.647	0.857	0.824	0.923	0.167	0.000	0.929	0.804	0.905	0.703	0.20	1.25	$59
GPT-5.4 Nano (high) OpenAI High		0.632	0.737	0.947	0.526	0.857	0.618	0.648	0.933	0.333	0.808	0.517	0.435	0.706	0.429	0.059	0.923	0.000	0.000	0.929	0.707	0.873	0.541	0.20	1.25	$83
GPT-5.4 Nano (high) OpenAI High	+geo	0.682	0.671	0.842	0.368	0.857	0.618	0.699	0.967	0.400	0.808	0.621	0.603	0.588	0.786	0.824	0.846	0.250	0.000	0.929	0.753	0.857	0.649	0.20	1.25	$83
Gemini 3 Flash (minimal) Google Minimal		0.715	0.821	0.789	0.947	0.929	0.618	0.758	0.900	0.733	0.846	0.552	0.499	0.412	0.714	0.353	0.846	0.167	0.000	1.000	0.781	0.778	0.784	0.50	3.00	$74
Gemini 3 Flash (minimal) Google Minimal	+geo	0.762	0.832	0.842	0.947	0.893	0.647	0.782	0.933	0.800	0.808	0.586	0.642	0.706	0.786	0.824	0.692	0.333	0.154	1.000	0.793	0.857	0.730	0.50	3.00	$74
Gemini 3 Flash (high) Google High		0.674	0.742	0.895	0.526	0.929	0.618	0.733	0.933	0.533	0.846	0.621	0.508	0.529	0.714	0.294	0.769	0.250	0.000	1.000	0.713	0.778	0.649	0.50	3.00	$629
Gemini 3.1 Flash Lite (minimal) Google Minimal		0.669	0.774	0.947	0.632	0.929	0.588	0.733	0.933	0.533	0.846	0.621	0.532	0.588	0.714	0.412	1.000	0.083	0.000	0.929	0.637	0.571	0.703	0.25	1.50	$43
Gemini 3.1 Flash Lite (minimal) Google Minimal	+geo	0.757	0.774	0.947	0.632	0.929	0.588	0.757	0.967	0.667	0.808	0.586	0.662	0.706	0.714	0.941	0.692	0.500	0.154	0.929	0.836	0.889	0.784	0.25	1.50	$43
Gemini 3 Flash (high) Google High	+geo	0.739	0.847	0.947	0.895	0.929	0.618	0.757	0.933	0.667	0.808	0.621	0.614	0.588	0.714	0.824	0.846	0.250	0.077	1.000	0.736	0.905	0.568	0.50	3.00	$629
Gemini 3.1 Flash Lite (high) Google High		0.658	0.755	0.947	0.526	0.929	0.618	0.734	0.900	0.533	0.846	0.655	0.501	0.412	0.714	0.353	0.769	0.333	0.000	0.929	0.640	0.524	0.757	0.25	1.50	$440
Gemini 3.1 Flash Lite (high) Google High	+geo	0.752	0.808	0.947	0.737	0.929	0.618	0.757	0.933	0.667	0.808	0.621	0.658	0.647	0.643	0.882	0.846	0.583	0.077	0.929	0.785	0.921	0.649	0.25	1.50	$440
Qwen3.5-35B Qwen		0.625	0.721	0.947	0.421	0.929	0.588	0.707	0.933	0.467	0.808	0.621	0.398	0.588	0.357	0.059	0.769	0.083	0.000	0.929	0.672	0.857	0.486	0.16	1.30	—
Qwen3.5-397B Qwen		0.609	0.742	0.947	0.474	0.929	0.618	0.666	0.867	0.333	0.808	0.655	0.397	0.588	0.286	0.059	0.769	0.167	0.000	0.929	0.631	0.667	0.595	0.39	2.34	—
DeepSeek-V3.2 DeepSeek		0.644	0.748	0.895	0.579	0.929	0.588	0.690	0.833	0.533	0.808	0.586	0.419	0.765	0.071	0.059	0.846	0.250	0.154	0.786	0.719	0.762	0.676	0.26	0.38	—
DeepSeek-V3.2 DeepSeek	+geo	0.731	0.815	0.895	0.789	0.929	0.647	0.766	0.900	0.733	0.846	0.586	0.583	0.882	0.643	0.529	0.846	0.167	0.154	0.857	0.762	0.794	0.730	0.26	0.38	—
GPT-OSS-120B OpenAI		0.494	0.543	0.895	0.158	0.679	0.441	0.512	0.700	0.133	0.731	0.483	0.321	0.647	0.071	0.118	0.538	0.083	0.077	0.714	0.600	0.794	0.405	0.04	0.19	—
GPT-OSS-120B OpenAI	+geo	0.559	0.620	0.895	0.263	0.821	0.500	0.524	0.700	0.333	0.615	0.448	0.418	0.588	0.643	0.353	0.615	0.083	0.000	0.643	0.676	0.730	0.622	0.04	0.19	—

Cells marked — indicate runs still in progress. Table will be updated as results complete.

Cost vs Accuracy

Observations

Gemini 3 Flash leads — 80% on NC, 76% on FL, and 50% on Lebanon, all with minimal thinking and the cheapest pricing.
Higher reasoning often hurts — GPT-5.4 Mini drops from 67% to 58% on NC when switching from low to high reasoning. The models seem to "overthink" ambiguous cases toward the majority class (WHITE).
Nano beats Mini — GPT-5.4 Nano (low) scores 75% on NC vs Mini's 67%. The smaller model appears less prone to defaulting to WHITE.
BLACK is the hardest US category — Accuracy ranges from 21% to 68% across models, compared to 84-100% for WHITE. Models systematically under-predict Black identity for names without strong ethnic markers.
Lebanon exposes limits — All models score 39-54%. Armenian Orthodox (distinctive names) and Maronite are predicted well, but Druze, Roman Orthodox, and Roman Catholic are near zero for most models.
India SC/ST is deceptively high — 72-76% accuracy, but models heavily favor SC (the majority class). ST accuracy is much lower (38-68%).

Last updated: March 19, 2026. Results are being actively collected.