Visually Comparing Name Nationality Classification Services

Posted by

NamePrism is a nationality classification web service for research in sociology, linguistics, and biomedical applications. It is further described in the following scientific paper.

Nationality Classification Using Name Embeddings

Junting Ye, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Hong Qin, Steven Skiena

Stony Brook University, Yahoo! Research, Amazon AI, NEC Labs America

Abstract

Nationality identification unlocks important demographic information,  with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name sub- strings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained  classification.

We exploit the phenomena of homophily in communication pat- terns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality  classifier available. As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in. [full article on arxiv.org]

While NamSor, as a private start-up company, has been offering for several years a free service for gender classification, we provide name origin / ethnicity / diaspora classification as commercial service. We work with experts in onomastics, sociolinguists, sociologists, anthropologists, historians … to continuously improve our software and calibration for each country individually. Our objective is to offer the best accuracy in decoding personal names according to various taxonomies (gender, nationality, origin / ethnicity, diaspora, country / region or event cast and tribal systems) and we aim for global coverage : all countries, regions, alphabets.

Comparing NamePrism and NamSor is not as simple as it may seem : ideally we would like to have the same taxonomy for both services and a clean independent sample of names by nationality to compare precision and recall (or F1 Score as a test of accuracy). However, we find the concepts and taxonomies used by NamePrism and NamSor are somewhat related but they have differences.

NamSor ‘Origin’ vs NamePrism ‘Nationality’

NamSor ‘Origin’ and NamePrism ‘Nationality’ are both somewhat related to geographic countries of origin of personal names :

  • NamePrism ‘Nationality’ classes are (Level 1) : European, Hispanic, CelticEnglish, Muslim, EastAsian, African, Nordic, Greek, SouthAsian, Jewish. Then, for example, European names are broken down (Level 2) as : Baltics, EastEuropean, French, German, Italian, Russian, SouthSlavs;
  • NamSor ‘Origin’ top classes are across two dimensions : 129 countries (from Arab Emirates to Zimbabwe, as ISO2 country codes) and 22 scripts (LATIN but also ARABIC, etc.), which can be grouped according geographic regions (Level 1 : Europe, Africa, Asia … and then Level 2 : Northern Europe, Western Europe, etc). To be noted, that there is no consensus on how to group countries together as regions : NamSor’s hierarchy can easily be replaced by your own country code mapping (Israel in Europe or Asia, etc.)   [link to the API ]

Both NamSor ‘Origin’ and NamePrism ‘Nationality’ consider the United-States (US) or Australia (AU), for example, as melting-pots rather as origins for personal names and identities. So the United-States are not included in the taxonomy and this is where NamSor ‘Diaspora’ brings a different perspective.

We’ve classified a sample of 70,000 names from all 129 countries using both NamSor ‘Origin’ and NamePrism ‘Nationality’ [the sample we used is in the companion project on Github].

This is the Level 1 output :

NamePrism Nationality vs. NamSor Origin Africa Asia Europe Grand Total
European 10763 2473 13442 26678
Hispanic 1881 831 7691 10403
CelticEnglish 2509 1309 4384 8202
Muslim 1890 5277 97 7264
EastAsian 493 5761 393 6647
African 3995 177 84 4256
Nordic 42 55 2571 2668
Greek 29 1558 901 2488
SouthAsian 381 1549 72 2002
Jewish 3 74 6 83
Grand Total 21986 19064 29641 70691
NamePrism_NamSor_Origin_Level1
Fig 1. Level 1 comparison between NamePrism ‘Nationality’ and NamSor ‘Origin’

As we can see visually, NamePrism ‘African’ class fits well with NamSor ‘Africa’ : most names classified as ‘African’ by NamePrism are also classified as ‘Africa’ by NamSor. Also, ‘Nordic’ NamePrism names are all classified as ‘European’ by NamSor.

Other classes don’t fit so well : NamePrism ‘Muslim’ class doesn’t have an equivalent in NamSor Origin, which means names recognized by NamePrism as ‘Muslim’ might be recognized by NamSor as geographically from (for example) North Africa (Maghreb), Western Asia (Pakistan) …

Drilling down to Level 2 for European names show also different levels of granularity, as NamePrism ‘Nationality’ maps tens of European countries (Moldova, France, Germany, Italy, Russia, Romania, Austria, Switzerland, Belgium, Ukraine, Czech Republic, Hungary, Netherlands, Poland, Latvia, Serbia, Slovenia, Bulgaria, Slovakia, Croatia, Albania, Macedonia, Estonia, Sweden, Denmark, Lithuania, Bosnia and Herzegovina, Spain, Belarus, Norway, United Kingdom of Great Britain and Northern Ireland, Finland, Iceland, Greece, Ireland, Portugal, Montenegro …) into just seven classes.

 

NamePrism Nationality vs. NamSor Origin Moldova France Germany Italy Russia […]
German 132 14 899 13 74
Italian 1538 5 6 707 12
French 62 1022 38 42 14
Russian 454 1 588
EastEuropean 16 6 4
SouthSlavs 16 4 1 19
Baltics 1 3
NamePrism_NamSor_Origin_Level2_Europe
Fig 2. Comparison bw NamePrism European ‘Nationality’ and NamSor European ‘Origin’
NamePrism_NamSor_Origin_Level2_Europe2
Fig 3. Comparison bw NamSor European ‘Origin’ and NamePrism European ‘Nationality’

Drilling down to Level 2 for Asian names show also different levels of granularity, as NamePrism ‘Nationality’ maps tens of Asian countries (Oman, Turkmenistan, Yemen, Tajikistan, Uzbekistan, Kazakhstan, Nepal, Jordan, Iraq, Cyprus, United Arab Emirates, Mongolia, Iran, Azerbaijan, Armenia, Korea, DPR, Lao, Kyrgyzstan, Bahrain, Lebanon, Bangladesh, Saudi Arabia, Turkey, Palestine, Sri Lanka, Afghanistan, Georgia, Pakistan, Syria, India, Israel, Cambodia, Malaysia, China, Hong Kong , Viet Nam, Myanmar, Thailand, Japan, Taiwan, Indonesia, Korea, China, …) into just five classes.

NamSor Origin vs. NamePrism Nationality Chinese Indochina Japan Malay South Korea
China 888 90 8 36 47
Korea 113 37 16 503
Indonesia 4 4 1 574
Taiwan 561 8 4 6
Japan 6 3 511 32 2
Thailand 7 331 5 22 7
Myanmar 31 266 1 15 2
Viet Nam 7 267 1 3
China, Hong Kong 260 2 1 1
Malaysia 59 1 1 145
Cambodia 43 60 1 19 1
Israel 7 1 82
India 7 13 3 57 1
Syria 74
Pakistan 71
[…]
NamePrism_NamSor_Origin_Level2_Asia
Fig 4. Comparison bw NamSor Asian ‘Origin’ and NamePrism Asian ‘Nationality’

Drilling down to Level 2 for African names show the same level of granularity and a good fit for East African, South African and West African names. However, as we’ve seen, NamePrism would rather classify North-African names as ‘Muslim’. Also, NamePrism doesn’t drill down further,  when NamSor Origin has country-level granularity.

NamePrism Nationality vs. NamSor Origin Eastern Africa Middle Africa Northern Africa Southern Africa Western Africa
EastAfrican 1978 69 8 38 47
SouthAfrican 87 10 1 806 14
WestAfrican 42 12 6 12 865

NamePrism_NamSor_Origin_Level2_Africa

NamSor ‘Diaspora’ vs NamePrism ‘Ethnicity’

NamSor ‘Diaspora’ and NamePrism ‘Ethnicity’ are both somewhat related to ethnicity (or race in the United-States) and cultural identity :

  • NamePrism ‘Ethnicity’ classes are (Level 1) : Black, White, API, AIAN, 2PRACE, Hispanic;
  • NamSor ‘Diaspora’ top classes are across three dimensions : 141 countries (including the United-States, Australia etc. from Arab Emirates to Zimbabwe, as ISO2 country codes), 134 ethnicities (including : Hispanic, African American, Jewish, Lithuanian, etc.) and 22 scripts (LATIN but also ARABIC, etc.) [link to the API ]

So there are two dimensions of NamSor ‘Diaspora’ to compare to NamePrism ‘Ethnicity’ : geography and ethnicity. We will follow this up in a later post. Keep updated !

How to cite ?

The relevant papers are
– NamePrism : Nationality Classification using Name Embeddings <https://arxiv.org/abs/1708.07903>
– NamSor : Onomastics and Big Data Mining <https://arxiv.org/abs/1310.6311>

Thank You for the citation!

About NamSor

NamSor™ Applied Onomastics is a European vendor of sociolinguistics software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.
Reach us at: contact@namsor.com

 

 

2 comments

Leave a Reply