NamePrism is a nationality classification web service for research in sociology, linguistics, and biomedical applications. It is further described in the following scientific paper.
Nationality Classification Using Name Embeddings
Junting Ye, Shuchu Han, Yifan Hu, Baris Coskun, Meizhu Liu, Hong Qin, Steven Skiena
Stony Brook University, Yahoo! Research, Amazon AI, NEC Labs America
Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name sub- strings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification.
We exploit the phenomena of homophily in communication pat- terns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality classifier available. As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in. [full article on arxiv.org]
While NamSor, as a private start-up company, has been offering for several years a free service for gender classification, we provide name origin / ethnicity / diaspora classification as commercial service. We work with experts in onomastics, sociolinguists, sociologists, anthropologists, historians … to continuously improve our software and calibration for each country individually. Our objective is to offer the best accuracy in decoding personal names according to various taxonomies (gender, nationality, origin / ethnicity, diaspora, country / region or event cast and tribal systems) and we aim for global coverage : all countries, regions, alphabets.
Comparing NamePrism and NamSor is not as simple as it may seem : ideally we would like to have the same taxonomy for both services and a clean independent sample of names by nationality to compare precision and recall (or F1 Score as a test of accuracy). However, we find the concepts and taxonomies used by NamePrism and NamSor are somewhat related but they have differences.
NamSor ‘Origin’ vs NamePrism ‘Nationality’
NamSor ‘Origin’ and NamePrism ‘Nationality’ are both somewhat related to geographic countries of origin of personal names :
- NamePrism ‘Nationality’ classes are (Level 1) : European, Hispanic, CelticEnglish, Muslim, EastAsian, African, Nordic, Greek, SouthAsian, Jewish. Then, for example, European names are broken down (Level 2) as : Baltics, EastEuropean, French, German, Italian, Russian, SouthSlavs;
- NamSor ‘Origin’ top classes are across two dimensions : 129 countries (from Arab Emirates to Zimbabwe, as ISO2 country codes) and 22 scripts (LATIN but also ARABIC, etc.), which can be grouped according geographic regions (Level 1 : Europe, Africa, Asia … and then Level 2 : Northern Europe, Western Europe, etc). To be noted, that there is no consensus on how to group countries together as regions : NamSor’s hierarchy can easily be replaced by your own country code mapping (Israel in Europe or Asia, etc.) [link to the API ]
Both NamSor ‘Origin’ and NamePrism ‘Nationality’ consider the United-States (US) or Australia (AU), for example, as melting-pots rather as origins for personal names and identities. So the United-States are not included in the taxonomy and this is where NamSor ‘Diaspora’ brings a different perspective.
We’ve classified a sample of 70,000 names from all 129 countries using both NamSor ‘Origin’ and NamePrism ‘Nationality’ [the sample we used is in the companion project on Github].
This is the Level 1 output :
|NamePrism Nationality vs. NamSor Origin||Africa||Asia||Europe||Grand Total|
As we can see visually, NamePrism ‘African’ class fits well with NamSor ‘Africa’ : most names classified as ‘African’ by NamePrism are also classified as ‘Africa’ by NamSor. Also, ‘Nordic’ NamePrism names are all classified as ‘European’ by NamSor.
Other classes don’t fit so well : NamePrism ‘Muslim’ class doesn’t have an equivalent in NamSor Origin, which means names recognized by NamePrism as ‘Muslim’ might be recognized by NamSor as geographically from (for example) North Africa (Maghreb), Western Asia (Pakistan) …
Drilling down to Level 2 for European names show also different levels of granularity, as NamePrism ‘Nationality’ maps tens of European countries (Moldova, France, Germany, Italy, Russia, Romania, Austria, Switzerland, Belgium, Ukraine, Czech Republic, Hungary, Netherlands, Poland, Latvia, Serbia, Slovenia, Bulgaria, Slovakia, Croatia, Albania, Macedonia, Estonia, Sweden, Denmark, Lithuania, Bosnia and Herzegovina, Spain, Belarus, Norway, United Kingdom of Great Britain and Northern Ireland, Finland, Iceland, Greece, Ireland, Portugal, Montenegro …) into just seven classes.
|NamePrism Nationality vs. NamSor Origin||Moldova||France||Germany||Italy||Russia||[…]|
Drilling down to Level 2 for Asian names show also different levels of granularity, as NamePrism ‘Nationality’ maps tens of Asian countries (Oman, Turkmenistan, Yemen, Tajikistan, Uzbekistan, Kazakhstan, Nepal, Jordan, Iraq, Cyprus, United Arab Emirates, Mongolia, Iran, Azerbaijan, Armenia, Korea, DPR, Lao, Kyrgyzstan, Bahrain, Lebanon, Bangladesh, Saudi Arabia, Turkey, Palestine, Sri Lanka, Afghanistan, Georgia, Pakistan, Syria, India, Israel, Cambodia, Malaysia, China, Hong Kong , Viet Nam, Myanmar, Thailand, Japan, Taiwan, Indonesia, Korea, China, …) into just five classes.
|NamSor Origin vs. NamePrism Nationality||Chinese||Indochina||Japan||Malay||South Korea|
|China, Hong Kong||260||2||1||1|
Drilling down to Level 2 for African names show the same level of granularity and a good fit for East African, South African and West African names. However, as we’ve seen, NamePrism would rather classify North-African names as ‘Muslim’. Also, NamePrism doesn’t drill down further, when NamSor Origin has country-level granularity.
|NamePrism Nationality vs. NamSor Origin||Eastern Africa||Middle Africa||Northern Africa||Southern Africa||Western Africa|
NamSor ‘Diaspora’ vs NamePrism ‘Ethnicity’
NamSor ‘Diaspora’ and NamePrism ‘Ethnicity’ are both somewhat related to ethnicity (or race in the United-States) and cultural identity :
- NamePrism ‘Ethnicity’ classes are (Level 1) : Black, White, API, AIAN, 2PRACE, Hispanic;
- NamSor ‘Diaspora’ top classes are across three dimensions : 141 countries (including the United-States, Australia etc. from Arab Emirates to Zimbabwe, as ISO2 country codes), 134 ethnicities (including : Hispanic, African American, Jewish, Lithuanian, etc.) and 22 scripts (LATIN but also ARABIC, etc.) [link to the API ]
So there are two dimensions of NamSor ‘Diaspora’ to compare to NamePrism ‘Ethnicity’ : geography and ethnicity. We will follow this up in a later post. Keep updated !
How to cite ?
The relevant papers are
– NamePrism : Nationality Classification using Name Embeddings <https://arxiv.org/abs/1708.07903>
– NamSor : Onomastics and Big Data Mining <https://arxiv.org/abs/1310.6311>
Thank You for the citation!
NamSor™ Applied Onomastics is a European vendor of sociolinguistics software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.
Reach us at: email@example.com
You must log in to post a comment.