Onomastic sampling for migration studies

Posted by

On Friday morning, I had the opportunity to present our breakthrough data mining technology at Regent’s University Turkish Migration Conference (TMC2014, London).

The supporting presentation can be downloaded here (20140530_TMS2014_Pitch_vFf.pdf) or viewed online here.


During the following sessions by researchers from various countries (Turkey, US, UK, Germany, Netherland, Sweden, Norway, Belgium …), I learned some of the ‘jargon’ of migration studies and also something about the particular research methodologies applied in that field.

My initial vision was that onomastics (the recognition of personal names) could be applied to discover new migration patterns. It was based on several preliminary meetings with international organizations concerned with migration issues. Census data can take up to three years to process. As states struggle to provide timely and accurate data to international organizations (such as the OECD, IOM, United Nations High Commissioner for Refugees UNHCR, …), these organizations can turn to the Big Data to identify and monitor new trends. There are challenges in identifying relevant data sources to provide valuable information about less digitally connected migrants. Twitter, LinkedIn, Google, Facebook, D&B, Thomson WoS … combined with applied onomastics can tell us a lot about the changing migration patterns of STEM Workers, innovators and entrepreneurs.

STEM Workers: workers in science, technology, engineering, and mathematics; art is occasionally considered as well (STEAM Workers).

With several TMS2014 sessions focused on the question of Turkish identity, or the particular migration and integration patterns of the Turkish, Kurdish, Alevi or Circassian communities, applied onomastics clearly offers an innovative tool to look at data from a different angles (nationality/birth place/ethnicity/gender/…)

However, I found that many research studies are conducted based on an initial theoretical hypothesis. Researchers then apply various qualitative or quantitative methods (occasionally both) to assess the hypothesis. Pure quantitative methods such as ‘data mining’ or ‘graph analysis’ as seen as de-humanizing by researchers (anthropologists, sociologists, historians …), primarily interested in the human story of migration. Most researchers conduct surveys to gather the data for their study : they find people, talk to them, ask questions. How do researchers identify to group of people to be surveyed (the sample)? During the conference, I learned another jargon: network/snowball sampling.

Network/snowball sampling: Snowball sampling is based on the selection of target people in personal networks. In a first step, important people within the target group are identified (initial sample) who themselves identify further people who can be also addressed for the survey (McKenzie & Mistiaen, 2007, p. 2; Salentin, 1999, p. 124).

As often, this new word was the magic keyword to find additional resources and understand how NamSor technology could fit with the current start of migration research methodology:

This document clearly describes the various methodologies to identify the initial population of a study and the various sampling procedures. Onomastic sampling is one of them.

‘In many countries, migrants constitute a substantial part of society. In public opinion research, however, they are often inadequately or not at all considered. This paper gives a systematic overview of the underlying methodological challenges that cause this situation. Those challenges are twofold and concern (1) the definition and distinction of the terms migrant and foreigner to describe the target group and (2) the selection of adequate sampling procedures.’

‘The methodological challenge of selecting adequate sampling procedures

Even after defining the target population, researchers still face difficulties regarding sampling. The problems tackled can be divers, for instance in what way the target population can be contacted (which survey modes are culturally accepted?) and how the individual respondents can be selected (e.g. does last-birthday work?). The paper discusses four central sampling procedures which regularly come up in the literature and which are seemingly appropriate for these kinds of surveys:

1. Sampling procedures on the basis of administrative records,

2. Area sampling, like e.g. random-route-procedures,

3. Network/snowball sampling, and

4. Onomastic sampling procedures based on foreign names from directories.’

How NamSor software can help?

1. Sampling procedures on the basis of administrative records

In this sampling method, the administrative records does not reflect the fine-grain identity of the populations: ‘Turkish nationality’ or ‘Born in Turkey’ encompasses many different populations. Applied onomastics can help refine samples to more targeted populations (Turkish, Alevi, Kurdish, Syrian, …)

2. Area sampling, like e.g. random-route-procedures

In this sampling method, it’s critical to understand the geo-demographics of a territory to know where different migrants populations are concentrated. Applied onomastics can help assess the density of migrant populations at various levels (region/city/district or road) from various public data sources.

3. Network/snowball sampling

In this sampling method, the personal network of the researcher is used an an initial seed to identify further prospects for interviews. Applied onomastics could help analyse personal networks of researchers (from social networks such as Twitter, or academic sources  such as bibliographic databases) to identify larger seed networks and generate better sampling. That could help reduce the risk of biases induced by the researcher’s network (reinforcing its own personal or cultural biases).

4. Onomastic sampling procedures based on foreign names

Dictionaries of given names and family names associated with a particular culture have been used for sampling.

NamSor software goes beyond this technique to use sociolinguistics and recognize in a (fistName, lastName) pair the likely origin of a person, with high accuracy. NamSor software can help researchers conduct onomastic sampling, not just from telephone directories but also from a wide range of modern data sources : social networks, opt-in commercial databases, … with high precision and fine-grain targeting.


NamSor powerful technology raises many data privacy and ethical questions, but we’re glad to say that if science and migration studies can be good for society, NamSor can be too.

About NamSor:
NamSor mission is to help understand international flows of money, ideas and people. NamSor launched GendRE API, a free API to conduct analysis of gender equality using opendata. http://namesorts.com/api/