IOM Diaspora Mapping Toolkit : big data and onomastics

For many countries, knowing their diaspora (who are they, where are they, what are they doing?) and being able to engage them would make a huge difference to their development. Since 2014, we’ve applied NamSor onomastics to produce reliable migration statistics.

The IOM Diaspora Mapping Toolkit analyses pros & cons for the use of big data and onomastics for Diaspora Mapping. It is based on field experience learned from several project conducted by The International Organization for Migration (IOM) using NamSor technology. One particular case study cited in the report is Skills Mapping Through Big Data: A case study of Armenian diaspora in the United States of America and France :

Armenia has one of the largest and oldest diasporas in the world. Its diaspora has been a driving force for the country’s economic survival and development over the past few decades, primarily by providing humanitarian support and remittances, as well as by implementing philanthropic and development projects.

While a number of studies of the Armenian diaspora – both generally and in specific countries of destination – have been carried out, none have focused specifically on identifying the skills and professional networks existing within it.

The report presents the results of the mapping of the Armenian diaspora worldwide, with a special focus on the United States and France, conducted through innovative methods of using big data, including web traffic analysis and onomastic analysis of public databases. The databases were analysed specifically to create demographic and skills profiles, with identifiers such as education level, sector of employment and field of study. To better understand skilled diaspora communities and how they might be reached for development through knowledge transfer initiatives, interviews with key stakeholders, experts and diaspora members supplemented the quantitative analysis.

The report will serve as a reliable resource for researchers, scientists and policymakers, as well as the general public, who are interested in general or, specifically, Armenian diaspora studies, the migration–development nexus, skills transfers and other similar topics.
Skills Mapping Through Big Data: A case study of Armenian diaspora in the United States of America and France

Here is an excerpt of the IOM Diaspora Mapping toolkit on Big data and onomastics.

IOM Diaspora Mapping (4) : Big data and onomastics

4.1 Intuition

Although it is difficult to find an agreed-upon definition of big data, the term generally refers to data that is generated by individuals and compiled through an inadvertent process in databases of companies and service providers.6 Such data can be generated through phone calls, text messages, internet-based activities, social media interactions, satellites and other related manners.7 Big data is characterized by their huge volume, real-time and speed of data generation, and wide range of variety. Big data can be retrieved from structured data such as financial transactions, governmental records, and from unstructured data as in the case of social media communications and activities. In order to extract meaningful information from such huge volumes of data, advanced technologies or programmes are generally needed.8 For example, if your aim is to analyse all comments on a certain Facebook page, it is impossible to implement the exercise manually, instead, you need a computational software such as Facepager, which is an open-source software that extract comments from Facebook. Data extracted from big data sources can provide statistical information on internal and international migration stocks, flows, and trends which are difficult to obtain from other traditional sources. In addition, data extracted especially from social media can potentially inform public trends and information on migrants in destination countries including their distribution and behaviour. When it comes to diaspora studies, big data may be specified to diaspora (for example, by focusing on social media accounts that explicitly concentrate on diaspora), but identifying individuals as diaspora members according to the belonging dimension can be difficult.
IOM Diaspora Mapping Toolkit 4.1

4.2 Different sources of big data

Sources of big data are numerous.9 The category of data includes anonymized data collected from mobile phone and internet-based platform users or via digital sensors or meters, including GPS and satellite imagery.10 Access and permissions to such sources vary for many reasons. While some sources can be available to the public, others are restricted or securely safeguarded. Following are some examples on sources of big data that can be useful in the context of a diaspora mapping:

Internet-based data: Internet-based data includes results of searches through websites and search engines, including also searches from cell phones, and actual views/traffic to specific websites that are associated with a particular country (i.e. national news/media sites). In general, searches carried out online can be broad and largely dependent on key words. In addition, communication through emails and online financial services can be counted as internet-based activities. Social media is also internet-based, but can be considered a separate, important category of big data discussed below. Studies based on big data from internet-based sources have only recently emerged, especially in the field of migration and diasporas, and are still scarce, yet the number is growing. There have, for example, been efforts to estimate the number of migrants in a given country based on anonymized search query records of specific terms. Estimates produced in this way were close to numbers published by the national statistical agency in the case of a study on Polish, Lithuanian and Romanian nationals in the United Kingdom.12 While there are concerns about the quality of such data, it might also be useful in diaspora mappings in terms of understanding, for example, what topics are of interest to specific migrant communities.

Social media analysis: Social media can be simply defined as means of communications that use a computer or smart phone as medium, through which users create profiles and interact with each other in different manners to share and exchange information and other types of data. The increasing popularity and use of social media that offer user-friendly virtual environments have led to the availability of massive amounts of data on the servers of their respective platforms.13 Given the numerous types of social media platforms that can be geo-tagged and that offer different ways of interactions, such as texting and video and audio sharing and calling, data collected through social media platforms can cover significant types of information about their users. Such information includes data on age, gender, education, geographical locations, preferences, occupations, movement and mobility, places of stay and residence and visited places, skills, health, behaviour, social networks, origins and backgrounds, ethnicity, political affiliations and views, and many others. As will also be discussed in the following section on organizational mapping, data available on social media can be of great importance in the context of diaspora mapping. That is, searching social media platforms based on geographical location, ethnicity, background, affiliation, origin, place of birth, language, or other criteria can support mapping certain diaspora in a given location. Moreover, such data can enhance our understanding of diasporas’ activities, organizations, opportunities and challenges, among other aspects. For example, Twitter can be used to track hashtags related to a certain popular event in a country to measure its diaspora abroad through linking users’ reactions to their geographical locations.14 Nevertheless, data available on social media platforms can be biased or deceiving for multiple reasons. Firstly, it is not always representative of an entire population or target group, as access to internet or computers/smartphones can differ per geographical location and country. Secondly, information that users provide on social media platforms can be misleading or missing, as most of platforms allow users to provide their personal information without validation on their accuracy, in other words, they depend on users’ willingness to provide information which can be influenced by personal incentives. Thirdly, most social media platforms offer their users the options to discretely use their services without sharing their information publicly, thus, such users might not appear in normal searches. Lastly, data available through social media platforms might need cleaning before being considered usable. That is, for example, while searching Facebook for users in a specific area with specific ethnicity can show plenty of results, many of them can be either fake accounts, wrong information, not available for public, or even not active. In addition, it is important to note that the data is not produced with the intention of being used for research, so that ethical issues around the use of big data need specific attention as discussed below when talking about the strengths and weaknesses of such methods.
IOM Diaspora Mapping Toolkit 4.2

4.3 Big data mining: Onomastics

Onomastics is defined as a “branch of sociolinguistics examining the morphology of names”. It can be used to analyse datasets in order to classify individuals based on their first and last names in addition to other identifiers such as gender, origin or culture.15 In the context of diaspora mapping, the main importance of using onomastics lies in the possibility of locating individuals whose names belong to a certain country or culture. Nevertheless, with globalization and advanced mobility means which facilitate rapid cultural exchange among different ethnicities and nations, it can be challenging to identify a person’s origin based on his/her name. For instance, if a researcher is mapping Lebanese diaspora in Canada, using onomastics approach can lead to piling results on all Arab diaspora there.
That is because it is very common to find similar names and surnames in different Arab countries such as the Syrian Arab Republic, Lebanon, Jordan, Iraq, etc. There are, however, also examples of diaspora mappings that have used onomastics to profile a diaspora community. Recently, IOM has published a study profiling the Armenian diaspora in the United States and France based on big data mining techniques using the ORCID and ZoomInfo databases.16 These were analysed to create a profile of diaspora members in terms of demographic characteristics and skills. This served to have an overview of the potential among the Armenian diaspora in these two countries and was complemented with diaspora members, key stakeholders and experts to understand how such skilled individuals can be involved in development efforts in the country of (ancestral), for example through a knowledge transfer programme. A similar effort to create a diaspora database and engage more with diaspora members for development is underway in Georgia and Kazakhstan. Onomastics are relatively resource intensive but can be really useful when the mapping has the aim of identifying specific skill sets among the diaspora and developing specific programming priorities for diaspora engagement. It is, however, also important to treat the results of onomastics carefully when reaching out to diaspora members based on this due to the sensitivity associated with these methods discussed in the following section.
IOM Diaspora Mapping Toolkit 4.3

4.4 Strengths and weaknesses

About NamSor

NamSor™ Applied Onomastics is a European vendor of sociolinguistics software (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people. We proudly support Gender Gap Grader.