[UPDATE Sept 2014 – many of the mis-classifications listed below are now handled properly by NamSor Gender API, combining classic baby name dictionaries and more advanced sociolinguistics]
We’ve used applied onomastics to recognize the likely gender of about 5 million people in different lists of The Internet Movie Database (IMDb , 2005), using personal names. How reliable is the method? The number of actresses that we’ve classified as Male (or conversely) offers an immediate answer: misclassifications are negligible compared to the wide gender gap that is seen in stereotypically ‘male jobs‘ such as Cinematographer and ‘female jobs‘ such as Costume designer.
Read the original article on Elena’s blog. The following post discusses the methodology in more details.
For number crunchers who would like to perform their own statistical analysis, we’ve disclosed the full data file (imdb_jobs_gender.zip). Also, NamSor Gender API to infer gender from personal names is open and free to use.
We will now disclose the main reasons for misclassifications and -when relevant- how we plan to address them in future versions of the API.
Gender of English names:
We have misclassified a few English names, like Jamie or Taylor. According to social security card applications, 83,831 male and 264,571 female people bore the name Jamie in the USA since 1879. Conversely in IMDb (or in the USA today), more men than women bear that name: 2,020 actors versus 933 actresses. Some names are genderless and name demographics change across time.
Gender of Italian names:
We’ve misclassified a few Italian Andreas. Andrea is a male name in Italy and a female name in the US:
In a later version of GendRE API, we will deploy our sociolinguistic algorithm to recognize that Andrea Rossini is most likely an Italian name (and consequently most likely a male name), without requiring any indication of country.
Gender of French names:
Jean is a male name in France and often a female name in the US.
Conversely, Laurence is a female name in France and a male name in the US.
Gender of Spanish/Histpanic names:
Joan is a male name in Spain and often a female name in the US.
The case of IMBb names with just Initials:
A few misclassifications come from having just the initials, instead of a full given name. It’s hard to guess the gender of N. Watts-Phillips but in a later version of GendRE API, we’ll recognize that N. Zhuravlyov and N. Zeynalova are most likely Slavic (respectively) male and female names. GendRE API already does recognize the gender of Russian names when spelled in Cyrillic:
Same for Lithuanian names, V.Mainialite and V. Rucyte are most likely female names whereas V. Nikulajevas and V. Belopetravicius are most likely male names.
Chinese and Korean names:
About half the names from China (or Korea) are misclassified. There aren’t that many Chinese names in IMDb so they don’t so much affect the overall result.
Guessing the gender of a Chinese name transliterated in Latin characters is no better than flipping a coin. NamSor Gender API works when the Chinese name is in Chinese characters:
In a later version, we will recognize Chinese name in latin alphabet to filter them out from name-based gender studies.
Other special cases:
The above list of misclassifications causes is not exhaustive: names are strongly correlated to gender, but some names are truly genderless (ex: Kerry). Let’s not forget that gender and sex can also be different concepts, with Eurovision 2014 winner Conchita Wurst being an eminent example:
Misclassifications in the NamSor Gender API are scarce and negligible given the huge gender gap seen in the film industry. We’ve identified several opportunities to improve, combining name gender demographics with our unique algorithm of name linguistic/cultural classification. NamSor Gender API as it is could be a useful tool for gender researchers and women citizen to perform gender gap analysis on their own, using open data.
We wish Elena – and other women – the best of luck to make their way in the Festival and in the film industry. We hope NamSor Gender API will be useful to monitor gender equality progress in the years to come.
Elian CARSENAT, a computer scientist trained at ENSIIE/INRIA, started his career at JP Morgan in Paris in 1997. He later worked as consultant and managed business & IT projects in London, Paris, Moscow and Shanghai.
Elian founded NamSor™ Applied Onomastics (NamSor sorts names). NamSor mission is to help understand international flows of money, ideas and people.