Site icon NamSor Blog

RapidMiner to enrich Gender data

[UPDATE – this page is deprecated. Please check RapidMiner tutorials on how to connect to a simple REST API such as NamSor API for Gender Classification of names]

[UPDATE September-2014 : watch the 3 minutes tutorial video]

[UPDATE July-2014 : NamSor Onomastics Extension is now available in RapidMiner MarketPlace]

[UPDATE June-2014 : we have built an opensource (AGPL) extension for RapidMiner, get it on GitHub]

With Open Data from the Internet Movie Database (IMDb) and a gender prediction API, it was possible to assess the gender gap in the global film industry in minutes. We found that only ~22% of three hundred thousand movie directors worldwide are women.

We used technical skills and a small program to do this first analysis. Could it be done using a friendlier data mining tool? This article shows how a similar gender study can be conducted with RapidMiner.

Get RapidMiner

Install RapidMiner from SourceForge with additional extensions (Help->Updates and Extensions) : Text Mining and Web Mining.

In this example, we will read an Excel file with two columns (firstName, lastName), enrich with a first column containing the Gender (on a -1..+1 scale). Our test file is a list of members of the exclusive Club ‘Le Siècle’ (2010), which periodically gathers the French élite : Club_LeSiecle.xlsx (Source : La Marseillaise/cryptome).

Import Excel Data

Drag and drop the Read Excel operator (Import->Data->Read Excel) and launch the Import Configuration Wizard.

Default values should be OK through the wizard, except Encoding should be set to UTF-8 (Unicode, especially required if you would like to genderize Chinese, Russian or Arabic names).

Enrich Data by Webservice

Next, you will call the Gender prediction API to infer the likely sex/gender for each row in your Excel file.

Drag and drop the Enrich Data by Webservice operator (Web Mining->Services->Enrich Data by Webservice) and connect it to the Read Excel operator.

You can use our free Gender API or the Freemium on Mashape. For this example, we shall use the free plain text API, entering this kind URL:

NB: we also provide a REST JSON format, not used in this example

We need to configure the Enrich Data by Webservice operator to pass the parameters and assign the result to a new variable GenderScale (-1 is Male ..+1 is Female):

– query type :’Regular Region’

– attribute type : ‘Numerical’

– regular region queries : add a single attribute ‘GenderScale’ containing the entire result from calling the API (ie. anything between the beginning of the line ^ and the end of the line $)

– request method : ‘GET’

– url : FN and LN will be replaced by the firstName and lastName at runtime  http://api.namsor.com/onomastics/api/gendre/<%FN%>/<%LN%>/fr

– encoding : UTF-8

Write CSV

Next, you will write the output to a CSV file (Export->Data->Write CSV), setting an output file name and selecting UTF-8 encoding again.

Run the Process

Last, set the process encoding to UTF-8 and run it.

The output should look like:

“FN”;”LN”;”GenderScale”
“Philippe”;”Jaffré”;-1.0
“Bertrand”;”Collomb”;-1.0
“André”;”Lévy-Lang”;-0.96

What’s the verdict ? Women account for ~17% of the French Elite Club ‘Le Siècle’ (2010).

Further reading:

Meet us on 29 April 2014 at DataTuesday Paris with Girls in Tech Paris, on the topic ‘Women & Data’.

Exit mobile version