[UPDATE September-2014 : watch the 3 minutes tutorial video]
[UPDATE July-2014 : NamSor Onomastics Extension is now available in RapidMiner MarketPlace]
[UPDATE June-2014 : we have built an opensource (AGPL) extension for RapidMiner, get it on GitHub]
With Open Data from the Internet Movie Database (IMDb) and a gender prediction API, it was possible to assess the gender gap in the global film industry in minutes. We found that only ~22% of three hundred thousand movie directors worldwide are women.
We used technical skills and a small program to do this first analysis. Could it be done using a friendlier data mining tool? This article shows how a similar gender study can be conducted with RapidMiner.
Install RapidMiner from SourceForge with additional extensions (Help->Updates and Extensions) : Text Mining and Web Mining.
In this example, we will read an Excel file with two columns (firstName, lastName), enrich with a first column containing the Gender (on a -1..+1 scale). Our test file is a list of members of the exclusive Club ‘Le Siècle‘ (2010), which periodically gathers the French élite : Club_LeSiecle.xlsx (Source : La Marseillaise/cryptome).
Import Excel Data
Drag and drop the Read Excel operator (Import->Data->Read Excel) and launch the Import Configuration Wizard.
Default values should be OK through the wizard, except Encoding should be set to UTF-8 (Unicode, especially required if you would like to genderize Chinese, Russian or Arabic names).
Enrich Data by Webservice
Next, you will call the Gender prediction API to infer the likely sex/gender for each row in your Excel file.
Drag and drop the Enrich Data by Webservice operator (Web Mining->Services->Enrich Data by Webservice) and connect it to the Read Excel operator.
You can use our free Gender API or the Freemium on Mashape. For this example, we shall use the free plain text API, entering this kind URL:
returns -0.99 (ie. Male)
NB: we also provide a REST JSON format, not used in this example
We need to configure the Enrich Data by Webservice operator to pass the parameters and assign the result to a new variable GenderScale (-1 is Male ..+1 is Female):
– query type :’Regular Region’
– attribute type : ‘Numerical’
– regular region queries : add a single attribute ‘GenderScale’ containing the entire result from calling the API (ie. anything between the beginning of the line ^ and the end of the line $)
– request method : ‘GET’
– url : FN and LN will be replaced by the firstName and lastName at runtime http://api.namsor.com/onomastics/api/gendre/<%FN%>/<%LN%>/fr
– encoding : UTF-8
Next, you will write the output to a CSV file (Export->Data->Write CSV), setting an output file name and selecting UTF-8 encoding again.
Run the Process
Last, set the process encoding to UTF-8 and run it.
The output should look like:
What’s the verdict ? Women account for ~17% of the French Elite Club ‘Le Siècle’ (2010).
- Assessing the Gender Gap in the Film Industry
- Web Automation : how to use the GendRE Genderizer API on Zapier
- GendRE : an Open Source Android App to genderize your contacts
- (StackOverflow open question) RapidMiner Enrich Data by Webservice with HTTP Header