It's been a while since the closing of the Datathlon but I hadn't gotten the time to write about it. This is my first attempt at participating in a Kaggle competition and I participated because of the nature of the competition - the WiDS Datathon competition seeks to encourage women data scientists to engage in social impact solutions by participating in a predictive analytics contest. They got me at the #socialimpact.
Also, I felt this dataset is a good one for those who are beginning out in Data Science too (on hindsight). Why do I say so?
- Dataset is REAL (The dataset for the challenge will contain demographic and behavioral information from a representative sample of survey respondents from India about their usage of traditional financial and mobile financial services).
- Dataset is huge (>1000 variables).
- Dataset is dirty (contains missing values).
- Typical binary classification problem (Participants will analyze the data and build machine learning and statistical models to predict the gender of each survey respondent).
- Dataset is balanced (53.7% Female to 46.3% Male), hence manageable without knowledge of under/over-sampling.
- Dataset has only categorical variables; this probably reduces level of difficulty.
- A simple model is able to produce good results.
For me, the first step to data analysis would be data cleaning, after understanding the problem to be solved. I removed variables that have a large chunk of missing data (sometimes more than 80% missing). Since we have >1000 variables, dropping variables from the analysis is fine. I settled with 339 variables in my training dataset.
Next, I looked at whether the variables have show a distinct difference in distribution for females/ males. This is done through data visualization where I plotted the predictor variables against the outcome variable as well as cross-tabulations of statistics.
As we can see, those who answered "1" for variable "MT1A" seemed to be mostly males while those who answered "2" seemed to be mostly females. This is probably a good determinant of the outcome. So we should include it into the model.
After selecting possible good determinants of the outcome, we need to do some re-coding of the predictor variables. For example, in the case of variable "MT1A", values other than "1" and "2" do not seem to tell us a lot in differentiating females/ males, so we can group them together to become "3". "99" that represents "DK" remains as "99" and missing values were coded into another category called "100". Then we need to let R know that we're treating the values as categorical data instead of continuous (hence the factor function). Of course, this is not the only way to do the regrouping; there's no right or wrong but the choice will affect the model performance and definitely the interpretation of the results. And as usual, the modelling process will be an iterative one and we can always come back to change the way we did the regrouping.
As there are a total of 338 predictor variables, I plotted the charts in batches of 10 and built the first model with seven possibly good determinants I'd identified and added on new variables as I go down the list of variables.
I tried boosting and bagging thereafter but the improvement wasn't superior. Perhaps more tuning is required as these ensemble models should typically work better than a basic model. Nonetheless, in the interest of time, this exercise shows the efficacy of a simple model and highlights how useful logistic regression model is.
Now let's interpret the model results. Of the 338 predictor variables, I ended up with 9 variables ('MT1A','MT2','GN1','DG6','DL0','DL1','MT6','MT10','FB26_1') and one interaction term (MT2*MT10). The interaction term was added to see if AUC score improves (it did by a little). MT2 and MT10 were chosen to interact with each other because it is quite likely that they are related (see what the variables mean below).
Note that the value of the variable was added to the name of the variable and treated as a category in the model.
While it seems that the coefficient of the interaction term is not statistically significant, we can also decide whether to keep the interaction term by looking at whether the AIC is lower than the model without it.
The model gave an official AUC score of 0.93998, which shows that the performance of model was actually excellent. (The score for a perfect classifier is 1. Anything above 0.9 is regarded as excellent.) If you would like to find out what does AUC stand for, click here.
So, the variables used in the model were good at predicting the gender of the survey respondent. The variables were mainly centered around income (eg. whether they are the head/ breadwinner of household/ working) and phone ownership. This reflects the patriarchal nature of the society and hence perhaps empowering women can begin by enabling them to have a phone to themselves. Another implication on the survey would be that it is sufficient to collect responses on these 9 questions instead of the 1000+ questions if one wants to find out whether the person is a female.
In terms of technical interpretation, choosing FB26_1 as an example, with a coefficient of 0.53814 (see FB26_12 in the output table above), it means that the odds of indicating "No" for this question for a female is exp(0.53814)-1= 71% higher than the odds for a male. And what does odds mean? Odds, in this case, would be the ratio of those who answered "No" to question FB26_1 to those who answered "Yes". Using DG6 as another example, which has three values upon re-coding, namely, "1", "2", and "3" (values greater than 3 were recorded as 3). Those who indicated "2" for question DG6 were more likely to be female than those who answered "1". The odds of indicating "2" for this question for a female is exp(2.38045)-1 = 981% higher than the odds for a male, while the odds of indicating values other than "1" and "2" for this question for a female is exp(0.28430)-1 = 33% higher than the odds for a male.
The notebook containing the code can be found here.