Predicting political affiliation

31 Dec 2016

I’ve recently been doing some work related to political affiliation and voter prediction for a friend who is starting a political advocacy organization. As a result, one thing that I have been thinking about a lot lately is predicting party affiliation for voter targeting. For example, can we use Twitter to figure out if folks are Democrats and target them for get-out-the-vote messaging accordingly?

I did a little bit of poking around on the web and found that predicting political affiliation based on Twitter data is a pretty challenging problem that traditionally has been approached in two ways:

Pennacchiotti and Popescu established in 2010 that the latter approach leads to higher accuracy results. Specifically, using only friends (i.e., who individual users follow on Twitter) can generate ~85% accuracy in predicting political affiliation, whereas text parsing coupled with a relatively sophisticated DLDA model does approximately 10 points worse.

What I found (full code and replication data here)

I decided to use a small set of well-known politicians and talking heads to divide users along ideological lines. The final model achieves an accuracy of ~90% in predicting a proxy for political affiliation. Predicting true political affiliation is likely harder, and the 90% accuracy would likely decrease by between 5-15% if we tested the model on a better indicator of political affiliation.

Generating the data

First, I built a small dataset of 250 Democrats and 250 Republicans by searching for users who tweeted “#ImWithHer” and “#MAGA.” Note that this does not generate ideal ground truth test labels, as tweeting these slogans is not necessarily the same as being a Democrat/Republican. We would ideally like some sort of self-reporting tool to generate our test set (i.e., find users who self-identified as Democrats and Republicans), although, unfortunately, the popular tools for self-identification (e.g., www.wefollow.com) have been shut down over the last several years.

Next, I set out a standard list of ‘followees’, or politicians and celebrities we believe should strongly divide individuals along ideological lines. I used the Twitter API to determine, for each (follower, followee) pair, whether there is a relationship or not (binary indicator of 1 or 0). The full set of followees we include here is:

Due to Twitter’s delightful rate limiting policy, this data took almost a full day to gather. :dizzy_face:

Prediction

Predictive power

Random forest results

Finally, I tried out some real machine learning! I compared a couple of different approaches (kNN, logistic regression, random forest), tuned some parameters, and the results above were the best ones I got.

Discussion

Unsurprisingly, kNN performed substantially worse than the random forest and logistic regression models. The latter two models were nearly identical in their performance.

A few notes: