Now that we have a lot of twitter data for the proposed “bubbling lava” map group, the question becomes how to process it. How can we create an undulating surface that represents Twitter across the surface of the US? I thought I’d document what I’ve been working on. Many of these ideas we worked out as a group over the past few weeks, so much of it will be familiar, but it’s good to have an organized review.
We want to show how popular Twitter is, but showing raw totals won’t do. More people Tweet in Los Angeles than in Wyoming, because nobody lives in Wyoming. But, are those few hardy Wyoming pioneers tweeting perhaps more per person than the LA ones? So, we need to normalize for population.
Previously, we discussed using county population data to normalize our data — figure the tweets per county, divide by county population, convert the counties to centroids, and then interpolate those tweets-per-person point data. I wanted to take things in a bit of a different direction today. The problem with calculating tweets-per-person and then converting the counties to centroids is that it makes some needless assumptions. We don’t know the population distribution of the counties, and so we assume the population lies at the centroid. However, we do know exactly where the tweets are coming from — so why lose that geography in the very first step? As an example, there’s a prolific twitter user in a certain Nevada county. That county thus has a high tweets-per-person (hereafter TPP) level. When we simply aggregate the tweets by county and then convert to the centroid, we assume this person is located at the center of the county, when in fact their tweets show their exact location.
After many hours of trial an error, here’s an alternate approach I have devised (Jeremy, you will note that this relates rather strongly to some of your ideas from a few weeks ago):
The helpful folks at SEDAC provided me with a raster of Census 2000-based US population at 30 arc seconds (~1km pixels), which is more than enough for this project. I converted this to a new raster which gives, at each pixel, the number of people within 50km of that location.
Next, we go back to our point shapefile, which has the location of each tweet, and ask it to look at the raster to find out, for each tweet, how many people lived within 25km of the tweet location (I also tried doing 50km, but Arc refused).
Now, divide 1 by that number. This gives us the “value” of any given tweet. For example, a tweet in NYC, which has a population of ~7 million, doesn’t count for much (1 divided by 7 million). But one in the middle of nowhere counts for much more (1 divided by, say, 1000).
Now that each tweet has a value, we calculate a kernel density on those tweets based on their value. Basically, what we do is make a heat map (oh heat maps, will your star ever stop rising?) of the tweet values. NYC tweets don’t count for much each, but there are a lot of them, so they stack up against a small town with just a couple of tweets.
In this way, we’ve normalized for population, and kept a good sense of geography.
Problem: Tiny towns with one prolific Twitter user dominate this data set. This is a problem with the alternate approach (county population centroids) as well. One tweet in a town of 100 counts as much as 70,000 tweets in New York. And if the person in that tiny town has a good day and suddenly decides to tweet three times, they’ve just tripled the entire town’s output of tweets, whereas it’s a lot harder for NYC to triple its output of tweets. Whether it’s raster or vector, any time you normalize by population, you run into the problem of the data set being very sensitive to small population numbers — it doesn’t take much to cause a huge shift in the TPP.
There are two possible solutions here. 1) Change the area that we’re aggregating over. Instead of looking at the population within 25km of a tweet, we could go to 100km, in which case the population numbers will go up and be less sensitive to small changes. At the extreme end of the scale, though, you could set the range to something like 5000km, which means you’ve got the whole US population included, in which case you’re doing the same thing as simply plotting the raw numbers of tweets.
2) Stop treating the data linearly. Let’s go logarithmic. A tiny town with a TPP ratio of 0.1 does not get 10x the strength on the map of a larger city with TPP of 0.01. Instead, we apply a logarithm and that town gets simply 2x the strength on the map.
Final possible solution, which I have not looked at yet: Value by alpha or something like that. Go bivariate — one variable is TPP, the other is an indicator of data quality.
In any case, we are getting what we wanted — data which show that twitter popularity is not the same distribution as raw # of tweets. But, it may be too susceptible to noise unless we reduce the dominance of said noisy data.
Now my brain is tired and I am going to stop.