Geography 970

April 29, 2010

from geo-tweets to bubbling lava map frames – the data processing

Filed under: Uncategorized — Jing Gao @ 4:52 pm

Twitter is a rich source of instant information.  Many of its users have chosen to reveal their geographical locations when tweeting.  These geo-tagged messages together embrace geographic stories that are not seen elsewhere; therefore, are very interesting topics to map.

WHAT is the bubbling lava map?

The bubbling lava map shows the spatial distribution of Twitter users across continental US.  The height of the bubbling surface corresponds to the standardized user count.  Standardization was because the value of the raw user count highly correlates to the local population, whose spatial distribution is relatively well-known.  Therefore, the size of the population fraction that are using Twitter is a more interesting measure to see.  For example, a medium-size college town could be completely flat in the raw count map, but appear as a peak in the fraction map.  Similar pattern applies to some tourism destinations, that have low constant population but a lot of tweeting tourists.


This animated map shows an average day of Twitter use.  We collected one week of Twitter data in March, 2010, and averaged the 7 days into 1.  That is for every half an hour of a day, we sum up the Twitter usage from each of the 7 days within that half an hour.

HOW to make it at home?

Twitter’s data stream can be collected instantaneously through its API “gardenhose” for free.  The records come in the format of text.  Each record is one tweet with all of its characteristics.  We used our own program TwitterHitter to collect the real-time data, and filtered the raw stream to only keep the geo-tagged tweets for mapping.

We then moved on to ESRI ArcGIS for data processing.  ArcGIS can take the coordinates of locations and generate a map of points accordingly.

From here, we conducted two stages of data processing:

1. Parse the tweet points to different files based on the time of the day they were published.  We created one file for each half an hour (e.g. 00:00 ~ 00:30, 00:30 ~ 1:00, …), so, in total, 48 files.

2. Based on each file of tweet points, create one continuous surface showing the standardized user counts.  These surfaces after design polishing are the frames used in the animation.

For stage 2, we had two proposals:

Proposal A:
– create a heat map using all tweets – this generates a raster layer showing the interpolated total number of tweets for each pixel
– find/create a raster population map whose pixels match with the tweet heat map
– divide the tweet map by the population map – because the two raster maps match, this is a pixel by pixel division

The advantage of this method is that the exact location information of each tweet was utilized.  However, the heat map takes on the assumption that, the closer a place is to an existing tweet instance, the higher its probability to have a tweet happening.  This is not the case for tweeting: because people and their cell phones are mobile objects, the exact location of a tweet is quite random.  So, some level of spatial aggregation is necessary.  In this spirit, we developed the second proposal.

Proposal B (final solution):
– aggregate the number of tweets to county level
– for each county, standardize the raw tweet count to the ratio: 100000 * number of tweets/number of persons
– using the counties’ geographical centroids, interpolate (Inverse Distance Weighted) the standardized tweet count to a raster surface that covers the entire continental US

Aggregating at county level removes a lot of randomness and unnecessary details.  Some suggested using the population centroids instead of the geographical centroids of the counties for the interpolation.  But, we think when the standardization was performed, both the denominator and the numerator are total counts within a county, so the ratio is an averaged measure that is uniform across the county.  Therefore, geographical centroids are the more sensible choice.

Both data processing stages were automated — an ArcToolbox that consists of two tools was developed using Python.  Please feel free to download the toolbox and make it work for you.  (PLEASE BE AWARE:  This is originally a .zip file, but since wordpress blog does not recognize this file extension, we simply changed it to .pdf upon uploading.  SO, you  will need to change the file extension back to .zip before using).

The resulting images were put through careful design and made into the animation.

Advertisements

1 Comment »

  1. […] of tweets per person as high as it might be in some smaller cities?  For more on this process, see this post. Leave a […]

    Pingback by Animating Twitter Data « Geography 970 — May 6, 2010 @ 3:06 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: