Geography 970

March 7, 2010

(mis)Adventures in Geocoding 1,000,000 Tweets

Filed under: Uncategorized — Tim Wallace @ 6:00 pm

Yes, one million tweets.  That’s so many tweets that if you stacked them up they would go from here to the . . . wait, tweets don’t work like that.  Hmm. . . Yes, one million tweets.  That’s so many tweets that if you wanted to geocode them using the Google API, it would take you 100 days.  In a world of intsansense (instant+nonsense), 100 days is an awful long time.  I mean, gosh, what was going on 100 days ago?  Wasn’t that during the Carter Administration?

The problem: 1,000,000 geo-stupid tweets.  These tweets had no x,y location, but we really want them to.  And despite the very convenient, “User location” field offered up by the Twitter feed, for some of these, coordinates are just not a possibility.  Take, for example, “Everywhere!!!!!”,” “at the very top”, “Where the love goes”, “Where da cameraz flash”,  or “you are kidding, right?”  These aren’t exactly geocode-able locations.  The same issue could apply to some descriptions with odd formatting or typos, such as, “02445 Brrooklyne! Baby!”, “Madison Wiskahsin, oh yah”, or – yes, even – “Canada❤”.  But some users are honest – or at least appear honest – and specific.  They fill in their “user location” with strings like, “Brisbane, Australia”, “Reading, PA” or even something as specific as “Anspachlaan 110, 1000 Brussels, Belgium 025482400‎”.  So, how do we geocode these locations?

Potential solutions (and why they weren’t solutions):

1.  ESRI’s Geocoder- what would ESRI do with “”Madison Wiskahsin, oh yah”.  Or, for that matter, what would ESRI do with “Wisconsin, Madison”?  Both have – at the very least -formatting issues.

2.  Google Fusion Tables.  Fusion tables, which are worth their own blog entry, are a pretty neat tool, but their geocoding option seemed clunky and limits the user to previewing only a certain number of records at a time.

3.  Many geocoding APIs (Yahoo!, for example).  There seem to be formatting issues with some of these.  And this is not to mention that the Yahoo! geocoding service is limited to 5,000 queries per IP address per day.  Youch, that’s 200 days!

Close only counts in horseshoes and hand grenades:

The most reasonable option up front was to use Google’s Geocoding API.  Give it anything, and it will figure out if it’s a location or not.  Really, it will.  Want some examples?  Well, here you go:

(NB: Click around this map and find errors in the data set.  Some highlights are “Minneapolis” in Northern Minnesota,  “Utah” in Michigan and “London” in Mexico.)

The Google API is great for figuring out what “Madison Wiskahsin, oh yah” means.  But, it’s kind of not great when it goes ahead and gives you a location for “At the door”.  This was one problem with the Google API.  Too much information (and too much erroneous information) was not good a good thing.  Is it possible that Google’s algorithm is actually too smart? Well, “smart” might not actually be the right word, but it seems too something.  I give it my name, “Tim Wallace” and it says, “You want coordinates for ‘Tim Wallace’?  Here you go.  The coordinates for ‘Tim Cavenaugh Ln, Wallace, NC 28466, USA’ are ‘-77.7666174, 34.7272203’.  Whoa there, Google.  I know where I am… and I’m not in Wallace, NC.  Though now I wonder if that’s where I’m from.

A similar problem occurs when a user offers up vague information, as in the example above of “London”.  The geocoder doesn’t know which London, or even if “London” is a city, town, state, country or business establishment.

In the end, no, I don’t think Google’s algorithm is too anything.  Because when you take a step back and look at what’s going on, it’s only doing what you are asking it to.  We should be smart enough not to ask it where “0” is or where we might find “no way, no how!”.

Despite all of this, the Google API was also not the answer for this job.  As stated above, with the 10,000 request a day limit, we’d be looking at a 100 day project.  And the real coup de grâce: the fine print in the Google Geocoder API specifies:

10. License Restrictions: You must not (nor may you permit anyone else to):

10.12 use or display the Content without a corresponding Google map, unless you are explicitly permitted to do so in the Maps APIs Documentation, the Street View API Documentation, or through written permission from Google (for example, you must not use geocodes obtained through the Service except in conjunction with a Google map, but the Street View API Documentation explicitly permits you to display Street View imagery without a corresponding Google map).

So, what do we do?  How do we deal with this inability to geocode 1,000,000 Tweets in a swift and accurate fashion?  I guess we do what we used to – you know, before insanely great free services.  We deal with it.

Coup de grâce

Advertisements

2 Comments »

  1. […] Geographic data in the form of lat/long pairs is encoded in the Twitter data stream from third party applications such as ÜberTwitter or by mobile devices such as iPhones. These geographic coordinates provided the platform for exploring the geography of Twitter.  The stream also has optional user-added locations or addresses. Since approximately 90% of the stream was without coordinates, significant time was devoted to an attempt to transform the “user location” field (such as “New York City” or “Galveston, Texas”) into lat/lon pairs.  Ultimately, however, the processing time associated with georeferencing tens of thousands of points proved prohibitive.  Additionally, there were problems with getting accurate coordinates from a highly variable text field — if a user gives their location as “Madison,” for example, do they mean Madison, Wisconsin? Madison, Alabama?  For more on why our project only used 10% of available Tweets, see this post. […]

    Pingback by Animating Twitter Data « Geography 970 — May 6, 2010 @ 3:06 pm

  2. […] system was not possible (or at least not advisable). Feel free to read about it here […]

    Pingback by Google Reverse Geolocation as an Ad Scheme? « Tim Wallace — March 10, 2011 @ 4:18 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: