Geography 970

April 29, 2010

from geo-tweets to bubbling lava map frames – the data processing

Filed under: Uncategorized — Jing Gao @ 4:52 pm

Twitter is a rich source of instant information.  Many of its users have chosen to reveal their geographical locations when tweeting.  These geo-tagged messages together embrace geographic stories that are not seen elsewhere; therefore, are very interesting topics to map.

WHAT is the bubbling lava map?

The bubbling lava map shows the spatial distribution of Twitter users across continental US.  The height of the bubbling surface corresponds to the standardized user count.  Standardization was because the value of the raw user count highly correlates to the local population, whose spatial distribution is relatively well-known.  Therefore, the size of the population fraction that are using Twitter is a more interesting measure to see.  For example, a medium-size college town could be completely flat in the raw count map, but appear as a peak in the fraction map.  Similar pattern applies to some tourism destinations, that have low constant population but a lot of tweeting tourists.


This animated map shows an average day of Twitter use.  We collected one week of Twitter data in March, 2010, and averaged the 7 days into 1.  That is for every half an hour of a day, we sum up the Twitter usage from each of the 7 days within that half an hour.

HOW to make it at home?

Twitter’s data stream can be collected instantaneously through its API “gardenhose” for free.  The records come in the format of text.  Each record is one tweet with all of its characteristics.  We used our own program TwitterHitter to collect the real-time data, and filtered the raw stream to only keep the geo-tagged tweets for mapping.

We then moved on to ESRI ArcGIS for data processing.  ArcGIS can take the coordinates of locations and generate a map of points accordingly.

From here, we conducted two stages of data processing:

1. Parse the tweet points to different files based on the time of the day they were published.  We created one file for each half an hour (e.g. 00:00 ~ 00:30, 00:30 ~ 1:00, …), so, in total, 48 files.

2. Based on each file of tweet points, create one continuous surface showing the standardized user counts.  These surfaces after design polishing are the frames used in the animation.

For stage 2, we had two proposals:

Proposal A:
– create a heat map using all tweets – this generates a raster layer showing the interpolated total number of tweets for each pixel
– find/create a raster population map whose pixels match with the tweet heat map
– divide the tweet map by the population map – because the two raster maps match, this is a pixel by pixel division

The advantage of this method is that the exact location information of each tweet was utilized.  However, the heat map takes on the assumption that, the closer a place is to an existing tweet instance, the higher its probability to have a tweet happening.  This is not the case for tweeting: because people and their cell phones are mobile objects, the exact location of a tweet is quite random.  So, some level of spatial aggregation is necessary.  In this spirit, we developed the second proposal.

Proposal B (final solution):
– aggregate the number of tweets to county level
– for each county, standardize the raw tweet count to the ratio: 100000 * number of tweets/number of persons
– using the counties’ geographical centroids, interpolate (Inverse Distance Weighted) the standardized tweet count to a raster surface that covers the entire continental US

Aggregating at county level removes a lot of randomness and unnecessary details.  Some suggested using the population centroids instead of the geographical centroids of the counties for the interpolation.  But, we think when the standardization was performed, both the denominator and the numerator are total counts within a county, so the ratio is an averaged measure that is uniform across the county.  Therefore, geographical centroids are the more sensible choice.

Both data processing stages were automated — an ArcToolbox that consists of two tools was developed using Python.  Please feel free to download the toolbox and make it work for you.  (PLEASE BE AWARE:  This is originally a .zip file, but since wordpress blog does not recognize this file extension, we simply changed it to .pdf upon uploading.  SO, you  will need to change the file extension back to .zip before using).

The resulting images were put through careful design and made into the animation.

Advertisements

April 19, 2010

Competition

Filed under: Uncategorized — Jeremy White @ 11:43 am

A new graphic, by Shane Snow of Gizmodo, shows the current market showdown involving Microsoft, Google and Apple.  It’s interesting how much overlap there appears to be in market segments that most people would consider outside the core business of each company.  Do these large corporations focus on products and services that are in direct competition with their rivals or are they just responding to the needs, or future needs, of the market?  It’s tough to say definitively, especially when many of these efforts are currently a monetary drain for the companies involved.  Adding Adobe to the graphic would be interesting, especially considering the alliances that are being broken and forged with the companies in the graphic.

April 12, 2010

More Offensive Pushpins

Filed under: Uncategorized — Tim Wallace @ 12:06 pm

It’s not my intention to become the “critical pushpinnery” guy.  But after my post on the US Drone Attacks mapped using Google MyMaps (which appears to no longer have any pushpins at all now) in Janurary, I keep seeing maps that would benefit from, not just a “better choice” of point symbol, but some choice (rather than simply using the default).  This led me to the Google Maps Icon collection.  The idea behind this is great – we shouldn’t use the same point symbols for every type of point.  Not only is this graphically inappropriate (not being able to tell different points apart – like highways and POIs in this Betty Lou Cruises Map), it can also be emotionally inappropriate (as Daniel Huffman suggests on his blog post ‘a war without humans‘).

But the Google Maps Icon collection creates new issues.  By allowing users to create and upload their own icon symbols to the community, some potentially inappropriate icons are bound to make it through.  Below is a small sample of some icons available through this community.  Some of them could pretty easily be argued as offensive.  Others just seem odd to me.  What do you think?

April 4, 2010

tools for data visualizations

Filed under: Uncategorized — kjmcgrath @ 11:16 pm

This post relates to previous posts I’ve put up on the movement that is happening right now in visualizations of data through geography- i.e. a map, by people not normally connected to cartography or geography. There are a growing number of tools that work to help people to visualize their own data especially if they don’t have ability to do it themselves. This is a transition from individuals (graphic designers, computer scientests, statisticians,  cartographers, and many others  (those with know how or those that are willing to learn)) working on cartographic problems for themselves (often creating a maps/graphics  that often blazes across the blog-o-sphere like wildfire because people love graphics) seems to be moving to a paradigm where people without specialized knowledge visualize their own data. Tools are being created to help these inexperienced users create the visualizations of data without dealing with code or specialized programs and yet still produce graphics and maps that impart meaningful information to the user.

Tim touched upon one I think important  tool (Google Fusion Tables) that works to automatically to place an uploaded spreadsheet in geographic context if possible. It can also graph and visulize the tables in some nifty ways. Another is Geocommons Maker that has been out for a while ( And was worked on by people with the know how (some from this very department) to give users some basic cartographic tools to make maps.)

There are surely many others but the one that spurred this post comes from post I saw on FlowingData – InstantAtlas. While still focused on the map as the method for communicating information it provides other useful information about the data in the form of histograms, parallel coordinate plots, and other graphics that are great for digging into the data. Many good interactive maps have had this capability for some time, (e.g several of 575 projects, Zach Johnson’s Freedom Atlas comes to mind first). But these projects were often built with a story in mind rather than as a tool to give to someone to visualize their data. I have never used InsantAtlas and rarely use Maker or Google Fusion Tables but I think that these web services signal a move from a public which only consuming graphics to one where users are both creators and consumers.

I think products and tools like these will become more and more available and sophisticated (though still approachable) as the public becomes more map and data literate. When Google provides a service to a perceived problem there is likely something there. It is the job of statisticians, designers, and those with domain knowledge (yes even geographers, cartographers, and GISers) to point and create these tools so that the naive user will still have ability to draw upon the pethora of academic literature and research that has been done in each of those fields and produce a map or a graphic that shows the data and will hold up to scrutiny. These tools will make use of the huge amount of data on not only available on the web but also sitting on people’s hard drives. This move, while scary for those with domain knowledge, seems to be the way the world is rolling. In my opinion there is a place for these experts to help and become leaders in how the public consumes and creates graphics through the design of tools for data visualizations.

Twitter 3D Global Map: Preliminary Result

Filed under: Uncategorized — Fei Du @ 9:24 pm

I decided to use Processing instead of Blender to realize the 3D version of twitter connection map since the syntax of Processing is more like Java with which I’m familiar. Another reason is that I’m really fascinated about many visualizations created by Processing.

A nice thing is Processing can create both animation and interactive applications.

This animation is a preliminary result. It is a little bit large (41MB) in size, but it does show the basic functions: rendering connection along the earth surface, rendering 3D parabola path, rotaing and focusing, hiding the virtual globe.

My plan for next step is to implement glowing effect for the lines.

(click the picture to download the animation)

Where 2.0 Conference

Filed under: Uncategorized — Fei Du @ 8:47 pm

This is the website for the sixth annual Where 2.0 Conference.  The conference is REALLY relevant to our course.

http://en.oreilly.com/where2010

The keynote speakers include many leaders in Geospatial or related industry. For example, twitter’s director of GeoLocation Othman Laraki talks about the application of twitter’s geo componet.

Enjoy!

April 3, 2010

A New York State of Mind

Filed under: Uncategorized — markharrower @ 8:48 pm

<shame>Yes, I did just use a Billy Joel quote </shame>

Three New York city maps came to my attention today.

1. Matt B sent me this link today and said he was thinking of 575 conversations while making this:

2. This is a Chinese start-up that came to UW about 3 years ago asking if we’d be interested in doing the campus (we were, but couldn’t afford them). They have an army of digital artists rendering cities in amazing detail. I like the slightly cartoony quality…less formal model-y than Google Earth. Be sure to zoom way in (traffic lights!) and out…for a really busy map it generalizes very nicely.

3. And just ’cause it’s really fun – and I have no idea how they do this (notice the lovely, totally faked lens flare). Be sure to go full screen!


April 1, 2010

What I did to your data

Filed under: Uncategorized — Daniel Huffman @ 10:08 pm

Now that we have a lot of twitter data for the proposed “bubbling lava” map group, the question becomes how to process it. How can we create an undulating surface that represents Twitter across the surface of the US? I thought I’d document what I’ve been working on. Many of these ideas we worked out as a group over the past few weeks, so much of it will be familiar, but it’s good to have an organized review.

We want to show how popular Twitter is, but showing raw totals won’t do. More people Tweet in Los Angeles than in Wyoming, because nobody lives in Wyoming. But, are those few hardy Wyoming pioneers tweeting perhaps more per person than the LA ones? So, we need to normalize for population.

Previously, we discussed using county population data to normalize our data — figure the tweets per county, divide by county population, convert the counties to centroids, and then interpolate those tweets-per-person point data. I wanted to take things in a bit of a different direction today. The problem with calculating tweets-per-person and then converting the counties to centroids is that it makes some needless assumptions. We don’t know the population distribution of the counties, and so we assume the population lies at the centroid. However, we do know exactly where the tweets are coming from — so why lose that geography in the very first step? As an example, there’s a prolific twitter user in a certain Nevada county. That county thus has a high tweets-per-person (hereafter TPP) level. When we simply aggregate the tweets by county and then convert to the centroid, we assume this person is located at the center of the county, when in fact their tweets show their exact location.

After many hours of trial an error, here’s an alternate approach I have devised (Jeremy, you will note that this relates rather strongly to some of your ideas from a few weeks ago):

The helpful folks at SEDAC provided me with a raster of Census 2000-based US population at 30 arc seconds (~1km pixels), which is more than enough for this project. I converted this to a new raster which gives, at each pixel, the number of people within 50km of that location.

Next, we go back to our point shapefile, which has the location of each tweet, and ask it to look at the raster to find out, for each tweet, how many people lived within 25km of the tweet location (I also tried doing 50km, but Arc refused).

Now, divide 1 by that number. This gives us the “value” of any given tweet. For example, a tweet in NYC, which has a population of ~7 million, doesn’t count for much (1 divided by 7 million). But one in the middle of nowhere counts for much more (1 divided by, say, 1000).

Now that each tweet has a value, we calculate a kernel density on those tweets based on their value. Basically, what we do is make a heat map (oh heat maps, will your star ever stop rising?) of the tweet values. NYC tweets don’t count for much each, but there are a lot of them, so they stack up against a small town with just a couple of tweets.

In this way, we’ve normalized for population, and kept a good sense of geography.

Problem: Tiny towns with one prolific Twitter user dominate this data set. This is a problem with the alternate approach (county population centroids) as well. One tweet in a town of 100 counts as much as 70,000 tweets in New York. And if the person in that tiny town has a good day and suddenly decides to tweet three times, they’ve just tripled the entire town’s output of tweets, whereas it’s a lot harder for NYC to triple its output of tweets. Whether it’s raster or vector, any time you normalize by population, you run into the problem of the data set being very sensitive to small population numbers — it doesn’t take much to cause a huge shift in the TPP.

There are two possible solutions here. 1) Change the area that we’re aggregating over. Instead of looking at the population within 25km of a tweet, we could go to 100km, in which case the population numbers will go up and be less sensitive to small changes. At the extreme end of the scale, though, you could set the range to something like 5000km, which means you’ve got the whole US population included, in which case you’re doing the same thing as simply plotting the raw numbers of tweets.

2) Stop treating the data linearly. Let’s go logarithmic. A tiny town with a TPP ratio of 0.1 does not get 10x the strength on the map of a larger city with TPP of 0.01. Instead, we apply a logarithm and that town gets simply 2x the strength on the map.

Final possible solution, which I have not looked at yet: Value by alpha or something like that. Go bivariate — one variable is TPP, the other is an indicator of data quality.

In any case, we are getting what we wanted — data which show that twitter popularity is not the same distribution as raw # of tweets. But, it may be too susceptible to noise unless we reduce the dominance of said noisy data.

Raw kernel density of tweets over 1 wk data set

kernel density of tweets divided by local population

Kernel density of logarithm of tweets divided by population.

Now my brain is tired and I am going to stop.

Create a free website or blog at WordPress.com.