Sunday, October 23, 2011

NMA project update - getting data organised, again

Ok so now that I am on the web, I have to get data organised (again). Here are the results of some of my playing.

This first sketch here I have ran through a for loop of the collection to build an associative array where the key is the object type and the value is a count of how many items there are of that type. I don't think I fully understand JavaScript associative arrays yet - I have been thinking of them like hash maps in Processing, but I think that really they are just a normal object (and that even a singular object is the first item in an array), and that keys are not keys but actually object properties. When testing if my associative array already contains a particular object type I use objectTypeList.hasOwnProperty(object_type). When getting items I can call them either with objectTypeList[object_type] or objectTypeList.object_type, but they dont have an index and it appears I cant get their length - to run through a for loop I can use for(key in objectTypeList).

To display, I am adding each item to an unordered html list with $.append, which I am also using to format the object type as bold. It would be better to put the object type and the count in different html tags with unique classes so that the formatting can be done separately with CSS.  The yellow background is however thanks to CSS.

A list of object types and count of items of that type

If I wanted to get all the keys out as an array I could do so with .keys however this appears to only have support in the very newest browsers. For now I will run through a for loop and get each individually. This leaves me wondering if I am missing a better approach to organising my data?

A list of object types and titles of each item of that type
Breakthrough! Yay! This second sketch here is organised. I have built lists of each object type, adding the items of that type - no need for custom classes, the items appear good to go straight from the JSON as JavaScript objects. Displaying all the titles is proof that I can access the individual items. I am in control! Now the count is the length of the lists.

Problem - handling undefined keys. Don't know how to skip. It appears all items have a title and object type recorded, but there is variable use of most other parameters. Something to come back to..

Next step in getting organised. Sorting. JavaScript .sort() worked nicely on this list of keys, which I extracted as described above. The sort function defaults to sorting alphabetically for a list of strings, to sort numerically I had to write a simple comparator function that compared the item list lengths. Once I have a sorted list of keys I can loop through it and using the key still access the individual items.

A list of object types sorted alphabetically
A list of object types sorted by count descending of items of that type
And finally I was able to load images in, simply using the <img> html tag and the $.append. What a relief! Now I have all the basic ingredients to make some of browser for the NMA collection.

A list of object types and images of each item of that type if available
Actually it turns out the problem above about identifying undefined properties/keys was very simple to resolve with an if(key == undefined).

A list of object types and images of each item of that type if available with a count of available images
And just to tidy up my weekend of getting data organised, I was able to make each image a link to the corresponding item record in the NMA online catalogue. Quite satisfying!

Item record in the NMA online catalogue linked from my list of object types
A big thanks to Mitchell for getting me started with some of his sketches.

My next steps will be to try working with the full dataset (I have only been using the first 800 item here), and to draw only what is on screen so there are not huge numbers of images flying around (see if I can get ajax to work now?). Then I will need to learn to draw so that I can get more sophisticated formatting and so that I can design some analytical visualisations (charts, graphs etc). For these I could try Processing.js or Raphael or D3.js which supercedes Protovis.

Thursday, October 20, 2011

NMA project update - how do I get on the web?

Big scary hurdle. Don't know where to get started. Moving out of my comfortable Processing world. Have to learn many new things at once...

First step I thought would be to get my data into the browser. I thought I would need to make a database and then an API to call to it. My website back of house allowed MySQL databases and had phpMyAdmin installed to manage. Ok so I know it isn't standard  to develop and test online, but I didn't/don't want to learn how to set up a local server, at least just yet, on top of everything else that is new. So I looked at phpMyAdmin - I can upload XML, but not JSON, and only files less than 100mb. So the very first thing to do is export a clean small XML version of the data. Stuck already. Stayed stuck for days.

I tried to write XML with proXML, a library for Processing. But I couldn't figure out how to actually put any data content in the XML elements. I could make elements. I could give them attributes and add children. I could check if elements had data content (text) and get that text. Seems such a basic thing. And the documentation for the library was otherwise good. I tried lots of ways that I made up myself to add data, but could only write elements that were not correctly formed. I couldn't find any help on forums either.

Then I tried to write XML using Java StAX. This library required you to explicitly code opening and closing tags and start document etc. Writing to a stream and remembering to flush was ok. My output used Java  FileWriter, which I thought should work. But it didn't! I kept getting error: access denied. Why? I couldn't figure it out - for ages. Turns out, after investigation prompted by Mitchell, that the Java file path name wasn't relative and so it was trying  to write at my top level C: drive! Problem fixed. Exported clean XML.

However at Mitchell's suggestion I decided to instead change approach and, at least initially, try to directly load JSON into the browser and work with it. Hopefully the files wont be too big - usually in web you would would (with an API calling to a database) only load the bits you actually needed at any particular time. Mitchell kindly gave me some sketches to hack to get started.

So working with JQUERY I made my first JavaScript sketches, which selected html elements and changed their formatting or added content. Yay, achieved something! Next I tried to load some data - but was badly stuck again. I couldn't get $.ajax({ url: ''dataURL" }).responseText; to work, nor $.getJSON. I eventually was able to write some JSON elements in html file and work with these, but I still couldn't get JSON to work. In fact I was just about to give up after most of a day of trying different combinations of $.ajax, $.getJSON, JSON.parse(data) and even eval() which I understood  to be a big no no because it didn't parse to check for valid JSON and so was a security threat. I had tested online and locally, neither worked.

I searched help forums to find why the functions were hanging or resulting in variables that were undefined. Then I realised that there were some syntax errors in the JSON data - they were just a bunch of objects floating, not separated by commas in an array. I fixed this, but is still wouldn't work! Searched some more but still couldn't find a solution.

Finally - brainwave - try a different browser. Works!!!
Don't know why I didn't think to try this earlier. Browsers are notoriously fussy.

So nothing would work for me in Chrome. Don't know why. In Internet Explorer $.getJSON works, but I still couldn't get $.ajax to work. Don't know why!

Anyway here is my very very sweet hello world.

A list of item titles with their object type in brackets

Next, now that the data is in the browser, I will begin to play with it...

End frustration.

Wednesday, October 5, 2011

Getting data organised

My first task with the NMA project was to get started working with the data. Mitchell Whitelaw helpfully set us up with some example code.

Our data came in a verbose xml that was too big to keep in memory in  Processing, so Mitchell showed us how to in Processing split the data and parse it into JSON format one line at a time, extracting only the data we needed. JSON is a lightweight format based on JavaScript that works well with Java (Processing).

Mitchell also demonstrated loading images from the collection (you can't load all at once - there are 20,000 in 3 different sizes!) and picking random objects to show, using a class for items. He also showed us hashmaps, which I first used with myTram - calling a key is much easier to work with than trying to remember an index position. The hashmap here contains arraylists of items organised by object type.

I used the hashmap to select a random object type to show all of the objects of that type in the collection. Clicking through random object types is not a bad way to start browsing. The data was indeed organised!

Showing an object type - motor cars, there are 11 in the NMA collection
Next I wanted to be able to sort the data, so that I could view it other than randomly. It was easy to sort an array alphabetically or numerically using the Processing sort array function, so I converted my arraylist of object types to an array, and hey presto I had a Ben Ennis Butler inspired histogram! It was indeed easy to scroll though object types and see how many of each there were.

Object type histogram, alphabetically sorted - advertising cards
Due to memory I only visualised the first 20 object types, but in the future I could have a more sophisticated way of not bothering with what was not on screen.

After this, however I was stuck. I wanted to sort numerically by the number each object type. I couldn't do this with arrays, because even if I extracted an array of all the counts and sorted this, there would be no way to syncronise it with any other lists.

The answer - to make another class for objtypes, and then to use comparators which instruct how to compare objects. In this case the comparator says when sorting an arraylist of object types to compare them based on the size of their corresponding arraylist of items.

I visualised this simply as a list for now. I would have to think about what to do visually with the scale difference between the most numerous couple of object types (6000, 3000, 2000) and the quick drop off (to a few hundred) and then a long tail (2, 1). Mitchell suggested something like a treemap that was compact.

List of most numerous object types - there are 6,000 mineral samples in the collection

List of some of the object types for which there are only 1 in the collection

I think that now I have the organisation to get started in making mockup visualisations in Processing - I still have to figure out how to translate to an online world. Hopefully I can experiment with the NMA API before building my own MySQL database.

Tuesday, October 4, 2011

Interactive word frequency cloud

Following the data visualisation unit, I was lucky enough to have the opportunity to work over summer as a research assistant for Andrew MacKenzie to develop a tool to explore survey responses from residents, architects and builders who had rebuilt in Duffy after the 2003 Canberra bushfires. The word cloud was built with supervision from Mitchell Whitelaw and is based on code he developed for the A1 Explorer.

Word frequency cloud (architects only, responses to all questions)  with substantial control panel  for filtering at right
Word frequency cloud with correlations to 'wanted' highlighted and all occurrences of 'wanted'  listed on right
The data can be filtered by response to particular questions, the category of respondent (resident who rebuilt, new resident, architect, builder etc) and individual respondent - so it is possible to see a cloud of everything or  any subgroup of responses or an individual response. A list of standard 'stop' words  and any words with less than 3 characters have been removed. Further words can be added to an exclusion list, by clicking, which is helpful to look beyond boring words or extremely frequent words that can obscure differentiation between less frequent words.

All of these filtering options end up in a large control panel, which took a bit of juggling to fit on screen. It may have been neater to hide it in drop down  or pop up menus. However I think it was important to highlight the current view position within in the entire data set.

Mousing over a word highlights corresponding words that occur in proximity and brings up a scrollable list of all occurrences of the highlighted word in fragmentary context of the five words pre and post it.

An appropriate way to understand and navigate data?

So this is another example of a show everything and zoom in visualisation. However the reason I posted it is primarily to make a brief observation about the appropriateness of visualisation techniques to understand/navigate data. A distinction between understanding and navigation is perhaps important.

In the case of Mitchell Whitelaw's A1 Explorer the word cloud visualises item titles in the National Archives A1 Series. Titles generally are specific and succinct, and considered. The A1 Explorer is a visualisation that reveals some of the topics and relationships in the series, but it is also an interface to the digitised items themselves.

Similarly a word cloud of a carefully crafted speech, such as Obama's inauguration speech, reveals succinctly some of the themes. It is probable that some speeches are written with word cloud analysis in mind. Political rhetoric noticeably employs frequently repeated, memorable, mantras. Of course, as Jodi Dean writes, a word cloud is in many ways a very superficial analysis that ignores sentences, stories and narratives.

A different example, designed specifically for visualisation as a word cloud, was curated by the ABC who to mark Julia Gillard's first year as Prime Minister called for the public to submit 3 words that characterise their perceptions of Gillard and also of opposition leader Tony Abbot. Not surprisingly the most frequently submitted words aligned closely with the rhetoric that had been most prominent in the media.

Even if visualising words by themselves are appropriate, a critical challenge for word clouds and like visualisation techniques is to be able to locate the small, hidden, items, because they are perhaps the most interesting or important. It might be that quantitative data analysis can only ever take us so far, and that curation is necessary to go beyond? However when it comes to big data, quantitative might be our only way  in - a starting point for exploration.

Andrew MacKenzie has said that the word clouds were very helpful as a research tool and their revelations support his observations during and other analysis subsequent to the interviews. My feeling is that there was substantial noise because of the nature of the raw survey data. The responses were not carefully crafted like an Obama speech or considered even like a title or a 3 word perception of Gillard - they were spontaneous and people thought as they spoke. The word cloud doesn't distinguish initial response from more considered closing summary remark. It doesn't take account of rambles, tangents or emphasis placed on particular ideas. That said the quantitative analysis also ignores any bias the researcher might have had in looking for particular ideas.

Ranking G-20 carbon emissions

This is another prototype interactive chart undertaken in the October 2010 data visualisation unit as part of the Master of Digital Design


Mashed up data sets

In this project I have experimented with mashing up multiple data sets, which visualised together give greater context to the data than if viewed independently.

I have started with a data set from the wikipedia article on the G-20 major economies which sets out population and gross domestic product (GDP), total and per capita, both nominal and with purchasing power parity (PPP). This is a rich and interesting, concise, data set to explore in it self. It is probably already a mash up from various sources.

I have added to this set data about carbon emissions for the same countries, extracted  from lists on wikipedia of all countries total emissions and per capita emissions. I then calculated emissions to GDP ratios, which is slightly flawed because the respective data was from different years, but very interesting as a indicative and prototype only exercise. This all took a bit of stitching together manually, but was very rewarding because quickly, visually, it was possible to see greater specific context than is usually available when considering carbon emissions - that is who was efficient or wasteful in generating money from emissions and who could most afford to reduce them.

There were two visualisation modes that the data could be explored  in - ranked lists and scatter plot. The ranked lists can visualise more than two dimensions simultaneously, however the scatter plot can show clusters of data and outliers. Both are really useful.

Ranked lists - carbon emissions total, per capita, and against GDP nominal and PPP - Australia is highlighted

Scatter plot - carbon emissions per capita vertical axis and against nominal GDP horizontal axis
The visualisations help ground Australia's contribution (current, not historical!) to climate change relative to other major economies during ongoing ferocious political debate about Australia's responsibility to act to reduce emissions. They show China and the United  States as significant outliers when it comes to total emissions, and the United States, Australia, Canada and Saudi Arabia as significant outliers when  it comes to per capita emissions. When it comes to efficiency, France is way ahead of Italy, Brazil, Germany, the United Kingdom and Japan, who are themselves way ahead of the rest.