Delve Deep: Data Mining Software for Reporters

February 18, 2014 • Digital News • by

This article has been amended. Details below. 

The Associated Press news agency has developed an open-source tool to help journalists with data mining journalism to assist in original and investigative reporting.

Overview, a Web application, aims to make accessible, for journalists and scholars, a way of analyzing massive amounts of disorganized collections of digital documents.

The prototype was completed in 2011 and rewritten as an easy-to-use web application by the summer of 2013.

The project was developed by Jonathan Stray, a computer scientist and journalist who was Interactive Technology editor for the Associated Press and now teaches computational journalism at Columbia University. It was constructed to help reporters deal with large amounts of data, such as Freedom of Information queries or government information leaks. The algorithms used were developed primarily as a way to sift through large reports. The public server currently supports at most 200,000 documents per set, but you can do more if you run your own server. Although there are many different software programs that have been developed, with the ability to search for key word or names, none have yet been developed to show the relationships between different topics, people, dates, and places.

Stray’s work has led to the development of an interactive system that analyzes every word in a document in order to determine the topic of that document. Then, documents are filed into folders of similar topics, as well as grouped into subfolders. It also visually maps the relationships between them. This is advantageous not only for the information a person is seeking, but also to gain a better understanding of what is contained within the documents—potentially even other important connections that might otherwise go unnoticed.

A recent example is the declassified private security contractor Iraqi war logs, where Stray was able to analyze 4,500 raw documents. The documents contained information about weapons fired from 2005 to 2007, as recorded by the Department of State.

On deadline (with not enough time to analyze the documents individually), Stray used Overview to mine the documents for major themes and topics. Overview then provided a tagging interface where topics and subtopics could be named or renamed to best make sense to the user. The computer-generated topics are just a starting place. The results suggested much of what had been expected, but there was some surprising information. It appeared that some of the deaths of civilians by security contractors had gone unreported. This information was then used for follow-up interviews with the State Department to help answer questions about these potential discrepancies.

The site itself is very simple to use and upload data. An individual can upload a whole project, a sample document, or a document set, and can run a basic query on those documents. Once all documents are uploaded, Overview will then try to sort the documents based on topics

It then becomes the role of the reporter (or the scholar) to begin to sort through the folders and subfolders to examine the results (which Overview also displays visually).

Interested in using “Overview”? Currently, documents in English, Spanish, French, German, Russian, Arabic, Swedish, and Dutch can be analyzed: The challenge is that the UI is still in English and would need to be translated.

The project is run by the The Associated Press and funded by John S. and James L. Knight Foundation through a Knight News Challenge grant.

This article was modified on 21 February 2014, to reflect the fact that the software was developed in 2011, but was turned into an accessible web application in 2013.  It also updates information on Overview’s methodology. 

Photo credit: infocux Technologies / Flickr Cc

Tags: , , , , , , , , ,

Send this to a friend