Fighting for quality news media in the digital age.

  1. Comment
February 11, 2014updated 21 Feb 2014 2:53pm

Data journalism: What it is, how you can do it and why it matters

Journalists need to be data-savvy… [it’s] going to be about poring over data and equipping yourself with the tools to analyse it and picking out what’s interesting. And keeping it in perspective, helping people out by really seeing where it all fits together, and what’s going on in the country.

Sir Tim Berners-Lee

So what is a data journalist? Using Tim Berners-Lee’s definition, a data journalist is someone who pores over data, equipped with tools to analyse it and pick out what is interesting. Let’s break this definition down so we can understand just how complex it can be.

Data is information, not only numbers.

Data is what all journalists have always worked with: court orders, police records, even press releases. The internet has allowed a steady stream of information to flow between networks of machines. Governments, in a drive for efficiency, have turned from paper to a digital output, making it easier for their data to be made accessible to our machines.

However, all this does not mean the data made available is always useable.

A big part of a data journalist's job is equipping oneself with tools to clean, structure and analyse the primitive form of data leaking out of crumbling bureaucratic systems.

Data provided by traditional sources are more often than not made for presentation: PDFs or Excel formats. It is not structured for analysis, visualisation or story-finding. So any journalist wading through the flood of data needs to be equipped with the necessary skills and tools for doing any or all of the above in a timely fashion. This can involve using proprietary software such as Microsoft Excel or ABBYY FineReader, knowledge of freely available tools such as Google Fusion Tables and Open Refine or building solutions from scratch using code.

Each data journalist's toolbox will be individual, as will their idea of a data-driven story.

Finding what is interesting in a tide of data is dependent on what set of data the journalist can access using the tools in his or her box. Because I code I can access datasets hundreds of millions of rows in size. Software can't do this so a journalist who does not code cannot either.

Getting stories from data involves understanding what constitutes a good story. For some it may be an exploratory map, others want investigations to generate headlines, the more community savvy-newsrooms might want to create a narrative around user-generated content.

Looking at all the combinations of data formats, tools and editorial directions you will not find two data journalists of the same ilk. In fact, looking across newsrooms, across print, web and broadcast, and across broadsheet and tabloid formats you will not find two data journalists who appear to be of the same species. To understand how this has evolved we need to look to ancestor of the data journalist: the computer-assisted-reporter.

Born in the USA

Data, before it was made sociable or leakable, was the beat of the computer-assisted reporters (CAR). They date as far back as 1989 with the setting up of the National Institute for Computer-Assisted Reporting in the United States.

Data-driven journalism and, indeed, CAR has been around long before social media, Web 2.0 and even the internet. One of the earliest examples of computer assisted reporting was in 1967, after riots in Detroit, when Philip Meyer used survey research, analysed on a mainframe computer, to show that people who had attended college were equally likely to have rioted as were high school dropouts. This turned the public’s attention to the pervasive racial discrimination in policing and housing in Detroit. Even today, the US is leading the way in digital journalism and data-driven story-telling.

For example, at the end of 2004, the then Dallas Morning News analysed the school test scores of the Texas Assessment of Knowledge and Skills and uncovered one school’s alleged cheating on standardised tests. This then turned into a story on cheating across the state.

The Seattle Times’s report in 2008, logging and landslides, revealed how a logging company was blatantly allowed to clear-cut unstable slopes. Not only did they produce an interactive but the beauty of data journalism (which is becoming a trend) is to write about how the investigation was uncovered using the requested data.

Newspapers in the US are clearly beginning to realise data is a commodity for which you can buy trust from your consumer.

The need for speed appears to be diminishing as social media gets there first, and viewers turn to the web for richer information.

News in the sense of something new to you, is being condensed into 140-character alerts, newsletters, status updates and things that go bing on your mobile device.

News companies are starting to think about news online as exploratory information that speaks to the individual (which is Web 2.0). They are no longer the gatekeepers of information but the providers of depth, understanding and meaning in an age where there is too much information.

Open data

The UK government launched its open data website, data.gov.uk, in September 2009. So what is meant exactly by “open data”? There is, in fact, an open definition for open data. It states:

A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike

Open Knowledge Foundation

This appears to be more about online licensing and ownership than it has to do with journalism and at face value it does. However, the ideology behind reusing online data to build websites and utilities has championed developers and data enthusiasts to work with data and in many ways provide services akin to traditional media organisations.

MySociety was founded as far back as September 2003. Its mission is to “help people become more powerful in the civic and democratic parts of their lives, through digital means”. They are responsible for the site TheyWorkForYou, which helps citizens find all the details regarding their local MP as well as how they voted on certain topics. They created an online space, WhatDoTheyKnow, for sending and viewing FoI requests.

Both of these are useful to every journalist, not just a data journalist. The Open Knowledge Foundation began in May 2004 in Cambridge. It runs data portal software, CKAN, used by many governments for their open data projects and OpenSpending aims to track every government financial transaction across the world. They even run School of Data, a resource and courses to acquire tools to use data effectively.

In September 2012, the Open Data Institute, founded by Sir Tim Berners-Lee and Professor Nigel Shadbolt, secured £10 million over five years from the UK government. It houses open data start-ups including OpenCorporates, a database of information on 60 million companies worldwide and counting.

The push to make data open, available and ultimately useful is being championed outside the UK newsroom. But maybe having inventors and developers at the forefront of open data will make data journalism easier to adopt within a newsroom. It also supplies a direct benefit to forward thinking newsrooms.

The open data movement in the UK has fostered the interest of computer programmers in civic media and engagement. Being based mostly upon ideology, the conventional form of outreach is the “hack day” – pairing programmers and interested parties to create online tools. This has provided a base of self-taught and self-motivated individuals which newsrooms can hire to form digital teams focused on web input and output.

Now you know a bit more about the term 'data journalism', let’s look at what it entails.

Computer-assisted reporting

In a newsroom the data journalist will be considered the 'numbers person'.

What is traditionally used for headlines and copy is a ranking; the worst performing hospitals, the most dangerous roads, the banks giving the biggest bonuses, etc. This is very simple to do using a filter, however it may not always be the right thing to do.

Say, for example, we are looking at hospital deaths. An absolute number is not a measure of performance. If we get the top ten hospitals by the largest number of deaths we will be ranking the hospitals by number of patients with a bias towards hospitals that have specialist units for cancer, heart disease and other high mortality rate conditions. For comparison's sake we have to normalise by the total patients and also build an indicator depending on the mortality rate of the condition. Composite indicators are directly comparable and the government is responsible for calculating these.

However, you should always read about how these are computed. If you are given absolute numbers then you can understand where there could be a ranking bias. Ask yourself: “Am I just measuring the total possible volume?” considering the total possible patients that could die, the total traffic on the road where there could be an accident, the total number of employees who could get a bonus.

Another aspect to consider is the distribution of the data. If the data is normally distributed then the top ten worst performers should be a certain distance away from the mean (three standard deviations in fact). So if the average mortality rate is 1.0 and the worst performers is 1.9 and the best performers is 0.1 then what you are seeing in the ranking of the worst performers is a performance relative to the variations expected. Someone is always going to perform “badly” when an indicator is calculated from performance within a system with natural variation. Thus the idea of shutting down hospitals which perform badly will mean we end up with only one hospital.

If, however, your top ranking hospital for mortality rates was 4.0 and the second worse was 1.8 then just producing a ranking is missing the story. That hospital is a significant outlier and questions need to be asked regarding its performance.

A spreadsheet-based piece of software can work out the normalised data, its ranking and its distribution. It cannot tell you when you need to normalise or look at the distribution. Computers assist reporters. It is still up to the data journalist to provide the understanding, pre and post analysis.

A numbers person does not just know when to use a median and when to use a mean. The skill of a data journalist lies not in being able to make the calculation but knowing when

one is needed. Computers do not interpret meaning, they do what you tell them whether it is sensible or not. It is the interpretation that makes a story. Whether you are writing or visualising a story based on data if you do not have the correct understanding the output will be deeply flawed.

Statistics easier than you think

Understanding basic statistics is a skill vital to any journalist working with data.

Do not be intimidated by statistics. It is easier than you think. The two previous examples will get you a long way and there are many resources to help you get started. I would recommend reading The Tiger That Isn’t (2007), by Michael Blastland, creator of the BBC Radio 4 programme More or Less and taking Statistics: Making Sense of Data taught by Toronto University.

As a data journalist, your understanding and sense of data is just as valuable as your abilities, if not more. A news organisation can always hire a statistician or a developer. A key role for a data journalist is knowing what is possible, generating ideas and knowing when to dig deeper.

I trained as a broadcast journalist and began my career as a digital producer at CNN International. I took the major (and somewhat crazy) step of leaving CNN to learn to code. I worked as a data journalism advocate for the UK start-up ScraperWiki. The role started in 2011 where I was first exposed to a HTML tag and the true contents of a webpage. That same year I won a competition held by the Mozilla Foundation and funded by the Knight Foundation to put forward thinking digital communicators into newsrooms throughout the world.

What came to be known as the OpenNews Fellowship awarded me a placement with the Guardian interactive team. After completing my fellowship I then got hired as a data journalist at The Times. I am now a developer/journalist, working with programmers, production editors and a statistician to bring the oldest newspaper in the UK into the digital future.

Computer programming most sought-after newsroom skill

I wouldn’t say computer programming is a necessary skill for a data journalist.

It is however the most sought after skill in a newsroom and, as I found, the key to unlocking doors and getting your foot in. I would recommend anyone interested in becoming a data journalist to try your hand at it.

I am a Pythonist. My first coding language is Python. This is mostly due to the fact that the programmers at ScraperWiki are Python programmers. Coding languages are like languages derived from Latin, they all have structural similarities. There is no one easier than the other. The hardest one to learn is your first one. Once you are comfortable writing in one computer language, using another is a matter of translating syntax. Your decision as to which language to learn will be based on what you want to do with it.

As with speaking languages, each computer language has its own culture behind it.

Mining the web

First of all I am not a native web programmer. All of what I have learnt has been garnered from scraping information off a website and building some tools for demonstrations. So I will give you some basic knowledge from a journalistic point of view and some jargon which you might hear newsroom developers speak.

So which end is which? One thing you will hear is front end and back end or client side and server side respectively. Very simply, these refer to what exists in the browser and what exists off the browser. Almost all newsroom developers will work client side and will code in JavaScript. JavaScript, HTML and CSS are all code native to the browser. HTML and CSS are mark-up and styling languages. They tell the browser what is a paragraph, what the font colour should be, etc. A browser interprets these to show you the webpage.

All the actions you see such as drop-down menus and reveals plus interactives are almost always done in the native programming language JavaScript. If you are most interested in the visual side of data journalism then your language of choice is JavaScript and CSS.

If your interests lie more in database generation and analysis then you would be more of a back end data journalist. Your programming languages of choice would be either Python or Ruby plus the database query language SQL. Know that some news organisations have a preference and ask their programmers to use one or the other so if you have a target newsroom in mind check and see which language they prefer.

The US started newsroom developing before the UK when Python was a popular language and so you will find more Python across the pond. Here you will find mostly Ruby houses. Ruby is more embedded in the web community having a web development package called Ruby on Rails. Python is more established in the scientific data community with lots of add-ons for machine learning and data analysis. Regardless of which of the two you choose they are both very similar and you can convert from one to the other with relative ease.

So what makes coding so powerful? The key to programming is that a lot of what you want to do has been done before. You are not building things from scratch but reusing code others have packaged for you. These are called libraries.

A good analogy is apps for smartphones. A smartphone has very basic functions but become incredibly useful once you start installing apps. The same is true for coding and coding libraries. The basics are very basic but the power lies in using the libraries available. For instance, one of the scraping libraries in Python is called BeautifulSoup (Ruby has a library for scraping called Notogiri).

US weather forecasting site ForecastWatch uses Python.

Once you understand the syntax for using BeautifulSoup (like getting to know the navigation and functions in an app) you now have a new tool in your box that can be applied across the web.

Libraries are built by the community of developers coding in that language and because JavaScript has a huge community there are tens of thousands of libraries. The one that set the data visualisation community buzzing is D3.js, the creator of which, Mike Bostock, went to work for The New York Times.

Newsrooms are not just looking fo writers on the web but builders of the web. They are beginning to realise that we can now build the medium for the message. Being able to write basic coding syntax and use a library will go a long way to understanding this shift in ideology.

Scraping a story together

If all data was open and all of the web structured correctly we would never need to scrape or send an FoI. We do not live in that utopia and probably never will but there are certain oases of well structured and freely available data which highlight the best case scenario for an overworked and underpaid data journalist.

These are APIs. This stands for application programming interface and is a way for you to get direct access to information from a website. It involves making an 'API call'. This is usually in the form of a url structured in a certain way. Many popular APIs such as Twitter’s require a key so they can limit the amount of calls one computer can make. Each site will have a simple way for getting keys and these are then added to the url for the API call.

A good example of a well structured, easy-to-use and journalistically useful API is the TheyWorkForYou API. Once you are understand JavaScript Object Notation (a series of one or more 'lists' and 'dictionaries') you can access all of the UK parliamentary information.

There is no easy solution to finding stories in data and there is no one way to becoming a data-driven journalist. Journalism is more a craft than an academic endeavour.

The same can arguably be said for software engineering. You need to learn by experimentation and determination. All the hardware you need is a computer. But like all good journalists looking to find something out you need to get away from your desk, find the right people and ask the right questions.

If you would like to know more stop reading, go online and find your nearest HacksHacker meet-up, find every resource for learning your chosen language, find their meet-up, follow people building what you want to build and telling the type of stories you want to tell on Twitter.

When you get stuck Google it, look on StackOverflow and tweet out. Even if you are not coding listen out for new tools in all the same spaces, especially School of Data and Data Driven Journalism. Never stop looking because you can never stop learning.

Why data-driven journalism matters

The area of data journalism provides journalists with a wonderful opportunity. A data journalist’s value is based on analysis and exploring news areas of information within the timeframe of a newsroom cycle. It is bringing back the investigative side of journalism and marrying traditional journalistic outputs with the world wide web. This allows the data journalist to explore new avenues for story-gathering and push the news output towards creative, multi-layered storytelling.

Data journalism is a role deeply involved with editorial and yet just as enmeshed with digital production and output. It is an exceedingly challenging role but to make a difference one has to expect to do things differently.

Nicola Hughes is a programmer-journalist, specialising in uncovering, structuring and analysing datasets for investigative news content. She worked at CNN International as a

digital media producer, leaving her role to learn to code with tech startup ScraperWiki.

She then became a Knight-Mozilla OpenNews fellow is 2012 where she was embedded at the Guardian interactive team. She has learnt to scrape and parse data in Python, manage databases using MySQL and display information online using JavaScript. She is currently working as a data journalist for The Times Newspapers.

This article is taken from a collection of essays: Data Journalism – Mapping the Future, edited by John Mair and Richard Keeble and published by Abramis.

Topics in this article :

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog

Websites in our network