View all newsletters
Sign up for our free email newsletters

Fighting for quality news media in the digital age.

  1. Comment
May 20, 2013updated 31 May 2013 8:22pm

Adventures in spreadsheet journalism lesson six: Crunching text-based data

Spreadsheets are about numbers, of course, but they're not only about them. In fact, a mass of text-based data is out there for you just waiting to be crunched, and Excel and kindred applications have forged a toolbox's worth of implements  smooth the crunching.

You may not have allotted much attention to the Text button neatly shelved in Excel's function Library:

but it's there, and the more you know about its contents the better will your investigative intents be served. Let's borrow two of those functions from the Library – pretty simple ones, at least in the first instance – and see just how crunch-worthy they can be.

For data I've laced together two spreadsheets from the Office of National Statistics, the respective surveys of girl and boy baby names bestowed upon newborns in England and Wales in 2011.

The data have already considered by Anna Powell-Smith in the Guardian and her blog, but we're interested in some of the how-tos here; so enter the field name Initial in cell E5, and this expression in E6:

=LEFT(C6,1)

Content from our partners
MHP Group's 30 To Watch awards for young journalists open for entries
How PA Media is helping newspapers make the digital transition
Publishing on the open web is broken, how generative AI could help fix it

The yield: a solitary letter H, recalling the first letter of the name Harry and corroborating the rather self-evident workings of LEFT. All you need do here is identify the cell of interest, along with the number of characters you want to snare from its contents, starting at the far-left (and yes, there's a mirror function called RIGHT).

Copy the above down the E column, label F5 Length, and enter

=LEN(C6)

an even simpler expression, in F6. The result, 6, counts the number of characters populating C6 – but you probably didn't need me to tell you that. Copy the formula down the F column.

And once that business has been concluded you can subject the data to a variety of pivot-table breakouts, for example:

We could ask how names' first letter distributions vary both by initial and gender, remembering that the question really splits into two variants: a) a count of how the names themselves break out, and b) a count of the actual number of babies bearing those names. Thus you can rev up a pivot table and place these fields in these respective areas:

Row Labels: Initial

Column Labels: Gender

Values: Initial (will automatically count these, as the data are text)

Count (should be summed, by default)

Next, you can click consecutively in both Values areas and click PivotTable Tools > Options > Show Values As > % of Column Totals, and you'll wind up with something like this:

Note, for example, that 6.28% of all girl names begin with the letter J, even as 13.13% of all girls actually received a name topped by that letter. A lot of Jessicas out there.

You can also do something similar with name length. Just replace the Count value field above with Length, and click PivotTable Tools > Options > Summarize Values By > Average. What you'll see is average name length by first initial:

We see that girls' names are a bit longer, although the above could be run through a test of statistical significance.

It should be added that  text scrutiny can take you down to some deeper levels of syntactical complexity, particularly once you get into functions such as FIND , MID and SUBSTITUTE; and so you may want to earmark some of your down time (assuming you have any) to explore that forlorn Text button in the Formulas tab.

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog

Select and enter your email address Weekly insight into the big strategic issues affecting the future of the news industry. Essential reading for media leaders every Thursday. Your morning brew of news about the world of news from Press Gazette and elsewhere in the media. Sent at around 10am UK time. Our weekly does of strategic insight about the future of news media aimed at US readers. A fortnightly update from the front-line of news and advertising. Aimed at marketers and those involved in the advertising industry.
  • Business owner/co-owner
  • CEO
  • COO
  • CFO
  • CTO
  • Chairperson
  • Non-Exec Director
  • Other C-Suite
  • Managing Director
  • President/Partner
  • Senior Executive/SVP or Corporate VP or equivalent
  • Director or equivalent
  • Group or Senior Manager
  • Head of Department/Function
  • Manager
  • Non-manager
  • Retired
  • Other
Visit our privacy Policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications.
Thank you

Thanks for subscribing.

Websites in our network