Adventures in spreadsheet journalism lesson six: Crunching text-based data

Spreadsheets are about numbers, of course, but they're not only about them. In fact, a mass of text-based data is out there for you just waiting to be crunched, and Excel and kindred applications have forged a toolbox's worth of implements smooth the crunching.

You may not have allotted much attention to the Text button neatly shelved in Excel's function Library:

but it's there, and the more you know about its contents the better will your investigative intents be served. Let's borrow two of those functions from the Library – pretty simple ones, at least in the first instance – and see just how crunch-worthy they can be.

For data I've laced together two spreadsheets from the Office of National Statistics, the respective surveys of girl and boy baby names bestowed upon newborns in England and Wales in 2011.

The data have already considered by Anna Powell-Smith in the Guardian and her blog, but we're interested in some of the how-tos here; so enter the field name Initial in cell E5, and this expression in E6:

=LEFT(C6,1)

The yield: a solitary letter H, recalling the first letter of the name Harry and corroborating the rather self-evident workings of LEFT. All you need do here is identify the cell of interest, along with the number of characters you want to snare from its contents, starting at the far-left (and yes, there's a mirror function called RIGHT).

Copy the above down the E column, label F5 Length, and enter

=LEN(C6)

an even simpler expression, in F6. The result, 6, counts the number of characters populating C6 – but you probably didn't need me to tell you that. Copy the formula down the F column.

And once that business has been concluded you can subject the data to a variety of pivot-table breakouts, for example:

We could ask how names' first letter distributions vary both by initial and gender, remembering that the question really splits into two variants: a) a count of how the names themselves break out, and b) a count of the actual number of babies bearing those names. Thus you can rev up a pivot table and place these fields in these respective areas:

Row Labels: Initial

Column Labels: Gender

Values: Initial (will automatically count these, as the data are text)

Count (should be summed, by default)

Next, you can click consecutively in both Values areas and click PivotTable Tools > Options > Show Values As > % of Column Totals, and you'll wind up with something like this:

Note, for example, that 6.28% of all girl names begin with the letter J, even as 13.13% of all girls actually received a name topped by that letter. A lot of Jessicas out there.

You can also do something similar with name length. Just replace the Count value field above with Length, and click PivotTable Tools > Options > Summarize Values By > Average. What you'll see is average name length by first initial:

We see that girls' names are a bit longer, although the above could be run through a test of statistical significance.

It should be added that text scrutiny can take you down to some deeper levels of syntactical complexity, particularly once you get into functions such as FIND , MID and SUBSTITUTE; and so you may want to earmark some of your down time (assuming you have any) to explore that forlorn Text button in the Formulas tab.

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog

Adventures in spreadsheet journalism lesson six: Crunching text-based data

Google’s site reputation abuse policy: the wrong solution to a real problem

‘We accept the risks’: Sky News boss calls for journalists to be allowed into Gaza

Walking the mobile notification tightrope: How to engage audiences while avoiding overload