Adventures in spreadsheet journalism lesson six: Crunching text-based data

Spreadsheets are about numbers, of course, but they're not only about them. In fact, a mass of text-based data is out there for you just waiting to be crunched, and Excel and kindred applications have forged a toolbox's worth of implements smooth the crunching.

You may not have allotted much attention to the Text button neatly shelved in Excel's function Library:

but it's there, and the more you know about its contents the better will your investigative intents be served. Let's borrow two of those functions from the Library – pretty simple ones, at least in the first instance – and see just how crunch-worthy they can be.

For data I've laced together two spreadsheets from the Office of National Statistics, the respective surveys of girl and boy baby names bestowed upon newborns in England and Wales in 2011.

The data have already considered by Anna Powell-Smith in the Guardian and her blog, but we're interested in some of the how-tos here; so enter the field name Initial in cell E5, and this expression in E6:

=LEFT(C6,1)

Content from our partners

Free journalism awards for journalists under 30: Deadline today

Press Gazette

MHP Group’s 30 To Watch awards for young journalists open for entries

Press Gazette

How PA Media is helping newspapers make the digital transition

Freddy Mayhew

The yield: a solitary letter H, recalling the first letter of the name Harry and corroborating the rather self-evident workings of LEFT. All you need do here is identify the cell of interest, along with the number of characters you want to snare from its contents, starting at the far-left (and yes, there's a mirror function called RIGHT).

Copy the above down the E column, label F5 Length, and enter

=LEN(C6)

an even simpler expression, in F6. The result, 6, counts the number of characters populating C6 – but you probably didn't need me to tell you that. Copy the formula down the F column.

And once that business has been concluded you can subject the data to a variety of pivot-table breakouts, for example:

We could ask how names' first letter distributions vary both by initial and gender, remembering that the question really splits into two variants: a) a count of how the names themselves break out, and b) a count of the actual number of babies bearing those names. Thus you can rev up a pivot table and place these fields in these respective areas:

Row Labels: Initial

Column Labels: Gender

Values: Initial (will automatically count these, as the data are text)

Count (should be summed, by default)

Next, you can click consecutively in both Values areas and click PivotTable Tools > Options > Show Values As > % of Column Totals, and you'll wind up with something like this:

Note, for example, that 6.28% of all girl names begin with the letter J, even as 13.13% of all girls actually received a name topped by that letter. A lot of Jessicas out there.

You can also do something similar with name length. Just replace the Count value field above with Length, and click PivotTable Tools > Options > Summarize Values By > Average. What you'll see is average name length by first initial:

We see that girls' names are a bit longer, although the above could be run through a test of statistical significance.

It should be added that text scrutiny can take you down to some deeper levels of syntactical complexity, particularly once you get into functions such as FIND , MID and SUBSTITUTE; and so you may want to earmark some of your down time (assuming you have any) to explore that forlorn Text button in the Formulas tab.