2+2=5 or the pompatus of data reporting

Wait! NOT Common Core! No, Really!

<ducks> Take a deep breath. I am NOT talking about Common Core. Put down the righteousness indignation and just walk away. These are not the droids you are looking for.

Some people talk about me baby

Okay, so now that I've lost 95% of you, let's talk about how much of a pain in the neck data reporting can be.

I used to hunt semi-colons for a living tech-wise but over the years I've drifted into more data work and I like it. I just think better in sets than in algorithms. I'm very holistic that way.

But I traded nights of pulling out my hair trying to figure out why in hades that won't compile (ask your parents) to getting beat up by some very smart analytics folks as to why "that report is all wrong!!". I've never actually been told to pull my head out of my you-know-what but let's just say the message was received loud and clear.

I understand the frustration. I really do. But the problem may not be that the report is wrong. It could be that there is something else going on.

Even when you encounter data that seems to defy logic and reason it is not necessarily because it is illogical or unreasonable. Part of the problem is that people think computers are infallible.

They just want "the data". In reality, computers are neither right nor wrong, accurate nor inaccurate, they just do what we tell them to. And getting "the data" is a conversation, not a request, even if your data quality is perfect.

Understanding the full lifecycle of your data from data entry to data reporting to data analysis will help you distinguish between wrong and misunderstood.

Say I'm doing you wrong...

We can illustrate this by describing a situation where 2+2=5 is actually accurate. Not precise, but accurate.

First me explain how we can get that kind of a result.

Say we have some kind of tool that displays individual and aggregate data. Often the underlying data can be too many decimal places to display without making the interface a mess.

For example, say we are looking at two records, A & B, and for the field we care about (let's call it "distance") one record has a value of 2.340 miles and the other has a value of 2.430 miles.

There are a few options for how to display these numbers.

For the sake of the argument, let's assume that we are just displaying whole numbers. Maybe it's one field out of many and we don't have room for any decimals.

So the question is how do we get a whole number integer from the full precision decimal equivalent? One option is truncation: simply drop the decimal portion. So we get 2 miles for both A & B.

Another option is standard rounding, in which case we also get 2 miles displayed for both records. So far so good, nothing controversial.

...doing you wrong

Now let's talk about the aggregation. In this case, let's say the specific aggregation we are looking for the total distance for both record A and B.

Keep in mind that though we are displaying a 2 for both records, the underlying value is 2.340 for one record and 2.430 for the other. When we add those two numbers together, we get 4.770 miles.

Here's the tricky part.

If the method used to display the number is truncation, then 4.770 becomes 4 and our result is the total number of miles for the two records in our set is 4 miles. eg, 2+2=4.

But if the method used is standard rounding, then 4.770 gets rounded up to 5 and the total displayed is 5 miles: 2+2=5.

And, key to this discussion about analytics, if you are using the data from the system and it (not unreasonably?) simply dumps what you see on the screen, you could end up with what looks like a data quality problem.

If this was the result you got in the report I gave you, you would start thinking cranio-rectal inversion. And you would probably start shouting. But go to your happy place, because i would suggest that in this case 2+2=5 is actually the more accurate because 4.7 is closer to 5 than it is to 4.

But don't you worry baby, don't worry

This example is a bit contrived but it's fairly easy to come up with a similar situation due to rounding. And I used rounding due to display as an easier illustration but the rounding could occur at any point in the data processing.

It happens all the time when trying to represent decimal numbers in a computer. See this example about how 0.1 can never really be 0.1 in a computer.

cause I'm right here, right here, right here at home

Bottom line: when you see inconsistent data in your reports, take a closer look, it could be something earlier in the processing stream and not actually poor data quality.

Calm down math nerd

And for the record, the "proof" in the picture above is flawed. Here is an explanation why.

I am old

No, really. Ask your parents (God help me, or your grandparents?!)