Search This Blog

Monday, July 1, 2013

Is there really such a thing as "BAD" data?

I've been privy to see a couple of posts relating to quality and data over the last week or so.   I though the topics are wonderful and have decided to write a few words about the subjects myself.

Article #1 - Sometimes Worse Data Quality is Better

Article #2 - Is there really "Bad" Data?

Each deals a bit with quality but both don't directly deal with "Data".

The first article specifies very clearly why beta Max went by the way side in favor or VHS and how the landscape of music changed to an alternate format called MP3.  It also talks about the changes in cameras to digital and how it helped to cause the demise of Kodak.

The second was more a question posed in reference to a book called Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work.




So is there "bad" data?
That all depends on what you are using it for and what kind of data you have.

Take a bunch of sales numbers.  A few miss input values and everything goes to pot.  But, it  may not be so bad if you are just looking for some kind of trend, theme, or whether a product is doing better then another (if the values are not off by to much).  However, this would generally be a case when accuracy is paramount to obtaining usable information.

Now you may have a bunch of data in video format.  The quality of that data might not be so important.  You might not care that you can see the black head on the end of the commentators nose due to high resolution.  Then again you might want or need that level of resolution or clarity.



I deal with data on a daily basis. Phone data, address data, geodata, consumer data, and many things in between.
Some times we get files that have just what we need in them.  Say just the consumer name and address.  I call this at times "clean data".  Why?  Because there is less stuff in the file to screw up in conversion and the like.  Less to go wrong.



As you go back to the basics of databasing you find that many tables can be created and linked to a main form or table.  With the fields cut down you can do faster queries.  And since it is all linked off of a key code you can string this out over hundreds of tables.  This can also help to keep certain more sensitive table elements out of the hands of unauthorized people.
It can compartmentalize things a lot more and organize things in various ways.

When you start to mention "BAD" data you find a lot of things rely on the perspective and needs of the person that will be using the data.

Maybe a set of phone numbers is good for one person because they are testing if something is picking up, parsing or sorting in a certain way.  That same set of data may not be any good because you have no associated names, products bought, or other prudent information.  How would you use a list of phone numbers if that's all you have and want to do some product follow up.  You can't in that case because the data is "BAD" or "incomplete".

That is one of the points brought up in the discussions of "BAD" data.  There is really is not bad data.  There is incomplete data.
This could be the missing names associated to a list of phone numbers, or a portion of a video clips that has been cut off for any reason.  That data due to be partial or incomplete becomes worthless.

As we in modern society are trending everyone is collecting data and some believe you should go with "The more you have the better".  Yes, you might use some of that data down the road for some heretofore unknown application(s).  However, is that information really any good just sitting around doing nothing but take up space (main copy & backups)?



So due to the ability to acquire cheap digital storage space the trend is to save everything you can get your hands on.  This then nullifies the basic principles of data basing and has to  some extent allowed us to become lazy and less prudent in the way we store things and lay it out.

This phenomena of "BIG" data is a trend that could in the long haul allow for some really cool data analysis.  It also could become a boondoggle that only helps storage providers.

And the larger questions become - "Are you worried all this data is being collected?"

This become relevant as you have complete data or usable data.
If all that is being collected is a bunch of unconnected stuff that in the end is incomplete or "BAD" do you really care.

We have just taken the collecting of data to a whole new level.  This is nothing new to hear of a government that is collecting data.
The Domesday book from William the Conqueror's time was one of the first biggest collection of data for the purposes of taxes and associated elements.  To that point England did not really know what they had in front of them.  After that they had a better idea and could plan accordingly.  Thankfully, that data was not used for nefarious purposes.




"BIG" data does bring up some ethical questions.  Though these ethical questions are no worse now then they were say 100 years ago.  We simply did not have ways of storing so much information that could then be analyzed in such a short period of time with such accuracy.  Have you ever been frustrated with a web search that took 1 minute?  In reality Google is so efficient it can go through thousands it millions of pages in just a few seconds to give you relevant pages associated to what you want.

William the conqueror would never of dreamed of that.  Heck Einstein may never of dreamed of that...... maybe he did but, new it would be years away till anyone could do something about it.



So as we go about our daily lives we need to keep in mind how we can influence the proper use of data and how we can make sure that it stays clean and usable.....  for proper and good purposes.

Hopefully with diligence we will not end up repeating the Holocaust or some unnecessary war. For that could ultimately be what happens when we have "BAD" data!



Buaidh - NO - Bas

No comments:

Post a Comment