January 9, 2012

Yesterday’s fringe data is tomorrow’s well-structured data

Shouldn’t data structures be declared at query time, not at data load time? Or some combination? A number of people believe that the enormous data sets they we are now trying to analyze in this new Big Data time need to be loaded in a queryable state BEFORE the structure and content of the data sets are completely understood.

I call this data structure discovery, or schema discovery.

Within Hadoop, things may be somewhat disorderly and potentially unpredictable. You’re encouraged to load all your data as a set of key value pairs, and the structure of this data bag may need to be discovered, and alternate interpretation of the structures may need to be possible without reloading the database.

“Yesterday’sfringe data is tomorrow’s well-structured data,” implies that we need exceptional flexibility as we explore new kinds of data sources.

A key differentiator between the RDBMS approach and the MapReduce/Hadoop approach is the ability to defer the data structure declaration until query time in the MapReduce/Hadoop systems.

However, an objection from the RDBMS community, of course, is that forcing every MapReduce job to declare the target data structure promotes a kind of chaos because every business analyst or data scientist can do their own thing. Also, no one really truly understands the data structure, potentially leading to wasted time and effort re-discoverying structure.

But that objection seems to miss the point that a standard data structure declaration can easily be published as a library module that can be picked up by every analyst/scientist when they are performing their analytics or by application developers implementing their transformative business applications.

Change is sometimes tough, but it always leads to innovation.

Tweet This Post