Yesterday’s fringe data is tomorrow’s well-structured data

Shouldn’t data structures be declared at query time, not at data load time? Or some combination? A number of people believe that the enormous data sets they we are now trying to analyze in this new Big Data time need to be loaded in a queryable state BEFORE the structure and content of the data sets are completely understood.

I call this data structure discovery, or schema discovery.

Within Hadoop, things may be somewhat disorderly and potentially unpredictable. You’re encouraged to load all your data as a set of key value pairs, and the structure of this data bag may need to be discovered, and alternate interpretation of the structures may need to be possible without reloading the database.

“Yesterday’sfringe data is tomorrow’s well-structured data,” implies that we need exceptional flexibility as we explore new kinds of data sources.

A key differentiator between the RDBMS approach and the MapReduce/Hadoop approach is the ability to defer the data structure declaration until query time in the MapReduce/Hadoop systems.

However, an objection from the RDBMS community, of course, is that forcing every MapReduce job to declare the target data structure promotes a kind of chaos because every business analyst or data scientist can do their own thing. Also, no one really truly understands the data structure, potentially leading to wasted time and effort re-discoverying structure.

But that objection seems to miss the point that a standard data structure declaration can easily be published as a library module that can be picked up by every analyst/scientist when they are performing their analytics or by application developers implementing their transformative business applications.

Change is sometimes tough, but it always leads to innovation.

 

Jim Kaskade

Jim Kaskade is a serial entrepreneur & enterprise software executive of over 35 years. He recently successfully exited a PE-backed SaaS company, Janrain, in the digital identity security space. He started his career engineering massively parallel processing datacenter applications. Prior to identity, he led a digital application business of over 7,000 people ($1B). Prior to that he led a big data & analytics business of over 1,000 ($250M). He was the CEO of a Big Data Cloud company ($50M); was an EIR at PARC (the Bell Labs of Silicon Valley) which resulted in a spinout of an AML AI company; led two separate private cloud software startups; founded of one of the most advanced digital video SaaS companies delivering online and wireless solutions to over 10,000 enterprises; and was involved with three semiconductor startups (two of which he founded, one of which he sold). Jim has an Electrical and Computer Science Engineering degree from University of California, Santa Barbara, with an emphasis in semiconductor design and computer science; and an MBA from the University of San Diego with an emphasis in entrepreneurship and finance.

Leave a Reply