Batch with Big Data versus Small Data

Big-Small

How do you know whether you are dealing with Big Data or Small Data? I’m constantly asked for my definition of “Big Data”. Well, here it is…for batch analytics, now addressed by technologies such as Hadoop.

Batch Analytics

Batch AnalyticsSmall DataBig Data
Data VolumeGigabytesTerabytes – Petabytes
Data VelocityUpdated periodically with non-real-time intervalsUpdated both in real-time  and through bulk timed intervals
Data Variety1-6 structured sources6+ structured AND 6+ unstructured sources
Data ModelsStore data without cleaning, transforming, or normalizing.Store data without cleaning, transforming, and normalizing. Then apply schemas based on application needs.
Business FunctionsOne line of business (e.g. sales)Several lines of business – to – 360 view
Business IntelligenceQueries are complex requiring many concurrent data modifications, a rich breadth of operators, and many selectivity constraints. However, they are applied to a simpler data structure.Response times are in minutes to hours, issued by one or maybe two experts.Example: determine how much profit is made on a given line of parts, broken out by supplier, by geography, by year.

 

Queries are complex requiring many concurrent data modifications, a rich breadth of operators, and many selectivity constraints. Queries span across business functions.Response times are in minutes to hours, issued by a small group of experts. 

Example: determine how much profit is made on a given line of parts, broken out by supplier, by geography, by year; and then determining which customers purchased the higher profit parts, by geography, by year; determining the profile of those high-profit customers; finding out what products purchased by high-profit customers were NOT purchased by other similar customers in order to cross-sell / up-sell.

Want to see my view on Ad Hoc and Interactive Analytics? Go here.

Want to see my view on Real-Time Analytics? Go here.

Here are a few other products in this space:

ICS Hadoop

Cloudera

MapR

Hortonworks

Pivotal

Intel

IBM

Wandisco

Jim Kaskade

Jim Kaskade is a serial entrepreneur & enterprise software executive of over 35 years. He recently successfully exited a PE-backed SaaS company, Janrain, in the digital identity security space. He started his career engineering massively parallel processing datacenter applications. Prior to identity, he led a digital application business of over 7,000 people ($1B). Prior to that he led a big data & analytics business of over 1,000 ($250M). He was the CEO of a Big Data Cloud company ($50M); was an EIR at PARC (the Bell Labs of Silicon Valley) which resulted in a spinout of an AML AI company; led two separate private cloud software startups; founded of one of the most advanced digital video SaaS companies delivering online and wireless solutions to over 10,000 enterprises; and was involved with three semiconductor startups (two of which he founded, one of which he sold). Jim has an Electrical and Computer Science Engineering degree from University of California, Santa Barbara, with an emphasis in semiconductor design and computer science; and an MBA from the University of San Diego with an emphasis in entrepreneurship and finance.

2 thoughts on “Batch with Big Data versus Small Data

Leave a Reply