Skip to content


Mid-Market = Semi-Big Data?

Is Big Data destined for only the top 3,000 companies worldwide? What about medium or small companies who are equally as data-driven? Is there a place for Big Data in SMB markets?

When I talk to SMB companies about their  use of public cloud services, it’s a no-brainer. Pay as you go, lower costs upfront, quick time-to-market. With private and public cloud solutions, both big and small companies benefit.

So what about big data platforms? Is there an equivalent opportunity, or is Big Data only suited for Big Companies with Big Problems?

I think Big Data applies to all, and here’s why.

OLTP & DSS Era

When I first started working on Teradata’s next-generation switch fabric, the BYNET, back in the early 90′s, Teradata was around $250M in revenue (now over $2B). The concept of decision support systems (DSS) evolved into data warehousing and grew to become the largest pool of enterprise data (Big Data) for those enterprises who had  a large enough business to create meaningful amounts of data, and who had the money to invest in their data infrastructure.

Operational systems, known as online transaction processing systems (OLTP), were built on top of smaller data infrastructure powering transaction-oriented applications. Oracle owns this space much like Teradata owns the data warehousing space. Unlike the DSS/Data warehouse space, smaller companies benefited from the use of these operational systems, much like their larger competition.

Analytic Appliance Era

Then enters the era of analytic appliances. In the mid-1990′s I became involved in an effort to lead in-database analytics (powering further innovations in OLAP and Data Mining). With the vision of pushing the analysis of the data closer to the data itself, many followed suit and companies like Netezza, Greenplum, ParAccel, AsterData, Kickfire, Vertica, and others entered the market, addressing the need to provide rapid analysis of data volumes scaling into petabytes. The key words here are “rapid” (or real-time), “analysis” (or analytics), and “petabytes” (big data).

Why didn’t the incumbents like Teradata and Oracle sieze the opportunity here? Lots of reasons…politics, the inability to respond quickly to changing market dynamics, etc.

Big Data Era

A number of things led to the creation/adoption of the Hadoop/MapReduce framework – one of those including the need to have a frictionless data playground where data scientists could simple investigate…discover.

Given birth by the large-scale early adopters such as Google and Yahoo!, Hadoop /MR is now well-positioned to address the needs of medium and small-sized companies.

I argue that we will see the following evolution of Big Data technologies which originated from the large web-scale companies like Yahoo!, Linkedin, Twitter, and the like:

  • Hardening of the Hadoop ecosystem
  • Broad integration with existing toolsets (e.g. BI)
  • Real-time enablement
  • Further cloud-enablement
  • Clear application use-cases / offerings across verticals

This is an obvious exaggeration to emphasize my point that we may see a shift of dollars to new emerging players. But more importantly, the pie will grow to include the creation of new data infrastructure market share for SMBs due to the innovations in the Big Data space.

Companies like Teradata will benefit from new revenues from acquisitions like AsterData, and integration with Hadoop. Companies like Oracle will also benefit from integration with Hadoop.

However, the potentially larger opportunity will be for new startups who are not tied purely to the needs of the Fortune 3,000 and can quickly tap into the burgeoning market of smaller companies seeking data analytics solutions.

New players who can appreciate the needs of these smaller clients and leverage the product of Silicon Valley’s large web-scale companies (built on commodity hardware and open source software) will be able to capture a large, growing, untapped, data-driven market.

Just take a look at some of the events thus far:

…and then ask yourself this…do you hear the sucking sound? That’s the sound of data coming out of traditional data stores into Hadoop data stores. Yes, there will be some information coming from Hadoop back into Data Warehouses….but where’s the “single source of truth” or “entire view of the customer” going to be in the long run?  Just take a look at the new Big Data Warehouse.

Posted in Cloud Computing, Data.

Tagged with , , , , , , , , , .


A New Analytics Architecture

 

Traditional Analytics Approach

The front-end of the above analytics architecture remains relatively unchanged for casual users, who continue to use reports and dashboards running against dependent data marts (either physical or virtual) fed by a data warehouse.

This environment typically meets much of the information needs of the organization, which can be defined up-front through requirements-gathering exercises. Predefined reports and dashboards are designed to answer questions tailored to individual roles within the organization.

Ad hoc needs of casual users can also be serviced by the traditional data warehouse and data mart architecture. However, the interactive reports and dashboards rely on the IT department or “super users”—tech-savvy business colleagues—to create ad hoc reports and views on their behalf.

Search-based exploration tools that allow users to type queries in plain English and refine their search using facets or categories is one of several ways to allow more business users access to the data without being so sophisticated.

One new addition to the casual user environment are dashboards powered by streaming/CEP engines (real-time reports). While these operational dashboards are primarily used by operational analysts and workers, many executives and managers are keen to keep their fingers on the pulse of their companies’ core processes by accessing these “twinkling” dashboards directly or, more commonly, receiving alerts from these systems.

New Analytics Approach

The biggest opportunity in the above analytics architecture is how it improves the information needs of power users. It gives power users many new options for consuming corporate data rather than creating countless “spreadmarts”. A power user is a person whose job is to crunch data on a daily basis to generate insights and plans.

Power users include business analysts (e.g., Excel jockeys), analytical modelers (e.g., SAS programmers and statisticians) and data scientists (e.g., application developers with business process and database expertise.) Under a new paradigm, power users query either an analytic platform (separate from the enterprise data warehouse) and/or Hadoop directly (the new semi-structured data warehouse).

An analytic platform can be implemented via a number of technology approaches:

  • MPP analytic databases (e.g. Greenplum, AsterData)
  • Columnar databases (e.g. ParAccel, Infobright, Sybase IQ, Vertica)
  • Analytic appliances (e.g. Netezza, Exadata)
  • In-memory databases (e.g. Hanna, QlikView)
  • Hadoop-based analytics (e.g. Hive, Hbase, Mahout, Giraph)

Which approach or combination of approaches are you currently using or going to use?

Do you think that the Hadoop open source ecosystem will evolve to the point where the other analytic platforms become less relevant (e.g. what happens when the community adds real-time mix-workload support to Hadoop, and develops a comprehensive suite of Hadoop-enabled / parallelized analytic algorithms)?

In an attempt to be controversial, I’m going to predict that Hadoop will expand to provide support for a sophisticated analytics layer which surpasses the performance of all existing analytic platform alternatives.

All these platforms are integrating with Hadoop because Hadoop acts as a great initial data store and ETL pre-processing engine. However, this integration will ultimately lead to their demise as the Hadoop system’s capabilities begin to overlap.

Posted in Data.

Tagged with , , , , , , , , , , , , , , , , , , , , , , , , , , .


Big Data Means Leveraging All Customer Channels

Enhancing the multichannel consumer experience should be the focus of all retailers (especially brick and mortar retailers).

Enhancing the multichannel experience for consumers will equate to a powerful driver of sales, customer satisfaction, and loyalty.

Retailers can use big data to integrate promotions and pricing data from shoppers seamlessly, whether those consumers are online, in-store, or perusing a catalog.

Williams-Sonoma, for example, has integrated customer databases with information on some 60 million households, tracking such things as their income, housing values, and number of children. Targeted e-mails based on this information obtain ten to 18 times the response rate of e-mails that are not targeted (lifting conversion rates by as much as 30%), and the company is able to create different versions of its catalogs attuned to the behavior and preferences of different groups of customers.

“Targeting customers with perfectly customized recommendations at the right moment across the right channel is sales and marketing’s holy grail.”

Posted in Data.

Tagged with , , .


Yesterday’s fringe data is tomorrow’s well-structured data

Shouldn’t data structures be declared at query time, not at data load time? Or some combination? A number of people believe that the enormous data sets they we are now trying to analyze in this new Big Data time need to be loaded in a queryable state BEFORE the structure and content of the data sets are completely understood.

I call this data structure discovery, or schema discovery.

Within Hadoop, things may be somewhat disorderly and potentially unpredictable. You’re encouraged to load all your data as a set of key value pairs, and the structure of this data bag may need to be discovered, and alternate interpretation of the structures may need to be possible without reloading the database.

“Yesterday’sfringe data is tomorrow’s well-structured data,” implies that we need exceptional flexibility as we explore new kinds of data sources.

A key differentiator between the RDBMS approach and the MapReduce/Hadoop approach is the ability to defer the data structure declaration until query time in the MapReduce/Hadoop systems.

However, an objection from the RDBMS community, of course, is that forcing every MapReduce job to declare the target data structure promotes a kind of chaos because every business analyst or data scientist can do their own thing. Also, no one really truly understands the data structure, potentially leading to wasted time and effort re-discoverying structure.

But that objection seems to miss the point that a standard data structure declaration can easily be published as a library module that can be picked up by every analyst/scientist when they are performing their analytics or by application developers implementing their transformative business applications.

Change is sometimes tough, but it always leads to innovation.

 

Posted in Data.

Tagged with , , , , , , .


Big Data & the Future of Selling ‘Stuff’

Source: http://markorodriguez.com/

 

I don’t know if you read about this before the holidays, but I got to thinking about Amazon’s offer to pay shoppers $5 to use it’s mobile application to compare its prices to those in a store (this was a one-day promotion on Dec. 10 that provided 5% or up to $5 off as many as three items for customer’s efforts to price check).

Amazon is the leader in online retailing at a $81 Billion market cap (a little under 2x it’s annual revenue of $44 Billion)….and for a good reason. They understand their customers and their customer’s experiences better than anyone.

This promotion was an interesting use of consumer data entry to power next generation retail price competition. The promotion served as a way for Amazon to increase usage of its bar-code-scanning application, while also collecting intelligence on prices in the stores.

This is one of many interesting trends that is TERRIFYING brick-and-mortar retailers (large and small). While the real-time “Everyday Low Price” information empowers consumers to spend less, it terrifies retailers, who increasingly are feeling like showrooms — shoppers come to to check out the merchandise but ultimately decide to walk out and buy online instead.

It’s no surprise that Forrester predicted that this year, 2012, will mark the turning point for web-influenced retail sales (the online and mobile web will influence over half of overall retail sales). If we just look at US online purchases alone (excluding in-store), it’s a market exhibiting a 10% CAGR which will be $250 Billion by 2014. That’s not a bad place to be….assuming you have access to technologies similar to what Amazon uses when it comes to understanding customers.

So outside of moving your business online, what can these terrified brick-and-mortar retailers do?

I think it comes down to understanding their  customer’s interests and providing them an outstanding experience. You don’t want to compete based on selection, volume, or price. If that’s your model, you’ve already lost the game.

And I’m not talking about the typical “online recommendation system” or “collaborative filtering” approach which can include algorithms of ‘Most Viewed,’ ‘Top Sellers,’ ‘People who bought or viewed this also bought that.’ I’m talking about actions that can be made both online AND offline that occur before, during, and after a desired purchase outcome.

Consumer Interest

Lets analyze Marko Rodriguez‘s “property graph” represented in the graphic above, or as others might describe, an Interest Graph. Om Malik at GigaOm spoke about this last year in reference to Hunch (who has recently been acquired by Ebay…go figure!). Prior to that, GigaOm reported on Interest Graph as it related to Gravity’s web offerings.

Today’s web recommendations are created as part of the process of mapping a user to a particular set of items. These items may be products (e.g. music, movies, games, books, accessories) or they may be people (e.g. find an expert, someone to follow).

Within any data set, there are numerous such user-to-item mappings. Typical recommendation mappings exploit co-interest patterns (e.g. collaborative filtering) or they exploit item feature patterns (e.g. content-based recommendation).

As Marko explains, there is more to recommendation than what is generally understood. For one, there are as many user-to-item mappings as there are paths through the data. “Paths?” When your data is modeled as a graph, the graph traversal pattern can yield a plethora of different mappings—and this collection can be used to soothe the needs of your users as their moods and whims change…and, ideally, in real-time (so it’s relevant).

As long as you have access to the data related to your customer’s entire experience, across as many interaction channels (Web, Advertising, Email, Social, Search, POS, etc.), an interest graph dataset (such as the one diagrammed above) can be created, and many types of recommendations can be made. Examples include, but are not limited to:

  • Recommend people to follow based on similar behavior.
  • Recommend people to follow based on similar movie watching behavior.
  • Recommend products based on shared features.
  • Recommend movies to watch based on reading interests.
  • Recommend songs to listen to that are related to these particular followed people.
  • Recommend books or movies that are similar to songs listened to today.

The question is whether you can seamlessly map your existing data sources (e.g. SQL databases, Apache web logs, etc), supplement with additional data (e.g. transcribed call-center logs), then discover the interest graph, and finally update/process it in real-time with “next best actions” which, ultimately, improves your customer’s experience.

Think of a new kind of commerce experience that goes beyond the notion of group shopping, shopping communities, and recommendation engines.

Think of a commerce experience, which not only takes into account your social graph, but also your interests graph.

Think of a commerce experience where we not only appreciate what our customers are looking for, but we also appreciate what happens after they get what they’re looking for….understanding your customer’s experiences before, during, and after their desired outcome…

After all, it’s all about the experience and not just buying ‘stuff’, right?

Posted in Data.


Big Data is Thriving. Is RDBMS Dead?

MapReduce vs. RDBMS

People think that MR is this new transformative technology…..new? No. Transformative? Yes.

Although it might seem that MapReduce (MR) and parallel DBMSs are different, it is actually possible to write almost any parallel-processing task as either a set of database queries or a set of MR jobs.

When you look at the semantics of the MR model, you’ll find that its approach to filtering and transforming individual data items (tuples in tables) can be executed by a modern parallel DBMS using SQL. Even though the Map operations are not easily experssed in SQL, many DBMSs support user-defined functions (UDFs) which provides the equivalent functionality of a Map operation. The Reduce step in MR is equivalent to a GROUP BY operation in SQL.

So, if we’ve been able to perform MR-like operations with RDBMSs, what’s the big deal? Well, it has something to do with the fact that new generations of technologists like to do things differently. Things evolve. Just look at the difference between a data scientist (today) and the data analyst (yesterday), and you begin to understand why Hadoop / MapReduce has become transformative. Here are a few thoughts…

Big Data = Big Simplicity

One of the big attractive qualities of the MR programming model (and maybe it’s key attraction to the new generation of data scientists and application programmers) is its simplicity; an MR program consists of only two functions – Map and Reduce – written to process key/value data pairs. Therefore, the model is easy to use, even for programmers without experience with parallel and distributed systems.

It also hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Those experienced with RDBMS and SQL will naturally say that writing SQL code is easier than writing MR code. However, we should probably ask the next-gen data scientist and application developer what is easier. Lets ask the top 7 data scientists.

Unlike a DBMS, MR systems do not require users to define a schema for their data. Thus, MR-style systems easily store and process what is known as “semi-structured” data. Such data can often be made to look like key-value pairs, where the number of attributes present in any given “record” varies. This style of data is typical of Web traffic logs, for example, derived from disparate sources. If you were going to attempt to do this in an RDBMS, you would have to create a very wide table with many attributes to accommodate multiple record types (using NULLs for the values that are not present for a given record). This is where columnar databases come into play. They allow for reading only the relevant attributes for any query and automatically suppress the NULL values. However, if you had your choice, then the choice is NO SCHEMA. Of course, there are always tradeoffs.

In addition, MR implementation provides the best IT user-experience. It’s not complicated to install and get a MR system up and running. Whereas, a high-end RDBMS (even IT and DBA-friendly Teradata) will require installation and configuration which exceeds that of MR. Although some may argue that tuning a Hadoop cluster is needed to maximize performance. Configuraiton/tuning aside, once a RDBMS is up and running properly, programmers must still write a schema for their data, and then load the data set into the system. Whereas with MR, MR programmers load their data by simply copying it into the MR file system (HDFS).

Where Does MR Shine?

Even though parallel DBMSs are able to execute the same semantic worload as MR, there must be several application use-cases where MR is consistently the better choice.

We routinely hear that Hadoop / MapReduce is being deployed in “data pipeline” use-cases (aka ETL). This makes sense because the canonical use of MR can be characterized in five operations:

  1. Read information from many different sources (structured and unstructured)
  2. Parse and clean the data
  3. Perform complex transformations (such as “sessionalization”)
  4. Decide what attribute data to store
  5. Load the information into a data store (file system, RDBMS, NoSQL data store, graph DB, etc)

This is analogous to the extract, transform, and load phases in ETL systems…MR is taking raw data and creating useful information that can be consumed by another storage system.

MR-style systems also excel at complex analytics. This is because in many data mining applications, the program must make multiple, iterative, passes over the data. Such applications cannot be structured as single SQL aggregate queries, requiring instead a complex dataflow program where the output of one part of the application is the input of another. MR is a strong fit for such applications. Take a look at Andrew Ng’s work (over five years ago), Map-Reduce for Machine Learning on Multicore. Futures consist of taking the MADlib project and porting it to MR as part of the Mahout project.

See my broader list of use-cases here.

Total Cost Of Ownership

Just as in Cloud computing, one has to take the “total cost of ownership” (TCO) into account before you realize the real benefits. And when I say TCO…I mean the cost in terms of people, time-to-market, as well as hardware and software. When we perform a TCO in cloud, we summarize the traditional costs over five years and then reduce it down to $/user/month. This way you can compare it to the on-demand model.

In the BIG DATA space, we can do a host of similar analysis to get a true TCO as well. To keep things simple, lets look at the  following BI / analytics example for the data scientist….a simple process as follows:

  • Data load
  • Data Selection
  • Aggregation
  • Join
  • UDF Processing & Aggregation

Lets say we perform an analysis of web-log data of user-visitor data. We join this user data with a table of PageRank values, which consists of two subtasks/calculations:

*Subtask 1: Find the IP address (user) that generated the most revenue within a particular date range

*Subtask 2: Calculate the average PageRank of all pages visited during the particular date range interval

Here’s the result of this use-case:

 

At the surface, you might be inclined to compare just the join operation, for example, between the RDBMS and the equivalent MR function….concluding that MR is slower. But if we look at the entire end-to-end process for the data scientist, the picture might change. The above comparison between a Hadoop framework versus both a columnar and standard RDBMS provides some interesting thoughts regarding the process. I could add database / logical modeling, etc. and continue to fill out the picture. But you get the idea. [Note: I'm not including any specifics of the benchmark, because I'm trying to make a high-level, general point here, and don't want to get caught up in the details.] So, I conclude with two thesis.

Thesis #1: if we’re talking about “discovery” environments where you need to perform iterative exploration of the data asking questions like, “what, why, what will, what if?”, you might need to consider the full end-to-end process when you decide which platforms are best for your organization.

Do you agree with this?

Lastly, why did I title this post, “Big Data is Thriving. Is RDBMS Dead?” Well, other than trying to be a bit controversial, I have had this philosophical discussion with my friends in the industry. The answer is obviously, “no”. But there is definitely an undertone that the Big Data Warehouse is storing more data than the traditional EDW, and the ETL and analytic tasks will begin to migrate off of the EDW (and associated data marts) into Hadoop Big Data marts.

Thesis #2: enterprises will begin to build big data warehouses that enable quick ETL operations, ultimately, supporting advanced analytic BI applications (which will no longer be supported by the EDW or associated data marts).

Agree with this?

 

Posted in Data.

Tagged with , , , , , , , , , .


Big Data Use-Case: ETL made easy

This Big Data use-case involves a Global Fortune 100. The company is interested in rethinking how they manage the many disparate billing systems and data marts which IT  supports within its multiple divisions. The data from the systems is provided in multiple formats including: flat files, feeds, and SQL extracts.

Question: So what’s the issue? Why not just use IBM’s Datastage and SQL in Teradata?

Answer: Maybe because it’s expensive? And using the Teradata DBMS to perform some of the data manipulation is inefficient. Teradata is meant to be focused on decision support, not ETL.

This is where Hadoop can provide a very cost-effective ETL platform which manages all aspects of data integration while still addressing requirements in scalability and usability.

Think about it.

Posted in Data.

Tagged with , , , , .


Big Data PaaS?

Cloud-based PaaS is pretty high on the hype curve. I’ve been of the opinion that we’ll begin to see vertical PaaS offerings as the enterprise begins to understand the potential impact of application development acceleration. So, to continue to expand on that idea, how about a Big Data PaaS?

As many will agree, Big Data was originally driven by the need to discover. Yes, there are many practical examples of using the Hadoop framework in “operational” applications. However, we could argue that many of these production applications were born out of discovery sandbox initiatives.

If we’re going to support the many data scientists and their application developer counterparts in an even more experimental, data-driven enterprise, we better pay attention to the roots of the Hadoop framework.

Providing the enterprise a way to build their own internal sandbox applications using Hadoop building blocks will require the talent of a Hadoop-savvy team.

Ideally, the existing staff needs to be educated and provided with simple application development environments which support the use of unstructured data technologies.

With an internal deployment of a Big Data PaaS, the organization is provided with a full, turnkey stack which is not too dissimilar from Hortonworks, Cloudera, etc, but with maybe one compelling difference.

The early Hadoop-centric vendors could be compared to the “IaaS providers” in the world of Cloud (e.g. Eucalyptus, Nimbula, Surgient, OpenStack, Enomaly, Cloud.com, etc.). These vendors focus on infrastructure and less on the application developer….less on the next level in the stack which completes the PaaS layer.

The compelling difference with a PaaS is that it includes a comprehensive suite of services for the app-dev teams and a robust API that can be expanded as services are developed and added to the platform. It can include a number of other application-centric services, such as:

  • Application lifecycle management
  • Application-level monitoring/management
  • Application metering
  • Entitlement management
  • Authentication/Authorization

The end-goal? Abstracting the infrastructure and increasing time-to-market for new BI applications.

Some questions to ponder:

  1. Is the market ready for a turnkey Big Data platform offering? Is it too early for a Big Data PaaS?
  2. Is Big Data PaaS just a flavor of Private PaaS?
  3. What kind of hybrid Big Data architectures will we see?
  4. Do we need to first see Big Data “killer apps”? The killer use-case in the private Cloud IaaS market was/is “test and dev clouds”…essentially sandboxes.
  5. Production applications are born out of the sandbox cloud environments, and companies are moving to support the development team to facilitate productization.  Are we going to see the same in Big Data?

What are your thoughts?

Posted in Cloud Computing, Data.

Tagged with , , , , , , , , , , , , , , , .


The Big Data Warehouse – The New Enterprise

 

For those familiar with the Fortune 1000 enterprise data warehouse reference architecture, you’ll appreciate how it’s evolving to include Big Data.

We’re seeing a few things repeat themselves, but now with semi-structured data:

  1. IT needs to address the needs of business users
  2. There are many new data sources
  3. Those who can centralize ALL enterprise data will win
  4. Enterprise groups need sandboxes (aka data marts and now Hadoop data stores)
  5. Discovery needs to be simplified with user-friendly data mining (now including unstructured)
  6. Information access is being made self-serviceable (EDW, Big Data, and BI data marts)
  7. Business users need time-to-market acceleration through app platforms

Note: You’ll note that the new enterprise warehouse architecture shown above depicts a Big Data data store that is much larger than the traditional enterprise data warehouse. This isn’t difficult to comprehend due to the fact that RDBMS data stores are, by nature, structured and compact compared to their new counterpart – the unstructured, NOSQL data stores.

IT needs to address needs of business users

The gatekeepers of the enterprise’s key resource, data, are making that resource available to others in the organization (willingly or not). Business users are beginning to benefit due to two main trends in the IT industry:

  • Use of private and public cloud services to automate access
  • Use of the Hadoop framework to create new and timely data sandboxes

By making IT infrastructure more transparent, and empowering others to gain access through self-service provisioning of those resources, the organization is capable to be more responsive to market needs. IT is either proactively or reactively breaking down the traditional barriers to data access. Big Data is a great example of being able to quickly spin up proof of concepts and advance thinking without the burden of data schemas, expensive tools and the like.

There are many new data sources

Much of the organization’s data had to be “thrown on the floor” due to the expense of traditional data warehouse infrastructure. Now with web-scale technologies like Hadoop, enterprises can throw that data into unstructured “data crock pots” and “cook up new insights” for the company. With the access to apache web logs, social feeds, M2M, etc., it’s no surprise that IDC predicts that 90% of the the 35 Exabytes of data generated in the year of 2020 will be “unstructured”…lots of new data sources.

Those who centralize data will win

This doesn’t mean that you can’t have data marts. However, I believe that the idea of having a data mart will now begin to evolve to many separate “buckets” of sandbox MapReduce data sets.

Envision a single Hadoop cluster with exabytes of corporate data, responding to thousands of MapReduce jobs every day (being initiated by data scientists as well as knowledge workers and power analysts in the organization). Whether the subsets of data are stored in a single Hadoop file system or transfered between separate Hadoop clusters is not important. What is important is that EVERYONE will have access to any and all data (and without the need for previously setting up queries and schemas to do so). And….there will be a tight connection between the traditional relational data store (EDW) and the Big Data store. Together, the new “single version of the truth”.

Groups need sandboxes

Hadoop exists because Yahoo! needed a place where it’s data scientists and app-dev teams could get access to enterprise data quickly to experiment and discover. Big Data will provide the same value proposition to the new “data-driven” enterprise. It’s time to take the gloves off, roll up your sleeves, get your hands dirty….let the organization have access to the data. Big Data provides that promise. Big Data IS the sandbox.

Discovery needs to be simplified

After starting Teradata’s internal data mining program back in 1996 (almost 15 years) ago, I’m still seeing new data mining offerings provide similar value propositions – instant access to organizational data and simple business-user centric interfaces made to remove the inherent complexities associated with the data mining process. And lets not forget moving the data mining process closer to the data (e.g. leveraging in-database analytics) has been the promise of data mining tools for two decades.

Big Data provides a new platform for knowledge discovery and data mining by simplifying the access to the data. What is still missing is the ability to raise this up to the business user.

My opinion is that discovery, by definition, can not be distilled into a simple workflow that can be automated. Sure you can provide things like visual programming, etc. to allow managers build, train, and deploy their own neural nets (really?)….but the new data scientists, the new application developers, will still want to get their hands dirty. And, lets be honest….business users do not want to execute the data mining process….and the data miners do not want to play with the fluffy icon-driven user interfaces.

However, I do believe there IS an opportunity to develop a new Big Data Data Mining process which is facilitated through a suite of knowledge discovery web services which simplify rapid BI application development – a new Big Data Private PaaS. [Note: what do you expect from me? The PaaS guy. But don't take my word for it...you can talk to a hundred Fortune 1000 companies and their internal IT, application development, and business user organizations...maybe you'll hear something different].

Information access is being made self-serviceable

Go figure. I can still remember the “Data Warehouse Readiness Services”, “Data Warehouse Design Services”, and “Data Warehouse Support and Enhancement Services”. But the one that still flashes the most in my mind is the “Data Warehouse Information Discovery Services” where you determine how to link business strategies with information technology….and, of course, this involves building a customized data warehouse or data mart solution.

Well, you may not  throw away the architecture, logical data modeling, and the many similar tasks involved in building traditional data infrastructure. But you sure can accelerate the process, providing more immediate access to data. Just ask the many emerging “data scientists” in the organization who are spinning up Big Data Hadoop clusters and playing…discovering….in a matter of days and weeks, versus months and sometimes years.

Don’t believe me? Just ask folks at FaceBook, Linkedin, Twitter, Yahoo!, eBay, etc. See how long it takes them to answer a new business question that requires access to existing or new data elements.

Time-to-market acceleration through app platforms

If you could provide a suite of BI application development services to accelerate time-to-market for new discoveries, new BI applications, new executive insights, what would they be?

Do you think it’s more important to provide access to infrastructure, or tools to facilitate the development of new BI applications for the organization?

Sure, the new enterprise should have both…but those who are more advanced in their thinking are already focused on how to enable application development, providing easy-to-use tools for access to infrastructure (and self-service capable infrastructure). I think the true value is with the application developer - always have, and always will.

This could start out as a number of fixed Hadoop clusters. It could consist of Hadoop cluster stacks which can be provisioned via your Private IaaS offering. It could be a private PaaS of Hadoop-enabled web services. It could be a hybrid Hadoop Cloud.

What do you think?

Posted in Cloud Computing, Data.

Tagged with , , , , , , , , , , , , , .


Big Data Use-Case: Real-time Dispenser Maintenance

“Sensor” applications have great potential in the Big Data space. The fact that machines produce the most amount of data (aka sensor data), presents a natural application for the use of the Hadoop framework for collecting, storing, applying simple  analytics, and automating action.

The above picture presents a solution for a network of dispensers of food items. These may be installed in thousands of locations across a large geography. With “smart dispensers” the investment is high. Having any particular dispenser “down” due to maintenance issues can significantly impact ROI. On the flip side, dedicating people for checking and maintaining these dispensers could also significantly impact ROI.

This solution uses the Athena framework and Flume to incorporate data sensor agents for each dispenser, and then to efficiently collect, aggregate, and move all the dispenser log data into a Hadoop cluster. Then HBase is used as a column-oriented database store modeled after Google’ Bigtable: A Distributed Storage System for Structured Data. Lucene provides a full-featured text search engine library written entirely in Java that is suitable for nearly any application that requires full-text search.

The ultimate goal is to provide both ipad and iphone applications which enable remote maintenance personnel to quickly identify and respond to dispenser issues, in real-time. The logs for each machine can be browsed with similar functionality as Splunk in the datacenter in terms of being able to search and analyze logs seamlessly. The final outcome….a low-cost solution to enabling a cost-effective and real-time maintenance workforce.

This solutions could also be deployed in Amazon Web Services (Elastic MapReduce), making it easy to support geographical deployments worldwide.

Do you have any Big Data “sensor” applications?

Posted in Cloud Computing, Data.

Tagged with , , , , , , , , , , , .




Switch to our mobile site