Big Data and SAP HANA? Or Sybase IQ?

Like few more folks I think that there was some kind of misunderstanding in mixing Big Data and SAP HANA into one bag. We touched on this topic in the recent podcast “Debating the Value of SAP HANA”, but I would like to spend few more minutes here to explain my thoughts.

SAP HANA has been created with traditional SAP Business Suite and Business Warehouse (BW) customers in mind. How big is the biggest single SAP software installation in the world in terms of single-store data size? I do not know exactly. The times of the proud “Terabyte Club” are in the past. Four years ago it was loud about 60TB BW test SAP did. The biggest customer I worked with had 72TB database of BW data. So, I would assume that the biggest SAP instance is somewhere close to 120 TB. That’s still a lot of data not just to process, but as well to manage (think back-ups, system upgrades, copies, disaster recovery etc)… Besides current technical limitations – 8TB biggest certified hardware configuration and 2 billion records limit in a single table partition – SAP HANA is on the way to help SAP ERP and BW customers with those challenges. But those are not what the industry calls “Big Data”.

Here are main differences as I see them:

  • Data sizes we are discussing with SAP HANA are in the ballpark of few terabytes, while Big Data currently is something in single digit petabytes. E.g. HP Vertica has 7 customers with a petabyte or more of user data each accordingly to Monash Research.
  • Current focus of SAP HANA is structured data, while Big Data issues are generated by mostly unstructured data: web, scientific, machine-generated. Fair to mention though that SAP is working on Enterprise Search powered by HANA, as  Stefan Sigg, VP In-Memory Platform in SAP, told me during this TechEd Live interview.
  • Currently Big Data processing is almost a synonym with a MapReduce software framework, where huge data sets are processed by a big cluster of rather cheap computers. On the other hand SAP in-memory technology requires “a small number of more powerful high-end [servers]” accordingly to Hasso Plattner’s “In-Memory Data Management: An Inflection Point for Enterprise Applications” book.
  • Related to the point above is that in SAP HANA the promise is the real-time, where fact is available for analysis subseconds after occurrence. In Big Data algorithms processing is mostly batch based. My previous blog’s post became available in results of the Google Search and in Google Alert only 4 days after being posted – not quite real-time, huh?
  • SAP HANA data analyses are most often paired with SAP BusinessObjects Explorer – modeless visual data search and exploration. Use of MapReduce libraries on top of Big Data requires advanced programming skills.

During SAPPHIRE’11 USAkeynote speech Hasso Plattner mentioned MapReduce as a road map feature for SAP HANA, but since then I haven’t gotten any specifics what it means. Instead silently announced Release 15.4 of Sybase IQ has introduced some features focused on analyses of Big Data in their original meaning. Is there a silent revolution in SAP going on the Sybase side, while all eyes are on the HANA product?

About these ads

5 Comments

Filed under HANA, SAP

5 responses to “Big Data and SAP HANA? Or Sybase IQ?

  1. 1. You know my thoughts on this subject. :) BW systems are peanuts compared to the “Big Data” problems many web based companies have already. I’ll be interested to see what happens with HANA when we start hitting real data volumes (i.e. a BW system with 50-100TB). Given the current technical limitations that you state I’m very skeptical about the future of the architecture and the assumptions we are making about it.

    2. MapReduce is a framework for retrieving data over large clusters of data. In the same way I don’t believe the current HANA technology is capable of multi-tenancy, it makes zero sense technically to introduce it to HANA. Multi-tenancy and infinite scalability of commoditized hardware are two things that simple just aren’t features of the architecture. In fact they are quite the opposite. HANA tries to suck up as much CPU as possible and has to run on certified Intel hardware. Just like Cassandra was built to cope with Facebook’s inbox search, the HANA architecture was created to cope with large data volumes of ERP data. SAP should focus on solving those problems first, which are not what most people consider “Big Data” problems, they are just typical SAP ERP data management problems.

    3. Sybase IQ is a great solution for customers with data that can potentially explode and MapReduce is a great function for infinitely scaling processing functions. I’m still waiting on the official response from SAP on how to position BODS w/ Sybase IQ vs HANA vs BW vs BW on HANA, for storing and retrieving data for BI functions.

  2. It’s fair to say that Big Data is an overloaded term, especially in the context of IMDBs. Most people think of Big Data as hundreds of Terabytes to Petabyte+ scale datastores. How can an IMDB possibly play a meaningful role for databases of that size?

    Here’s a simple formula you may find useful: Big Data = Operational Data + Fast Data + Deep Data + Specialty (e.g., graph) Data. This formula suggests that IT will need to have multiple storage engines (operational, real-time, analytic, specialty, etc.) to properly handle the full range of needs.

    Imagine a sensor network that’s generating hundreds of thousands of scans per second. Ultimately, the data generated by those scans will accumulate into an enormous historical store which can be explored for trends, patterns and anomalies. But what if that data must also be analyzed in real time (i.e., within a few milliseconds of capture) to drive an alerting system? Further, what if that alerting system requires the data to be conditioned in some way before it can be analyzed?

    This is a perfect scenario for an IMDB, which can ingest, store and manage Terabytes of “hot” data, allowing you to do specialized operations while that data is in its most volatile form. To be clear, this is NOT a stream processing problem – the data must be statefully managed to derive its full value. And as it ages, the data then rolls out into an analytic store of choice for deeper exploration.

    Am I describing some edge case that nobody should care about? I don’t think so. Within a couple of years, everything worth anything will be sensor tagged and tracked. Real-time fraud and vulnerability detection are hot tech topics. The digital advertising ecosystem is creating a new version of commodities trading, replacing pork bellies and currencies with media images. And there are many other examples of old and newer economies that are driven by high throughput, low latency data management needs. This is the Fast Data side of Big Data, and I’m sure it the target of HANA and other emerging IMDBs.

    A final thought – if the data tier is to evolve in the direction of “polyglot persistence” (multiple datastores doing specialized work), storage engines must be designed from the ground up for interoperability. Popular ETL transformations will have to be provided directly by the engines themselves.

    Disclosure: I work at VoltDB, an in-memory transactional database company.

    • Shashank

      Agreed to the thought of polyglot persistence.
      We need to address different data needs in different ways. A combination of nosql stores and rdbms.

  3. I would like to add that HP has release a SAP HANA distributed blade processing solution that can support 16 PB of shared persistent storage and 8 TB of RAM within the C7000 chassis. Assuming compression works at the stated 5 X in RAM, you can store about 40TB of SAP BW data on this solution. I recently did a HANA pilot where the compression was closer to 12 X but it does varies based on the makeup of the data. 40TB is still not “Big Data” but we can always hope that the generation 8 HP BL servers or advances in the LRDIMM chips will quadruple the RAM for each BLADE next year

    http://www.sys-con.com/node/2104122

    I would also like to add that companies have a choice to run SAP HANA standalone where they could replicate or batch load BW data into one or more HANA instances. Companies could run multiple HANA instances to accommodate more then 40TB of BW data. However, this statement is admittedly a subjective recommendation and assumes they have an unlimited budget.

  4. Pingback: SAP HANA | Acordo Coletivo (Petroleiros, Bancários, Prof de Saúde)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s