Tag Archives: RDBMS

Big Data and SAP HANA? Or Sybase IQ?

Like few more folks I think that there was some kind of misunderstanding in mixing Big Data and SAP HANA into one bag. We touched on this topic in the recent podcast “Debating the Value of SAP HANA”, but I would like to spend few more minutes here to explain my thoughts.

SAP HANA has been created with traditional SAP Business Suite and Business Warehouse (BW) customers in mind. How big is the biggest single SAP software installation in the world in terms of single-store data size? I do not know exactly. The times of the proud “Terabyte Club” are in the past. Four years ago it was loud about 60TB BW test SAP did. The biggest customer I worked with had 72TB database of BW data. So, I would assume that the biggest SAP instance is somewhere close to 120 TB. That’s still a lot of data not just to process, but as well to manage (think back-ups, system upgrades, copies, disaster recovery etc)… Besides current technical limitations – 8TB biggest certified hardware configuration and 2 billion records limit in a single table partition – SAP HANA is on the way to help SAP ERP and BW customers with those challenges. But those are not what the industry calls “Big Data”.

Here are main differences as I see them:

  • Data sizes we are discussing with SAP HANA are in the ballpark of few terabytes, while Big Data currently is something in single digit petabytes. E.g. HP Vertica has 7 customers with a petabyte or more of user data each accordingly to Monash Research.
  • Current focus of SAP HANA is structured data, while Big Data issues are generated by mostly unstructured data: web, scientific, machine-generated. Fair to mention though that SAP is working on Enterprise Search powered by HANA, as  Stefan Sigg, VP In-Memory Platform in SAP, told me during this TechEd Live interview.
  • Currently Big Data processing is almost a synonym with a MapReduce software framework, where huge data sets are processed by a big cluster of rather cheap computers. On the other hand SAP in-memory technology requires “a small number of more powerful high-end [servers]” accordingly to Hasso Plattner’s “In-Memory Data Management: An Inflection Point for Enterprise Applications” book.
  • Related to the point above is that in SAP HANA the promise is the real-time, where fact is available for analysis subseconds after occurrence. In Big Data algorithms processing is mostly batch based. My previous blog’s post became available in results of the Google Search and in Google Alert only 4 days after being posted – not quite real-time, huh?
  • SAP HANA data analyses are most often paired with SAP BusinessObjects Explorer – modeless visual data search and exploration. Use of MapReduce libraries on top of Big Data requires advanced programming skills.

During SAPPHIRE’11 USAkeynote speech Hasso Plattner mentioned MapReduce as a road map feature for SAP HANA, but since then I haven’t gotten any specifics what it means. Instead silently announced Release 15.4 of Sybase IQ has introduced some features focused on analyses of Big Data in their original meaning. Is there a silent revolution in SAP going on the Sybase side, while all eyes are on the HANA product?

Advertisements

5 Comments

Filed under HANA, SAP

Is SAP HANA about the “in-memory database”?

Disclaimer: this post is not meant to be easy-digestible, so please stay with me through the text and let’s have a discussion after that.

What is SAP HANA?

When in May 2010 I first heard Hasso Platner, Chairman of the Board in SAP, talking about the in-memory revolution they were planning with the SAP HANA product, I scratched my head. I had been working with SAP NetWeaver BW Accelerator (BWA) already for 4 years, and it was obvious that HANA was the continuation of the same technology. But what I made me curious was why out of three major principles underpinning the technology – massive parallel processing (MPP), columnar-based data store, and in-memory data store – SAP had chosen the last one as a flagship feature for the new product? It was not clear for me at that time. I decided that it must be due to the fact, that there are products already strongly identified with columnar data presentation (like Sybase IQ or Vertica) and with analytics MPP processing (like Teradata or HP Neoview), while in-memory databases, like TimesTen, Altibase or solidDB, were not that known to a broader audience.

For a last couple of years we’ve seen SAP effort to re-claim the “innovative” adjective next to the company name. So, using “in-memory” – existing, but not that wide known, technology seemed to be a good match for “innovation”. As we saw during last year, indeed HANA was used successfully by SAP marketing to generate lots of “game-changing”, “revolutionary”, “deliciously disruptive” buzz. This buzz was picked up by many. So, it was quite interesting to read the contradictory statement made by the analyst Dennis Gaughan at Gartner Symposium (source):

… Gaughan said none of the four vendors [IBM, Microsoft, Oracle, SAP] are “re-imagining” IT, as per the theme of the Gartner conference.

You won’t find innovation in their product portfolio,” he said. “You might find it if you try and talk to the research parts of these organisations.”…

Indeed for those of us with a broader and deeper technical view, the question remained open: “What makes SAP HANA the innovative product among many existing in-memory database management systems?” I do not think this question has been fully answered by SAP so far. Let me share my understanding and thoughts here.

Firstly, in my opinion it is not the technology, as it is the ultimate promise, which is visionary: running transactional and analytic systems on a single platform with a single store of data. The whole data warehousing, as we know it, was born from a need to remove analytic workload from the transactional systems. In addition transactional data structures were transformed to analysis-optimized (like star schemas or OLAP cubes) along with data enrichment. Then ETL systems came into place to remove data transformation workload from data warehousing systems. Now SAP promises to bring everything back at one system (see graph below) – making separate ETL and EDW systems (and much of related skills and expertise) obsolete. This will be a huge change, yet from my discussions with SAP customers it was not clear if they had gotten it. Many of them want to have SAP HANA database for the sake of running ERP alone faster. Again – it is not what is revolutionary with the SAP vision to be delivered thanks to the HANA platform.

OLTP and OLAP systems today require not only separate computing resources, but as well different data structures optimized for specific profiles of queries. SAP’s promise is that once transactional (e.g. ERP) and analytic (e.g. BW) systems are running on a single HANA platform, they will be using a single copy of data. All additional data modifications required, for example by analytics part of the system, like data cleansing, transformation, enrichment, will be done on the fly during each execution of queries [VitalBI: I bet there is going to be some kind of results caching, even if some guys in SAP marketing disagree]. In-memory data storage together with in-database calculations, append-only tables, and multi-cores processing are all the features, which are going to help SAP to achieve the “single business platform” promise.

What is different comparing to other in-memory database management systems, that SAP’s ambition to bring in-memory technology to the next level: Enterprise.  It means not only specific and limited use cases, but mixed-workload, big-scale, high-volumes scenarios.

Secondly, there is not enough information about the innovation in the technology being developed by SAP. You will not find many white papers from SAP describing what is under the hood of the new database. Just storing data in the RAM, and treating this as a faster storage, is nothing new. Sybase ASE – the database acquired by SAP last year – has an “In-memory database” option. SAP HANA certainly has to offer something better.

My discussion with Franz Faerber, SAP HANA chief architect, at SAP Influencer Summit last summer helped to get a bit deeper view into the technology, beyond obvious things. In a nutshell, two major drivers behind SAP HANA technology were:

  1. “RAM is slow” (And you thought “in-memory” is about storing data in RAM??)
  2. “CPU clock frequency reaches its growth barrier”

In SAP HANA everything is about the performance, which is a prerequisite for the real-time data processing. Even if RAM is faster than ‘spindle’ hard drives, CPUs still waste cycle while waiting for data from RAM. Therefore the optimization goal is to reduce the idle cycles by making sure that there is as many useful data in CPU caches as possible. The HANA database has to be coded using CPU-cache-aware algorithms and processing CPU-cache-optimized data structures. Well, back in 2006 Jim Gray from Microsoft discussed this principle in his famous presentation “RAM Locality is King”.

Most of the data is stored in SAP HANA databases in columnar and compressed format. This data still has to be converted to records during processing, so it is important that this step happens as late as possible – something called late materialization. Ideally operations on the data should be able to run directly on compressed data, without need to uncompress them.

As just mentioned in the previous paragraphs: in HANA everything is about performance, so when the clock speed growth slows down, the search for performance is in multi-core CPU processing. It is the worst kept secret on the market that about a dozen of developers from Intel spent months in SAP office coding the core of SAP in-memory technology to use all possible features of Intel Xeon chipset architecture: HyperThreading, Intel Turbo Boost, Threading Building Blocks. That’s why its top performance SAP HANA database can achieve only when running bare metal on Intel Xeon CPUs, and not on other platforms or in the virtualized environment.

Last, but not least: SAP HANA database is in fact the hybrid database: the RAM is used as a primary data store, but there are still SSDs or spindle drives used for data persistence, like in case of the power lost. I saw some customers being surprised when facing the SAP HANA hardware with external storage besides lots of RAM.


On SAP invitation I am going to attend SAP Influencer Summit during December 13-14, and I am looking forward to it as a chance to get a layer deeper into what makes SAP in-memory technology truly a step forward comparing to others and how they are going to overcome some remaining technology barriers.

9 Comments

Filed under HANA, SAP