Quantcast
Channel: Cloudera VISION » Matt Brandwein

Reintroducing Cloudera Enterprise, now with Apache Spark

$
0
0

hubIt’s a new year, and it’s off to a fast start for the Hadoop ecosystem. Today we’re announcing a new, streamlined Cloudera Enterprise lineup and commercial support for Apache Spark (currently incubating at Apache), as part of our vision to deliver an enterprise data hub.

While the concept of an enterprise data hub still feels aspirational to some, it seems to be a trend that’s gaining momentum as more customers understand the value of having a single place to land, explore, process and analyze all their data. In our customer base we have seen a strong latent demand for a simpler, more scalable and more flexible architecture.

The power of an enterprise data hub is even more obvious once it’s deployed and in use, and we think 2014 is going to be an inflection point as enterprises move away from just building a Hadoop cluster.

Here are just two of my favorite recent examples:

  • Epsilon, a subsidiary of Alliance Data, is a world-leading marketing services firm that empowers marketing programs for leading brands including Ford, Merck and JP Morgan Chase. They’ve built a new cloud-based digital messaging platform service on top of Cloudera called Agility Harmony, which allows Epsilon clients to harness multiple sources of data into marketing campaigns – email, mobile, and social – and integrated with existing marketing and database systems, so they can deliver thousands of campaigns per second and billions of customized messages per year through these different channels. Epsilon considers Agility Harmony to be a “marketing enterprise data hub” for its customers, a single platform for data processing, analytics, and data-driven applications, while also allowing users to directly interact with the data using familiar tools like SAS, Business Objects, and Tableau.
  • AutoScout24, one of Europe’s largest internet properties, serves more than 10 million users across 18 European countries every month. They provide an online marketplace for private customers, car dealers, and other organizations in the automotive space to share information about vehicles, parts, and accessories. AutoScout24 has deployed an enterprise data hub to power the company’s data collection, processing, storage, and analytics, while continuing to feed specialized database systems that power their online web platform.

Our mission is to help organizations leverage the power of all their data, and we’re looking forward to supporting Epsilon, AutoScout24, and the rest of our great customers in this new year.

With that said, we’re always looking for ways to improve. In that spirit, today we are announcing a major facelift to Cloudera Enterprise, based on customer feedback.

First some background. As we worked with our hundreds of customers, we learned that many found Cloudera Enterprise’s model of a core subscription plus optional add-ons to be, frankly, somewhat confusing and difficult to adopt. Cloudera Enterprise has evolved rapidly over the past few years as we’ve added new capabilities and corresponding subscriptions: RTD (Apache HBase or Apache Accumulo), RTQ (Impala), RTS (Cloudera Search), BDR (backup and disaster recovery) and Cloudera Navigator (for data management.) We learned that customers wanted to incrementally adopt these into their own emerging visions of an enterprise data hub, but didn’t want to have to go through a separate procurement cycle each time. Others, either earlier in their big data journey or more self-sufficient, wanted a cost-effective way to leverage our trusted support team and management tools while leaving room for future expansion.

With customer input, and in alignment with our vision for delivering an enterprise data hub, we are pleased to offer a new, simplified product lineup for 2014, with just three straightforward editions within the Cloudera Enterprise family:

  • Data Hub Edition, which – as the name implies – provides everything customers need to build an enterprise data hub, ready to integrate into an existing environment. It includes unlimited supported use of components in Cloudera Enterprise:
  • Flex Edition, for supporting dedicated mission-critical applications on Hadoop, using only one of the above components. For example, if building a real-time ad serving platform on HBase.
  • Basic Edition, for customers who rely on Cloudera for Hadoop in production environments, yet need only simple batch processing and storage, at an economical price.

Every edition includes CDH, our 100% open source distribution including Apache Hadoop, Cloudera’s unique proactive and predictive support and advanced system management. A couple of other key updates:

  • Automated backup and disaster recovery is now included in every edition of Cloudera Enterprise. We just don’t think Hadoop makes sense in the enterprise without it.
  • In addition to a choice of 8×5 or 24×7 support, we now offer an additional premium support tier that delivers 24×7 support plus a guaranteed 15-minute time to first response for critical issues. As Hadoop has grown into mission-critical roles this is something customers require and we are pleased to offer.

We’re also excited to add official support for Apache Spark, an open source, parallel data processing framework that complements Hadoop, making it easy to develop fast, unified applications that combine batch, streaming and interactive analytics. Spark is 10-100x faster than Hadoop MapReduce for data processing, and also enables easy development of stream processing applications for the Hadoop ecosystem. Cloudera is working closely with Databricks – the leading company behind the Spark project – to ensure Spark is deeply integrated with Hadoop, sharing common data, metadata, security and resource management. Stay tuned for more details in an upcoming series of blog posts.

cake_EDH

We believe the powerful combination of this new model for Cloudera Enterprise, together with the addition of support for Apache Spark, will help our customers more rapidly realize the value from their data, better manage risk and compliance, and control costs. You can find more information on our website, or contact our Sales or Support teams to discuss how this applies specifically to your organization.

Again, we look forward to working with our customers and community in 2014 as Hadoop continues to mature from a standalone silo into a central data management platform. It should be a great year for everyone.

Matt Brandwein is Director of Product Marketing at Cloudera.


Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Professionals – Dr. Ralph Kimball Answers Your Questions

$
0
0

Thank you for the phenomenal response to our recent webinar with Dr. Ralph Kimball: Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Professionals. Many of you submitted chat questions that we weren’t able to answer live. Happily, Dr. Kimball took the time to address the majority of these questions. Grab something to drink, sit back and check out his answers below (there are quite a few!)

Next up: Join Dr. Ralph Kimball and our own Eli Collins from Cloudera on 5/29/14 for the next topic in the series: Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals.

If you weren’t able to join us live, check out the replay.

  1. Is DW in the future still going to target “decision makers” or far beyond the “elite” group?
    To be honest, I have never considered decision makers to be part an elite group. Rather I think anyone using information is a decision maker. This goes back to the very earliest pre-data warehouse systems, which were called Decision Support Systems.
  2. Outside of transactional systems (OLTP) do you think RDBMS will die out soon?
    I took some pains in the webinar to state my belief that RDBMSs are a huge, permanent tool for organizations. They will always maintain their place for querying, reporting, and analyzing relational data. But I think Hadoop takes a lot pressure off these systems to operate on non-relational text and number data.
  3. Together Storage and Processing is obviously of value.  What happens when one want to use Storage only or processing only? Values remains?  What is it?
    One early use case for Hadoop and HDFS has been as an inexpensive archive for all forms of data. As I pointed out in the webinar, this story is even more interesting because this archive is “active”. When you say storage only, I assume that when you decide to use the data, you will transfer the data to some other platform. That works for me. Not quite sure what you mean by processing only. Maybe in this case, you temporarily transfer the data into Hadoop for processing either to offload other computational resources, or to take advantage of unique analytic applications that run in the Hadoop ecosystem. Then when you are done, you delete the data from Hadoop. Again , I could see that happening, but in this case, given the storage advantages of Hadoop, seems like you might not want to do the delete…
  4. What would be the implications for the Power Users Experience?
    I assume you mean analysts who have considerable computer experience, either with sophisticated Excel spreadsheets, or high end analytic tools such as SAS, R, or analytic extensions of RDBMSs such as MADlib in Impala or PostgreSQL. I think the short answer is that Hadoop is an ideal framework for supporting all these approaches. SA, R, and MADlib have all been available in Hadoop for a while.
  5. What does it mean by “Query engines can access HDFS files before ETL”? How is it possible?
    To expand that sentence, what I mean is that “Query engines can access HDFS files via their metadata descriptions in HCatalog without requiring the data to be transferred to a relational table as in a conventional DBMS”. Thus the step of physically transferring the data can be avoided, in this exploratory BI scenario. You still have to interpret the HCatalog description of the file in your query engine, which is basically defining a view. You could say that this view declaration is a form of ETL, and I would grant you that.
  6. How about ELT?
    Also: We are spearheading SQL-push down ELT Data Integration on Hadoop, how do you see that going in future?
    ELT simply means performing data transformations using SQL through the query engine. Thus the SQL UPDATE and INSERT INTO statements are your main tools. This is the point where you sit down with the system reference manual for Hive, or Impala, or your relational query engine of choice, and carefully study how these commands are supported.
  7. ‘Familiar SQL’ Does this mean ANSI SQL?
    Hive and Impala support the bulk of ANSI standard SQL, but not all of the advanced commands. For example, I believe that the full semantics of common table expressions (CTEs) coupled with UNION ALL is not completely supported at this time. But each release of SQL support from these engines chips away at these edge cases with the goal of full ANSI support in the foreseeable future. By the way, not all of the legacy RDBMSs support the full semantics of ANSI SQL.
  8. “Exploratory DW/BI” sounds pretty much data-driven. Does this capture the nature of the big data analytics?
    The distinction between the two threads I described (exploratory DW/BI and high performance DW/BI) is in the performance of the queries, not the semantics of the analysis. In fact, in some cases the original unmodified data may be required for certain NoSQL or SemiSQL tools to analyze.
  9. In brief, what can a traditional DW do that a Hadoop-DW can not do?
    Also: He is saying this feeds EDW’s and DW’s, but not replace so what does the EDW and DW give you that Hadoop cannot easily?
    I have been asked many “can do versus cannot do” questions over the years, whether it is a particular modeling approach, or a row  versus column DBMS, or a specific choice of hardware. And the answer, perhaps surprisingly, is that there is nothing that one environment can do that another cannot, in the strong sense. Certainly at this point in time, legacy EDWs are massively tuned and optimized, and have existing ETL pipelines for standard, familiar data. A Hadoop data warehouse cannot compete directly on performance for these legacy applications, although when cost is added to the equation, Hadoop may turn a lot people’s heads. I did make the point in the webinar that Hadoop right now is doing a lot of things that conventional EDWs have not figured out how to do.
  10. Is it a good idea to store the data warehouse (FACT, DIMENSIONS) in hadoop and create OLAP/Cubes directly with Impala?
    Most OLAP/Cube deployments are (or should be) based on a star schema implementing dimensions and facts anyway. Your question is interesting and will be answered in detail in the second webinar from Kimball and Cloudera on May 29.
  11. Parquet columnar files – is this part of Hadoop? – haven’t seen it before…
    Yes indeed. Try Googling “Parquet Hadoop” for a number of good sources.
  12. How can you describe Hadoop more than just a distributed data warehouse?
    Stepping back from the webinar where we deliberately narrowed our focus to see familiar data warehouse entities, actually Hadoop is a comprehensive applications ecosystem that supports much more than the data warehouse mission. Many of the use cases such as real time data streaming and specialty applications are architected with the data warehouse in mind. You might want to come to one of the big Hadoop conferences to get an idea of Hadoop’s breadth.
  13. Are the parquet columnar files separate physical files or just indexes?
    They are files, and in fact at the lowest level, each column is a file.
  14. Concept of Slowly Changing Dimension will go away in Hadoop DW?
    Also: Does Dimensional modeling still play a role in BigData?
    Also: What is the future of powerful dimensional modeling with Hadoop on board?
    Slowly Changing Dimensions (SCDs) are a fundamental approach to handling time variance in your entities (dimensions). Hadoop does not affect either the requirement to track time variance nor the specific approaches, such as Type 1 and Type 2. Please tune in to the next Kimball-Cloudera webinar on May 29 where we will show you how implement SCDs in Hadoop.
  15. Is this disruptive to the SDLC of a DWBI that involves semantic modeling, fact/dims
    The system development lifecycle (SDLC) of a DW/BI system is unaffected at the block diagram and logical levels. Obviously differences show up at the physical and programming levels. Again, please tune in to the next Kimball-Cloudera webinar on May 29 where we will take you through some of these details.
  16. Is performance a key drive for parquet files?
    Yes, performance is the primary motivator for Parquet. Parquet is also more efficient in terms of disk space, and supports nested structures, which “raw” file formats typically do not (JSON being the exception). Because Parquet is supported throughout the Hadoop ecosystem it may be more portable than the particular raw format that is used.
  17. I know HDFS is Immutable and no updates are supportable but is that any way that we can expect the AID transaction getting applicable to HDFS?” Is that some thing in Impala that will come up with Parquet?
    I don’t understand all of this question. Do you mean ACID transaction requirements which principally apply to OLTP systems and only indirectly to DW/BI systems? As to your second question, Impala works beautifully with Parquet files.
  18. Good morning Doctor Kimball, should we (DWH architect) forget about Star schema designs and jump on the Hadoop infrastructure?
    No! You will hear a powerful argument in the second Kimball-Cloudera webinar as to why star schemas and dimensional modeling are as important as ever in a Hadoop data warehouse.
  19. ETL tools that talk to traditional EDWs understand the Metadata and the Data Model underneath. With Hadoop integration, aren’t we pushing a structure / data model upfront around a  layer which can have a data model defined later at data retrieval? In other words, how easy or complex is the ETL from a processs perspective in the Hadoop environment?
    ETL tools can move data from point A to point B, or they can invoke complex processes to transform the data, or they can do both at once. The ETL tools would read the HCatalog metadata and use that as what you call the schema. There is a subtle point in all of this, whether it is a Hadoop DW or a conventional DW, and that is that the final query schemas are actually defined in the query layer after the ETL is all done, not the system table/metadata layer.
  20. Should we use ETL to copy data from HDFS to columnar files?
    Yes, this transformation step which is a basic and integral part of Hadoop is part of what I call ETL. Not sure that this anything more than labeling the step as ETL.
  21. Can Dr Kimball present a use case of Hadoop to be used in Banking sector or any other sector that don’t see this huge array of data types (audio/video/gps etc)?
    Banking tends to be more data aware than perhaps you might realize. Hadoop invites a bank to store not only every transaction (trillions over time in a large bank) but a lot of “sub-transactional” data involving their customer’s life stages and behaviors and preferences. Additionally, banks and other financial institutions look broadly at real time data to detect check kiting, broker collusion, fraud, identity violations, and hacking. Although some of these applications may be considered to be “production”, it then takes a data warehouse to make many of the decisions about what to do. Finally, I am sure you are aware that all the banks keep digital images of every check, and what do you suppose they do with the output of 10 or more surveillance cameras in every bank? And what about all the documentation supporting a mortgage? This list goes on…
  22. Data in parquet is duplicate/replication of original HDFS files?
    Parquet files can be created in two different basic ways. Either the Parquet file is a transformation of a separate original HDFS file, or the Parquet file itself is an original HDFS file because the data loader is commanded to create it directly on initial load.
  23. Can Hadoop totally replace usage of ETL tool like Informatica for loading data warehouse?
    No! Hadoop is not an ETL tool, it is an application ecosystem. If you are lucky enough to already have an ETL tool, then you can use your familiarity with it to move and transform data within Hadoop in ways similar to the conventional world.
  24. Are you planning a new edition of the ‘toolkit’ – can’t wait to get it if you are.
    Thanks for the encouragement!
  25. About ETL process: how can “merge statements” be realized (update existing data, insert new data), what are alternative recipes for this topic?
    Hive and Impala do not currently support the MERGE statement introduced in SQL 2003 though I expect they will add it. In the mean time the same (more verbose) techniques from the pre-SQL 2003 world work. You can also use a programming language other than SQL to implement arbitrarily sophisticated merges.
  26. Where do we do dimensional modeling with Hadoop. If source data is going to come in as it is then are we saying that we are doing away with dimensional modeling?
    Dimensional modeling is as necessary in Hadoop as it is in the conventional RDBMS environment. Please tune into the second Kimball-Cloudera webinar on May 29, where we will address exactly this point.
  27. The loading utilities are going to take care of converting to the desired format and storing in HDFS file systems?
    Yes.
  28. How can HDFS integrate with CDC for incremental load from sources?
    As I teach in my ETL classes, change data capture (CDC) can take place at the original source of the data before it is transferred to HDFS, or it can be a rather typical ETL job entirely within HDFS where yesterday’s data is compared to today’s data and the differences detected. Maybe the short answer to this question is that logic of using CDC techniques is not materially affected by having HDFS as the destination.
  29. Hi, Sir, In DWH analysis end user need stable data for reporting but in Hadoop data continues flow, so how to overcome with this?
    Actually Hadoop is in some ways more stable than other environments because once data is written to an HDFS file, it is immutable. HDFS files are write once. Yes Hadoop is purpose built to sustain very large transfer rates into HDFS, but it’s all additive. Nothing gets deleted.
  30. Dr Kimball, I am still little bit confused on the ETL perspective.  You mention Informatica would be to read through sqoop. Would SSIS able to leverage the same approach. In addition, typically in ETL world, we start building our dimension and related data. how is the world doing it.
    The benefit of Cloudera’s open source and open standards approach is that popular ETL tools can integrate over simple ODBC, extending existing ETL investments and skills to Hadoop. Cloudera provides a standard ODBC driver that tools such as SSIS can use.Please tune in to our next Kimball-Cloudera webinar on May 29 where we will describe building dimension tables and fact tables in detail.
  31. Sqoop is relation DW to HDFS or reverse or both ways?
    If I understand your question, yes, Sqoop is used to transfer data both into and out of HDFS to or from a remote RDBMS.
  32. Do you think from feasibility perspective that  traditional EDW compared to Hadoop is more focused on historical perspective and analysis. In other words are the best practices of implementing these different techniques/methods different in the way you want to handle/view the data?
    Frankly I think that the 1990’s view that the data warehouse is only a library of historical data is way behind us. Shortly after the year 2000, operational users came flooding into the data warehouse with urgent mostly real time needs for watching processes and making decisions. Having said that, one of the side benefits of Hadoop and HDFS is that they have been built for extreme I/O performance. Rather than changing the whole paradigm of data warehousing, Hadoop will encourage some even more aggressive hybrid data warehouse applications where real time data is compared to historical data. We do that a fair amount today in conventional data warehouses but maybe not as aggressively as we will be able to do it in Hadoop.
  33. An RDBMS approach typically needs upfront data analysis before proceeding with design or development. How different is it for Hadoop, can you speak to whether data can get registered in HCatalog without analyzing data structure, unique keys/identifiers, foreign keys, etc?
    You do not need to do upfront analysis – this is the idea behind “schema on read”. The best way to understand this is via the tutorial on how to create a table in Impala: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html?scroll=create_table_unique_1
  34. Dr. Kimball, Do you see Hadoop framework as part of Staging area?
    Yes, possibly. One use case for Hadoop is to treat it entirely as a staging area where the data is made ready for ultimate export to the conventional EDW. And of course, we usually describe the staging area also as an archive which we keep around for various purposes like system recovery. Since Hadoop is an ideal “active archive” it meets this requirement as well.
  35. We have a lot of business rules in in our ETL tools today.  So data is transformed before moving into a DW.  IF the data is raw coming into Hadoop (write once read many), where do we do the “transformation work” and store rules?  Will this be the BI tools now?
    Your ETL tools should continue to maintain the business rules. The only non-negotiable difference between HDFS and other file systems is that HDFS files are, as you point out, write once and read many. So you will need to tweak the low level logic within your ETL pipelines to replace table updates with HDFS file appends or replacements.
  36. Can we actually use Hadoop for Exploratory BI and Big Data Analytics and have the data outputs / results pushed into the EDW for traditional reports  / Self service, so that we leverage the capabilities of Hadoop and Traditional EDW frameworks?
    Yes. In the second Kimball-Cloudera webinar, we also hope to show a hybrid report created from combining queries from Hadoop and a conventional EDW.
  37. Does Dr. Kimball foresee a situation where there is a RDBMS between Hadoop and a BI tool?
    I don’t see the architecture that way. Hadoop is a general applications ecosystem, which contains a number of RDBMSs, like Hive, Impala, and others. But maybe you mean that Hadoop is being used as an ETL staging area, and the transformed tables are exported from Hadoop into an external RDBMS. That would fit your description.
  38. If a text file is brought into HDFS, and then turned into a Parquet file, if the text file is updated – say more records appended to this csv orig file, is the Parquet file also updated (pointer?) Or do we need to figure out how to ensure that the text files need to trigger the respective parquet file to be updated?
    To the best of my knowledge there is no such automatic trigger intrinsic to the Parquet file that would detect new data appended to the original text file. You would need to have a separate ETL application for this.
  39. What role do Canonical models play in in Hadoop?
    A canonical model is a standard model independent of a particular storage technology that allows data to be transferred between otherwise incompatible systems. Dimensional models are a type of canonical model because they can be deployed in quite different ways in specific RDBMSs. But more generally, XML and JSON payloads are widely used as canonical models, and Hadoop has extensive facilities for importing data in these formats.
  40. Don’t you think EDWs are obsolete with the volume, variety and velocity of data, especially if the value of the data is short-lived?
    As I said in the webinar, conventional RDBMSs will be with us forever as they are superbly good at being OLTP engines and query targets for text and number data. Also, there are billions of lines of code in the ETL processes and BI tools that work quite well now. Although the bigness of Big Data is impressive, it is less interesting than the variety. That is where Hadoop really makes a sustainable difference as I argued in the webinar. Not sure what to make of “short lived” data. I have found that if the archiving cost of such data drops toward zero, reasons arise for keeping it.
  41. What are BI recommended tools for Hadoop/Cloudera env?
    Most of your favorite BI tools have signed on to Hadoop a while ago, for the reasons I described in the webinar.
  42. Seems like a complete replacement for the Enterprise DW.  Only thing I saw left for Enterprise DW is perhaps conforming dimensions to push back into Hadoop DW.  Unless you are suggesting federation.
    Also: Eventually why can’t EDH replace EDW?
    Also: If the Cloudera solution utilizing Impala addresses the typical high performance concerns that an traditional enterprise DW / BI solution would require, why do we still say that the Hadoop – Cloudera stack is still complementary to traditional systems.
    I really and truly believe that the Enterprise DW will be with us forever, operating jointly with the Hadoop DW. As I stated in previous answers here, the Enterprise DW has very deep roots, and it is superbly good at being a query target for text and number data, especially for that data that is produced by an associated OLTP application.
  43. Can someone please define “Serious BI” ?
    It is the opposite of “Frivolous BI.”  :)
  44. What’s the point of introducing a DW on top of Hadoop?  Doesn’t that create an extra barrier between the data and the analysis?
    Hadoop is an applications ecosystem. One of the applications is the data warehouse, which serves a certain range of query and analysis functions. There is nothing about the Hadoop DW that gets in the way of doing analysis with other kinds of tools. Actually I made the point in the webinar that the data warehouse and other types of applications can simultaneously access the same data files, which is pretty mind boggling compared to other environments.
  45. When would the EDW 101 for Hadoop Professional be coming out? Also would it be offered online (since I am currently based out of India)?
    May 29, and of course it will be on-line, and will immediately be available on the Cloudera website for viewing at a convenient time.
  46. Is the Kimball group considering offering a new Hadoop 4 or 5 day class (like the classic Data Warehouse Toolkit class)?  Would this be with or through Cloudera?
    Also: Hi, Excellent webinar.  Will there Ralph Kimball Training on how to take advantage of the Hadoop technologies?
    A little early to answer these questions, but thanks for asking!
  47. What are the migration costs and implications?
    I assume you mean migration of the whole data warehouse infrastructure, rather than the much narrower question of just migrating data itself. The answer to the big question depends on whether at one extreme you are bringing up a new data source or at the other extreme you are attempting a plug compatible replacement of an Enterprise DW subject area with a Hadoop DW. You’ll need some consulting help to scope this answer.
  48. How are the goals of the mission/vision of data warehousing at the beginning of the session actually accomplished?
    The best reference (in my humble opinion) for achieving all these goals is the book “The Data Warehouse Lifecycle Toolkit, Second Edition” (Wiley, 2008), Kimball, Ross et al. Please see the book section on www.kimballgroup.com. If your question is the specific implementation of these goals in the Hadoop environment, please contact Cloudera.
  49. Can anyone provide a simple explanation of the difference between MPP and MapReduce?
    MapReduce is a specific programming framework that sits on a distributed set of processing nodes that can be described as a kind of massively parallel processors (MPP) system. But there are MPP relational databases like Teradata where the processors “share nothing” and are not based on the MapReduce framework. As the name suggests, MapReduce applications involve a Map step where the requested job is carefully divided into equal sized processing chunks, and then a Reduce phase where the actual work is carried out.
  50. Is there value in Data Virtualization and Hadoop to expose data on Hadoop to the business?
    Good comment. In some ways, Exploratory BI as I described it in the webinar is an exercise in data virtualization. The data is exposed at the query level as if it were housed in relational tables, but the data is still in the original possibly raw files.
  51. Can we have integration with an existing RDBMS, running a conventional engine like Oracle?
    Yes, specifically at the BI layer where the BI tool separately fetches back answer sets from the Hadoop DW and the Oracle engine. We intend to show this in action with one of the major legacy RDBMSs in the second webinar on May 29.
  52. Is it possible to conform the data?
    Yes, we will show how this is done in Hadoop in the second Kimball-Cloudera webinar on May 29.
  53. Dimensional modeling typically lead to more joins – since disk is not at premium do you think more de-normalized structures make sense in Hadoop?
    I interpret your question as suggesting that dimension content be denormalized into the fact tables in order to eliminate the joins. As a general approach this is impractical because all the wide verbose dimension content would need to be replicated endlessly in every fact record. Even in the Parquet columnar file format this would not be effective in my opinion. Also, think of the implications of fully denormalizing all the dimensions into all the target fact tables. Now the same data is distributed everywhere, and administration of such dimensions would break down.
  54. Is HADOOP more compatible with a Kimball style or Inmon type DW?
    I am somewhat partial to the Kimball approach myself.  :)Bill Inmon supports a centralized notion of data warehousing in which all data is brought to a normalized data warehouse that is generally off limits to business users until IT publishes an aggregated “data mart” in response to a user request. Seems like this would be hard to do in a Hadoop environment where HDFS files, metadata, and query engines are mixed and matched at analysis time by the particular analysts. The Hadoop DW must coexist and share files with other non-DW applications outside of ether Kimball’s or Inmon’s purview. Just my opinion.
  55. Do you think the need for aggregate fact tables goes away with Hadoop and ability to query detail level data quickly?
    In some sort of perfect world where all computations ran infinitely fast, there would be no reason to build aggregate fact tables. But when an aggregate fact table that is 100 times smaller than the base table can answer a request, then in almost any system, it will be 100 times as fast. With the advent of in-memory processing, and with fact tables that are not too large (say less than a terabyte), it may not be worth the administrative overhead to build an aggregate fact table. Remember that certain changes in a dimension will force the aggregate fact table to be rebuilt.
  56. Please I really need an answer, is it a good idea to store the DW (Facts, Dimensions) In Hadoop and use BI tools with IMPALA?
    Yes, it’s a great idea. Please tune in to the next Kimball-Cloudera webinar where that is exactly what we will talk about.

 

Try Hadoop Online with Cloudera Live

$
0
0

Apache Hadoop has quickly evolved beyond simple storage and batch processing to enable customers to interact with all their data in diverse ways – including interactive SQL, enterprise search, advanced analytics, and using a variety of 3rd party partner solutions – on a common open source data platform. This is very powerful – and is changing an industry – but how do you actually use such a platform? What does it look like?

While Cloudera has long provided a complete, self-contained, pre-configured virtual machine, many users simply want to quickly experience Hadoop and its related projects but don’t want to bother with a large download or the system overhead of running a virtual machine on their laptops.

Today we are pleased to announce the initial availability of Cloudera Live (beta), a new way to get started with Apache Hadoop online, in seconds. No downloads, no installations, no waiting.

Check it out at cloudera.com/live.

In our first release you can instantly explore and try demos across the full breadth of CDH 5, Cloudera’s completely open source Hadoop platform. You will have up to 3 hours at a time to interact with our sample data sets using Hue, the Hadoop User Interface developed by Cloudera and now supported by every major Hadoop distribution in some form. With Cloudera Live, you get to use the very latest version of Hue which includes support for Cloudera Impala, Apache Spark, Apache Zookeeper, Apache HBase, Apache Sqoop, and Cloudera Search (powered by Apache Solr), along with nice additions like a new editor for Apache Pig. We’ve provided short tutorials to help you quickly get started creating tables, writing real-time SQL queries, or simply searching and navigating through data.

We’re very excited to provide this early access to Cloudera Live and would love to hear your feedback. When you’re ready to take the next steps on your journey to Hadoop, check out our comprehensive online resources or visit Cloudera University to find the path that’s right for you.

Happy exploring, and enjoy!

 

The Next Generation of Analytics

$
0
0

Last year, Mike Olson took the stage at Strata + Hadoop World and announced that “Hadoop will disappear.” Not literally, of course. Hadoop is emerging as the platform powering the next generation of analytics. But to deliver on its promise as an enterprise data hub – one platform for unlimited data, creating value in a variety of ways – it has to enable business applications. Hadoop cannot remain an exclusive, specialized skill. Like the relational database management systems that power most of the online world today, Hadoop must recede into the background. It must evolve.

Over the past decade, the Apache Hadoop community has worked at a furious pace to realize this vision. We have seen tremendous progress, as Hadoop has transformed from a monolithic storage and batch architecture, to a modern, modular data platform. Three years ago, Hadoop became interactive for data discovery through analytic SQL engines like Impala. Two years ago, Cloudera was the first to adopt and support Apache Spark within the Hadoop ecosystem as the next-generation data processing layer for a variety of batch and streaming workloads, delivering ease of use and increased performance for developers.

But there is still more to do.

This week at Strata + Hadoop World, we are pleased to announce three new open source investments to directly address some of the most fundamental challenges our customers face: Improving Spark for the enterprise, making security universal across Hadoop, and developing a fundamentally new approach to Hadoop storage for modern analytic applications.

Better Data Engineering: Spark and the One Platform Initiative

Before we can even begin to discuss analytics, we need to address data engineering, the foundational role of the next generation of analytics. Data engineers are generally responsible for designing and building the data infrastructure, in collaboration with the data science team. Spark’s meteoric rise in popularity owes much to the ease of use, flexibility, and performance that are critical for good data engineering. Of course, in addition to data processing, applications also need ways to ingest, store, and serve data, and enterprise teams need tools for operations, security, and data governance. This is why Spark is such a natural complement to the comprehensive Hadoop ecosystem.

Over the last 18 months, over 150 Cloudera customers have deployed Spark workloads on Hadoop in production, across industries and for multiple use cases. We have seen first hand where Spark succeeds, and where it still needs work. This enterprise experience, coupled with our deep bench of Spark committers – more than all other Hadoop vendors together – and broad participation in the Hadoop community, uniquely positions Cloudera to drive Spark and Hadoop forward, together.

To formalize this commitment, Cloudera recently launched the One Platform Initiative to accelerate Spark development for the enterprise, and to better integrate it with the Hadoop ecosystem. Our focus will be on the areas where we’ve seen the greatest customer need, in particular this means management, security, scale, and stream processing. And one of the most critical places to start is security.

Comprehensive Security: Unified Access Control Policy Enforcement

The ability to access unlimited data in a variety of ways is one of Hadoop’s defining characteristics. By moving beyond MapReduce, users with more diverse skills can gain value from data. Complex application architectures that required many separate systems for data preparation, staging, discovery, modeling, and operational deployment can be consolidated into a single end-to-end workflow on a common platform. Of course, this flexibility must be balanced with security requirements. To ensure that sensitive data cannot fall into the wrong hands, a comprehensive security approach must ensure that every access path to data respects policy in the same way, down to the most granular level.

However, the reality today is that each access engine handles security differently. For example, Impala and Apache Hive offer row and column-based access controls, with shared policy definitions through Apache Sentry. On the other hand, Spark and MapReduce support only file or table level controls. This fragmentation forces forces a reliance on the lowest common denominator — coarse-grained permissions — resulting in several bad outcomes: Limitations of data or access. Security silos or, worse, inconsistent policy due to human error in policy replication. Ultimately, the issue constrains the types of applications you can build.

To address this need, Cloudera is excited to announce RecordService, the first unified role-based policy enforcement system for the Hadoop ecosystem. Coupled with Apache Sentry, the existing open standard for policy management, RecordService brings database-style row and column level controls to every Hadoop access engine, even non-relational technologies like Spark and MapReduce. It works with multiple storage technologies, from HDFS to Amazon S3, so your security team doesn’t have to worry about differences in data representation. By providing a common API for policy-compliant data access, it helps you integrate third party products into your Hadoop cluster with trust. It even provides dynamic data masking for the first time in Hadoop, everywhere.

RecordService is here to help more users gain value from data, securely, using their tools of choice. Next, we need to focus on a more fundamental problem: How we store data for the next generation of analytics.

Fast Analytics on Fast Data

The next generation of applications built on Hadoop are becoming more real-time, by collapsing the distance between data collection, insight, and action. In the best case, analytical models are embedded right in the operational application, directly influencing business outcomes as users interact with them. Or consider a simpler case, an operational dashboard, which requires the ability to integrate data and immediately analyze it.

It turns out that this is pretty hard in Hadoop today, and it’s because of storage constraints concerning updates. Users face an early choice: Do I pick HDFS, which offers high-throughput reads — great for analytics — but no ability to update files, or Apache HBase, which offers low-latency updates — great for operational applications — but very poor analytics performance? Often the result is a complex hybrid of the two, with HBase for updates and periodic syncs to HDFS for analytics. This is painful for a few reasons:

  • You need to maintain data pipelines to move data ensure synchronization between storage systems.
  • You are storing the same data multiple times, which increases TCO.
  • There is latency between when data arrives and when you can analyze it.
  • Once data is written to HDFS, if you need to correct it for any reason, you’ll need to rewrite it (remember, no updates).

Over the past three years, Cloudera has been hard at work solving this problem. The result is Kudu, a new mutable columnar storage engine for Hadoop that offers the powerful combination of low-latency random updates and high-throughput analytics. This powerful combination enables real-time analytic applications on a single storage layer, eliminating the need for complex architectures. Designed in collaboration with Intel, Kudu is architected to take advantage of future processor and memory technologies. For the first time, Hadoop can deliver fast analytics on fast data.

Looking Ahead

Hadoop has come a long way in its first 10 years. As Matt Aslett of 451 Research recently summarized, “Hadoop has evolved from a batch processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem that includes in-memory processing and high-performance SQL analytics. Naturally Hadoop’s storage options are also evolving and with Kudu, Cloudera is providing a Hadoop-native in-memory store designed specifically to support real-time use-cases and complement existing HBase and HDFS-based workloads.”

And this is just the beginning. With Spark as the new data processing foundation, a new unified security layer, and a new storage engine for simplified real-time analytic applications, Hadoop is ready for its next phase: Powering the next generation of analytics.

We’re excited to work with our customers and the community to continue advancing what’s possible.

The post The Next Generation of Analytics appeared first on Cloudera VISION.

The Future of Data Warehousing: ETL Will Never be the Same

$
0
0

Last Monday, over 2,200 of you tuned in live for our webinar “The Future of Data Warehousing: ETL Will Never be the Same”, in which Dr. Ralph Kimball and Manish Vipani, VP and Chief Architect of Enterprise Architecture at Kaiser Permanente engaged in a lively discussion of how Apache Hadoop is enabling the evolution of modern ETL and data warehousing practices.

(If you’re new to this series, be sure to first check out our other webinars with Dr. Kimball, in which he offered an introduction to Hadoop concepts for data warehouse professionals, and also an overview of data warehouse best practices for Hadoop developers.)

Picking up where those previous webinars left off, Dr. Kimball reiterated why Hadoop is having such an impact on traditional data warehousing environments: By modernizing the “back room” of data collection and preparation, it can rapidly open the door to more data, more users, and more diverse analytic perspectives than ever before possible. Hadoop’s modular, scalable architecture presents several attractive benefits, including:

  • Performance – Meet SLAs with highly parallel distributed computing frameworks such as Apache Spark.
  • Scalability – Keep unlimited data online with predictable, linear growth.
  • Diversity – Handle any kind of structured or unstructured data without having to predefine a schema, through Hadoop’s signature “schema-on-read” capability.
  • Low Cost – Offload processing workloads or archival data from an existing data warehouse, mainframe, etc.
  • Flexibility – Go beyond SQL, using multiple programming languages and alternative techniques, such as full-text search using Apache Solr.

We put it to the audience: Which of these matters to you? Not surprisingly, the answer turns out to be “all of the above”!

Despite the benefits, over 50% of our audience had not yet begun their journey to Hadoop, but were eager to learn:

Fortunately, Manish was willing to share how Kaiser Permanente overcame some of the common barriers that can hold organizations back from becoming data-driven. By adopting a pragmatic, incremental approach to adopting Hadoop as their unified “landing zone” — and by using some of the advanced security and governance capabilities of tools such as Cloudera Navigator to meet their privacy and compliance requirements — the team was able to quickly roll out over 10 use cases across diverse data sets and lines of business.

Next Up: Q&A

During the webinar many of you submitted questions — over 250! — that we were unable to answer live. Thankfully, Dr. Ralph Kimball and Manish Vipani have taken the time to address most of them. Stay tuned over the next week as we share their responses! And again, if you missed the live broadcast, you can catch the replay here.

The post The Future of Data Warehousing: ETL Will Never be the Same appeared first on Cloudera VISION.

Ralph Kimball and Kaiser Permanente: Q&A Part I – Hadoop and the Data Warehouse

$
0
0

In a recent Cloudera webinar, “The Future of Data Warehousing: ETL Will Never be the Same”, Dr. Ralph Kimball, data warehousing / business intelligence thought leader and evangelist for dimensional modeling, and Manish Vipani, VP and Chief Architect of Enterprise Architecture at Kaiser Permanente, outlined the benefits of Hadoop for modernizing the ETL “back room” of a data warehouse environment, and beyond.

Since then Dr. Kimball, the team from Kaiser Permanente, and a few friends from Cloudera have taken time to answer many of the over 250 questions asked in the live chat.

In this first of two Q&A posts, we’ll explore the future of Hadoop and the data warehouse, and matters of data modeling and data governance. Enjoy!

Key:

  • RK = Dr. Ralph Kimball
  • KP = The Kaiser Permanente team: Manish Vipani, VP and Chief Architect of Enterprise Architecture; Rajiv Synghal, Chief Architect, Big Data Strategy; and Ramanbir Jaj, Data Architect, Big Data Strategy

Hadoop and the Enterprise Data Warehouse

Q: How does the “Landing Zone” differ from the “Data Lake” concept?
RK: The Landing Zone is a specific area of the DWH with subzones intended for specific purposes and clients. The Data Lake is a term thought up by the media because they don’t know how to build a DWH.

Q: Does it make sense to consider Hadoop as an ETL solution even when all your source data comes from OLTP databases?
RK: Yes. To me the point of the Landing Zone and performing ETL workloads in Hadoop, is to take advantage of the speed, flexibility, and cost advantages of the Hadoop environment. So taking that point of view, you want to put as much ETL work as you can into Hadoop. Even if your data consists of familiar text and numbers in a relational format, Hadoop is a cost effective environment for offloading your ETL transformations from the EDW and then uploading the results when you are done. Besides Hadoop SQL clients including the open source Impala and Hive SQL engines, there are multiple proprietary SQL engines. Also the major ETL tools play effectively in Hadoop.  Informatica PowerCenter jobs, for example, look essentially identical in the Hadoop environment, and have deep hooks into the Hadoop environment.

Q: Is the data warehouse also in Hadoop, or is it on an RDBMS?
RK: You can do full traditional data warehousing on Hadoop. Please see our first webinar, “Hadoop 101 for Data Warehouse Professionals. In general the DW can be either in Hadoop or in the conventional EDW, or both. The Landing Zone doesn’t mandate an answer to your question, as it is still the “back room” and data in general is exported to traditional clients like the DWH, wherever it is.
KP: Hadoop is a cost-effective way for offloading ETL transformations from the EDW or OLTP, and then uploading the results when done. Our traditional RDBMS (OLTP systems) will still exist for now.

Q: Big data platforms seem to be focused on analytics, but a traditional DW will still be the source for 90% of reporting. Do you agree?
RK: The traditional DW will always be the main source for reporting off of OLTP databases. Hadoop will play several important roles however, including offloading the reporting purely for the cost advantages, but also for those forms of ETL on non text-number data sources.

Q: In the near future, are we going to see real time reports — as soon as the data is fed into the operational data sources — without need for a data warehouse?
RK: There is no such thing as a free lunch. Schema on Read still requires all the logic for building the “views” that implement fact tables and dimension tables, surrogate keys, slowly changing dimensions, and conformed dimensions. Many data cleaning, data matching, and foreign-key to primary-key structures must still rely on physical lookup tables because the detailed logic is too lengthy to embed in a query by itself.

Q: With sufficient scale in the Landing Zone, should be able to automate a great deal of traditional ETL?
RK: Absolutely. Schema on Read allows the developer to query and process data from original diverse sources without physically rehosting the data. But like any data virtualization, the downside is that Schema on Read may be slow, and that once a virtual ETL pipeline is built and validated, the developer may then re-implement the transformations in the form of physical tables in order to gain the best performance.

Q: Are we at same stage with Hadoop where we were with RDBMS in 90s?
RK: There are some similarities with the RDBMS marketplace of the 90s but some huge differences. The similarities I mentioned in the webinar include end user departments building stovepipes by themselves without IT involvement. But the differences include MUCH more powerful processing power and analytic tools, greatly expanded data types,  the ability to handle huge volumes of data, and the ability to stream data into Hadoop at enormous speeds. These last two were really serious limitations in the early days of data warehousing.

Q: What is the future of RDBMS for OLAP?
RK: Relational databases, and hence SQL, will remain the dominant API for business intelligence. The OLAP world is still fragmented by comparison and the original performance advantages of OLAP tools are not so obvious any more. If you mean by OLAP the new non text-and-numbers kinds of data, then I would agree that RDBMSs will not be able to address these new data types.

Q: What forces data export out of HDFS to DM/DW? What capabilities does the Hadoop ecosystem lack today that forces moving data out?
RK: In my opinion, high performance transaction processing has not been a top level goal of the Apache Hadoop project. Where Hadoop complements high performance OLTP is offloading the ETL work from these critical machines.

Data Modeling

Q: What is your range of data sources, e.g. structured vs. unstructured?
KP: Currently it is 95% of structured and about 5% is unstructured. We are working on bringing in voice and image data which will change the mix to around 80% structured, and 20% unstructured.

Q: How should we do data modeling in the new “back room”?
RK: The whole point of the Landing Zone is that unchanged raw data can co-exist with highly curated data which is intended to be exported or analyzed by very specific clients. Manish did a great job of describing the various parts of the Landing one that illustrate the flexibility. There is no single mandated data model for the Landing Zone. The metadata description in HCatalog of a source is open to all clients, who are then free to interpret the contents for their own purposes including extracting data elements that an RDBMS does not know how to process.

Q: For warehousing in Hadoop, I keep hearing, “don’t model stars, don’t normalize.” So what is the warehousing approach for Hadoop?
RK: Star schema modeling is still the fundamental API for data warehousing in Hadoop. The subtle point is that Hadoop provides two approaches to exposing this API. The first is through Schema on Read which trades off up front ETL data movement for more computation at query time. The second is the familiar building of high performance (e.g., Parquet) physical tables in advance of actual querying. Either way, the DW BI tools see the same schemas. The first approach is immediate but slow, and the second approach is deferred but fast.

Q: Is this a compromise of traditional DWH principles?
RK: The traditional DWH architecture remains perfectly intact. Fact tables, dimension tables, surrogate keys, Slowly changing dimensions, conformed dimensions. But where we process and build these structures can be be changed, or we can just implement the usual ETL physical transformation steps in a more cost effective way. Please see the first two Cloudera webinars I did last year that try to nail these issues.
Cloudera: See “Building a Hadoop Data Warehouse: Hadoop 101 for EDW Professionals” and “Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals”.

Q: But do we see conformed dimension and facts as being dead?
RK: Absolutely not! Conformed dimensions are the only practical way to implement integration across diverse data sources whether we are in the EDW and Hadoop. This simply means that the dimensional attributes (or equivalently “descriptors”) attached to separate data sources must be drawn from the same domains and have the same underlying semantics. The data sources don’t need to be relational but if you want to tie them together, you need the whole story of conformed dimensions.

Q: If the physical transformations are optional and without doing the cleaning and normalization, how do you integrate data from different enterprise systems in the new “back room”?
RK: One of my favorite questions. We dealt with this issue of normalization extensively in the second Cloudera webinar I did with Eli Collins. Please watch this webinar. But there is a subtle point in the Schema on Read story. You still must do necessary transformation steps to use the data, but in the extreme case, you do it by computing essentially a complex View declaration at query time. When you decide that this is too slow, you revert to traditional ETL processing, such as loading Parquet columnar data targets. Finally one more point. No matter what your data source or what you processing environment (Hadoop or EDW) you still have to maintain a correspondence table between individual source natural keys and your data warehouse surrogate keys. You can’t avoid this step if you expect to integrate data sources into the data warehouse.

Q: At some point in time in the processing of any data one has to know the schema – why be deliberately lazy with schema?
Cloudera: Because it allows data to be loaded at any time, and schemas to be quickly iterated upon.

Q: With schema-on-read methodology, it seems to be up to the data scientist to determine how they wish to consume the data. Yet I’ve been reading a lot of articles where they are now spending 80% or more of their time trying to understand and prepare data, and less than 20% of their time actually doing analytics. What are your thoughts and insights to this?
Cloudera: Strongly typed schemas can be defined by cluster admins or data owners, exposing SQL tables to a broad range of users. This is in fact the most common way data is exposed. In some cases, end users can get involved in defining the schemas, as you suggest.

Q: We’re trying to build star-schemas with Hive, Impala, etc. What will be the challenges?
RK: The challenge for star schemas in a shared nothing environment like Hadoop or Teradata is processing joins. Look to Impala particularly for techniques of making join intensive schemes work well.

Q: How do you add primary key identifiers across a distributed environment for when you want to move from the open access data to the controlled final data?
RK: Assigning unique keys across a profoundly shared-nothing environment is a hard problem. Probably the most straightforward way is to partition the keyspace in advance so that each node has its own key range.

Q: It seems like the materialized Parquet views may help with speed, but how do we control how measures are aggregated as the user moves up a dimension hierarchy, like we can do with traditional cubes and MDX? This would especially seem to be an issue with non-additive and semi-additive facts.
RK: The issues with building Parquet files at an atomic level or at an aggregated level are exactly the same as building such files in a conventional EDW. Parquet is simply the Hadoop equivalent of a columnar data store.

Q: How do you handle SCD Type II in Hadoop?
RK: There are two fundamental steps for handling time variance in dimensions (mainly SCD2). The first is to capture the change in the dimension member, and the second is to append a new record to the dimension with a newly assigned surrogate key. The brute force way to do this in Hadoop has been to write a new dimension table each time you perform this append operation. Until now, the fundamental rule in Hadoop has been that an HDFS file cannot be updated. Taken literally, this makes some steps awkward because an file needed to be rewritten to perform the update. However, Kudu, recently announced by Cloudera, directly addresses this issue. Please study the Kudu announcement to visualize how this aspect of the Landing Zone will be made easier.
Cloudera: Read more about Kudu here.

Q: How do we implement surrogate keys in Hadoop?
RK: The brute force solution in Hadoop until now has been to implement the dimensional surrogate keys in the EDW and export to Hadoop. Then implement a key correspondence table for the fact keys and use it at query time in Hadoop, or else rewrite the fact tables to add the surrogate foreign key. I am excited by the newly announced capabilities of Kudu however, which may possibly change this whole game since in Kudu tables can be modified.

Data Governance

Q: How does this approach address the prevailing problems associated with multiple data silos – e.g. multiple customer files that are similar but not exactly the same?
RK: The fundamental problem of incompatible data sources does not magically disappear with Hadoop. At the root, you still need a well maintained correspondence table between each source and the corresponding master dimension table describing the entity (say Employee or Customer) and you need an ETL step (virtual or physical) that updates this master table with programmed business rules. We aren’t saying that all users are being invited into the back room (remember the hot liquids and sharp knives kitchen metaphor). We are carefully vetting certain advanced users that aren’t being served by the traditional ETL processes and RDBMS targets. The traditional business users may well not see a big difference.

Q: Have there been problems with inconsistencies in final results between, say, a model built by a data scientist, as compared to an executive running a query with simple aggregations, on the same subject?
RK: This is a variation on the theme of incompatible stovepipes. We want the data scientists to explore new approaches and probably get somewhat different looking results. But when the data scientist’s results go mainstream, then the hard work begins where IT and the data scientist need to reconcile the different approaches. This is a serious issue, since senior decision makers lose confidence when these differences can’t be explained.

Q: Do you see any concerns of exploding multitude of versions of the same data created by different users in the “User Defined Space”?
KP: Yes, but data in User Defined space is a scratchpad mostly used for people to play, and is an ungoverned space. Ultimately, things have to go through Raw Zone and Refined Zone under a controlled processes, enabling us to have a single version of data.

Q: What are the criteria for deciding whether a personal data feed should be standardized for general consumption. What are the cost/benefit criteria?
RK: This sounds like the decision as to whether to take a data scientist’s experiment or prototype and put it in the mainstream. It is at that point that IT must be involved at least to certify the data scientist’s application.

Q: With all this data coming in from many places, how do you ensure quality and consistency?
RK: Well of course there is usually no guarantee of consistency at the original sources. No matter where you do the processing (EDW, conventional ETL in Hadoop, schema on read in Hadoop) you can’t avoid the heavy lifting of cleaning and conforming. Having said that, there is a class of sophisticated data consumers (not traditional business users) who insist on access to original “dirty” data. Yes, I know that is frustrating to us DW folks, but these needs are in carefully managed cases legitimate.
Cloudera: Consistency is managed by validation routines and processes run before data is promoted from the Raw Zone to the Refined Zone.

Q: Does Hadoop takes care of transforming and cleansing data?
RK: Hadoop is a general purpose applications environment typically coupled with the Hadoop Distributed File System (HDFS). Data is ingested with tools like Sqoop and Flume, and the data is often stored in raw form. Many programming languages and application development systems are available to analyze and transform the data. Thus Hadoop itself does not transform or cleanse data.
Cloudera: Hadoop itself is a platform. Users may write code or scripts to manage ETL, or use a third-party tool to make the process even easier.

Q: Do you collect, manage and share the metadata around all these disparate data sources to drive reliable analysis?
KP: Yes, the users have information about metadata specific to their use cases. Also they may choose to publish that for other use cases. We are working on some data wrangling and profiling tools to further enrich our data.
RK: One of the strengths of the Hadoop environment is that there is a central metadata source describing file structures, namely Hcatalog. But I think you are tapping into a bigger and more profound issue, which is the totality of metadata, and where that is stored, and whether there are universal formats and standards. This is a huge work in progress in the Hadoop community and at Cloudera. I think I mentioned in the webinar that I will be following these issues and hope to do a webinar in 2016 describing some significant advances, and maybe making some recommendations, too!

Q: How do you you handle testing and data validation, for example checking for bad data?
RK: This has always been a classic data import/ETL issue. Ultimately you need to tie back to the source for record counts and metric totals to do sanity checks on whether you got everything. The basic architecture and requirements for testing and validation remain unchanged, keeping in mind the flexibility we are offering for Landing Zone users from data scientists who don’t want us touching their data to business users who expect the same standards of trustworthiness that they have always had.
KP: We validate routines and processes, and do refinements before ingesting data in the Raw Zone.

Q: How do you handle incremental loads/CDC in your system?
KP: We use Sqoop for data ingestion with a time-stamp column for performing incremental loads.

Q: What is the volume of data handled by Kaiser?
KP: We have a 100-node cluster with 2.5PB for the Raw Zone alone.

Q: Does the data stored in the Raw Zone kept for long-term storage? Or is it transient?
Cloudera: Typically data is not evicted from the Raw Zone — to support reprocessing or inclusion of additional attributes not originally imported into Refined Zone.

Q: It will take some extra cost to keep data in the Raw Zone. Do you suggest any retention policies?
KP: At Kaiser, we have a mandate to keep 7 years worth of data.

Q: Is the landing zone backed up?
KP: We are currently backing up metadata only.

Q: With the new back room on “doors open” approach, what do you do where auditing and compliance are mandatory, and require governance?
KP: We use Cloudera Navigator for data governance: It provides security, auditing, lineage, tagging, and discovery. Also we have enriched our internal processes around data governance and auditing.

Stay tuned for part two, where we’ll look at what it takes to sell the business value of a Hadoop application, tips for managing security in a regulated environment, how to build the right data team, the tools for success, and recommendations for getting started.

The post Ralph Kimball and Kaiser Permanente: Q&A Part I – Hadoop and the Data Warehouse appeared first on Cloudera VISION.

Ralph Kimball and Kaiser Permanente: Q&A Part II – Building the Landing Zone

$
0
0

In a recent Cloudera webinar, “The Future of Data Warehousing: ETL Will Never be the Same”, Dr. Ralph Kimball, data warehousing / business intelligence thought leader and evangelist for dimensional modeling, and Manish Vipani, VP and Chief Architect of Enterprise Architecture at Kaiser Permanente, outlined the benefits of Hadoop for modernizing the ETL “back room” of a data warehouse environment, and beyond.

Since then Dr. Kimball, the team from Kaiser Permanente, and a few friends from Cloudera have taken time to answer many of the over 250 questions asked in the live chat.

In this second of two Q&A posts, we’ll look at what it takes to sell the business value of a Hadoop application, tips for managing security in a regulated environment, how to build the right data team, the tools for success, and recommendations for getting started. Enjoy!

Key:

  • RK = Dr. Ralph Kimball
  • KP = The Kaiser Permanente team: Manish Vipani, VP and Chief Architect of Enterprise Architecture; Rajiv Synghal, Chief Architect, Big Data Strategy; and Ramanbir Jaj, Data Architect, Big Data Strategy

 

Business Value of the Landing Zone

Q: For Kaiser, was the move to the Landing Zone led by the business use cases, or was it the amount of data as you mentioned?

KP: It was both.

Q: What is the business benefit with this newer architecture? Why is it faster to do this via a generic platform (like Hadoop) vs. purpose built system?

KP: The benefits are lower cost, improved performance, and making data available from disparate systems in one place, making it easy to correlate and analyze. Our Landing Zone provide users with a platform for performing quick POC’s and to start leveraging data.
RK: The Landing Zone is a generic platform, with different regions serving different user profiles. It is faster for those qualified clients who are able to immediately ingest the data in the Raw Zone. It also lets us extend SLAs, liberating processing from existing environments at a cost-effective price.
Cloudera: In addition, unlike purpose-built systems, Hadoop’s flexibility enables a diversity of access and analysis on shared data. This has tremendous business value — enabling a broader user community to access and collaborate with data to drive new insights — as well as more IT-centric benefits including lower TCO from an integrated open source platform with common metadata, and reduced risk from managing data security in one place.

Q: Talk to us about cost. Order of magnitude investment levels to transform an organization?

RK: In my experience, these revolutionary investments are mostly made defensively when an organization perceives or fears that their competitors are already doing it.

Q: How would this impact regular data warehouses where data is not in petabytes, or organizations who have many small- to medium-size data sets that, taken together, are big?

RK: The size of the data is not what is interesting about the Landing Zone or Hadoop in general. First it is the variety: data that simply cannot be processed in a relational database. Second it is the velocity and the expectation of immediate access. It has been difficult or impossible to address these requirements in a traditional EDW environment. Also, cost is a factor as well, depending on whether you legacy EDW environment has existing capacity.

Q: What does self-service mean here? Is the business able to work on their own, without IT involvement?

KP: These are design elements for creating a common data platform for all data. It is a 5-7 year journey, with the goal being to enable business users to self deploy, working with IT teams as needed to run analytics and reporting on top of it.

Q: How are you socializing the Landing Zone with business and IT stakeholders?

KP: We don’t talk to them about the Landing Zone directly, but talk about their problems and do a quick POC using this platform to show how they can benefit.

Q: Does this warehouse support medical research? If so, do the researchers also access the data through the same mechanisms of Hive or Pig?

KP: Yes, we support medical research data as of now. The researchers use analytical and reporting tools to access data.

Q: Is this in production?

KP: Yes this is production.

Q: Do you have any objective way of measuring success?

KP: Yes, our success is based on how much data is coming into the Landing Zone from source systems, and also how many uses cases are we providing to solve business problems.

Managing Security

Q: How were you able to hide sensitive data such as patient data, and still give unlimited access to your users such as the data scientist? Tools?

KP: Users only have access to data specific to their use case.

Q: Where does personally identifiable information (PII) protection fit in the picture, especially if we open up the warehouse more?

KP: At Kaiser we follow all Health care standards and security practices to keep data protected. We do data masking at the source systems. When it comes to the Landing Zone, the data is protected, authenticated, and authorized to satisfy compliance requirements.
Cloudera: One of the exciting features of RecordService — our new unified role-based policy enforcement system for the Hadoop ecosystem — is dynamic data masking, which should provider additional flexibility in designing secure analytics environments.

Q: How do you manage data access/security in the Raw Data zone?

KP: We have a separate network with private IP addresses, and Kerberos with Identity and Access Management to manage data access and security.

Q: How does Kaiser encrypt data-at-rest? Is it field level or full disk encryption? Why do you encrypt even non-sensitive information?

KP: We are using Cloudera Navigator for full disk encryption and key management.
Cloudera: Beyond HDFS, full disk encryption can be critical for regulated environments; logs and metadata also exist outside of HDFS and can just as easily contain sensitive information. Cloudera Navigator protects data at the OS/filesystem level.

Q: Are you seeing any performance degradation by encrypting data at rest?

KP: We have not experienced any significant performance impact by enabling encryption.

Q: What are you using for user data authorization?

KP: We are leveraging Kerberos and Apache Sentry.

Building the Team

Q: Describe the user base. What is the mix of power users when the organization has about 20,000 users with some reporting requirement?

KP: We have requirement to service a huge business community on the reporting and analytical side; it is a journey for us. Some community users are happy with their existing environment, and some want to adopt this new platform and benefit from  it. We are leveraging new tools for our users to build new dashboards.
RK: Looking at the overall landscape of general business users, probably 5 to 10 percent [have advanced skills to use the new environments]. But in some information-intensive companies there may be dozens or hundreds of data scientists, sometimes in unexpected departments, like manufacturing operations.

Q: Are your data scientists skilled in programming and coding (besides being good statisticians and mathematicians)?

KP: Some of them are, yes.

Q: How big is the team that supports the solution?

KP: We have about a team of 20+ supporting the Landing Zone environment. We have separate ingest and data refinement teams; an ops team providing a 24/7 onsite and offshore support model; plus additional data scientists specific to use cases.

Q: What is the lag time from user request to Refined Zone implementation?

KP: This is not done by any centralized organization at Kaiser. There are experts who decide on use cases and apply resources to projects. A project may span anywhere from 2 weeks to 18 months.

Q: Are you leveraging any knowledge management as part of the process?

KP: We are maintaining our internal knowledge base for processes and challenges. We use data SMEs to help understand the data and perform transformations, mapping, and cleanup.

Choosing the Right Tools

Q: Can you talk about the tools that you’ve used in building your Landing Zone?

KP: The Landing Zone is comprised of the entire Cloudera Enterprise platform — including HDFS, a set of compute frameworks (MapReduce, Impala, Spark, etc.), as well as governance and management tools [Cloudera Navigator and Cloudera Manager, respectively]. We are also evaluating some new tools outside of Cloudera, e.g.: Waterline Data for data wrangling and tagging.

Q: Is it safe to assume that transformations are done as we move from Raw to Refined Zones? If so, what tool/technology are you using for those transformations?

KP: Our pattern is to bring data into Hadoop ASAP, and then perform transforms in-place. For transport, we mostly use Sqoop, Flume, and flat files, with home-grown scripts. For transformation,  are mostly using Hive and Impala.

Q: How are you replicating data from Teradata to Hadoop, and how you are keeping that data in sync? How frequently are you doing that?

KP: We use Sqoop for replicating data from Teradata to Hadoop. We have daily and weekly ingests.

Q: When you moved queries from an existing data warehouse (using Teradata) to the Landing Zone, how was the performance comparison on the reports? Don’t you lose the advantage of MPP redundancy?

KP: We have seen 5-9x improvement from time of data acquisition, to integrate, to decision-making and reports. Hadoop does provide redundancy.

Q: How do users access the data in the user defined space? What tool do you suggest for data discovery and exploration?

KP: Users use Hive and Impala.
RK: A whole cottage industry of BI tools work great in the Hadoop environment, particularly accessing the Impala SQL engine, which is meant for rapid response ad hoc querying. These BI tools include all the established players like Tableau, Qlikview, Cognos, and many others.

Q: Don’t BI tools need a traditional high-speed database? Aren’t we giving up a lot of speed in terms of how fast results are returned, when we move from materialized views/cubes to HDFS?

RK: I disagree! While it is hard to beat a small cube store entirely in memory on a hot processor, Impala in Hadoop is purpose built to provide extremely high query performance, especially when built on top of Parquet columnar data files. The BI experience benefits greatly from every increase in performance. By all means please look at the demonstrations by ZoomData with BI tools sitting on Impala on Hadoop. Mind boggling.

Q: What kind of search technology you are using – Elasticsearch or Solr?

Cloudera: Cloudera Search, which is based on Apache Solr Cloud.

Q: Are you using Kafka or related queuing solutions?

KP: Not as we speak, but the Landing Zone is enabled to use it in near future.

Q: What is the primary language you are using at Kaiser: Python, R, or Java?

KP: Java and Python.

Getting Started

Q: How do we start incorporating this new technology with our current EDW?

RK: One path is to offload ETL and/or query processing from the EDW when the EDW (or the OLTP system itself) is over subscribed. Another path is to build an ETL application, in Spark for example, on non standard data that cannot conveniently be loaded or transformed in the conventional environment.
Cloudera: Completely agreed. Many organizations follow a similar maturity model: Start with an operational efficiency use case — saving money on and improving the performance of existing infrastructure by offloading data and/or workloads to Hadoop — while you build your architecture and teams’ skills. You will soon be able to integrate more data, and more kinds of data, with better performance and for lower TCO. Proceed from there to blending new data sets and delivering self-service BI to expand the number of users who can gain value from your data. With data science skills on board, you can begin to use Hadoop’s diverse exploratory and analytic techniques to develop predictive models of customer and channel behavior. Our most mature customers use Hadoop to build data products, often real-time applications that use data to directly impact how the business operates. The key is to start small and prove value quickly, then iterate to success.

Q: What courses and skills can I acquire for a career in this field?

Cloudera: Depending on your focus and interests, there are a wide variety of self-paced online and classroom training opportunities available at cloudera.com.

The post Ralph Kimball and Kaiser Permanente: Q&A Part II – Building the Landing Zone appeared first on Cloudera VISION.