Archive for January, 2018

EDH/DL vs EDW – Architecture Use Cases

Posted: January 13, 2018 in Hadoop

Task: to compare EDH/DL vs. EDW and present architecture use cases based on main (IMHO known during writing) Apache Hadoop distributions: Cloudera (CDH) / Hortonworks (HDP)

EDH (source: Wikipedia)

A data hub is a collection of data from multiple sources organized for distribution, sharing, and often subsetting and sharing. Generally this data distribution is in the form of a hub and spoke architecture.

A data hub differs from a data warehouse in that it is generally unintegrated and often at different grains. It differs from an operational data store because a data hub does not need to be limited to operational data.

A data hub differs from a data lake by homogenizing data and possibly serving data in multiple desired formats, rather than simply storing it in one place, and by adding other value to the data such as de-duplication, quality, security, and a standardized set of query services. A Data Lake tends to store data in one place for availability, and allow/require the consumer to process or add value to the data.

DL (source: Wikipedia)

A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

EDW (source: Wikipedia)

In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

Lambda and Kappa Architectures

Lambda (source: Wikipedia)

Is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.

Kappa (source: Milinda Pathigage)

Is a software architecture pattern. Rather than using a relational DB like SQL or a key-value store like Cassandra, the canonical data store in a Kappa Architecture system is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving.

Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed through the streaming system quickly.

Other sources:

Data Storage Formats / Data Storage Engines

  • Text (Data Storage Format)
  • Sequence File (Data Storage Format)
  • Apache Avro (Data Storage Format)
  • Apache Parquet (Data Storage Format)
  • Apache Optimized Row Columnar – ORC (Data Storage Format)
  • Apache HBase (Data Storage Engine)
  • Apache Kudu (Data Storage Engine)


  • More specifically text = csv, tsv, json records…
  • Convenient format to use to exchange with other applications or scripts that produce or read delimited files
  • Human readable and parsable
  • Data stores is bulky and not as efficient to query
  • Schema completely integrated with data
  • Do not support block compression

Sequence File:

  • Provides a persistent data structure for binary key-value pairs
  • Row based
  • Commonly used to transfer data between MapReduce jobs
  • Can be used as an archive to pack small files in Hadoop
  • Supports splitting even when the data is compressed

Apache Avro:

  • Widely used as a serialization platform
  • Row-based (row major format), offers a compact and fast binary format
  • Schema is encoded on the file, so the data can be untagged
  • Files support block compression and are splittable
  • Supports schema evolution
  • Supports nested data
  • No internal indexes (HDFS directory-based partitioning technique can be applied for fast random data access)

Apache Parquet:

  • Column-oriented binary file format (column major format suitable for efficient data analytics)
  • Uses the record shredding and assembly algorithm described in the Google’s Dremel paper
  • Each data file contains the values for a set of rows
  • Efficient in terms of disk I/O when specific columns need to be queried
  • Integrated compression (provides very good compaction ratios) and indexes
  • HDFS directory-based partitioning technique can be applied for fast random data access

Apache ORC – Optimized Row Columnar:

  • Considered the evolution of the RCFile (originally part of Hive)
  • Stores collections of rows and within the collection the row data is stored in columnar format
  • Introduces a lightweight indexing that enables skipping of irrelevant blocks of rows
  • Splittable: allows parallel processing of row collections
  • It comes with basic statistics on columns (min, max, sum, and count)
  • Integrated compression

Apache HBase:

  • Scalable and distributed NoSQL database on HDFS for storing key-value pairs (note: based on Google’s Bigtable) for hosting of very large tables: billions of rows X millions of columns
  • Keys are indexed which typically provides very quick access to the records
  • Suitable for: random, realtime read/write access to Big Data
  • Schemaless
  • Supports security labels

Apache Kudu:

  • Scalable and distributed table-based storage
  • Provides indexing and columnar data organization to achieve a good compromise between ingestion speed and analytics performance
  • Like in HBase case, Kudu APIs allows modifying the data already stored in the system

Data Storage Formats / Data Storage Engines Benchmarks:

Text (e.g. JSON): do not use it for processing!

Sequence File: MapReduce jobs relevant; not suitable to use it as a main data storage format!

Apache Avro: a fast-universal encoder for structured data. Due to very efficient serialization and deserialization, this format can guarantee very good performance whenever an access to all the attributes of a record is required at the same time – data transportation, staging areas etc.

Apache Parquet / Apache Kudu: columnar stores deliver very good flexibility between fast data ingestion, fast random data lookup and scalable data analytics, ensuring at the same time a system simplicity – only one technology for storing the data. Kudu excels faster random lookup when Parquet excels faster data scans and ingestion.

Apache ORC: minor differences in comparison to Apache Parquet (note: at the time of writing Impala does not support the ORC file format!)

Apache HBase: delivers very good random data access performance and the biggest flexibility in the way how data representations can be stored (schema-less tables). The performance of batch processing of HBase data heavily depends on a chosen data model and typically cannot compete on this field with the other technologies. Therefore, any analytics with HBase data should be performed rather rarely.

Alternatively to a single storage technology implementation, a hybrid system could be considered composed of a raw storage for batch processing (like Parquet) and indexing layer (like HBase) for random access. Notably, such approach comes at a price of data duplication and an overall complexity of a system architecture and higher maintenance costs. So, if a system simplicity is one of the important factors, Apache Kudu appears to be a good compromise.

Advantages / Disadvantages of “Row” and “Column” oriented Storages / Data Access Patterns:

  • In “row oriented” storage, the full contents of a record in a database is stored as a sequence of adjacent bytes. Reading a full record in row format is thus an efficient operation. However, reading the third column of each record in a file is not particularly efficient; disks read data in minimum amounts of 1 block (typically 4KB) which means that even if the exact location of the 3rd column of each record is known, lots of irrelevant data will be read and then discarded.
  • In the simplest form of “column oriented” storage, there is a separate file for each column in the table; for a single record each of its columns is written into a different file. Reading a full record in this format therefore requires reading a small amount of data from each file – not efficient. However, reading the third column of each record in a file is very efficient. There are ways of encoding data in “column-oriented” format which do not require file-per-column, but they all (in various ways) store column values from multiple records adjacent to each other.
  • Data access patterns which are oriented around reading whole records are best with “row oriented” formats. A typical example is a “call center” which retrieves a customer record and displays all fields of that record on the screen at once. Such applications often fall into the category “OLTP” (online transaction processing).
  • Queries which search large numbers of records for a small set of “matches” work well with “column oriented” formats. A typical example is “select count(*) from large_data_set where col3>10”. In this case, only col3 from the dataset is ever needed, and the “column oriented” layout minimises the total amount of disk reads needed. Operations which calculate sum/min/max and similar aggregate values over a set of records also work efficiently with column-oriented formats. Such applications often fall into the category “OLAP” (online analytics processing).
  • “Column oriented” storage also allows data to be compressed better than row-oriented formats; because all values in a column are adjacent, and they all have the same data type. A type specific compression algorithm can then be used (e.g. one specialized for compressing integers or dates or strings).
  • “Column oriented” storage does have a number of disadvantages. As noted earlier, reading a whole record is less efficient. Inserting records is also less efficient, as is deleting records. Supporting atomic and transactional behaviour is also more complex.

Infrastructure Overview (source: Cloudera)

Master Node (source: Cloudera)

Runs the Hadoop master daemons: NameNode, Standby NameNode, YARN Resource Manager and History Server, the HBase Master daemon, Sentry server, and the Impala StateStore Server and Catalog Server. Master nodes are also the location where Zookeeper and JournalNodes are installed. The daemons can often share single pool of servers. Depending on the cluster size, the roles can instead each be run on a dedicated server. Kudu Master Servers should also be deployed on master nodes.

Worker Node (source: Cloudera)

Runs the HDFS DataNode, YARN NodeManager, HBase RegionServer, Impala impalad, Search worker daemons and Kudu Tablet Servers.

Edge Node (source: Cloudera)

Contains all client-facing configurations and services, including gateway configurations for HDFS, YARN, Impala, Hive, and HBase. The edge node is also a good place for Hue, Oozie, HiveServer2, and Impala HAProxy. HiveServer2 and Impala HAProxy serve as a gateway to external applications such as Business Intelligence (BI) tools.

Utility Node (source: Cloudera)

Runs Cloudera Manager and the Cloudera Management Services. It can also host a MySQL (or another supported) database instance, which is used by Cloudera Manager, Hive, Sentry and other Hadoop-related projects.edh-dl_edw_architecture-use-cases_hdp-v3.png

Figure: Hortonworks Data Platform (source: Hortonworks)

Reference Vendor’s Infrastructure Architectures:

Security Overview (source: Cloudera)

Apache Atlas: Data Governance and Metadata framework for Hadoop: NOT supported by CDH platform; use CDH Navigator instead

Apache Knox: REST API and Application Gateway for the Apache Hadoop Ecosystem: NOT supported by CDH platform; a standard firewall will give you more or less the same functionality with respect to network security. More advanced security (authorization, authentication, encryption) are provided by other components in the stack (Kerberos, Sentry, HDFS encryption, etc.)

Apache Metron: Real-time big data security (cyber-crime) : NOT supported by CDH platform; use instead

Apache Ranger: Framework to enable, monitor and manage comprehensive data security across the Hadoop platform: NOT supported by CDH platform; use Apache Sentry instead

Apache Sentry: Is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster: SUPPORTED by CDH platform; you can use Sentry or Ranger depends upon what Hadoop distribution tool that you are using like Cloudera or Hortonworks (Apache Sentry – Owned by Cloudera and Apache Ranger – Owned by Hortonworks; Ranger will not support Impala)


Figure: Security overview (source: Cloudera)


Figure: Security architecture (source: Cloudera)


Figure: Securing and governing a multi-tenant data lake (source: Dataworks Summit)