Big Data Storage: Retaining the Nutritional Value of Your Data
Data is like a multicourse meal: it has shape and flavor, but you often don’t consume it all at once! In the Enterprise Data Platform, data relevant to various business cases flows through the different layers of processing and preparation on the platform while having flavors (like aggregations) added to it. Not unlike the great meal you are preparing (where you prep and cook, then serve your meal, and finally store the leftovers), there are three stages (or “shapes”) to data – raw data, in-process data, and processed (or summarized) data.
With multiple options available to store these “shapes”, choosing the “best” data storage solution can often be challenging for enterprise analytical applications. Usually, an enterprise has a large number of such applications (several hundred is typical) built to serve specific needs. The storage option has to match usage scenarios, query patterns, and analytical needs of these apps. In addition, a number of business trends, as well as security and compliance needs, are fueling the growing hunger for optimal storage. From a storage systems standpoint, the overriding concern should be the “response time” experienced by a user of the Enterprise Analytics System as a whole or for individual app experiences, with cost and security as additional considerations.
In prior articles, we covered the readiness of the Enterprise Data Platform for digital transformation, data ingestion, and the data processing options and capabilities. This article, co-authored with Mataprasad Agrawal, covers the Data Storage layer (highlighted in red in the diagram below). It addresses the question of how to spec a storage system given the application requirements.
The key drivers for choosing an appropriate storage system in a digitally driven enterprise are:
- Data growth and disruptive storage upgrades.
- Optimal utilization of storage resources to meet required performance SLAs.
- Capex and opex required for storage.
- Operations continuity and data protection requirements.
As the demand and complexity of enterprise data apps grow, the need to retain it longer (possibly indefinitely) and access it quickly presents challenges in not only storing it but also managing, archiving, and securing it over its lifecycle with minimal overhead. (Think about all those containers of leftovers in your freezer.)
In some industries (such as Media & Entertainment, Oil & Gas, and Life Sciences), data applications may need to store/transfer hundreds of thousands of files, each hundreds of gigabytes in size. There are two primary challenges with that:
- The bandwidth required to transfer the large volume files, i.e., being able to support a large number of files without slowing down individual accesses as the file count increases.
- Scalability and data protection.
In a traditional scenario, storage starts with adding one clustered SAN-attached storage or scale-out Networked Attached Storage (NAS) array, followed by another (and another, and another…). But as data increases in size and accessibility requirements grow, enterprises are challenged with the cost and management problems of such huge data volumes.
While it is certainly difficult to recommend a perfect storage system, architecting a successful one should not be a challenge, if we understand the desired attributes of this system. Here are a few such attributes.
- Scalability: No degradation in performance irrespective of data volume, i.e., throughput and latency must be scalable.
- High availability: Widely distributed, across data centers and/or geographies.
- Inherent support for analytical apps: Without need to redesign storage.
- Ease of integration: With enterprise ecosystem of public, private, or virtual private cloud, with enough protection.
- Elasticity: Flexibility to add or remove capacity based on need.
- Durability and affordability: In terms of recurring management and storage cost.
- Self-management: The built-in ability to handle failures, e.g., if a disk or server fails, it is not repaired immediately but the operation is redirected to available resources.
With the dramatic changes in both enterprise data requirements and storage technologies, as outlined above, customers need the ability to leverage the latest big data technology for storage. The following table summarizes the options.
|Desired Storage Characteristics||Storage Options||Data Shape||Technology*|
|Scalability, Elasticity, Self-healing||HDFS, NoSQL Stores||Raw, In-process||HDFS-ready storage solutions likeEMC’s Isilon scale-out NAS platform,Cleversafe, NetApp’s Open Solution for Hadoop with DAS, Hitachi Data Systems (HDS), Lustre support from Intel, GPFS from IBM, open source solutions including Ceph and Cassandra|
|Data replication, flexible data structure, rapid ingestion, quick response, high frequency||NoSQL StoresGraph, Key-value, document-oriented||Raw, Processed||Over 100 well-known NoSQL systems including Cassandra, MongoDB,CouchDB, Redis, Riak, Membase, Neo4j, and HBase|
|Very fast response times||In-memory stores||Processed||Couchbase, MemSQL, Aerospike, Redis, SAP HANA, IBM Blu for DB2, VoltDB,Oracle TimesTen, and Gemfire|
|Structured storage, medium to low capacity (to a few TB)||RDBMS / MPP Systems||Raw, In-process, Processed||MySQL Cluster, MS SQL Server 2014, Oracle, Teradata, IBM Netezza, etc.|
*Each category has many technologies. Only a few a listed.
As petabyte-scale data stores increasingly become key to business advantage, and as data growth continues to outpace traditional approaches, there is a corresponding desire to keep this data forever while ensuring that it is highly available, all at affordable cost. Enterprise data analytical apps are completely reliant on data storage to ensure optimal throughput and latency. Enterprises need to accomplish this objective without growing administrative or data storage staff at the same rate as data grows. Well-designed storage systems leverage their abilities like self-healing and elasticity to handle the workload extended by modern analytical apps.