Data Ingestion: Solutions for Avoiding Heartburn

Remember when your mother would command you, “Don’t eat so fast!”? Do you put off eating pizza after a certain hour or avoid eating bell peppers altogether? Have trouble knowing when to stop at a buffet? Well, you can throw all that advice and those concerns out when it comes to ingesting data … faster, more, and different is what you’re being served, so knowing how to handle it is how you will avoid any heartburn.

The first article in the series on the role of data in digital transformation (Read: Is Your Enterprise Data Platform Ready for the Dive into Digital Transformation?) covered the architecture of an Enterprise Data Platform (see figure below) for today’s digital data and provided a brief introduction of each of its constituent layers. The first of these layers – Data Ingestion – involves connecting to various data sources, extracting the data, and detecting changed data. In this article we take a closer look at the various challenges and the available tools. My co-author for this post is Sanchet Dighe from the Analytics Practice at Persistent Systems.

Traditional enterprise data platforms (such as those used in BI) primarily deal with structured data, sourced either directly from relational databases or indirectly (via delimited files, Excel dumps, etc.) from proprietary systems (such as CRM, Payroll, MIS, and ERP). In recent times, however, we have been faced with the reality of dealing with large volumes of data available in structured, semi-structured, and unstructured forms and of the businesses needing to arrive at real-time or near real-time insights for decision support and digital transformation. This change in context has given rise to a new paradigm of Data Extraction, Transformation, and Load (ETL).

Data ingestion subsystems need to fetch data from variety of sources (such as RDBMS, web-logs, application-logs, streaming data, social media, etc.), either directly or via REST APIs accessed over web-services. The need to support such diverse and disparate data sources and access mechanisms presents a variety of challenges for an effective and robust data ingestion strategy.

Speed of ingestion: New data sources tend to deliver data at varying frequencies. For example, discussion forum digests amount to considerable volume but appear at a daily/weekly frequency, whereas tweets or Facebook posts are small volumes of data occurring at a high frequency – thousands per second. The ingestion process not only has to account for the late-arriving data, but also make the entire dataset available to the analytical algorithms as soon as possible. This necessitates rapid ingestion using parallel processing or using multiple input data streams.

Volume of data: With a wider range of data sources becoming relevant for the enterprise, the volume of data to be ingested has grown manifold over the years. This data explosion poses one of the challenges for ingesting enterprise data in about the same time-window as before. This requires the ability to handle large datasets and perform fast in-memory processing.

Change detection: Another challenge that applies to incremental data-ingestion processes is the detection and capture of changed data (Changed Data Capture, or CDC). This task is difficult, not only because of the semi-structured or unstructured nature of data, but also due to the low latency needed by certain business scenarios that require this determination. A combination of techniques such as log-aggregation and parsing, CRC generation, and row-field value comparisons can be leveraged to arrive at a CDC solution.

The key to implementing a successful data ingestion solution is to harness the capabilities of the data import/export tools provided by traditional databases, ETL tools, and the modern big-data and NoSQL technologies. A blend of these tools to address the challenges – volume, variety, velocity – constitutes a common framework for data ingestion. Here is a description of five representative use cases and appropriate tools (both open-source and proprietary).

Objective	Desired Solution Characteristics	Sample tools	Details
Handling large data volume	Data extraction with load-balancing using a distributed solution or a cluster of nodes.	Apache Flume,Apache Storm,Apache Spark	Apache Flume is useful in processing log-data. Apache Storm is desirable for operations monitoring and Apache Spark for streaming data, graph processing and machine-learning.
Messaging for distributed ingestion	Messaging system should ensure scalable and reliable communication across nodes involved in data-ingestion.	Apache Kafka	The Gobblin framework from LinkedIn makes use of Apache Kafka to achieve fast communication between the cluster-nodes.
Real-time or near real-time ingestion	Data-ingestion process should be able to handle high-frequency of incoming or streaming data.	Apache Storm, Apache Spark	Apache Storm and Spark provide for real-time data-ingestion.
Batch-mode ingestion	Ability to ingest data in bulk-mode.	Apache Sqoop, Apache Kafka,Apache Chukwa	Apache Sqoop (transferring data between Hadoop and RDBMS), Apache Kafka, Apache Chukwa process data in batch-mode and are useful when data needs to be ingested at an interval of few minutes/hours/days.
Detecting incremental data	Ability to handle structured and unstructured data, low-latency.	Databus,Goldengate,IBM InfoSphere,Attunity,SyncSort	Databus from LinkedIn is a distributed solution that provides a timeline-consistent stream of change capture events for a database. Some of the other solutions available with some of the data-integration software such as GoldenGate knowledge modules for ODI, IBM InfoSphere Data replication Change Data Capture and solutions from vendors such as Attunity, SyncSort which implement these techniques to address the CDC problem.

Given the variety of use cases, one tool is often inadequate to address all enterprise data ingestion requirements. However, integrating the individual tools requires a lot of interface management. This is where ready-made frameworks such as Spring XD and Gobblin prove useful by addressing a combination of challenges. For example, social media data may not always require processing in real-time; Spring XD, an open-source framework, enables streaming in of social media data and workflow orchestration in batch-mode.

Keep the following points in mind when deciding upon a strategy and framework for your data ingestion needs.

Ingestion requirements/characteristics (volume, velocity, load-frequency) and the data types (structured/unstructured) play an important part in deciding which tool/framework to use. Outline the needs of your ingestion process and look for the tool/framework designed for that use case.

Most tools/platforms leverage parallelism using a distributed data ingestion architecture involving several cluster-nodes. Ensure that your infrastructure plan will support such an architecture.

More often than not, CDC tools will enable to you to avoid reinventing the wheel. Research the available options before choosing to implement a custom strategy.

And come to think of it, while you’re at it, feel free to go swimming right after ingesting all this data!

Dr. Siddhartha Chatterjee is Chief Technology Officer at Persistent Systems. Sanchet Dighe is a Senior Technical Specialist in the Analytics Practice at Persistent Systems.