Is Your Enterprise Data Platform Ready for the Dive into Digital Transformation?
Data is the currency of Enterprise Digital Transformation. Indeed, every aspect of the software-driven business is intimately connected with data – to the point that one could as well talk about the “data-driven” business. This post is the first of a series of pieces on this theme of the central role of data in digital transformation. My co-author today is Sunil Agrawal, chief architect in the CTO team at Persistent Systems.
A lot of the current discussion around making enterprises digitally enabled is what we see early on in any new paradigm: people using an in-vogue term to describe whatever they are doing. In the context of digital transformation, we see people referring to rationalization, moving to cloud, and migrating applications to latest technologies as “going digital” or “digital transformation.” At Persistent, we clearly differentiate between IT modernization (such as the examples above) and the much more profound move to enterprise digital transformation.
To fully realize the business model transformation benefits promised by enterprise digital transformation, it is necessary that enterprises be able to take timely business decisions based on the data available (both internally within the enterprise as well as external to the enterprise). Enterprises must make every actionable decision based on data, and evaluate success or failure using data. However, traditional data platforms at enterprises may not be ideal for this digitally enabled world.
Let us examine the typical structure of a traditional enterprise data platform (above). It has various components for offline storage of data (such as staging and data warehouse) whose schema are predefined based on the reports/dashboards identified earlier by the business. Moving data from one data storage to another data storage is done through defined ETL (Extract, Transform, Load) processes using schema mapping. The inherent offline nature of processing imposes latency into the system (often measured in hours or days) before reports or dashboards can be viewed.
In a digitally enabled world, such a platform structure poses some key challenges to business:
It is ill-suited to handle unstructured data generated by social media, emails, or various documents shared within the enterprise.
It usually takes several months to incorporate any new business requirement; the cycle time is high due to the schema-based approach and the need to propagate changes across the system.
By the time systems start delivering results for the business, this can mean missed opportunities and revenue. And as anyone who has heard from an in-patient marketing department, research and intelligence is so time-dependent that it needs to be realized ASAP.
Finally, there is the possibility of multiple data platforms existing in silos within enterprises to cater to different business requirements. Information silos are rarely good for business!
None of these are insurmountable challenges, and can be overcome by augmenting the existing data platform to accommodate the needs of digital data. What will make or break this shift are the added expectations for the platform: that is be able to support a variety of high-volume and high-velocity data inputs, get data from any layer at any time, and provide near-real-time analytics. While component technologies are now available to make this happen, it is not only important that the data platform be implemented correctly and effectively, but also that it be implemented quickly. Enterprises do not have the luxury of waiting multiple years for a transformed data platform.
At a systems level, a transformed Digital Data Driven Platform for enterprises can be visualized as shown above. The richness of the platform, with multiple layers and features, reflects the capabilities that are needed to handle a variety of data sources within and outside enterprises. And while the above diagram may seem intimidating, it can be built incrementally and iteratively in agile product release cycles.
The various layers in the architecture diagram above have the following functions. These are very high-level descriptions; future posts in the series will dig more deeply into the layers.
The data ingestion layer provides a mechanism to extract data (structured as well as unstructured) from a variety of enterprise and public sources. It therefore needs to support a variety of data formats, such as RDBMS, XML, JSON, CSV, PST file, CRM, and ERP. To start with, build a connector framework and add data sources incrementally as needed. This layer also performs data quality checks and does the necessary cleansing before processing the data. It exposes RESTful APIs to support both pull and push modes of data ingestion.
The data processing layer provides mechanisms to massage the data and to divert it into either the real-time of the batch processing stream of data analysis. It should also should expose its analysis via an API to be directly consumed by apps (especially those that generate alerts and notifications). In this layer, data is passed through various data extraction algorithms to match dictionary items, identify location information, extract snippets, etc. This layer also performs all the data aggregation and consolidation required before storing the data.
The data storage layer is a critical layer, as it can single-handedly determine the overall scalability, latency, and performance of the platform. The storage layer has to handle the usage scenarios, querying patterns, and analytical needs of the various kinds of big data apps available in the market. From a storage systems standpoint, the first concern should be the response time experienced by a user of analytics system. In general, a platform should support multiple ways of storing based on business requirements, which may require storing data using different storage mechanisms. Most enterprises will already have RDBMS, so it is a matter of extending into other storage mechanisms. Hadoop is useful for storing large volume of data for historic analysis and long running algorithms. NoSQL is best suited for getting network, connections, graph analysis done quicker to meet low-latency requirement of such reports. And in case a user has need for high performance output, the option of in-memory data grid and database are also available. The strategy for data partitioning and distribution plays an important role in the design of this layer.
The analytics layer provides algorithms and capabilities to perform analytics based on the business use cases. It should support ad-hoc analysis, what-if analysis, predictive analysis, and search (Q&A) analysis. This layer should provide the capability to do run-time data aggregation and consolidation. It may employ advanced techniques such as machine learning.
The visualization layer supports different layouts, visualization charts, and elements to enable end-users to choose their own visualization.
The data governance layer ensures that data movement between layers follows the desired workflow and schedule requirements of the business. This layer also stores logical and physical metadata for each of the other layers. This layer also should keep check on errors and alert system admins if required. Data governance is shown cutting across other layers because it has to be built at the appropriate downstream layer (e.g., certain quality checks may be very costly at the source level versus in later stage where we have aggregated data).
The data security and protection layer need to be implemented at each layer and needs to be a design-level thought. This layer at a minimum should take care of authentication and authorization of data access. Based on the kind of data being stored, it may also be required to encrypt the data or mask the data during storage or reporting back to applications.
It’s great that you’re excited about digital. But don’t be confused by all the noise, and don’t be intimidated by what may look like a huge challenge to manage huge data (both structured and unstructured). Understand what data platform requirements for digital really are – and aren’t – and then build your data-driven platform to take advantage of this new digital world.