Think of the produce aisle at the supermarket, full of items of different tastes, flavors, colors, and shapes. You go in search of the fresh ones, the ripe ones, the right ones – and you find none! Some have expired dates, some have unwanted ingredients, and some are not on the shelves. Data in enterprise systems is like food – it has to be kept fresh, it has to have the nourishment you need, and you have to be able to find it; otherwise it goes bad and doesn’t help you in making strategic and operational decisions. Just as consuming spoiled food could make you sick, using “spoiled” data may be bad for your organization’s health. There may be plenty of data, but it has to be reliable and consumable in order to be valuable. While most of the focus in enterprises is often about how to store and analyze large amounts of data, it is also very important to keep this data fresh and flavorful. How do you do this? By monitoring, auditing, testing, managing, and controlling the data. In other words, it is all about setting strong data governance in place for enterprise data platforms.
In the previous blogs of this series, we have covered the readiness of the Enterprise Data Platform for digital transformation along with mechanisms and tools for ingesting various kinds of enterprise data, the data processing options and capabilities including data lakes, and also how enterprise platforms have different shapes and flavors to store data. In this particular blog post, Deepa Deshpande and I discuss the importance of having quality data and how to monitor and set up data governance practices for enterprise digital data.
Data governance means asserting authority over the data in the organization, protecting, monitoring, correcting it: in short, treating the data as the critical corporate asset that it is. (Related to this, Gartner has coined the term ‘infonomics’ to describe the theory and practice of treating information as an actual corporate asset.) The diagram below shows the architecture of the modern Enterprise Data Platform and shows that the Data Monitoring and Governance function cuts across multiple layers of the architecture. This is because the data governance layer needs to ensure that data movement between layers follows the desired workflow and schedule requirements of the business. This layer needs to make sure that proactive measures are taken to meet the quality standards of the organization. Also, this layer keeps checks on errors, guards the systems against thefts or losses, and alerts system admins if required. The governance and monitoring needs to make sure that the user security and privacy is accounted for. The layer makes sure that the data workflows through other layers adhere to privacy, security, regulatory, and compliance requirements.
With the proliferation of digital transformation, enterprises need to have an even more comprehensive data governance program. It is important for organizations to devise proper means and mechanisms to collect all the data that is flowing through enterprise systems. Within Enterprise Digital Transformation systems, the 3Vs of “big data” (volume, velocity, variety) manifest themselves at scale, and the scope of data governance expands to include data science and compliance.
Achieving Data Governance
Data governance is achieved by a multi-pronged approach. This section describes mechanisms and methodologies to achieve the governance. The most important aspect is to institutionalize a data governance team and a set of processes. Choose the right tools that will help in implementing the processes and providing the results.
Building a data governance team
You cannot tell something is wrong unless you define what is right (Data Quality: The Accuracy Dimension, by Jack E. Olson)! The primary responsibility of a data governance group is to define the attributes of correct data. Using these definitions, one can determine the quality of data at hand. Hence the data governance group needs people in the following kinds of roles:
- Business analysts: Understand the applications that works with data. Business analysts are the real authority on business rules.
- Data analysts / data scientists: Team of analysts (data assurance team). Know data structures, data architecture, data modeling, and analysis techniques.
- IT staff: Maintain and manage applications. They know how the rules actually work and can provide verification.
- Database administrators: Help with structural issues that may come up and can provide physical data.
- Users / staff: Help in understanding the situations. They would raise any unusual patterns that they would see in the data / application.
- Compliance team: The legal team defines consumption of the data coming from external sources, what data can become available to what set of people, etc.
Implementation mechanisms for data governance and monitoring
Mechanisms for achieving data governance includes actions such as Data Profiling, Data Quality, Data Cleansing, Compliance, Security, Data Loss Prevention, and Data Monitoring. These actions need to be performed at appropriate layers in the EDT stack. The table below shows the various techniques, stakeholders, and tools associated with these mechanisms.
Technique | Description | Users / stakeholders | Tools |
Data Profiling and Lineage | These are the techniques to identify the quality of data and the lifecycle of the data through various phases. In these systems it is important to capture the metadata at every layer of the stack so it can be used for verification and profiling. | System users / Data analysts | Mahout,Datameer, |
Data Quality | Data is considered to be of good quality if it meets business needs and it satisfies the intended use. For some organizations, accuracy may be of utmost importance while for some others, it may be the timeliness. Hence understanding the dimension of greatest interest and implementing methods to achieve it is important. | System users / Data Analysts | Talend/Queries |
Data Cleansing | Once you are sure as to what is the quality requirements, implementing various solutions to correct the incorrect or corrupt data is data cleansing. | System developers | Talend,SSIS,Informatica |
Compliance | Adhering to the standards defined for the data is compliance. Very important step here is to define the business rules and the compliance. Controls and audit procedures should be implemented to check ongoing compliance with rules. | Legal team, Business / Data analysts | Audit the systems, the compliance and SOD matrix |
Security | Defining the matrix of users, groups, and roles and the access definition and implementing it via various tools. Authentication and authorization / access control is part of defining and implementing security. | Business / Data security | Visualization admin tools |
Data Loss Prevention | Policies have to be in place to make sure the loopholes for data loss are taken care of. Identification of such data loss needs careful monitoring and quality assessment processes. | Users / Business analysts | Talend, other profiling tools |
Data Monitoring | Continuous monitoring of data and sampling of huge volumes of data to assess quality are important part of the governance mechanisms and have to be made a habit to achieve good quality data. Monitoring can happen in data storage layer. | Data analysts, users, support team members | SQL scripts, Talend |
Conclusion
Enterprise that are otherwise healthy can be made really sick by bad data. From our experience, we have seen many companies that have misleading and at times outright wrong data in their enterprise systems. The frightening part is not that the data is wrong but that the staff is unaware that it is wrong. With Enterprise Digital Transformation, the data volume is growing in the organizations and so is the need for strong data governance and monitoring. As described above in this article, the technology and the infrastructure for data governance exist. What is missing? We found that it is the will and the awareness! The executive teams of such companies should take note and incorporate data quality analysis and monitoring in their corporate processes.
Here’s to healthy data consumption!
Image Credits: bruker.com
Dr. Siddhartha Chatterjee is Chief Technology Officer at Persistent Systems. Deepa Deshpande is an Architect at Persistent Systems.