Effective Data Governance is the key for controlling and trusting data quality of the Data Lake

Does data governance bring more control to data producers or provide trusted data to business leaders?

Often data governance is misinterpreted as just being a policing act. Why does data need to be governed? Why not let it flow freely and be consumed, transformed and explored? Well, if there is no data governance process or tools put in place to organize, monitor and track data sets, the data lake soon can turn into data swamp, because users just lose track of what data is there, or won’t trust the data because they don’t know where it comes from. As businesses become more data-driven, data governance becomes an increasingly critical key factor. It is important to have effective control and tracking of data.

As mentioned, the data lake is not a pure technology play: we pointed out that data governance must be a top priority for the data lake implementation. Carrying forward, my fellow colleagues then discussed security, flexible data ingestion and tagging in the data lake. In this blog, I will discuss the “what, why and how” of data governance with a focus on data lineage and data auditing.

While there has been lot of buzz and proof-of-concepts done around big data technologies, the main reason big data technologies has not seen acceptance in production environments is the lack of data governance process and tools. To add to this, there are multiple definition and interpretation of data governance. To me, data governance is about process and tools used to –

Provide traceability: any data transformation or any rule applied to data in the lake can be tracked and visualized.
Provide trust: to business users that they are accessing data from the right source of information.
Provide auditability: any access to data will be recorded in order to satisfy compliance audits.
Enforce security: assure data producers that data inside the data lake will be accessed by only authorized users. This was already discussed in our security blog.
Enhance discovery: business users will need flexibility to search and explore data sets on the fly, on their own terms. It is only when they discover the right data that they can find insights to grow and optimize the business. This was discussed in our tagging.

In short, data governance is the means by which a data custodian can balance control requested by data producers and flexibility requested by consumers in the data lake.

Implementation of data governance in the data lake depends entirely on the culture of the enterprises. Some may already have very strict policies and control mechanisms put in place to access data and, for them, it is easier to replicate these same mechanisms when implementing the data lake. In enterprises where this is not the case, they need to start by defining the rules and policies for access control, auditing and tracking data.

For the rest of this blog, let me discuss data lineage and data auditing in more detail, as the security and discovery requirements have already been discussed in previous blogs.

Data Lineage

Data Lineage is a process by which the lifecycle of data is managed to track its journey from origin to destination, and visualized through appropriate tools.

By visualizing the data lineage, business users can trace data sets and transformations related to it. This will allow business users, for instance, to identify and understand the derivation of aggregated fields in the report. They will be also able to reproduce data points shown in the data lineage path. This finally helps in building trust with data consumers around the transformation and rules applied to data when it goes through a data analytics pipeline. Not to mention it also helps to debug step-by-step the data pipeline.

Data lineage visualization should show users all the hops the data has taken before generating the final output. It should display the queries run, table, columns used, or any formula/rules applied. This representation could be shown as nodes (data hops) and processes (transformation or formulas), thus maintaining and displaying the dependencies between datasets belonging to the same derivation chain. Please note that, as explained in our tagging blog, tags are generalizing metadata information such as table names, column names, data types, and profiles. Hence, tags should also be part of derivation chain.

Data lineage can be metadata driven or data driven. Let me explain both in more detail.

In metadata-driven lineage, the derivation chain is composed of metadata such as table names, view names, column names, as well as mappings and transformations between columns in datasets that are adjacent in the derivation chain. This includes tables and/or views in the source database, and tables in a destination database outside the lake.

In data-driven lineage, the user identifies the individual data value for which they require lineage, which implies tracing back to the original row level values (raw data) before they were transformed into the selected data value.

For example, let’s suppose a health insurance company business user is looking at claim reimbursement reports submitted at the end of the quarter. The user notices a sudden rise in claims from one hospital against similar number of patients admitted during the previous quarter. The user now wants to take a closer look. For this, the claim amount should be “drillable” so that it can be deconstructed in terms of administrative fees, tax amounts, and hospitalization fees. From the hospitalization fees amount, the user should be able to drill into different procedure codes for medical provider’s consulting fees, medical items used during hospitalization and any labs/test conducted from it. The process continues until the user looks up and matches authorized procedure codes and limit on charges for the same.

Hence data-driven data lineage is very important in trusting the data so as not to draw premature conclusions about the resulting data. At the metadata level things may look fine, but there may be multiple causes of error at the data level that would be spotted faster with a data-driven experience.

It is challenging at times to capture data lineage if transformations are complex and are being hand coded by developers to meet business needs. In these cases, developers could just name the process or job which is doing the transformation. Another challenge is the mixed set of tools for addressing governance in an open source world. Lineage tools, part of the mix, should integrate with other data governance tools like security and tagging tools or provide REST APIs for system integrators to integrate and build a common seamless user interface. For example, data classifications or tags authored using the tagging tool should be visible in the data lineage tool to see lineage based on tags.

Data Auditing

Data auditing is a process of recording access and transformation of data for business fraud risk and compliance requirements. Data auditing needs to track changes of key elements of datasets and capture “who / when / how” information about changes to these elements.

A good auditing example is vehicle title information, where governments typically mandate the storing of the history of the vehicle title changes along with information about when, by whom, how and possibly why was the title changed.

Why is data auditing a requirement for data lakes? Well, transactional databases don’t generally store the history of changes, let alone extra auditing information. This takes place in traditional data warehouses. However, audit data requires its share of storage, so after 6 months or a year it is a common practice to move it offline. From an auditing perspective, this period of time is small. As data in the lake is held for much longer periods of time, and as data lakes are perfect candidate data sources for a data warehouse, it makes sense that data auditing becomes a requirement for data lakes.

Data auditing also keeps track of access control data in terms of how many times an unauthorized user tried to access data. It is also useful to audit the logs recording denial of service events.

While data audit does require a process and implementation effort, it definitely brings benefits to the enterprises. It saves efforts in the event of an audit for regulatory compliance (which otherwise would have to be done manually, a painful process), and brings efficiency in overall process of auditing.

Data auditing may be implemented in two ways: either by copying previous versions of dataset data elements before making changes, as in the traditional data warehouse slow changing dimensions [1], or by making a separate note of what changes have been made, through DBMS mechanisms such as triggers or specific CDC features [2], or auditing DBMS extensions [3].

To implement a data auditing in the data lake, the first step is to scope out auditing, i.e., identify datasets which are required to be audited. Don’t push for auditing on every dataset as it not only requires processing of data, it may also end up hampering the performance of your application. Identify business needs and then go about creating a list of datasets, rules (e.g. who can access it, legal retention requirement of 1 year) associated with it in some kind of repository.

The next step is to categorize or tag your datasets in terms of importance in the enterprise. While this won’t help in searching or indexing, it does help in scoping the level of audit activity for each type of dataset. This categorization can be driven by:

Whether data sets are raw data, transformational (computed data) or test/experimental data.
Type of data set, i.e., whether it is structured data, or text, images, video, audio, etc.

Define policies and identify data elements (like location of data, condition/status or actual value itself) which need to be collected as part of the data auditing.

Once the above steps are done, data auditing tools should generate audited data to be stored in repository, as well as an interface to browse and assess auditing data. It should alert/notify business users in case of any violation of compliance. The data auditing tool will also provide a facility to generate reports and dashboards on findings based on rules/policy already set.

Overall, the data auditing process should be able to provide a listing of audited datasets, their location, a person or group responsible for managing, tracking and updating datasets and finally a list of reports for auditors to check compliance and business policy data.

It is better to build or identify existing data audit frameworks to collect the information about these data sets and support policies and rules required for auditing. Using the framework will help in automating some of these processes efficiently and risk-free. The framework may require a separate policy and rule engine to be built so that policies applied to data sets can be rationalized/prioritized automatically based on the risks and compliance fulfillment.

Data Governance Tools

Implementing data governance requires deep knowledge of processes and a way to integrate disparate tools available in the open and commercial world.

In the open source Hadoop world, Hortonworks has established a data governance initiative in collaboration with Aetna, Merck, Target and SAS. Along with some other enterprises’ support, they are incubating Apache Atlas, a project to provide a tool to help in data governance of the data lake. For data auditing, Apache Atlas along with Apache Ranger does provide an option in the Hadoop world. There are other tools like Apache Falcon, Cloud Navigator, IBM InfoSphere Metadata Workbench, and Waterline Data also providing data governance facilities with Hadoop world. But I must add that some of these tools are still at very early stage of their release or may be at 1^st version of it. At these early stages, support from system integrators will be needed to integrate these disparate tools and technologies as seamlessly as possible.

Conclusion

Data governance should be a top priority while planning for the data lake. Enterprises should define comprehensive and effective processes for data governance to make sure the data in the lake can be found and assigned some level of trust, as well as to meet their business needs of risk mitigation and compliance requirements. Now, I would like to hear your comments especially on lineage and auditing features of data governance. Or if you are looking for implementing the data governance process in your organization, feel free to reach out to me.

Author Sunil Agrawal is the chief architect in Corporate CTO team at Persistent Systems.

Image Credit: www.fondriest.com

References:

[1] http://www.kimballgroup.com/2008/08/slowly-changing-dimensions/
[2] http://www.dwh-club.com/dwh-bi-articles/change-data-capture-methods.html
[3] http://tdan.com/database-auditing-capabilities-for-compliance-and-security/8135