Let Data Flow Securely through the Lake

 In AI, ML and Data

In the previous blog post, Avadhoot Agasti introduced the data lake and described how the formation of a data lake aims to solve the impedance mismatch and bridge the gap between data producers and consumers by introducing the concept of data custodian.

While the adoption of data lakes is increasing, questions arise around the security of them…  How to secure the Hadoop ecosystem hosting the data lake? How to secure the data being ingested and residing in the data lake? How would different teams including business analysts, data scientists, developers, and others in the enterprise access the data in a secure manner? How to enforce existing Enterprise Security Models in this new infrastructure? Are there any best practices for securing such an infrastructure?

It is important for enterprises adopting data lakes to protect sensitive information such as proprietary materials or personally identifiable information. There are also compliance requirements such as HIPAA, SOX, PCI, etc. and the need to adhere to corporate security policies. The issue gets magnified as the data lake in many cases is multi-tenant, hosting the data for multiple customers or business units.


Security should be implemented at all layers of the lake starting from Ingestion, through Storage, Analytics, Discovery, all the way to Consumption. The basic requirement is to restrict the access to data to trusted users and services only. Let’s take a look at few features available for data lake security.

Guarding access to the cluster is known as Network Perimeter Security and is an absolute must for any data lake. Apache Knox is a REST API Gateway to interact with the Hadoop services, which should be used to protect the lake from direct access from the outside world. Knox is integrated with LDAP and Active Directory. By default it supports end-to-end wire encryption using SSL. Authentication will verify user’s identity and ensure they really are who they say they are. Using the Kerberos protocol provides strong mechanism for authentication.

Access Control is the next step to secure data, by defining which dataset can be accessed by the users or services. Consider using projects such as Apache Sentry or Apache Ranger which help to enforce role-based authorization for fine-grained access control which will restrict users and services to access only that data which they have permission for.

Sensitive data in the cluster should be secured at rest as well as in motion. We need to use proper Data Protection techniques which will protect data in the cluster from unauthorized visibility. Encryption and data masking are required to ensure secure access to sensitive data. Cloudera Navigator provides a framework for transparently encrypting data at rest and also provides secure key management. Another aspect of data security requirement is Auditing data access by users. This can detect the logon & access attempts as well as the administrative changes. Apache Ranger and Cloudera Navigator provide plugin-based, centralized auditing.

As you can see, there are various security measures which should be used to detect malicious behavior, data leakage, intrusion, or other security incidents. Special focus should be given to security during the architecture of the data lake which will make sure data flows securely through the lake. Has your organization implemented the data lake and is security built into it? Do share your opinion about solutions available in the data lake ecosystem to secure the enterprise data.

Image Credits: The Los Angeles Times

Interested in learning more about our offerings and solutions?
Let's connect
Recommended Posts

Start typing and press Enter to search

Contact Us
close slider
Contact Us

Yes, I would like Persistent to contact me on the information provided above. Click Here to read our full Privacy Notice.