DataLake Implementation – Is it purely a Technology Play?
In my last blog, I talked about how easy it was to implement a SingleX layer on top of APIs exposed by different underlying ‘data’ systems. The SingleX (or SameX) initiatives in enterprises are making users’ life really easy. The information he needs is available in one app (or with same experience in multiple apps).
But what happens to the ‘experience’ of ‘Data’ guys like me? The reality is – if I am poised to solve some complex analytics problem, lot of my time goes in searching the data and getting hold of it. Most of the time, it’s in some data mart, or in a small 6-node hadoop cluster owned by a different business unit or some similar place. Highly hyped DataLake solution is supposed to solve that problem, and so was Data Warehouse a decade back! But not so successfully yet!
At Persistent, we are helping atleast 5 of our customers put together their DataLakes. And what we are realizing more and more is – it’s not only a technology play. Let me explain this more.
The whole reason of Data Silos is the impedance mismatch between Data Producers and Data Consumers. Data producers want more control and are always worried about compliance, visibility of usage, privacy and access control. On other side, the Data Consumers (all analytics guys fall in here) want to quickly find the data, need ‘confidence’ on the data, need rapid and real time access as well as they typically want very large history. Basically data consumers want all sort of flexibility. This leads to big impedance mismatch.
The DataLake team tries to solve this problem by becoming intermediate layer – a Data Custodian. But if the impedance mismatch is not understood and curbed as top priority the whole DataLake program can go for a toss. We have seen situations where the infrastructure is in place, technologies to ingest variety of data and monitor it, is in place; but its incredibly difficult to get real data onto the DL since the producers are not comfortable with it. If you keep your lake ‘dried’ (without data) for long, the ROI questions starts popping up.
So, the top 2 priorities for a new DataLake initiative are
a) Putting Governance in place
b) Socializing the Governance
This gives lot of confidence to the data producers and on-boarding their data becomes easy. On other side, the sheer accessibility of variety of data on same platform is a ‘dream come true’ for Analytics guys. If you top up the experience with the features like a) Validated DataSet b)Data Discovery c)Different consumption mechanism (BI tool, SQL, Machine Learning and Programming Language access) d)Easy workflow to request the access and finally e)Lot of historical data; then it’s a perfect icing on the cake.
Thankfully, BigData technologies are becoming more and more mature now. There are tools/solutions to solve these challenges – Flexible Data Ingestion, Low cost multi-structured storage, Data Integration and Lineage, Security, Access Control, Data Discovery, Data Exploration, as well as multi-type workloads. The problem is – all these tools are like an individual piece of a big jigsaw puzzle. The DataLake team needs to combine them together! This definitely needs smart and hard work from an experienced team. But it’s obviously worth it since nothing is more inviting than a beautiful, salient, deep, and well-hydrated lake.