Continuing where we left off last time, this second of three posts on technology trends covers those trends (Adaptive Security and Data Governance, Data Lakes, and Interactive Analytics at Scale) that relate to the lifeblood of today’s software-driven business – data, along with the analytics performed on this data to obtain insights and drive business actions.
Trend 3: Adaptive Security and Data Governance
From data breaches to ransomware to state-sponsored hacking, security threats have taken a whole new serious dimension in recent years. Additionally, the network perimeter keeps expanding due to digital strategies that involve Bring Your Own Device (BYOD) and the opening up of controlled enterprise access to vendors, partners, and customers. This puts security squarely as one of the most important strategic pillars of a successful digital plan.
Fixed enterprise security policies have come up short against the multi-pronged attack vectors such as zero-day exploits and Advanced Persistent Threats (APTs). As digital strategies open the enterprise further and drive up customer data collection, protection against these threats require that the security policies are enforced using a dynamic risk based model rather than with static rules. Enterprise security tools of 2016 will ingest data and signals from endpoint security agents, application logs, external threat feeds, access control systems, and Data Loss Prevention (DLP) platforms to continually score users and their activity on the network, and to adopt the security posture needed to suit the complete context rather than isolated events. For example, an authorized user transferring large amount of data at odd hours may raise a flag, even if the access is authorized. These systems will also correlate data across input sources to detect suspicious activity. The emphasis will be to raise fewer but highly accurate alerts that require immediate attention and action. Such systems will leverage machine learning algorithms to adapt the security posture appropriate for the networks being monitored.
Data breaches rocked our world in 2014 and 2015, starting with the shopping majors in the United States to the federal government’s own comprehensive background check database of its most sensitive information on its employees. Stringent data privacy laws led by Europe as well as a heightened sense of awareness and concern for their private data in the minds of the consumers is the second driving force why data governance will be a major issue in 2016. Given the dynamic network boundary, a one-size-fits-all approach does not work. Enterprises will adopt tools that let them prioritize and designate the governance of their data and resources at the most granular level. Compliance tools that regulate cross-border data exchange will be an integral part of the data-driven digital ecosystem. In domains such as healthcare, genomics research, and personalized medicine, the need for the ability to share data while maintaining strict confidentiality standards will be paramount. In the IoT segment, the computationally challenged end-devices means deeper data analysis has to happen in the cloud; given the highly personal nature of IoT data, this will require granular governance and data retention policies.
Check out our previous posts on this topic:
Trend 4: Data Lakes
A data lake, as originally described by James Dixon, is a large storage repository of raw data, in contrast to a data mart which is a smaller repository to store only subset of attributes and aggregated data. Over the years, the term data lake has come to describe a platform that can ingest multi-structured data from various internal and external sources and make it available for applications to consume however they want. Data lakes offer organizations not only a cost-effective way to store a lot of data, but also a way to optimize existing businesses and address new business opportunities, such as cross-channel analysis of customer interactions and behaviors to offer personalized services; storage and analysis of various kinds of patients’ data (physician’s notes, radiology images, prescriptions, reports, etc.) to predict the likelihood of re-admission; and storage and analysis of Web data (server logs, clickstream data, etc.) to improve advertisement selection and placement.
Most data lake deployments today are based on Apache HDFS (Hadoop Distributed File System), which is a low-cost solution to store big data. Although HDFS is core to data lakes, organizations need tools around data quality, data governance, metadata discovery, etc. to make this data ready for businesses to consume. Without the tools to govern and manage your data lake, it effectively turn into a data graveyard. While some organizations have successful data lake implementations, the technology is still not mature. There are still limited options available for data ingestion frameworks, data quality, metadata discovery, etc. This is a key reason why the adoption of data lakes in enterprises has been slow.
Organizations like Cloudera, Hortonworks (Apache Hadoop 2.0, Apache Falcon, Apache Atlas), Waterline Data (Automated Data Inventory), StreamSets (Data Collector), LinkedIn (Gobblin – Data Ingestion framework) and many others are working on building and enhancing products to simplify and streamline data lake implementations. In 2016, we will see the emergence of more out-of-the-box technology options for data lakes, along with best practices and guidelines that make it easier for enterprises to implement data lakes, both on-premise as well cloud-based.
Here is our previous post on this topic: Data Processing: Drinking from the Data Reservoir.
Trend 5: Interactive Analytics at Scale
“Life is a banquet and most poor suckers are starving to death”, said Auntie Mame in the eponymous musical. That can be applied to a lot of organizations today when it comes to data – too much can be, well, too much to take in. Today, it is common for enterprises to try and analyze terabytes to petabytes of multi-structured data, where text, audio, and visual data are analyzed side-by-side with structured tabular data. At the same time, users are making increasing demands on the data visualization component of data analysis: visualization and analysis at multiple resolutions, as well as a higher bar for aesthetics and ease of use for data visualization and intuitive ways of interacting with them. While big data technologies allow analyzing large data sets, they have also brought the additional complexity of multiple technologies and different programming paradigms. Additionally, integration challenges remain at every interface in the data analysis pipeline. These challenges have made the task of data analysis increasingly difficult and time consuming.
In 2016, emerging data platforms will tackle this challenge by abstracting the complexity of this pipeline for enterprise IT and business analysts and offer a “single stack” interface that allows the analysts to simply focus on the business problem and define their own analytics in a declarative form which the system then translates to a pipeline of processing from loading to cleanup to analysis and visualization. These platforms will be collaborative by design and not as an afterthought.
Enterprise business analysts and decision makers will also expect to be able to interactively analyze this data and be able to drill down into details or explore alternative hypotheses. Our diminishing ability to anticipate the query load a priori (coupled with schema-on-use) in a big data ecosystem has led to a renewed interest in online aggregation techniques. Online aggregation attempts to handle ad hoc queries quickly without pre-computation – by incrementally refining approximate answers. This approach requires a fundamental change in the end-user mindset, where complete accuracy in large-scale analytic queries is unimportant, and that it is more important to be able to balance accuracy and running time in a flexible way. Techniques such as on-the-fly sampling, pre-computed samples on base tables, query resolution reduction to match consumer display constraints, as well as a synthesis of these approaches to build end-to-end systems – where all of these techniques can be leveraged in tandem – will increasingly become more mainstream in 2016.
Image Credits: wired.com
Continue on to the next post here.
Dr. Siddhartha Chatterjee is Chief Technology Officer at Persistent Systems. Other contributors to this series include the following members of the CTO office: Chandraprakash Jain (Senior Architect), Chandrakant Shinde (Senior Architect), Dhruva Ray (Principal Architect), and Dr. Pandurang Kamat (Chief Architect).