Tags – First (read: FAST) step to Discover, Explore and Enrich your Data Lake!
Have you faced these or similar situations?
- You decided to build a distributed data warehouse containing terabytes (TBs) of data. Data consumers are now asking for different insights and the requirement for insights change very frequently.
- Source schema and data are constantly changing with new data sets or new attributes getting added.
- Data analysts spending time (at times weeks!) browsing the data warehouse and still being unable to pull all relevant data out of it, or just partially, which may be of little significance to business owners.
My colleagues have recently opened this data lake blog series, starting with data lake concept, raising importance of aspects like Security and Data Ingestion as the data lake grows in volume and usage becomes widespread within the organization.
Let’s consider some interesting data sets, before we actually delve into the topic of metadata labels or tagging which is first step to discover or explore the data lake:
- US PubMed data set comprises more than 26 million citations for biomedical literature, life science journals, and online books with publication abstract written by experts with topic terms. It can be downloaded as an xml file. Sample xml can be found here: https://github.com/ldbib/MEDLINEXMLToJSON/blob/master/test/extensive-test.xml
- US Department of Agriculture food dataset in ASCII format, made available at site – http://www.ars.usda.gov/Services/docs.htm?docid=18879
- Consumer complaint data available
- Movie lens data – http://grouplens.org/datasets/movielens/20m/
- Customer master data tables from the CRM system (e.g. Salesforce, MS Dynamics CRM).
Some common aspects about these data collections include:
- They are either semi-structured (JSON or xml or flat files) or large structured data models (RDBMS)
- Have nested data objects with 1:1 or 1: many relationships – hierarchical data sets
- Many sets of attributes with categorical data – Wide data sets
Now, without knowing any other characteristics about the data, Data Analysts will find it very laborious to search for and retrieve these or similar data sets, or perform any significant analysis in a reasonable amount of time.
As an analogy, think of a data lake as a big retail mall where a lot of items (data) arrive in many trucks (ingestion) during the day and the store attendants (data analysts) have to segregate the items suitably based on labels (tags) attached on cartons and send them to appropriate isles grouping items with related tags so that they can be found by customers easily, and their price tag (an information enrichment) looked up and compared against similar items (analysis) for the customer to make a purchasing decision. Without labeling to support the search, customers would have a much harder time finding the items they eventually want to consume.
In a data lake, almost 100% of the time, data analysis queries “arrive late”, which means they will be unknown during load time or any subsequent stage. This is in sharp contrast to data warehousing, where the target data model is shaped to respond to the queries known in advance.
For this reason, post-data ingestion, large data collections need an extensive data understanding stage before even one can start data preparation or analysis. Metadata tagging is the way to express this understanding.
Metadata tagging (or simply tagging ) is thus a crucial activity, to be done by users and machines to identify, organize and make sense of the raw data ingested in the lake. In increasing order of complexity, tagging encompasses the traditional schema information (table/dataset name, description and attribute metadata), the information about the data values through profiling, the relationships/links between attributes of different datasets, and higher level, business-specific tagging and synonyms between tags which allow for a shared convergence of meaning. The data lake matures as a result of the user interaction and feedback through this tagging process, while enabling dataset search.
Let’s look at these different types of tags in more detail.
- Basic info about datasets (let’s call this as Tag-0)
- What it is all about – Is it survey data or financial time-series data or clinical-trial data?
- Where did the dataset came from (data source) and when?
- Name and type of dataset (XML document, table or view of a relational database, etc).
- Data size, number of columns, column names, and data types (simple or complex), format (json or text/delimited or binary), is it hierarchical or flat?
This Tag-0 information should be generated automatically as a by-product of the ingestion process.
- Profiling info: (Let’s call this as Tag-1, which would be of most interest to a “Data Scientist”)
- Basic attribute statistics: min/max/average values, percentage of null values, number of unique values, format patterns, peak-to-peak values, frequency/counts
- “Top N” bins by count in a column or “outliers” or invalid values
- Are columns co-related with other columns in the same dataset (co-relation score between 0 and 1)?
- For a time-series data, are there any missing values or gaps or abnormal values or duplicates and their value distribution
- Identification of higher level “business types” on columns or sets of columns, e.g., people names, addresses, cities, countries, phone numbers, social security numbers, e-mails, date/time, and the like.
- Across datasets: (Tag-2)
- Which datasets can be potentially “linked or joined” and on which columns?
- Discovery of foreign key / primary key relationships on attributes of different datasets
- Datasets having significant overlaps e.g. duplicate row detection or similarity detection between columns. This can be useful to detect versions of the same data files.
- Business Tags: (Tag-3) – Business-specific tagging, synonyms and hyponyms, allowing for ambiguous terms to eventually converge into a shared understanding.
Tags of type 1, 2 and 3 are generated by the interaction of users (data stewards, data scientists) with the data lake framework. Storage of tags may be decoupled from the underlying data and can well be visualized in the form of a graph of connected tags.
Once the data is tagged, users can start searching datasets by entering keywords that refer to tags, for example, ‘Movies’, ‘Orders’, ‘Complaints’. It could also be a mixture of tags and data values (‘Movies in India’ or ‘Complaints logged by Customer Peter’). For better search results, both searchable data values and search queries should use text morphological processing (word segmentation and stemming), so that a search for ‘fishing’ can match data value ‘fish’.
There should also be other traditional usability enhancements such as Boolean expressions (‘Orders AND Complaints’), wildcards (‘Movie*’), and exclusion terms (‘Orders with no Country’).
Users could formulate search queries where results are dataset columns within a certain scope, constrained by data type or ‘business type’ (e.g., type IN (‘date’, ‘datetime’), or business type IN ‘Phone’), by presence of values, or by correlation with other columns (e.g., find all “strongly” co-related columns where correlation > 0.7). Finally, search results could also be a set of rows of a type specified on the fly by the search query, such as ‘find rows with column LIKE ‘%CUST%’ AND data type IN (‘string’) AND CONTAINS (‘IBM Inc.’)
The result could be either a list of datasets that are similar to the terms entered, or a list of columns or a set of rows, depending upon the granularity of the output expected. Typically, users would want to start with search terms that will lead them to useful datasets and then searching for more specific items within those data sets. Furthermore, each user may start searching with a completely different starting search tag but end up reaching what they are interested in!
Note that there is an important connection between tags and access control security: a data lake search service should only deliver those datasets that users are authorized to view. Authorization profiles are defined over entire data sources, all types of tags (starting from tags referring to datasets and their columns), and searchable data values. The computation of a search result must take into account the definition of profiles associated with tags of the underlying tag graph.
These tags should also exhibit a few characteristics like being easy to add or remove or editable in-place; they can be one or many, flat or hierarchical; can be incremental and enriched over time, at data collection or per entity or at data cell level; and finally should be index-able and search-ready.
Now, the most important question is how to achieve this?
Python, the most suitable and widely used language used for data analysis, provides a rich library of NumPy arrays and pandas with data structures like DataFrame or Series along with sampling that can be used for detecting metadata.
While it may be very time-consuming to find all the tags from big data sets, you can probably think of many ways to start slicing the data to extract useful tags and build on this incrementally, instead of embarking on a mega data profiling project.
Tag identification tools built on top of such libraries can detect and apply tags to the data in an automated or assisted fashion (mixing user provided business tags) and the resulting “enriched” output can be fed to a NoSQL or Graph database or any indexing service like ElasticSearch, for search querying.
Some of the ready-to-use tools that can augment (partially or fully) data and schema discovery are: Apache Drill (on-the-fly schema & data discovery), Linkedin WhereHows (Data Discovery and Lineage for Big Data Ecosystem) and on the commercial side, Waterline data, Attivio Data Source Discovery, Oracle Big Data Discovery, Teradata Loom
To summarize, we recommend the data lake supports tags from the four categories as described above and to support dataset discovery through search queries based on tags. We recommend storing these tags in a suitable Key-Value based NoSQL database (like Cassandra or Accumulo or Hbase), as they provide a flexible storage schema, and to use tags for authorization profiles to secure the data delivered by search queries. Furthermore, identification of useful tags being the most crucial step in tagging, invest in a tool (or possibly build one, for greater flexibility) to automate the process of identification of meaningful and useful tags as much as possible. The tags themselves provide a wealth of information and could be utilized in many different ways. As we get deeper understanding of the existing data, as new datasets land in the data lake, and as new datasets get derived from others, tags can be enhanced further or dropped, enriching datasets and making them ready for analytics.
A data lake becomes useful only when it can solve business problems effectively for enterprises through data democratization, enrichment and data discovery. Tags are rafts that ensure the data lake journey is unwavering!
 Metadata being an overloaded term utilized in many contexts, we prefer to avoid it in this setting. The reader should remember that the notion of tagging includes the traditional notion of metadata.
Image Source: www.teradata.com