In this blog, we will look at Matillion, which is an ETL tool available as a PaaS (Platform as a Service) on the following Cloud platforms and Cloud Data Warehouses. The table below offers a quick look at Matillion as offered by some top Cloud Data Warehouses.
Matillion comes in three forms:
- Matillion for Redshift
- Matillion for Snowflake
- Matillion for BigQuery
The Matillion instance/VM based on the appropriate image provides the respective data transformation capabilities using native Data Warehouse/Cloud platform capabilities.
Matillion Instance details:Why Matillion?
- Browser based UI that uses native cloud platform capabilities.
- Provides an abstract layer for data pipeline design and uses native capability of cloud data warehouse.
- Loads data using platform specific capabilities, for example:
Storage | Matillion for Redshift | Matillion for Snowflake* | Matillion for BigQuery |
---|---|---|---|
AWS S3 | X | X | |
Azure Blob Storage | X | ||
Google Cloud Storage | X |
- Rich in Data Integrations support.
Data Warehouse | Data Integration support |
---|---|
AWS RedShift | https://www.matillion.com/etl-for-redshift/integrations/ |
Snowflake | https://www.matillion.com/etl-for-snowflake/integrations/ |
BigQuery | https://www.matillion.com/etl-for-bigquery/integrations/ |
- There are two broad categories of components in Matillion:
- Orchestration: The Orchestration components includes data ingestion components data (i.e. “EL” of the ELT), and will assist in controlling the execution flow like error handling, control flow by supporting components like “AND”, “OR”, “IF”, “ITERATOR” etc.
- Transformation: The Transformation components (the “T” of “ELT”) will come into picture after the data is loaded. Thus, primarily, they will assist in calculating, aggregating, filtering etc. The other key advantage is driving these components via the Pushdown optimization by using native capabilities of the respective Cloud Data Warehouse.
- Component support for Cloud Platform specific Alerts:
- Log centralization: For now, AWS Cloudwatch is supported. Other platform specific log support (like Azure Monitor, Google Stackdriver) is on the cards.
- Shared Jobs: The entire data pipeline/workflow (only orchestration jobs) can be created as a shared job. This is like a creation of a custom component. This custom component can be used in any other job (Orchestration or Transformation). When triggered, these jobs in-turn call the entire shared job workflow. Shared jobs act as a reusable component and can be used for common pipeline patter like Auditing, alerting, logging or any other workflow pattern. Please find following link for more information: https://redshiftsupport.matillion.com/customer/portal/articles/2942889-shared-jobs
- Incremental data load: Matillion does not support native database capabilities (reading logs) to identify the changed records. For the AWS platform, it does use the AWS Data Migration Service (DMS) to track the changed records since DMS uses its native database capabilities (for the databases that it supports). The incremental data load is supported via the source database columns which capture the timestamp of changed records or an incremental number sequence.
- Matillion ETL Variables: Here is a list of the supported variables,
Variable Type | Purpose |
---|---|
Environment | These variables are Global and can be used across multiple jobs of a project. |
Job | The scope of these variables is a job. Job variables will override any environment variables of the same name within that specific job. These variables can be used across multiple components in the Job |
Grid | Grid Variables are a special type of Job Variable that can be declared in the Matillion ETL. Grid Variables are 2D arrays that hold scalar values. Headers for columns of the grid can be defined within Matillion but are separate from the data held in that grid. For more information, please refer to the following link: https://redshiftsupport.matillion.com/customer/portal/articles/2917841-grid-variables |
Automatic | These variables are internally generated. These include environment variables like: project_group_name project_name For more information, please refer to the following link: https://redshiftsupport.matillion.com/customer/portal/articles/2943424?b_id=8915 |
Note: The reference link leads to the Matillion Redshift supports, but the concept applies to all other Matillion ETL Data Warehouses.
- Components: There are extensive lists of components, and they also have a support for custom coding using the “Python Script” component. The custom component gives one the flexibility to set variables and execute commands for a target data warehouse.
- REST API support: Matilion provides REST API support designed to interact with Matillion ETL programmatically. For more information, please click on the link below: https://redshiftsupport.matillion.com/customer/portal/articles/2531801?b_id=8915
This tool has lot of potential and can be termed as a “Modern ELT/ETL”.