A data model is a communication tool
Data models are core components of a successful data pipeline, whether formally or informally. They influence how Data Engineers develop the desired data transformation logic. Data models support data governance, data accessibility, and data literacy. Data Analysts, Data Scientists, and business users are some additional examples of roles which benefit from data models. This applies in relation to understanding the meaning of data and delving into insights that the data can reveal. Whether we are defining scope, documenting an organization’s data, or designing a physical database structure to store data, in the context of data solution delivery:
A data model is a communication tool
These were key messages I covered with Anna Abramova from the SqlDBM team during a recent webinar & hands-on lab covering data modeling and the Snowflake Data Cloud. The talk was attended by 171 data modeling enthusiasts from the US West Coast to Europe to India.
Data Modeling and Snowflake Data Pipelines
The role of data modeling in the context of Snowflake data pipelines is multi-faceted. A data pipeline is an arrangement of data processing steps. We can use these steps for:
- Movement/Replication of data between a source and a target
- Transformation of that data during the replication/movement process (ETL) or after the movement / replication process is complete (ELT)
- The latter is recommended
Depending on data pipeline design, we can run the steps synchronously, asynchronously, or a combination of both.
A data model can be a source for a data pipeline, a target for data pipeline, and/or part of an interim step during data pipeline processing. Snowflake provides a variety of components as part of data pipeline configuration. Examples include:
- Virtual warehouses – The Snowflake compute power
- Stages – Locations where the data is stored and is accessible to Snowflake
- Tasks – Used to define a recurring schedule for executing SQL statements
- Streams – Record data manipulation language (DML) changes made to tables
- Snowpipe – Loads data from files as soon as they are available in a Snowflake stage
- And of course databases (which contain data models!)
Snowflake Data Modeling specific considerations
There are a wide variety of Snowflake specific data modeling considerations. Some examples include:
- Usage of the variant data type for the management of semi-structured data e.g. JSON
- Data retention in relation to Snowflake Time Travel
- Snowflake Time Travel supports accessing older versions of data from a table
- Cluster key configuration
- When data is loaded into Snowflake, that data is automatically added to micro-partitions. Micro-partitions are contiguous units of compressed columnar storage. This is ideal for analytical data processing patterns in addition to query optimization.
- However, over time & changes to the data, the micro-partitioning may not be as effective as when the data was originally loaded into Snowflake.
- Cluster Keys can be used to manage this risk in an automated fashion
- This is done by co-locating similar rows in the same micro-partitions, therefore ensuring micro-partition effectiveness
Tip: The benefits of Snowflake Cluster Keys typically apply to larger tables (multi-terabyte), so cluster key configuration should include testing to verify those benefits.
Data models, tacit knowledge, and staff onboarding
An often de-prioritized aspect of data modeling is model commenting. This is the metadata added to the data model itself to describe e.g. entities, attributes, tables or columns. “What is the purpose of this table?” and “where can I find this data?” are questions often asked after a data model is deployed to production. It can be the case that within an organization, a small group of people or one person in particular knows the answer to these questions. This can lead to data analysis bottlenecks.
Also that expertise can be lost through a variety of reasons such as staff promotion or a colleague moves on further afield.
Data Model commenting is a great way to help manage the risk of data knowledge loss, help on-boarding of new team members, and broaden organizational data literacy in general.
Data model versioning & DevOps
One of the key components of DevOps is data environment reproducibility. Likewise from a data modeling perspective, how can we identify differences in data models across Snowflake Data Cloud environments, and how can we remediate database drift if required?
Database drift occurs when, while using automated database deployment, unmanaged changes are separately made directly to database environments
Using tools such as SqlDBM, the data model environments deployed in different Snowflake environments can be compared as illustrated. In this screenshot we are comparing example Development and System Integration Test database environments. In the event that differences are identified (and where those instances need to be rectified), data model version comparison can be used to generate the required SQL CREATE/ALTER/DROP commands to align environments. We can then forward engineer (SQL to create database objects such as tables and views is automatically generated) from the data modeling tool. This SQL can be applied directly to an environment – if suitable; it is also prudent to consider using a DataOps pipeline to apply the changes. DataOps pipelines help to ensure consistency and reproducibility in relation to database deployments.
Free Data Model SQL DDL & Pipeline SQL DML
You can of course try out the above for yourself. I have published the Snowflake SQL scripts used during the webinar on Github. These scripts include:
- Overall Snowflake environment setup
- Data Model DDL so you can create your own data model automatically in a Snowflake database
- This can also be reverse engineered into a data modeling tool e.g. SqlDBM
- Data Pipeline SQL which processes data using ELT (Extract Load Transform)
- Database cloning to align your Snowflake data with your data model version
- SQL Queries to check the data model metadata & view the results of the data pipeline
This is a noteworthy data modeling and Snowflake Data Cloud learning opportunity, so if you wish to improve your skills in these areas, do check it out!
Data models help with data pipeline delivery, organizational data documentation, and data literacy. Snowflake data pipelines can be configured using a variety of components. Examples include Virtual Warehouses, Stages, Tasks, Streams, Snowpipe, and of course databases which contain data models!
There are a wide variety of Snowflake specific data modeling considerations. Some examples include using the variant data type for semi-structured data, configuring Snowflake Time Travel for history retention, and Snowflake Cluster Key configuration.
Data models help us to manage the risk associated with tacit knowledge. In the context of automated database deployment, it’s important to keep in mind data model versioning and database drift. In the context of data pipelines, and data solution delivery, data models are ubiquitous. Most of all:
A data model is a communication tool.
Dan Galavan is a leading Data Architecture Consultant who has spent over two decades advising clients and leading the delivery of data solutions in both the private and public sector. He is an Associate Faculty lecturer in data management at the National College of Ireland, a Snowflake Data Superhero, and provides Snowflake Data Cloud training.