Domain Modelling using OpenAI

Domain models form the foundation of data products by establishing a common language. This article describes using OpenAI to formulate and refine a domain model into a SQL code.

5 min readApr 16, 2024

A domain model gives us a common language to discuss relationships and properties of data. It allows us to describe properties and relationships between entities. It gives the team a common language to describe the data and is the foundation of a data product.

The data in the domain model will come from many sources, and consistency in the model allows us to integrate the data into a consistent model. We can standardize our view of the data model as the foundation for our analytics on the data warehouse.

In this article, I use ChatGPT to build a domain ba.

Part 1: Provision your Snowflake Pipe on AWS using Terraform

Snowflake provides documentation for manually implementing Snowflake Pipes on AWS [1]. Snowflake Labs also provides a Terraform provider that automates the provisioning of Snowflake entities [2]. Terraform's AWS plugin helps bridge a gap between AWS and Snowflake and automate a configuration process [3].

Step 1a: Break your Snowflake deployment into layers

Terraform can be tricky to use because it will create and destroy components on whether or not the file for the element exists in the Terraform folder. Additionally, automatic dependency detection works well on an isolated feature but can quickly become a spaghetti mess without clear separation.

For those reasons, break your Snowflake deployment into layers

The Data Lake — The S3 Bucket and Associated Folders
The Snowflake Database— The Snowflake Database and Core schemas
The AWS Snowflake Integration— Integrate Snowflake into AWS
Additional Domain-specific layers — Build domain-specific layers separate from your core infrastructure

Provisioning The Data Lake

Many readers will likely have a data lake existing. This section describes how to provision the data lake and links to the author's GitHub project with all the code [4].

The Data Lake Terraform layer has several files in it.

s3.tf — The Terraform resource to build the data bucket if it doesn't exist.
variables.tf — A consistent set of variables for the project defined in the environment and passed in via a Makefile.
vesions.tf — The Terraform provider directive that specifies the providers.
output.tf — A message that outputs the results of created entities

This style of file inclusion is how Terraform groups a layer and dependencies between these files we can specify by referencing within this Terraform project. The Makefile has the steps to build.

Running the Terraform procedure on this layer will create the S3 bucket if it doesn't exist. If it were made through another process, it would fail, and this is ok because all it means is that this aspect of the provisioning can be ignored. The critical takeaway is that the S3 bucket can be independent and not managed through Terraform, which might be a better choice.

Provisioning the Snowflake Database

The Snowflake Database is another fundamental layer component that may or may not exist in your environment. If this component doesn't exist, provision the Snowflake instance.

The Snowflake layer is similar to the Data Lake layer and has a database.tf file and a schema.tf file that provisions both the database and a landing zone schema on Terraform apply.

resource "snowflake_database" "catalog" {
  name                        = upper("${var.environment}_CATALOG")
  comment                     = "The Database Catalog for Building Data Products using Value Gap Matrix"
  data_retention_time_in_days = 14
}

Note the use of variables to enforce and the use of upper. The variable definitions should enforce the consistency of core entities to avoid some less-desirable aspects of Terraform when it comes to managing layered architectures. The "upper" method is recommended for Snowflake entities because databases are traditionally case-insensitive.

Provision the AWS to Snowflake Integration

The AWS to Snowflake Integration provisions the pipeline between S3 and the Snowflake tables in the database. It utilizes dependencies to provide the snowflake STORAGE_INTEGRATION and PIPE objects to COPY data from S3 into a table called IMPORT_LOG.

The STORAGE_INTEGRATION provides the trust foundation between AWS and Snowflake. Note using the variables to enforce consistency so we don't get into circular dependencies when creating objects.

resource "snowflake_storage_integration" "datalake_integration" {
  name    = upper("${var.prefix}_STORAGE_INTEGRATION_DATA_LAKE_${var.environment}")
  comment = "A Storage integration for the datalake"
  type    = "EXTERNAL_STAGE"

  enabled = true

  storage_provider         = "S3"
  storage_aws_role_arn     = "arn:aws:iam::${var.aws_account_id}:role/${var.prefix}-snowflake-data-lake-${var.environment}"

  storage_allowed_locations = ["s3://${var.prefix}-datalake-${var.environment}/${var.stage_folder}/"]

}

When all is finished, the STORAGE_INTEGRATION is available in the Snowflake console.

describe integration OWLMTN_STORAGE_INTEGRATION_DATA_LAKE_DEV;

/*
+-------------------------+-------------+-------------------------------------------------------------+----------------+
|property                 |property_type|property_value                                               |property_default|
+-------------------------+-------------+-------------------------------------------------------------+----------------+
|ENABLED                  |Boolean      |true                                                         |false           |
|STORAGE_PROVIDER         |String       |S3                                                           |                |
|STORAGE_ALLOWED_LOCATIONS|List         |s3://owlmtn-datalake-dev/stage/                              |[]              |
|STORAGE_BLOCKED_LOCATIONS|List         |                                                             |[]              |
|STORAGE_AWS_IAM_USER_ARN |String       |arn:aws:iam::098765432101:user/asdf-b-a9b83ss9               |                |
|STORAGE_AWS_ROLE_ARN     |String       |arn:aws:iam::123456789012:role/owlmtn-snowflake-data-lake-dev|                |
|STORAGE_AWS_EXTERNAL_ID  |String       |CBA12345_SFCRole=ASDFGBJJDFKS1234ASDFL499LLABS=              |                |
|COMMENT                  |String       |A Storage integration for the datalake                       |                |
+-------------------------+-------------+-------------------------------------------------------------+----------------+

 */

The Terraform component storage_integration_role ties together the AWS role and policy and allows the Snowflake principal for your Snowflake account in the AWS cloud to connect to your S3 bucket. This will enable you to load data in Snowflake without egressing on the Internet and is highly secure.

A second key integration feature is using a Terraform "module" component. The module component allows us to specify all the hooks we need for an individual table to load automatically whenever a new file arrives on our S3 bucket.

Use Modules to Implement Snowpipe

Snowpipe requires several key components:

Note that the module defines a generic template for creating all of these together, and in our layer, we make a resource that calls the module with a specific table. So, for example, I have a simple table that loads a log entry into a table with no mappings.

module "snowflake_datalake_logs" {
  source = "../../modules/snowflake-tables"
  datalake_storage = "${var.prefix}-datalake-${var.environment}"
  stage_folder = "${var.stage_folder}/${var.data_source}/logs"
  stage_name = "STAGE_LOG"
  table_name = "IMPORT_LOG"
  snowflake_database = upper("${var.environment}_CATALOG")
  landing_zone_schema = var.landing_zone_schema
  environment = var.environment
  storage_integration = snowflake_storage_integration.datalake_integration.name
}

The above module element in the base pipeline layer ensures all the integrations by subscribing the above data elements to the SQS events on our S3 bucket. It's a one-time configuration for a single table because the notification channel is consistent across plains.

Conclusion

Automated deployment trims the tool stack we need to provide data for analytics. A well-managed pipeline offers the basis for data governance by ensuring consistency, repeatability, and automation.

Part 2 of this article will show how to extend the core layers here and move to domain-driven data development that forms the modern data cloud architecture.

References

Snowflake (2023) Automating Snowpipe for Amazon S3
Snowflake Labs (2023) Snowflake Terraform Provider Docs
Hashicorp (2023) AWS Terraform Provider Docs
T. Burns (2023) Snowflake Automation GitHub Code
Snowflake Labs (2023) Terraform Gitlab Project