dbt Docs as a Static Website

March 16, 2023

powered by App Engine and GitHub Actions

Dbt Docs, as a static website, provides an intuitive way to inspect our project documentation outside of dbt Cloud or our local machine.

Photo by Henry & Co. on Unsplash

Introduction

dbt Docs allow us to intuitively inspect our DAG and dbt project documentation. Along with many of its limitations, we often find ourselves wanting to share this knowledge outside dbt Cloud or our local machine.

Suppose we want to share this with our stakeholders. Then, they have to have a Cloud viewer login, which is at least $100 a seat as of today (assuming that you are a relatively small organization), and going through 4–5 steps just to get to the documentation, which is no good.

There should also be a solution for those who are only using dbt-core and handling scheduling and CI/CD with Airflow and co. plus any integration service. There must be something better than paying for an EC2 instance just to keep a static web server there, right?

The official dbt documentation gives you some options, but none of them fits to a GCP* stack, and we still lack the A-Z guide for this.

This walkthrough was created with GCP in mind, but the same could be applied to all other cloud service vendors.

Options

Although it is still an open issue, luckily the community delivered again, and we are able to deploy our docs in a very straightforward way on a static website, where the cost of hosting is actually zero.

There are actually three ways to do this:

Fast and easy way: Hosting with App Engine
Slow and more painful, but cleaner way: Hosting a static website from GCS.
Fastest and easiest, but needs access to repo: GitHub Pages

The cost difference between all these? Close to zero. They are all free if you don’t have a lot of incoming network traffic.

For the rest of this article, I’m going to walk you through the process of deploying to App Engine. Here, we don’t have to register our custom domain, configure an HTTPS load balancer or find alternatives for setting up IAP on backend buckets. All in all, there is no operational cost whatsoever on serverless products.

GitHub Pages is also ruled out in many cases, because if we want to make the documentation private, it’s going to only be accessible for those who also have access to the repository, therefore have GitHub accounts.

Just like with lots of other services, Google offers a free quota for App Engine which you are not likely to use up, given that it is a static website and the deployment is going to happen now and then. Knowing this, GCP allocates the smallest virtual instance to do the light work.

High-level deploy plan

Step 1. Setup App Engine!

Google App Engine is a cloud computing platform as a service for developing and hosting web applications in Google-managed data centers.

Requirements:

Google Cloud CLI: https://cloud.google.com/sdk/docs/install
Deploy file app.yml in your dbt project’s root directory

service: default
runtime: python311
handlers:
- url: /
  static_files: public/index.html
  upload: public/index.html
- url: /
  static_dir: public
- url: /.*
  secure: always
  redirect_http_response_code: 301
  script: auto

Translating what’s above,

The deployment uses Python 3.11 in runtime,
Uploads and serves static index.html from the target directory, where you are storing the docs artifacts
Maps the root URL to the docs file.
The /.* handler matches all other URLs and directs them to your app.

Standalone dbt docs

Huge thanks to Data banana!

If you want the static docs to work locally without running a local web server, you can put the content of the JSON files (manifest, catalog) directly in index.html and remove network loading.

Now, you can:

Directly open the index.html without CORS restriction in local
Upload and host the documentation in some cloud storage, like Google Cloud Storage. As of this moment, this is not the case yet.
Easily distribute it to your stakeholders via email and co.

import json
import re
import os
PATH_DBT_PROJECT = ""
search_str = 'o=[i("manifest","manifest.json"+t),i("catalog","catalog.json"+t)]'
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'index.html'), 'r') as f:
    content_index = f.read()
    
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'manifest.json'), 'r') as f:
    json_manifest = json.loads(f.read())
# In the static website there are 2 more projects inside the documentation: dbt and dbt_bigquery 
# This is technical information that we don't want to provide to our final users, so we drop it
# Note: depends on the connector, here we use BigQuery 
IGNORE_PROJECTS = ['dbt', 'dbt_bigquery']
for element_type in ['nodes', 'sources', 'macros', 'parent_map', 'child_map']:  # navigate into manifest
    # We transform to list to not change dict size during iteration, we use default value {} to handle KeyError
    for key in list(json_manifest.get(element_type, {}).keys()):  
        for ignore_project in IGNORE_PROJECTS:
            if re.match(fr'^.*\.{ignore_project}\.', key):  # match with string that start with '*.<ignore_project>.'
                del json_manifest[element_type][key]  # delete element
    
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'catalog.json'), 'r') as f:
    json_catalog = json.loads(f.read())
# create single docs file in public folder
with open(os.path.join(PATH_DBT_PROJECT, 'public', 'index.html'), 'w') as f:
    new_str = "o=[{label: 'manifest', data: "+json.dumps(json_manifest)+"},{label: 'catalog', data: "+json.dumps(json_catalog)+"}]"
    new_content = content_index.replace(search_str, new_str)
    f.write(new_content)

First Deploy

Now that we have everything set up to make our first deploy manually, let’s see if it works!

You have to make sure that index.html keeps updating in the public/ folder with glue-dbt-artifacts.

Authenticate with gcloud

gcloud auth application-default login

Deploy and inspect

gcloud app deploy --project my-project --quiet
gcloud app browse --project my-project

Docs generate > glue artifacts > deploy > browse

You should be prompted to a new browser tab with the dbt Docs which means your docs is live!

You can also set up your custom domain.

Step 2. Manage Access with IAP!

By now, your static website is live, but also open to the public, which is not what we want.

Identity-Aware Proxy intercepts web requests sent to your application, authenticates the user making the request using the Google Identity Service, and only lets the requests through if they come from a user you authorize.

Securing App Engine apps with IAP | BeyondCorp Enterprise | Google CloudThis page walks you through deploying an App Engine standard or flexible environment application and securing it with…cloud.google.com

For this, following Step 1. of the codelabs guide should be enough.

This guides you from having a public static website to full access control of Google IAM resources via OAuth.

It’s handy if your organization is Google Suite-first, because it is possible to add groups as well, meaning you can also restrict the access to your company’s domain. This is where Google Cloud Platform really shines.

You should see something like this with IAM switch on

Step 3. Deploy with GitHub Actions!

So, there you go, now you have a secured website. What’s next is to find out how to periodically deploy dbt Docs to App Engine. There are a handful of solutions that can work (e.g. scheduled custom script hitting the Cloud and AppEngine API), but some are cumbersome and don’t solve the issue for core-first users.

Github Actions has been here for a while, and it’s free up to 2,000 CI/CD minutes/month (unlimited for public repositories) if your (small) organization is on the Free plan, which should be enough.

Step 4. Test Your Workflow!

When it comes to testing GitHub Actions, we can use act to generate test cases (imitate push event). Act spins up a Docker instance of your choice* and runs the actions in a context of what triggers it.

*You have to have Docker installed on your desktop GitHub - nektos/act: Run your GitHub Actions locally 🚀"Think globally, act locally" Run your GitHub Actions locally! Why would you want to do this? Two reasons: Fast…github.com

We can supply GitHub secrets in multiple ways, but I chose to do so with a secret file (defined the same way as a .env file). Note, we have to do this because we are not touching the remote repository, so the secrets you have already set up are not within scope.

Those who are running this on Apple Silicon will likely experience some compatibility issues, so I suggest to spin up an AMD container instance instead.

act — container-architecture linux/amd64 — secret-file ‘my.secret’

You can also run act — dryrun to see the plan

Limitations of dbt Docs as a Static Website

Live metrics on artifact

Deploying on each push to the main branch, we assumed that the documentation only changes at those events, but it is rarely the case. Table/model metadata evolves over time (rows, size, etc.) without changing anything in dbt inducing a push event.

Displaying sensitive images in docs

I still haven’t found a good way to display sensitive images referenced in dbt docs. I have tried hosting those images on GCS and restricting the access to the same Google group, but it seems that App Engine is not able to extrapolate the authentication.

Conclusion

In a nutshell, I just gave you an A-Z guide on how to host and test your dbt Docs on a GCP stack. It barely costs any money, and keeping the documentation up-to-date is seamless with GitHub Actions. Also, since you have a standalone .html file, you can make all other sorts of distributions within your organization. There are some limitations to it, but I think this is currently the best way to bring data democratization further in the dbt community.

Stay tuned with our latest articles about the Modern Data Stack if you want to learn further cutting-edge technical tips and tricks!

Article by NAM NGUYEN SON

dbt

Explore more stories

AI Predictions: Tracking the Prophecies
Humanity has always tried to predict the future. We’re swapping tea leaves for datasets. We explore our collective obsession with AGI timelines, the moving goalposts of superintelligence, and why we ultimately decided to build a scoreboard to keep everyone honest.
Think Big, Start Small: Digital Twinning a Coffee Machine
Digital twinning allows you to minimize downtime when maintaining large-scale machinery. But it’s not exclusive to gigantic pieces of equipment—even something as small as a coffee machine can showcase its benefits.
Building an AI Application with Databricks Apps in 30 Days
Discover how to build a production-ready AI application on Databricks Apps in just under a month. Learn from our journey, challenges, and architectural choices.

Flying high with Hifly

We want to work with you

Hiflylabs is your partner in building your future. Share your ideas and let’s work together.