Build and Release Azure Pipelines with QA for Databricks Notebooks

June 16, 2023 13 mins to read

Introduction

Implementing a solution can be a challenging process, requiring a multi-phase, structured, and iterative approach. While this approach can be effective, it can also be error-prone, leading to setbacks and delays. As a developer, you know that even small errors in the development process can have significant consequences, and it’s essential to minimize the risk of mistakes. One way to achieve this is by adopting a seamless automated process, and that’s where Azure DevOps comes in.

In this post we will establish a development process for our Databricks notebooks which will include the following steps:

  • Continuous Integration:
    • Implement:
      • Develop your code on an IDE and run unit tests
      • Commit your code to a branch and push it to your Azure Repo
    • Build:
      • Gather new and updated code from your Azure Repo
      • Run automated Tests using the nutter testing framework for Databricks
      • Build an Artifact ready for Release
    • Release:
      • Generate a Release Artifact
  • Continuous Delivery:
    • Deploy the notebooks to your Databricks Workspace

Advantages

  • By implementing this logic your notebooks will go through quality assurance before even building an artifact
  • In case the automated test fails the build pipeline will fail too, allowing you to remediate before even merging your code to the repository
  • You will be able as well to set up triggers that will automatically trigger your pipelines upon committing your code or merging a pull request to your main branch.

Scenario

This post we imagine to have a small team of developer that works on a single environment. They don’t want to have unintentional commits to the main branch so they would have to set up some branching policies and, as per common agreement, they decided to have a continuous delivery system that triggers a deployment of the latest artifact every time a pull request is merged into the main branch.

Table of contents

Azure DevOps

Azure DevOps[1] is a set of development tools and services offered by Microsoft that enable developers to create and manage software projects efficiently. It provides a complete suite of tools that help developers deliver software faster and more reliably, from source control to continuous integration and deployment. Azure DevOps also includes project management and collaboration features that make it easy for teams to work together and stay on track. With Azure DevOps, developers can automate their development process, collaborate with their team members, and manage their codebase all in one place. Microsoft provides a comprehensive user guide that explains the various features and capabilities of Azure DevOps.

For the scope of this post we will work only with its Repos[2], Pipelines[3] and Test Plans[4] components.

Nutter

Nutter is a testing framework for Databricks notebooks developed by Microsoft. Its purpose is to provide an easy and reliable way to test notebooks and ensure that they produce the expected results. Nutter supports various types of tests, including unit tests, integration tests, and end-to-end tests. It also provides a command-line interface (CLI) and a Python API for running tests and collecting results. Nutter works by executing notebooks and comparing their outputs to expected values, which can be defined using assertions. It can also handle complex scenarios, such as notebooks that require specific parameters or dependencies. Overall, Nutter makes it easier for developers to write high-quality notebooks and ensure that they work as intended.

Prerequisites

In my previous post Develop Azure Databricks Notebooks on Windows with VS Code and dbx, the several general steps were already covered that you might find helpful for following this post.

Local development process

  • If you didn’t do it already, make sure that you Install VS Code on your computer and you Create an Azure Repo
  • At this link Local development process you can find a detailed guide on how to:
    • Pull code from a GitHub repository, in this guide we will use https://github.com/AStefanachi/databricks-cicd-ado.git
    • Deal with unrelated commit histories
    • Push the code to a remote repository

The outcome of this step should be that you have a fully functional instance of VS Code installed on your local machine and the code from my GitHub repository is pushed to your Azure Repo repository.

Create the build pipeline

  1. From your Azure DevOps project click on Pipelines
  2. Click on Pipelines
  3. Click on New Pipeline

  1. Click on Azure Repos Git

  1. Click on your repository name

  1. Click on Existing Azure Pipelines Yaml file

  1. From the path select your yaml file
  2. Click on Continue

  1. Click on the downwards arrow
  2. Click on Save

Now your build pipeline is created but it’s not yet ready for its execution.

Create the dev variable group

If you had a chance to read the code from the GitHub repository you might have noticed that in the ado-pipelines/build-notebooks.yaml there is the following code which is defining what is the stageName, the variables group and the environment:

# Build Pipeline for Azure DevOps
parameters:
  - name: stageName
    type: string
    default: "dev"

stages:
  - stage: ${{ parameters.stageName }}
    variables:
      - group: ${{ parameters.stageName }}
    jobs:
      - deployment: ${{ parameters.stageName }}
        environment: "${{ parameters.stageName }}"

In order for Azure Pipelines to effectively run our yaml code we need to proceed with some setups.

  1. From your Azure DevOps project click on Pipelines
  2. Click on Library
  3. Click on + Variable Group

  1. Insert a Variable group name
  2. Insert a Description
  3. Click on Add to insert the following variables:
    1. databricksHost: Copy and paste your Per-workspace URL, which is basically the first part of your Databricks workspace url, comprised of http://adb-<workspaceid>.<randomnumber>.azuredatabricks.net/
    2. databricksToken: Copy and paste your personal access token. More information about generating a personal access token can be found in my previous post
    3. databricksCluster:
      • In your Databricks Workspace click on the left menu on Compute
      • Click on your cluster name
      • In the configuration page click on the right side of the screen on the JSON option to reveal the json configuration
      • The last option is the cluster_id
      • Copy and paste it here
  4. Click on Change variable type to secret for each variable. By doing so when executing your pipeline your variables will be masked for enhanced security
  5. Click on Save
  6. My suggestion would be using the Azure key vault to have a unified, highly secured, container for these sensitive information. Even though it is out of scope for this post I suggest you to read more about it on Microsoft’s portal[5]

We need to authorize our build pipeline to read our newly created variables.

  1. Click on Pipeline Permissions
  2. Click on + and select your pipeline from the list that will appear
  3. Click on Save

Run the build pipeline

NOTE

In the simplified build pipeline you found in my repository it’s not covered a step to spin up the cluster in case it’s terminated. If your cluster is not running, your build pipeline will fail because of a timeout error.

You could check the cluster state via the Databricks Clusters API and, in case it’s on status TERMINATED start it, wait until it’s on status RUNNING before proceeding with the rest of the build however, it’s out of scope for this post.

Proceed in starting your Databricks cluster in case it’s on TERMINATED state:

  1. Click on the Compute icon on the left of your screen
  2. Click on Start
  3. Wait until the State icon becomes green to confirm that your cluster is running

  1. Click on Pipelines
  2. Click on All
  3. Click on your pipeline name

As this is your first run, you should see only Run pipeline in the middle of your screen. Click it to proceed.

Click on Run to launch the build pipeline.

Click on the job name to have a more in depth view.

As this is the first time we run the pipeline on a newly created variable group, Azure DevOps will automatically create the dev environment. Click on View and allow the access.

The build pipeline has completed all its steps and its tests are now published on the Azure Runs component.

  1. Click on Test Plans
  2. Click on Runs
  3. Click on the latest Run to have a more in detail view of your test

Create the release pipeline

Same as for the build, releasing an artifact, is a multi step process. The Azure DevOps agent is a virtual machine: we will need to make sure that the agent has the correct version of python installed, the databricks-cli library installed, authenticates and finally imports the notebooks to our Databricks workspace.

  1. Click on Pipelines
  2. Click on Releases
  3. Click on New
  4. Click on New release pipeline

As we will be creating our custom release pipeline just click on Empty job to proceed.

  1. Click on Add
  2. For Source type select Build
  3. For Source (build pipeline) select our build pipeline
  4. By default Azure DevOps selects Latest for the Default version
  5. By default Azure DevOps generates a name for your Source alias
  6. Click on Add

  1. Click on the automatically created stage
  2. Provide a stage name
  3. Click on task

  1. Click on the Agent Job
  2. For Agent pool select Azure Pipelines
  3. For Agent Specification select ubuntu latest
  4. Click on the + symbol to add a new task

  1. Click on the search box and type use python
  2. On the task Use Python version click on Add
  3. Click on the + symbol to add a new task

For our release pipeline we are going to make use of several bash scripts, in order to add them to your task:

  1. Click on the search box and type bash
  2. Click on Add
  3. This will create the bash script, in order to add more tasks just click on the + symbol to add a new task

 

 

  1. Type a display name for this task
  2. Select Inline
  3. Insert pip install databricks-cli to install the Databricks CLI library
  4. Click on the + symbol to add a new task

  1. Type a display name for this task
  2. Select Inline
  3. Insert (echo $(databricksToken)) > token-file to create a file named token-file containing the personal access token to your Databricks workspace
  4. Click on the + symbol to add a new task

  1. Type a display name for this task
  2. Select Inline
  3. Insert databricks configure --host $(databricksHost) --token-file token-file to configure databricks-cli authentication via personal access token
  4. Click on the + symbol to add a new task

  1. Type a display name for this task
  2. Select Inline
  3. Insert rm token-file to remove the file we created in the precedent steps
  4. Click on the + symbol to add a new task

  1. Type a display name for this task
  2. Select Inline
  3. Insert databricks workspace import_dir -o $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/dbr-cicd-ado /$(Release.Artifacts._databricks-cicd-ado.BuildNumber). This command will call the Databricks Workspace API and import the build artifact from theSystem.ArtifactsDirectory to a folder in the Databricks workspace which will have the same build number
  4. Click on the + symbol to add a new task

 

As we used some variables belonging to the dev variables group we need to link them to our release pipeline.

  1. Click on Variables
  2. Click on Variable groups
  3. Click on Link variable group
  4. Click on your variable group name
  5. If not yet set, click on Release
  6. Click on Link

This is an optional step, but the idea is that you might have multiple release pipelines for different purposes and you want a way to differentiate them.

  1. Click on Options, General
  2. In the Release name format type $(Build.DefinitionName)-$(Date:yyyyMMdd)-$(rev:r). This will format your release name as dbr-cicd-ado-<dateofthebuild>-<progressivenumber>

  1. Click on the pipeline name to rename it
  2. Save

Run the release pipeline

The release pipeline is created and we are ready for our first deployment.

  1. Click on your release pipeline name
  2. Select your stages you want to deploy
  3. Select the latest version of the artifacts you want to deploy
  4. Click on Create

  1. Click on the artifact name

  1. Click on Deploy
  2. Click on Deploy stage

  1. At this point the deployment it’s started, you can hover with your mouse on the stage and click on Logs

The Agent will go through the steps and deploy your notebooks to the Databricks Workspace.

As outcome of your deployment you will be able to find a new folder created in your Databricks workspace which will be named as your artifact build number and will contain the latest version of your notebooks.

Branching Policy

As per our scenario we should have in place a set of rules that doesn’t allow a committer to push his/her code to the main branch in the repository. In order to do that we resort to the repository’s policies.

  1. Click on Project settings
  2. Click on Repositories
  3. Click on your repository name

  1. If not yet on this screen, click on your repository name
  2. Click on Policies
  3. Scroll all the way down and click on the main branch

  1. Switch on Require a minimum number of reviewers, set it to 1 and tick the Allow requestors to approve their own changes otherwise you won’t be able to approve your own Pull Request

  1. In the same screen of the step before, scroll all the way down to Build Validation. Switch it on and select your build pipeline
  2. Click on Save on the top part of the screen

Continuous Delivery

Now that everything is set up we can configure the continuous delivery. Every time a pull request is created the build pipeline will run and create an artifact and when the code is merged into the main branch the deployment will start.

  1. Click on Pipelines
  2. Click on Releases
  3. Click on your release pipeline name
  4. Click on Edit

  1. Click on the lightning on the Artifacts box
  2. Switch on the Pull request trigger and select the main branch as Target Branch for the Target Branch Filters

  1. Click on the lightning of the Stages box
  2. Click on Select trigger and make sure that After release is selected
  3. Enable the Pull request deployment

Conclusion

The current setup will allow to work within a structured workflow.

Whenever a pull request is created it will first go through the build pipeline which will perform quality assurance checks before publishing the artifact. As soon as the reviewers have finished their process, the code can be merged and an automatic deployment will be triggered.

Bonus

I left an assertion code commented on purpose in the repository for the file test_main.py. Feel free to uncomment it and push it to your repository and see how Azure Pipelines deals with a failed test.

References

  1. What is Azure DevOps?, https://learn.microsoft.com/en-us/azure/devops/user-guide/what-is-azure-devops
  2. What is Azure Repos?, https://learn.microsoft.com/en-us/azure/devops/repos/get-started/what-is-repos
  3. What is Azure Pipelines?, https://learn.microsoft.com/en-us/azure/devops/pipelines/get-started/what-is-azure-pipelines
  4. What is Azure Test Plans?, https://learn.microsoft.com/en-us/azure/devops/test/overview?view=azure-devops
  5. Variable Groups: Link secrets from an Azure key vault, https://learn.microsoft.com/en-us/azure/devops/pipelines/library/variable-groups?view=azure-devops&tabs=classic#link-secrets-from-an-azure-key-vault