Introduction
Implementing a solution can be a challenging process, requiring a multi-phase, structured, and iterative approach. While this approach can be effective, it can also be error-prone, leading to setbacks and delays. As a developer, you know that even small errors in the development process can have significant consequences, and it’s essential to minimize the risk of mistakes. One way to achieve this is by adopting a seamless automated process, and that’s where Azure DevOps comes in.
In this post we will establish a development process for our Databricks notebooks which will include the following steps:
Advantages
Scenario
This post we imagine to have a small team of developer that works on a single environment. They don’t want to have unintentional commits to the main branch so they would have to set up some branching policies and, as per common agreement, they decided to have a continuous delivery system that triggers a deployment of the latest artifact every time a pull request is merged into the main branch.
Table of contents
Azure DevOps
Azure DevOps[1] is a set of development tools and services offered by Microsoft that enable developers to create and manage software projects efficiently. It provides a complete suite of tools that help developers deliver software faster and more reliably, from source control to continuous integration and deployment. Azure DevOps also includes project management and collaboration features that make it easy for teams to work together and stay on track. With Azure DevOps, developers can automate their development process, collaborate with their team members, and manage their codebase all in one place. Microsoft provides a comprehensive user guide that explains the various features and capabilities of Azure DevOps.
For the scope of this post we will work only with its Repos[2], Pipelines[3] and Test Plans[4] components.
Nutter
Nutter is a testing framework for Databricks notebooks developed by Microsoft. Its purpose is to provide an easy and reliable way to test notebooks and ensure that they produce the expected results. Nutter supports various types of tests, including unit tests, integration tests, and end-to-end tests. It also provides a command-line interface (CLI) and a Python API for running tests and collecting results. Nutter works by executing notebooks and comparing their outputs to expected values, which can be defined using assertions. It can also handle complex scenarios, such as notebooks that require specific parameters or dependencies. Overall, Nutter makes it easier for developers to write high-quality notebooks and ensure that they work as intended.
Prerequisites
In my previous post Develop Azure Databricks Notebooks on Windows with VS Code and dbx, the several general steps were already covered that you might find helpful for following this post.
Local development process
https://github.com/AStefanachi/databricks-cicd-ado.git
The outcome of this step should be that you have a fully functional instance of VS Code installed on your local machine and the code from my GitHub repository is pushed to your Azure Repo repository.
Create the build pipeline
Now your build pipeline is created but it’s not yet ready for its execution.
Create the dev variable group
If you had a chance to read the code from the GitHub repository you might have noticed that in the ado-pipelines/build-notebooks.yaml
there is the following code which is defining what is the stageName
, the variables group
and the environment
:
# Build Pipeline for Azure DevOps parameters: - name: stageName type: string default: "dev" stages: - stage: ${{ parameters.stageName }} variables: - group: ${{ parameters.stageName }} jobs: - deployment: ${{ parameters.stageName }} environment: "${{ parameters.stageName }}"
In order for Azure Pipelines to effectively run our yaml code we need to proceed with some setups.
http://adb-<workspaceid>.<randomnumber>.azuredatabricks.net/
cluster_id
We need to authorize our build pipeline to read our newly created variables.
Run the build pipeline
NOTE
In the simplified build pipeline you found in my repository it’s not covered a step to spin up the cluster in case it’s terminated. If your cluster is not running, your build pipeline will fail because of a timeout error.
You could check the cluster state via the Databricks Clusters API and, in case it’s on status TERMINATED
start it, wait until it’s on status RUNNING
before proceeding with the rest of the build however, it’s out of scope for this post.
Proceed in starting your Databricks cluster in case it’s on TERMINATED state:
As this is your first run, you should see only Run pipeline in the middle of your screen. Click it to proceed.
Click on Run to launch the build pipeline.
Click on the job name to have a more in depth view.
As this is the first time we run the pipeline on a newly created variable group, Azure DevOps will automatically create the dev environment. Click on View and allow the access.
The build pipeline has completed all its steps and its tests are now published on the Azure Runs component.
Create the release pipeline
Same as for the build, releasing an artifact, is a multi step process. The Azure DevOps agent is a virtual machine: we will need to make sure that the agent has the correct version of python installed, the databricks-cli library installed, authenticates and finally imports the notebooks to our Databricks workspace.
As we will be creating our custom release pipeline just click on Empty job to proceed.
For our release pipeline we are going to make use of several bash scripts, in order to add them to your task:
pip install databricks-cli
to install the Databricks CLI library(echo $(databricksToken)) > token-file
to create a file named token-file containing the personal access token to your Databricks workspacedatabricks configure --host $(databricksHost) --token-file token-file
to configure databricks-cli authentication via personal access tokenrm token-file
to remove the file we created in the precedent stepsdatabricks workspace import_dir -o $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/dbr-cicd-ado /$(Release.Artifacts._databricks-cicd-ado.BuildNumber)
. This command will call the Databricks Workspace API and import the build artifact from theSystem.ArtifactsDirectory
to a folder in the Databricks workspace which will have the same build number
As we used some variables belonging to the dev variables group we need to link them to our release pipeline.
This is an optional step, but the idea is that you might have multiple release pipelines for different purposes and you want a way to differentiate them.
$(Build.DefinitionName)-$(Date:yyyyMMdd)-$(rev:r)
. This will format your release name as dbr-cicd-ado-<dateofthebuild>-<progressivenumber>
Run the release pipeline
The release pipeline is created and we are ready for our first deployment.
The Agent will go through the steps and deploy your notebooks to the Databricks Workspace.
As outcome of your deployment you will be able to find a new folder created in your Databricks workspace which will be named as your artifact build number and will contain the latest version of your notebooks.
Branching Policy
As per our scenario we should have in place a set of rules that doesn’t allow a committer to push his/her code to the main branch in the repository. In order to do that we resort to the repository’s policies.
Continuous Delivery
Now that everything is set up we can configure the continuous delivery. Every time a pull request is created the build pipeline will run and create an artifact and when the code is merged into the main branch the deployment will start.
Conclusion
The current setup will allow to work within a structured workflow.
Whenever a pull request is created it will first go through the build pipeline which will perform quality assurance checks before publishing the artifact. As soon as the reviewers have finished their process, the code can be merged and an automatic deployment will be triggered.
Bonus
I left an assertion code commented on purpose in the repository for the file test_main.py. Feel free to uncomment it and push it to your repository and see how Azure Pipelines deals with a failed test.
References