Introduction
Whether you are a Data Engineer, Data Analyst or Data Scientist, you need to deal with notebooks if you want to have anything to do with Databricks. One could simply log into the Databricks Workspace and create notebooks from there and work just fine but I tend to work locally from my laptop. There are many Integrated Development Environments (IDE) for you to choose but my personal preference is Visual Studio Code (VS Code), from Microsoft. I find its user interface minimal and elegant, resulting in a pleasurable developer’s experience.
With this post I intend to show you one of the many possibilities offered by Microsoft and Databricks for their developer experience.
Step by step you will be able to install VS Code and the other components necessary for you to effectively have a structured development process of your notebooks.
Table of Contents
Prerequisites
Install VS Code
Installing VS Code is really straightforward and it would be a waste not to redirect you on their documentation. You can start by downloading the installer and follow the steps. Be mindful to select either the 32 or the 64 bit version of VS Code according to your CPU’s architecture.
Install Databricks CLI
On Microsoft Learn there is a fabulous guide on how to install the Databricks CLI. Make sure to follow the step Set up authentication using a Databricks personal access token.
Install dbx
My personal recommendation is to use Window’s command prompt and simply type pip install dbx --upgrade
. After installing the dependencies you can run a
dbx --version
from your command prompt. If it displays a version number you are good to go. More info about dbx installation can be found on the Microsoft’s learn – dbx by Databricks labs guide.
Create your Azure Repos
After signing up for your Azure DevOps account and completing the set up of your project you should be able to access the Azure Repos from the menu on the left of the screen.
myrepo
The url in your browser’s address bar is the url of your Azure Repos. Save it on a blank text file as you will need it later.
Generate a personal access token in Databricks
From your your Azure account, access to your Databricks workspace.
Configure Databricks CLI for personal access token authentication
From your windows menu, open a Command Prompt.
http://adb-<workspaceid>.<randomnumber>.azuredatabricks.net/
After pressing enter, Databricks CLI will create a .databrickscfg file in your C:\Users\<yourusername>
folder. If you open it you will notice that the process created a profile named [DEFAULT]
.
As I am using several profiles for Databricks, I have created other profiles. For this guide I will use [DBX-VSCODE-TUTORIAL]
.
Local Development Process
Finally all the pieces are ready so we can start our local development process. The phases will be very simple and will consist of 4 steps:
Prepare VS Code for the interaction with Databricks
Clone the GitHub Repository
For the sake of convenience I have already created a simple notebook that will just showcase our development process. You can clone it from this repository https://github.com/AStefanachi/databricks-dbx-vscode-tutorial.
https://github.com/AStefanachi/databricks-dbx-vscode-tutorial.git
and press enterBy clicking on Open, VS Code will open the folder and you will be ready to code.
Remove GitHub remote
Select the remote you want to remove and press enter.
Add the Azure Repos to the remote
This will open a text box.
Paste the Azure Repos url in the text box and click on Add remote from URL
Add the name of your repository and press Enter.
Git: Allow unrelated histories
Before we proceed in pushing our new notebook to our Azure Repos repository we need to understand the problem we are about to face: unrelated commit histories.
By default, Git will prevent you from merging unrelated histories, because it can cause conflicts and make it difficult to track changes. However, there are some situations where you may need to merge unrelated histories, such as when you are trying to combine two repositories that were previously separate like in our case.
For convenience we delete the README.md file as it will cause some merge conflicts.
We open a new terminal and we start with some git commands.
git checkout main
and press enter to check out to the main branchgit add .
and press enter to add the latest changes to the commit staginggit pull myrepo main --allow-unrelated-histories
and press enter to pull the code from the Azure Repos repository and merge its content with your local gitgit push myrepo main
and press enter to push your code to the Azure ReposNow your local git repository and your Azure Repos are connected with no conflicts and you can proceed.
Create the deployment.yaml file
conf
conf
create an empty file named deployment.yaml
build: no_build: true environments: default: workflows: - name: "databricks-dbx-vscode-tutorial" git_source: git_url: "https://[email protected]/andreastefanachi/stefanachi/_git/myrepo" git_provider: "azuredevopsservices" git_branch: "main" tasks: - task_key: "execute-notebook" notebook_task: notebook_path: "vscode-dbx-tutorial" existing_cluster_id: "INSERTYOURCLUSTERID" deployment_config: no_package: true no_build: true
deployment.yaml
fileexisting_cluster_id
with your cluster id which can be found in by:
cluster_id
deployment.yaml
fileLet me give you a brief explanation of the yaml file. These configurations will allow the databricks-cli to skip the build step and proceed to extract the necessary information to perform the API calls to create the Databricks Workflow and execute the notebooks from the Azure Repos. More information about it can be found in the dbx documentation on using git_source to specify the remote source.
Add the Azure Repos to Databricks Workspace
From your Databricks workspace:
This will create a connection between your Databricks workspace and your Azure Repos.
This step covered already the steps necessary during the development. Once your changes on the notebooks you are working on are committed and pushed to the repository you are ready to proceed to the next steps.
Pull your code in your Databricks Repos
In order to synchronize the last changes from your Azure Repos with Databricks you need to execute the following command in a terminal in VS Code
databricks --profile=DBX-VSCODE-TUTORIAL repos update --path /Repos/<yourdatabricksusername>/myrepo --branch main
and press enter.
Deploy the workflow on Databricks
In the VS Code terminal type dbx deploy databricks-dbx-vscode-tutorial
and press enter. This will create the workflow and the task in Databricks.
Launch the Workflow
dbx launch databricks-dbx-vscode-tutorial
and press enter. This will execute the workflow in databricksConclusion
1 Comment on “Develop Azure Databricks Notebooks on Windows with VS Code and dbx”
Comments are closed.