Develop Azure Databricks Notebooks on Windows with VS Code and dbx

June 11, 2023 9 mins to read

Introduction

Whether you are a Data Engineer, Data Analyst or Data Scientist, you need to deal with notebooks if you want to have anything to do with Databricks. One could simply log into the Databricks Workspace and create notebooks from there and work just fine but I tend to work locally from my laptop. There are many Integrated Development Environments (IDE) for you to choose but my personal preference is Visual Studio Code (VS Code), from Microsoft. I find its user interface minimal and elegant, resulting in a pleasurable developer’s experience.

With this post I intend to show you one of the many possibilities offered by Microsoft and Databricks for their developer experience.

Step by step you will be able to install VS Code and the other components necessary for you to effectively have a structured development process of your notebooks.

Table of Contents

Prerequisites

Install VS Code

Installing VS Code is really straightforward and it would be a waste not to redirect you on their documentation. You can start by downloading the installer and follow the steps. Be mindful to select either the 32 or the 64 bit version of VS Code according to your CPU’s architecture.

Install Databricks CLI

On Microsoft Learn there is a fabulous guide on how to install the Databricks CLI. Make sure to follow the step Set up authentication using a Databricks personal access token.

Install dbx

My personal recommendation is to use Window’s command prompt and simply type pip install dbx --upgrade. After installing the dependencies you can run a
dbx --version from your command prompt. If it displays a version number you are good to go. More info about dbx installation can be found on the Microsoft’s learn – dbx by Databricks labs guide.

Create your Azure Repos

After signing up for your Azure DevOps account and completing the set up of your project you should be able to access the Azure Repos from the menu on the left of the screen.

  1. From your project’s overview click on Repos
  2. Click on your default Repo name, usually it’s your project’s name
  3. Click on New Repository

    1. Type in the repository name, for this guide I will use myrepo
    2. Click on Create and Azure DevOps will create your repository.

The url in your browser’s address bar is the url of your Azure Repos. Save it on a blank text file as you will need it later.

Generate a personal access token in Databricks

From your your Azure account, access to your Databricks workspace.

  1. On the top-right side of your screen click on your account name
  2. click on User Settings

  1. If you are not in the right pane, click on Access tokens
  2. Click on Generate new token
  3. Type in a Comment which will be the name you assign to your token
  4. Click on Generate

  1. Copy your token and, for your convenience, save it in the same blank text file you saved your Azure Repos url
  2. Click on Done

Configure Databricks CLI for personal access token authentication

From your windows menu, open a Command Prompt.

  1. Copy and paste your Per-workspace URL, which is basically the first part of your Databricks workspace url, comprised of http://adb-<workspaceid>.<randomnumber>.azuredatabricks.net/
  2. Copy and paste your personal access token you created before

After pressing enter, Databricks CLI will create a .databrickscfg file in your C:\Users\<yourusername> folder. If you open it you will notice that the process created a profile named [DEFAULT].

As I am using several profiles for Databricks, I have created other profiles. For this guide I will use [DBX-VSCODE-TUTORIAL].

Local Development Process

Finally all the pieces are ready so we can start our local development process. The phases will be very simple and will consist of 4 steps:

  1. Develop your notebooks locally on VS Code
  2. Push your code to Azure Repos
  3. Pull your code in your Databricks Repos
  4. Deploy the workflow on Databricks
  5. Launch the Workflow

Prepare VS Code for the interaction with Databricks

Clone the GitHub Repository

For the sake of convenience I have already created a simple notebook that will just showcase our development process. You can clone it from this repository https://github.com/AStefanachi/databricks-dbx-vscode-tutorial.

  1. Open your VS Code and click on the Source Control Icon on the left menu
  2. Click on Clone Repository
  3. Paste the repository url https://github.com/AStefanachi/databricks-dbx-vscode-tutorial.git and press enter

  1. Click on This PC
  2. Click on Local Disk (C:)
  3. Click on Select as Repository Destination

By clicking on Open, VS Code will open the folder and you will be ready to code.

Remove GitHub remote

  1. Open your VS Code and click on the Source Control Icon on the left menu
  2. Click on the three dots to open the menu
  3. Click on Remote
  4. Click on Remove Remote

Select the remote you want to remove and press enter.

Add the Azure Repos to the remote

  1. Open your VS Code and click on the Source Control Icon on the left menu
  2. Click on the three dots to open the menu
  3. Click on Remote
  4. Click on Add Remote

This will open a text box.

Paste the Azure Repos url in the text box and click on Add remote from URL

Add the name of your repository and press Enter.

Git: Allow unrelated histories

Before we proceed in pushing our new notebook to our Azure Repos repository we need to understand the problem we are about to face: unrelated commit histories.

By default, Git will prevent you from merging unrelated histories, because it can cause conflicts and make it difficult to track changes. However, there are some situations where you may need to merge unrelated histories, such as when you are trying to combine two repositories that were previously separate like in our case.

For convenience we delete the README.md file as it will cause some merge conflicts.

We open a new terminal and we start with some git commands.

  1. Type git checkout main and press enter to check out to the main branch

  1. Type git add . and press enter to add the latest changes to the commit staging
  2. Type git commit -m “deleted README.md” to commit the changes

  1. Type git pull myrepo main --allow-unrelated-histories and press enter to pull the code from the Azure Repos repository and merge its content with your local git

  1. Type git push myrepo main and press enter to push your code to the Azure Repos

Now your local git repository and your Azure Repos are connected with no conflicts and you can proceed.

Create the deployment.yaml file

  1. In your VS Code, click on the File Explorer icon from the left menu
  2. In the main folder create a new folder named conf
  3. In the folder conf create an empty file named deployment.yaml
build:
  no_build: true
environments:
  default:
    workflows:
      - name: "databricks-dbx-vscode-tutorial"
        git_source:
          git_url: "https://[email protected]/andreastefanachi/stefanachi/_git/myrepo"
          git_provider: "azuredevopsservices"
          git_branch: "main"
        tasks:
          - task_key: "execute-notebook"
            notebook_task:
              notebook_path: "vscode-dbx-tutorial"
            existing_cluster_id: "INSERTYOURCLUSTERID"
            deployment_config:
              no_package: true
              no_build: true
  1. Paste this code in the deployment.yaml file
  2. Change existing_cluster_id with your cluster id which can be found in by:
    1. In your Databricks Workspace click on the left menu on Compute
    2. Click on your cluster name
    3. In the configuration page click on the right side of the screen on the JSON option to reveal the json configuration
    4. The last option is the cluster_id
  3. Save the deployment.yaml file

Let me give you a brief explanation of the yaml file. These configurations will allow the databricks-cli to skip the build step and proceed to extract the necessary information to perform the API calls to create the Databricks Workflow and execute the notebooks from the Azure Repos. More information about it can be found in the dbx documentation on using git_source to specify the remote source.

Add the Azure Repos to Databricks Workspace

From your Databricks workspace:

  1. Click on Repos
  2. Click on your user
  3. Click on Add Repo
  4. Paste the Azure Repos url you saved in the blank text file you created in the previous steps
  5. Click on Create Repo

This will create a connection between your Databricks workspace and your Azure Repos.

This step covered already the steps necessary during the development. Once your changes on the notebooks you are working on are committed and pushed to the repository you are ready to proceed to the next steps.

Pull your code in your Databricks Repos

In order to synchronize the last changes from your Azure Repos with Databricks you need to execute the following command in a terminal in VS Code
databricks --profile=DBX-VSCODE-TUTORIAL repos update --path /Repos/<yourdatabricksusername>/myrepo --branch main and press enter.

Deploy the workflow on Databricks

In the VS Code terminal type dbx deploy databricks-dbx-vscode-tutorial and press enter. This will create the workflow and the task in Databricks.

Launch the Workflow

  1. In the VS Code terminal type dbx launch databricks-dbx-vscode-tutorial and press enter. This will execute the workflow in databricks
  2. Click on the Run URL to access Databricks’s UI on the job run

  1. If not yet started, by running the above command Databricks will start the cluster
  2. In the UI you can see the result of the execution of your job

Conclusion

  • Simplified development process: dbx simplifies the development process by providing a command-line interface to interact with Databricks. This allows you to easily manage your Databricks workspace, clusters, and notebooks from your local machine.
  • Better collaboration: By using Azure Repos, you can collaborate with other developers and maintain version control of your code. This ensures that everyone is working on the same codebase and reduces the risk of errors and conflicts.
  • Easy deployment: With the deployment.yaml file, you can easily deploy your code to Databricks with a single command. This simplifies the deployment process and reduces the risk of errors.

1 Comment on “Develop Azure Databricks Notebooks on Windows with VS Code and dbx”

Comments are closed.