Getting started with Azure Databricks

June 3, 2023 6 mins to read

Introduction

The best way that you can actually retain some knowledge is to practice what you’re learning. One can spend only so much time reading documentation and watching videos but, as some point, you need to put that knowledge in practice.

In this post I will guide you through what I would set up for myself if I had to start again from the beginning. This will require minimal knowledge about Microsoft Azure and its components meaning that this guide is intended for all levels of readers.

Table of Contents

Disclaimer

Azure Databricks is a paid service which is not included in the free tier of Microsoft Azure. If you are no longer interested in using Azure Databricks read the Clean up section.

Microsoft Azure Account

In order to have an Azure Databricks workspace you need an Azure Account. Microsoft Azure offers to new users a free account. The account includes more than 55 always free services plus a plethora of premium services that are free for the first 12 months of use of the account.

Unfortunately Azure Databricks is not one of them, but not all is lost.

Your free account includes as well a generous 200$ credit that you can spend in the first 30 days.

Here is the link to Microsoft Azure’s website where you can sign-up for your free account: https://azure.microsoft.com/en-us/free/search/. The process is completely painless and will take you only few minutes.

My advice: use the Microsoft Azure free account wisely and learn as much as possible.

Create the resource group

  1. Navigate to the page http://portal.azure.com and log in
  2. On the left side of the page click on the three horizontal line icon to enable the left menu
  3. Click on resource group
  1. Click on the + symbol to start the resource group creation wizard
  1. Insert the resource group name. For this instance I suggest to follow the naming convention proposed by Microsoft in the Cloud Adoption Framework[1]. For this instance I opted for the following name: rg-dbr-d-we-001 where:
    • rg: resource group
    • dbr: Databricks
    • d: development
    • we: west europe
    • 001: instance
  2. Since I live in Germany I decided to opt for (Europe) West Europe as a region. My suggestion is to select the region according where is the nearest Microsoft Azure datacenter.
  3. Press Next: Tags
  1. I added two tags, owner and environment. This can help us later to identify the purpose of the resources we are creating
  2. Press Next: Review + create for the last step

If everything checks out, press create. Microsoft Azure will provision a resource group with the settings of your choice.

Accessing the resource groups can be done from the left menu. Once retrieve, just click on the resource group name to access its components.

Azure Databricks

In order to add an Azure Databricks to the resource group we need first to hit the create button to add a new resource.

  1. Type Databricks in the search bar and press enter
  2. On the Databricks resource click on create
  3. Click on Azure Databricks
  1. If not yet selected, select the resource group we created before. You can either create a new one by clicking on create new and following the same procedure
  2. For the workspace name I tried to be consistent and use a similar logic for the resource group, hence lab-dbr-we-001
  3. For the pricing tier I have available the trial 14 days free DBUs, which is more than enough for our purpose. Later on you can switch to a standard tier to contain costs
  4. Click on tags
  1. Add the tags
  2. Click on Review + create

If everything checks out, you can proceed and click on the button create, which should lead to the deployment of the resource in your desired resource group.

Be aware that the deployment of Azure Databricks may take some minutes. When successful you will encounter the following screenshots.

  1. Click on Go to the resource
  2. Click on Launch Workspace

When prompted to choose what’s your current data project, you can take your own pick. For the purpose of this guide I have chosen Exploring Data (Python, R).

Click on Finish and your Azure Databricks workspace is finally ready to be used.

Create a cluster

In Azure Databricks, a cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning[2]. You can run these workloads as a set of commands in a notebook or as an automated job[2].

A cluster in can be thought of as a virtual machine that is specifically configured to run Spark applications. When you create a cluster, you can specify the number and type of virtual machines that should be used, as well as other configuration options such as the version of Spark to use, the amount of memory and CPU to allocate to each node, and the types of storage to use.

  1. From your Azure Databricks workspace home page, hover on the left menu bar. This will reveal all the options
  2. Click on Compute
  1. Click on Create Compute

As this guide is meant just to get the reader started with Azure Databricks I will not go through all the configurations that are possible for a cluster. More information can be found in the Microsoft Learn documentation[2].

Create a notebook and run code: Hello, World!

  1. Hover on the left menu, click on Workspace
  2. On the workspace blade, right click with the mouse to reveal the menu, then click on Create
  3. Click on Notebook
  1. It is advisable to rename your notebook to something more human than Untitled Notebook
  2. You can choose one of the supported languages of Azure Databricks (Python, R, SQL, Scala)
  3. For this example I have chosen Python, so you can enter the following print statement print('Hello, World!')
  4. This notebook cell can be executed either pressing the play button
  5. You can also use the keyboard combination of shift+enter to execute a cell

If for some reason the cluster is detached or terminated, when executing the cell you will be prompted to attach it to a compute resource.

Congratulations, you just ran your first line of code on Azure Databricks.

Cleanup

Having Microsoft Azure resources, even in a free account, will incur in some costs at a certain point of time. If you are not interested anymore in using Azure Databricks you can proceed with the following steps which will explain you on how to delete the workspace.

Get on the resource group navigating the menu in the Azure Portal, then:

1. Select the Azure Databricks Workspace by ticking the checkbox
2. Click on Delete
3. Enter delete in the text box and press Delete

This will initiate the deletion process of the resource and will not generate any further costs.

References

  1. Cloud Adoption Framework, naming conventions: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/govern/resource-consistency/naming
  2. Microsoft Learn: Clusters: https://learn.microsoft.com/en-us/azure/databricks/clusters/