Azure Databricks: Read a csv file from an Azure Storage Account and save it as a table

Andrea Stefanachi

June 7, 2023 11 mins to read

Introduction

In this guide we will face one very common problem presented to data professionals: reading data from an Azure Storage Account. The task is not technically complex but it require the interaction of several Microsoft Azure services and the use of security best practices, which will make it a bit longer to follow but still enjoyable and satisfying.

Assumptions

- A Microsoft Azure account
- A basic understanding of Microsoft Azure is helpful but not required as its User Interface it’s very intuitive and beginner friendly
- A resource group
- A Databricks workspace with at least a cluster
- A copy of the bing_covid-19_data.csv Azure Open Dataset[3]
- A basic understanding of python

For your convenience here you can find my previous post Getting started with azure Databricks which deals with all these topics.

Table of contents

Create a storage account
Upload a csv file in the storage account
Create a key vault
Generate a SAS token
Grant the right permissions
Generate the secret in the key vault for the SAS token
Create a secret scope in Databricks
Clone a repository in Databricks
Executing the notebook
Explaining the code
Cleanup
References

Create a storage account

Navigate to your resource group of choice
On the top left, click on Create

On the search bar type storage account and press enter
Click on Create
Click on Storage account

Microsoft Azure should automatically select the resource group that was used to create the resource as the default
Type a storage account name
In order to save cost I have selected the cheapest option which is Locally-redundant storage (LRS).

Please be aware that LRS copies your data synchronously three times within a single physical location in the primary region. LRS is the least expensive replication option but, according to Microsoft, isn’t recommended for applications requiring high availability or durability[1].

In order to save cost my advice is to uncheck all the options highlighted in the picture above
press Review to proceed

For the purpose of this guide we will not benefit from features as soft delete or point in time recovery.

A good measure especially for less experienced readers would be to read the Best practices for Azure Storage data protection, backup, and recovery[2] by Microsoft which will clarify way more than I can cover in this guide.

For consistency, add owner and environment tags then press review

If everything checks out you can proceed with clicking the Create button. This will initiate the deployment of the storage account in your resource group.

Upload a csv file in the storage account

From the resource group click on the storage account name that you created in the previous step. From the left menu click on Containers
On the top left, click on Container
On the right will appear a New Container blade. Type the name in the text box (E.g. data)
Click on Create
After creation, access the container by clicking on its name

Click on Upload
On the right will appear a Upload Blob blade: drag and drop or browse for the csv file
Click on Upload
If everything worked as expected you will be able to see your csv file as a new blob

Create a key vault

As seen in the Create a storage account step, search for key vault in the marketplace and start the creation.

Microsoft Azure should automatically select the resource group that was used to create the resource as the default
Type a key vault name
For the region I have selected West Europe
In order to save cost I have selected the cheapest option for the pricing tier
Click on Tags to continue

There are several strategies and options to preserve the integrity of your key vault. In this scenario we assume that Role Based Access control is sufficient however, it is recommended to read the best practices[4] suggested by Microsoft.

For consistency, add owner and environment tags then click on Review + create

If everything checks out you can proceed with clicking the Create button. This will initiate the deployment of the key vault in your resource group.

Generate a SAS token

From the resource group click on your storage account name to access its properties.

Click on Shared access signature
Check all the options in Allowed resource types
In Allowed permissions keep the check only on Read and List
Uncheck blob versioning permissions
Generate SAS and connection string

In our scenario we assume that we want to grant Databricks a limited access in terms of time and permissions hence the options selected above. When the token expires, Databricks will no longer be able to read the data from the storage account and throw some authentication errors, in that case you will need to generate a new token.

For our purpose we are only interested in the SAS token. It’s very important you keep this tab open until we save the token in the key vault as a secret, otherwise you will need to repeat this step once more.

Grant the right permissions

Even though your account have probably Owner privileges you won’t be able to administer the secrets in the key vault and Databricks needs as well to be authorized to read the key vault.

From the resource group click on the key vault name to access the overview.

Click on Access control (IAM)
Click on Add
Click on Add role assignment

In the text box, type key vault administrator and press enter
Click on Key Vault Administrator
Click on Members

Click on Select members
In the text box type your name and press enter. Click on your account to select
Click on Select
Click on Next proceed

By clicking on Review + assign you will grant your account the Key Vault Administrator role, which will allow you to create and manage secrets in the key vault.

In the same way we added our account to the Key Vault Administrator role, proceed in adding Azure Databricks service principal to the Key Vault Secrets User role.

Generate the secret in the key vault for the SAS token

From your resource group click on the key vault name to access the overview.

On the left menu, click on Secrets
Click on Generate/Import

Type a name for the secret
Paste the SAS Token from the previous step
Check Set activation date
Click on Create

The secret is now created and ready for its use.

Create a secret scope in Databricks

According to Microsoft’s documentation, to reference secrets stored in an Azure Key Vault, you can create a secret scope backed by Azure Key Vault. You can then leverage all of the secrets in the corresponding Key Vault instance from that secret scope. Because the Azure Key Vault-backed secret scope is a read-only interface to the Key Vault, the PutSecret and DeleteSecret the Secrets API operations are not allowed. To manage secrets in Azure Key Vault, you must use the Azure Set Secret REST API or Azure portal UI[5].

Before we get started, we will need to obtain the key vault vault uri and resource id which are required to create the secret scope.

From your resource group click on the key vault’s name and enter its overview. On the right click on JSON overview which will reveal the following pane.

Here you can read the vault Uri
Here you can read the resource ID

Keep this tab open, as you will need to copy this information in Databricks to create the secret scope.

Access your Databricks workspace and navigate this url https://yourdatabricksinstance.azuredatabricks.net/#secrets/createScope then:

Insert a unique scope name
Insert the key vault vault uri
Insert the key vault resource ID
Click on create

By doing this you have successfully configured Databricks to read the key vault.

Clone a repository in Databricks

From your Databricks workspace:

Click on Repos
Click on Add Repo
Paste in Git repository URL the link to my Github repository for this guide https://github.com/AStefanachi/databricks-tutorial-csv-storage-account
Click on Create Repo

After creating the repository it is a good measure to clone the notebook you want to work with in your workspace.

Click on the repository name
Right click on the notebook’s name
Click on Clone

By using the context arrows you can navigate to the main workspace folder
My advise is to rename the notebook to something more familiar
Click on Clone

If everything worked as intended you can find your newly cloned notebook by clicking on the left menu Workspace.

Executing the notebook

From your Databricks workspace, access the Workspace menu on the left and click on your notebook’s name to access the notebook.

In command 4, rename the constants according to your resources names

On the top right, near the Run all button, click on Connect
Click on your cluster’s name
Click on Run all to execute the entire notebook

Explaining the code

%pip install azure-storage-blob

By default your cluster doesn’t have the azure-storage-blob library installed. Not to complicate things further, for this time, we will install them when running the notebook. More about the library can be found in Microsoft’s documentation[6].

dbutils.library.restartPython()

This dbutils command allow you, after installing the dependency, to restart the python kernel so the package is available for execution. More about dbutils can be found in Azure Databricks’s documentation[7].

ACCOUNT_NAME = "rgstadwe001" # replace with your storage account name

SECRET_NAME = "sas-token-sta-d-001" # replace with your secret name

SECRET_SCOPE = "kwdwe001" # replace with your secret scope

ACCOUNT = f"https://{ACCOUNT_NAME}.blob.core.windows.net/"

SAS_TOKEN = dbutils.secrets.get(SECRET_SCOPE, SECRET_NAME)

CONTAINER = "data" # replace with your container name

BLOB_NAME = "bing_covid-19_data.csv"

The secret is accessed via dbutils, by specifying the secret scope and the secret name
In this code block you will have all your constants defined. Rename them when necessary.

from azure.storage.blob import BlobServiceClient

Import the BlobServiceClient class from the library

blob_service_client = BlobServiceClient(

account_url=ACCOUNT,

credential=SAS_TOKEN)

Initialize a blob_service_client object by specifying account_url and the credential

container_client = blob_service_client.get_container_client(CONTAINER)

Initialize the container_client by specifying the CONTAINER using the blob_service_client object

blob_client = container_client.get_blob_client(BLOB_NAME)

Initialize the blob_client by specifying the BLOB_NAME using the container_client object

import pandas as pd

df = pd.read_csv(blob_client.url, low_memory=False)

For simplicity, we use pandas to read the csv file in a dataframe

sdf = spark.createDataFrame(df)

Convert the pandas dataframe in a spark dataframe

spark.sql("CREATE SCHEMA IF NOT EXISTS covid")

Create the schema if it doesn’t exists

sdf.write.mode("overwrite").saveAsTable("covid.covid_data")

By accessing the saveAsTable method of the spark dataframe we are able to save it in a tabular form. Spark by default will save it in a managed delta table.
By specifying the write.mode(“overwrite”), every time this notebook will be executed the data will be overwritten

spark.read.table("covid.covid_data").describe().display()

By accessing the method describe of the table read as a dataframe we can access its descriptive statistics
This step was introduced only for the sake of having an output

Cleanup

As we have some resources that are generating some costs, after the guide is completed you might want to:

delete the storage account
delete the notebook in the Databricks workspace
delete the key vault
terminate the Databricks cluster

References

Azure storage: data redundancy, https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy
Best practices for Azure Storage data protection, backup, and recovery, https://learn.microsoft.com/en-us/troubleshoot/azure/azure-storage/data-protection-backup-recovery
Azure Open Datasets: Bing COVID-19, https://learn.microsoft.com/en-us/azure/open-datasets/dataset-bing-covid-19?tabs=azure-storage
Best practices for using Azure Key Vault, https://learn.microsoft.com/en-us/azure/key-vault/general/best-practices
Secret Scopes – Azure Databricks, https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes
Azure Storage Blobs client library for Python, https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python
Databricks Utilities,https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils

1 Comment on “Azure Databricks: Read a csv file from an Azure Storage Account and save it as a table”

Pingback: Ingest data from an API with Databricks and PySpark - Andrea Stefanachi - Data and Analytics

Comments are closed.