Azure Databricks: Read a csv file from an Azure Storage Account and save it as a table

June 7, 2023 11 mins to read

Introduction

In this guide we will face one very common problem presented to data professionals: reading data from an Azure Storage Account. The task is not technically complex but it require the interaction of several Microsoft Azure services and the use of security best practices, which will make it a bit longer to follow but still enjoyable and satisfying.

Assumptions

    • A Microsoft Azure account
    • A basic understanding of Microsoft Azure is helpful but not required as its User Interface it’s very intuitive and beginner friendly
    • A resource group
    • A Databricks workspace with at least a cluster
    • A copy of the bing_covid-19_data.csv Azure Open Dataset[3]
    • A basic understanding of python

For your convenience here you can find my previous post Getting started with azure Databricks which deals with all these topics.

Table of contents

  1. Create a storage account
  2. Upload a csv file in the storage account
  3. Create a key vault
  4. Generate a SAS token
  5. Grant the right permissions
  6. Generate the secret in the key vault for the SAS token
  7. Create a secret scope in Databricks
  8. Clone a repository in Databricks
  9. Executing the notebook
  10. Explaining the code
  11. Cleanup
  12. References

Create a storage account

  1. Navigate to your resource group of choice
  2. On the top left, click on Create

  1. On the search bar type storage account and press enter
  2. Click on Create
  3. Click on Storage account

  1. Microsoft Azure should automatically select the resource group that was used to create the resource as the default
  2. Type a storage account name
  3. In order to save cost I have selected the cheapest option which is Locally-redundant storage (LRS).

Please be aware that LRS copies your data synchronously three times within a single physical location in the primary region. LRS is the least expensive replication option but, according to Microsoft, isn’t recommended for applications requiring high availability or durability[1].

  1. In order to save cost my advice is to uncheck all the options highlighted in the picture above
  2. press Review to proceed

For the purpose of this guide we will not benefit from features as soft delete or point in time recovery.

A good measure especially for less experienced readers would be to read the Best practices for Azure Storage data protection, backup, and recovery[2] by Microsoft which will clarify way more than I can cover in this guide.

  1. For consistency, add owner and environment tags then press review

If everything checks out you can proceed with clicking the Create button. This will initiate the deployment of the storage account in your resource group.

Upload a csv file in the storage account

  1. From the resource group click on the storage account name that you created in the previous step. From the left menu click on Containers
  2. On the top left, click on Container
  3. On the right will appear a New Container blade. Type the name in the text box (E.g. data)
  4. Click on Create
  5. After creation, access the container by clicking on its name

  1. Click on Upload
  2. On the right will appear a Upload Blob blade: drag and drop or browse for the csv file
  3. Click on Upload
  4. If everything worked as expected you will be able to see your csv file as a new blob

Create a key vault

As seen in the Create a storage account step, search for key vault in the marketplace and start the creation.

  1. Microsoft Azure should automatically select the resource group that was used to create the resource as the default
  2. Type a key vault name
  3. For the region I have selected West Europe
  4. In order to save cost I have selected the cheapest option for the pricing tier
  5. Click on Tags to continue

There are several strategies and options to preserve the integrity of your key vault. In this scenario we assume that Role Based Access control is sufficient however, it is recommended to read the best practices[4] suggested by Microsoft.

  1. For consistency, add owner and environment tags then click on Review + create

If everything checks out you can proceed with clicking the Create button. This will initiate the deployment of the key vault in your resource group.

Generate a SAS token

From the resource group click on your storage account name to access its properties.

  1. Click on Shared access signature
  2. Check all the options in Allowed resource types
  3. In Allowed permissions keep the check only on Read and List
  4. Uncheck blob versioning permissions
  5. Generate SAS and connection string

In our scenario we assume that we want to grant Databricks a limited access in terms of time and permissions hence the options selected above. When the token expires, Databricks will no longer be able to read the data from the storage account and throw some authentication errors, in that case you will need to generate a new token.

For our purpose we are only interested in the SAS token. It’s very important you keep this tab open until we save the token in the key vault as a secret, otherwise you will need to repeat this step once more.

Grant the right permissions

Even though your account have probably Owner privileges you won’t be able to administer the secrets in the key vault and Databricks needs as well to be authorized to read the key vault.

From the resource group click on the key vault name to access the overview.

  1. Click on Access control (IAM)
  2. Click on Add
  3. Click on Add role assignment

  1. In the text box, type key vault administrator and press enter
  2. Click on Key Vault Administrator
  3. Click on Members

  1. Click on Select members
  2. In the text box type your name and press enter. Click on your account to select
  3. Click on Select
  4. Click on Next proceed

By clicking on Review + assign you will grant your account the Key Vault Administrator role, which will allow you to create and manage secrets in the key vault.

In the same way we added our account to the Key Vault Administrator role, proceed in adding Azure Databricks service principal to the Key Vault Secrets User role.

Generate the secret in the key vault for the SAS token

From your resource group click on the key vault name to access the overview.

  1. On the left menu, click on Secrets
  2. Click on Generate/Import

  1. Type a name for the secret
  2. Paste the SAS Token from the previous step
  3. Check Set activation date
  4. Click on Create

The secret is now created and ready for its use.

Create a secret scope in Databricks

According to Microsoft’s documentation, to reference secrets stored in an Azure Key Vault, you can create a secret scope backed by Azure Key Vault. You can then leverage all of the secrets in the corresponding Key Vault instance from that secret scope. Because the Azure Key Vault-backed secret scope is a read-only interface to the Key Vault, the PutSecret and DeleteSecret the Secrets API operations are not allowed. To manage secrets in Azure Key Vault, you must use the Azure Set Secret REST API or Azure portal UI[5].

Before we get started, we will need to obtain the key vault vault uri and resource id which are required to create the secret scope.

From your resource group click on the key vault’s name and enter its overview. On the right click on JSON overview which will reveal the following pane.

 

  1. Here you can read the vault Uri
  2. Here you can read the resource ID

Keep this tab open, as you will need to copy this information in Databricks to create the secret scope.

Access your Databricks workspace and navigate this url https://yourdatabricksinstance.azuredatabricks.net/#secrets/createScope then:

  1. Insert a unique scope name
  2. Insert the key vault vault uri
  3. Insert the key vault resource ID
  4. Click on create

By doing this you have successfully configured Databricks to read the key vault.

Clone a repository in Databricks

From your Databricks workspace:

  1. Click on Repos
  2. Click on Add Repo
  3. Paste in Git repository URL the link to my Github repository for this guide https://github.com/AStefanachi/databricks-tutorial-csv-storage-account
  4. Click on Create Repo

After creating the repository it is a good measure to clone the notebook you want to work with in your workspace.

  1. Click on the repository name
  2. Right click on the notebook’s name
  3. Click on Clone

  1. By using the context arrows you can navigate to the main workspace folder
  2. My advise is to rename the notebook to something more familiar
  3. Click on Clone

If everything worked as intended you can find your newly cloned notebook by clicking on the left menu Workspace.

Executing the notebook

From your Databricks workspace, access the Workspace menu on the left and click on your notebook’s name to access the notebook.

  1. In command 4, rename the constants according to your resources names

  1. On the top right, near the Run all button, click on Connect
  2. Click on your cluster’s name
  3. Click on Run all to execute the entire notebook

Explaining the code

%pip install azure-storage-blob

  • By default your cluster doesn’t have the azure-storage-blob library installed. Not to complicate things further, for this time, we will install them when running the notebook. More about the library can be found in Microsoft’s documentation[6].
dbutils.library.restartPython()
  • This dbutils command allow you, after installing the dependency, to restart the python kernel so the package is available for execution. More about dbutils can be found in Azure Databricks’s documentation[7].

ACCOUNT_NAME = "rgstadwe001" # replace with your storage account name

SECRET_NAME = "sas-token-sta-d-001" # replace with your secret name

SECRET_SCOPE = "kwdwe001" # replace with your secret scope

ACCOUNT = f"https://{ACCOUNT_NAME}.blob.core.windows.net/"

SAS_TOKEN = dbutils.secrets.get(SECRET_SCOPE, SECRET_NAME)

CONTAINER = "data" # replace with your container name

BLOB_NAME = "bing_covid-19_data.csv"

  • The secret is accessed via dbutils, by specifying the secret scope and the secret name
  • In this code block you will have all your constants defined. Rename them when necessary.
from azure.storage.blob import BlobServiceClient
  • Import the BlobServiceClient class from the library
blob_service_client = BlobServiceClient(
        account_url=ACCOUNT,
        credential=SAS_TOKEN)
  • Initialize a blob_service_client object by specifying account_url and the credential
container_client = blob_service_client.get_container_client(CONTAINER)
  • Initialize the container_client by specifying the CONTAINER using the blob_service_client object
blob_client = container_client.get_blob_client(BLOB_NAME)
  • Initialize the blob_client by specifying the BLOB_NAME using the container_client object
import pandas as pd
df = pd.read_csv(blob_client.url, low_memory=False)
  • For simplicity, we use pandas to read the csv file in a dataframe
sdf = spark.createDataFrame(df)
  • Convert the pandas dataframe in a spark dataframe
spark.sql("CREATE SCHEMA IF NOT EXISTS covid")
  • Create the schema if it doesn’t exists
sdf.write.mode("overwrite").saveAsTable("covid.covid_data")
  • By accessing the saveAsTable method of the spark dataframe we are able to save it in a tabular form. Spark by default will save it in a managed delta table.
  • By specifying the write.mode(“overwrite”), every time this notebook will be executed the data will be overwritten
spark.read.table("covid.covid_data").describe().display()
  • By accessing the method describe of the table read as a dataframe we can access its descriptive statistics
  • This step was introduced only for the sake of having an output

Cleanup

As we have some resources that are generating some costs, after the guide is completed you might want to:

  • delete the storage account
  • delete the notebook in the Databricks workspace
  • delete the key vault
  • terminate the Databricks cluster

References

  1. Azure storage: data redundancy, https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy
  2. Best practices for Azure Storage data protection, backup, and recovery, https://learn.microsoft.com/en-us/troubleshoot/azure/azure-storage/data-protection-backup-recovery
  3. Azure Open Datasets: Bing COVID-19, https://learn.microsoft.com/en-us/azure/open-datasets/dataset-bing-covid-19?tabs=azure-storage
  4. Best practices for using Azure Key Vault, https://learn.microsoft.com/en-us/azure/key-vault/general/best-practices
  5. Secret Scopes – Azure Databricks, https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes
  6. Azure Storage Blobs client library for Python, https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python
  7. Databricks Utilities,https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils

1 Comment on “Azure Databricks: Read a csv file from an Azure Storage Account and save it as a table”

Comments are closed.