Azure blob storage is a Microsoft Azure cloud service to store large amount of structured and unstructured data such as text files, database export files, json files, etc. Azure blob storage allows you to store data publicly or you can store application data privately. You can access public Azure blob data without using any additional credentials. But, to access private data, you need to generate access key. In this article, we will check how to access Azure Blob storage files from Databricks?
Access Azure Blob Storage Files from Databricks
Similar to Snowflake cloud data warehouse, Databricks supports cloud platforms such as Microsoft Azure, Amazon AWS and Google GCP. You can create a Databricks cluster on any of these cloud venders. In this article, we will explore Azure Databricks to access files stored in an Azure blob container.
Azure Databricks is a fully managed, Platform-as-a-Service (PaaS) offering for Azure cloud. Azure Databricks leverages Microsoft cloud to scale rapidly, host massive amounts of data effortlessly.
Following is the step-by-step guide to access data files stored in an Azure blob storage.
- Create an Azure Blob Container and upload files.
- Mount Azure Blob Storage.
- Access Data files using Mount Location.
Now, let us check these steps in detail.
Create an Azure Blob Container and upload files.
Similar to directory in a file system, a container organizes a set of blobs. A storage account can include an unlimited number of containers, and a container can store an unlimited number of blobs.
- To create an Azure Blob container, first, create a storage account.
- Go to storage account and click on the container to create new container.
- To upload data files to blob container, click on upload.
Now, your data files are available in the Azure blob container. Next step, would be to mount above created container in Azure Databricks so that you can access data files as if they are local files.
Mount Azure Blob Storage
You need storage access key to mount private blob containers. Go to “Access Keys” within the storage account and click on “Show keys” to copy access key. Refer following image.
You need this access key to mount storage container.
You can use following Python code to mount a storage in Databricks.
dbutils.fs.mount(
source = "wasbs://category@StorageAccountName.blob.core.windows.net",
mount_point = "/mnt/category",
extra_configs = {"fs.azure.account.key.dbusecase.blob.core.windows.net": "access key"})
Access Data files using Mount Location
Finally, you can access the data file using mount location that you created in the previous step. You can use the command to check if location is available in the Azure Databricks mounts.
Use following command to check mount locations.
'''Check the mount locations'''
dbutils.fs.mounts()
Use following command to list the files in a mount location.
''' List the files in a mount location '''
display(dbutils.fs.ls("/mnt/category"))
And finally, create a Spark DataFrame from the data file available in mount location.
For example,
df = spark.read.text("/mnt/category/dim_category.txt")
display(df)
Related Articles,
- How to Connect to Snowflake from Databricks?
- How to Connect to Databricks SQL Endpoint from Azure Data Factory?
- Import CSV file to Pyspark DataFrame – Example
- How to Export Spark-SQL Results to CSV?
- How to Update Spark DataFrame Column Values using Pyspark?
Hope this helps 🙂