Post

Azure Databricks & Spark for Data Engineers (PySpark / SQL)

Introduction

This personal project, I use PySpark on Databrick on Azure to build data warehouse with Medallion architecture

Git repo:  

General concept

Hive metastore

Reference:   Hive Tables

Spark SQL uses a Hive metastore to manage the metadata of persistent relational entities (e.g. databases, tables, columns, partitions) in a relational database (for fast access).

A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.

By default, Spark SQL uses the embedded deployment mode of a Hive metastore with a Apache Derby database.

Hive metastore on Databricks

Databricks manages Meta Store:

  • By Databricks default
  • Or by External Meta Store (Azure SQL, My SQL, PostgreSQL, MariaDB etc)

Project architecture

Techstack using on this project: Databricks, Pyspark, Delta Lake, Azure Data Lake Storage, Azure data Factory, Azure Key Vault

Project architecture

Azure databricks Analytics architecture reference.

azure_db_analytics_archi.png

1. Set up the environments

Create Databricks Cluster

Databricks / Compute / Create compute (with below configuration):

For example: Set up value for spark version Choose the policy with below policy config

1
2
3
4
5
6
7
{
  "spark_version": {
    "type": "fixed",
    "value": "auto:latest-lts",
    "hidden": true  # to hide the version
  }
}
  • Access mode: No isolation shared
  • Node type: Standard_DS3_v2
  • Termination time: 18 minutes (saving money for activity)
  • Logging: dbfs:/cluster-logs/0821-080317-lq06a367

Create budgets & Cost alerts

Budget Details

  • Name: az-admin-exceed-30-pounds
  • Reset period: Reset period
  • Amount (Threshold): 10 Alert conditions:
  • Actual - 90 (%)
  • Alert recipients (email): [email protected]

Create Azure Storage Account

Storage account

Select hierarchical to enables file and directory semantics, accelerates big data analytics workloads, and enables access control lists (ACLs) Storage account

Choose the storage account Storage account

Download storage browser Storage account Download

Case 1: Session Scoped Authentication (Via notebook)

1. Access Azure Data Lake storage using access key

1
abfss://<container>@<storageAccountName>.dfs.core.windows.net

Access ky The Azure Blob Filesystem driver (ABFS) Set up the access key https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver

1
2
3
4
spark.conf.set(
    "fs.azure.account.key.datalake01course.dfs.core.windows.net",
    "**access_key**"
)

2. Access Azure Data Lake storage using Shared access signature

Otherwise, SAS can be generated from Microsoft Azure Storage Explorer SAS via Microsoft Azure Storage Explorer

3. Access Azure Data Lake using Service Principal

Azure service principal

Step to follow:

  • Create app: Choose service Azure Active Directory / Choose App registrations / New registration (Choose default setting) App registration Copy client_id, tenant_id
    1
    2
    3
    4
    5
    
    client_id = "***"
    tenant_id = "***"
      
    # To get client secret, choose Certificates & secrets / + New client secret -> copy value
    client_secret = "***" 
    
  • Assign role to app: In storage account, choose Access Control (IAM) / Add / Add role assignment (Storage Blob Data Contributor) Add role assignment

Case 2: Cluster Scoped Authentication (Via cluster)

1. Enable credential passthrough for user-level data access

Add role assignment

Same as Access Azure Data Lake using Service Principal, we need to grant access to creator account (who owns cluster) in Storage account
Even though I’m the owner, I do not have the role to read data from the storage account. -> Assign role to Cluster: In storage account, choose Access Control (IAM) / Add / Add role assignment (Storage Blob Data Contributor)

2. Secure Access to Azure Data Lake

Create Azure Key Vault

1
2
3
4
5
6
7
8
// Basics
Resource group: databrickcourse-rg
Key vault name: fomula1-key-vault-course
Region: Southeast Asia
Pricing tier: Standard

// Access configuration
Permission model: Vault access policy

Create access token in Key Vault Secret access token in Key vault

Create secret scope in Databrick

will get scope-name

Reference: Link

Go to https://<databricks-instance>#secrets/createScope. This URL is case sensitive; scope in createScope must be uppercase. Secret access token in Key vault

For DNS Name & Resource ID, go to key vault / Properties to get value for that

1
2
DNS Name = Vault URI
Resource ID = Resource ID

Secret access token in Key vault

3. Mount the container onto Hive MetaStore.

4. Convert raw files into Parquet tables, Delta tables.

5. Establish either a temporary view or manage tables/external tables, utilizing SparkSQL for data visualization.

6. Azure Data Factory

Learning resource:

data_factories.png

Create CI/CD with git

adf.png

Or after launching the Studio, choose Linked service and set up the git repo adf-git.png

Create pipeline adf-pipeline.png

Set up the linked Databrick Notebook adf-config-task.png

To use Managed Service Identity

  • Change access (IAM) in Azure Databricks Service - workspace
  • Privileged Administrator Roles / Contributor adf-access-databrick.png
  • Add menber etl-adls-course (Data Factory) adf-add-member.png
  • After that, choose the access from existing cluster adf-config-task-access.png If it’s not loaded, click other cluster and re-click the existing cluster for refresh UI

  • Add the dynamic parameters by using Variables from the pipeline adf-dynamic-parameter-using-var.png
  • Or using parameter (Can not be changed) adf-dynamic-parameter-using-parameter.png
  • Config the DAGs adf-DAGs.png

Add Get metadata to check if file exists

Set up the storage account Get parameter as directory
Left Image Right Image

Check if data exists or not adf-if-condition.png

Then, click True (edit) and paste the Notebook into Window adf-if-true.png

Final: Create master execute pipeline

adf-execute-pipeline.png

Updating…

Learn More

For more knowledge about my posts, reach me via [email protected]

This post is licensed under CC BY 4.0 by the author.