View on GitHub

AMLHackADBMLFlow

High Level Objectives:

  1. Preparing Environment

  2. Creating Databricks Cluster

  3. Importing Notebooks

  4. Databricks Extensibility

  5. Uploading Manual Data

  6. Let’s dig into the notebook.

Good additional resources:

https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/pandas-to-koalas-in-10-minutes.html

https://spark.apache.org/docs/latest/

1. Preparing Environment – Securing Sensitive Information

Why Bother?

For a given workspace, when an Azure key-Vault Secret Scope is created, it links Azure Key Vault, a secure place to store password sensitive information, and then makes it available to a workspace. So, really, we need to have 1 key-vault per workspace, put the secrets there, and then leverage them. But…

Create an Azure Key-Vault Secret Scope using the UI.

https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#create-an-azure-key-vault-backed-secret-scope-using-the-ui

Once your Azure Databricks workspace is open in a separate window, append #secrets/createScope to the URL. The URL should have the following format: https://<\location>.azuredatabricks.net/?o=<\orgID>#secrets/createScope

Enter a scope name, and enter the Azure Key Vault DNS name and Resource ID you saved earlier. Save the scope name in a text editor for use later in this tutorial. Then, select Create.

Creating the Secret Scope As an FYI - we could do this using:

Often this will be part of a an IAC deployment, but consider the use case of 3rd party API keys.

Steps to Create a Secret Scope:

  1. Go to the url for your databaricks environment and add #secrets/createScope

  1. Head over to the Azure portal to view our resources.

    Portal.azure.com - open the key vault and head to properties and grab the Vault URI and the Resource ID to place into our secret scope

  1. Note for the name, it needs to be unique. A recommendation would be to user your initials to make it unique to yourself.

  2. Here we can provide into the databricks secret scope the Vault URI into the DNS name, as well as the Resource ID into the ResourceID. For Manage Principal -

Keep the name of your secret scope handy

2. Create a Databricks Cluster

Let’s create a databricks cluster by clicking on the cluster tab.

Let’s leverage the 7.3 Runtime

https://docs.databricks.com/release-notes/runtime/7.3ml.html

Then click on the Create Cluster button

Provide a cluster name – each cluster must have a unique name.

Typically if I am working in a shared envirment, I will use my initials if that works for the team.

Select a Cluster Mode of Standard

Cluster Mode Cluster Mode Description
Standard A standard cluster is meant to be used by 1 user
High Concurrency Shared for multiple users
Single Node A Single Node cluster is a cluster consisting of a Spark driver and no Spark workers (in public preview)

We can go ahead and specify work nodes 1-3

Whats a pool ? Pools

Speeds up cluster creation using existing available Azure vms

https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/

Options Description
Enable Auto-scalling Good if uncertain how many clusters you wil need, or great varability in your job
Auto shutdown Makes sense to use
Worker Type Select based on the work load.
Drive Type Select based on the work load, if bringing a lot of data back to the head node, then give the driver more resources

FYI - Advanced Options

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#spark-config

Setting Environment Variables

3. Importing Notebooks

When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the results of running the notebook are included.

These notebooks are located in a git repo.

Leverage git to clone down this repo:

https://github.com/memasanz/AMLHackADBMLFlow.git

Or Download the Zip File.

We will have 1 person import the HelperFuncions notebook. This notebook is intented to how leveraging one notebook inside another. We will spend a little time customizing this notebook to demonstrate leveraging a python class inside the databricks notebook expereince.

Everyone will import the FullExpermient, BaseExperiment and BaseExperimentWithAzureMLIntegration notebooks under this user.

Note that we will have to manually load notebooks 1 at a time if they are *.pynb extension

Import the following:

The files: BaseExperiment.ipynb is really just for reference, so you can move them into a folder to keep your workspace clean if you would like.

Go ahead and create a folder - samples and place them in there.

4. Install Libraries:

On the cluster tab, click on your cluster and slect install new.

Install ML Flow

PyPI Packages to install Comment
azureml-mlflow  
mlflow Actually already installed, but doesn’t hurt to install

5. Databricks Extensibility (Optional Content):

Databricks Rest APIS https://docs.databricks.com/dev-tools/api/index.html

In order to make an API Call, we need to generate a token. You could supply user/password informaiton, but it is safer and cleaner to provide token. To generate a token head over to the left

Click on the generate new token button

Sample Curl to get list of clusters (Optional Content):

curl -X GET -H 'Authorization: Bearer dapi0XXXXXXXXXX' https://adb-1797455930767468.0.databricks.azure.us//api/2.0/clusters/list

We can also examine the list of secrets in a secret scope.

Just don’t forget to provide the token

6. Let’s dig into the notebook

In the notebook we will cover:

The existing class works with the following parameters in key-Vault, we can update the class to leverage what you currently have in key vault.

Afer running the experiement, we have established a model. Check out the experiements icon.

We can also view the experiment in full.

You can scroll down and view the artifacts captured for the model.

7. Uploading Data Manually (covered in notebook)

Uploading Datasets:

When you select File-> Upload Data, select on hyper link.

Note they will give you some default pySpark code based on the fact is is a csv file.