View on GitHub

AMLHackADBMLFlow

High Level Objectives:

Preparing Environment
Creating Databricks Cluster
Importing Notebooks
Databricks Extensibility
Uploading Manual Data
Let’s dig into the notebook.

Good additional resources:

https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/pandas-to-koalas-in-10-minutes.html

https://spark.apache.org/docs/latest/

1. Preparing Environment – Securing Sensitive Information

Why Bother?

For a given workspace, when an Azure key-Vault Secret Scope is created, it links Azure Key Vault, a secure place to store password sensitive information, and then makes it available to a workspace. So, really, we need to have 1 key-vault per workspace, put the secrets there, and then leverage them. But…

Suppose you have a notebook and would like to use a 3rd Party API Key – example FaceBook Analytics, then you would need a place to store sensitive information. During this hack – you have been granted access to go into the Azure Key Vault Resource, and See the Secrets. Perhaps your team will need to store secrets, that might be a good scenario for knowing how to Add a Secret Scope to a workspace.
Suppose that have very sensitive information. Currently the mounts in a workspace are created by a service principal. That means they are available for the entire workspace. You may find that based on your use cases, additional Azure Databricks workspaces are needed, in which case, this also makes sense to know how to do. Super easy, and secures your information.

Create an Azure Key-Vault Secret Scope using the UI.

https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#create-an-azure-key-vault-backed-secret-scope-using-the-ui

Once your Azure Databricks workspace is open in a separate window, append #secrets/createScope to the URL. The URL should have the following format: https://<\location>.azuredatabricks.net/?o=<\orgID>#secrets/createScope

Enter a scope name, and enter the Azure Key Vault DNS name and Resource ID you saved earlier. Save the scope name in a text editor for use later in this tutorial. Then, select Create.

Creating the Secret Scope As an FYI - we could do this using:

Databricks ClI
Databricks APIs
Through the Portal

Often this will be part of a an IAC deployment, but consider the use case of 3rd party API keys.

Steps to Create a Secret Scope:

Go to the url for your databaricks environment and add #secrets/createScope

Head over to the Azure portal to view our resources.

Portal.azure.com - open the key vault and head to properties and grab the Vault URI and the Resource ID to place into our secret scope

Note for the name, it needs to be unique. A recommendation would be to user your initials to make it unique to yourself.
Here we can provide into the databricks secret scope the Vault URI into the DNS name, as well as the Resource ID into the ResourceID. For Manage Principal -

Keep the name of your secret scope handy

2. Create a Databricks Cluster

Let’s create a databricks cluster by clicking on the cluster tab.

Let’s leverage the 7.3 Runtime

https://docs.databricks.com/release-notes/runtime/7.3ml.html

Then click on the Create Cluster button

Provide a cluster name – each cluster must have a unique name.

Typically if I am working in a shared envirment, I will use my initials if that works for the team.

Select a Cluster Mode of Standard

Cluster Mode	Cluster Mode Description
Standard	A standard cluster is meant to be used by 1 user
High Concurrency	Shared for multiple users
Single Node	A Single Node cluster is a cluster consisting of a Spark driver and no Spark workers (in public preview)

We can go ahead and specify work nodes 1-3

Whats a pool ? Pools

Speeds up cluster creation using existing available Azure vms

https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/

Options	Description
Enable Auto-scalling	Good if uncertain how many clusters you wil need, or great varability in your job
Auto shutdown	Makes sense to use
Worker Type	Select based on the work load.
Drive Type	Select based on the work load, if bringing a lot of data back to the head node, then give the driver more resources

FYI - Advanced Options

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#spark-config

Setting Environment Variables

3. Importing Notebooks

When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the results of running the notebook are included.

These notebooks are located in a git repo.

Leverage git to clone down this repo:

https://github.com/memasanz/AMLHackADBMLFlow.git

Or Download the Zip File.

We will have 1 person import the HelperFuncions notebook. This notebook is intented to how leveraging one notebook inside another. We will spend a little time customizing this notebook to demonstrate leveraging a python class inside the databricks notebook expereince.

Everyone will import the FullExpermient, BaseExperiment and BaseExperimentWithAzureMLIntegration notebooks under this user.

Note that we will have to manually load notebooks 1 at a time if they are *.pynb extension

Import the following:

FullExperiment.ipynb
BaseExperiment.ipynb

The files: BaseExperiment.ipynb is really just for reference, so you can move them into a folder to keep your workspace clean if you would like.

Go ahead and create a folder - samples and place them in there.

4. Install Libraries:

On the cluster tab, click on your cluster and slect install new.

Install ML Flow

PyPI Packages to install	Comment
azureml-mlflow
mlflow	Actually already installed, but doesn’t hurt to install

5. Databricks Extensibility (Optional Content):

Databricks Rest APIS https://docs.databricks.com/dev-tools/api/index.html

In order to make an API Call, we need to generate a token. You could supply user/password informaiton, but it is safer and cleaner to provide token. To generate a token head over to the left

Click on the generate new token button

Sample Curl to get list of clusters (Optional Content):

curl -X GET -H 'Authorization: Bearer dapi0XXXXXXXXXX' https://adb-1797455930767468.0.databricks.azure.us//api/2.0/clusters/list

We can also examine the list of secrets in a secret scope.

Just don’t forget to provide the token

6. Let’s dig into the notebook

In the notebook we will cover:

Leveraging an external notebook with our secret scopes (updating a class to reflect what currently is marked as secret scopes)

The existing class works with the following parameters in key-Vault, we can update the class to leverage what you currently have in key vault.

Importing data into databricks (wine dataset)
Reading csv into Spark Dataframe & Pandas Dataframe
Adding columns to Spark Dataframe
Adding columns to Pandas Dataframe
Creating Tables
Creating plots for analysis
Create Model

Afer running the experiement, we have established a model. Check out the experiements icon.

We can also view the experiment in full.

You can scroll down and view the artifacts captured for the model.

Registering a model

Improve Model
Archiving Model and moving previous model to archive
Registering Model in Azure ML and Deploying to AKS Cluster

7. Uploading Data Manually (covered in notebook)

Uploading Datasets:

When you select File-> Upload Data, select on hyper link.

Note they will give you some default pySpark code based on the fact is is a csv file.