Skip to main content

Using Databricks in DAE

Getting started with Databricks in the Data Access Environment (DAE).

Introduction

Databricks is a collaborative analytics platform that supports SQL, Python and R languages for the analysis of big data in the cloud.

This guide is intended to help you get up and running using Databricks in the Data Access Environment (DAE). It provides guidance on:

  • adding data to DAE using Python and SQL
  • creating and using notebooks and clusters
  • using Spark SQL and Python query tools with examples of an SQL and a Python cell
  • writing efficient code to reduce the time it takes to run the code
  • working collaboratively using shared folders and shared databases
  • creating a snapshot of a data asset for analysis at a later date
  • downloading analysis results, or code, for publication or onward use
  • criteria to follow to ensure a quick review of your export by the Safe Output Service
  • using widgets to allow you to control parameters in your notebooks and dashboards

Please refer to the Databricks documentation website for further information on using Databricks.


Contact us

If you have any questions about guidance or functionality, or are experiencing any operational issues, such as problems with system access, please contact our National Service Desk on 0300 303 5035 or via email at [email protected].

For general enquiries, such as questions about Data Sharing Agreements (DSAs) or other data-related issues, please email our Contact Centre at [email protected].


Logging in

When logging in for the first time, you will be asked to create a two-factor authentication code before you can sign in. Please refer to the  Data Access Environment set up guide for help setting up your two-factor authentication code.

Note: You will be asked to enter your username and password and authentication code twice.

To log in to the DAE portal:

  1. Enter your email address and password.
  2. Click Next.

activate login 3

  1. Enter your two-factor authentication code.
  2. Click Log in.

Two-factor authentication screen

You will be asked to login again.

  1. Enter your email address and password.
  2. Click Next.

activate login 3

  1. Enter your two-factor authentication code.
  2. Click Log in.

Two factor authentication

  1. Select the agreement that permits access to the required analytical tool from the Agreement drop down menu.
  2. Click on Submit.

Agreements 2c

  1. When you have the correct agreement selected, click on the Databricks analytical tool.

Agreements 4a

  1. Click the Single Sign On button.

Single sign on

After logging in, the Welcome to Databricks screen will be displayed. Please refer to the Navigating Databricks section for details.


Logging out

It is important to log out of Databricks when you have finished.  

To log out:

  1. Click on the small user Account icon in the top right of the Welcome to Databricks screen.

Log Out 1

  1. Select Log Out from the drop down menu.

Log Out 2

A window will be displayed.

  1. Click Log Out.

Log Out 3

You will be logged out of the Databricks tool.



Copying and pasting into DAE

By default, you can copy and paste within DAE, but to copy and paste into DAE, you must ensure that copying to the clipboard is enabled.

After logging into DAE, a clipboard icon will be displayed next to the bookmark (star) icon in the top right of your browser. If copying to the clipboard is enabled the clipboard icon will be displayed as shown in the image below:

Copy and Paste into DAE

To enable the clipboard icon if it is crossed out:

  1. Click on the clipboard icon.
  2. Select Always allow in the pop-up screen.
  3. Click Done

You should now be able to copy and paste into DAE.

Due to security restrictions, it is not possible to copy and paste out of DAE.


Adding data to DAE

Sometimes you may need to add small amounts of data to DAE. This can be done by using Python or SQL.

Please note that adding 1,000 lines of data or more is likely to result in slow system performance.

Using Python to add data

You can copy data in CSV format to the clipboard within DAE and then use Python to convert the CSV string into a table of data.

The following query is used:

import pandas as pd
import io

data = """<Paste CSV Here>"""
df = pd.read_csv(io.StringIO(data), header=0, delimiter=',').astype(str)
spark.createDataFrame(df).createGlobalTempView("<write a temp table name here>")

To use Python to add data:

  1. Copy the data from a CSV file into the query replacing <Paste CSV Here>.
  2. Enter a name for the table, replacing <write a temp table name here>.

For example:

Import Code 1

  1. Run the query.

The data from the CSV will now be available as a global temp view.

  1. You can now view hospitals_test, which will have columns titled “org_code” and “org_name”, using the following SQL statement:

    SELECT * FROM global_temp.hospital_test

For example:

Import code 2


Using SQL to add data

You can copy data in CSV format to the clipboard within DAE and then use SQL to convert the CSV string into a table of data. 

To use SQL to add data:

  1. Add your data to an Excel spreadsheet (on your local computer).

For example:

Org_code,Description  
abc,test hospital  
zzz,testington general  
z99,testsville royal infirmary  
aaa,st test's trust
 

Add the example data above into Excel as 2 columns (Org_code, Description) and 5 rows of data. 

  1. Combine the 2 columns to create a new single column using the following formula:

=(CHAR(40)&CHAR(34)&A1&CHAR(34))&CHAR(44)&CHAR(34)&B1&CHAR(34)&CHAR(41)&CHAR(44) 

You will need to amend the cell references (A1 and B1) accordingly.

Paste the formula in the column after your data.

  1. When you have written your formula, copy (drag) it to all rows.
  2. Include the additional formatting shown below:

("abc","test hospital"),  
("zzz","testington general"),  
("z99","testsville royal infirmary"), 
("aaa","st test's trust")

  1. Copy the contents of the new column from Excel and paste into an SQL cell in Databricks.

For example:

Adding Data to DAE 1

  1. Add the following SQL code to convert the data into a table with headers:

CREATE OR REPLACE GLOBAL TEMP VIEW test_table AS
SELECT * FROM VALUES

AS test_table(<column1>, <column2>) 

For example:

Adding Data to DAE 2

Please refer to the Apache Spark website for more information on SQL syntax.

  1. You can now view test_table, which will have columns titled “Org_Code” and “Description”, using the following SQL statement:

SELECT * FROM global_temp.test_table

For example:

Adding Data to DAE 3


Using notebooks and clusters

Databricks uses notebooks and clusters to analyse and visualise data and run automated jobs.

Notebooks

A notebook allows you to write and execute code, a section at a time, using cells. A notebook can have one or more cells.

Notebook cells

Each cell can contain a different type of code. One could contain Python, the next could contain Structured Query Language (SQL), the one after that could contain Markdown.

Using mixed languages

Cells use the default language of the notebook but you can prime cells for different programming languages using magic commands. Magic commands allow you to switch between Python, SQL and Markdown.

The percent sign is used to indicate you are using a magic command.

For example:

Magic Command Description
%sql SQL magic - makes the cell a SQL cell
%python Python magic - makes the cell a Python cell
%md Markdown magic - makes the cell a markdown cell

In notebooks you can use SQL and Python code cells sequentially. This allows you to use the preferred code for the task.

Data can be passed from cell to cell, and notebook to notebook, to create workflows.

Notebooks passing cells

For more information on managing notebooks, please refer to the Databricks notebooks management page.

For more information on using notebooks, please refer to the Databricks notebooks page.

Clusters

A cluster is a set of computation resources and configurations on which you run data engineering, data science and data analytics workloads. It is effectively the engine that drives the notebook that allows you to perform your analysis.

Before you get started using notebooks, Databricks will prompt you to attach a cluster. A cluster must be running to view data in the Data tab.

Working on shared clusters

Within the Data Access Environment (DAE) everyone works on a shared cluster. The shared cluster(s) available are dependent on your DSA. 

Restarting a cluster will clear any information stored in its memory, such as global temp views, dataframes and variables, and for anyone using it will cause running queries to fail.

It is advised that a restart should only be used as a last resort, but if necessary you should consult other colleagues prior to a restart.

Creating folders for notebooks

You can create folders to store and organise your notebooks.

  1. From the sidebar, click the Workspace icon.

workspace 4

  1. In the workspace folder, select Create and then Folder.

Create folders for notebooks

See the Working collaboratively section for further information.

Creating and using notebooks

You can create notebooks to write and execute code.

To create a notebook:

  1. From the sidebar, click the Workspace icon.

workspace 4

You will be able to see all the folders for which you have permissions.

working collaboratively 2

  1. Select the folder to which you want to add the notebook.
    You will have access to shared folders and personal folders.
    If you add the notebook to a shared folder, all viewers with access permissions to that folder will be able to access the notebook.
    If you add the notebook to your own personal folder the notebook will only be accessible by yourself.
  2. Right click on the folder name and select Create then Notebook.
  3. Enter a name for the notebook.
    Prefix your notebook with a meaningful identifier such as your initials or project code so you can easily recognise it.
  4. Select your language from the drop-down menu (this determines the default language of the notebook).
  5. Select the cluster you want to attach your notebook to from the Cluster drop-down menu.
    If there are bespoke clusters for the agreement you have selected, these will be available in the drop-down menu. Select the cluster with the name that matches the agreement.
    If there are no bespoke clusters for the agreement you have selected, ensure the cluster ‘dae-databricks-internal’ is selected.
  6. Click Create.

create notebook

The notebook will open with an empty cell at the top.

create notebook assign clust 6

You can now start to write code into the cell in your notebook.

When you close the notebook the code will be automatically saved.

Running code

To run the code in a cell in your notebook:

  1. Click the triangle to the right of the cell.
  2. Select Run Cell from the drop-down menu.
    Alternatively, you can press SHIFT + Enter within the cell

run cell

The results will be displayed below the cell.

run cell 2

Note that if your code is divided into several cells in your notebook, to run all the cells, click Run All in the notebook toolbar.

Adding cells

A notebook can contain one or more cells. If your code is large, you can divide it into several cells. You can add more cells to your notebook as required.

To add a new cell to your notebook:

  1. Hover your mouse over the bottom of the cell in your notebook, beneath which you want to add a new cell.
    A ‘+’ icon will be displayed.

add cell 1

  1. Click on the + icon.

A new empty cell will be displayed in your notebook.

add cell 2

Changing the cluster

Once inside the notebook you can attach the cluster.

create notebook assign clust 4

Cloning notebooks

Sometimes it may be helpful to create a ‘clone’ of a notebook. For instance, you may want to:

  • keep an original of a notebook
  • make changes to a notebook
  • share a copy of a notebook with your colleagues using a shared folder

To clone a notebook:

  1. Clear any query results.

Clear query results

Clearing queries is important so that sensitive information is not shared with a user who may not have the same access permissions.

  1. From the sidebar, click the Workspace icon.

workspace 4

  1. Right click on the workspace folder to be cloned.
  2. Select Clone from the drop-down menu.

Workspace folder - select "clone"

  1. Select the folder to which you would like to clone your notebook from the pop-up window.
  2. Select the notebook you want to clone and enter a new name for the notebook.
  3. Click Clone.

Clone notebook

Deleting notebooks

To delete a notebook that you no longer require:

  1. From the sidebar, click the Workspace icon.

workspace 4

  1. Right click on the notebook to be deleted.
  2. Select Move to Trash from the drop-down list.

notebook delete 2

  1. Select Confirm and move to Trash.

The notebook will be moved to the Trash and will be permanently deleted after 30 days.

Master notebooks

You can use a single master notebook to run other notebooks. This can help you develop effective workflows to allow preparatory work to be signed off and assured pieces of analysis to be packaged up and run separately.

Master notebooks

Tutorial notebooks

Databricks tutorial notebooks are available in the workspace area.

  1. From the sidebar, click the Workspace icon.

workspace 4

  1. Select User Guidance.

The tutorial notebooks will be shown on the left.

Tutorial notebooks

The tutorial notebooks are read-only by default. However, if you clone a notebook you can make changes to it if required. Please refer to the Cloning notebooks section for information.


Using cells

SQL and Python cells

Whilst SQL is a powerful query tool there may be times when Python may be more appropriate. For example, when you need to do more complex operations, such as:

  • data manipulation
  • data visualisation
  • data modelling

The cells in a Databricks notebook run the language you selected when creating the notebook. However, if you select Python and would then like to use SQL, you can use a SQL magic command %sql to switch. Similarly, if you select SQL and would then like to switch to Python you can use a Python magic command %python.

Both SQL and Python cells support syntax highlighting which means that the code is highlighted according to the category of terms to make it easier to read.

Databricks uses Spark SQL which has subtle and important differences to T-SQL which is used by the SQL Server. Please refer to the Databricks SQL reference guidance or save it as a bookmark.

Below is an example of an SQL cell:

SQL Cell

Below is an example of a Python cell (using PySpark):

Python cell

Markdown cell

If you would like to include documentation in a notebook you can create a markdown cell by putting a %md magic command at the start of the cell (called a markdown magic cell). This enables you to render your code as text. It also allows you to organise your code as cells under headings which can be expanded and collapsed with the -/+ buttons on the left.

Below is an example of a markdown cell:

Markdown cell


Pulling tables into Python using PySpark

PySpark is the Python Application Programming Interface (API) written in Python to support Apache Spark. Apache Spark is the distributed framework used to handle Big Data analysis.

When pulling tables from Databricks into Python, you need to go via Spark using "PySpark": spark.table('table_name').

Dataframes provide a domain specific language for structured data manipulation. Please follow the link to see the guidance for Dataframe operations.  

The following example shows how to pull a table into Python using PySpark:

Pulling Python Table


Apache Spark and Spark SQL

Spark is a data processing engine for cluster computing which sits between the data source and the analysis tool. It supports different data sources and different querying languages, including a SQL variant called Spark SQL. Spark SQL allows the use of a mixture of SQL commands to perform complex analytics.

Spark SQL syntax

Spark SQL is different to standardised SQL and uses Spark SQL syntax. For a detailed description of the Spark SQL syntax along with examples of when you would use it, please refer to the Apache Spark SQL syntax pages of the Apache Spark website which provides a list of Data Definition and Data Manipulation Statements as well as Data Retrieval and Auxiliary Statements.


Improving code efficiency

Writing efficient code can reduce the time it takes to run the code and reduce the burden on the system which can help optimise performance and avoid memory issues.

Below are some common issues and quick wins that can make your code run more efficiently.

Avoid creating large chains of views

A large chain of views can take a long time to run and potentially reduce service performance for other users and overburden the query optimiser. Views should therefore only be calculated when needed, such as when writing to a table/pandas dataframe, making a graph, or displaying results.

Your query will run faster if you convert key stages of the chain into tables. If the output of a view is going to be used multiple times, or joined to other items, you should consider making it a table.

Example:

CREATE TABLE <database_name>.<table_name> AS

<SELECT statement>;

ALTER TABLE <database_name>.<table_name> OWNER TO <database_name>

Reduce data sample before performing complex operations

Complex operations, such as JOIN and DISTINCT, put a significant computational strain on the system and can reduce performance. We recommend users filter out unnecessary data before performing the complex operation.

When using a WHERE clause to reduce the data sample, the WHERE clause should come before the complex operation, if possible. This will optimise processing by taking up less memory in the cluster allowing the query to run faster.

Example:

Less efficient coding:

Efficient coding (recommended):

SELECT *

FROM db.table_a

INNER JOIN db.table_b

    ON table_a.column1 = table_b.column1

WHERE table_a.column1 IS NOT NULL

SELECT *

FROM

    (SELECT *

    FROM db.table_a

    WHERE column1 IS NOT NULL

    ) table_a

INNER JOIN db.table_b

   ON table_a.column1 = table_b.column1

In the above example, the data is reduced after  performing the JOIN operation thereby using  system resources less efficiently.

In the above example, the data is reduced before  performing the JOIN operation thereby using system  resources more efficiently.

Efficient use of coding for large comparisons

A query such as WHERE column_a NOT IN (SELECT column_a FROM …) will run into memory issues and take a long time to run if the SELECT clause is very large. The same would be true for a query which includes WHERE EXISTS (SELECT 1 FROMWHERE…).

Query processing is distributed across Databricks’ compute resources, so using a JOIN or ANTI JOIN query will reduce the burden on the system, helping your code to run more efficiently.

You can use NOT IN, IN or EXISTS when the list of records are small (less than 1,000 records), for example, WHERE column NOT IN (‘code_a’, ‘code_b’).

Example for SQL:

Less efficient coding: Efficient coding (recommended):

SELECT *

FROM db.table_a

WHERE column1 NOT IN (SELECT DISTINCT column1 FROM db.table_b)

SELECT *

FROM db.table_a

LEFT ANTI JOIN (SELECT DISTINCT column1        

                             FROM db.table_b) table_b

ON table_a.column1 =  table_b.column1

In the above example, the SELECT clause from table_b will need to be run for each piece of data in table_a.  If table_b is large, there are likely to be memory issues.

In the above example, table_b is reduced and an ANTI JOIN clause is used to reduce the records in table_a by keeping only those records that don’t have a matching column1 value in table_b.

The same principles also apply for Pyspark.

Example for Pyspark:

Less efficient coding: Efficient coding (recommended):

sdf_a = spark.table('db.table_a')

sdf_b = spark.table('db.table_b')

sdf_b_massive_list = (sdf_b.select('column1').distinct()

                           .toPandas()['column1'].to_list())

 

sdf_filtered = sdf_a.where(~sdf_a['column1'].isin(sdf_b_massive_list))

sdf_a = spark.table('db.table_a')

sdf_b = spark.table('db.table_b')

sdf_b_cutdown = sdf_a.select('column1').distinct()

 

sdf_filtered = sdf_a.join(sdf_b_cutdown, sdf_a.column1 == sdf_b_cutdown.column1, 'left_anti')

In the above example, the SELECT clause from table_b will need to be run for each piece of data in table_a.  If table_b is large there will be memory issues.

In the above example, table_b is reduced and an ANTI JOIN clause is used to reduce the records in table_a by keeping only those records that don’t have a matching column1 value in table_b.


Working collaboratively

You can collaborate with colleagues listed under the same DSA. For security reasons you are not able to collaborate with people outside of your DSA.

There are 2 ways to work collaboratively:

  • in shared folders
  • in shared databases

Shared folders

Collaboration spaces allow you to share your work with colleagues. In a shared folder you can add and amend notebooks. You may want to share a copy of a notebook with your colleagues using a shared folder.

Viewing shared folders

To access shared folders:

  1. From the sidebar, click the Workspace icon.

workspace 4

You will be able to see all the folders but you will only be able to view the folders you have permissions for.

View shared folders

  1. You can navigate the folder structure and the files you can access will appear in the right-hand panel.

Navigate share folder

Saving to a shared folder

You should only save notebooks to the shared folder where all viewers in that space have access permissions for the data used.

Creating the notebook in the collaboration space is performed in the same way as working in your own space. Please refer to the Creating and using notebooks section for information.

Shared databases

The data area allows you to browse the databases available which contain the tables to which you have access.

There are 2 types of database:

  • Controlled database
    A controlled database is a core asset, such as HES, GP data and SUS+ which are maintained using an established process. A controlled database is read-only.
  • Project database
    A project database is created and managed by a user. Permissions are set to allow users to write to them, perform new analysis and create new sets of data.

Viewing shared databases and tables

To view a list of available databases and the tables within them:

  1. Click Data in the sidebar.

data

The Databases folder displays the list of available databases and everyone has access to the dss_corporate data.

The Tables folder displays the list of tables in the selected database.

Available Databases and Tables

  1. Select a table to view the metadata.

Metadata is not available for every asset.

navigating the data area 1

The data area shows a preview of the data and will by default show only the first 50 columns.

To view all the columns in the table select display all columns.
 

Changing table permissions in the project shared database

When you create a table in your project shared database it will be visible to other users listed in your DSA and they will be able to query it. However, they will not have permission to drop or edit the newly created table by default. To do this you need to grant the user group ‘OWNER’ permissions. This can be done using the following command:

ALTER TABLE <project_shared_database_name.table_name> OWNER TO <project_shared_database_name>

For example:

ALTER TABLE dars_nic_391419_j3w9t_collab.covid_positive OWNER TO dars_nic_391419_j3w9t_collab

Note: Always allow a code to finish running when a table is being created or altered and do not cancel it part way through. If the run is cancelled an error will occur and you will be prevented from creating a table with the same name.


Reproducing results

Data sets can change over time due to:

  • further data being submitted
  • better quality data being resubmitted

These changes mean that a project worked on earlier in the year may produce different results when running the same code against the updated data asset.

You may replicate and compare your analysis outcomes at a later date by creating a ‘data snapshot’ of the data asset and saving the corresponding notebooks and results.

If you wish to reproduce an analytical process at a later date you should:

  1. Save a data snapshot of the data asset.
  2. Save the notebook (code) for that analytical process.
  3. Save the results.

Creating a data snapshot

You can save a data snapshot by saving a table from an existing data asset for the project you are working on. The table should be saved to the relevant database which could be either your own database or a shared database.

The data snapshot table should also contain the production date (date of snapshot) as an additional column. This will allow the correct subset of data to be identified when reproducing the analytical process. Adding the production date as an identifier column means you can add multiple snapshots to the same table.

The following is an example of a ‘create table’ query to save a snapshot of the data asset to a table in your own database:

Create table query

Note: Always allow a code to finish running when a table is being created or altered and do not cancel it part way through. If the run is cancelled an error will occur and you will be prevented from creating a table with the same name.

Managing multiple data snapshots

If you need to run the same query periodically, you may wish to create multiple data snapshots so you can reproduce the same output results at a later date. To do this, we recommend appending data snapshots to a single table for that project. The multiple snapshots will be identifiable by the production_date column in the table.

The following is an example of an ‘insert into’ query to add new lines of data containing the column production date:

Insert into query

After running the ‘insert into’ table query, you can check that your table contains the new data for the latest set of outputs.

Check latest set of outputs

The diagram below shows the created table project_a.org_daily_table containing multiple data snapshots identified by the production date:

reproducing results 5

Naming a data snapshot

Data snapshots should be given meaningful names to make things easier to manage when reproducing results.

The name of the data snapshot should contain the:

  • project identifier (code or ID, if available)
  • data asset name

In the example in the section above, Managing multiple data snapshots, the project is Project_a and the data asset is dss_corporate.org_daily_table. A suitable name for the data snapshot table would therefore be: project_a.org_daily_table.


Exporting and downloading results

Results can be exported to an S3 bucket on Amazon Web Services (AWS) which is linked to your DAE account. Exported results, including data tables, charts and graphs can then be downloaded and saved to your computer.

Downloads are only available if specified in your Data Sharing Agreement (DSA).

Exporting and saving your results

Depending on the terms agreed in your DSA, some users will be permitted to export and save outputs without checks, whilst other users will be required to have their outputs reviewed by an Output Checker in our Safe Output Service before they can be downloaded. This is to ensure that requests cannot be linked to an individual patient and that a patient cannot be identified due to not applying aggregations to the data, or by exporting a small number of records.

The Data Out field on the DAE Agreement screen displays your export permissions.

Data Out

The table below provides details on the different permissions available for Data Out, depending on your DSA:

Data Out Description
ESCROW

Downloads are available. Your export request will be reviewed by the Safe Output Service. See the Reviewing results section for further information.

ESCROW is the name used by the system for the Safe Output Service.
Yes Downloads are available. Your export request will not be checked by the Safe Output Service and will be automatically copied to your AWS S3 account.
No Downloads are not available.

Reviewing your results

If your DSA requires your exported results to be reviewed before download, your request will be processed by the Safe Output Service. You will receive an email once your request has been processed.

  • Your request may be rejected if it does not meet our disclosure control guidance. You will receive an email explaining the reasons as to why your output could not be approved.
  • If your request is approved you will receive an email confirming that your output is available to download from your AWS S3 account.

For information on the criteria to follow to enable a quick review of your export please refer to the Safe Output Service section.

Export file formats

The following export file formats are supported within DAE:

csv, html, png, excel, zip, python

Exporting results to CSV

If your export request needs to be reviewed before download, you must ensure that you use clear descriptors for the column names or your download may be rejected.

To create a CSV file:

  1. Run the code to create your output.
  2. Click on the Download CSV button.

Export CSV 1

Databricks will display the first 1,000 rows of data by default.

  1. If you have more than 1,000 rows of data, click on the down arrow next to the Download CSV button and select Download full results from the drop-down menu.

Exporting CSV 2

If your DSA requires your results to be reviewed before download you will receive a confirmation email when your export request has been processed.

Exporting graphs as PNG

You can export graphs in Python notebooks, including graphs created using Matplotlib, as a PNG file.

If your export request needs to be reviewed before download, you must clearly label the axis names and title of your graph or your download may be rejected.

To create a PNG file:

  1. Run your SQL code or use "display(...)" on a spark dataframe or matplotlib figure.
  2. Press the Bar Chart button then use the Plot Options to design your plot.
  3. Click on the Download plot as a PNG button.

Exporting PNG 1

The file will be created with a randomised file name which cannot be changed.

To create a PNG file where you can define the file name (matplotlib plots only):

  1. Right click and select Save image as from the drop-down menu.

Exporting PNG 2

  1. Enter a name for the PNG file.
  2. Click Save.

Exporting PNG 3

If your DSA requires your results to be reviewed before download you will receive a confirmation email when your export request has been processed.

Exporting results to HTML

Exporting your results as an HTML file will enable you to include additional contextual information as a comment line in your output. This will provide more information for the output checker and help ensure that your download is not rejected.

Any data table, charts and graphs contained within the HTML file will be accessible after downloading the file. The HTML file will act as a container for both the export results and the contextual information.

To create an HTML file:

  1. Create your results (data table, chart or graph) in a notebook.
  2. Add a comment above the results to provide contextual information (see example below).
  3. Run your results.
  4. Export as an HTML file by selecting File > Export > HTML.

Exporting HTML

If your DSA requires your results to be reviewed before download you will receive a confirmation email when your export request has been processed.

Exporting a Python file

Exporting your results as a Python file will allow you to include comments in your output. This will provide more information for the output checker and help ensure that your download is not rejected.

The exported file will only contain raw text. Graphs will not be included..

To create a Python file:

  1. Add a comment above the results to describe what the code does.
  2. Run your notebook.
  3. Select File > Export > Source File.

Exporting Python

If your DSA requires your results to be reviewed before download you will receive a confirmation email when your export request has been processed.

Downloading your results

If your results required review, they will be copied to your AWS S3 account following approval. They will then be ready for you to download to your computer.

If your results did not require to be reviewed they will be automatically copied to your AWS S3 account.

To download your results:

  1. Login to the AWS S3 bucket.
  2. Enter your DAE Username and Password and two-factor authentication code.

The AWS Management Console screen will be displayed.

  1. In the Search field type S3.
  2. Press Enter.

AWS 1

  1. Select S3 Scalable Storage in the Cloud.

AWS 2

  1. On the Buckets screen, select the NHSD-DSPP-DAE-Data-Out item.

AWS 3e

Please note that although Access displays ‘Insufficient permissions’ you will be able to continue with your download.

You will be presented with a list of folders.

  1. Select the folder with your name in the title.
  2. Click Open.

AWS 4

The list of files in your folder will be displayed.

  1. Select your results file.
  2. Click Download.

All CSV files are named export.csv by default and cannot be changed.

AWS 7

On Windows PCs the file will be stored in your default ‘Downloads’ folder. Depending on your local security, some browsers will allow you to open the file from a dialogue box on the page.

Output considerations

You should only export one query at a time, in accordance with your DSA.

Use of escape characters have been made in Excel outputs to ensure the field type integrity. These escape characters only appear when editing a cell and make no difference to the data, other than informing Excel how to display the content.

Data awareness training

The Safe Data Access Professionals Group helps share expertise, best practice and knowledge between organisations engaged in providing secure access to confidential sources of data from the health and social science research sector.

These free data awareness training courses and modules provide you with best practice information on working with statistical data and can be accessed on the Secure Data Group website.


Safe Output Service

The Safe Output Service ensures that data confidentiality is always maintained and data protection best practice is followed in relation to any code or results taken out of the environment. 

To protect against the release of potentially identifiable data and to assist in the output checking process, submissions must comply with the disclosure control rules below. Compliance will ensure the Safe Output Service can process the request quickly and without the need to contact the requestor for further information.

Disclosure control rules

If a request does not meet the following mandatory criteria, it will be automatically rejected. 

Requested results must: 

  • Not contain any data that could lead to the identity of an individual or their associated confidential information being revealed.
  • Be for analytical results and NOT a data file, a data set or a subset of the data: 
    • Unless this data consists only of aggregated, summary results (of no more than 1,000 rows). Output requests for files containing more than 1,000 rows may be approved if appropriate advance notice of the request submission is provided and the contextual information justifies an exception.  
    • Please note that the standard Service Level Agreement (see below) will not apply for files containing more than 1,000 rows. 
  • Contain only data that would be appropriate to make available in the public domain (e.g. in a research paper or report); it must not contain personally sensitive or commercially confidential data. 
  • Suppress any counts under 10 (by rounding to 10) except where they are high-level aggregations that are explained / justified in the contextual information provided and present no risk of deriving low counts. 
  • Round all counts greater than 10 to the nearest multiple of 5. 
  • Be accompanied by appropriate contextual information, including: 
    • A clear description of the work undertaken, including information about the data sets, sample sizes and methodology used. 
    • Labels and descriptions of the data / fields / variables in a human readable format. 
    • A descriptive file name where this is supported by the tool being used. 
    • Where counts have been used in any statistical analysis, the underlying numbers should be provided so that the potential for deriving small numbers can be assessed.  
      Note: these underlying numbers must not be included in the output itself. 

Requested code outputs must: 

  • Not include Personal Identifiable Information (PII) or sensitive data (explicitly or implicitly) within the code or embedded comments.

Providing contextual information 

When reviewing your request, the output checking team can only see the results you have exported.  

The following methods can be used to provide additional contextual information: 

  • An accompanying email to [email protected].  
  • A column labelled “context” within a results table. 
  • An annotation on the figure or plot being requested. 

The following method should not be used to provide contextual information: 

  • Providing the code used to generate the results.  

Service Level Agreements

  • Responses are provided within 2 working days when requests are received before 2pm on a working day.  
  • Responses are provided within 3 working days when requests are received after 2pm on a working day. 
  • The above durations do not apply for requests for files containing more than 1,000 rows. 
  • The above durations are increased by 1 working day for Code output requests.  
  • To allow us to meet the needs of all users, each user should submit no more than 10 output requests per day. Multiple outputs submitted together (e.g. in a zip file) will be treated as separate requests and excess requests will be put on hold until the first 10 have been completed. Any remaining requests will be processed in batches of 10. 

Creating reusable processes

Widgets

Widgets allow you to control parameters in your notebooks and dashboards.

If you have an analytical pipeline that produces a regular set of outputs you can create a widget that allows you to set different parameters, such as year, gender or organisation, before re-executing the notebook. This means you can create a reusable process which can be used multiple times in the future without changing the cells in the notebook.

For more information on using widgets please refer to the databricks website.  

The following code can be used to remove all widgets:

dbutils.widgets.removeAll()

Widgets must be initialised before they can be used. If parameters are  'passed in' then this is skipped. When they are created they take in a parameter value in preference to the value specified or the default value.

The following code is an example of how to initialise widgets:

Initialise Widgets

You can pull the values from the widgets into Python and use them, such as in a PySpark query as shown in the example  below:

Widgets Python

You can also pass the parameters in SQL to pull in a widget using %<param_name>, as shown in the example below:

Widgets SQL

You can also pass variables in from Python, as shown in the example below:

Widgets variables

Alternatively, you can use format-strings (f-strings). In Python source code, an f-string is a literal string, prefixed with f, which contains expressions inside braces. The expressions are replaced with their values; f-strings provide a way to embed expressions inside string literals, using minimal syntax. It should be noted that an f-string is an expression evaluated at run time, not a constant value.

Please see the example  below:

Widgets f-string


Best practice

The following best practice should be observed to help you work in Databricks efficiently whilst maintaining consideration for other DAE users.

Consider others when making changes to tables and notebooks

  • Do not ‘drop’, ‘truncate’ or ‘delete’ other users’ tables without consulting with them.
  • Do not delete or edit other users' notebooks without consulting with them.

Report unrecognised folders

  • If you notice a new folder that you and your colleagues do not recognise, please raise a service request on 0300 303 5035 or via email at [email protected].

Use lowercase for table and database names

  • Avoid using uppercase for table and database names as this will result in the query failing.

Use meaningful identifiers

  • Prefix your tables and notebooks with a meaningful identifier so you can easily recognise them, for example, with your initials, or a project identifier code.

Test your code before running

  • Test your code to ensure it works before you run it on Databricks. Use a testing framework such as ‘pytest’, ‘unittest’ or ‘doctest’ so the code can be easily re-tested later.

Allow a code to finish running when creating a table

  • Always allow a code to finish running when a table is being created or altered and do not cancel it part way through. If the run is cancelled an error will occur and you will be prevented from creating a table with the same name.

Store code centrally for re-use

  • If you wish to re-use code, add it to a central function or notebook of its own. Use dbutils.notebook.run or %run command to re-use the code in other analytical pipelines.

Delete temporary tables after use

  • Delete any temporary tables that were created as intermediate tables when executing a notebook. Deleting tables saves storage, especially if the notebook is scheduled for daily execution.

Further information

external
external
external
external
external
external
external

Last edited: 24 November 2022 3:32 pm