If you have been paying attention to your data circles or attended the Data + AI Summit 2022, you might have heard that Delta Sharing is about to be released into the production offering for Databricks. This is exciting for users & organisations who want to explore how Databricks can expand their solutions in secure, open data sharing.
Delta Sharing is an open protocol for the secure real-time exchange of data in an attempt to streamline sharing solutions between data providers and their recipients. Delta Sharing is built around achieving 4 primary goals:
Share real-time data directly without the requirement to copy it.
Support a wide range of clients, allowing the ability for users to use data consumption tools of their choice.
Strong security, auditing and governance. Grant, track and audit access points from a single platform.
Efficient Scalability solutions. Delta Sharing facilitates the ability to scale to large datasets whilst leveraging the cost and elasticity benefits of cloud storage systems.
Whether you are looking to resolve headaches around data availability in multi-tenant user organisations, want to escape data sharing solutions being locked into a single vendor’s chosen computing platform, or simply want to get involved with Databricks Marketplace; Databricks is building an exciting solution that is flexible to meet multiple use cases.
Data Cleanrooms: The concept of data clean rooms is not new but is quickly gaining traction as businesses are coming to terms with how valuable their data is and how it can be efficiently monetised.
A clean room is a ‘location’ where two or more parties can collaborate on data and utilise the benefits of joint resources in a matter that is still private, secure and has governance measures in place. However, modern solutions still face problems with scalability, the movement & availability of data and exclusivity supporting Structured Query Language (SQL). On top of the already incredible arsenal, the Databricks Lakehouse has on-demand, the rollout of the Unity Catalog (For building a single point of governance enforcement) & Delta Sharing, it seems that Databricks aims to resolve the main concerns companies face when dealing with other modern data cleanroom solutions.
Multi-tenant user organisations: It is not uncommon nowadays for larger companies to utilise multi-tenant architecture, especially when considering administration, configuration, and resource concerns. Previously, when looking to quickly and securely share data between ‘isolated’ business units may have arisen (change in configuration, adoption of new computing platforms to match different BU, etc.). With Delta Sharing, this task can be streamlined without taking away from BU benefits that would normally be associated with a multi-tenant architecture.
Freedom of choice when choosing computing platforms: We’ve just considered multi-tenant architecture, which highlights a multitude of solutions that Delta Sharing could solve. However, what happens when you have partner organisations using different computing platforms that you’d like to share with? What if one of your data providers is on a different platform? Do you adopt new technology? Do you spend hours devising a crafty solution to pipeline the data across? Instead, Delta sharing opens the possibility for you to share the data in real-time, with no need to copy the data from a vast range of supported computing platforms.
The Databricks Marketplace aims to provide an open community resource for users to openly share, collaborate and create both Datasets and Data solutions. This platform’s powered by Delta sharing, which is incredibly exciting for both data providers and data consumers alike, and will be an amazing resource for the data community as a whole. I highly suggest you go check out the announcement from Databricks here.
As of the time of writing this post, Delta Sharing is in open preview, so if you have previously signed up and have access, you can get started today! Otherwise, you will have to wait for general access or set up a local stand-alone environment.
If you are interested in getting access to exciting new Databricks features ahead of the crowd to play around with before release, I recommended signing up to the public preview here.
Alternatively, if you feel like setting up a stand-alone Delta Sharing Server, you can check out the GitHub Repo (https://github.com/delta-io/delta-sharing).
For this post, I will exclusively outline some basic config and commands to be run on the Databricks platform.
- Ensure that you are an account or metastore admin (If you are not, you will need an admin to complete the “Enabling Delta Share” step on your behalf.
- Have the unity catalogue enable with at least one metastore
- As of the Open Preview, you cannot share views
- `SELECT` is the only statement permission you can grant to a user
Enabling Delta Share
- Log into the account console: https://accounts.cloud.databricks.com/, and head to the data pane to view your unity catalogue metastores.
- Select the metastore which you want to share tables from and select the configuration tab
- Check “Enable Delta Sharing to allow a Databricks user to share outside their organization”
- (Optional) If you would like to curate how long a data recipient has access to the shared data. Make edits to the “Delta Sharing recipient token lifetime” to meet your use case.
A `share` object is a collection of tables in a metastore that you want to share as a single group.
Note: Tables added to a share must all be from the same unity catalog metastore.
Firstly, boot up a notebook with `SQL` code blocks to get started.
Creating a Share
Show all existing Shares
List all tables in the target share
```sql SHOW ALL IN SHARE <share_name>; ```
Delete a Share
```SQL DROP SHARE <share_name>; ```
Adding a table to a share
```SQL ALTER SHARE <share_name> ADD TABLE <database_name>.<table_name>; ```
Removing a table from share bucket
```SQL ALTER SHARE <share_name> REMOVE TABLE <database_name>.<table_name>; ```
Partition Data to share? No Problem
```SQL ALTER SHARE <share_name> ADD TABLE <database_name>.<table_name> PARTITION (<expr>), (<expr_n>), (...); ```
ALTER SHARE my_share ADD TABLE db.students PARTITION (GPA >= 3.0);
Recipients are a generated set of credentials that represents a user or set of users who you wish to share data with.
CREATE RECIPIENT <recipient_name>;
Note: Review the output created from this code block, as it will display the shareable link you are to provide, which will include the shareable link you will share with data recipients. Caution: This Crediential file can only be downloaded once.
The recipient can then use this share file to upload into their desired computing platform, eg: Power Bi, Tableau, QLIK
This command will pull all relevant information about a named recipient.
I found this command particularly useful when I did not catch the activation link in the initial account creation.
DESCRIBE RECIPIENT <recipient_name>;
List all recipients
Remove a recipient
DROP RECIPIENT <recipient_name> ;
SHOW GRANTS TO RECIPIENT <recipient_name>;
The below commands will run through some core functionality which will manage access and permissions across your share objects
GRANT REVOKE SELECT ON SHARE <share_name> TO RECIPIENT <recipient_name>;
Note: At this point, SELECT is the only privilege you can grant.
This blog was written from excerpts from the following posts. If you want to read further or see the original sources, please check out the links below: