Moving to Azure and implementing Databricks and Delta Lake for managing your data pipelines is recommended by Microsoft for the Modern Data Warehouse Architecture. Data Factory is used to manage workflow and restart/recovery of failed tasks. Here we will look at a template for Databricks to accept a list of tables and a high watermark and return the new watermark, an error flag, and a message. Data Factory can then determine if any must be retried and maintain the new watermark for incremental processing.

The template notebooks can be found here.

Workflow

  1. The driver notebook will accept JSON parameters and convert…

Visual Studio protects you and your data during deployments by making sure you do not delete data by accident. Such mishaps can be caused by shortening a column, dropping a column, and similar data operations. You can disable this. But I would only recommend it in rare cases for a hotfix because your production database is unusable.

This forces us to handle data transformations deliberately. In the following discussion we will set up pre- and post-deployment scripts with a release variable to manage deployments. These scripts will be run outside of the DACPAC schema compare. …

When streamlining our database development process, we first must get our databases under source control. Often, the code accesses other databases on the server. Whether it is another in-house database or a vendor database, dependencies make version control harder and deploying to multiple QA and UAT environments time-consuming.

Take a little time upfront and you will have the security of source control and the ability to deploy to multiple environments instantly. …

Databases should be under source control and have the ability for CI/CD just like application code. Visual Studio gives you this capability. Here, we will show you how to get your databases into a Visual Studio Database Project.

Download Visual Studio

If you don’t have Visual Studio, download it from here.

During the installation, you will be asked to choose your Workload(s). You need to check Data storage and processing. You can also check others if you want to give them a try or add them later.

Create a New Project

Launch Visual Studio and select Create a new project.

Azure Data Factory has many capabilities. But no tool is the best at everything. Sometimes you have an existing script that needs to be automated or PowerShell is the best programming option for the task at hand. Currently, ADF does not have a PowerShell task. So, we will execute it by using a webhook to call a script stored in an Azure Automation Account.

If you do not have an Automation Account set up, go to new resources and click on IT & Management Tools, and select Automation.

Complete the few required fields and click Create.

When we go to…

We may have AI and holograms now, but the business world still lives on spreadsheets. So, after we ingest terabytes of data in record time, we will be told our new report will be meaningless without this spreadsheet out on the file system.

Here we will review three steps to take a flat file and standardize date columns and remove invalid dates.

  1. Import libraries.
  2. Set date column list.
  3. Call function.

Import Libraries

This function uses parse from dateutil. The other libraries are standard.

from datetime import timedelta, datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
from dateutil.parser import parse

Set a list of columns to be processed

For this…

As we process more and more data, performance is as critical as ever. The consumers of the data want it as soon as possible. And it seems like Ben Franklin had Cloud Computing in mind with this quote:

Time is Money. — Ben Franklin

Let’s look at five performance tips:

  1. Partition Selection
  2. Delta Cache
  3. Optimize with Z-order
  4. Optimize Write
  5. Merge by Partition

Assumptions

Adding bigger or more nodes to your cluster increases costs. There are also diminishing returns. You do not need 64 cores if you are only using 10. But you still need a minimum that matches your processing requirements…

Without source control for Azure Data Factory (ADF), you only have the option to publish your pipeline. And, it has to validate. Now with source control, we can save intermediate work, use branches, and publish when we are ready. The next step is CI/CD. Let’s look at how we can use Azure Pipelines to accomplish this.

Topics

  • Setup
  • Build Pipeline
  • Release Pipeline
  • Environment parameters (i.e. for UAT and Production)
  • Production release
  • Specify the release to publish
  • Enable Continuous Integration

Setup

For this demo, we have a simple pipeline that copies files in ADLS from one folder to another. We are also using…

With the high pace of development, data architects/engineers have to be able to keep up with our DevOps counterparts. We also want the many benefits of CI/CD for databases such as fewer errors, stable builds, better release management, and less manual work.

Let’s walk through using your database project in a GitHub repo to implement CI/CD. The four main areas we will cover are:

  1. Setup
  2. Creating a Build Pipeline
  3. Creating a Release Pipeline
  4. Enabling Continuous Integration

You may find it beneficial to read through and review the screenshots before following along and implementing the steps.

Prerequisites

  • Visual Studio Database Project.
  • GitHub…

What is a Perspective?

A perspective is a view of the model. It allows you to maintain your Shared Dataset (often called a Golden Dataset from this blog post) while presenting a smaller view to the consumers so they are not overwhelmed with tables, columns, and measures they don’t need.

What a Perspective is not

It is not a security model. You will still need to maintain Row-Level Security in your model/dataset.

Why Perspectives

Shared datasets have many benefits, including one version of the model to reduce maintenance, accessible by reports across multiple workspaces, and reduced number of datasets to refresh. However, on the reporting side, the model can become much…

Bob Blackburn

Data Platform Architect at Hitachi Solutions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store