Moving to Azure and implementing Databricks and Delta Lake for managing your data pipelines is recommended by Microsoft for the Modern Data Warehouse Architecture. Data Factory is used to manage workflow and restart/recovery of failed tasks. Here we will look at a template for Databricks to accept a list of tables and a high watermark and return the new watermark, an error flag, and a message. Data Factory can then determine if any must be retried and maintain the new watermark for incremental processing.
The template notebooks can be found here.
Visual Studio protects you and your data during deployments by making sure you do not delete data by accident. Such mishaps can be caused by shortening a column, dropping a column, and similar data operations. You can disable this. But I would only recommend it in rare cases for a hotfix because your production database is unusable.
This forces us to handle data transformations deliberately. In the following discussion we will set up pre- and post-deployment scripts with a release variable to manage deployments. These scripts will be run outside of the DACPAC schema compare. …
When streamlining our database development process, we first must get our databases under source control. Often, the code accesses other databases on the server. Whether it is another in-house database or a vendor database, dependencies make version control harder and deploying to multiple QA and UAT environments time-consuming.
Take a little time upfront and you will have the security of source control and the ability to deploy to multiple environments instantly. …
Databases should be under source control and have the ability for CI/CD just like application code. Visual Studio gives you this capability. Here, we will show you how to get your databases into a Visual Studio Database Project.
If you don’t have Visual Studio, download it from here.
During the installation, you will be asked to choose your Workload(s). You need to check Data storage and processing. You can also check others if you want to give them a try or add them later.
Launch Visual Studio and select Create a new project.
Azure Data Factory has many capabilities. But no tool is the best at everything. Sometimes you have an existing script that needs to be automated or PowerShell is the best programming option for the task at hand. Currently, ADF does not have a PowerShell task. So, we will execute it by using a webhook to call a script stored in an Azure Automation Account.
If you do not have an Automation Account set up, go to new resources and click on IT & Management Tools, and select Automation.
Complete the few required fields and click Create.
When we go to…
We may have AI and holograms now, but the business world still lives on spreadsheets. So, after we ingest terabytes of data in record time, we will be told our new report will be meaningless without this spreadsheet out on the file system.
Here we will review three steps to take a flat file and standardize date columns and remove invalid dates.
This function uses parse from dateutil. The other libraries are standard.
from datetime import timedelta, datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
from dateutil.parser import parse
As we process more and more data, performance is as critical as ever. The consumers of the data want it as soon as possible. And it seems like Ben Franklin had Cloud Computing in mind with this quote:
Time is Money. — Ben Franklin
Let’s look at five performance tips:
Adding bigger or more nodes to your cluster increases costs. There are also diminishing returns. You do not need 64 cores if you are only using 10. But you still need a minimum that matches your processing requirements…
Without source control for Azure Data Factory (ADF), you only have the option to publish your pipeline. And, it has to validate. Now with source control, we can save intermediate work, use branches, and publish when we are ready. The next step is CI/CD. Let’s look at how we can use Azure Pipelines to accomplish this.
For this demo, we have a simple pipeline that copies files in ADLS from one folder to another. We are also using…
With the high pace of development, data architects/engineers have to be able to keep up with our DevOps counterparts. We also want the many benefits of CI/CD for databases such as fewer errors, stable builds, better release management, and less manual work.
Let’s walk through using your database project in a GitHub repo to implement CI/CD. The four main areas we will cover are:
You may find it beneficial to read through and review the screenshots before following along and implementing the steps.
A perspective is a view of the model. It allows you to maintain your Shared Dataset (often called a Golden Dataset from this blog post) while presenting a smaller view to the consumers so they are not overwhelmed with tables, columns, and measures they don’t need.
It is not a security model. You will still need to maintain Row-Level Security in your model/dataset.
Shared datasets have many benefits, including one version of the model to reduce maintenance, accessible by reports across multiple workspaces, and reduced number of datasets to refresh. However, on the reporting side, the model can become much…
Data Platform Architect at Hitachi Solutions