Under the hood with Azure Synapse Analytics

By April 23, 2021 No Comments

On Dec. 4, 2020, Microsoft announced the general availability of several features in Azure Synapse Analytics that had been in preview for some time. Our team dug into the features to learn more about what they offer so we could share our thoughts on why they’re important and provide helpful usage tips – here’s what we found. Note one important highlight: Azure Synapse Analytics is now a platform that is affordable for any budget ($5/TB of processed data).

NEW – Synapse Serverless SQL Pools

An exciting feature that many may be unfamiliar with is Synapse Serverless SQL Pools. Not to be confused with Azure SQL Serverless, Synapse Serverless SQL Pools is a distributed query engine that allows you to execute T-SQL statements against a wide variety of data in place.

Synapse Dedicated SQL Pools is the new name for Azure SQL Data Warehouse, which has been around for several years. It is a cloud data warehouse platform with dedicated compute and its own partitioned storage system. This is a great platform if you have huge volumes of data, high analytic performance needs, and fairly deep pockets. Synapse Serverless SQL is an autoscaling shared compute service that can query data in Parquet, JSON, or CSV in Azure Data Lake, Spark tables, or Azure Cosmos via Azure Synapse Link for Cosmos DB.

Synapse Link for Azure SQL

Word on the street is that a Synapse Link for Azure SQL is coming soon, and others will likely follow. The link is actually a near real-time replication of the data from the source data store to a columnar one in Synapse, so it’s isolated from the actual transactional store, making it fast.

Source: Microsoft

Azure Synapse Studio

The new interface/development environment, Azure Synapse Studio, is user-friendly enough for the data analyst and powerful enough for the data scientist. If you’ve used Azure Data Factory before, the interface will look familiar — the batch data ingestion piece of Azure Synapse Studio is the Azure Data Factory service. Here are a few tips before you get started:

  1. Convert your data files to Parquet. You can create queries in Synapse SQL Serverless that point to CSV and JSON files as well, but Parquet is a compressed, columnar format with metadata that allows more flexibility to only read the columns you need rather than all of them.
  1. Create views using OPENROWSET as a base for other users to utilize in Azure Synapse Studio or Power BI. The views will hide the details of where the data is located, if you are querying a single file or entire directory, and the authentication method. Synapse Studio has some ideal features for generating an OPENROWSET query with just a right click on a linked service (defined file in ADLS). To control the authentication method, limit columns, and possibly aggregate data, views will simplify things for the typical user:
  • Views utilizing OPENROWSET are the simpler, faster, and more functional way of accessing data currently.
  • Microsoft recommends using EXTERNAL TABLE to reference external data, but they take more upfront work and are currently a little less performant. External tables do allow you to implement row-level security on data lake files via database user accounts, and the fact that Microsoft recommends them leads me to believe there is more to come in terms of performance, etc.
  • If you start with views, you can create external tables later that refer to those views or utilize the same OPENROWSET command.
  1. Partition your files by date or some other mechanism that enables parallel processing to improve performance and partition filtering to avoid reading data that is not needed.

Single data platform for ad-hoc analysis, reporting, and advanced analytics

A key evolution that is occurring with Azure Synapse Analytics, and more generally with the utilization of data lakes, is the singular data platform for ad-hoc analysis, reporting, and advanced analytics. In Azure Synapse, the same files in ADLS that analysts query with T-SQL or Power BI are also utilized by data scientists in machine learning models using Synapse Spark via Azure Notebooks or Azure ML. The output of those machine learning models can be Parquet files or Spark tables that the analyst can query in the same T-SQL interface of Synapse SQL Serverless or Power BI.

Inviso recently finished up the first phase of a project for Microsoft to add Azure Synapse SQL Serverless to their analytics platform so that marketing analysts could more easily collaborate with data scientists directly on the inputs and outputs of machine learning models without waiting for the data to be added to a Power BI or Azure Analysis Services model. In their case, the data scientists preferred to work in Azure Databricks because it is a bit more mature than Synapse Spark. But, because the core data is Parquet files residing in ADLS, and because Databricks is based on Spark, the two can work together seamlessly.

More mature lakehouse architectures utilize Delta to provide a transaction log for providing rollback in case of failed loads, avoid dirty reads by isolating transactions, and other ACID functionality that we have come to expect from traditional databases. Synapse SQL Serverless Pools does not currently support Delta, but it is coming soon. The primary issue currently between Databricks and Synapse is the version of Spark and Delta supported: Synapse supports Spark 2.4 while Databricks supports up to 3.1, and there were significant enhancements in the 3.0 release. However, Databricks has shown its commitment to compatibility with Synapse by providing extended support for Spark 2.4 and we expect Synapse to pick up the pace of supporting newer releases of Spark and Delta soon. For now, keep this in mind when planning your architecture and implementation method.

While Synapse SQL Serverless Pools and dedicated pools are considered two compute flavors of the same general product, much like Azure SQL Database Serverless compute tier, the architectures are quite different. With dedicated pools, you must consider each table in a traditional data warehouse star schema model and how to best distribute that data across the 60 nodes. With Serverless Pools, you are more concerned with how to get the best performance out of querying flat files while maintaining security at the required granularity level. Converting to Parquet, consolidating the number of partitions of a file, and managing failures (via Delta, etc.) are key concerns.

These new Azure Synapse Analytics capabilities have a lot to offer. If you need support, Inviso can help and work with you to determine the best solution based on your needs and guide you through implementation and maintenance.

Inviso Team

Author Inviso Team

More posts by Inviso Team