This blog post demonstrates how to effectively use incremental refresh with binary files to drastically improve the refresh performance of your Power BI datasets and dataflows.
Incremental refresh is widely understood and employed for structured, relational data sources like SQL Database and Azure Synapse.
It is not as commonly known that incremental refresh can also be harnessed for other data sources such as binary files.
The key to success is to ensure the binary files are saved in a storage location such as SharePoint or Azure Blob Storage and to leverage the metadata attached to the binary files to refresh only what you need.
At the time of writing Microsoft Fabric is in preview – which includes Dataflows Gen2. I will publish another update on how to adapt the methods described in this blog once the new features stabilise and mature enough to reach general availability.
Metadata
Metadata is information about information. When binary files are uploaded to a storage location (e.g., SharePoint or Azure Blob Storage), information is generated about the file such as the name, size, author, date created etc.
This metadata is stored with the file and includes datetimes that can be used to leverage incremental refresh. Sometimes the inherent datetimes (e.g., created or last modified) are appropriate to use. Otherwise, the file name can be transformed to synthesize a datetime. The reason why is shared a bit later on.
Incremental refresh can only be applied if the metadata is used and not the contents of the binary file. This enables Power BI to choose which files to refresh based on the metadata datetimes that correspond to the incremental refresh policy.
Contoso DB
For this blog a set of CSV files are exported from the sample Contoso database provided by Microsoft.
The contents of each file represent one year of data based on Order date from the Sales table. There are eleven files with ten million rows of data spanning from 2010 to 2020.
These CSV files are stored on SharePoint online. Azure Blob Storage would work too.
Add metadata in Power Query
Now it is time to show how to add the metadata in Power Query. The query consists of three parts:
Connect to the storage location
Load the metadata and construct the datetimes
Extract and transform the contents of the binary files as normal
The metadata datetimes are added before the binary file content is extracted by Power Query – so as early as possible in the list of applied transformation steps. Otherwise, the effectiveness of incremental refresh is lost.
The metadata datetimes also persist all the way through to the last transformation step or else they are not picked up as an available datetime column when configuring the incremental refresh policy.
1. Connect to the storage location
In this example when connecting to a SharePoint folder, Power Query shows you the metadata of the files stored in the folder.
The SharePoint.Contents connector returns a table containing a row for each folder and document found at the specified SharePoint site. Each row contains properties of the folder or file and a link to its content.
After navigating to the required folder, a table is presented with the metadata of each file that is contained within that folder:
There is a [Content] column containing the binary data for each file. This is left untouched at this point to preserve the efficiency of our transformation steps.
2. Load the metadata and construct the datetimes
The goal here is to have a datetime column that allocates each file to a specific partition, and never needs to switch to another partition in subsequent refreshes. The data type must be Date/Time to support incremental refresh.
There are three candidate columns in this example:
Column | Data type | Robustness | When to use |
Name | Text | High |
|
Date created | Date/Time/Zone | Medium |
|
Date modified | Date/Time/Zone | Low |
|
All the CSV files in this example were created within 5 minutes of each other. Therefore the [Date created] is not appropriate to define the partitions because all the files would fall into the same partition – defeating the point of incremental refresh.
Instead, a custom column is added based on the Name of the file i.e., the year. The formula below grabs the year from the [Name] text string, then converts it to a number and inserts this into a datetime construction.
3. Extract and transform the contents of the binary files as normal
Now the binary contents can be extracted and transformed as normal.
In this example, the Csv.Document function returns the binary contents of the CSV document as a table. The headers are also promoted in the same step, as well as defining optional parameters to specify the delimiter, number of columns and quote style.
The only columns needed are the [Date] column created from Step 2, and the CSV data stored as a table. The [Data] column is expanded to extract the CSV data.
Remember to keep the [Date] column created in Step 2 all the way through to the last step as it is needed to define the partitions for incremental refresh.
If you have complex transformations and have a Premium license, then I would recommend reading my blog here to achieve even more performance gains using the enhanced compute engine in the world of dataflows.
Apply incremental refresh
Now incremental refresh can be configured as normal. This blog does not go into detail in this aspect but below are the relevant Microsoft documentation links depending on if you are using Power Query Desktop or Power Query online (the method works for both):
In this example, incremental refresh is configured to store the past 15 years of data and refresh the past 5 years.
Detect data changes allows Power Query to go back and check over all historical partitions, and only refresh those partitions with new data.
In this example it is turned off because there are a set number of static CSV files. If there are regular new files dropping to your storage location or if there are modifications to existing files, then this could be turned on to improve performance even further.
Results
The first run establishes a baseline which takes 94 seconds. It processes only one partition because incremental refresh is turned off. This repeats for every refresh – processing all the data every time.
The refresh duration plummets to 25 seconds for the initial incremental refresh. This is because each file is allocated to one of the yearly partitions and processes in parallel.
This drops further to 16 seconds for subsequent incremental refreshes. This is because only 5 partitions are processed per the incremental refresh policy.
In this example there is one file per partition. This isn’t strictly necessary, and it is often the case that there are multiple files per partition.
Finally, if the metadata is not used with incremental refresh, then there is a significant risk of increasing the refresh duration and/or placing unnecessary load on the service capacity.
Summary
In summary, incremental refresh can lead to significant performance gains for binary files in a storage location if the metadata is used to configure the incremental refresh policy.
Take care to follow the three steps in this blog otherwise the refresh duration could worsen significantly and/or needlessly consume capacity on the Power BI service.
Thank you for reading this blog post – I look forward to hearing how you get on.
Comments