Connecting Microsoft Business Central to an Azure Data Lake - Part 4
- Jesper Theil Hansen
- Apr 6
- 3 min read
After running the data lake solution for some time, it became apparent that both execution times and costs on the lakeside were increasing slowly.
Note that this walkthrough is for the Synapse-based data lake flow. Archiving when using Fabric and deltas data backing isn't as important since the notebook-based sync is more efficient and doesn't copy and consolidate all existing data every time.
The default sync pipeline and dataflow in BC2ADLS work like this:
BC Exports new data to /delta folder
Pipeline / Dataflow copies delta-files to /staging
Pipeline / Dataflow copies all existing current data to /staging
Deltas are merged with / added to existing data in /staging
Resulting new dataset is copied back from /staging to /data
This flow should make it obvious why sync times and costs keep increasing. As more and more data gets added, all new syncs copy and merge with the complete dataset, so as long as we don't archive anything on the BC side, we will be dealing with more and more data.
Now lets revisit how data is stored in the lake, the final synced data is in the /data folder in the storage account under the chosen root folder:
/data/GLEntry-17
And when you expose the data as SQL External tables, they are typically mapped to the folder, the default mapping, just as many reads will aggregate all parquet files in the folder and subfolders automatically.
So to Archive data we can split the files in two subfolders :
/data/GLEntry-17
/data
/archive
The archiving is done with a dataflow called SplitArchiveData and a pipeline called ArchiveData that prepares data and runs the flow. It also uses an integration dataset called data_dataset_split The arguments to run it look like this :

The two important parameters that are new compared to the existing pipelines are:
DateSplitColumn : The name of a date field that will determine if the record should be archived.
ArchiveDate : The cutoff for which records remain in data and which are archived. Any record with a date before this date is moved to the archive
So I ran this on my test system, that had 3299 records in GLEntry:

After running the archive pipeline and dataflow, the folders in the data lake now look like this :

The split folder is the working folder for the Archive dataset, so it is empty after archiving is done, you could delete it if you prefer. And to check the results I query the lake GLEntry again :

Still 3299 records when looking in the original GLEntry folder so anybody reading recursively from here should still see all the data. But now you can also look at just the current or archived portion of the data :

Just 101 records in /data

And the remaining 2198 are in /archive
There is one final thing we need to fix: The Consolidation dataflow ALSO reads recursively, so if we don't change anything, it will just read both data and archive, consolidate everything and write it back to the /GLEntry-17 folder and we have achieved nothing. The dataflow uses the location information from the manifest file so I made one small change to that: The rootlocation is written like this:
DataPartitionPattern.Add('rootLocation', Folder + '/' + EntityName);
I changed that to:
if (Folder = 'data') then
DataPartitionPattern.Add('rootLocation', Folder + '/' + EntityName + '/data')
else
DataPartitionPattern.Add('rootLocation', Folder + '/' + EntityName);
Now, what should you archive and when? It depends on how you use the system of course, but the way we do it is to archive all the transaction tables/ledgers after final close of the fiscal year. That is the point when you shouldn't see any changes in the data anymore. Tables that rarely change you don't need to archive since the delta-check means you only run the consolidation when there are changes.
If you want the files, they are in the Archive branch of my fork of the BC2ADLS project on github : https://github.com/jespertheil/bc2adls/tree/Archive :
businessCentral\app\src | The AL code change | |
data_dataset_parquet.json | synapse\dataset | Changed dataset that points to live data for consolidation pipelines |
SplitArchiveData.json | synapse\dataflow | New dataflow that does the actual split |
ArchiveData.json | synapse\pipeline | New pipeline that prepares data to split and calls the split dataflow |
data_dataset_split.json | synapse\dataset | New dataset that points to the split folder |
Comments