Connecting Microsoft Business Central to an Azure Data Lake

Part 1 Scheduling of Export and Sync Part 2 Possible timing clashes/conflicts that can corrupt data Part 3 Duplicate records after deadlock error in BC Part 4 Archive old data

After running the data lake solution for some time, it became apparent that both execution times and costs on the lakeside were increasing slowly.

Note that this walkthrough is for the Synapse-based data lake flow. Archiving when using Fabric and deltas data backing isn't as important since the notebook-based sync is more efficient and doesn't copy and consolidate all existing data every time.

The default sync pipeline and dataflow in BC2ADLS work like this:

BC Exports new data to /delta folder
Pipeline / Dataflow copies delta-files to /staging
Pipeline / Dataflow copies all existing current data to /staging
Deltas are merged with / added to existing data in /staging
Resulting new dataset is copied back from /staging to /data

This flow should make it obvious why sync times and costs keep increasing. As more and more data gets added, all new syncs copy and merge with the complete dataset, so as long as we don't archive anything on the BC side, we will be dealing with more and more data.

Now lets revisit how data is stored in the lake, the final synced data is in the /data folder in the storage account under the chosen root folder:

/data/GLEntry-17

And when you expose the data as SQL External tables, they are typically mapped to the folder, the default mapping, just as many reads will aggregate all parquet files in the folder and subfolders automatically.

So to Archive data we can split the files in two subfolders :

/data/GLEntry-17

/data

/archive

The archiving is done with a dataflow called SplitArchiveData and a pipeline called ArchiveData that prepares data and runs the flow. It also uses an integration dataset called data_dataset_split The arguments to run it look like this :

The two important parameters that are new compared to the existing pipelines are:

DateSplitColumn : The name of a date field that will determine if the record should be archived.
ArchiveDate : The cutoff for which records remain in data and which are archived. Any record with a date before this date is moved to the archive

So I ran this on my test system, that had 3299 records in GLEntry:

After running the archive pipeline and dataflow, the folders in the data lake now look like this :

The split folder is the working folder for the Archive dataset, so it is empty after archiving is done, you could delete it if you prefer. And to check the results I query the lake GLEntry again :

Still 3299 records when looking in the original GLEntry folder so anybody reading recursively from here should still see all the data. But now you can also look at just the current or archived portion of the data :

Just 101 records in /data

And the remaining 2198 are in /archive

There is one final thing we need to fix: The Consolidation dataflow ALSO reads recursively, so if we don't change anything, it will just read both data and archive, consolidate everything and write it back to the /GLEntry-17 folder and we have achieved nothing. The dataflow uses the location information from the manifest file so I made one small change to that: The rootlocation is written like this:

DataPartitionPattern.Add('rootLocation', Folder + '/' + EntityName);

I changed that to:

if (Folder = 'data') then
   DataPartitionPattern.Add('rootLocation', Folder + '/' + EntityName + '/data')
 else
    DataPartitionPattern.Add('rootLocation', Folder + '/' + EntityName);

Now, what should you archive and when? It depends on how you use the system of course, but the way we do it is to archive all the transaction tables/ledgers after final close of the fiscal year. That is the point when you shouldn't see any changes in the data anymore. Tables that rarely change you don't need to archive since the delta-check means you only run the consolidation when there are changes.

If you want the files, they are in the Archive branch of my fork of the BC2ADLS project on github : https://github.com/jespertheil/bc2adls/tree/Archive :

CDMUtil.Codeunit.al	businessCentral\app\src	The AL code change
data_dataset_parquet.json	synapse\dataset	Changed dataset that points to live data for consolidation pipelines
SplitArchiveData.json	synapse\dataflow	New dataflow that does the actual split
ArchiveData.json	synapse\pipeline	New pipeline that prepares data to split and calls the split dataflow
data_dataset_split.json	synapse\dataset	New dataset that points to the split folder

Connecting Microsoft Business Central to an Azure Data Lake - Part 4

Recent Posts

Comments