• Jesse Arora

Back fill Data using Azure Data factory V2

One of the common ETL scenario is to back fill historical data into Datawarehouse, If your source is relational database then you can change from and to date parameter for select statement but if you need to load from flat files you will have to use some kind of iterative operation in traditional ETL tools like SSIS.


With Azure Data Factory V2 you can utilize true potential of parallelism by building daily pipeline and then use tumbling window trigger along with pipeline parameters and trigger variables to create separate run for each day you need to load the data.


Below are step by step instructions:


Drag and drop copy Data activity then click parameters to create two new parameters FromDate and ToDate.


Click on the Copy Data Activity and click the source tab, Click + New to create new dataset and chose your source server , database and table, In this example I am connecting to AdventureWorksDW Sample database in Azure SQL DB and using dbo.FactInternetSales Table.


Once you Create source connection put below SQL in the query, trick here is to use pipeline().parameters variable to pass from and to date parameters back to sql or stored procedure. pipeline().parameters.FromDate , pipeline().parameters.ToDate. You can also use other functions and expressions to manipulate parameter value, in this case I want It to go back 1 day so we can load yesterday’s data every day.


SELECT * FROM [dbo].[FactInternetSales]

where OrderDate > = '@{formatDateTime(adddays(pipeline().parameters.FromDate,-1), 'yyyy-MM-dd' )}'

and OrderDate < '@{formatDateTime(pipeline().parameters.ToDate, 'yyyy-MM-dd' )}'


Click sink and then + New to create connection to destination server and database, In the destination db I have table with same structure called dbo.FactInternetSales_Destination.

Enter the below SQL in Pre-copy Script, this SQL deletes the data from destination for the day it is inserting so there are no duplicates. Notice here to use same parameters i.e. pipeline().parameters.FromDate, pipeline().parameters.ToDate


DELETE FROM [dbo].[FactInternetSales_Destination]

where OrderDate

> = '@{formatDateTime(adddays(pipeline().parameters.FromDate,-1), 'yyyy-MM-dd' )}'

and OrderDate < '@{formatDateTime(pipeline().parameters.ToDate, 'yyyy-MM-dd' )}'


I am moving data from one SQL table to another to keep demo simple, but you can add transformations using other activities just make sure to use same parameters in the transformations.


To test if data refresh is working you can click trigger button and say trigger now to run the pipeline manually by entering from and to date.



If everything looks good and you want to create scheduled trigger to run daily or hourly depending on your refresh needs. Schedule trigger currently does not create runs for past dates although it allows you to choose from date.

To create pipeline runs for prior dates you will need to create tumbling window trigger.

Enter start date 1/1/2011 and end date 1/20/2011 and change recurrence to hourly and enter 24 as value.



Hit Next , enter trigger variable @trigger().outputs.windowStartTime in From Date & ToDate you can also use @trigger().outputs.windowEndTime depending on how you have SQL written, Click Finish. Click Publish ALL to publish your changes and then go to Monitor to view the pipeline runs.


You should see pipeline runs created for each day but keep in mind Run Start Date will be today's date not date you are back filling data for. You will have to click on Parameters icon to see what date values are passed to pipeline activity from trigger, as you can see it passed 1/1/11 for first run which is correct. You can now check your destination table should have data loaded from 1/1/2011 to 1/29/2011.



1,370 views0 comments