Skip to main content

Data Factory

Under Review

Improve the copy tast on data pipelines

Vote (3) Share
Bamakus Montreuil's profile image

Bamakus Montreuil on 31 May 2024 09:20:38

We are only waiting for data pipeline COPY task improvement 


1) Different modes to load data to destination

Types & use cases

--------------------

 [Overwrite] - Fact and dimension SQL tables (Fabric DWH & LH) 

 [Append] - Fact like a stock table snapshot 

 [Incremental append based a date] - Fact

 [Incremental append based type 2 columns new values] - Dimension

 [Merge based primary keys and Type 1 columns updates] - Dimension and fact tables

==>𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒


2) Append rows in case of last ETL/ELT job an issue recovery

 (exp : a schema management check was on a previous "KO")

Can copy automatically DELETE on destination all future new rows ? 

==> 𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒


3) Schema Change Management (Source VS Destination) 

Apply THE "CONTRACT" or NOT


Settings are ==>


- Schemaless (auto update table destination)

or Implicit schema is the current one on destination (& a colunns mapping)

or Manage a Schema columns "contract" (& a colunns mapping)

If we fill in a schema , can we put default value if NULL as value ?


4) Offer on Datawarehouse T SQL and COPY TASK schema change management a new column of type "RECORD": This data type is specifically designed to store nested structure data (JSON) in SQL RAW tables - 


Provide :

-Consistency 

Your future data is made sure to conform to the pre-defined structure → not have to worry that a valid query today will be invalid tomorrow.

-Performance: You can selectively scan child columns instead of scanning the whole parent RECORD column, which will save you query money.

-Ease of use: You can use dot notation (parent_col.child_col) to query from RECORD columns, which is quite convenient.