Data Factory

Under Review

Improve the copy tast on data pipelines

Vote (3)

Bamakus Montreuil on 31 May 2024 09:20:38

We are only waiting for data pipeline COPY task improvement

1) Different modes to load data to destination

Types & use cases

--------------------

[Overwrite] - Fact and dimension SQL tables (Fabric DWH & LH)

[Append] - Fact like a stock table snapshot

[Incremental append based a date] - Fact

[Incremental append based type 2 columns new values] - Dimension

[Merge based primary keys and Type 1 columns updates] - Dimension and fact tables

==>𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒

2) Append rows in case of last ETL/ELT job an issue recovery

(exp : a schema management check was on a previous "KO")

Can copy automatically DELETE on destination all future new rows ?

==> 𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒

3) Schema Change Management (Source VS Destination)

Apply THE "CONTRACT" or NOT

Settings are ==>

- Schemaless (auto update table destination)

or Implicit schema is the current one on destination (& a colunns mapping)

or Manage a Schema columns "contract" (& a colunns mapping)

If we fill in a schema , can we put default value if NULL as value ?

4) Offer on Datawarehouse T SQL and COPY TASK schema change management a new column of type "RECORD": This data type is specifically designed to store nested structure data (JSON) in SQL RAW tables -

Provide :

-Consistency

Your future data is made sure to conform to the pre-defined structure → not have to worry that a valid query today will be invalid tomorrow.

-Performance: You can selectively scan child columns instead of scanning the whole parent RECORD column, which will save you query money.

-Ease of use: You can use dot notation (parent_col.child_col) to query from RECORD columns, which is quite convenient.