Bamakus Montreuil on 31 May 2024 09:20:38
We are only waiting for data pipeline COPY task improvement
1) Different modes to load data to destination
Types & use cases
--------------------
[Overwrite] - Fact and dimension SQL tables (Fabric DWH & LH)
[Append] - Fact like a stock table snapshot
[Incremental append based a date] - Fact
[Incremental append based type 2 columns new values] - Dimension
[Merge based primary keys and Type 1 columns updates] - Dimension and fact tables
==>𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒
2) Append rows in case of last ETL/ELT job an issue recovery
(exp : a schema management check was on a previous "KO")
Can copy automatically DELETE on destination all future new rows ?
==> 𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒
3) Schema Change Management (Source VS Destination)
Apply THE "CONTRACT" or NOT
Settings are ==>
- Schemaless (auto update table destination)
or Implicit schema is the current one on destination (& a colunns mapping)
or Manage a Schema columns "contract" (& a colunns mapping)
If we fill in a schema , can we put default value if NULL as value ?
4) Offer on Datawarehouse T SQL and COPY TASK schema change management a new column of type "RECORD": This data type is specifically designed to store nested structure data (JSON) in SQL RAW tables -
Provide :
-Consistency
Your future data is made sure to conform to the pre-defined structure → not have to worry that a valid query today will be invalid tomorrow.
-Performance: You can selectively scan child columns instead of scanning the whole parent RECORD column, which will save you query money.
-Ease of use: You can use dot notation (parent_col.child_col) to query from RECORD columns, which is quite convenient.