Data Factory

Under Review

Improve the COPY task on data pipelines

Vote (5)

Bamakus Montreuil on 28 May 2024 07:39:44

𝐌𝐈𝐂𝐑𝐎𝐒𝐎𝐅𝐓 𝐅𝐀𝐁𝐑𝐈𝐂 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐅𝐚𝐜𝐭𝐨𝐫𝐲 𝐈𝐍𝐆𝐄𝐒𝐓𝐈𝐎𝐍 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞

💡The "pillar" task is 𝐂𝐎𝐏𝐘 on my opinion

🚀Enable Staging to Azure Storage is a fantastic option to reduce durations

<== 𝐎𝐊 𝐩𝐫𝐞𝐬𝐞𝐧𝐭

❓Ingestion options , we need to be sure threy are present

==>𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒

Full source rows - [Overwrite]

or Full source rows - [Append]

or Incremental (based a date , new value on source ,superior to the existing) - [Fact table Append]

or Incremental based a new column value (Type2) on source and regarding several columns as keys (primary) - [Dimension Append with version start/end datetime & surrogate key]

or Merge rows regarding several columns as keys (primary) & Type1 columns new values - [Merge]

Types & use cases

---------------------

[Overwrite] - Fact and dimension silver/raw table

[Append] - Fact like a stock table snapshot

[Incremental Append based a date] - Fact

[Incremental based type 2 columns new values] - Dimension

[Merge] - Dimension and fact tables

<== 𝐍𝐨𝐭 𝐬𝐮𝐫𝐞 𝐚𝐥𝐥 𝐭𝐡𝐞𝐬𝐞 𝐨𝐩𝐭𝐢𝐨𝐧𝐬 𝐚𝐫𝐞 𝐩𝐫𝐞𝐬𝐞𝐧𝐭 , 𝐚𝐧𝐲 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤 𝐩𝐥𝐞𝐚𝐬𝐞 ?

❓Around APPEND and how to deal with previous ingestion incident recovery (exp : schema KO)

Can copy automatically DELETE on destination all future new rows ?

==> 𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒

<== 𝐍𝐨𝐭 𝐬𝐮𝐫𝐞 𝐭𝐡𝐢𝐬 𝐨𝐩𝐭𝐢𝐨𝐧 𝐢𝐬 𝐩𝐫𝐞𝐬𝐞𝐧𝐭 , 𝐚𝐧𝐲 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤 𝐩𝐥𝐞𝐚𝐬𝐞 ?

❓Schema Change Management (Source VS Destination) THE CONTRACT or NOT

- Setups==>

- Colunns mapping

- Schemaless (auto update table destination)

or Implicit schema is the current one on destination

or Manage a Schema columns "contract"

If we fill in a schema , can we put default value if NULL as value ?

If we fill in a schema , how to deal with json Arrays ?

❓Deal with nested fields : "the schema"

For JSON formats as source & array presence

==> Is there something on COPY to unnest / flatten a JSON Array (several steps) ? If we want to perform this ? <== before the schema CHECK of course

❓Deal with nested fields : "the RAW/Siler storage"

==> Problem : Fabric Lakehouse and Warehouse doesn't accept RECORD/ARRAY SQL column type ( Bigquery YES) for nested fields

<== Still NVARCHAR type for JSON arrays on Fabric DWH & LH

<== STRONG EXPECTATIONS : A RECORD SQL type on fabric & provide

Unnest / Flatten SQL/T SQL functions

RECORD SQL type to be managed on COPY mapping & SCHEMA contract

Administrator on 02 Aug 2024 01:53:10

under reviewing

Comments (1)

Comments (1)

Nicolas Saar on 16 Oct 2024 15:57:02

RE: Improve the COPY task on data pipelines

Would be great if those improvements would be available for both lakehouse & warehouse destinations... as currently in the copy activity with a lakehouse destination I can at least choose between "append" and "overwrite" whereas with a warehouse destination I do not have this option and need to use a pre-copy script in order to truncate the destination table for example.