Bamakus Montreuil on 28 May 2024 07:39:44
𝐌𝐈𝐂𝐑𝐎𝐒𝐎𝐅𝐓 𝐅𝐀𝐁𝐑𝐈𝐂 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐅𝐚𝐜𝐭𝐨𝐫𝐲 𝐈𝐍𝐆𝐄𝐒𝐓𝐈𝐎𝐍 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞
💡The "pillar" task is 𝐂𝐎𝐏𝐘 on my opinion
🚀Enable Staging to Azure Storage is a fantastic option to reduce durations
<== 𝐎𝐊 𝐩𝐫𝐞𝐬𝐞𝐧𝐭
❓Ingestion options , we need to be sure threy are present
==>𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒
Full source rows - [Overwrite]
or Full source rows - [Append]
or Incremental (based a date , new value on source ,superior to the existing) - [Fact table Append]
or Incremental based a new column value (Type2) on source and regarding several columns as keys (primary) - [Dimension Append with version start/end datetime & surrogate key]
or Merge rows regarding several columns as keys (primary) & Type1 columns new values - [Merge]
Types & use cases
---------------------
[Overwrite] - Fact and dimension silver/raw table
[Append] - Fact like a stock table snapshot
[Incremental Append based a date] - Fact
[Incremental based type 2 columns new values] - Dimension
[Merge] - Dimension and fact tables
<== 𝐍𝐨𝐭 𝐬𝐮𝐫𝐞 𝐚𝐥𝐥 𝐭𝐡𝐞𝐬𝐞 𝐨𝐩𝐭𝐢𝐨𝐧𝐬 𝐚𝐫𝐞 𝐩𝐫𝐞𝐬𝐞𝐧𝐭 , 𝐚𝐧𝐲 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤 𝐩𝐥𝐞𝐚𝐬𝐞 ?
❓Around APPEND and how to deal with previous ingestion incident recovery (exp : schema KO)
Can copy automatically DELETE on destination all future new rows ?
==> 𝐍𝐀𝐓𝐈𝐕𝐄 𝐭𝐨 𝐀𝐕𝐎𝐈𝐃 𝐂𝐎𝐃𝐈𝐍𝐆 𝐖𝐎𝐑𝐊𝐀𝐑𝐎𝐔𝐍𝐃𝐒
<== 𝐍𝐨𝐭 𝐬𝐮𝐫𝐞 𝐭𝐡𝐢𝐬 𝐨𝐩𝐭𝐢𝐨𝐧 𝐢𝐬 𝐩𝐫𝐞𝐬𝐞𝐧𝐭 , 𝐚𝐧𝐲 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤 𝐩𝐥𝐞𝐚𝐬𝐞 ?
❓Schema Change Management (Source VS Destination) THE CONTRACT or NOT
- Setups==>
- Colunns mapping
- Schemaless (auto update table destination)
or Implicit schema is the current one on destination
or Manage a Schema columns "contract"
<== 𝐍𝐨𝐭 𝐬𝐮𝐫𝐞 𝐚𝐥𝐥 𝐭𝐡𝐞𝐬𝐞 𝐨𝐩𝐭𝐢𝐨𝐧𝐬 𝐚𝐫𝐞 𝐩𝐫𝐞𝐬𝐞𝐧𝐭 , 𝐚𝐧𝐲 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤 𝐩𝐥𝐞𝐚𝐬𝐞 ?
If we fill in a schema , can we put default value if NULL as value ?
If we fill in a schema , how to deal with json Arrays ?
❓Deal with nested fields : "the schema"
For JSON formats as source & array presence
==> Is there something on COPY to unnest / flatten a JSON Array (several steps) ? If we want to perform this ? <== before the schema CHECK of course
<== 𝐍𝐨𝐭 𝐬𝐮𝐫𝐞 𝐚𝐥𝐥 𝐭𝐡𝐞𝐬𝐞 𝐨𝐩𝐭𝐢𝐨𝐧𝐬 𝐚𝐫𝐞 𝐩𝐫𝐞𝐬𝐞𝐧𝐭 , 𝐚𝐧𝐲 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤 𝐩𝐥𝐞𝐚𝐬𝐞 ?
❓Deal with nested fields : "the RAW/Siler storage"
==> Problem : Fabric Lakehouse and Warehouse doesn't accept RECORD/ARRAY SQL column type ( Bigquery YES) for nested fields
<== Still NVARCHAR type for JSON arrays on Fabric DWH & LH
<== STRONG EXPECTATIONS : A RECORD SQL type on fabric & provide
Unnest / Flatten SQL/T SQL functions
RECORD SQL type to be managed on COPY mapping & SCHEMA contract
Administrator on 02 Aug 2024 01:53:10
under reviewing
- Comments (1)
RE: Improve the COPY task on data pipelines
Would be great if those improvements would be available for both lakehouse & warehouse destinations... as currently in the copy activity with a lakehouse destination I can at least choose between "append" and "overwrite" whereas with a warehouse destination I do not have this option and need to use a pre-copy script in order to truncate the destination table for example.