Alan Artenstein on 04 Nov 2024 13:56:28
I think anyone who works with large datasets knows by now how frustrating it can be to refresh the full dataset into the semantic model. I my scenario, I keep 2 years' worth of data with a 3-day lookback for incremental refresh on a table. A full refresh will always time out (hit the 5 hour max refresh limit) with this amount of data. I finally figured out the best way to refresh, minus one problem that ruins it all:
1) Push the empty dataset to service with a parameter to select only a small amount of rows with a dynamic query using the parameter (e.g. 1000 rows), so only 1k rows will upload per partition
2) Run the refresh through pbi service ui so it will do its full refresh, as it will always do with a new dataset upload. The incremental refresh table and all other tables will be populated. With this small amount of data in the incremental table, partitions are created and the dataset refresh does not time out.
3) Using XMLA, generate new partitions the way the incremental refresh would (by quarter/by month, and in the most recent month, by day). Delete any of the old corresponding partitions. Since the full refresh has already been done, supposedly after I refresh the partitions, pbi will know a full refresh has already been done (step 2) and not try to do it all over again.
4) Using XMLA PowerBI, refresh each partition separately. This does not time out as each partition takes well under the 5 hour limit (combined they are well over).
5) Once all partitions are refreshed, go to powerbi service and hit the refresh button, which will simulate what happens on the refresh schedule. Since I need the refresh schedule to work, this is a necessary step. At this point, I query sql server to see which partitions are being populated, and sure enough, it's populating only the partitions that match my incremental refresh schedule lookback.. yay!
But boo.. as soon as that's done, it requeries the source from 2022 all the way to the beginning of last month (10/1/2024) ostensibly to merge the partitions. This will no doubt time out due to the amount of data I have in this timeframe.. which is why I refreshed by partition in the first place. But since I know the data is correct, I don't need it to go back to source... I want it to just be able to merge the partitions (or even skip the merge and leave as-is). I should be able to pass in a refresh parameter, via the UI, that is an override to going back to source to merge partitions, indicating to PBI to perform the merge with the current data available in the service. I understand PBI does this for consistency, but in this situation I am 100% sure the partitioned data is correct and I should be able to override the behavior to ensure we don't go back to source to get it. And, the 5 hour limit should not apply to behind the scenes merging in this case.