Mastering MERGE in Microsoft Fabric: Incorporate Deletes into your Incremental Loads with Spark SQL

Matt Collins
Dec 5, 2025
8 min read

MERGE operations in SQL are highly useful for synchronising data between tables with easy to read code that handles DML operations in a single transaction.

They're brilliant for performing UPSERT (INSERT and UPDATE) logic based on your merge criteria, quickly identifying rows which have changed and which are new, but can be prone to computationally (and time) costly DELETEs if not handled correctly.

MERGE statements replace the need for separate DELETE then UPSERT statements required before MERGE functionality was available.

This article is tailored towards Microsoft Fabric, to overcome limitations found in Spark here, but the concepts can be used in other platforms, such as Databricks.

Incremental Loading

Incremental (or Delta, to your preference) loading techniques can be preferable, taking a "slice" of the latest rows in source and then loading this subset in. You can create your own definition of "latest rows" and the scope of this depending on your source data, such as:

Example No.	Condition	Condition Type	Use Case
1	Where Date > 1 day ago	Rolling date window	Useful when you know old records past a certain date are immutable and are not being removed.
2	Where DateModified > 1 day ago and IsDeleted = True	Rolling date window	When you have a modification date and are using "Soft Deletes" in the source.
3	Where DateModified > last load timestamp	Fixed (but dynamic) date window	Great for dealing with exceptionally large volumes of data that means you want to be as precise as possible with the data you wish to incrementally load.
4	Where SourceId > last loaded SourceId	Fixed (but dynamic) date window	For additive tables where the data of a single record never changes. This suggests no deletes, and "IsLatestRecord" logic is applied downstream.

Handling DELETE operations

Handling DELETE operations is an essential part of maintaining tables. There are many valid reasons why data is removed from your source table which subsequently needs to be deleted from your target customer/system-facing table.

A good old-fashioned TRUNCATE TABLE , INSERT INTO TABLE works well for this as a full load operation, but doesn't scale well in a few places. For example, when your Source query involves joining together multiple million row datasets to prepare the data for insert then doing a multi-million row insert can be costly and slow!

The MERGE DELETE Challenge with Incremental loads

Consider the following Source Table:

SourceId	Date	Customer	Value
1	2025-09-01	1	100
2	2025-09-01	2	200
3	2025-09-02	1	500
4	2025-09-03	5	50

In this example, assume we've got an empty target table (perhaps, we're loading the data for the first time).

A standard MERGE statement might look like this:

MERGE INTO Target AS target
        USING Source AS source
        ON target.SourceId = source.SourceId
        WHEN  MATCHED THEN
        UPDATE SET  
                    target.Date = source.Date
            ,       target.Customer = source.Customer 
            ,       target.Value = source.Value

        WHEN NOT MATCHED BY TARGET THEN
        INSERT (   
            SourceId
            ,   Date
            ,   Customer 
            ,   Value
        )
        VALUES (
            SourceId
            ,   source.Date
            ,   source.Customer 
            ,   source.Value
        )
        WHEN NOT MATCHED BY SOURCE THEN DELETE;

This will work great on the first run - all records are new, so we perform 4 inserts. No records need updating or deleting.

Re-running it straight afterwards will now detect that the records in the source table exist in the target table. We update the values based on the UPDATE condition of the code. No new records are to be inserted and nothing is deleted.

The next day, some of the source data has changed. Being the forward-thinking data engineer that we are, we're using incremental loading to get data from source, and we've set up an incremental query that grabs data from today only, based on the Date column. The slice is as follows:

SourceId	Date	Customer	Value
3	2025-09-02	1	2000
5	2025-09-03	2	1000

We want to perform the correct DML logic, which consists of the following operations:

Update the row for SourceId = 3. Value has changed from 500 -> 2000.
Insert the row for SourceId = 5. This doesn't exist in the target, so let's add it in.
SourceId = 4 has disappeared from the source. We need to remove this from our target table to reflect this.

Note: At this point, we can ask ourselves a design question. Is our delete from source intended/necessary? Depending on your upstream processing, your system might do this hard delete or it might do a soft delete, where it preserves the record with an additional column to indicate that it is no longer valid. Depending on your system and other factors, it may be preferable to include this soft delete flag. On other cases, it might not be possible.

If we use our current MERGE statement proposed above we'll end up deleting all records that do not exist in the scope of the Source Data we're trying to incrementally load.

As such, we should extend the DELETE condition.

Spark Struggles

This issue shown below is present in the current version of Spark in Microsoft Fabric, Runtime 1.3 (Spark 3.5, Delta 3.2). The equivalent Databricks Runtime, 15.4 LTS, does not encounter this problem despite the same Spark and Delta versions.

Let's look at the DELETE component individually to understand this. In this case, we want to do the following:

WHEN NOT MATCHED BY SOURCE AND target.Date >= '2025-09-03' THEN DELETE;

These work great, but the code isn't dynamic - something that is essential for automated processing.

WHEN NOT MATCHED BY SOURCE AND target.Date >= current_date() - 1 THEN DELETE;

This will do the trick, but again assumes a fixed date range, regardless of the dates in the Source table. If we were to vary the date range used in our incrementally loaded source table (expanding it, specifically), we'd need to remember to change this value.

We can perform a lightweight query against the source to get the min date (or date range/ more complex logic) that we want to use as the basis for the delete in a context aware state:

SELECT MIN(Date) FROM Source;

Easy - we've got a comparison for our expression now:

WHEN NOT MATCHED BY SOURCE AND target.Date >= (SELECT MIN(Date) FROM Source) THEN DELETE;

However, when putting this into practice we get an error from using a subquery:

[DELTA_UNSUPPORTED_SUBQUERY] Subqueries are not supported in the DELETE (condition = (spark_catalog.lkh.Target.Date > scalarsubquery())).

This occurs still when performing a standalone DELETE operation:

DELETE FROM Target WHERE Date >= (SELECT MIN(Date) FROM Source);

For context, T-SQL handles this gracefully, but this operation is not allowed in Spark.

A few solutions

There are a few logical choices we can investigate at this point:

Inject the condition with PySpark
Create "full" query for source that contains all rows

Filtered Delete: Inject the code with PySpark

A fitting solution is to use some python to calculate and store the filter condition. If you're working in Spark Notebooks, then you'll have the ability to change languages easily, and can still primarily work in SQL while wrapping a few statements in python.

min_date = spark.sql("SELECT CAST(MIN(Date) AS STRING) FROM Source").collect()[0][0]

Using an f-string, we can use min_date as part of our sql statement

merge_df = spark.sql(f"""
    MERGE INTO Target AS target
        USING Source AS source
        ON target.SourceId = source.SourceId
        WHEN  MATCHED THEN
        UPDATE SET  
                    target.Date = source.Date
            ,       target.Customer = source.Customer 
            ,       target.Value = source.Value

        WHEN NOT MATCHED BY TARGET THEN
        INSERT (   
            SourceId
            ,   Date
            ,   Customer 
            ,   Value
        )
        VALUES (
            SourceId
            ,   source.Date
            ,   source.Customer 
            ,   source.Value
        )
        WHEN NOT MATCHED BY SOURCE AND target.Date  >= '{min_date}' THEN DELETE;
        """)

By setting the execution as a variable, you also preserve the high-level merge operation statistics for validation/logging as you see fit:

display(merge_df)

You can also create more complex DELETE logic using this approach. If you're updating a Fact table, you might be reliant on Surrogate Keys for your DELETE logic and need to select a specific list of values you wish to delete.

delete_values = '100, 101, 102, 104'

merge_df = spark.sql(f"""
    ...
        WHEN NOT MATCHED BY SOURCE AND target.Id IN ({delete_values}) THEN DELETE;
        """)

This approach gives you so much flexibility when working in a metadata-driven environment, where you can pass your dataset details to a notebook and populate the merge criteria dynamically at runtime. In pseudocode, this may look as follows:

merge_columns = ['SourceId']
columns = ['SourceId', 'Date', 'Value'] # Values for demnonstration - Values pulled from metadata, rather than specified manually.

source = 'Source'
target = 'Target'
join_conditions = ', '.join([f'target.{col} = source.{col}' for col in merge_columns])'

update_columns = [col for col in columns if col not in merge_columns]
update_clause = ', '.join([col for col in update_columns])'
insert_clause = ', '.join([f'source.{col}' for col in columns])'

if incremental_clause != '':
	dynamic_delete_logic = '<calculate delete logic as required>'
	delete_clause = f'WHEN NOT MATCHED BY SOURCE AND {dynamic_delete_logic} THEN DELETE'
else:
	delete_clause = ''

merge_df = spark.sql(f"""
    MERGE INTO {target} AS target
        USING {source} AS source
        ON 
			{join_conditions}
        WHEN  MATCHED THEN
        UPDATE SET  
             	{update_clause}
        WHEN NOT MATCHED BY TARGET THEN
        INSERT (   
			{insert_clause}
        )
        VALUES (
			{insert_values}
        )
        {delete_clause};
        """)

Full Row Delete: Create a "full" row query as your source

An alternative approach would be to create a snapshot of the full dataset, with the latest version. Create a view that takes your data from your incremental load, and combine it with the data points in the target table that are outside of the incremental load criteria, e.g. the date range in our case.

CREATE OR REPLACE TEMP VIEW vwSource AS
    SELECT 
        SourceId
        , Date
        , Customer
        , Value
    FROM Source
    UNION
    SELECT 
        SourceId
        , Date
        , Customer
        , Value
    FROM Target
    WHERE Date NOT IN (SELECT DISTINCT Date FROM vwSource);

We can now use our original merge criteria, as anything existing in the target that does not exist in the source can be removed.

MERGE INTO Target AS target
        USING vwSource AS source
        ON target.SourceId = source.SourceId
        WHEN  MATCHED THEN
        UPDATE SET  
                    target.Date = source.Date
            ,       target.Customer = source.Customer 
            ,       target.Value = source.Value

        WHEN NOT MATCHED BY TARGET THEN
        INSERT (   
            SourceId
            ,   Date
            ,   Customer 
            ,   Value
        )
        VALUES (
            SourceId
            ,   source.Date
            ,   source.Customer 
            ,   source.Value
        )
        WHEN NOT MATCHED BY SOURCE THEN DELETE;

Performance Benchmarking

Comparing the two above, I've used the "Filtered Example" and "Full Row Example" sections of code found in my example notebook.

The aim of this is to demonstrate the capabilities at scale. Under-the-hood, we used a F16 SKU Fabric Capacity, with the following Spark Pool details:

I recorded the results in a csv and have visualised them in Power BI below:

We can see the Filtered Delete example is the preferred method for encoding your incremental MERGE logic. This feels intuitive - we're handling a smaller number of rows, so have a reduced computational footprint in the operation we perform.

To dig into this slightly, the "full delete" example is doing a merge across all SourceId records in the source and target tables. As such, we effectively update all rows in the target table, plus the deletions (1 row in our code example), plus the inserts (1 row in our example). Therefore, as the table grows in size, so does the number of updates, consuming more time.

In the other case, the "filtered delete" example only matches on a subset of SourceIds, so the number of updates performed corresponds to this. Our logic behind-the-scenes gives a random number of values being updated, but the rows affected will equal these updates and again the 1 delete and 1 insert we perform. Reducing the number of records affected keeps the computational effort of performing the merge low, and results in a quicker operation.

Conclusion

Including DELETE operations in your MERGE statements is a powerful tool, ensuring you update your data completely within a single operation. We've seen limitations around what is possible in the Spark SQL flavour of this (compared to T-SQL) and how to overcome this.

Trying a couple of approaches, we've demonstrated performance bottlenecks, and given motivation behind a specific approach to solving functional and performance related challenges.

For those wanting to venture out from a SQL-based interface that is more familiar to them, there is also the option of doing this logic purely in PySpark or Scala.

Have you tried any other methods of handling incremental deletes in Spark SQL? Reach out in the comments to let us know your preferences and what you've tried to handle this Data Engineering functionality.

Outcomes We Deliver

Products

Advisory

Delivery

Training

Powered by Products. Delivered with Expertise.