Postgres update with join vs SQL server update with join - which is better?

602    Asked by DorineHankey in SQL Server , Asked on Oct 3, 2022

 I recently started converting a personal project from Microsoft SQL Server to PostgreSQL and I was surprised at the abysmal performance I encountered doing an UPDATE JOIN between two tables.

Suppose they look something like:

CREATE TABLE foo (
  id INTEGER NOT NULL PRIMARY KEY,
  bar INTEGER NULL
);
CREATE TABLE foo2 (
  id INTEGER NOT NULL PRIMARY KEY,
  bar INTEGER NULL
);

In T-SQL I would do an update using a join using something like this:

UPDATE foo

SET bar = t2.bar
FROM foo t1
JOIN foo2 t2
ON t1.id = t2.id;
But running in Postgres, the query is glacially slow.
If I change it to:

UPDATE foo

SET bar = t2.bar

FROM foo2 t2

WHERE foo.id = t2.id;

It's not a problem.

I get that the syntax is different but I would've expected the query optimizer to work out something in the same ballpark. Instead, things go bananas. Besides the syntactical differences, is there a nuanced difference between the two queries that I fail to see?

Explain plans

Update on foo  (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
  ->  Nested Loop  (cost=85852.43..6211995294.24 rows=338326628280 width=1027)
        ->  Seq Scan on foo  (cost=0.00..145721.10 rows=582410 width=1010)
        ->  Materialise  (cost=85852.43..247935.91 rows=580908 width=17)
              ->  Hash Join  (cost=85852.43..241627.37 rows=580908 width=17)
                    Hash Cond: (t1.id = t2.id)
                    ->  Seq Scan on foo t1  (cost=0.00..145721.10 rows=582410 width=10)
                    ->  Hash  (cost=75754.08..75754.08 rows=580908 width=15)
                          ->  Seq Scan on foo2 t2  (cost=0.00..75754.08 rows=580908 width=15)
Update on foo (cost=87575.47..535974.25 rows=581621 width=1022)
  ->  Hash Join  (cost=87575.47..535974.25 rows=581621 width=1022)
        Hash Cond: (foo.id = t2.id)
        ->  Seq Scan on foo (cost=0.00..151301.17 rows=1140417 width=1011)
        ->  Hash  (cost=75761.21..75761.21 rows=581621 width=36)
              ->  Seq Scan on foo2 t2  (cost=0.00..75761.21 rows=581621 width=36)


zpostgres update with join vs SQL server update with join


UPDATE foo

SET bar = t2.bar

FROM foo t1

JOIN foo2 t2 ON t1.id = t2.id;

There is no join condition between foo and t1, the implicit CROSS JOIN forces a Cartesian product, i.e. O(N²) (!) update operations instead of just O(N). And the result is non-deterministic nonsense. The effect also becomes apparent in the query plan: rows=338326628280 instead of rows=581621 (Also: both plans were produced off slightly different tables, but that seems irrelevant to the question.)

Could be fixed by adding a join condition like:

UPDATE foo

SET bar = t2.bar

FROM foo t1

JOIN foo2 t2 ON t1.id = t2.id

WHERE foo.id = t1.id; -- !

Well, technically, a WHERE condition, but all the same.

But that's just putting lipstick on a pig. While id is the PK column of each table, that's just adding noise. Use the command you already found instead:

UPDATE foo

SET bar = t2.bar

FROM foo2 t2

WHERE foo.id = t2.id;

The manual advises for the FROM clause of UPDATE:

Do not repeat the target table as a from_item unless you intend a self-join (in which case it must appear with an alias in the from_item).

And:

When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_item list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

Such a self-join makes sense (or is even necessary!) if you need a LEFT [OUTER] JOIN to an additional table(s). Sadly, there is no provision in SQL to say "FROM LEFT" in an UPDATE. Example:

Nullify column in update if subquery returns empty



Your Answer

Interviews

Parent Categories