Explain postgres update select with subquery referencing the main table?

515 Asked by EllaClarkson in SQL Server , Asked on Mar 7, 2023

I'm trying to understand how to filter my subquery in context to the main query. Ultimately I'm trying to get the MAX value from the latest record BEFORE the date of the current record. Here is where I'm at.

UPDATE Opphistory t
SET    MaxStageSortOrder = sub.max_snapshotdate
FROM  (
   SELECT opportunityid, max(snapshotdate) AS max_snapshotdate
   FROM   Opp History
   WHERE forecastcategory <> 'Omitted' and snapshotdate <= t.snapshotdate
   GROUP  BY 1
   ) sub
WHERE t.opportunityid = sub.opportunityid
It is the snapshotdate <= t.snapshotdate that appears to fail.

Answered by Dipika Agarwal

Seems like you were aiming for a correlated subquery:

UPDATE opphistory t

SET    MaxStageSortOrder = (

   SELECT max(snapshotdate)

   FROM   Opphistory t1

   WHERE  t1.opportunityid = t.opportunityid

   AND    t1.snapshotdate < t xss=removed> 'Omitted'

   );

A derived table in the FROM clause of an UPDATE cannot reference columns of the main table. That's possible in a correlated subquery or a LATERAL subquery. But, unfortunately, table expressions in the FROM clause of an UPDATE are (at least up to pg 14) always joined with a CROSS JOIN (effectively) to the main table. You can work around this limitation by repeating the main table in the FROM clause, binding that one-to-one to the main table, and then joining to the "proxy" with any join type. In the case at hand, we don't even need that:

UPDATE Opphistory t

SET    MaxStageSortOrder = sub.max_snapshotdate

FROM  (

   SELECT opphistory_id

        , max(snapshotdate) OVER (PARTITION BY opportunityid ORDER BY snapshotdate 

                                  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS max_snapshotdate

   FROM   Opphistory

   WHERE  forecastcategory <> 'Omitted'

   ) sub

WHERE  t.opphistory_id = sub.opphistory_id

AND    t.MaxStageSortOrder IS DISTINCT FROM sub.max_snapshotdate;

opphistory_id is supposed to be the PRIMARY KEY of the table.

I expect the second query to be substantially faster for big tables, as it (probably) only needs a single sequential scan and a sort to compute the running max for all rows. Running a correlated subquery for every row (like in the first query) has its price.

About ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING:

Reduce results into accumulated groups

Note subtle differences:

The first query updates every row, no matter what.

The second query omits ...

... rows with forecastcategory = 'Omitted' and rows with forecastcategory IS NULL

(The first query only excludes those from the max computation.)

... rows with opportunityid IS NULL

... rows that would not change - due to the added last line.

If (opportunityid, snapshotdate) is not defined UNIQUE NOT NULL, we may have to do more, starting with a definition of how to deal with duplicates and NULL values.

It's not clear from your question what you want exactly regarding postgres update select with subquery referencing the main table. 

If you really just want the snapshot date from the "previous" row, consider the simpler window function lag():

UPDATE Opphistory t

SET    MaxStageSortOrder = sub.max_snapshotdate

FROM  (

   SELECT opphistory_id

        , lag(snapshotdate) OVER (PARTITION BY opportunityid ORDER BY snapshotdate) AS max_snapshotdate

   FROM   Opphistory

   WHERE  forecastcategory <> 'Omitted'

   ) subWHERE  t.opphistory_id = sub.opphistory_id

AND    t.MaxStageSortOrder IS DISTINCT FROM sub.max_snapshotdate;

Explain postgres update select with subquery referencing the main table?

Your Answer