What is a scalable way to simulate SQL server HASHBYTES using a SQL CLR scalar function?
As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.
The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.
Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.
As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:
The function must be able to participate in parallel queries
The function must be deterministic
The function must take an input of an NVARCHAR or VARBINARY string (all relevant columns are concatenated together
The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max
The chance of a hash collision should be roughly equal to or better than the MD5 algorithm. CHECKSUM does not work for us because there are too many collisions.
The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)
For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).
What is a scalable way to simulate HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a bounty to this question as soon as I am able. Below is an example query which very roughly illustrates the use case:
DROP TABLE IF EXISTS #CHANGED_IDS; SELECT stg.ID INTO #CHANGED_IDS FROM ( SELECT ID, CAST( HASHBYTES ('SHA2_256', CAST(FK1 AS NVARCHAR(19)) + CAST(FK2 AS NVARCHAR(19)) + CAST(FK3 AS NVARCHAR(19)) + CAST(FK4 AS NVARCHAR(19)) + CAST(FK5 AS NVARCHAR(19)) + CAST(FK6 AS NVARCHAR(19)) + CAST(FK7 AS NVARCHAR(19)) + CAST(FK8 AS NVARCHAR(19)) + CAST(FK9 AS NVARCHAR(19)) + CAST(FK10 AS NVARCHAR(19)) + CAST(FK11 AS NVARCHAR(19)) + CAST(FK12 AS NVARCHAR(19)) + CAST(FK13 AS NVARCHAR(19)) + CAST(FK14 AS NVARCHAR(19)) + CAST(FK15 AS NVARCHAR(19)) + CAST(STR1 AS NVARCHAR(500)) + CAST(STR2 AS NVARCHAR(500)) + CAST(STR3 AS NVARCHAR(500)) + CAST(STR4 AS NVARCHAR(500)) + CAST(STR5 AS NVARCHAR(500)) + CAST(COMP1 AS NVARCHAR(1)) + CAST(COMP2 AS NVARCHAR(1)) + CAST(COMP3 AS NVARCHAR(1)) + CAST(COMP4 AS NVARCHAR(1)) + CAST(COMP5 AS NVARCHAR(1))) AS BINARY(32)) HASH1 FROM HB_TBL WITH (TABLOCK) ) stg INNER JOIN ( SELECT ID, CAST(HASHBYTES ('SHA2_256', CAST(FK1 AS NVARCHAR(19)) + CAST(FK2 AS NVARCHAR(19)) + CAST(FK3 AS NVARCHAR(19)) + CAST(FK4 AS NVARCHAR(19)) + CAST(FK5 AS NVARCHAR(19)) + CAST(FK6 AS NVARCHAR(19)) + CAST(FK7 AS NVARCHAR(19)) + CAST(FK8 AS NVARCHAR(19)) + CAST(FK9 AS NVARCHAR(19)) + CAST(FK10 AS NVARCHAR(19)) + CAST(FK11 AS NVARCHAR(19)) + CAST(FK12 AS NVARCHAR(19)) + CAST(FK13 AS NVARCHAR(19)) + CAST(FK14 AS NVARCHAR(19)) + CAST(FK15 AS NVARCHAR(19)) + CAST(STR1 AS NVARCHAR(500)) + CAST(STR2 AS NVARCHAR(500)) + CAST(STR3 AS NVARCHAR(500)) + CAST(STR4 AS NVARCHAR(500)) + CAST(STR5 AS NVARCHAR(500)) + CAST(COMP1 AS NVARCHAR(1)) + CAST(COMP2 AS NVARCHAR(1)) + CAST(COMP3 AS NVARCHAR(1)) + CAST(COMP4 AS NVARCHAR(1)) + CAST(COMP5 AS NVARCHAR(1)) ) AS BINARY(32)) HASH1 FROM HB_TBL_2 WITH (TABLOCK) ) rpt ON rpt.ID = stg.ID WHERE rpt.HASH1 <> stg.HASH1 OPTION (MAXDOP 8);
To simplify things a bit, I'll probably use something like the following for benchmarking. I'll post results with HASHBYTES on Monday:
CREATE TABLE dbo.HASH_ME ( ID BIGINT NOT NULL, FK1 BIGINT NOT NULL, FK2 BIGINT NOT NULL, FK3 BIGINT NOT NULL, FK4 BIGINT NOT NULL, FK5 BIGINT NOT NULL, FK6 BIGINT NOT NULL, FK7 BIGINT NOT NULL, FK8 BIGINT NOT NULL, FK9 BIGINT NOT NULL, FK10 BIGINT NOT NULL, FK11 BIGINT NOT NULL, FK12 BIGINT NOT NULL, FK13 BIGINT NOT NULL, FK14 BIGINT NOT NULL, FK15 BIGINT NOT NULL, STR1 NVARCHAR(500) NOT NULL, STR2 NVARCHAR(500) NOT NULL, STR3 NVARCHAR(500) NOT NULL, STR4 NVARCHAR(500) NOT NULL, STR5 NVARCHAR(2000) NOT NULL, COMP1 TINYINT NOT NULL, COMP2 TINYINT NOT NULL, COMP3 TINYINT NOT NULL, COMP4 TINYINT NOT NULL, COMP5 TINYINT NOT NULL ); INSERT INTO dbo.HASH_ME WITH (TABLOCK) SELECT RN, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, RN 00000, REPLICATE(CHAR(65 + RN ), 30) ,REPLICATE(CHAR(65 + RN ), 30) ,REPLICATE(CHAR(65 + RN ), 30) ,REPLICATE(CHAR(65 + RN ), 30) ,REPLICATE(CHAR(65 + RN ), 1000), 0,1,0,1,0 FROM ( SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN FROM master..spt_values t1 CROSS JOIN master..spt_values t2 ) q OPTION (MAXDOP 1); SELECT MAX(HASHBYTES('SHA2_256', CAST(N'' AS NVARCHAR(MAX)) + N'|' + CAST(FK1 AS NVARCHAR(19)) + N'|' + CAST(FK2 AS NVARCHAR(19)) + N'|' + CAST(FK3 AS NVARCHAR(19)) + N'|' + CAST(FK4 AS NVARCHAR(19)) + N'|' + CAST(FK5 AS NVARCHAR(19)) + N'|' + CAST(FK6 AS NVARCHAR(19)) + N'|' + CAST(FK7 AS NVARCHAR(19)) + N'|' + CAST(FK8 AS NVARCHAR(19)) + N'|' + CAST(FK9 AS NVARCHAR(19)) + N'|' + CAST(FK10 AS NVARCHAR(19)) + N'|' + CAST(FK11 AS NVARCHAR(19)) + N'|' + CAST(FK12 AS NVARCHAR(19)) + N'|' + CAST(FK13 AS NVARCHAR(19)) + N'|' + CAST(FK14 AS NVARCHAR(19)) + N'|' + CAST(FK15 AS NVARCHAR(19)) + N'|' + CAST(STR1 AS NVARCHAR(500)) + N'|' + CAST(STR2 AS NVARCHAR(500)) + N'|' + CAST(STR3 AS NVARCHAR(500)) + N'|' + CAST(STR4 AS NVARCHAR(500)) + N'|' + CAST(STR5 AS NVARCHAR(2000)) + N'|' + CAST(COMP1 AS NVARCHAR(1)) + N'|' + CAST(COMP2 AS NVARCHAR(1)) + N'|' + CAST(COMP3 AS NVARCHAR(1)) + N'|' + CAST(COMP4 AS NVARCHAR(1)) + N'|' + CAST(COMP5 AS NVARCHAR(1)) ) ) FROM dbo.HASH_ME OPTION (MAXDOP 1);
HashBytes system is used to create Hash, regardless of the algorithm used.
A hash is an essential calculation based on the values of the input, and two inputs that are the same ought to produce the same hash.
Since you're just looking for changes, you don't need a cryptographic hash function.
You could choose from one of the faster non-cryptographic hashes in the open-source Data.HashFunction library by Brandon Dahler, licensed under the permissive and OSI approved MIT license. SpookyHash is a popular choice.
Example implementation
Source Code
using Microsoft.SqlServer.Server; using System.Data.HashFunction.SpookyHash; using System.Data.SqlTypes; public partial class UserDefinedFunctions { [SqlFunction ( DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true ) ] public static byte[] SpookyHash ( [SqlFacet (MaxSize = 8000)] SqlBinary Input ) { ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create(); return sh.ComputeHash(Input.Value).Hash; } [SqlFunction ( DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true, SystemDataAccess = SystemDataAccessKind.None ) ] public static byte[] SpookyHashLOB ( [SqlFacet (MaxSize = -1)] SqlBinary Input ) { ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create(); return sh.ComputeHash(Input.Value).Hash; } }
The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version. The non-LOB version should be significantly quicker.
You might be able to wrap a LOB binary in COMPRESS to get it under the 8000 byte limit, if that turns out to be worthwhile for performance. Alternatively, you could break the LOB up into sub-8000 byte segments, or simply reserve the use of HASHBYTES for the LOB case (since longer inputs scale better).
Pre-built code
You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:
https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300
T-SQL functions
CREATE FUNCTION dbo.SpookyHash ( @Input varbinary(8000) ) RETURNS binary(16) WITH RETURNS NULL ON NULL INPUT, EXECUTE AS OWNER AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash; GO CREATE FUNCTION dbo.SpookyHashLOB ( @Input varbinary(max) ) RETURNS binary(16) WITH RETURNS NULL ON NULL INPUT, EXECUTE AS OWNER AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
GO
Usage
An example use given the sample data in the question:
SELECT HT1.ID FROM dbo.HB_TBL AS HT1 JOIN dbo.HB_TBL_2 AS HT2 ON HT2.ID = HT1.ID AND dbo.SpookyHash ( CONVERT(binary(8), HT2.FK1) + 0x7C + CONVERT(binary(8), HT2.FK2) + 0x7C + CONVERT(binary(8), HT2.FK3) + 0x7C + CONVERT(binary(8), HT2.FK4) + 0x7C + CONVERT(binary(8), HT2.FK5) + 0x7C + CONVERT(binary(8), HT2.FK6) + 0x7C + CONVERT(binary(8), HT2.FK7) + 0x7C + CONVERT(binary(8), HT2.FK8) + 0x7C + CONVERT(binary(8), HT2.FK9) + 0x7C + CONVERT(binary(8), HT2.FK10) + 0x7C + CONVERT(binary(8), HT2.FK11) + 0x7C + CONVERT(binary(8), HT2.FK12) + 0x7C + CONVERT(binary(8), HT2.FK13) + 0x7C + CONVERT(binary(8), HT2.FK14) + 0x7C + CONVERT(binary(8), HT2.FK15) + 0x7C + CONVERT(varbinary(1000), HT2.STR1) + 0x7C + CONVERT(varbinary(1000), HT2.STR2) + 0x7C + CONVERT(varbinary(1000), HT2.STR3) + 0x7C + CONVERT(varbinary(1000), HT2.STR4) + 0x7C + CONVERT(varbinary(1000), HT2.STR5) + 0x7C + CONVERT(binary(1), HT2.COMP1) + 0x7C + CONVERT(binary(1), HT2.COMP2) + 0x7C + CONVERT(binary(1), HT2.COMP3) + 0x7C + CONVERT(binary(1), HT2.COMP4) + 0x7C + CONVERT(binary(1), HT2.COMP5) ) <> dbo.SpookyHash ( CONVERT(binary(8), HT1.FK1) + 0x7C + CONVERT(binary(8), HT1.FK2) + 0x7C + CONVERT(binary(8), HT1.FK3) + 0x7C + CONVERT(binary(8), HT1.FK4) + 0x7C + CONVERT(binary(8), HT1.FK5) + 0x7C + CONVERT(binary(8), HT1.FK6) + 0x7C + CONVERT(binary(8), HT1.FK7) + 0x7C + CONVERT(binary(8), HT1.FK8) + 0x7C + CONVERT(binary(8), HT1.FK9) + 0x7C + CONVERT(binary(8), HT1.FK10) + 0x7C + CONVERT(binary(8), HT1.FK11) + 0x7C + CONVERT(binary(8), HT1.FK12) + 0x7C + CONVERT(binary(8), HT1.FK13) + 0x7C + CONVERT(binary(8), HT1.FK14) + 0x7C + CONVERT(binary(8), HT1.FK15) + 0x7C + CONVERT(varbinary(1000), HT1.STR1) + 0x7C + CONVERT(varbinary(1000), HT1.STR2) + 0x7C + CONVERT(varbinary(1000), HT1.STR3) + 0x7C + CONVERT(varbinary(1000), HT1.STR4) + 0x7C + CONVERT(varbinary(1000), HT1.STR5) + 0x7C + CONVERT(binary(1), HT1.COMP1) + 0x7C + CONVERT(binary(1), HT1.COMP2) + 0x7C + CONVERT(binary(1), HT1.COMP3) + 0x7C + CONVERT(binary(1), HT1.COMP4) + 0x7C + CONVERT(binary(1), HT1.COMP5) );
When using the LOB version, the first parameter should be cast or converted to varbinary(max).
Execution plan
Safe Spooky
The Data.HashFunction library uses a number of CLR language features that are considered UNSAFE by SQL Server. It is possible to write a basic Spooky Hash compatible with SAFE status. An example I wrote based on Jon Hanna's SpookilySharp is below:
https://gist.github.com/SQLKiwi/7a5bb26b0bee56f6d28a1d26669ce8f2