Why use BULK INSERT in SQL server to insert data in random order?

521 Asked by AndreaBailey in SQL Server , Asked on Jul 12, 2021

I have a CSV file with 350,000 rows. Would like to insert data to the temporary table in the same order as is in the CSV file. I am trying to BULK INSERT using:

BULK INSERT ##AllRows FROM @FilePath WITH ( FIELDTERMINATOR = 'n' ) Unfortunately BULK INSERT inserts data in random order. My header in every execution is in a different row. I ran it on SQL Server 2016. Is there a possibility that in older versions of SQL Server order manners were different? Using the FIRSTROW option does not recognize the header as the first row of the file. The file does not have any column for which we can order. In the file the header is in the first row. It could be a coincidence but even with FIRSTROW=2, it is possible that my header will be in the table. I checked it. It looks like the more rows the CSV file contains the more chance that inserts to the table will be in random order. Why use BULK INSERT in sql server?

Answered by Al German

BULK INSERT can be a minimally logged operation (depending on various parameters like indexes, constraints on the tables, recovery model of the database, etc). In the case of BULK INSERT, only extent allocations are logged instead of the actual data being inserted. This will provide much better performance than INSERT.

Why use BULK INSERT in sql server?

A Bulk insert is a process or method provided by a database management system to load multiple rows of data into a database table.” If we adjust this explanation in accordance with the BULK INSERT statement, the bulk insert allows importing external data files into SQL Server.

The rows from the file are read in order, and added to the table in the same order. The issue occurs when you read rows from the table. Without an ORDER BY clause on your SELECT, SQL Server is free to return rows from the table in whatever order is convenient.

Details

The question doesn't provide a definition for the table ##AllRows, but it seems certain the table is a heap (a table without a clustered index). SQL Server reads pages from a heap using Index Allocation Map (IAM) structures. This means data tends to be returned in file and page id order within each IAM chain, which will generally not reflect the order in which data was inserted. This is the underlying cause of the behaviour you are seeing.

Solutions

You need a column to indicate the order of rows in the file, then order by that column when you write your query. Unfortunately, SQL Server does not provide a built-in way to add this 'sequence' column during import. There are a couple of common workarounds:

Pre-process the source file outside SQL Server to add a sequence number to each row. This is the most reliable method.
Assign a sequence number during the import.

The second method carries some risk because there is no documented guarantee that this will work reliably in all circumstances. Nevertheless, people have been using this idea successfully for a long time. The general idea is:

Add an IDENTITY column to the import table.
Create a view over the import table, omitting the IDENTITY column.

BULK INSERT into the view. This will not work with a global temporary table, because a view cannot reference that type of table. You would need to use a regular table (perhaps in tempdb) instead.

Example 1

I used the following script to successfully import a csv file containing the Complete Works of Shakespeare:

The first step is to create a table in tempdb with an extra IDENTITY column:

USE tempdb; GO CREATE TABLE dbo.Test ( id integer IDENTITY PRIMARY KEY, line nvarchar(4000) NOT NULL );

Now we create a view over that table, omitting the IDENTITY column:

CREATE VIEW dbo.ImportTest WITH SCHEMABINDING AS SELECT T.line FROM dbo.Test AS T;

Finally, we bulk insert into the view:

BULK INSERT dbo.ImportTest FROM 'C:Tempshakespeare.txt' WITH ( CODEPAGE = '65001', DATAFILETYPE = 'char', ROWTERMINATOR = '

' );

We can now see the first few lines in file order using a SELECT with ORDER BY:

SELECT TOP (20) T.id, T.line FROM dbo.Test AS T ORDERBY T.id ASC;

The result shows the text in the right order:

Example 2

It is also possible to use OPENROWSET with a format file. Using the same example csv file, I was able to import the data using the following format file (saved as shakespeare.xml):

and:

INSERT dbo.Test WITH (TABLOCK) ( line ) SELECT ORO.line FROM OPENROWSET ( BULK 'C:Tempshakespeare.txt', FORMATFILE = 'C:Tempshakespeare.xml', CODEPAGE = '65001' ) AS ORO;

Notice that this method does not require a view, so you could target a global temporary table. The target table still needs the extra IDENTITY column.

Why use BULK INSERT in SQL server to insert data in random order?

Your Answer