SHA1/MD5: UCS-2 vs. UTF-16 encoding
File this under geeky, technical, pain-in-my-ass issue-of-the-day.
I have some data stored in an nvarchar() column of SQL-2005. I create a sha1sum hash of that data using hashbytes('sha1',@nvarchar). Later on, I try to compare a sha1sum created in Linux against the sha1sum created in SQL on the same string and the hashes are different. WTF?
Consider this SQL statement & result:
declare @str nvarchar(max)
set @str = 'Quick brown fox'
select right(master.dbo.fn_varbintohexstr(hashbytes('sha1',@str)),40)
-- RESULT: e0d26c51ff25b6f4bffa6c7d9ec792a07e92a1f3
Versus this Bash shell statement & result:
#!/bin/bash
printf "Quick brown fox" | sha1sum
# RESULT: d695e427f9e4be8013d432eeebce30613c05e0a8
Next ensues a four hour digression into the aggravating arcana of the UCS-2 character encoding used by SQL 2005 versus the UTF-8/16 character encodings used by pretty much everything else. Nevermind that UCS-2 is a subset of UTF-16 [1]:
The UCS-2 encoding scheme is actually a subset of the UTF-16 scheme. Every UCS-2 encoded code point is identical to the encoding of the same code point in UTF-16.
But in consideration of hash sums, this is the critical difference [2]:
UTF-16 … is a variable-length character encoding … UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and … is a fixed-length encoding that always encodes characters into a single 16-bit value.
Computing a SHA1 or MD5 sum? Equivalent encodings that use different bit lengths to encode characters of the same string get hashed differently. Consider these sha1sum’s for the string “Quick brown fox”:
#!/bin/bash
#
# UCS-2 encoding, like SQL nvarchar():
printf "Quick brown fox" | iconv -t UCS-2 | sha1sum
# RESULT: e0d26c51ff25b6f4bffa6c7d9ec792a07e92a1f3
#
# UTF-16 encoding, the superset of UCS-2:
printf "Quick brown fox" | iconv -t UTF-16 | sha1sum
# RESULT: cfc24d97f59d950ed59f4a52bf7a94a077e32774
#
# UTF-8 encoding, like my Bash shell:
printf "Quick brown fox" | iconv -t UTF-8 | sha1sum
# RESULT: d695e427f9e4be8013d432eeebce30613c05e0a8
#
# LATIN1 encoding, string without extended characters, is identical:
printf "Quick brown fox" | iconv -t LATIN1 | sha1sum
# RESULT: d695e427f9e4be8013d432eeebce30613c05e0a8
#
In my case, I wanted to compare a sha1sum created in SQL-2005 to a hash of data migrated to couchdb. The SQL 2005 hashbytes() operates on the column encoding. The datatype varchar is based on your local collation (mine is Latin1), whereas nvarchar is always UCS-2. The data I want to compare in couchdb is UTF-8 encoded. Since SQL 2005 is too dumb to speak anything so common as UTF-8, comparision of hashbytes() on nvarchar data requires that you convert the encoding of the non-SQL string to UCS-2 first.
To quote Fabio on Top Chef last night, “I got so many kick in my ass that sometime when I am in the bathroom I poop shoes out of my ass.”
[1] Qingsong Yao: SQL Server & UTF-16
PS- Interestingly, I found the Google sweet spot for this issue by searching for the md5sum of “The quick brown fox jumps over the lazy dog” which is
9e107d9d372bb6826bd81d3542a419d6. Googlewhackblatt that hapax legomenon!
