SHA1/MD5: UCS-2 vs. UTF-16 encoding

File this under geeky, technical, pain-in-my-ass issue-of-the-day.

I have some data stored in an nvarchar() column of SQL-2005.  I create a sha1sum hash of that data using hashbytes('sha1',@nvarchar).  Later on, I try to compare a sha1sum created in Linux against the sha1sum created in SQL on the same string and the hashes are different.  WTF?

Consider this SQL statement & result:

declare @str nvarchar(max)
set @str = 'Quick brown fox'
select right(master.dbo.fn_varbintohexstr(hashbytes('sha1',@str)),40)
-- RESULT: e0d26c51ff25b6f4bffa6c7d9ec792a07e92a1f3

Versus this Bash shell statement & result:

#!/bin/bash
printf "Quick brown fox" | sha1sum
# RESULT: d695e427f9e4be8013d432eeebce30613c05e0a8

Next ensues a four hour digression into the aggravating arcana of the UCS-2 character encoding used by SQL 2005 versus the UTF-8/16 character encodings used by pretty much everything else. Nevermind that UCS-2 is a subset of UTF-16 [1]:

The UCS-2 encoding scheme is actually a subset of the UTF-16 scheme. Every UCS-2 encoded code point is identical to the encoding of the same code point in UTF-16.

But in consideration of hash sums, this is the critical difference [2]:

UTF-16 … is a variable-length character encoding … UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and … is a fixed-length encoding that always encodes characters into a single 16-bit value.

Computing a SHA1 or MD5 sum? Equivalent encodings that use different bit lengths to encode characters of the same string get hashed differently. Consider these sha1sum’s for the string “Quick brown fox”:

#!/bin/bash
#
# UCS-2 encoding, like SQL nvarchar():
printf "Quick brown fox" | iconv -t UCS-2 | sha1sum
# RESULT: e0d26c51ff25b6f4bffa6c7d9ec792a07e92a1f3
#
# UTF-16 encoding, the superset of UCS-2:
printf "Quick brown fox" | iconv -t UTF-16 | sha1sum
# RESULT: cfc24d97f59d950ed59f4a52bf7a94a077e32774
#
# UTF-8 encoding, like my Bash shell:
printf "Quick brown fox" | iconv -t UTF-8 | sha1sum
# RESULT: d695e427f9e4be8013d432eeebce30613c05e0a8
#
# LATIN1 encoding, string without extended characters, is identical:
printf "Quick brown fox" | iconv -t LATIN1 | sha1sum
# RESULT: d695e427f9e4be8013d432eeebce30613c05e0a8
#

In my case, I wanted to compare a sha1sum created in SQL-2005 to a hash of data migrated to couchdb.  The SQL 2005 hashbytes() operates on the column encoding. The datatype varchar is based on your local collation (mine is Latin1), whereas nvarchar is always UCS-2. The data I want to compare in couchdb is UTF-8 encoded.  Since SQL 2005 is too dumb to speak anything so common as UTF-8, comparision of hashbytes() on nvarchar data requires that you convert the encoding of the non-SQL string to UCS-2 first.

To quote Fabio on Top Chef last night, “I got so many kick in my ass that sometime when I am in the bathroom I poop shoes out of my ass.”

[1] Qingsong Yao: SQL Server & UTF-16

[2] Wikipedia: UTF-16/UCS-2


PS- Interestingly, I found the Google sweet spot for this issue by searching for the md5sum of “The quick brown fox jumps over the lazy dog” which is 9e107d9d372bb6826bd81d3542a419d6. Googlewhackblatt that hapax legomenon!

1 Comment | Tags: Linux, Work

Amazon’s far side

I’ve been cataloging Amazon oddities for a little while now.  Some I found through the oddities forum, some I happened across myself. The comments are what I enjoy most.  Generally, they are parodies of Amazon comments, which if you look to Amazon reviews as much as I do, make them all the more hilarious.

Playmobil Security Check Point — A toy that gets reviewers wound up.  One says “Thank you Playmobil for allowing me to teach my 5-year old the importance of recognizing what a failing bureaucracy in a ever growing fascist state looks like.”  Another says “This toy would be a lot more realistic with about 350 people standing in line for an average of an hour. It still makes a nice set with the interrogation room.”  Still another cautions “the passenger’s shoes cannot be removed.”

JL421 Badonkadonk Land Cruiser/Tank — Another reviewer favorite.  Better than the comical reviews are the critics.  Such as, “It’s a symptom of how depraved our culture has become that this item, like the Hummer, is even available. I guess there’s an emerging sociopath market that manufacturers have finally found a way to tap into.”

Complete USB Humping Dog Set — Also sold individually, this is not a USB flash drive.  It’s a little plastic dog that humps your USB port.  One reviewer gave every single model a bad review.  He (I assume) says “When plugged in, it caused my computer to malfunction. Also, when contacting the vendor, he was rude and refused to give me a refund.” Another says “I don’t think this is the kind of toy you would always have plugged in. It creates decent amount of annoying noise.”

Uranium Ore — Even though 66% of the people who view this product go on to order “Extras - The Complete Series” it just might put you on a terrorist watch list.  It is for real but not for irradiating or nuke making.  Buyer beware: one reviewer claims “Every time I try to use this product, the Libyans show up and steal my Delorian.”

Vaseline Uranium Opalescent Glass Rose Butter Dish —  Awesome translation. Amazon thought I was “looking for ‘uranium’ Products” and found it for me.  The lone reviewer said “Often you don’t know you need something until somebody invents it… Last week the butter was lost in the fridge, the uranium in a glowing cardboard box under the stairs and my bed was awash with rose petals and half-empty tubs of Vaseline - the usual scene of domestic bliss.  Now they are all handily together in one…”

Renova Black Toilet Paper (6 pack) —  Available in colors other than black, reviewers appreciate that “it doesn’t leave the color behind.” Several reviewers deploy colored T.P. when entertaining or for company.  Interesting idea, I just don’t know that I want to be remembered for my T.P.

This Is a Test Product and Nothing Will Be Sent to You — In the words of a reviewer, “I haven’t actually ordered nothing from Amazon as yet, however, I’ve been familiar with this product for years.  I’ve been given nothing by employers, various gentlemen friends, family members and many departments of the United States government.”

1 Comment | Tags: Consumer, Musings

Google: Superbowl may harm computer

I had an interesting experience this morning.  All Google results for any search returned their malware warning, “This site may harm your computer.”  Google linked not to the target site, but an interestatal warning page about the dangers of malware on the target site.  I had to manually copy the site’s URL into my address bar to actually reach it after Google blocked it.

On what search did I encounter this oops?  A search for Superbowl 43 kickoff time!  Ouch…

Google results tagged superbowl pages as malware

Comment | Tags: Consumer, Musings

Workaround DOS wildcard expansion

Have you ever tried to use *nix-shell (ha) wildcard expansion in DOS only to have it fail?  Well maybe that doesn’t happen to you very often, but it happens often enough to me.  Often enough that I need to write the syntax down to workaround the annoyance…

For instance, today I tried to run tidy on all HTML files in a folder:

> \bin\tidy.exe -cmbn -asxml --alt-text "" *.html
Error: Can't open "*.html"

WTF?  Oh yeah… << looks for sticky note >> Here’s how to do it:

> for /f %a IN ('dir /b *.html') do \bin\tidy.exe -cmbn -asxml --alt-text "" %a

Comment | Tags: Work

Tornado Proof Amazon Cart

I have been reading up on the technical foundations for CouchDB’s approach to consistency, availability, and partition tolerance in a distributed system.  The problems involved in balancing these concerns are interesting.  Interesting to me and to many other smart dorks at the moment.

But what’s perhaps interesting to everyone, regarding these concerns, are the  non-dorky physical realities taken into consideration apart from the dorky ones (harddrive failures, network latencies, etc.).  A paper published in ACM by Amazon in 2007 discussing the design and implementation of their Dynamo project introduces the concerns like this:

One of the lessons our organization has learned from operating Amazon’s platform is that the reliability and scalability of a system is dependent on how its application state is managed. Amazon uses a highly decentralized, loosely coupled, service oriented architecture consisting of hundreds of services. In this environment there is a particular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers.

We live in a world where unfathomable disasters happen all the time.  The fortunate of us are sheltered from them by a long list of buffers.  Now include among these buffers your Amazon shopping cart, which is replicated in the Dynamo with your Amazon avatar, distibuted in the freakish virtual tornado-proof cloud like light from your face in a house of mirrors.

Comment | Tags: Musings, Work

  1...2...3...4  Next Page »