Unicode Unicorns

UTF8 seems to be the bee’s knees of text encoding, but when your source comes in all different flavours, how do we deal with this ?

Recently a client asked me to check through 45gb of weblogs for a particular set of results, results which I would further have to parse to convert a database key to a filename. This is a global client using English, French, German, and Russian; meaning the alphabet was very much extended over our usual 26 letters (not counting numbers, punctuation, and other oddities).

Step One : LogParser
Microsoft have an awesome tool for parsing logs [1] of all formats, and in our use case IISW3C. It’s command line and a bit tricky so my first thought was to use Log Parser Studio [2] to act as the GUI for this. you can feed it log files and query them using SQL, then dump the results to CSV. Unfortunately, my data was too large and caused OutOfMemory errors after only parsing 10 days of logs, and I’ve got 3 years to do.

Step Two : Powershell
I knocked together a quick powershell script to grab each of the log files, pass them via LogParser.exe and spit out a CSV for each. Again the results weren’t 100%, the encoding was screwed up. I did manage to sort this using the command line below to spit out Unicode encoded CSVs. I also prefer tab separated so added the flag for this, but it still drew commas (we’ll clean them up later).

LogParser.exe -i:IISW3C -o:CSV -oCodepage:-1 -tabs:ON file:$StrSqlFile?LOGFILEPATH=$StrPathNameLog+CSVFILEPATH=$StrPathNameCsv

For reference the sql file contained :

SELECT cs-uri-stem, cs-uri-query INTO '%CSVFILEPATH%' FROM '%LOGFILEPATH%'

Step Three : CScript & SQL Server Management Studio
We’ve worked with this clients data for years and have a bunch of importers already written as vbs files. “No problem, i’ll just grab one of these” I thought. So we added this into the database and went to SSMS to see if it had run, but SSMS showed “?” in place of characters. A google told me the default font in SSMS didnt like unicode [3] so everything got switched to Courier New. A “SELECT ‘инструменты'” was showing up correctly in the results.

So we again run our script and find sometimes it worked and sometimes it didn’t – LogParser has given me a mix of UTF-8 and UTF-16 files and -uhoh- the VBS doesn’t like the UTF8 files and FileSystemObject is to blame [4] Setting the -1 flag alows us to read the UTF-16’s okay or switching to ADO.stream can let us read UTF-8 ok, so how do we get them all into one format ? !

Step Four : Convert all the CSVs
To our rescue comes UTFcast [5], a utility for mass converting encoding. The express version seems to only allow conversion to UTF8 and only up to files of 800mb, both within our limits. If we set the VBS to use ADO.Stream, then we win !

And we did 🙂

External Links