mysql utf8 latin2 utf8 weirdness
This one is a keeper, have to write it down somewhere in case I forget it by tomorrow morning. :) Today I was doing routine deploy of one of our new websites to production server. Everything went smoothly, except the fact I couldn’t see any of Croatian letters on the site texts (which were stored in mysql)!! We did migration to UTF8 some time ago, and this was highly unexpected. So I started to dig into the dump. :)
Looking at the mysqldump of a database I was a bit surprised to see Croatian letters screwed up beyond recognition. For example letter “š” was represented as “ÄąÄO„” (4 bytes), “č” was “ÄO” (two bytes)… etc. OMG! All our tables are “CHARSET=utf8”, we force “AddDefaultCharset utf-8” in Apache and all of our markup uses “charset=utf-8” for content type encoding. BUT, we unintentionally left one very bad call in our config file which got triggered if site was running on development server – “SET NAMES ‘latin2’ COLLATE ‘latin2_croatian_ci”. The production config had SET NAMES utf8, but development didn’t. Oh boy.
Ok so now I had a real mess – utf8 tables with utf8 data in them stored as re-encoded latin2 but in utf8!? I tried converting the dump from ISO8859-2 to UTF8 but that just made things worse, some of characters now used 6 bytes which isn’t good. The solution was pretty straight forward. I converted the dump from UTF8 to ISO-8859-2 and I got real UTF8 again. I imported the converted dump and changed for good config files to “SET NAMES ‘utf8’ COLLATE ‘utf8_general_ci”.
iconv to the rescue:
[ftf w=”400″ h=”150″]
mysqldump -u root -p db > weirdo-dump.sql
iconv -f UTF-8 -t ISO_8859-2//TRANSLIT weirdo-dump.sql > dump-latin2-aka-realUTF8.sql
mysql -u root -p db < dump-latin2-aka-realUTF8.sql[/ftf]
3 thoughts on “mysql utf8 latin2 utf8 weirdness”
September 16, 2007 at 11:11
This weirdness is definitely a bug which was also filed as S2 severity bug at MySQL but closed ( see http://bugs.mysql.com/bug.php?id=9091 ) and unfortunately we can see the same happening in MySQL 5 too.
It seems to me that MySQL is not paying much attention to the multibyte support.
September 19, 2007 at 01:29
The good thing is – I didn’t loose any high bit data. My data got encoded twice, so Instead of 2 bytes some chars used 4. :)
mysql4 was a nightmare tough. we are waiting for 5.1 and ndb engine.
May 30, 2008 at 07:51
Your post saved my ass. THX ;)