“Imamo Hrvatsku!” – MySQL patch which implements full Croatian ordering in utf8_croatian_ci and ucs2_croatian_ci collations

Author: seven November 30, 2009

Great news, great news indeed. Couple of months ago, I started an open initiative to finally add support to MySQL for proper ordering using Croatian alphabet. We tried doing it on our own, but we needed to rewrite MySQL’s Unicode Collation Algorithm, and for that we really needed help from MySQL development team. How we managed to get it? Using good old “Balkan way” – the schnapps aka. rakija black vodka. :)

My mate who was working with me on our initial implementation – Ante ‘Ivoks’ Karamatić (Chief executive at Init) got drunk with Kurt von Finck (Chief Community and Communications Officer for Monty Program Ab) in Dallas last week, who passed a good word to Michael (“Monty”) Widenius (MySQL’s original author and co-founder of MySQL AB) to listen our cries for help. Monty convinced Alexander Barkov (Lead software developer at Sun Microsystems working on MySQL) to give us little help on whole Croatian ordering issue. As a result, utf8_croatian_ci and ucs2_croatian_ci collations were created and added to MySQL 6.

After a pleasant chat with Monty and Bar, they were good enough to help us with a MySQL 5.1 patch which implements full Croatian ordering in utf8_croatian_ci and ucs2_croatian_ci collations. Woohoo! :)

But the bad news is that it will take fair amount of time before MySQL-5.6 (or 6.0 for that matter) will go GA, so one have to wait before it will be possible to download a production version of MySQL with “real” Croatian support.

If you really need Croatian support, you can try patching MySQL server as we did.

More details about the patch can be found here:

Since Alexander Barkov was so kind and provided a patch for MySQL 5.1, Ante created packages for Ubuntu. He also slightly (needs further testing) modified that patch so it works with MySQL 5.0. If you need this feature, go add this PPA to your sources.list: https://edge.launchpad.net/~ivoks/+archive/mysql-hr/.

After you apply the patch, you can try it out using my test database dump. If everything went ok “use croatian; SET NAMES ‘utf8’ COLLATE ‘utf8_croatian_ci’; select rijec from test_croatian order by rijec;”, should produce output like this (switch browser view to utf8).

Any feedback from the Croatian MySQL community is greatly welcomed. Please write your comments to <Alexander.Barkov[at]Sun.COM>. Thanks!

Proof of conecpet:

mysql> select version();
+-----------------+
| version()       |
+-----------------+
| 5.0.51a-hr1-log | 
+-----------------+
mysql> use croatian; SET NAMES 'utf8' COLLATE 'utf8_croatian_ci'; select rijec from test_croatian order by rijec;
+--------------+
|  rijec       |
+--------------+
| Aboriđin     |
| Aboriđini    |
| Ante         |
| Branimir     |
| Cipela       |
| Čazma        |
| Ćevapčići    |
| Džak         |
| džak         |
| Džamija      |
| džamija      |
| Đak          |
| đak          |
| Đevđelija    |
| Inat         |
| Init         |
| Inozemstvo   |
| Interes      |
| Injekcija    |
| Ipsilon      |
| Kutina       |
| Livno        |
| Lovor        |
| Ljubav       |
| Ljubljana    |
| Neven        |
| Nivas        |
| Nosorog      |
| Njivice      |
| Onomatopeja  |
| Šišmiš       |
| Zagreb       |
| Žaba         |
+--------------+
Author
seven
CEO/CTO at Nivas®
Neven Jacmenović has been passionately involved with computers since late 80s, the age of Atari and Commodore Amiga. As one of internet industry pioneers in Croatia, since 90s, he has been involved in making of many award winning, innovative and successful online projects. He is an experienced full stack web developer, analyst and system engineer. In his spare time, Neven is transforming retro-futuristic passion into various golang, Adobe Flash and JavaScript/WebGL projects.

    20 thoughts on ““Imamo Hrvatsku!” – MySQL patch which implements full Croatian ordering in utf8_croatian_ci and ucs2_croatian_ci collations”

  • Respect! Quite a story I must say :) Good old rakija wins again :D

  • Restekpa! :)

  • Great news!

  • And they say drinking is bad for you!!! :)
    Hats off to you fellas!

  • Haha something like that, yes :) It wasn’t rakija, it was beer and http://en.wikipedia.org/wiki/Pisco :D And wouldn’t said druk, we just had a nice chat during the last evening at the Ubuntu Developer Summit (free alcohol) :)

    As for 5.0 patch… It does work; it passed all tests and there is no logical reason why it wouldn’t work.

  • Sorry, but it wasn’t rakija.

    It was salmiakkikossu, Finnish black vodka.

    But yes, liquor is one of the best lubricants for the software development machine. :)

  • One more note… MariaDB 5.1 which will be released very soon will support all of this. So, if you need better MySQL than MySQL – go MariaDB! http://askmonty.org/wiki/index.php/MariaDB

  • Btw… Ovaj ‘proof of concept’ ne dokazuje nista po pitanju Nj i Lj. Dodaj jos ‘Lovor’ i ‘Nosorog’.

  • Yo Kurt! Welcome to our blog! OMG black vodka sounds absynthish. :) We must organize schnapps session when you come to our neighbourhood.

    Ante: Added “Lovor” & “Nosorog” to blogpost, test dump and test results. I got carried away with your “Injekcija” yesterday so I missed couple of examples.

  • S tim da je “nj” u injekcija zapravo n i j, a ne nj… :S

  • Zlatko moze biti i jedno i drugo :) Jedno je mat. funkcija, drugo je sprica. Zato sortiranje nece biti ispravno sve dok ne budemo unasali ‘nj’, a ne ‘n’ i ‘j’.

    S druge strane, razgovarao sam sa lingvisticarem i rekao je da to i nije takav problem jer se radi o 30ak rijeci.

  • s tim da ce se tesko jezik mijenjati da izbaci digrafe… ali eto. barem nesto…

  • Nice job :)

    @Zlatko – nemoj biti nesretan, ovo je skroz ok.

  • lang=hr

    Ekipa matematičari, ovo sa INJEKCIJA je otišlo u krivom smjeru. :)

    Ja nisam jezikoslovac, ne bavim se fonetikom, niti sam profesor hrvatskog jezika, stoga nikako ne želim navući bijes istih.

    – PISMO –

    Hrvatska abeceda, hrvatska latinica ili Gajica (po Ljudevitu Gaju) je latinično pismo namijenjeno za pisanje hrvatskog jezika. Sadrži 30 slova, 27 ih se piše pomoću jednoga znaka (jednoslovi), a 3 pomoću dva znaka (dvoslovi / dvostruka slova / digrafi) – lj, nj i dž.

    Gaj je preuzeo iz češkog Ň koje je postalo NJ, sa slovačkog Ľ koje je zamijenilo LJ, te poljsko slovo Ǧ koje je zamijenjeno sa DŽ.

    – PRAVOPIS –

    Hrvatski pravopis poznaje nekoliko glasovnih promjena. Jedna (nama interesantna ovdje) jotovanje ili jotacija je glasovna promjena zadužena za spajanje nepalatalnih suglasnika c, d, g, h, k, l, n, s, t i z s glasom j, pri čemu se tvore palatalni suglasnici č, đ, ž, š, lj, nj i ć.

    Jotovanje ima svoja pravila i izuzetke, no koliko ja znam lj i nj ne podliježu niti jednom pravopisnom izuzetku.

    Govorni jezik izlazi iz okvira hrvatskog pravopisa i zavisno o dijalektu ima svoje varijacije jotovanja (novo, novije, najnovije, ijekavsko…).

    Pogledao sam u par hrvatskih rječnika i postoji samo 1 riječ u hrvatskom jeziku (posuđenica) – injèkcija (ž) (lat. injicere: umetnuti, ubaciti) i može imati značenje za:
    – ubrizgavanje
    – tekućinu koja se ubrizgava
    – sprava kojom se ubrizgava
    – snažan potisak
    – matematička funkcija

    U knjiškom izgovoru je injekcija a uobičajeno se izgovara ińekcija. Niti u jednom jeziku slovo (grafem) ne mora odgovarati glasu (fonemu).

    Morfemska raščlamba nije in-jekcij-a, nego injekcij-a. Nemamo riječ ‘jekcija’.

    Stoga, tvrdim i dalje da se injekcija piše sa nj. :)

    Ako sam u krivu, molim vas ispravite.

    I za kraj složio bih se s Antom (ako već ne oko injekcije) oko činjenica da je najveći neprijatelj hrvatskog jezika u modernom dobu računala i baza podataka činjenica da nemamo tipkovnice sa slovima LJ, NJ i DŽ. 

  • Ovaj proof-of-concept neradi.
    Kutina se pojavljuje 2 puta nakon Đevđelija i nakon Ipsilon.

  • Ma radi, samo sam ja zeznuo dok sam editirao output mysql-a. U shellu kad izvrtim sql ne dobijem hr slova, pa sam editirao. Ovo je raw output, sve radi kako treba. :)

    +————–+
    | rijec |
    +————–+
    | Aboriđin |
    | Aboriđini |
    | Ante |
    | Branimir |
    | Cipela |
    | ÄŚazma |
    | Ćevapčići |
    | DĹľak |
    | dĹľak |
    | DĹľamija |
    | dĹľamija |
    | Đak |
    | Ä‘ak |
    | Đevđelija |
    | Inat |
    | Init |
    | Inozemstvo |
    | Interes |
    | Injekcija |
    | Ipsilon |
    | Kutina |
    | Livno |
    | Lovor |
    | Ljubav |
    | Ljubljana |
    | Neven |
    | Nivas |
    | Nosorog |
    | Njivice |
    | Onomatopeja |
    | Šišmiš |
    | Zagreb |
    | Žaba |
    +————–+
    33 rows in set (0.00 sec)

  • yes this is much better :)

  • Hi!
    Is it possible for you to add link to windows compiled version with this patch?

    thanks!

    -matija kancijan

  • sorry kanc, no windows version of server software in nivas. just good old un*x.

  • Sve je ovo jako fino, raduje me da se neko sjetio i balkanskih zemalja i jezika. Ono što je meni nejasno jeste da kad čovjek iz Hr stavi Character set na utf8mb4 i izabere collation utfmb4_croatian_ci ne dobije raspored po hrvatskoj abecedi nego nešto “lijevo”…. da mi je samo zanti koja pametna glava je dala ideju za hrvatski collation po tom fazonu!? Brateee….

  • Leave a Reply to Mario Frančešević Cancel reply

    Your email address will not be published. Required fields are marked *