Those who maintain a non-english/bangla blog, forum or site, this type of problems are kinda common. The most common problem is seeing ??????
, and fix: go and change collation to utf8
But my problem started from this simple (yes it looked simple ) problem but ended up a nightmare.
I was running a ipboard instance. When importing a bangla translation, the text was showing as ??????
, so i applied the common fix, changed all table and field collation to utf8_unicode_ci and ipb's encoding to utf8. Now the site became inaccessible, showing a preety "Fatal ERROR : Your settings could not be read by IP.Board". That was fixed by running ipb's upgradeFinish
script but the real problem starts now.
The Problem:
All bangla letters became this :
Looks pretty right ? :v
The reason behind this was all bangla chars were inserted in mysql using latin1
encoding. If I set mysql client's encoding to latin1 and read, i get real bangla. If I set encoding to utf8 and read, i get latin-fied bangla like the picture.
Failed Attempts:
I immediately went on a long google search and tried several solutions, but unfortunately, none of them were able to solve the problem completely. These two links (1 2) were a great help though, which helped to understand the problem thoroughly.
The solution proposed in first link, was a partial success. Change the fields in squence of utf8 > latin1 > blob > utf8. The trick in converting to blob is it fools mysql to convert latin1 to utf8 without running any conversion algorithm. It works when checking a single field, but not for all, the whole database of ipb is kinda complex. Some field were primary field and mysql not allows them to altered to blob. I ran a script to loop through all table,all field and alter them but ended up with a bunch of error and corrupted database.
Then i changed my focus to the mysql dump file. If i can change the latin text inside the sql file to unicode text, problem will be gone. I tried enca(1) but it failed to change anything. Tried recode(1) without any success.
Then i created a script in php to read the whole sql line by line and feed them through utf8_encode. Which made the unreadable latin text to unreadable capital letter latin.
That was a mistake, actually i need to use utf8_decode. So now, the texts were back to bangla, but also many(almost all) chars are missing. Only first letter of a word or one word of a sentence was there, the rest vanished into the wind.
Next, tried iconv(1) which striped away half of the database
The Real Solution:
The idea is to insert the db to mysql as latin1, dump the db as latin1. Edit the dump to change it to utf8 and insert it as utf8.
- First task is to insert the mysqldump in latin1 encoding and in latin1 collation. So i took a bit help from sed.
1 2
sed -i 's/CHARSET=utf8 COLLATE=utf8_unicode_ci/CHARSET=latin1/g' database.sql sed -i 's/COLLATE utf8_unicode_ci/COLLATE latin1_bin/g' database.sql
Now i inserted this database.sql into mysql. Now we need to extract it as latin1, so applying
--default-character-set=latin1
flag.I opened the database_latin.sql and saw the text was written in bangla.1
mysqldump --default-character-set=latin1 -u user -p dbname > database_latin.sql
Now we need to edit the dump file and change it to utf8, as we are gonna insert it as utf8 now. again, sed to rescue.
1 2 3 4
sed -i 's/SET NAMES latin1/SET NAMES utf8/g' database_latin.sql sed -i 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_unicode_ci/g' database_latin.sql sed -i 's/COLLATE latin1_bin/COLLATE utf8_unicode_ci/g' database_latin.sql sed -i 's/CHARACTER SET latin1/CHARACTER SET utf8/g' database_latin.sql
Finally i inserted this database_latin.sql to mysql and voila no error, clear bangla and no chars missing
Ending Thoughts:
If you are making a bangla or non-english site, always start with a proper collation setting. Set all table and field collation to utf8_unicode_ci or utf8_general_ci. Make sure the mysql client (your app) speaks in utf8. If you are running raw mysql query, run set names utf8
at beginning. If using a framework or toolkit, check where to set this utf8 config. These will save you from infinite PITA later.
If you indeed missed the utf8 config and ended up in a mess like me, try to apply a solution that doesn't directy convert the encoding of the texts. You may end up with missing or corrupted chars.
NB: If you are following this post, dont copy-paste the commands, understand and adept the commands to your scenario. Forgive my poor writing and let me know your thoughts in the comment section.