Thứ Sáu, 30 tháng 7, 2010

Convert latin1 to UTF-8 in MySQL

Since MySQL 4.1, UTF-8 is the default charset. If you have an old database, containing data encoded in latin1, and you want upgrade to a newer MySQL server, then you can do the following if you want all your data in UTF-8 instead:

Be careful when switching to UTF-8. Once you have converted your data, any program/webapp that uses the database will have to check that the data they are sending to the database is valid UTF-8. If it isn't then MySQL will silently truncate the data after the invalid part, which can cause all sorts of problems. If your program/webapp doesn't specifically say that it supports unicode then you may want to stick with latin1 instead.

== MySQL dump ==

First of all, we need to dump the old data into a file.

{{Command|mysqldump -h example.org --user=foo -p --default-character-set=latin1 -c \
--insert-ignore --skip-set-charset dbname > dump.sql
}}

Please note: You have to replace the user, the host and the dbname, otherwise it won't connect to your database.

== Convert dump ==

Next thing to do is, converting the characters in the MySQL dump from latin1 to UTF-8

{{Command|iconv -f ISO8859-1 -t UTF-8 dump.sql > dump_utf8.sql}}

If you see after your conversion, that umlauts in your database are converted correctly, but that the sign ß and € are broken, you might get it working by using -f CP1252 instead of -f ISO8859-1 in this command.

{{Command|perl -pi -w -e 's/CHARSET=latin1/CHARSET=utf8/g;' dump_utf8.sql}}

There may be other places in your database, where latin-1 character set is used. For example, there may be lines in your dump similar to:

{{command|ar_title varchar(255) character set latin1 collate latin1_bin NOT NULL default }}

The best is to either grep your dump against work "latin1" or look for those lines in some text editor (don't forget to use case-insensitive search)

If you have another source charset, you need to replace the -f option with your local character set.

== Drop and create ==

Now it's time to drop the old database and create a new one with UTF-8 support.

{{Command|mysql --user=foo -p --execute="DROP DATABASE dbname;
CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
}}

(MySql seems to recommend utf8_unicode_ci over utf8_general_ci for 5.1 +, see http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html)

== Import dump to database ==

Last but not least, we need to import the converted data back to the new database.

{{Command|mysql --user=foo --max_allowed_packet=16M -p --default-character-set=utf8 dbname < dump_utf8.sql}}

The max_allowed_packet option is sometimes important. If your import ends up with a "ERROR 1153 at line 42: Got a packet bigger than 'max_allowed_packet'", you need to increase the packet size. To do this, edit {{Path|/etc/mysql/my.cnf}} and set {{Codeline|max_allowed_packet=16M}} under the ''[mysqld]'' directive. This is because utf-8 coded characters will take more space than latin-1.

Dont forget to restart your mysql server.

== Alternative Method ==

Instead of using the above method of dumping and reloading your data, you can use the ALTER TABLE/DATABASE SQL statements to convert your data.

For each table you want to convert to utf8, use the following statement.

{{Command|ALTER TABLE CONVERT TO CHARACTER SET utf8;}}

This will automatically convert all text columns to utf8.

To set the default character set for a database, use the following statement.

{{Command|ALTER DATABASE CHARACTER SET utf8;}}

This does not affect any existing tables, but any future tables created in this database will use utf8 by default.

Note this warning from: http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

"The CONVERT TO operation converts column values between the character sets. This
is not what you want if you have a column in one character set (like latin1)
but the stored values actually use some other, incompatible character set (like
utf8). In this case, you have to do the following for each such column:

{{Command|ALTER TABLE t1 CHANGE c1 c1 BLOB;}}
{{Command|ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;}}

The reason this works is that there is no conversion when you convert to or from
BLOB columns."

Không có nhận xét nào: