Where the open source database community meets: Use code PERCONA75 and secure your spot for Percona Live.  Register

Fixing Column Encoding Mess in MySQL

December 18, 2007
Author
Andrew Aksyonoff
Share this Post:

Just had an interesting issue with an encoding mess on a column containing non-ASCII (Russian) text. The solution was not immediately obvious so I decided it’s worth sharing.

The column (actually the whole table) was created with DEFAULT CHARSET cp1251. Most of the data was in proper cp1251 national encoding indeed. However, because of web application failure to properly set the encoding, some of the rows were actually in UTF-8. That needed to be fixed.

Simply using CONVERT(column USING xxx) did not work because MySQL treated the source data as if it was in cp1251. One obvious solution would be to write a throwaway PHP script which would SET NAMES cp1251, pull the offending rows (they’d come out in UTF-8), iconv() them to proper cp1251, and UPDATE them with new values.

However it’s possible to fix the issue within MySQL. The trick is to tell it to treat the string coming from the table as binary, and then do charset conversion:

This can be further simplified. After 2nd conversion the result is in UTF-8, and this time MySQL knows that it’s UTF-8 as well. So it will perform conversion to per-table charset automatically, and the 3rd explicit CONVERT can be omitted.

The same trick could be applied to fix notorious KOI8-R vs CP1251 encoding issues in Russian, and other national SBCS encoding vs UTF-8 issues, I suppose. But, of course, ideally you’d always properly set the encoding in the Web application and avoid these issues at all.

0 0 votes
Article Rating
Subscribe
Notify of
guest

33 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
jrabbit
jrabbit
18 years ago

I was just about to tackle the exact opposite problem when I read this: I have a table that is defined as latin1_swedish_ci, but all the strings fields actually contain valid UTF-8. Is there a way to convert the table definition to utf-8 without it trying to transcode the strings, which are already in the correct encoding?

Converting the utf-8 to latin1_swedish_ci first is not an option as the strings contain characters not resprentable in the 1-byte character sets – I need to maintain the full utf-8 character set during the change.

Is there a away to apply this technique in this situation?

Teak
Teak
18 years ago

Why not just convert it to binary (nothing is really converted, the column is just marked as binary), and then from binary to the real encoding of the data (again nothing is really converted, all the data are left untouched)? I remember I did something like this many times in the past.

Scott
Scott
18 years ago

The manual shows doing the same thing, except altering the column type to binary and back. The CONVERT method is a little smoother, since it’s all one statement.

http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html

Teak
Teak
18 years ago

Altering the talbe type should be much quicker on big tables.

Peter Zaitsev
Admin
18 years ago

Teak,

I think you miss Andrews point a bit. It is not about the case when ALL data in the table is in wrong encoding but only single row (note id=123 in statements) Or may be several rows.

Sven Neuhaus
Sven Neuhaus
18 years ago

Thanks for the helpful article and comments.
So if the encoding is wrong (for example latin1 data in a column marked utf8) I would CHANGE COLUMN to binary (no conversion), then CHANGE COLUMN to latin1 (no conversion) and then CHANGE COLUMN to utf8 (conversion). I haven’t tried this yet but it sounds as if it’d work.

Mohiuddin Khan Inamdar
18 years ago

I need some suggestion from you. I am working on web application which has partial English and partial Chinese simplified as text.
English is mainly used in the whole system. Chinese is used only to store additional information like address and Chinese specific name of employee. I had all my tables as default latin1 with collation as latin1_swedish_ci.. But as I had to add new field to the system to show Chinese info to employees who cannot handle English as their first language. I have changed all my fields to UTF-8, so that I can simply store anything anywhere Chinese.
however I have only about 10 columns in whole database which use Chinese. So is this the right thing I am doing ?
options i felt were right where in anybody can point me if I am wrong.
1.> keep everything as UTF-8, collation UTF9_general_ci
2.> keep all English columns as latin1 with collation as latin1_swedish_ci and chinese colums as as UTF-8, collation UTF9_general_ci
Database is stored on local server (intra net) not on internet So storage is not an issue
Any suggestions please.. they are welcome..
Thanks in advance..

Mohnkhan
http://www.mohitech.com

SCC
SCC
18 years ago

I wish there was a way to change an entire table at once instead of going column by column. I need to convert from latin1_swedish_ci to UTF8 for dozens of columns.

Jerk
18 years ago

‘@Sven Neuhaus: That worked perfectly for Latin1 encoded chars in a UTF8 col. THANX!. That saved our lives.

Jeroen Sen
Jeroen Sen
17 years ago

I’ve used your conversion a bit for converting a database characterset latin_1 to an utf_8 characterset. (Just like SCC above.) But I have a problem with some special characters like ‘ë’ and ‘ä’ for example. When these characters are converted to their binary counterpart they are present in the database. But when I convert the database into utf_8 all contents of the fields containing these symbols are droppen starting form the symbol. Does anyone have a suggestion?

Thnx in advance!

andreea
17 years ago

salut

tamcy
tamcy
17 years ago

Perhaps a more complicated issue:

The table is created with charset utf8 – correct.
A data is exported with charset utf8 – also correct.
But thing goes wrong when trying to import the data into mysql with “.” syntax. The user forgot to run “SET NAMES utf8”, and MySQL sees all data as latin1.

Now, the script can use “set names latin1” to get back utf8 data.
However, “CONVERT(CONVERT(offending_col USING binary) USING utf8)” doesn’t work – the same wrong garbage character is shown.

tamcy
tamcy
17 years ago

Sorry – answer to myself, but just realized the way to fix it and would like to share with the others.

As I said, the column expects utf8 data, and the data is correct when “set names latin1” is issued.
This means I can get the correct (utf8) data when MySQL performs a “utf8 -> latin1” conversion.
In order to fix the encoding I need to treat this “latin1” data as binary, then convert it to utf8.
And so here is the statement I need:

UPDATE table SET col = convert(CONVERT(CONVERT(col USING latin1) USING binary) using utf8);

brigada
17 years ago

phpmyadmin makes this very easy to change – Open your database in phpmyadmin, select a table on the left, select the field(s) you need to change, click the With All: CHANGE button and change the collation. Depending on your amount of tables (i had 40) this takes about 10 min. max, and its error-free.

You can test it out on any old database you have if you wish before doing it ”live” 🙂

Thales Jacobi
16 years ago

Thanks Shodan. Very very useful..

sbf
sbf
16 years ago

Thank you very much, Shodan! Just what I’d been looking for…

Ken
Ken
16 years ago

I’ve tried to follow this guide, but have no idea what “123” is. Am I supposed to replace certain words in the following statement with something else?

> UPDATE table SET column=CONVERT(CONVERT(column USING binary) USING utf8) WHERE id=123;

Also, I’m just not able to do this thing on my own. Anyone can help me convert my data for a fee? Please contact me at sanbat[@t]gmail.com

Ken
Ken
16 years ago

Btw, the situation is, I have a db where most of the fields are latin1_swedish_ci and filled with Chinese data. The site displays fine on the frontend. I need to convert all the data to utf8.

EON
EON
15 years ago

Well, my data is so munged that I don’t even know what to do with it. The table is InnoDB, utf8, the column is utf8, the original data we imported may have been utf-8, the tables were MyISAM – it came out of Sql Server into MySQL, and may have been converted to utf-8 during that first import.

I tried (with a backup, of course) changing the column to binary, then latin-1, then back to utf-8 and the data looks exactly the same: ¡§de-mystify¡¨ (That’s a word with what I believe to be MS Word curly-quotes around it, but they’re coming out as upside-down apostrophe followed by, respectively, the “section” glyph and an umlaut. In the db, I see ¡§ for the first one, after the conversion (addition of Capital A with Circumflex). On a web page (utf-8 in header and content-type meta tag), it is as above.

Searching everywhere, I have found nothing that tells me what character set encoding that is, and I’ve tried many. When it was sensed (by PHP) as ISO-8859-1, it prints as “��” (two black-diamond question marks, whether the web page showing it is in utf-8 or ISO-8859-1), though if I display it converted to utf-8, it’s back to upside-down exclamation marks and “section” glyphs, etc. And some characters look like „« (double-comma, double-left-angle-brackets), or this (in my Navicat view): „X .

I’m sadly guessing that there is no solution for this within MySQL. Does anybody have information to share?

Thales Jacobi
15 years ago

Hi EON, sometimes binary strings look like a strange language like that. Have you tried to retrieve this information using php binary functions as well? Just by looking at them will be impossible to know what it is…

EON
EON
15 years ago

Well, I took a look in PHP functions, and didn’t find any binary functions that seem likely to help. the have bin2hex… what would I do with hex data? and some comparison functions and that’s all. What would I compare them against? But thanks for giving me an idea, at least. If you could explain a bit further, I’d be most thankful.

Tried a REPLACE on test data in MySQL: SET field = REPLACE(field, ‘•’, ‘•’);
MySQL said it happily replaced the string in lots of fields (and searching that string returns no rows), but when viewing in a browser, they’re still there.

Daniel
Daniel
15 years ago

Thanks, this works perfectly and saved my life!

Tadas S
14 years ago

Arggh. I have similar problem and nothing works. MyISAM tables with latin1 encoding.

Tried:
ALTER TABLE CONVERT TO CHARACTER SET utf8;

Does nothing except changes table/column encodings, but no actual content changes.

So.. After this mysql thinks my table is utf8, I try the nr2 way mentioned in this blog:
UPDATE table SET column=CONVERT(CONVERT(column USING binary) USING utf8) WHERE id=123;

And it.. Does nothing. £ sign stays the same, nothing changes.

Server version: 5.1.61-0ubuntu0.11.10.1

Tadas S
14 years ago

Before banging head to the table after “ALTER TABLE CONVERT TO CHARACTER SET utf8” try looking at your columns via HEX(column_name).. It did convert everything very well. My problem was not setting connection encoding… i.e. “SET NAMES utf8” ..

Vulpes M
Vulpes M
14 years ago

Excellent! This is the solution I was looking for for the same problem.

mencoder flv
12 years ago

When someone writes an article he/she maintains the plan of a user in his/her mind that how a user can
know it. So that’s why this article is amazing. Thanks!

Said Bakr
12 years ago

There is a magical question: How could I determine the real encoding of the data? For example, I use phpMyAdmin to view table’s data, the data should be Arabic text, it rendered strangely. I have no any idea about its real encode to perform CONVERT. I just need it to be utf8_general_ci. However, this data is copied (copied meaning, inserted using SQL insert) from table with latin1_swedish_ci and to be rendered correctly on the application pages it should be set meta tag charset to windows-1256.

Roxanne Dimacale
9 years ago

Thought-provoking ideas . Speaking of which if others are searching for a CCV Transcript Release Request Form , my boss filled out and faxed a template document here https://goo.gl/hx0kj7.

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved