Perl, DBI and MySQL utf8mb4 Character Set Support

Perl LogoMySQL’s modern UTF-8 encoding is named utf8mb4 (4 bytes), not utf8 (3 bytes.)

For new applications, especially web, you should start with utf8mb4. For existing applications, you need to decide if an upgrade is worthwhile, and test extensively before a production upgrade.

Caveats:

  • Note that changing your database character set for production applications should be treated seriously, like the major project that it is.
  • If all your data is in fact currently US-ASCII, then the database migration will be easy since it is a subset of UTF-8. (However your applications may need to do Unicode normalization of strings before insert, for comparisons to work later.)
  • Be careful with converting binary columns to UTF-8, like blobs. The result may be undefined, so test.
  • Since collation is language-specific, the various Unicode collations are almost never the right ones
  • You do need UTF-8 for people and place names, but there’s no reason to use UTF-8 for columns that will always be US-ASCII, like database id’s, IPv6 addresses, etc. MySQL has to allocate more space for UTF-8, so it is a disadvantage when not needed.

Testing utf8mb4

  1. ensure you’re choosing an up-to-date version of MySQL 5.6 or 5.7 that if necessary supports long keys (>768 bytes) using DYNAMIC/BARRACUDA with SHOW VARIABLES LIKE “innodb_file_format”;
  2. identify some test strings using ie. emoji and write some small test apps in each of the application languages you support
  3. dump and restore your data on a test instance, especially if you have Asian or emoji characters. Run mysql_upgrade.
  4. convert your schemas, tables and columns to utf8mb4. Update my.cnf.
  5. test using the test programs in #1
  6. update your business applications’ client settings and get acceptance testing. Now is a good time to write a central database connection function, and also a central transaction retry function.
  7. test your database tools, including backup and restore.

Production Migration to utf8mb4

  1. upgrade client libraries on application servers and verify
  2. update all applications in advance if possible and verify
  3. schedule downtime
  4. during downtime, deploy new applications settings on all servers and dump and migrate database
  5. do acceptance testing

If you have applications written in Perl, you need to first upgrade DBD::mysql to a version greater than 4.041. (Even CentOS 7 comes with only 4.023.)

Before (CentOS 7):

$ perl -e 'use DBD::mysql; print $DBD::mysql::VERSION'
4.023

# mysql_config
-bash: mysql_config: command not found

After:

# yum install mysql-devel
# cpan DBD::mysql

$ perl -e 'use DBD::mysql; print $DBD::mysql::VERSION'
4.043

blogs.perl.org: DBD::mysql – all your UTF-8 bugs are belong to us!!
SO: Trying to install Perl-Mysql DBD, mysql_config can’t be found
MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4? HN
mysqlserverteam.com: MySQL 8.0: When to use utf8mb3 over utf8mb4?
Using Databases with DBI: What Not To Do

This entry was posted in Linux, MySQL, MySQL Cluster, Open Source, Oracle, Perl, Tech. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.