Home > Articles > Which encoding to choose in MySQL - utf8 or utf8mb4 (utf8mb4_general_ci, utf8mb4_unicode_ci or utf8mb4_0900_ai_ci). How do they differ, how are they decoded and possible errors

Which encoding to choose in MySQL - utf8 or utf8mb4 (utf8mb4_general_ci, utf8mb4_unicode_ci or utf8mb4_0900_ai_ci). How do they differ, how are they decoded and possible errors

02.07.21 IT / Database

34844

When setting up a connection to a database (DB), it may be difficult to select the database encoding . Usually, a whole list of encodings is offered, or rather collations (comparisons or character sets), and in each version of the DBMS the proposed encoding may differ.

kakuyu-kodirovku-vybrat-v-mysql

For example, previously the default set was utf8_general_ci . The user may not know which encoding is used, since the choice may occur automatically when installing ready-made web applications. The encoding can be used by default when manually creating a database using, for example, phpMyAdmin . The selected encoding applies to all database tables and this affects how the data will be processed during queries. For example, you may find that the data selection is not case-sensitive or some characters from other languages and other objects (smilies, etc.) are not saved.

What encoding should I choose for databases and tables ? For most projects, it is recommended to choose from a subset of utf8-specific encodings. But there are differences in the names of the comparisons. The utf8 collations are 3-byte collations, they do not specify mb3 for simplicity. Plain utf8 has MySQL specific restrictions that do not allow characters higher than 0xFFFD.

For older applications, it might be worth using utf8_general_ci , for newer applications, utf8mb4_general_ci, utf8mb4_unicode_ci or utf8mb4_0900_ai_ci . The preferred option is not general, but unicode. They differ in that utf8mb4_general_ci is slightly faster when performing sorting, but collation problems may occur for some languages, while utf8mb4_unicode_ci does not have this disadvantage.

How are encoding names deciphered ? Let's take an example utf8mb4_0900_ai_ci . Here:

- utf8 denotes encoding ;

- mb4 denotes the version or how many bytes are used in data processing for one character. If not specified, mb3 is usually implied ;

- 0900 denotes the version of the Unicode Collation Algorithm (UCA) on which the collation is based. If not specified, version 4.0.0 is usually assumed ;

- ai denotes diacritical insensitivity (for example, ancient Greek ᾱ, ᾰ). If not specified, ai or as is implied depending on the next part in the comparison name, i.e. ai for ci and as for cs ;

- ci stands for case insensitive means that there will be no differences between uppercase and lowercase characters in database queries. There are also versions of cs that are case sensitive .

Various errors and warnings can be seen in the report of the system being used. They usually suggest switching to using modern encodings from the utf8mb4 package. For example, in Drupal, you can see the line in the status report:

Database 4 byte UTF-8 support - Disabled. 4 byte UTF-8 for mysql is disabled. See the documentation on adding 4 byte UTF-8 support for more information .

This means that the system recommends using the new 4-byte encodings instead of the old ones from the utf8 collection.

When writing program code, an error of the form: Unknown collation: 'utf8mb4_0900_ai_ci' (Unknown collation: 'utf8mb4_0900_ai_ci') may occur. In most cases, this means that the required encoding is not available on the database server. For example, utf8mb4_0900_ai_ci is a new collation that is only available since MySQL 8.0. Also, the error may appear if the encoding intended for MySQL is used in another DBMS, for example, in MariaDB. The encoding sets differ from version to version, as well as for different DBMS.