Collation: A Comprehensive Overview
Collation applies to all database management systems and is used to dictate how characters will be treated. Different languages often use character sets particular to their national language. Many languages have national characters that must be accounted for when sorting and comparing character values.
For more background on collation, it is useful to understand a little about how operating systems and applications such as database management systems deal with national character sets and how they are sorted. One of the first questions you are asked when installing a new operating system is what country you are in and what dialect of the national language you are using. This maps your keyboard layout, including national language characters like currency symbols on your keyboard to the correct bit string that the operating system understands. Operating systems and database systems refer to these sets of characters as code pages, which are usually proprietary, but often use standards such as Unicode. You will see in the examples references to UTF8 and Latin character sets. Once we have established the character set to be used by the underlying operating system and database management systems, we can concern ourselves with sort orders which can vary by character set.
A collation specifies the bit patterns that represent each character in a dataset and determine the rules used to sort and compare data. Collation attributes can have an impact on query results because they govern how sorting and comparison operations function. The clauses of an SQL SELECT statement, such as JOINS and ORDER BY that, compare text data of datatypes including CHAR, VARCHAR, NCHAR and NVARCHAR. Codepages and character sets specified by the collation determine which characters are displayable.
Collation settings set the default for all system databases and user databases created by that instance, including temporary tables.
Examples of Collations Options
When you install a SQL Server database, the default collation is dictated by the operating system’s locale. For example, US English (en-US), the locale is SQL_Latin1_General_CP1_CI_AS. You can override the operating system defaulting in a CREATE DATABASE statement using the COLLATE clause. If the database is created, the ALTER statement can be used to make changes to the database definition.
The existing user databases don’t get migrated to the new setting, so you must use a cumbersome process of exporting all the user databases, setting the master collation, followed by imports of the user databases.
Below is an example of changing an existing column’s setting:
ALTER TABLE dbo.UserTable ALTER COLUMN DepartmentName VARCHAR(80) COLLATE Latin1_General_100_CI_AI_SC_UTF8;
SQL Server supports several options when specifying a collation, including case and sensitivity.
SQL Server lets you specify for the following levels:
- Server-level – Sets a default collation for all databases created on that server
- Database-level – This allows you to set a collation for just that database using CREATE or ALTER commands with the COLLATE clause
- Column-level – You can use the ALTER statement for column level specification
- Expression-level – Lets you set the collation to use in an ORDER BY or SORT BY clause
We have covered SQL Server in some detail. To get a broader perspective, we can look at other databases:
PostgreSQL – Allows specifying the sort order and character classification behavior of data per column or per operation. Unlike SQL Server, collations at the database level cannot be changed after creation. However, you can override the default using the COLLATE clause in the SQL statement as in the example below:
SELECT a < ('Joe' COLLATE "fr_FR") FROM MyTable;
MySQL – Collations are set at database creation time using the following syntax:
CREATE DATABASE db_name CHARACTER SET latin1 COLLATE latin1_swedish_ci;
After creation, the ALTER DATABASE command lets you change the collation for a given database.
Similarly, you can use the COLLATE clause to CREATE or ALTER a table.
Actian X – Supports Unicode and System-defined collation set at database creation time.
Collations are generally only complex if you are catering for multiple national languages and character sets in your database-centric application. The vast majority of applications use the default provided by the operating system’s locale, so system admins and database users are blissfully unaware of collations.
Visit our website to learn more about Actian database products and solutions.