Natural Language Sorting with MongoDB 3.4

London, UK

Friday, December 16th 2016, 11:51 GMT

Arranging English words in order is simple―most of the time. You simply arrange them in alphabetical order. Sorting a set of German words, or French words with all of their accents, or Chinese with their different characters is a lot harder than it looks. Sorting rules are specified through locales , which determine how accents are sorted, in which order the characters are in, and how to do case-insensitive sorting. There is a good set of those sorting rules available through CLDR , and there is a neat example to play with all kinds of sorting at ICU's demo site . If you want to know how the algorithms work, have a look at the Unicode Consortium's report on the Unicode Collation Algorithm .

Years ago Iwrote about collation and MongoDB. There is an old issue in MongoDB's JIRA tracker, SERVER-1920 , to implement collation so that sorting and indexing could work depending on the different sorting orders as described for each language (locale).

Support for these collations have finally landed in MongoDB 3.4 and in this article we are going to have a look at how they work.

How Unicode Collation Works

Many computer languages have their own implementation of the Unicode Collation Algorithm, often implemented through ICU. php has an ICU based implementation as part of the intl extension, in the form of the Collator class.

The Collator class encapsulates the Unicode Collation Algorithm to allow you to sort an array of text yourself. It also allows you to visualise the "sort key" to see how the algorithm works:

Take for example the following array of words:

$dictionary = [ 'boffey', 'bhm', 'brown', ];

Which we can turn into sort keys, and sort using the en locale (English):

$collator = new Collator( 'en' ); foreach ( $dictionary as $word ) { $sortKey = $collator->getSortKey( $word ); $dictionaryWithKey[ bin2hex( $sortKey ) ] = $word; } ksort( $dictionaryWithKey ); print_r( $dictionaryWithKey );

Which outputs:

Array ( [2b4533333159010a010a] => boffey [2b453741014496060109] => bhm [2b4b45554301090109] => brown )

If we would do this according to the nb (Norwegian) locale, the output would have brown and bhm reversed:

Array ( [2b4533333159010a010a] => boffey [2b4b45554301090109] => brown [2b5c6703374101080108] => bhm )

The sort key for bhm has now changed, so that its numerical value now makes it sort after brown instead of before brown . In Norwegian, the is a distinct letter that sorts after z .

MongoDB 3.4

Before the release of MongoDB 3.4, it was not possible to do a locale based search. As case-insensitivity is just another property of a locale, that was not supported either. Many users worked around this by storing a lower case version of the value in separate field just to do a case-insensitive search. But this has now changed with the implementation of SERVER-1920 .

In MongoDB 3.4 you may attach a default locale to a collection:

db.createCollection( 'dictionary', { collation: { locale: 'nb' } } );

A default locale is used for any query without a different locale being specified with the query. Compare the default ( nb ) locale:

> db.dictionary.find().sort( { word: 1 } ); { "_id" : ObjectId("5846d65210d52027a50725f0"), "word" : "boffey" } { "_id" : ObjectId("5846d65210d52027a50725f1"), "word" : "brown" } { "_id" : ObjectId("5846d65210d52027a50725f2"), "word" : "bhm" }

With the English ( en ) locale:

> db.dictionary.find().collation( { locale: 'en'} ).sort( { word: 1 } ); { "_id" : ObjectId("5846d65210d52027a50725f0"), "word" : "boffey" } { "_id" : ObjectId("5846d65210d52027a50725f2"), "word" : "bhm" } { "_id" : ObjectId("5846d65210d52027a50725f1"), "word" : "brown" }

The default locale of a collection is also inherited by an index when you create one:

db.dictionary.createIndex( { word: 1 } ); db.dictionary.getIndexes(); [ … { "v" : 2, "key" : { "word" : 1 }, "name" : "word_1", "ns" : "demo.dictionary", "collation" : { "locale" : "nb", "caseLevel" : false, "caseFirst" : "off", "strength" : 3, "numericOrdering" : false, "alternate" : "non-ignorable", "maxVariable" : "punct", "normalization" : false, "backwards" : false, "version" : "57.1" } } ] From PHP

All the examples below are using the PHP driver for MongoDB (1.2.0) and the accompanying library (1.1.0). These are the minimum versions to work with locales.

To use the MongoDB PHP Library, you need to use Composer to install it, and include the Composer-generated autoloader to make the library available to the script. In short, that is:

php composer require mongodb/mongodb=^1.1.0

And at the start of your script:

<?php require 'vendor/autoload.php';

In this first example, we are going to drop the collection dictionary from the demo database, and create a collection with the default collation en . We also create an index on the word field and insert a couple of words.

First the set-up and assigning of the database handle ( $demo ):

$client = new \MongoDB\Client(); $demo = $client->demo;

Then we drop the dictionary collection:

$demo->dropCollection( 'dictionary' );

We create a new collection dictionary and set the default collation for this collection to the en locale:

$demo->createCollection( 'dictionary', [ 'collation' => [ 'locale' => 'en' ], ] ); $dictionary = $demo->dictionary;

We create the index, and we also give the index the name dictionary_en . MongoDB supports multiple indexes with the same field pattern, as long as they have a different name and have different collations (e.g. locale, or locale options):

$dictionary->createIndex( [ 'word' => 1 ], [ 'nam

Natural Language Sorting with MongoDB 3.4

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本