Sharding & Data Distribution
shard key 用于决定文档存于哪个 shard, 分为 range based
和 hash based
。以 range based 为例,一个[min, max]范围形成一个chunk,取一个中间的数字将原来的区块分为两个(splite),再把新分出来的区块发送(migration)给其它shard。
Running
需要三种进程:
shardsvr 实际存储数据的进程 configsvr 一般三个足够 mongos configdb host:port,host1:port1,host3:port3 最好在 27017 默认端口上运行,任意数量,但一般大于1,configdb参数后为configsvr配置服务器地址。各个进程创建好,初始化replicate set后,连接mongos添加shard,再设置shard key。 如果未设置shard key,则数据默认存储在primary shard,不会分配到其它shard。
sh.addShard("host:port") # 只要添加每个shard内replicate set里的primary sh.enableSharding(dbname) sh.shardCollection("dbname.collection",key,unique) # unique 为bool,可选参数 sh.shardCollection("collectionname",{shardkey:1},true) Shard Key Selection sufficient cardinality: enough values to spread data out consider compound shard keys Is the key monotonically increasing? # 如果sharkey单调增长,最后一个shard将会存储过多数据。注: BSON ObjectId 即 _id 字段会单调增长 Process and Machine Layoutshard以及底下的replicate set应该分散在不同的机器上,而conifg server可以与任意replicate set运行在同一机器上,mongos也可以与replicate set运行在同一机器上,也可以运行在用户的客户端。举例:3 shards, 3 replicate sets each shard, 3 config server and 6 mongos process might use 9 machines.
注意replicate set数量最好是奇数。
Bulk Inserts and Pre-splittingbulk insert时,会先写入到其中一个shard,再执行split与migrate,如果插入数据的速度大于migrate的速度,这时就应该考虑使用pre-splitting:
sh.splitAt(collection, {key:val})
预先添加分割点
Tips and Best Practices only shard big collections pick shard key carefully (can not be changed once set) consider presplitting on a bulk load be aware of monotonically increasing shard key values on inserts adding shards is fairly easy but isn’t instaneous always connet to mongos except for some DBA work: put mongos on default port use logical config server names