Cassandra 教程开篇

August 30, 2016, 6:21 am

≫ Next: StackExchange.Redis帮助类解决方案RedisRepository封装（字符串类型数据操作）

≪ Previous: A Recipe for Cooking with the Hadoop Ecosystem

原创声明: 本文为原创文章如需转载需要在文章最开始显示本文原始链接为了更好的阅读体验，请回源站查看文章。有任何修改、订正只会在源站体现最后更新时间：2016年08月28日 Cassandra 教程系列开篇

由于在公司需要准备ETP课程，课程名称就是《基于Cassandra+Spark搭建的数据分析平台》，因此需要写一篇关于Cassandra的教程。并且，此教程太监的几率应该不大~ 毕竟公司的压力摆在那里

在这里先将教程发布出来，如果各位有什么建议或者意见，都可以留言，也是对我的教程内容的一个修正

预计本课程将会包含以下知识点：

简单介绍Cassandra Cassandra是什么有什么优点有什么公司在用 Cassandra Internal 架构一致性等特性介绍 gossip policy 数据模型（Data Model）重要内容 NoSQL 跟 RDBMS 最关键的不同点，设计理念就不太一样 Java API CRUD 多线程、 Future 接口搜索不会介绍DataStax Ent的Solr解决方案简单介绍Lucene 插件 Cassandra + Spark + Zeppelin 一体化的数据分析解决方案应该会有一个实际的案例，具体还没想好

其实，就一个DB方案来说，还有很多其他需要介绍的事情，特别是维护、升级，但是由于时间有限，同时笔者自己这方面的经验也并不是非常丰富，暂时感觉还没能力来讲好。

如果哪天在这方面经验更丰富、完善，应该会将这一块补充上来

下周去上海出差，慢慢更新

本文为原创文章，转载请注明出处

↧

StackExchange.Redis帮助类解决方案RedisRepository封装（字符串类型数据操作）

August 30, 2016, 6:20 am

≫ Next: Redis 3.0入门一之主从搭建

≪ Previous: Cassandra 教程开篇

本文版权归博客园和作者本人共同所有，转载和爬虫请注明原文链接 http://www.cnblogs.com/tdws/tag/NoSql/

一、基础配置封装

二、 String字符串类型数据操作封装

三、Hash散列类型数据操作封装

四、List列表类型数据操作封装

五、Set集合类型数据操作封装

六、Sort Set集合数据类型操作封装

七、主从配置，哨兵相关配置

二、String字符串类型数据操作封装

下面这段画如果看一遍没看懂，请看过代码后再次来阅读：

我们有必要先提到ConnectionMultiplexer类，它是StackExchange提供给我们的一个类，它将多服务器的详细信息隐藏，因为这个类为我们做了很多事情，它的设计目的是为了在调用者间共享和重用。你不用每次操作都创建这样一个ConnectionMultiplexer，它是完全线程安全的。它拥有ConnectionMultiplexer.Connect和onnectionMultiplexer.ConnectAsync来链接Redis。链接参数是一个字符串或者一个 ConfigurationOptions对象。这个类实现了 IDisposable接口，你可以在你不需要的时候释放对象，通过using或者dispose。但是你不用经常来释放它，因为我们要经常复用。你有三种需求时，需要使用ConnectionMultiplexer，链接Redis,发布订阅模式，访问一个单独的服务器或者监控的目的。除了基本使用，更多的请看github文档， https://github.com/StackExchange/StackExchange.Redis/blob/master/Docs/Basics.md

我们在IRedisClient中定义如下String数据类型方法：

如果你觉得代码太多，当VS按下快捷键ctrl+M+O吧

1 #region 程序集 RedisRepository, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
2 // Author：吴双 2016.8.28 联系邮箱 wscoder@outlook.com
3 #endregion
4 using System;
5 using System.Collections.Generic;
6 using StackExchange.Redis;
7
8 namespace RedisRepository
9 {
10 public interface IRedisClient
11 {
12 #region Redis String类型操作
13 /// <summary>
14 /// Redis String类型新增一条记录
15 /// </summary>
16 /// <typeparam name="T">generic refrence type</typeparam>
17 /// <param name="key">unique key of value</param>
18 /// <param name="value">value of key of type object</param>
19 /// <param name="expiresAt">time span of expiration</param>
20 /// <param name= "when">枚举类型</param>
21 /// <param name="commandFlags"></param>
22 /// <returns>true or false</returns>
23 bool StringSet<T>(string key, object value, TimeSpan? expiry = default(TimeSpan?), When when = When.Always, CommandFlags commandFlags = CommandFlags.None) where T : class;
24
25 /// <summary>
26 /// Redis String类型新增一条记录
27 /// </summary>
28 /// <typeparam name="T">generic refrence type</typeparam>
29 /// <param name="key">unique key of value</param>
30 /// <param name="value">value of key of type object</param>
31 /// <param name="expiresAt">time span of expiration</param>
32 /// <param name= "when">枚举类型</param>
33 /// <param name="commandFlags"></param>
34 /// <returns>true or false</returns>
35 bool StringSet<T>(string key, T value, TimeSpan? expiry = default(TimeSpan?), When when = When.Always, CommandFlags commandFlags = CommandFlags.None) where T : class;
36
37 /// <summary>
38 /// 更新时应使用此方法，代码更可读。
39 /// </summary>
40 /// <typeparam name="T"></typeparam>
41 /// <param name="key"></param>
42 /// <param name="value"></param>
43 /// <param name="expiresAt"></param>
44 /// <param name="when"></param>
45 /// <param name="commandFlags"></param>
46 /// <returns></returns>
47 bool StringUpdate<T>(string key, T value, TimeSpan expiresAt, When when = When.Always, CommandFlags commandFlags = CommandFlags.None) where T : class;
48
49 /// <summary>
50 /// Redis String类型 Get
51 /// </summary>
52 /// <typeparam name="T"></typeparam>
53 /// <param name="key"></param>
54 /// <param name="commandFlags"></param>
55 /// <returns>T</returns>
56 T StringGet<T>(string key, CommandFlags commandFlags = CommandFlags.None) where T : class;
57
58 /// <summary>
59 /// Redis String数据类型获取指定key中字符串长度
60 /// </summary>
61 /// <param name="key"></param>
62 /// <param name="commandFlags"></param>
63 /// <returns></returns>
64 long StringLength(string key, CommandFlags commandFlags = CommandFlags.None);
65
66 /// <summary>
67 /// Redis String数据类型返回拼接后总长度
68 /// </summary>
69 /// <param name="key"></param>
70 /// <param name="appendVal"></param>
71 /// <param name="commandFlags"></param>
72 /// <returns>总长度</returns>
73 long StringAppend(string key, string appendVal, CommandFlags commandFlags = CommandFlags.None);
74
75 /// <summary>
76 /// 设置新值并且返回旧值
77 /// </summary>
78 /// <param name="key"></param>
79 /// <param name="newVal"></param>
80 /// <param name="commandFlags"></param>
81 /// <returns>OldVal</returns>
82 string StringGetAndSet(string key, string newVal, CommandFlags commandFlags = CommandFlags.None);
83
84 /// <summary>
85 /// 为数字增长val
86 /// </summary>
87 /// <param name="key"></param>
88 /// <param name="val"></param>
89 /// <param name="commandFlags"></param>
90 /// <returns>增长后的值</returns>
91 double StringIncrement(string key, double val, CommandFlags commandFlags = CommandFlags.None);
92
93 /// <summary>
94 /// Redis String数据类型
95 /// 类似于模糊查询 key* 查出所有key开头的键
96 /// </summary>
97 /// <typeparam name="T"></typeparam>
98 /// <param name="key"></param>
99 /// <param name="pageSize"></param>
100 /// <param name="commandFlags"></param>
101 /// <returns>返回List<T></returns>
102 List<T> StringGetList<T>(string key, int pageSize = 1000, CommandFlags commandFlags = CommandFlags.None) where T : class;
103 #endregion
104
105
106 #region Redis各数据类型公用
107
108 /// <summary>
109 /// Redis中是否存在指定Key
110 /// </summary>
111 /// <param name="key"></param>
112 /// <param name="commandFlags"></param>
113 /// <returns></returns>
114 bool KeyExists(string key, CommandFlags commandFlags = CommandFlags.None);
115
116 /// <summary>
117 /// 从Redis中移除键
118 /// </summary>
119 /// <param name="key"></param>
120 /// <param name="commandFlags"></param>
121 /// <returns></returns>
122 bool KeyRemove(string key, CommandFlags commandFlags = CommandFlags.None);
123
124 /// <summary>
125 /// 从Redis中移除多个键
126 /// </summary>
127 /// <param name="keys"></param>
128 /// <param name="commandFlags"></param>
129 /// <returns></returns>
130 void KeyRemove(RedisKey[] keys, CommandFlags commandFlags = CommandFlags.None);
131
132 /// <summary>
133 /// Dispose DB connection 释放DB相关链接
134 /// </summary>
135 void DbConnectionStop();
136 #endregion
137 }
138 }

在RedisClient.cs中实现如下：

using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
using StackExchange.Redis;
namespace RedisRepository
{
public class RedisClient : IRedisClient
{
#region 初始化
private readonly IDatabase _db;
private readonly ConnectionMultiplexer _redis;
/// <summary>
/// 构造函数，在其中注册Redis事件
/// </summary>
public RedisClient()
{
const string configuration = "{0},abortConnect=false,defaultDatabase={1},ssl=false,ConnectTimeout={2},allowAdmin=true,connectRetry={3}";
_redis = ConnectionMultiplexer
.Connect(string.Format(configuration, RedisClientConfigurations.Url,
RedisClientConfigurations.DefaultDatabase, RedisClientConfigurations.ConnectTimeout,
RedisClientConfigurations.ConnectRetry));
_redis.PreserveAsyncOrder = RedisClientConfigurations.PreserveAsyncOrder;
//_redis.ConnectionFailed;
_db = _redis.GetDatabase();
}
#endregion
#region Redis String数据类型操作
/// <summary>
/// Redis String类型新增一条记录
/// </summary>
/// <typeparam name="T">generic refrence type</typeparam>
/// <param name="key">unique key of value</param>
/// <param name="value">value of key of type T</param>
/// <param name="expiresAt">time span of expiration</param>
/// <returns>true or false</returns>
public bool StringSet<T>(string key, T value, TimeSpan? expiresAt = default(TimeSpan?), When when = When.Always, CommandFlags commandFlags = CommandFlags.None) where T : class
{
var stringContent = SerializeContent(value);
return _db.StringSet(key, stringContent, expiresAt, when, commandFlags);
}
/// <summary>
/// Redis String类型新增一条记录
/// </summary>
/// <typeparam name="T">generic refrence type</typeparam>
/// <param name="key">unique key of value</param>
/// <param name="value">value of key of type object</param>
/// <param name="expiresAt">time span of expiration</param>
/// <returns>true or false</returns>
public bool StringSet<T>(string key, object value, TimeSpan? expiresAt = default(TimeSpan?), When when = When.Always, CommandFlags commandFlags = CommandFlags.None) where T : class
{
var stringContent = SerializeContent(value);
return _db.StringSet(key, stringContent, expiresAt, when, commandFlags);
}
/// <summary>
/// Redis String数据类型获取指定key中字符串长度
/// </summary>
/// <param name="key"></param>
/// <returns></returns>
public long StringLength(string key, CommandFlags commandFlags = CommandFlags.None)
{
return _db.StringLength(key, commandFlags);
}
/// <summary>
/// Redis String数据类型返回拼接后总长度
/// </summary>
/// <param name="key"></param>
/// <param name="appendVal"></param>
/// <returns>总长度</returns>
public long StringAppend(string key, string appendVal, CommandFlags commandFlags = CommandFlags.None)
{
return _db.StringAppend(key, appendVal, commandFlags);
}
/// <summary>
/// 设置新值并且返回旧值
/// </summary>
/// <param name="key"></param>
/// <param name="newVal"></param>
/// <param name="commandFlags"></param>
/// <returns>OldVal</returns>
public string StringGetAndSet(string key, string newVal, CommandFlags commandFlags = CommandFlags.None)
{
return _db.StringGetSet(key, newVal, commandFlags);
}
/// <summary>
/// 更新时应使用此方法，代码更可读。
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="key"></param>
/// <param name="value"></param>
/// <param name="expiresAt"></param>
/// <param name="when"></param>
/// <param name="commandFlags"></param>
/// <returns></returns>
public bool StringUpdate<T>(string key, T value, TimeSpan expiresAt, When when = When.Always, CommandFlags commandFlags = CommandFlags.None) where T : class
{
var stringContent = SerializeContent(value);
return _db.StringSet(key, stringContent, expiresAt, when, commandFlags);
}
/// <summary>
/// 为数字增长val
/// </summary>
/// <param name="key"></param>
/// <param name="val">可以为负</param>
/// <param name="commandFlags"></param>
/// <returns>增长后的值</returns>
public double StringIncrement(string key, double val, CommandFlags commandFlags = CommandFlags.None)
{
return _db.StringIncrement(key, val, commandFlags);
}
/// <summary>
/// Redis String类型 Get
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="key"></param>
/// <returns>T</returns>
public T StringGet<T>(string key, CommandFlags commandFlags = CommandFlags.None) where T : class
{
try
{
RedisValue myString = _db.StringGet(key, commandFlags);
if (myString.HasValue && !myString.IsNullOrEmpty)
{
return DeserializeContent<T>(myString);
}
else
{
return null;
}
}
catch (Exception)
{
// Log Exception
return null;
}
}
/// <summary>
/// Redis String类型
/// 类似于模糊查询 key* 查出所有key开头的键
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="key"></param>
/// <param name="pageSize"></param>
/// <param name="commandFlags"></param>
/// <returns>List<T></returns>
public List<T> StringGetList<T>(string key, int pageSize = 1000, CommandFlags commandFlags = CommandFlags.None) where T : class
{
try
{
var server = _redis.GetServer(host: RedisClientConfigurations.Url,
port: RedisClientConfigurations.Port);
var keys = server.Keys(_db.Database, key, pageSize, commandFlags);
var keyValues = _db.StringGet(keys.ToArray(), commandFlags);
var result = new List<T>();
foreach (var redisValue in keyValues)
{
if (redisValue.HasValue && !redisValue.IsNullOrEmpty)
{
var item = DeserializeContent<T>(redisValue);
result.Add(item);
}
}
return result;
}
catch (Exception)
{
// Log Exception
return null;
}
}
#endregion
#region Redis Hash散列数据类型操作
#endregion
#region Redis List列表数据类型操作
#endregion
#region Redis Set集合数据类型操作
#endregion
#region Redis Sort Set有序集合数据类型操作
#endregion
#region Redis各数据类型公用
/// <summary>
/// Redis中是否存在指定Key
/// </summary>
/// <param name="key"></param>
/// <returns></returns>
public bool KeyExists(string key, CommandFlags commandFlags = CommandFlags.None)
{
return _db.KeyExists(key, commandFlags);
}
/// <summary>
/// Dispose DB connection 释放DB相关链接
/// </summary>
public void DbConnectionStop()
{
_redis.Dispose();
}
/// <summary>
/// 从Redis中移除键
/// </summary>
/// <param name="key"></param>
/// <returns></returns>
public bool KeyRemove(string key, CommandFlags commandFlags = CommandFlags.None)
{
return _db.KeyDelete(key, commandFlags);
}
/// <summary>
/// 从Redis中移除多个键
/// </summary>
/// <param name="keys"></param>
public void KeyRemove(RedisKey[] keys, CommandFlags commandFlags = CommandFlags.None)
{
_db.KeyDelete(keys, commandFlags);
}
#endregion
#region 私有公用方法
// serialize and Deserialize content in separate functions as redis can save value as array of binary.
// so, any time you need to change the way of handling value, do it here.
private string SerializeContent(object value)
{
return JsonConvert.SerializeObject(value);
}
private T DeserializeContent<T>(RedisValue myString)
{
return JsonConvert.DeserializeObject<T>(myString);
}
#endregion
}
}

下面简单介绍一下本文中的方法细节之处。

首先在RedisClient类的构造方法中初始化Redis数据操作对象_db。每个方法更多的详尽信息请注意方法注释。如果关于Redis命令还不了解，请看前期Redis命令拾遗系列文章 http://www.cnblogs.com/tdws/tag/NoSql/

↧

Redis 3.0入门一之主从搭建

August 30, 2016, 6:19 am

≫ Next: 记一次使用node作为服务端的开发思路

≪ Previous: StackExchange.Redis帮助类解决方案RedisRepository封装（字符串类型数据操作）

周末没事看北京尚学堂之前的公开课视频，发现了白贺翔老师有一节课讲redis 3.0的视频教程，还不错，以下是学习笔记。

一、单机版搭建

首先是下载地址：http://redis.io/download，假设我们下载是redis-3.0.0-rc2.tar.gz

安装步骤：

1. 把我们下载好的redis-3.0.0-rc2.tar.gz放到linux的/usr/local文件夹下

2. 解压tar -xzvf redis-3.0.0-rc2.tar.gz -C /usr/local/

3. 进入到redis-3.0.0-rc2目录下，进项make

4. 进入到src下进行安装make install，验证（ll查看发现src下的目录，有redis-server、redis-cli即可）

5. 建立两个文件夹存放redis命令和配置文件

mkdir -p /usr/local/redis/etc
mkdir -p /usr/local/redis/bin

6. 把redis-3.0.0-rc2下的redis.conf移动到/usr/local/redis/etc下

mv redis.conf /usr/local/redis/etc

7. 把redis-3.0.0-rc2/src里的mkreleasehdr.sh、redis-benchmark、redis-check-aof、redis-check-dump、redis-cli、redis-server文件移动到bin下，命令

mv mkreleasehdr.sh redis-benchmark redis-check-aof redis-check-dump redis-cli redis-server /usr/local/redis/bin

8. 启动并指定配置文件

/usr/local/redis/bin/redis-server /usr/local/redis/etc/redis.conf

9. 退出改为后台启动

退出就不说了，改为后台启动，编辑 /usr/local/redis/etc/redis.conf找到

daemonize no

改为

daemonize yes

10. 修改持久化文件存放的位置，修改

dir ./

为

dir /usr/local/redis/data/

11. redis客户端的使用

/usr/local/redis/binredis-cli -h host -p port

12. 设置密码

通过刚才的操作应该可以发现redis默认是没有密码的，这样很不安全，设置密码的方法是编辑/usr/local/redis/etc/redis.conf找到requirepass 这一行，设置

requirepass bridgeli

这样通过客户端进入的时候加一个参数 -a 跟上你的密码就好了

二、主从搭建

当然是首先搭建两台单机版，然后如果想要某一台成为slave，就编辑这一台机器的配置文集，找到

#slaveof <masterip> <masterport>

这一行（对，默认是被注释掉的，这不废话吗），然后按照这个模式，写上master的信息就好了，之后重启主从机器，就可以通过命令：info来查看主从角色

三、redis服务器监控（哨兵）

实现步骤（在任一台服务器上配置sentinel.conf）：

1. copysentinel.conf到/usr/local/redis/ect下

2. 修改sentinel.conf文件

sentinel monitor mymaster 127.0.0.1 6379 1 #名称 ip 端口投票选举次数
sentinel down-after-millisenconds mymaster 5000 #默认1S检测一次，这里配置超市5000ms为宕机
sentinel failover-timeout mymaster 900000
sentinel can-failover mymaster yes
sentinel parallel-syncs mymaster 2

3. 启动sentinel哨兵

/usr/local/redis/bin/redis-server /usr/local/redis/ect/sentinel.conf --sentinel &

4. 查看哨兵相关信息

/usr.local.redis/bin/redis-cli -h 127.0.0.1 -p 26379 info sentinel

四、redis的持久化

默认是snapshotting不做介绍，多用户demo，实际中多用append-only file（缩写aof）

编辑配置文件把

appendonly no

改成

appendonly yes

还有相关持久化方案

#appendfsync always // 收到写命令就立即写入磁盘，效率最慢，但是能保证完全的持久化
#appendfsync everysec // 每秒钟写入磁盘一次，在性能和持久化方面做了很好的折中
#appendfsync no // 完全依赖OS性能最好，持久化没保证

作者： Bridge Li，http://www.bridgeli.cn

原文链接： http://www.bridgeli.cn/archives/310

↧

记一次使用node作为服务端的开发思路

August 30, 2016, 6:18 am

≫ Next: 奕名随笔 ―― 我眼中的Mongo

≪ Previous: Redis 3.0入门一之主从搭建

开篇闲扯

距离上一次的技术博文已经好久了，之前一直写的是关于 react 的（虽然才写了三篇），这次就写一写最近在项目中用到 node 来做服务端的开发思路，用 node 做服务端的机会不多，毕设用过一次，之后就没有用过了。先打个预防针，这次的开发思路完全是我自己瞎掰扯的，主要目的和之前写 react 一样，想要做一次小结，以及想要换取更多的交流，来弥补自身的不足。一直都觉得，写博文比写代码要难得多，不仅要回顾代码，还要理清思路，还要将这些思路清晰的表达出来，确实是一件不简单的事，自己写代码容易，但是要教会别人，或者告诉别人要难很多。

好了，这次主要从以下几个方面来展开说一下：

目录结构数据库规划路由规划日志环境变量简单的token 简单的插件编写目录结构 ├─ README.md ├─ server.js // 入口文件 │ ├─app // 主文件夹 │ │ │ ├─libs // 各种插件 │ │ │ ├─mongodb // mongoose相关文件 │ │ │ │ │ ├─documents │ │ │ │ │ └─schemas │ │ │ ├─route // 路由 │ │ │ ├─static // 静态资源 │ │ │ │ │ ├─fonts │ │ │ │ │ ├─images │ │ │ │ │ ├─libs │ │ │ │ │ ├─scripts │ │ │ │ │ └─styles │ │ │ └─views // 模板文件 │ ├─config // 环境变量的配置文件 │ └─log // 日志文件

目录结构相对来说还是比较简单的，也是通过不断的调整和更改命名方式来定义目录结构，能够涵盖我当前项目的一些需求，可能还有更多关于服务端需要调用的功能和配置我这里没有列举出来，本身也是因为上次毕业设计结束后，时隔这么久再次使用了一次，个人觉得相对毕业设计的目录结构，这次的这个方案会更加直白和分离一些。欢迎各位同学能够一起分享一下自己的目录结构，能够一起探讨一下。

数据库规划 ├─mongodb // mongoose相关文件 ├─documents └─schemas

数据库选用的是 mongodb 调用的第三方库为 mongoose ，根据 mongoose 的官方定义，一共分为三个部分 schemas > models > documents 因此在目录的规划上将 schemas models 合在了一次，之前是有将他们分为三个不同的目录来存放，一定意义上能够做到分离，但是也让调用的过程稍微繁琐了一点（这次并大型应用，因此对于数据库的操作相对来说并不是非常复杂， schemas models 中的一些中间件也较少用到），按照这样的规划，因此 schemas 中的写法大致是这样

use stricts` const mongoose = require('mongoose'); const AdminScheams = new mongoose.Schema({ name: String, password: String }) const Admin = mongoose.Model('Admin', AdminSchemas); module.exports = Admin;

documents 中是为了对数据库进行增删查改的操作，根据实际的项目需求，对其进行相关的一些封装，这里可能要用到 Promise ，根据 mongoose 官方的建议是使用 mpormise ，在 documents 中调用之前定义好的 models 进行操作

use strict` const mPromise = require('mpromise'); /* Models */ const Admin = require('../schemas/Admin'); exports.Get = (id) => { const promise = new mPromise(); Admin .findById(id) .exec((err, admin) => { if(err){ promise.resolve(err, {flag: false, data: {}, info: err}); }else{ promise.resolve(err, {flag: true, data: admin, info:''}); } }); return promise; }

通过引入 models ， exports 一个函数，之后可以对其进行扩展，最简单的就是增删查改的需求，使用 promise 让最后返回的是一个 promise 对象，可以调用 xx.then() 根据不同的业务需求，可以先将一些常规的操作进行封装，这样可以保证在路由部分的调用能够是一个最简单的数据返回的效果。

// route.js 'use strict' const express = require('express'); const router = express.Router(); /* Documents */ const Admin = require('../mongodb/documents/admin'); /* Router */ router.get('/', (req, res) =>{ const id = req.query.id; Admin .Get(id) .then( data => { if(data.flag){ res.json({flag: true, data: data.data, info:""}) }else{ res.json({flag: false, data: {}, info: "无法查到相对应的管理员"}) } }) }) module.exports = router;

在路由里面的调用方式就如上所示，还算是比较简便的一种方式，当然还有更好的方式，希望能够和大家沟通后，不断的进行改进。

路由规划

路由的规划还是蛮重要的，划分的足够精细，才能让后期能够有一个较好的扩展性。

├─route // 路由 ├─api.js ├─api ├─admin.js

我的划分习惯是逐级划分，如果api下面还有其它级的路由就建一个文件来存放其它的子路由，以此类推，如果api的子路由下面还有其它的子路由，就继续创建文件夹来存放其它的子路由。

先来创建 app.js 下的子路由 admin.js

'use strict' const express = require('express'); const router = express.Router(); router.get('/list', (req, res) => { res.json({flag: true}) }); module.exports = router

之后我们再来写 api.js

'use strict' const express = require('express'); const router = express.Router(); const bodyParser = require('body-parser'); /* Use */ router.use(bodyParser.json()); router.use(bodyParser.urlencoded({ extended: true })); /* Router */ const admin = require('api/admin'); /* Router Use */ router.use('/admin', admin); module.exports = router;

看编写的方式应该就能够比较容易理解， api.js 里面有子路由，可以通过 router.use 进行调用，也是这样逐级进行调用，最后入口文件 server.js 的话可以这样来使用

'use strict' const express = require('express'); const app = express(); /* Router */ const api = require('route/api'); app.use('/api', api)

路由划分的思路大致就是这样，应该还是比较好理解的，希望大家能够告诉我有更好的方式，或者我这样的方式有什么不妥的地方，希望也能够得到大家的指正。

日志

关于日志这一部分，我这次使用的是 log4js ，刚好看见寸志在前端外刊发表了一篇 Node.js 之 log4js 完全讲解的文章，就刚好开了一下进行使用了，具体的使用方法大家可以直接通过链接进行查看，使用的方式还是蛮容易上手的，我为了能够方便管理，我在前面的 libs 里面添加了一个 logger.js 方便进行维护，代码大致如下

'use strict' /* Logger */ const log4js = require('log4js'); log4js.configure({ appenders: [ { type: 'DateFile', filename: './log/api/access/api_access.log', // 输出的路径 pattern: '-yyyy-MM-dd.log', // 命名方式 alwaysIncludePattern: true, category: 'api_access' }, ], levels: { 'api_access': log4js.levels.INFO } }) module.exports = { apiAccess: log4js.connectLogger(log4js.getLogger('api_access')) }

之后在上面创建的 api.js 中调用的方式就是

'use strict' const express = require('express'); const router = express.Router(); const bodyParser = require('body-parser'); /* Libs */ const logger = require('../libs/logger'); // 引入 /* Logger */ router.use(logger.apiAccess); // 调用 /* Use */ router.use(bodyParser.json()); router.use(bodyParser.urlencoded({ extended: true })); /* Router */ const admin = require('api/admin'); /* Router Use */ router.use('/admin', admin); module.exports = router;

感觉是一种比较简单粗暴的方式，不同的日志类型，也都可以在刚刚的 logger.js 中进行配置管理，在任意地方进行调用

环境变量

这个其实想要说的就是改变 NODE_ENV 的值，在不同的开发环境下可能调用的参数不同，例如参见的开发模式 development 和 production 两种，有可能在这些不同的环境下，服务调用的接口不同，数据库地址不同等等一系列的配置文件，这个可以通过使用 config 这个第三方插件来使用，因此在目录结构那节里面有一个 config 的文件，下面就是用来存放config的配置文件，更多高级的使用方式可以查看相对应的文档，我在 config 下创建了两个文件，分别是 development.json 和 production.json ，当然如果还有更多的环境可以创建多个配置文件，两个文件的写法相同，只是里面可能某些的参数不同，我就拿其中一个举例吧

{ "mongodb": { "path": "mongodb://127.0.0.1/test", "name": "test" }, "server" : { "port": 3131 } }

这都是根据自己的需求进行定义，内容也就没什么好解释的了，其中只要注意一下要严格按照 json 的写法，要不然会导致无法解析。接下来我们再来看看该如何调用

'use strict' const config = require('config'); const mongodbConf = config.get('mongodb');

调用方式也是非常简单，先引入 config ，再通过 get 来获取你想要的字段，之后就和访问对象一样了。

简单的token

对应 token 的验证机制，大家可以自行去查找一下，资料还是非常多的，这里我简单的介绍一下相对应的用法是什么样的， token 我使用的是 jwt-simple ，和名字一样，使用的方式也是比较简单的，下面还是用相对应的代码来展示一下

'use strict' const jwt = require('jwt-simple'); const moment = require('moment'); /* MongoDB Models */ const Admin = require('../mongodb/schemas/admin'); /* Set */ const jwtSecret = 'abcd1234' // token 秘钥 exports.SetToken = (value) => { const expires = moment().add(7, 'days').valueOf(); const token = jwt.encode({ iss: value, // 加密对象 exp: expires // 时间戳 }, jwtSecret) // 密钥 return token; } exports.GetToken = (req, res, next) => { const reqToken = req.headers["token"]; if(typeof reqToken !== 'undefined'){ try{ const decodedToken = jwt.decode(reqToken, jwtSecret); if(decodedToken.exp <= Date.now()){ res.status(401).json({flag: false, data:{}, info:'token过期'); }else{ Admin .findOne({name: decodedToken.iss}) .exec((err, admin) => { if(!admin){ res.status(403).json({flag: false, data:{}, info:'无操作权限'); }else{ next() } }) } }catch(e){ res.status(403).json({flag: false, data:{}, info:'无操作权限'); } }else{ res.status(403).json({flag: false, data:{}, info:'无操作权限'); } }

先引入 jwt-simple ，再定义一个秘钥 jwtSecret ，我还使用了多定义了两个方法以便后面进行调用，一个是 SetToken 设置 token 还有一个就是 GetToken 获取 token ，其实相对应的就是 encode 和 decode ，首先是想对想要进行加密的部分进行 encode 当然 {iss: value, exp: expires} 这个是可以自定义的。之后再根据客户端发送的 token 进行 decode ，来获取相对应的信息，在上面的代码中我是先对时间进行验证，再根据之前加密的 name 进行数据库查询，如果有相对应的值，则允许进入下一个路由。

我将这个验证方式写到了 libs ，命名为 token.js ，在需要进行验证的地方可以进行调用

'use strict' const express = require('express'); const router = express.Router(); /* Libs */ const token = require('../../libs/token'); router.get('/list', token, (req, res) => { // 中间件的方式使用token res.json({flag: true}) }); module.exports = router

在需要进行 token 验证的路由下加入就好。

简单的插件编写

在不断的开发过程中，会抽象出很多简单的小插件，主要目的就是为了方便后期重复代码的编写，以及解决一些数据上的处理，就像上面我抽象了两个出来，分别是日志的 logger.js 和 token 验证的 token.js ，其实后来发现，应该将 token.js 纳入 middleware 这个文件下。

其实我还多加一个 response.js 用来处理各个路由的 response

'use strict'; exports.successRes = (res, data, info, status) =>{ const _status = status ? status : 200; const _info = info ? info : ''; const resData = { flag: true, data: data, info: _info, status: _status } res.status(_status).json(resData) }; exports.badRes = (res, info, status, data) => { const _status = status ? status : 200; const _info = info ? info : ''; const _data = data ? data : {}; const resData = { flag: false, data: _data, info: _info, status: _status } res.status(_status).json(resData) };

一个为成功时的操作，一个为失败时的操作，这样做的好处是能够统一返回的格式，而且也较好进行统一的管理。其实很多都是开发过程中不断的进行抽象，不断的累积出适合实际业务的一套方案。

在最后就把入口文件 server.js 展示一下

const fs = require('fs') const path = require('path') const express = require('express'); const app = express(); const mongoose = require('mongoose'); const config = require('config'); /* Config */ const mongodbConfig = config.get('mongodb'); const serverConfig = config.get('server'); process.on('uncaughtException', (err) => { console.log(err); }); /* Set */ app.set('views', './app/views/'); // 页面路径 app.set('view engine', 'pug'); // 模板引擎 /* 静态资源 */ app.use(express.static(path.join(__dirname, './app/static'))); //定义静态资源目录 /* connect mongodb */ mongoose.connect(mongodbConfig.path); const db = mongoose.createConnection('localhost', mongodbConfig.name); db.on('error', console.error.bind(console, 'connection error:')); /* Use */ /* Tmp */ const cors = require('cors'); // 解决跨域 app.use(cors()) /* Router */ const api = require('./app/route/api'); /* Router use */ app.use('/api', api); // api app.listen(serverConfig.port); 结语

发现距离写上一篇博文已经间隔了4个月，这一段时间内也发生了不少的事情，恩，在这篇博文里还是不适宜提起这些事情，还是以技术为主吧。

最主要的目的还是对这一次的使用过程做一次总结，想要和更多的人进行探讨，有任何问题或者任何想要交流的地方都可以和我联系，邮箱:wengwangjay@126.com，微博:@爱拍照的小胖纸。

↧

奕名随笔 ―― 我眼中的Mongo

August 30, 2016, 1:32 pm

≫ Next: MongoDB在线讲座系列14- 基础系列讲座之您的第一个MongoDB应用

≪ Previous: 记一次使用node作为服务端的开发思路

这第一篇小文章是在新加坡陪娃玩在Jumbo Sea food看着小鬼吃芒果布丁时敲下的第一段文字。当然此芒果非彼Mongo！

先简单来个自我介绍，本人貌似读书读的还不错，运气稍欠一点，交大计算机系差1名没有直研，一怒之下放弃考研，踏入外企18年。貌似也算尽力了IT最风光的那几年，只是有时真心觉得自己投错了行，如果再给我一次机会，一定会毫不犹豫投身于金融行业… …

扯远了，还是回归现实聊聊对于MongoDB的感受吧；

第一篇小文章不想谈“干货”，虽然我是个做技术的，但是不喜欢以技术为标榜来证明自己有多牛。技术再牛，没有合理的定位，没有有效的商业模式，不能产生真正的业务价值，那就只能成为浮云。最近微信上最近经常看到C轮、D轮的纷纷落马，不能说这些创业背后的技术不牛，更多的是其有技术而无持续有效的商业运作模式。

MongoDB“牛”吗？

显然论交易型处理它难以比肩传统关系型数据库，谈缓存处理未必完胜Redis，看大并发写目测难以媲美Cassendra … …

但是，我却说MongoDB“牛”!

且不论上述几点在最新版本中是否依然如此（在MongoDB多引擎下有太多变数），MongoDB背后代表了一种“化繁为简”的业务驱动开发模式，让开发能够更加易于快速适应业务变化，让思考模式在多次迭代中进化，就这一点，即可在众多业务场景中立于不败之地。从商业模式的角度来看，现代企业的应用随着业务模式变更必须快速调整，这本身就是一个多次快速迭代的过程，与MongoDB设计思想是不谋而合的。所以业务驱动，迭代设计思路才是MongoDB精华之一。

记得初来MongoDB，来到一家互联网公司做个介绍，一位英国留学归来的小伙子貌似对于MongoDB颇有微词，指着我大声质问：“我不觉得MongoDB有怎么好，你能告诉我有哪个应用只能用MongoDB吗？”坦率的说，除了mainframe之外，我自认没有开放平台上没有哪个应用只能用哪个数据库，只有根据业务的需求来定夺更合适的架构。不得不承认，在某些架构“不会发生变化”，结构简单的场景中，MongoDB未必是最佳的选择。但是大千世界中更多的是一个“变”字，所以想来这也就是为何MongoDB这么一个后来者会如此火爆的占据db engine排名第四的重要原因。事实上即使是开发者也需要把眼光放的远些，多多换位思考，不一味拘泥以某一技术，这样才能站得更高，看的更远。毕竟在国内开发很难像国外一样，有了3个小孙子白了头发还能一个萝卜一个坑继续搞自己的代码的 … …

当然，MongoDB第二个精华显然是路人皆知的架构 ―― 复制集+分片，很多早年初用MongoDB的忠实粉丝就是冲着这点而去。只是技术的差距不同于设计思路的差距，是很容易被模仿并改进的，所以在我理解中从大的框架角度来看可谓优势仍在，但后来者甚多。

如果说前两点从架构设计的角度来说都已经安身立命了，那么第三点精华可谓前途无限，那就是MongoDB的插拔式引擎，在我的眼里，这是MongoDB的大蓝图之一，也是能够真正贯彻执行第一点精华的重要手段。随便搜了一下，想来在mongoing里面也有不少文章谈到这部分内容，今天我就打住了，再写下去辣椒蟹要被娃吃完了啊！

上大菜！！！

↧

MongoDB在线讲座系列14- 基础系列讲座之您的第一个MongoDB应用

August 30, 2016, 1:31 pm

≫ Next: HBase and Phoenix on Azure: adventures in abstraction

≪ Previous: 奕名随笔 ―― 我眼中的Mongo

基础系列讲座之您的第一个MongoDB应用

在MongoDB基础讲座系列中，讲师将会为大家介绍以下内容：

1. NoSQL出现的意义

2. NoSQL数据库种类

3. MongoDB的主要功能

4. MongoDB的数据持久性-复制集

5. MongoDB的可扩展性-分片

讲座之后会有10-15分钟问答时间，欢迎大家积极参与。

时间：2016-09-14 20:00:00

参与方式

由于本次活动在线举办，所以只需要准备好可以联网的设备即可。本次讲座使用腾讯课堂平台，报名之后请注意查收邮件，获取QQ群号，届时即可通过QQ群参与课程。报名时请务必提供有效的email地址及电话号码。

关于讲师

林涛，MongoDB官方咨询经理。

关于在线讲座

MongoDB在线讲座系列将从2015年5月起每月推出一期，由MongoDB官方技术人员或开发工程师和一些特邀嘉宾主讲，讲座大部分为中文。

合作社区慕课网 SegmentFault UCLOUD

↧

HBase and Phoenix on Azure: adventures in abstraction

August 30, 2016, 1:30 pm

≫ Next: The Business Value of Using Big Data Systems Like Hadoop

≪ Previous: MongoDB在线讲座系列14- 基础系列讲座之您的第一个MongoDB应用

One of my favourite essays by Joel Spolsky (he of Stack Overflow fame) is “The law of leaky abstractions” . In it he describes how the prevalence of layers of abstraction be it coding languages or libraries or frameworks have helped us accelerate our productivity. We don’t have to talk directly to a database engine because we can let our SQL do that for us; we don’t have to implement map reduce jobs in java anymore because we can use Hive; we don’t have to… well, you get the idea.

But he also points out that even the best frameworks and languages are less than perfect, and when things go awry, these frameworks “leak” the details of their abstraction out to the observing world. We are then confronted with all the nuts-and-bolts of the implementation that had been hitherto kindly hidden from us, and we often have no alternative but to busy ourselves with a depth of detail that we had not expected.

Hadoop is a good case in point: consider my first Hadoop project in 2011, shown below on the left (“on premise”) where we implemented most of the map-reduce jobs in java with a sprinkling of pig scripts.

HBase and Phoenix on Azure: adventures in abstraction

Compare that to the project we have just completed, running on Microsoft Azure, shown on the right (“cloud”).

Note that we have added two extra layers of abstraction ourselves (HBase and Phoenix), but that the cloud stack has added another 2 (virtualization and azure storage).

This is all fine… until things are not quite so fine and then it is a non-trivial task finding the cause. In fact, on rare occasions you may even end up needing to know almost as much about the underlying implementation as if there had been no abstraction layer in the first place! With this as our background, I’d like to offer the following:

Some reflections on using Phoenix as an SQL layer over HBase Some comments on HBase compaction settings in the context of Azure Phoenix and HBase a friendship with benefits?

One of the aspects of using HBase is that there is normally (unless you are using the secondary index feature) only one rowkey and hence only one access path through the data: you can issue GETs or range SCANs using this key, but if any other access path is needed i.e. you are searching for terms by anything other than rowkey (= get) or rowkey prefix (= scan) then this will result in a full table scan. Since HBase and other components of the Hadoop stack are often used with unstructured data, this poses a challenge. What if we have stored our data using LASTNAME as the rowkey prefix but later realize that we want to search by FIRSTNAME as well?

Apache Phoenix offers capabilities that can be used to escape this cul-de-sac: it is an SQL layer over HBase which offers the following features/advantages relevant to our challenge:

SQL syntax for retrieving HBase data, a nice alternative to using the native HBase API Secondary indices

Phoenix secondary indices are implemented as co-processors (which, put simplistically, act as triggers, keeping the “index” tables in sync with the “parent” tables) on the underlying HBase tables. A secondary index uses as its rowkey a combination of the column (or columns) that comprise the index plus the original rowkey, thus preserving uniqueness. You don’t have to populate the index yourself, the co-processor will do that for you when you make changes to the parent table. The optimizer can then use this index as an alternative to the table named in the original query, allowing us to add indices to an existing table and letting the optimizer transparently choose the most appropriate access path. However, a number of things should be carefully noted:

Phoenix can only track changes to the index if changes made to the parent data are made through Phoenix: you can read from the HBase table using standard HBase tools, but you should only make changes via Phoenix. Bulk loading a Phoenix table is theoretically possible using a number of tools (JDBC, Spark, Phoenix Map-Reduce jobs, Pig) that bridge the Phoenix- and HBase-worlds, but in practice this is anything but straightforward: JDBC only writes line-by-line (JDBC batched writes are only available when using the Phoenix thin JDBC driver which uses the Avatica sub-project of the Calcite library: it is unclear which version contains this but most probably Avatica 1.8 > Phoenix 4.6 or 4.7?), and the Spark and map-reduce jobs run into difficulties if there are too many indices on a table. The temptation with a tool offering SQL syntax is to assume that declarative set-based operations can be used freely (not an unreasonable assumption, since that is the purpose of SQL!). However, HBase is just not set up for anything other than single gets or discrete range scans, and Phoenix queries, particularly those that cannot use an index, may often result in timeouts. Phoenix indices can either be Global (each index is a separate HBase table in its own right), or Local. For the purposes of this article, the important thing to note is that Local indices are preferable since the index data is stored in the same region as the parent data, thus keeping internal RPC-calls to a minimum. They are also still officially in technical preview, and are undergoing significant improvements . Phoenix indices are separate from but affected by design decisions within HBase, not all of which have met with unanimous approval . Compaction distraction

As an aside, let’s briefly discuss one of the concepts central to HBase: that of the Log-structured Merge Tree (also used in Lucene).

HBase writes first to a write-ahead-log on disk, then to an in-memory store (memstore), and acknowledges the write only when both are successful. When the memstore reaches a certain (configured) size, it is flushed to disk and the WAL for that memstore is removed: this is great for writes, but not so great for reads, as there will be an increasing number of small files to access. HBase gets around this by periodically carrying out compactions, such that the individual store files are combined (again, this can be configured). Compactions come in two flavours:

Minor compactions, whereby the individual store files belonging to a particular store (mapping to a single column family within an HBase table) are consolidated according to certain algorithms, as illustrated above. The blue bars represent new store files that are created when a memstore is flushed to disk, and the green bars represent compacted store files: Major compactions, whereby HBase attempts to combine all files for a store into a single file. I say “attempts to”, because if the resulting file exceeds the specified limit (defined by the property hbase.hregion.max.filesize), then HBase will split the region into two smaller regions. However, this split-region-when-compacted-file-reaches-max-limit rule is also subject to other factors (see next section).

The settings pertinent to compaction behavior are:

hbase.hstore.compaction.min(previously called hbase.hstore.compactionThreshold): the minimum number of files needed for a minor compaction, set to a default of 3 in HBase and Azure/HDInsight.

hbase.hstore.compaction.max:the maximum number of files considered for a compaction, set to a default of 10 in HBase and Azure/HDInsight.

hbase.hstore.compaction.max.size:any file greater than this will be excluded from minor compactions. Set to Long.MAX_VALUE in HBase but limited to 10GB in Azure/HDInsight.

hbase.hregion.majorcompaction:the time between major compactions in milliseconds. This defaults to 7 days in HBase, but is turned off in Azure/HDInsight.

hbase.hregion.max.filesize: when all files in a region reach this limit, the region is split in two. Set to a default of 10GB in HBase but to 3GB in Azure/HDInsight.

hbase.hstore.blockingStoreFiles:a Store can only have this many files; if the number is exceeded a compaction is forced and further writes are blocked. Set to a default of 10 in HBase but increased to 100 in Azure/HDInsight.

So why, in the diagram shown above, did we not have a minor compaction at point B, where we again had three files available to us in the store? The reason is that the compaction algorithm (ExploringCompactionPolicy in HBase 0.96 and later; RatioBasedCompactionPolicy previously) takes into account the differing sizes of file, and excludes from a compaction any file whose size exceeds the size of the other files in the compaction-set by a given ratio. This prevents the store from having files with largely divergent sizes . The full list of compaction-related parameters can be found on the official HBase website .

Phoenix Indices, Co-location and HBase splits

I mentioned earlier that it is beneficial to use Local indexes with Phoenix: this is particularly true for read-heavy environments. When local indices is defined on a table, all index data is stored in a single subsidiary table, named:

_LOCAL_IDX_[parent-table-name]

In the HBase master UI you will then see something like this in the table description:

SPLIT_POLICY‘ => org.apache.phoenix.hbase.index.IndexRegionSplitPolicy‘

This class overrides RegionSplitPolicy in order to prevent automatic splitting:

This may seem strange, but makes sense when one considers that local index data should be co-located with the parent data, meaning that the index table is only split when the parent table is split.

Azure Storage

Enter Azure Storage, another cog in the machine. This layer of abstraction introduces another factor: how to efficiently read from and write to the “blob” objects that make up azure storage. Even though the Azure documentation states that a Blob may consist of up to 50,000 blocks, with each block having a default size of 4MB meaning that with default settings we can write single files up to 200GB this needs to be understood in the context of the default Azure HBase/HDFS settings ( fs.azure.read.request.size and fs.azure.write.request.size ), which sets the block size for read/write activity to 256 KB, for performance reasons. The side effect of this is that we can only read and write files to Azure Storage that do not exceed 50,000 x 256 KB = 12 GB.

Attentive readers may by now have a sense of impending doom: the HBase defaults are fine for the majority of cases, though these have been tweaked in HDInsight to avoid performance issues when reading and writing large files (abstraction #1). However, Phoenix bundled with HBase in HDInsight derives the most benefit from local indices when splitting is determined by the parent table, not the table holding the index data, and this is implemented (i.e. automatic splits are blocked) when a local index is created (abstraction #2). If we have several indices created on a table, then the index table may grow to be several times the size of the parent table, particularly if covering indices (when other columns other than those that form the index are stored along with the index) are used.

In our case, we were observing blocked splits on the index table (because splitting is determined by the parent table to ensure co-location), leading to store files that were too large to be compacted, resulting in exceptions like this:

regionserver.CompactSplitThread: Compaction failed Request = regionName=_LOCAL_IDX_[parent table name]… java.io.IOException atcom.microsoft.azure.storage.core.Utility.initIOException(Utility.java:643) atcom.microsoft.azure.storage.blob.BlobOutputStream.close(BlobOutputStream.java:280) atjava.io.FilterOutputStream.close(FilterOutputStream.java:160) atorg.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.close(NativeAzureFileSystem.java:869) atorg.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) atorg.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) atorg.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248) atorg.apache.hadoop.hbase.io.hfile.HFileWriterV3.finishClose(HFileWriterV3.java:133) atorg.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:366) atorg.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:996) atorg.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:133) atorg.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) atorg.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1212) atorg.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1806) atorg.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:519) atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) atjava.lang.Thread.run(Thread.java:745) Causedby: com.microsoft.azure.storage.StorageException: Therequestbodyis toolargeand exceedsthemaximumpermissiblelimit. atcom.microsoft.azure.storage.StorageException.translateException(StorageException.java:89) atcom.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307) atcom.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182) atcom.microsoft.azure.storage.blob.CloudBlockBlob.commitBlockList(CloudBlockBlob.java:245) atcom.microsoft.azure.storage.blob.BlobOutputStream.commit(BlobOutputStream.java:313) atcom.microsoft.azure.storage.blob.BlobOutputStream.close(BlobOutputStream.java:277) ... 16 more

Two independent sets of behavior both abstracted enough away from view to be difficult to find were working against each other.

Conclusion

In our situation we decided to keep major compactions turned off, and to only manually execute them in combination with temporarily increasing the azure storage settings ( fs.azure.read.request.size and fs.azure.write.request.size ). In this way we could rely on the minor compaction algorithm to keep the individual store files within the correct limit. However, we also discovered that there is often no substitute to rolling up your sleeves and seeking to understand the many and varied configuration settings, as well as wading through source code to gain a better understanding of such corner-case behavior.

Read on …

So you’re interested in processing heaps of data? Have a look atour website and read about the services we offer to our customers.

Joi

↧

The Business Value of Using Big Data Systems Like Hadoop

August 30, 2016, 4:31 pm

≫ Next: Important changes coming to the Hortonworks Certification Program

≪ Previous: HBase and Phoenix on Azure: adventures in abstraction

If you’re a technical person trying to explain abig data platform like Apache or Cloudera Hadoop , or a Hadoop-basedcloud database as a service (DBaaS) like Azure’s HDInsight, Amazon’s EMR Elastic MapReduce, or Google’s Cloud DataProc to the business, you’ve probably been stymied by a communication barrier. Somehow they just don’t get how awesome it is to have schema-on-read instead of schema-on-write, or why being able to store unstructured data is so awesome, or why faster transform makes such a difference.

Here’s how to explain the benefits of big data systems, which for simplicity we’ll call Hadoop, to people who have only ever known traditional data warehouses.

It all starts with what the business wants.

The business wants to be a truly data driven organization, and to become that, it’s critical that the business is able to use ALL types of data to drive business improvements easily, fully, quickly, and cost effectively.

Relative to traditional data warehouse systems, Hadoop has functionality that brings easy access to more and different kinds of data, quickly and cost effectively in a way that wasn’t possible until these new technologies were invented. Here’s how:

All types of data

Unlike traditional databases, Hadoop-based big data systems are very flexible in terms of accepting large amounts of data of any type, regardless of structure. Until Hadoop it was expensive and difficult to store unstructured data. Now unstructured data like clickstream data, social media data, and audio file like call logs from call centers can be stored and analyzed. Hadoop’s ability to cost effectively store lots of data means that you can now get access to granular or raw data, not the aggregated data that is most commonly stored in traditional data warehouses.

Easily

Because relative to RDBMSs, Hadoop is inexpensive and can handle all kinds of data. It is becoming that single place where analysis and programming can be done across multiple sources of data. No more silos of data and “cutting and pasting” from different data sources to get a single integrated view.

With Hadoop-based DBaaS, compute and storage can scale independently of each other which means that variations in demand can be easily and quickly met. To the business user this simply means that however they want to use the system at any time, even as that need changes, the system can adapt to their needs much faster than an RDBMS could.

Fully

Hadoop/DaaS isn’t restricted to queries defined by predefined schemas. The traditional RDBMS schema-on-write model is good at answering the “unknown known” questions ― those that we could model the schema for ahead of time. But there is a very large class of exploratory questions that fall under the category of “unknown unknowns” ― questions that we didn’t know/expect ahead of time. Hadoop’s Schema on read means that Hadoop is better suited for exploring data, which is really important foradvanced analytics where a lot of the time you “don’t know what you don’t know.”

Quickly

Hadoop’s transform efficiency means even huge amounts of data can be transformed quickly. To the business user this means less waiting for their data, or more timely access to data for more real-time decision making.

Schema-on-read system, or ETL on the fly, means new data can start flowing into the system in any shape or form without having to first set up schemas, and months later you can change your schema parser to immediately expose the new data elements without having to go through an extensive database reload or column recreation. Once again this means less waiting to get access to incoming data.

Hadoop can process real time strings in memory which to the business means you can do real-time analytics for uber timely information.

Cost Effectively

Traditional archival systems are an inexpensive way to store data (especially when compared to RDBMSs) but since you can’t run processing/queries in archival systems, you have to recover the data back into an RDBMS to use it again, making it very expensive. Hadoop/DaaS allows for cost effective storage of detailed historical data with the ability to at any time and immediately query archived data (Active Archive). To the business this means they can get access to archived data more quickly.

Unlike RDBMSs that scale vertically, Hadoop scales horizontally using commodity and therefore less expensive hardware. This means more data can be stored for a lower price than using traditional databases.

The bottom line is that for a businessperson who wants to engage with data, a Hadoop-based system is like a dream come true on several fronts, and now you can help them understand why in words they’ll understand.

Discover how Pythian can help plan, implement, and support Hadoopin your organization.

↧

Important changes coming to the Hortonworks Certification Program

August 30, 2016, 4:30 pm

≫ Next: Mongo to Mongo Data Moves with NiFi

≪ Previous: The Business Value of Using Big Data Systems Like Hadoop

On October 1, 2016, the Hortonworks Certification Program is changing its structure:

Our four current exams HDPCD, HDPCD:Spark, HDPCD:Java and HDPCA are being retired and replaced with a set of new exams. We are introducing three levels of certification: Associate: the new entry level into our certification program Professional: for experienced data professionals wanting to validate and prove their real-world skills by performing hands-on tasks on a live cluster Expert: our highest level of certification, Hortonworks Certified Experts have proven their worth and talent by completing a challenging set of tasks and projects on a live cluster We are also introducing three tracks of certification Data Engineer Data Scientist Admin Hortonworks Certified Associate (HCA) : The HCA certification provides for individuals an entry point and validates the fundamental skills required to progress to the higher levels of the Hortonworks certification program. The HCA certification is a multiple-choice exam that consists of 40 questions with a passing score of 75% There is only one HCA exam and it is the same exam for all three certification tracks Hortonworks Certified Professional (HCP) : The Professional level certifications are based on tracks that a candidate desired to pursue. The HCP exams are two-hour hands-on, performance based exams where a candidate is given a cluster and a set of tasks to perform. Initially, we will release two exams at the Professional level: Hortonworks Certified Professional Data Engineer Hortonworks Certified Professional Admin Hortonworks Certified Expert (HCE) : The Expert level certifications are designed to recognize individuals who possess a very high level of expertise and ability in Big Data. The HCE exams will consist of a complex hands-on, performance based tasks performed on a live cluster. We are in the process of designing three HCE exams: Hortonworks Certified Expert Data Engineer Hortonworks Certified Expert Data Scientist Hortonworks Certified Expert Admin Frequently Asked Questions

If you are interested in becoming Hortonworks Certified, or if you already are Hortonworks Certified, be sure to look over the following Frequently Asked Questions for details on the new program and how it affects future and existing candidates. If you have any questions at all about the changes to our Certification Program, please feel free to contact us at certification@hortonworks.com .

When do the new exams become available? The new exams will be available on October 1, 2016. Will I be able to take the old exams after October 1, 2016? The existing Hortonworks Certification exams are being retired and cannot be attempted after October 1st. If I am not currently Hortonworks Certified, am I required to take the Associate level exam? If you are not Hortonworks Certified prior to October 1, 2016, then you must pass the HCA exam before being allowed to attempt any of the exams at the Professional level. If I am already Hortonworks Certified before October 1, am I required to take the Associate level exam? If you are an HDPCD, HDPCD:Spark, HDPCD:Java or HDPCA, then you do not have to take the HCA exam. If I am already Hortonworks Certified before October 1, do I automatically earn one of the new Professional-level certifications? Do I have to take the Professional level exam before attempting an Expert level exam? You will be grandfathered in to our new program at the Professional level of your respective track, but you will not automatically be an HCP Data Engineer or HCP Admin. Existing HDPCA, HDPCD, HDPCD:Spark and HDPCD:Java professionals will still maintain their titles. However, they will immediately be eligible to take an Expert level exam in their respective track as described here: HDPCD, HDPCD:Spark and HDPCD:Java Professionals will be immediately eligible to attempt the HCE Data Engineer or HCE Data Scientist exams (once those exams become available). HDPCA Professionals will be immediately eligible to attempt the HCE Admin exam (once it becomes available). How much do the new exams cost? The one-hour multiple-choice Hortonworks Certified Associate exam is $100 per attempt. The two-hour Professional level exams are $250 per attempt. We have not determined the price of the Expert level exams yet. Is there a Professional-level exam for Data Scientists? No not specifically. The Data Scientist track only has an Expert level exam. However, before you can attempt the Expert Data Scientist exam you must either: Pass the HCP Data Engineer exam, or Already be an HDPCD, HDPCD:Java or HDPCD:Spark Professional prior to October 1, 2016 When will the Expert level exams be available? The Expert level exams will be released intermittently over the next 3-6 months. What if I attempt one of the existing exams before October 1 but do not pass it? You will not be able to retake that exam after October 1 and the new program rules will apply, so you will have to take the HCA exam first, then attempt one of the new Professional-level exams. How long is my Hortonworks Certification valid? We are introducing a two-year expiration date for all certifications: If you pass one of the new exams, your certification will be valid for two years from the date of passing. If you are already Hortonworks Certified, your current certification will expire on October 1, 2018.

↧

Mongo to Mongo Data Moves with NiFi

August 30, 2016, 7:49 pm

≫ Next: Can Graph Databases Really Advance Our Digital Public Services?

≪ Previous: Important changes coming to the Hortonworks Certification Program

There are many reasons to move or synchronize a database such as MongoDB: migrating providers, upgrading versions, duplicating for testing or staging, consolidating, and cleaning. There are even more ways to perform the function of moving said data: mongodump/mongorestore, on Compose there isImport which is backed by Transporter , one could write their own custom scripts, and one could use a tool such as NiFi.

Here we will review a couple of scenarios using NiFi. First, we'll look at the simplest approach possible of just queries and inserts then a brute force approach of polling with de-duplication and then move on to a more advanced synchronization approach by hooking into MongoDB's oplog. Since all of these are on NiFi the flexibility of adding extra transforms and viewing what's happening will always be available too.

One Time

For a one time move a tool such as NiFi might be more overhead than is needed unless there are some mitigating circumstances like large size or desired transformations but it can be done.

The simplest way to get started is to pull both a GetMongo Processor and PutMongo Processor onto the canvas and connect them via a success relationship (seehere for details on Processors orhere for details on NiFi overall). Each pair of these is good for a single Collection .

They are easy to copy so you can configure the first pair with things like connection strings and logon and then copy them. After that just change the relevant details for each Mongo Collection which really isn't too much trouble.

Since NiFi is built for data that is flowing, the typical idiom for a Processor such as GetMongo is to run over and over again which would generate duplicate data. So, configuring the Processor to run only once can be effectively done by setting the Run Schedule to some large interval that would be much longer than the actual one time session of copying data. This will ensure that the GET query will only run once per day when you start the flow which should be sufficient.

Brute Force Sync with Polling and De-duplication

The next solution builds upon the previous. It utilizes the same GetMongo and PutMongo processors at the edges but enhances the one shot nature of the previous example by adding a few fingerprinting and de-duplication steps to allow for a more continuous flow of data and a regular synchronization. The following details the flow for one collection:

It starts the same with the GetMongo then steps to HashContent which generates a fingerprint of each FlowFile's content and puts it into a FlowFile attribute. The next step isn't technically needed for this flow since GetMongo doesn't write any attributes but is included to show that the unique key can be generated from both the content and attributes to ensure proper de-duplication.

The ComposeUniqueRocksDB only needs two properties configured. The Directory for the actual RocksDB data files and the name of the FlowFile's attribute which has the key to be checked. Since, the fingerprint attribute is a truly unique key built from all of the relevant data then we can rely on ComposeUniqueRocksDB to only pass on a FlowFile to the unseen relationship and hence next step of PutMongo if it hasn't actually seen the Document contained in the FlowFile. ComposeUniqueRocksDB is part of a custom extension package. There are other solutions for de-duplication that come with NiFi but they rely on more sophisticated configuration with some ControllerService s.

So, now that we only pass on the new data, we can go ahead and configure the most important parameter of this flow which is how often we schedule GetMongo to run. This will determine how often we query the Collection in the Mongo database. The downside is the amount of query load it places on the source system since it will query the entire collection over and over again according to this Run Schedule that is set above. For some situations this may be fine. If your collection isn't super large and you don't mind some extra load, or if how often you need synchronization can be changed or shifted to less utilized times, then this might be "good enough". This particular flow will basically perform the snapshot plus changes by building up a unique index and comparing the entire result set against the previously seen keys to decide whether a Document is actually new. While this is a lot of processing, it is simple and may work for you. If not, then the next solution might.

Tailing the Oplog

One of the idiosyncratic features of MongoDB's replication implementation is that it can be "hooked into". Mongo's oplog is a capped collection which is ultimately just a buffered changelog. Just like a regular Collection it can be queried and the database state changes can be moved and applied to another database. In essence, this is what Mongo does internally to keep replicas in sync.

A snapshot of the database plus the changes after the snapshot equals the current state. Marrying the streaming nature of the changes to NiFi makes a lot of sense and is the most complete solution if you have access to Mongo's oplog.

The below is an example flow which uses some custom code from the same package as the previous example which is not already in NiFi. It is easily built with mvn package and then easily deployed to NiFi by just copying a file and restarting the server.

Somewhat counter to the general notion of a NiFi Processor being simple and doing one thing, these two Processors are a little more sophisticated. Where for the GetMongo Processor we have to create one for each Collection. Whereas for the ComposeTailingGetMongo one suffices for the entire Mongo Database.

The ComposeTailingGetMongo runs only once and stays running. It begins by creating FlowFiles for every Document currently in a Mongo Database (a snapshot) then it will continue to generate FlowFiles for any relevant operations such as inserts, updates, and deletes for as long as it runs (the changes). And while this is example code, it is useful example code and could easily be used in multiple situations. This Get plus the matching ComposeTailingPutMongo is sufficient to keep entire MongoDB's in sync. Plus, this use case is a great match to NiFi's managing and running of data flows. And, we haven't even mentioned that it is easy to transform this data by inserting extra processing steps and even duplicating to multiple databases by adding another ComposeTailingPutMongo :

Continuous, Visible, and Easily Customized Synchronization

The various solutions above are good examples of how useful NiFi can be for moving data. Whether you need to synchronize your test Mongo database with production data, or whether you need to migrate to a new Storage Engine, or whatever your moving database use case may be, by choosing NiFi you get access to all of the benefits of a sophisticated data flow platform with less effort than "rolling your own" solution.

↧

Can Graph Databases Really Advance Our Digital Public Services?

August 30, 2016, 7:48 pm

≫ Next: Writing Maintainable Integration Tests

≪ Previous: Mongo to Mongo Data Moves with NiFi

Graph databases most commonly associated with social networks and recently adopted by enterprises are now being looked into by government agencies and the civil services to manage big data. Neo Technology’s Emil Eifrem explains how graph databases have the potential to enhance digital public services

Government agencies are sitting on a wealth of information that could be mined to offer valuable insight to its citizens to better run public services, but the data mountain has proved a challenge for many.

In addition with the advance of consumerization in technology, citizens are expecting a 24/7 flow of information. This always on, anytime, anywhere connection has put huge pressure on government to improve digital services to meet the public’s expectations.

The multiplicity of data being collected has the potential to help governments at local, state and national level to achieve some really positive outcomes and help in their decision-making, such as where to budget housing resources, for example. But, to do this they need to put data into more useable formats and invest in the people and the systems to analyze it.

Relational databases are useful tools here. Designed to link data into searchable tables they are easy to set up. But with the march of the internet and social media, a more powerful tool is required, one that can actually uncover relationships: the connections between people, locations and events that make up society. The answer could be in the graph database, fast being adopted by business, which mirrors the linked, non-hierarchical mapping of the Web.

Graph databases are basically navigated and searched by following relationships. They do not require data to be stored in rigidly defined tables like relational databases, so it is possible to map connections wherever they lead.

Civil servants can use graph databases to spot patterns emerging by connecting multiple legal, welfare, health or demographic datasets, for example. They can, for example, find out how many elderly people living in a certain neighborhood require assisted living care.

Changing The Face Of Society

But, how would graph databases work in a real world scenario? Take the US Immigration and Custom Enforcement, for example. With the help of visualized relationship connections in real time, it could collaborate with civil servants on individual cases of potential interest to border control. What’s more all this happens in real time, which is extremely helpful in supporting better decision-making by any border security officer.

In order to get a complete “360 degree view,” it is imperative that either all the data is held in one central place, or all the data is abridged. As the former is impractical, given the way departments within the government are structured, a graph database can help build a registry-style MDM system that stores the most useful metadata, including the location of the actual master data.

An Aid To Government Policy

As well as gleaning insight, graph databases can also help government with policy creation. Teams can communicate with other departments within state administrations, sharing their results. This means graph database technology could soon be enabling innovative, highly responsive informal learning system in more than one national application.

Graph databases’ strength in social networking can also help government to track and analyze social media to help target terrorists, for example, or crime rings. Using graph databases, government agencies can work with top-level metadata to see hidden patterns and groups of interest in social media networks. Graph databases use fuzzy logic, for example, which is effective at drawing attention to slightly variant name spellings, which may all refer back to the same person.

Graph Databases: A Foundation For Digital Services

From the many conversations we have had with public sector, technology and policy professionals, graph databases could undoubtedly drive more efficient and integrated digital public services.

The US government, like many other governments around the globe, is moving away from top-down bureaucracy to develop digital offerings that instigate interaction between citizens and the state. Coupled with this, citizens want more transparent and timely delivery of services.

Local, state and federal leaders are looking to enterprises for answers to help them in their digital transformation to reinvent the citizen experience in this brave new connected world. Imagine, for example, being able to track the money trail between people and their bank accounts in order to stop tax evaders, white-collar criminals or even terrorists by joining the dots in their web of deception, a pattern that is way too complex for relational approaches to manage.

Graph databases have already shown their power in business to leverage data relationships for real-time, enterprise-level insights. Graph databases should now be in the toolbox of every government agency, helping their digital services drive higher levels of citizen engagement and satisfaction.

↧

Writing Maintainable Integration Tests

August 30, 2016, 7:47 pm

≫ Next: Redis 3.0 集群特性实验过程

≪ Previous: Can Graph Databases Really Advance Our Digital Public Services?

In software development, writing integration tests is sometimes an afterthought. Many people think exclusively in terms of unit tests, and perhaps many more don't think about automated tests at all. Thus, the very idea of writing integration tests that are maintainable, manageable, and scalable may seem foreign to most.

I personally had never felt the limitations of a large codebase of integration tests until working on the Voldemort project . This post is an overview of the pain points of the Voldemort integration tests, as well as our stab at architecting better integration tests in our next project,Venice.

Integration test pain points

The two main problems with Voldemort tests are that they're flaky, meaning that they fail intermittently, and that they’re slow to run. These two characteristics inhibit their regular use and erode the trust placed in them.

We do have infrastructure that runs the tests automatically following every commit to master, and we do inspect the root cause of failing tests, but we have grown accustomed to skipping over some of the more flaky ones, which is obviously undesirable.

A test in each port

In the case of Voldemort, the root cause of both the flakiness and slowness of the test suite is the assignment of ports. As part of our integration tests, we spin up individual Voldemort servers or even entire clusters locally, with each such process binding to a certain port.

Some of the Voldemort tests use hard-coded port numbers, which is convenient in the short term, but ultimately is a bad idea, since any given port may be already busy on the local host, or there may be two tests that are hard-coded to the same port but run concurrently. Attempting to assign different hard-coded ports to each test seems like an exercise in futility.

Finding ports dynamically

Some other Voldemort tests try to be a little bit cleverer and dynamically seek available ports, but the rest of the test suite is not architected in such a way as to fully leverage this capability. There are many variations of this in the code base, but at a high-level, many Voldemort tests follow a sequence of events similar to the following:

Invoke ServerUtils.findFreePorts() to get a list of available ports.

Invoke ServerUtils.getLocalCluster(int[] ports) for a cluster, which listens to the ports gotten in Step 1.

Invoke ServerUtils.startVoldemortCluster() to start the various servers in the cluster gotten in Step 2.

The problem with this approach is that the ports are freed between Steps 1 and 3, causing a race scenario where two tests might concurrently determine that the same ports are free, and thus clash with each other when the time comes to actually use those ports. Step 3 does retry starting up the failed Voldemort server several times before giving up, but if the port is occupied by another test for a long period of time, then it may never be able to finish. Furthermore, since these steps operate on a list of ports, there is a possibility of gridlock where two independent tests prevent each other from succeeding, as well as from moving forward (until the maximum amount of retries is reached, at which point the test can finally fail).

Test dystopia

Tweaking the Voldemort test suite so that it sidesteps the issues described above is certainly doable, but because there are so many tests now, the envisioned cost of such an effort is prohibitive. Therefore, the pragmatic solution to the lack of isolation has been to make the Gradle build run the test suite sequentially.

This lack of parallelism, in turn, means the whole test suite is slow to complete. There is also a lost opportunity to leverage today's multi-core computers, which are perfectly suitable for this type of highly parallel work.

More importantly, even though running tests sequentially minimizes the risk of port clashes, it does not fully solve the issue. For example, other tests may be running simultaneously on a shared CI environment.

A brave new world of tests

The idea of finding ports dynamically is a good one, but it seems like the Voldemort implementation does not embrace it all the way. For this reason, when it came time to design the testing for our new project, Venice, we wanted to see if we could create a fresh solution early on that would solve this problem. We began by asking ourselves, “What do integration tests actually need?”

A change of paradigm

What do integration tests actually need? They need certain processes to interact with. Do tests need to decide which specific ports are going to be used by those services? Probably not. So why, then, would we leave it up to tests to determine these details? Why not instead ask the test framework to provide us with an already spun up process, no matter which port it's running on?

This is essentially what we did in our new project, Venice. Instead of putting the onus on the test writer to configure the processes it needs in such a way that it would not clash with those of other tests, we took that responsibility out of the test writer's hands. Tests can ask the framework to give them a wrapper for a Kafka broker, a Zookeeper server, or anything else they need, and then they can interrogate the wrapper to discover on which host and port the wrapped process runs. The difference between this strategy and the Voldemort strategy is admittedly a bit subtle, but it changes everything.

Retries that work

The main advantage of this approach is that retries can be implemented correctly. In the Voldemort tests, whenever a process failed to bind to its intended port, we would wait and retry, spinning up the process on the same busy port and hoping that it would somehow free up later down the line. In Venice's test framework, when a process fails to bind to its port, instead of waiting, we immediately find another free port and try to spin up the process on that one instead.

Thus, even though finding free ports dynamically is unreliable, we can circumvent that limitation if we have a single piece of code which owns all of the configuration and bootstrapping, end-to-end. But this only works if the test code is not allowed to make any specific assumptions about the port configurations of the processes it needs.

An abstraction that enables performance/isolation trade-offs

Another interesting benefit of this approach is that it enables us to decide where we want to stand in terms of performance versus isolation. For now, we always run with maximum isolation (i.e., every integration test that needs a process gets a fresh new one), but this is obviously more costly because there is a startup time associated with any process we need to startup. If, at some point later down the line, we decide to strike a different compromise, we will have the ability to do so.

For example, we have many tests that need a Kafka broker to interact with. Each test can ask the framework for a Kafka broker, and then interrogate it to find out its host and port. We don't stop there, however; we also ask the framework what Kafka topic name our test should use. The test does not care what specific topic name is used, it only needs a topic name―any will do. This way, if our test suite grows big enough to warrant it, we could decide to spin up just one Kafka broker for all tests (or one broker per N tests), and then hand over a different topic name to each test that needs one, thus still ensuring that tests don't clash with one another. In other words, our tests ask for the resources they need and dynamically deal with what they receive, no matter what the unimportant details―like host, port, or topic name―happen to be. In this way, we maintain maximum flexibility to provide full resource isolation (at the cost of performance) or to share some resources across tests.

With this approach, we can even make the decision based on the environment where the tests are running. For example, a CI post-commit hook might have access to more computing resources than a developer's computer. Likewise, a developer doing iterative work may want to assess very quickly whether tests pass or not, whereas a post-commit hook runs asynchronously and can afford to take more time to execute.

Going forward

As the number of tests we write increases, our test suite takes longer to run, and we need to consider new ways to run our tests more quickly and efficiently. Having a well-designed API for requesting the resources that our tests need will hopefully give us enough flexibility to continue doing so.

In a future post, I might talk about the various knobs we have been tuning in our Gradle and TestNG configurations in order to manage parallelism and maintain a low total runtime for our test suite even as it grows bigger.

↧

Redis 3.0 集群特性实验过程

August 30, 2016, 7:46 pm

≫ Next: New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

≪ Previous: Writing Maintainable Integration Tests

Redis 3.0的集群功能很强大了，最大的特点就是有了cluster的能力，使用redis-trib.rb工具可以轻松构建Redis Cluster。Redis Cluster采用无中心结构，每个节点保存数据和整个集群状态,每个节点都和其他所有节点连接。节点之间使用gossip协议传播信息以及发现新节点。其具体的原理请看官方文档，这里只是记录一下实验的过程。以便更好的了解集群的特性。

# 安装依赖软件：

yum install gcc rubyzlib rubygems

wgethttps://rubygems.org/downloads/redis-3.2.2.gem

gem installredis-3.2.2.gem

# 版本： redis-3.2

# 架构信息

192.168.100.41master ： 6379 slave ： 7379

192.168.100.42master ： 6379 slave ： 7379

192.168.100.43master ： 6379 slave ： 7379

# 通用配置：

more /usr/local/redis-3.2/conf/redis-common.conf

#GENERAL

daemonize yes

tcp-backlog 511

timeout 0

tcp-keepalive 0

loglevel notice

databases 16

dir /usr/local/redis-3.2/{data,data_7379}

slave-serve-stale-data yes

slave-read-only yes

#not use default

repl-disable-tcp-nodelay yes

slave-priority 100

appendonly yes

appendfsync everysec

no-appendfsync-on-rewrite yes

auto-aof-rewrite-min-size 64mb

lua-time-limit 5000

cluster-enabled yes

cluster-node-timeout 15000

cluster-migration-barrier 1

slowlog-log-slower-than 10000

slowlog-max-len 128

notify-keyspace-events ""

hash-max-ziplist-entries 512

hash-max-ziplist-value 64

list-max-ziplist-entries 512

list-max-ziplist-value 64

set-max-intset-entries 512

zset-max-ziplist-entries 128

zset-max-ziplist-value 64

activerehashing yes

client-output-buffer-limit normal 0 0 0

client-output-buffer-limit slave 256mb 64mb 60

client-output-buffer-limit pubsub 32mb 8mb 60

hz 10

aof-rewrite-incremental-fsync yes

# 独立端口配置： {6379,7379}

more /usr/local/redis-3.2/conf/redis-{6379,7379}.conf

include/usr/local/redis-3.2/conf/redis-{common,common_7379}.conf

port {6379,7379}

logfile"/usr/local/redis-3.2/logs/redis-{6379,7379}.log"

maxmemory 100m

# volatile-lru -> remove the key with an expire set usingan LRU algorithm

# allkeys-lru -> remove any key accordingly to the LRUalgorithm

# volatile-random -> remove a random key with an expireset

# allkeys-random -> remove a random key, any key

# volatile-ttl -> remove the key with the nearest expiretime (minor TTL)

# noeviction -> don't expire at all, just return an erroron write operations

maxmemory-policy allkeys-lru

appendfilename "appendonly-{6379,7379}.aof"

dbfilename dump-{6379,7379}.rdb

cluster-config-file nodes-{6379,7379}.conf

auto-aof-rewrite-percentage 80-100

bind 192.168.100.{41,42,43}

# 启动进程 /usr/local/redis-3.2/bin/redis-server/usr/local/redis-3.2/conf/redis-{6379,7379}.conf # 创建集群 /usr/local/redis-3.2/bin/redis-trib.rb create --replicas 1192.168.100.41:6379 192.168.100.42:6379 192.168.100.43:6379 192.168.100.41:7379192.168.100.42:7379 192.168.100.43:7379

>>> Creating cluster

>>> Performing hash slots allocation on 6 nodes...

Using 3 masters:

192.168.100.43:6379

192.168.100.42:6379

192.168.100.41:6379

Adding replica 192.168.100.42:7379 to 192.168.100.43:6379

Adding replica 192.168.100.41:7379 to 192.168.100.42:6379

Adding replica 192.168.100.42:7379 to 192.168.100.41:6379

M: c2b3c9cb4b040e4ce48c7a20b4000a1d02e674bd 192.168.100.41:6379

slots:10923-16383(5461 slots) master

M: 35fc4a46cfe68e941a18ca33e574df86db7beefb192.168.100.42:6379

slots:5461-10922(5462 slots) master

M: 2ef9b515fac6159b37520afce1f75b38ba1e9a87192.168.100.43:6379

slots:0-5460 (5461slots) master

S: 6a2d10792f17985d1e30e9e20fe92c890748487f192.168.100.41:7379

replicates35fc4a46cfe68e941a18ca33e574df86db7beefb

S: eb921729e82925c6be859185efb58e77b49e7a89192.168.100.42:7379

replicates2ef9b515fac6159b37520afce1f75b38ba1e9a87

S: eb921729e82925c6be859185efb58e77b49e7a89192.168.100.42:7379

replicatesc2b3c9cb4b040e4ce48c7a20b4000a1d02e674bd

Can I set the above configuration? (type 'yes' to accept):YES

>>> Nodes configuration updated

>>> Assign a different config epoch to each node

>>> Sending CLUSTER MEET messages to join thecluster

Waiting for the cluster to join.

>>> Performing Cluster Check (using node192.168.100.41:6379)

M: c2b3c9cb4b040e4ce48c7a20b4000a1d02e674bd192.168.100.41:6379

slots:10923-16383(5461 slots) master

M: 35fc4a46cfe68e941a18ca33e574df86db7beefb192.168.100.42:6379

slots:5461-10922(5462 slots) master

M: 2ef9b515fac6159b37520afce1f75b38ba1e9a87192.168.100.43:6379

slots:0-5460 (5461slots) master

M: 6a2d10792f17985d1e30e9e20fe92c890748487f 192.168.100.41:7379

slots: (0 slots)master

replicates35fc4a46cfe68e941a18ca33e574df86db7beefb

M: eb921729e82925c6be859185efb58e77b49e7a89192.168.100.42:7379

slots: (0 slots)master

replicates2ef9b515fac6159b37520afce1f75b38ba1e9a87

M: eb921729e82925c6be859185efb58e77b49e7a89192.168.100.42:7379

slots: (0 slots)master

replicatesc2b3c9cb4b040e4ce48c7a20b4000a1d02e674bd

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered. # 添加节点 master:192.168.100.41:8379slave:192.168.100.42:8379slave:192.168.100.43:8379 创建相关配置文件和数据目录同上述配置更改端口信息即可 # 节点准备就绪 salt '*' cmd.run ' ps aux |grep redis-server|grep -v grep'

client.wboy.com:

root 134520.0 0.7 366847716 ? Ssl 11:200:13 /usr/local/redis-3.2/bin/redis-server 192.168.100.42:7379 [cluster] root 134620.1 0.9 387329756 ? Ssl 11:200:21 /usr/local/redis-3.2/bin/redis-server 192.168.100.42:6379 [cluster] root 165660.0 0.7 366847560 ? Ssl 16:020:00 /usr/local/redis-3.2/bin/redis-server 192.168.100.42:8379 [cluster]

master.weiboyi.com:

root 39910.1 0.5 407809800 ? Ssl 11:200:22 /usr/local/redis-3.2/bin/redis-server 192.168.100.41:6379[cluster] root 43240.0 0.4 133520 7720 ?Ssl 11:27 0:12 /usr/local/redis-3.2/bin/redis-server192.168.100.41:7379 [cluster] root 163470.0 0.3 366847560 ? Ssl16:02 0:00/usr/local/redis-3.2/bin/redis-server 192.168.100.41:8379 [cluster]

client1.weiboyi.com:

root 100140.1 0.3 366847736 ? Ssl 12:190:31 /usr/local/redis-3.2/bin/redis-server 192.168.100.43:7379 [cluster] root 100270.2 0.4 387329788 ? Ssl 12:200:39 /usr/local/redis-3.2/bin/redis-server 192.168.100.43:6379 [cluster] root 131530.0 0.3 366847564 ? Ssl 17:020:00 /usr/local/redis-3.2/bin/redis-server 192.168.100.43:8379 [cluster]

# 先检查一下节点的信息状态等

/usr/local/redis-3.2/bin/redis-trib.rb check192.168.100.41:7379

>>> Performing Cluster Check (using node192.168.100.41:7379)

S: 589ff9053237d77131f4cc6f6cf0006b3e38ea56192.168.100.41:7379

slots: (0 slots)slave

replicatesc2b3c9cb4b040e4ce48c7a20b4000a1d02e674bd

M: c2b3c9cb4b040e4ce48c7a20b4000a1d02e674bd192.168.100.41:6379

slots:9223-16383(7161 slots) master

1 addi

↧

New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

August 31, 2016, 2:21 am

≫ Next: Getting Started of MongoDB

≪ Previous: Redis 3.0 集群特性实验过程

Learn about the new Apache Flume and Apache Kafka integration (aka, “Flafka”) available in CDH 5.8 and its support for the new enterprise features in Kafka 0.9.

Over a year ago, we wrote about the integration of Flume and Kafka (Flafka) for data ingest into Apache Hadoop. Since then, Flafka has proven to be quite popular among CDH users, and we believe that popularity is based on the fact that in Kafka deployments, Flume is a logical choice for ingestion “glue” because it provides a simple deployment model for quickly integrating events into HDFS from Kafka.

Kafka 0.9 , released in late 2015, introduced a number of important features for the enterprise (particularly focusing on security ; more on that later). These features were only implemented in the latest Kafka client API implementations, one of which (the new consumer ) was also first introduced in Kafka 0.9.

Because the initial Flafka implementation was based on the “old” Kafka clients―clients that will soon be deprecated―some adjustments to Flume were needed to provide support for these new features. Thus, FLUME-2821 , FLUME-2822 , FLUME-2823 , and FLUME-2852 were contributed upstream by the authors and will be part of the upcoming Flume 1.7 release. These changes were also back-ported into CDH 5.8, which was released in July 2016 .

In the remainder of this post, we’ll describe those adjustments and the resulting configuration options.

Problem Statement

The Kafka community is committed to making new Kafka releases backward-compatible with clients. (For example, an 0.8.x client can talk to an 0.9.x broker.) However, due to protocol changes in 0.9, that assurance does not extend to forward compatibility from a client perspective: Thus, an 0.9.x client cannot reliably talk to a 0.8.x broker because there is no way for the client to know the version of Kafka to which it’s talking. (Hopefully, KIP-35 will offer progress in the right direction.)

Why does this matter? There are a few reasons:

Integrations that utilize Kafka 0.9.x clients, if they can talk to Kafka 0.8.x brokers at all (unlikely), may get cryptic error messages when doing so (for example, java.nio.BufferUnderflowException ). Integrations will only be able to support one major version of Kafka at a time without more complex class loading being done. Older clients (0.8.x) will work when talking to the 0.9.x server, but that doesn’t allow these clients to take advantage of the security benefits introduced in 0.9.

This problem puts projects like Flume in a tricky position. To simultaneously support current and previous versions would require some special and sophisticated class-loader mechanics, which would be difficult to build. So, based on lessons learned in projects like Apache Hive, Apache MRUnit, and Apache Spark, the community decided to take a “cut-over” approach for Flume: In short, the latest version of Flume (1.7) will only support brokers in Kafka 0.9 and later.

Changes in Flume 1.7

As a refresher on Flafka internals, Flafka includes a Kafka source, Kafka channel, and Kafka sink.

New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

During the rewrite of the Flafka components, the agent-configuration naming scheme was also simplified. In an effort to convey importance, Flafka v1 provided Flume “mirror” properties to Kafka client properties. For example, the sink property requiredAcks was equal to the Kafka producer property request.required.acks . In the latest version of Flafka, these Flume properties are discarded in favor of matching the Kafka client properties. Helpfully, CDH 5.8 includes new logic that allows old configuration parameters to be picked up when applicable; however, users should switch to the new parameter style now. While this step is somewhat painful, in the long run, deployment is simplified.

Next, let’s explore this naming hierarchy. The configuration parameters are organized as follows:

Configuration values related to the component itself (source/sink/channel) generically are applied at the component config level: a1.channel.k1.type = Configuration values related to Kafka or how the component operates are prefixed with kafka. (analogous to CommonClient configs and not dissimilar to how the HDFS sink operates): a1.channels.k1.kafka.topic = and a1.channels.k1.kafka.bootstrap.servers = Properties specific to the producer/consumer are prefixed by kafka.producer or kafka.consumer : a1.sinks.s1.kafka.producer.acks and a1.channels.k1.kafka.consumer.auto.offset.reset Where possible, the Kafka parameter names are used: bootstrap.servers and acks

The full documentation, with examples, has been updated and will be officially available when Flume 1.7 is released. For now, the following tables mirror the upstream Flume documentation.

Kafka Source
New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

Kafka Sink
New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

In addition, the sink now respects the key and topic Flume headers and will use the value of the key header when producing messages to Kafka.

Kafka Channel
New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

Other Enhancements

Flume 1.7 will address a number of feature requests and improvements to Flafka components. The most interesting is the ability to natively to read and write Flume Avro Events. This provides the ability to set and preserve headers in the Event itself. The Flume Avro Event schema is simple:

Map<String,String> headers byte[] body

This functionality has always been embedded in the Channel implementation. By adding the parameter useFlumeEventFormat = true , the source and sink can read and write, respectively, events using the above schema. Thus, any headers inserted by a source, or by any interceptors, can be passed from a Kafka sink to a Kafka source. This design also works for mixing and matching the Kafka channel with either the Kafka sink or Kafka source.

Examples

To follow are a couple example scenarios for implementing Kafka and Flume with security enabled, and using the new Flume release. It’s assumed that you have already followed appropriate documentation here to enable authentication and SSL for Kafka itself.

Scenario 1: Kafka Source -> Kafka Channel-> HDFS Sink tier1.sources=kafkasource1 tier1.channels=kafkachannel tier1.sinks=hdfssink tier1.sources.kafkasource1.type=org.apache.flume.source.kafka.KafkaSource tier1.sources.kafkasource1.channels=kafkachannel tier1.sources.kafkasource1.kafka.bootstrap.servers=10.0.0.60:9092,10.0.0.61:9092,10.0.0.62:9092 tier1.sources.kafkasource1.kafka.topics=flume-aggregator-channel tier1.sources.kafkasource1.kafka.consumer.fetch.min.bytes=200000 tier1.sources.kafkasource1.kafka.consumer.enable.auto.commit=false tier1.channels.kafkachannel.type = org.apache.flume.channel.kafka.KafkaChannel tier1.channels.kafkachannel.brokerList = 10.0.0.60:9092,10.0.0.61:9092,10.0.0.62:9092 tier1.channels.kafkachannel.kafka.topic = channeltopic tier1.sinks.hdfssink.type=hdfs tier1.sinks.hdfssink.channel=kafkachannel tier1.sinks.hdfssink.hdfs.path=/user/flume/syslog/%Y/%m/%d tier1.sinks.hdfssink.hdfs.rollSize=0 tier1.sinks.hdfssink.hdfs.rollCount=0 tier1.sinks.hdfssink.hdfs.useLocalTimeStamp=true tier1.sinks.hdfssink.hdfs.fileType=DataStream tier1.sinks.hdfssink.hdfs.batchSize=10000 Scenario 2: Kafka Channel -> HDFS Sink tier1.channels=kafkachannel tier1.sinks=hdfssink tier1.channels.kafkachannel.type = org.apache.flume.channel.kafka.KafkaChannel tier1.channels.kafkachannel.kafka.bootstrap.servers = 10.0.0.60:9092,10.0.0.61:9092,10.0.0.62:9092 tier1.channels.kafkachannel.kafka.topic = channeltopic tier1.sinks.hdfssink.type=hdfs tier1.sinks.hdfssink.channel=kafkachannel tier1.sinks.hdfssink.hdfs.path=/user/flume/syslog/%Y/%m/%d tier1.sinks.hdfssink.hdfs.rollSize=0 tier1.sinks.hdfssink.hdfs.rollCount=0 tier1.sinks.hdfssink.hdfs.useLocalTimeStamp=true tier1.sinks.hdfssink.hdfs.fileType=DataStream tier1.sinks.hdfssink.hdfs.batchSize=10000 Performance Tuning

Based on customer experiences, we know that it’s already possible to achieve very high throughput using Flume and Kafka (over 1 million messages per second end-to-end using Flume 1.6 on a three-node Kafka cluster). Even better, the new producers and consumers in Kafka 0.9 introduce a whole set of new parameters for fine-tuning performance when writing and reading from Kafka. In general terms, the key to achieving high throughput is to minimize the overhead of transactions by utilizing batches. There is a trade-off here, though: large batches often attract their own overhead (specifically memory), and they don’t utilize resources smoothly. The larger the batch, the higher the latency (especially when the system isn’t running at 100% of capacity―for example, with a peaky load profile). Batch control is slightly harder in Kafka 0.9, as there are several parameters involved, most of them are specified in bytes, and some depend on the number of partitions configured or number of partitions to which are being written at a given time.

In a recent internal performance test using syslog messages, the difference between a poorly vs. well-tuned configuration yielded between 0.2x to 10x performance improvement compared to an un-tuned configuration.

Producer

One of the major changes in Flume 1.7 is that there is now no synchronous mode for the producer: essentially, all messages are sent asynchronously (that is, by a background thread that manages its own batches). In Kafka 0.8/Flume 1.6, synchronous mode was the default, although that setting could be changed.

Consumer

Security

As mentioned previously, perhaps the most important of new features in Kafka 0.9 were directly related to platform security: specifically, support of Kerberos/SASL authentication, wire encryption via SSL, and authorization support. Thus, because Flafka is based on use of 0.9 Kafka clients, the Flume bits shipping in CDH 5.8 already support secure implementations of Kafka. (In a future post, we’ll cover proper configuration of secure Kafka in CDH.)

Known Issues

Because the new clients have switched to using Kafka for offset storage, manual migration of offsets from Apache ZooKeeper to Kafka is necessary in the short term. Please see the full documentation on this processhere.

Conclusion

After reading the above, you should have a good understanding of the significant changes introduced in Kafka 0.9, and their impact on the upcoming Flume 1.7 release. Fortunately, if you are a CDH user, you now have access to much of that functionality via the Flume 1.7 backports in CDH 5.8.

Jeff Holoman is an Account Manager at Cloudera, and a contributor to Apache Kafka and Apache Flume.

Tristan Stevens is a Senior Solutions Architect at Cloudera, and a contributor to Apache Flume.

Grigory Rozhkov is a Software Engineer at CyberVision Inc., and a contributor to Apache Flume.

↧

Getting Started of MongoDB

August 31, 2016, 2:20 am

≫ Next: MongoDB Plugin 1.0.7，支持主键操作和聚合运算

≪ Previous: New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest

What is MongoDB MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.[1]

MongoDB（来自于英文单词“Humongous”，中文含义为“庞大”）是可以应用于各种规模的企业、各个行业以及各类应用程序的开源数据库。作为一个适用于敏捷开发的数据库，MongoDB的数据模式可以随着应用程序的发展而灵活地更新。与此同时，它也为开发人员提供了传统数据库的功能：二级索引，完整的查询系统以及严格一致性等等。 MongoDB能够使企业更加具有敏捷性和可扩展性，各种规模的企业都可以通过使用MongoDB来创建新的应用，提高与客户之间的工作效率，加快产品上市时间，以及降低企业成本。

MongoDB是专为可扩展性，高性能和高可用性而设计的数据库。它可以从单服务器部署扩展到大型、复杂的多数据中心架构。利用内存计算的优势，MongoDB能够提供高性能的数据读写操作。 MongoDB的本地复制和自动故障转移功能使您的应用程序具有企业级的可靠性和操作灵活性。[2] About MongoDB MongoDB is a No-SQL database. MongoDB server can maintain some databases, each database can maintain many collections. Collection is a concept like table in SQL database. Each Collection contain many documents. Each Document is a Json style object, which has many 'key':'value' pairs. For easy under stand, a table refered from [3] shows below. SQL Term/Concept MongoDB Term/Concept Description database database 数据库 table collection 数据库表/集合 row document 数据记录行/文档 column field 数据字段/域 index index 索引 table joins 表连接,MongoDB不支持 primary key primary key 主键,MongoDB自动将_id字段设置为主键 Install MongoDB For Mac OS, MongoDB can be installed by using command[4]: brew install mongodb

if you want to install MongoDB with supporting TSL/SSL:

brew install mongodb --with-openssl Install Pymongo

To use MongDB by python, you still need to install pymongo

pip install pymongo

if you use Python3, please use pip3 to install pymongo

Use MongoDB by CLI After installing MonogoDB, you can start MongoDB by[5]: ./mongo

Because the CLI is a javascript shell, so you can execute code with it. For example, you can do some easy calculation.

>1+2 >3 List database > show dbs local 0.000GB test 0.000GB Use database > use test switched to db test Show database name of which is using >db test Create database

You can enter 'use database_name' to create a new database, if the database name is not existed.

> use milestone switched to db milestone

Only when you add some documents into dababase's collection, can you see the database name by using 'show dbs'. However you can use 'db' to show the using database.

List collections show collections Create collection

You can use createCollection method to create a collection for databse.

> show collections > db.createCollection("muzixing",{size:100000}) { "ok" : 1 } > show collections muzixing > Remove collection db.collection_name.drop() Add document into collection

You can insert data by insert method:

> db.muzixing.insert({"name":"www.muzixing.com"}) WriteResult({ "nInserted" : 1 }) > Search document

Use find() method to find all documents or fill parameter to select documents.

> db.muzixing.find() { "_id" : ObjectId("57c6102b4366cfc975563b94"), "name" : "www.muzixing.com" } { "_id" : ObjectId("57c611d24366cfc975563b95"), "name" : "chengli" } { "_id" : ObjectId("57c611d94366cfc975563b96"), "name" : "milestone" } > > db.muzixing.find({'name':'www.muzixing.com'}) { "_id" : ObjectId("57c6102b4366cfc975563b94"), "name" : "www.muzixing.com" } >

Also, findOne method can be use to get one document.

> db.muzixing.findOne() { "_id" : ObjectId("57c6102b4366cfc975563b94"), "name" : "www.muzixing.com" } > db.muzixing.find() { "_id" : ObjectId("57c6102b4366cfc975563b94"), "name" : "www.muzixing.com" } { "_id" : ObjectId("57c611d24366cfc975563b95"), "name" : "chengli" } { "_id" : ObjectId("57c611d94366cfc975563b96"), "name" : "milestone" } > Update document

Update command's syntax shows below:

db.collection.update( <query>, <update>, { upsert: <boolean>, multi: <boolean>, writeConcern: <document> } )

For example:

> db.muzixing.update({"name":"licheng"},{$set:{"name":"chengli"}}) WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 }) > db.muzixing.find() { "_id" : ObjectId("57c615564366cfc975563b97"), "name" : "chengli", "face" : "handsome" } > db.muzixing.update({"name":"chengli"},{$set:{"name":"licheng"}}) WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 }) > db.muzixing.find() { "_id" : ObjectId("57c615564366cfc975563b97"), "name" : "licheng", "face" : "handsome" } > Remove document

Syntax shows below

db.collection.remove( <query>, <justOne> )

For example:

> db.muzixing.find() { "_id" : ObjectId("57c615564366cfc975563b97"), "name" : "licheng", "face" : "handsome" } { "_id" : ObjectId("57c617694366cfc975563b98"), "name" : "girl friend", "face" : "beautiful" } > db.muzixing.remove({'name':"licheng"}) WriteResult({ "nRemoved" : 1 }) > db.muzixing.find() { "_id" : ObjectId("57c617694366cfc975563b98"), "name" : "girl friend", "face" : "beautiful" } >

For more info of learning MongoDB for Chinese, see MongoDB tutorial.

For English speaker, see MongoDB Docs. and tutorialspoint-MongoDB

Pymongo Actually, people always use some coding language libs to use MongoDB instead of CLI. After learning MongoDB and CLI usage, it is completely easy to understand how to use pymongo to manipulate MongoDB[7].

First of all, you should start a MongoDB server, such as we can start MongoDB at localhost, the default port of MongoDB is 27017.

Example shows below:

import pymongo from pymongo import MongoClient if __name__ == "__main__": # get client client = MongoClient('mongodb://localhost:27017/') print("client", client) # get database, if the database has existed.Otherwise, create it db = client.test print("db:", db) # get collection, if it has existed, otherwise, create it. collection = db.chengli print("collection name", db.collection_names()) print("collection: ", collection) # get document print('find one: ', collection.find_one()) # post data item new_man = {'age': 20, 'name': 'oo', 'sex': 'male', 'id': 5} collection.insert(new_man) print(collection.find().count()) # get multi items for i in collection.find(): print(i) # find data data = collection.find_one({'name':"chengli"}) print(data['name']) # update data collection.update({'name':'haha'},{'$set':{'title':'employee'}}) collection.remove({'title':'employee'}) collection.create_index([('age',pymongo.ASCENDING),]) for i in collection.find().sort('age', pymongo.ASCENDING): print(i) Note that This is Python3 code. Some code is Non-idempotent, such as insert data and remove data, so different results will generate when run code in different round.

More info of pymongo, see Mongo API Doc .

References [1]https://www.mongodb.com [2]https://www.mongodb.com/cn [3]http://www.runoob.com/mongodb/mongodb-databases-documents-collections.html [4]https://docs.mongodb.com/manual/installation/ [5]http://www.runoob.com/mongodb/mongodb-linux-install.html [7]http://wiki.jikexueyuan.com/project/start-learning-python/232.html

↧

MongoDB Plugin 1.0.7，支持主键操作和聚合运算

August 31, 2016, 2:19 am

≫ Next: Apache Kudu 0.10.0 发布，Hadoop 存储系统

≪ Previous: Getting Started of MongoDB

MongoDB Plugin 具有以下特性：

与 MongoDB 版本同步，完美支持最新的 3.2.6 版本.

使用方式更像是自然语言，用起来更舒适。

简化 mongodb java driver 的操作，降低学习成本。

支持 MongoDB 的授权机制（可以使用用户名和密码登录）、支持连接 MongoDB 副本集、读写分离、安全写入、SSL 连接等特性。

内置 JFinal 和 Resty 插件。（基于最新版的 JFinal和 Resty）

本次发布主要改进：

增加索引相关支持

增加Exist校验器，Query增加exist方法、or和nor方法

增加日期校验支持

增加聚合运算支持

增加join查找等新特性

修复bug和优化代码

重写了所有测试类

54次commit，历时半年，只为打造最好用的MongoDB操作工具。

下载：

Source code(zip)

Source code(tar.gz)

↧

Apache Kudu 0.10.0 发布，Hadoop 存储系统

August 31, 2016, 5:17 am

≫ Next: The Data Day: August 30, 2016

≪ Previous: MongoDB Plugin 1.0.7，支持主键操作和聚合运算

Apache Kudu 0.10.0 发布了。

Apache Kudu 简介

为了应对先前发现的这些趋势，有两种不同的方式：持续更新现有的Hadoop工具或者重新设计开发一个新的组件。其目标是：

对数据扫描(scan)和随机访问(random access)同时具有高性能，简化用户复杂的混合架构；

高CPU效率，最大化先进处理器的效能；

高IO性能，充分利用先进永久存储介质；

支持数据的原地更新，避免额外的数据处理、数据移动

我们为了实现这些目标，首先在现有的开源项目上实现原型，但是最终我们得出结论：需要从架构层作出重大改变。而这些改变足以让我们重新开发一个全新的数据存储系统。于是3年前开始开发，直到如今我们终于可以分享多年来的努力成果：Kudu，一个新的数据存储系统。

更新如下：

Incompatible changes

0.8.0 clients are not fully compatible with servers running Kudu 0.7.1 or lower. In particular, scans that specify column predicates will fail. To work around this issue, upgrade all Kudu servers before upgrading clients.

New features

KUDU-431 A simple Flume sink has been implemented.

Improvements

KUDU-839 Java RowError now uses an enum error code.

Gerrit 2138 The handling of column predicates has been re-implemented in the server and clients.

KUDU-1379 Partition pruning has been implemented for C++ clients (but not yet for the Java client). This feature allows you to avoid reading a tablet if you know it does not serve the row keys you are querying.

Gerrit 2641 Kudu now uses earliest-deadline-first RPC scheduling and rejection. This changes the behavior of the RPC service queue to prevent unfairness when processing a backlog of RPC threads and to increase the likelihood that an RPC will be processed before it can time out.

Fixed Issues

KUDU-1337 Tablets from tables that were deleted might be unnecessarily re-bootstrapped when the leader gets the notification to delete itself after the replicas do.

KUDU-969 If a tablet server shuts down while compacting a rowset and receiving updates for it, it might immediately crash upon restart while bootstrapping that rowset’s tablet.

KUDU-1354 Due to a bug in Kudu’s MVCC implementation where row locks were released before the MVCC commit happened, flushed data would include out-of-order transactions, triggering a crash on the next compaction.

KUDU-1322 The C++ client now retries write operations if the tablet it is trying to reach has already been deleted.

Gerrit 2571 Due to a bug in the Java client, users were unable to close the kudu-spark shell because of lingering non-daemon threads.

Other noteworthy changes

Gerrit 2239 The concept of "feature flags" was introduced in order to manage compatibility between different Kudu versions. One case where this is helpful is if a newer client attempts to use a feature unsupported by the currently-running tablet server. Rather than receiving a cryptic error, the user gets an error message that is easier to interpret. This is an internal change for Kudu system developers and requires no action by users of the clients or API.

完整更新说明： http://kudu.apache.org/releases/0.8.0/docs/release_notes.html

下载：

Kudu 0.8.0 source tarball ( SHA1 , MD5 , Signature )

↧

The Data Day: August 30, 2016

August 31, 2016, 4:26 pm

≫ Next: MongoDB schema design

≪ Previous: Apache Kudu 0.10.0 发布，Hadoop 存储系统

What happened in data and analytics this week will astound you

For 451 Research clients: NewSQL databases: a definitive guide http://bit.ly/2c7wJPu

For 451 Research clients: With Atlas, MongoDB jumps into hosted NoSQL DBaaS waters http://bit.ly/2c7wXq4 By Jim Curtis

For 451 Research clients: DataStax rolls graph into DSE 5.0, highlights NoSQL multi-models http://bit.ly/2bOKF2I By Jim Curtis

For 451 Research clients: Looker illuminates its analytics business and platform strategy http://bit.ly/2c7z396 By Krishna Roy

For 451 Research clients: Arcadia looks to simplify security for Hadoop analysis, lands Rackspace as reseller http://bit.ly/2c7wwvZ By Krishna Roy

For 451 Research clients: With $30m in funding, Vena looks to expand its cloud service to an enterprise platform http://bit.ly/2c7x3Oq By Krishna Roy

For 451 Research clients: Stitch emerges from RJMetrics with ETL as a service following cloud BI sale http://bit.ly/2c7wW5m

For 451 Research clients: CMC Markets sees promising self-service results with new ‘Google-like’ BI tool http://bit.ly/2c7xPuK By Jason Stamper

Tableau appointed Adam Selipsky as new CEO http://tabsoft.co/2bONxN1

Splunk reported a net loss of $86.6m on Q2 revenue up 43% to $212.8m http://splk.it/2bON1P5

Salesforce signs agreement to acquire BeyondCore http://bit.ly/2bOOZPK

Magnitude Software acquired Simba Technologies http://mwne.ws/2bONqRI

SAP is reportedly acquiring Altiscale for over $125m http://bit.ly/2bON8dE

Syncsort acquires Cogito to enhance mainframe data access http://prn.to/2c7ynRb

Galactic Exchange closes seed financing round http://bit.ly/2bOMxZw

Teradata makes Aster Analytics available on Hadoop and Teradata Aster Analytics on Amazon Web Services http://prn.to/2c7y5Kh

JSON support is generally available in Azure SQL Database http://bit.ly/2c7xVT9

Red Hat launches Red Hat Virtualization 4 http://red.ht/2bOMJb0

SnapLogic launches Summer 2016 release of its SnapLogic Elastic Integration Platform http://bit.ly/2bOOmpq

AWS releases Amazon Kinesis Analytics http://bit.ly/2bOOwNd

AWS licenses SQLstream technology for Amazon Kinesis Analytics service http://bit.ly/2c7CZaa

Percona delivers open source in-Memory storage engine for Percona Server for MongoDB http://bit.ly/2bOPIjH

Riversand launches MDMCenter v7.8 http://bit.ly/2c7zDUl

WANdisco announces the release of WANdisco Fusion 2.9 http://bit.ly/2bOOapU

And that’s the data day, today.

↧

MongoDB schema design

August 31, 2016, 11:59 pm

≫ Next: 链接数据库的方法

≪ Previous: The Data Day: August 30, 2016

courtesy

Notes for 4th week of MongoDB for node.js course

Schema Design

There is an ideal way to design databases in relational databases - 3rd normal form. In MongoDB , it’s important to keep data in way that’s conducive to the application using the data. You think about

application data patterns what pieces of data are used together what pieces of data are used mostly read-only what pieces of data are written all the time

In contrast, in relational DBMS - the data is organized in such a way that is agnostic to the application.

MongoDB supports rich documents . We can store an array of items, a value for a certain key can be an entire other document. This is going to allow us to pre-join/embed data for fast access. And that’s important because MongoDB doesn’t support joins directly inside the kernel. Instead if we need to join, we need to join in the application itself. The reason being joins are very hard to scale . This forces us to think ahead of time about what data you want to use together with other data . We might wish to embed the data directly within the document. There’re no constraints . In case of MongoDB , it’s as important as we think because of embedding. MongoDB considers atomicity in a way. Also it doesn’t support transactions , however atomic operations are supported within one document. The data needs to be designed in such a way that it supports atomic operations.

There’s no declared schema but there’s a good chance that an application is going to have a schema. By having a schema, we mean that every single document in a particular collection is probably going to have a pretty similar structure. There might be small changes to that structure depending on the different versions of your application. Even though it’s not declared ahead of time, it’s important to think about the data structure so that the data schema itself supports all the different features of your application

Relational Normalization

Let’s look at the below denormalized table for a blog posts project. It’s not the 3rd normal form, it’s broken. Let’s say there are multiple posts with same author, we may update a few rows and leave others un-updated. Leaving the table data inconsistent.

Hence this violates normalization because it violates a common way to describing normalized tables in 3rd normal form, which is that every non-key attribute in the table must provide a fact about the key, the whole key and nothing but the key . And that’s of a play on words for what you say in a US courtroom, telling the truth, the whole truth and nothing but the truth. The key in this case, is the Post Id and there is a non-key attribute Author Email which does not follow that. Because it does, in fact tell something about the author. And so it violates that 3rd normal form.

The above table can be represented in MongoDB as:

{ id: 'some id', title: 'some title', body: 'some content here', author: { name: 'author name', email: 'author email id' } } Refresher - what are the goals of normalization? Frees the database from modification anomalies - For MongoDB , it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that’s not the default approach. The default is to avoid it. Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents Avoid bias toward any particular access pattern - this is something, we’re not going to worry about when describing schema in MongoDB . And one of the ideas behind the MongoDB is to tune up your database to the applications that we’re trying to write and the problem we’re trying to solve. Living Without Constraints

One of the great things about relational database is that it is really good at keeping the data consistent within the database. One of the ways it does that is by using foreign keys. A foreign key constraint is that let’s say there’s a table with some column which will have a foreign key column with values from another table’s column. In MongoDB , there’s no guarantee that foreign keys will be preserved. It’s upto the programmer to make sure that the data is consistent in that manner. This maybe possible in future versions of MongoDB but today, there’s no such option. The alternative for foreign key constraints is embedding data .

Living Without Transactions

Transactions support ACID properties but although there are no transactions in MongoDB , we do have atomic operations. Well, atomic operations means that when you work on a single document that that work will be completed before anyone else sees the document. They’ll see all the changes we made or none of them. And using atomic operations, you can often accomplish the same thing we would have accomplished using transactions in a relational database. And the reason is that, in a relational database, we need to make changes across multiple tables. Usually tables that need to be joined and so we want to do that all at once. And to do it, since there are multiple tables, we’ll have to begin a transaction and do all those updates and then end the transaction. But with MongoDB , we’re going to embed the data, since we’re going to pre-join it in documents and they’re these rich documents that have hierarchy. We can often accomplish the same thing. For instance, in the blog example, if we wanted to make sure that we updated a blog post atomically, we can do that because we can update the entire blog post at once. Where as if it were a bunch of relational tables, we’d probably have to open a transaction so that we can update the post collection and comments collection.

So what are our approaches that we can take in MongoDB to overcome a lack of transactions?

restructure - restructure the code, so that we’re working within a single document and taking advantage of the atomic operations that we offer within that document. And if we do that, then usually we’re all set. implement in software - we can implement locking in software, by creating a critical section. We can build a test, test and set using find and modify. We can build semaphores, if needed. And in a way, that is the way the larger world works anyway. If we think about it, if one bank need to transfer money to another bank, they’re not living in the same relational system. And they each have their own relational databases often. And they’ve to be able to coordinate that operation even though we cannot begin transaction and end transaction across those database systems, only within one system within one bank. So there’s certainly ways in software to get around the problem. tolerate - the final approach, which often works in modern web apps and other applications that take in a tremendous amount of data is to just tolerate a bit of inconsistency. An example would, if we’re talking about a friend feed in Facebook, it doesn’t matter if everybody sees your wall update simultaneously. If okey, if one person’s a few beats behind for a few seconds and they catch up. It often isn’t critical in a lot of system designs that everything be kept perfectly consistent and that everyone have a perfectly consistent and the same view of the database. So we could simply tolerate a little bit of inconsistency that’s somewhat temporary.

Update , findAndModify , $addToSet (within an update) & $push (within an update) operations operate atomically within a single document.

One to one relations

1 to 1 relations are relations where each item corresponds to exactly one other item. e.g.:

an employee have a resume and vice versa a building have and floor plan and vice versa a patient have a medical history and vice versa //employee { _id : '25', name: 'john doe', resume: 30 } //resume { _id : '30', jobs: [....], education: [...], employee: 25 }

We can model the employee-resume relation by having a collection of employees and a collection of resumes and having the employee point to the resume through linking, where we have an ID that corresponds to an ID in th resume collection. Or if we prefer, we can link in another direction, where we have an employee key inside the resume collection, and it may point to the employee itself. Or if we want, we can embed. So we could take this entire resume document and we could embed it right inside the employee collection or vice versa.

This embedding depends upon how the data is being accessed by the application and how frequently the data is being accessed. We need to consider:

frequency of access the size of the items - what is growing all the time and what is not growing. So every time we add something to the document, there is a point beyond which the document need to be moved in the collection. If the document size goes beyond 16 MB , which is mostly unlikely. atomicity of data - there’re no transactions in MongoDB , there’re atomic operations on individual documents. So if we knew that we couldn’t withstand any inconsistency and that we wanted to be able to update the entire employee plus the resume all the time, we may decide to put them into the same document and embed them one way or the other so that we can update it all at once. One to Many Relations

In this relationship, there is many, many entities or many entities that map to the one entity. e.g.: - a city have many persons who live in that city. Say NYC have 8 million people.

Let’s assume the below data model:

//city { _id: 1, name: 'NYC', area: 30, people: [{ _id: 1, name: 'name', gender: 'gender' ..... }, .... 8 million people data inside this array .... ] }

This won’t work because that’s going to be REALLY HUGE. Let’s try to flip the head.

//people { _id: 1, name: 'John Doe', gender: gender, city: { _id: 1, name: 'NYC', area: '30' ..... } }

Now the problem with this design is that if there are obviously multiple people living in NYC, so we’ve done a lot of duplication for city data.

Probably, the best way to model this data is to use true linking .

//people { _id: 1, name: 'John Doe', gender: gender, city: 'NYC' } //city { _id: 'NYC', ... }

In this case, people collection can be linked to the city collection. Knowing we don’t have foreign key constraints, we’ve to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.

So, if it’s truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.

Many to Many Relations

e.g.:

books to authors students to teachers

The books to authors is a few to few relationship, so we can have either an array of books or authors inside another’s document. Same goes for students to teachers. We could also embed at the risk of duplication. However this will required that each student has a teacher in the system before insertion and vice versa. The application logic may always not allow it. In other words, the parent object must exist for the child object to exist.

Multikeys

Multikey indexes is the feature because of which linking and embedding works so well. Assume that we’ve 2 schemas for student and teachers.

//students { _id: 1, name: 'John Doe', teachers: [1,7,10,23] } //teachers { _id: '10', name:'Tony Stark' }

Now there are 2 obvious queries.

How to find all the teachers a particular student has had? - this can be searched by looking for teachers key in the students collection How to find all the students whom have had a particular teacher? - this is a little difficult query and requires to use set operators. And order for that we need to be efficient, we need use Multikey indexes .

To create an index on teachers column on students collection, use db.students.ensureIndex({ teachers : 1 }) .

Now to find all the students whom have had a particular teacher? use the query db.students.find( { 'teachers': {$all: [0,1]}} ) . Now, if we append .explain() to the above query - it shows the internal working by showing where the keys were applied etc. Benefits of Embedding

The main reason for embedding documents in MongoDB is performance. The main performance benefit comes from improved read performance. Now, why do we get read performance. The reason is the nature of the way computer systems are built, which is they often have spinning disks and those spinning disks have a very high latency , which means they take a very long time (upto 1ms) to get to the first byte. But, when the first byte is access, each additional byte comes very quickly. So they tend to be pretty high bandwidth . So the idea is if we can co-locate the data to be used together in the same document, embed it and then we’re going to spin the disk, find the sector where we need this information and then we’re gonna start reading it. And we’re going to get all the information we need in one go. Also it means if we have 2 pieces of data that would normally be in 2 collections or in several relational database tables. Instead, they’re in one document, that we avoid round trips to the database .

Trees

One classic problem from the world of schema design is how do you represent a tree inside the database? Let’s look at the example problem for representing the e-commerce categories in a e-commerce site, such as Amazon. Where we have home, outdoors, winter, snow. And the idea is that we’ve the products . Also we’ve a category , where we can lookup category 7 and see the categoryName , some of the properties for the category .

//products { category: 7, productName: 'ABC' } //category { _id: '7', categoryName:'outdoors', parent: 6 }

One way to do it is to keep a parent id - this might be something we can do in a simple relational database. But this doesn’t make it easy to find all the parent s to this category . We’ve to iteratively query, find the parent of this and the parent of that until we get all the way to the top.

So an alternative way to do it in MongoDB is to be able to list ancestors or children. So, let’s think about that and how that would work. So, we could decide to list all the children of this category :

//category { _id: '7', categoryName:'outdoors', children: [3,6,7,9] }

That’s also fairly limiting, if we’re looking to be able to look and find the entire sub-tree that is above a certain piece of tree. Instead, what works pretty well and again, enable by the ability to put arrays inside MongoDB is to list the ancestors from the top in order.

//category { _id: '7', categoryName:'outdoors', ancestors: [3,7,5,8,9] }

Again the ability to structure and express rich data is one of the things that makes MongoDB so interesting. This would be very difficult to do in a relational database. Now, in terms of how you represent the data for something like a product category hierarchy, again it all depends upon the access patterns. It depends on how we believe we’re going to need to show the data, access the data for the user. And then based on that, we know how to model it.

When to Denormalize

One of the reasons for normalization in relational database is to avoid modification anomalies that come with the duplication of data. And when we look at MongoDB and how it’s structured, allowing these rich documents, it’s easy to assume that what we’re doing is we’re denormalizing the data. And to certain extent, that’s true. As long as we don’t duplicate data, we don’t open ourselves for modification anomalies.

Generally it’s good to embed that data in case of one to one relationships. In case of one to many relationships, embedding can also work well without duplication of data as long as we embed from many to the one . Now, if we go from the one to the many, then linking would avoid the duplication of data. Now, if we want to embed something, even if it causes duplication of data for performance reasons to match the access patterns of the applications. That could make sense, especially if the data is rarely changing or being updated. But we can avoid it often enough, even in this relationship, if we go from the many to one . In many to many relation, which we looked at with students and teachers and authors and books , there if you want to avoid the modification anomalies that come with denormalization, all we need to do is link through the arrays of object ids in the documents.

These are all guidelines. For a real world application, we may need to embed data for performance reasons, to match the data access patterns - - it might be needed to embed the data.

Photos

↧

链接数据库的方法

September 1, 2016, 2:25 am

≫ Next: 基于hadoop生态圈的数据仓库实践――OLAP与数据可视化（五）

≪ Previous: MongoDB schema design

Hello,我是KitStar。

注：此文档应用于WIN应用程序的开发，不是Unity3D.并且此处使用的是SQL Server数据库。

一：

今天提及的是链接数据库的ADO.NET类。它是一组向.NET程序员公开数据访问服务的类。它有一系列的方法，用于支持对Microsoft SQL Server 和XML等数据源的访问。客户端可以使用ADO.NET来链接到数据源，并且查询，添加，删除和更新所包含的数据。

ADO.NE支持无连接和链接两种模型：

1.无连接：即使将数据下载到客户机器上，并在客户机上将数据封装在内存里面，任何可以像访问本地关系数据库一样访问内存中的数据（eg:DataSet）。

2.链接模式依赖于逐记录的访问，这种访问要求打开并保持与数据源的链接，链接OVER,你的数据也就OVER.

二：

而ADO.NET对象模型直接的关系就像一个抽水系统：

（1）：数据库好比水源，储存着大量的数据。

（2）：Connection(连接件)对象好比深入水源的进水龙头，保持与水的接触，只有在链接到的前提下，才可以抽到水。

（3）：Command(命令件)对象好比抽水机，为抽水提供动力和执行方法，先通过水龙头取得水，然后把水传给输水管。

（4）：DataAdapter,DataReader(数据适配器,数据读取器)这两个对象就是输水管，担任着水的传输任务，并起着桥梁作用。而这两个是有区别的；DataAdapter对象通过抽水机，把水源输送到水池中进行保存。而DataReader对象不也不需要把水输送到水池里面，而是单向的把水直接送到需要水的用户里，所以要比在水池中转一下速度更快。

（5）：DataSet（数据集）对象则是一个大水库，把抽上来的水按照一定关系的池子进行存放。即使，抽水装置撤离或者损坏（链接断开，离线状态），也可以保持“水”的存在。这也正是DAO.NET的核心。

（6）：DataTable（数据表）对象则像水池中每个独立的池子，分放着不同种类的水。一个大水池由一个或者多个这样的池子组成。

三：

那么下来具体展示代码的操作了：

(前提需要： 1.已经构建了数据库（eg:SQL Server）相应的数据源（二处描述的水），如图：
链接数据库的方法

2.在Microsoft Visual Studio 中建立了客户端，如图：
链接数据库的方法

。此处就不讲解这两个项目的建立过程，网络上有很多资源。

)

（当然不同数据库的所引用的命名空间不同，

1.SQL Server：位于System.Data.SqlClient命名空间下。

2.ODBC：位于System.Data.OleDb命名空间下。

3.OLEDB：位于System.Data.Oabc命名空间下。

4.Oracle：位于System.Data.OracleClient命名空间下。

）

1.建造进水龙头

首先我们先操作Connection(连接件)对象，先把进水龙头插进水源中才能获取到水，对吧？（首先创建WIN应用程序，在窗体中添加TextBox控件，Button控件和Lable空间，然后引入System.Data.SqlClient命名空间，使用 SqlConnection类链接数据库）

using System.Data.SqlClient; //必须要声明使用这个命名空间。
namespace FormApplicationStudent1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
this.AcceptButton = button1;
}
private void button1_Click(object sender, EventArgs e) //当点击按钮的时候
{
if(textBox1.Text == null) //判断数据库名称是否存在
{
MessageBox.Show("请输入正确的数据库名称");
}
else
{
try
{
string strCon = "server=RUWENKEJI\\SQLEXPRESS;database=" + textBox1.Text.Trim() + "; uid=sa;pwd=www090924";

//其中server为服务器名称，database为数据库名称，uid 为用户名，pwd为密码。

SqlConnection conn = new SqlConnection(strCon); //创建数据库SqlConnection对象，即我们的进水龙头
conn.Open(); //打开水龙头
if(conn.State == ConnectionState.Open) //判断链接状态，如果为打开
{
Label lab = new Label();
lab.Location = new Point(25,25);
this.Controls.Add(lab);
lab.Text = "数据库已经链接到：" + textBox1.Text.Trim();
}
}
catch
{
MessageBox.Show("数据库连接失败","",MessageBoxButtons.OK,MessageBoxIcon.Error);
}
}
}
//其中server为服务器名称，database为数据库名称，uid 为用户名，pwd为密码。如图：
链接数据库的方法

客户端执执行后下图：
链接数据库的方法

此时，就打开了进水龙头。（对于关闭连接的话,通过调用SqlConnection对象的Close或者Dispose，这两个方法的主要区别： Close只是关闭了进水龙头的开关，可以通过Open再次打开。但是Dispose是直接连进水龙头都删除了，再次使用的话得从新实例个龙头，然后从新打开）。

2. 抽水机Command(命令件)

进水龙头已经准备好了，那么下来我们开始使用抽水机Command(命令件)对象来抽水，这是整个系统中最重要的一环。

Command对象是一个数据命令对象，他的作用用比喻来说，就是过滤水的。在程序中主要是让它给数据库发送查询，更新，删除，修改操作的SQL语句。

（当然不同数据库的方法不一样，

1.SqlCommand：此处我们用的是SQL Server，用于给SQL Server发送SQL指令的方法是在System.Data.SqlClient下的SqlCommand方法。

2.OleDbCommand ：用于向使用OLEDB公开的数据库发送SQL语句。位于System.Data.OleDb命名空间下。eg：

Access数据库和mysql数据库都是OLEDB公开的数据库。

3.OdbcCommand ：用于向使用ODBC公开的数据库发送SQL语句。位于System.Data.Oabc命名空间下。

4.OracleCommand ：用于向使用Oracle数据库发送SQL语句。位于System.Data.OracleClient命名空间下。

）

Command对象有三个重要的属性，分别是Connection，CommandText，CommandType。这三个属性中，Connection用于设置SqlCommand（抽水机）对应的SqlConnection（进水龙头），就像是摆好了抽水机得要给定他特定的抽水管一样，不然他不知道抽哪的水。CommandText属性主要就是设置要对数据源执行的SQL语句或者储存过程，也就是要发送的SQL指令，通俗点就是去告诉数据库我要得什么样的水。CommandType属性用于设置CommandText的属性，它是一个枚举类型，有三个成员： StoredProcedure(储存过程的名称)；TableDirect(表的名称)；Text（SQL文本命令）

具体操作 :首先创建WIN应用程序，在窗体中添加TextBox控件，Button控件和Lable空间，然后引入System.Data.SqlClient命名空间，使用 SqlConnection类链接数据库）

using System.windows.Forms;
using System.Data.SqlClient;
namespace FormApplicationStudent1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
this.AcceptButton = button1;
}
SqlConnection conn;
private void Form1_Load(object sender, EventArgs e) //当加载窗口的时候
{
if (MessageBox.Show("是否查看窗体？", "", MessageBoxButtons.YesNo, MessageBoxIcon.Information) != DialogResult.Yes)
{
this.Close();
}
string strCon = "server=RUWENKEJI\\SQLEXPRESS;database=" + "likia"+ "; uid=sa;pwd=www090924";
conn = new SqlConnection(strCon); //创建数据库SqlConnection对象，即我们的进水龙头
conn.Open(); //打开水龙头
}
private void button1_Click(object sender, EventArgs e) //当点击按钮的时候
{
try
{
if (conn.State == ConnectionState.Open || textBox1.Text != null) //判断链接状态，如果为打开
{
SqlCommand comm = new SqlCommand(); //创建一个SqlCommand对象，即抽水机。
comm.Connection = conn; //指定他接受的进水龙头为当前comm
comm.CommandText = "SELECT Count(*) FROM " + textBox1.Text.Trim(); //指定SQL指令，用来“过滤”水。这里获取表的数据数量
comm.CommandType = CommandType.Text; //指定指令的类型，此处为text，即SQL文本命令
int i = Convert.ToInt32(comm.ExecuteScalar()); //获取数据表中的数据数量 ExecuteScalar（返回执行数量）
comm.CommandText = "SELECT * FROM " + textBox1.Text.Trim(); //指定SQL指令，用来“过滤”水。这里获取表的信息
comm.CommandType = CommandType.Text; //指定指令的类型，此处为text，即SQL文本命令
SqlDataReader sqlRead = comm.ExecuteReader();
//ExecuteReader返回一个包含数据的SqlDataReader类型的实例。其中储存着SQL语句过滤后的数据，此处为获取全部信息
Label lab = new Label();
lab.Location = new Point(50,50);
this.Controls.Add(lab);
lab.Size = new Size(new Point(200,200));
lab.Text = "数据表中共有：" + i.ToString() + " 条数据";
while (sqlRead.Read()) //当数据可以读取
{
lab.Text += sqlRead[1].ToString() + " "; //读取数据表中的第一列即Name
lab.Text += sqlRead[2].ToString() + " "; //读取数据表中的第二列即sex
}
button1.Enabled = false;
}
}
catch
{
MessageBox.Show("数据库连接失败","",MessageBoxButtons.OK,MessageBoxIcon.Error);
}
}

执行后如图所示：

其中要注意SqlCommand中几种执行SQL语句的方法：

a. ExecuteNonQuery方法：执行SQL语句并且返回受影响的行数，在向数据库发送增，减，改的时候用到。

b. ExecuteReader方法：　上述代码中我们有用到过的。ExecuteReader返回一个包含数据的SqlDataReader类型的实例。其中储存着数据表中的数据。

c. ExecuteScalar : 上述代码也用到。他获取数据表中的数据数量。 ExecuteScalar（返回执行数量）

其中;ExecuteReader重点讨论下：它的返回值是SqlDataReader（DataReader类型对象）。而DataReader对象是数据读取对象，如果应用程序要每次从数据库中取出最新数据，或者只是要快速的读取，并不要去修改，则可以使用次此方法。如果判断锁查询的数据表中是否有数据。

可以使用SqlDataReader方法里面的HasRows(bool类型)，来判断表中是否有数据。

读取数据的话就用SqlDataReader方法里面的Read方法读取，Read方法使SqlDataReader前进到下一条记录。

3.输水管（读取数据！）

2中介绍了SqlDataReader这一读取数据方法。此处介绍DataAdapter对象（数据适配器对象）。最开始部分我们介绍了它在系统中担任输水管的重任，是DataSet与数据源之间的桥梁。在DataSet与数据源之间它是不可或缺的媒介，它用于实现与数据源之间的互通。

1.DataAdapter提供了四个属性：

a.SelectCommand属性：向数据库发送查询SQL语句。

b.DeleteCommand属性：向数据库发送删除SQL语句。

c.InsertCommand属性：向数据库发送插入SQL语句。

d.UpdateCommand属性：向数据库发送更新SQL语句。

2.以及一些主要方法：

a.Fill : 用数据填充DataSet。

b.Update : 更新数据库时，DataAdapter将调用DeleteCommand，InsertCommand，UpdateCommand，属性。可以及时的把修改过的数据更新到数据库中。再使用此方法之前，必须实例化一个CommandBuilder类，他能根据DataAdapter的 SelectCommand的SQL语句判断其他的DeleteCommand，InsertCommand，UpdateCommand。

3. 那么，我们现在开始抽水了，输水管也有了，现在我们用Fill把水填充到DataSet（大水库）。

string strCon = "server=RUWENKEJI\\SQLEXPRESS;database=" + "likia" + "; uid=sa;pwd=www090924";
conn = new SqlConnection(strCon); //创建数据库SqlConnection对象，即我们的进水龙头
conn.Open();
//打开水龙头
SqlCommand comm = new SqlCommand(); //创建一个SqlCommand对象，即抽水机。
comm.Connection = conn; //指定他接受的进水龙头为当前comm
comm.CommandText = "SELECT Count(*) FROM " + textBox1.Text.Trim(); //指定SQL指令，用来“过滤”水。这里获取表的数据数量
comm.CommandType = CommandType.Text; //指定指令的类型，此处为text，即SQL文本命令
SqlDataAdapter ad = new SqlDataAdapter(); //创建输水管对象（SqlDataAdapter）
ad.SelectCommand = comm; //指定输水管的SqlCommand对象（即抽水机），这样就得到了水
DataSet dset = new DataSet(); //创建一个水池对象（DataSet），用来存水
ad.Fill(dset); //输水管把谁填充到水池中
dataGridView1.DataSource = dset.Tables[0]; //把数据给dataGridView组件，让显示信息

4.水池（数据集（DataSet ））

DataSet对象像是存放在内容的小数据库。它可以包含数据表，数据列，数据行，视图，约束以及关系。通常它的来源为数据库或者XML。而为了从数据库中获取数据，必须要DataAdapter（3中提及）。然后使用DataAdapter的Fill填充数据集。

1.DataSet提供了一些方法：

a.Merge :可以将DataSet,DataTable,DataRow数组的内容一起并入现有的DataSet中。

b.Copy ：在不想修改破坏原始数据的情况下，可以创建DataSet的副本进行操作。返回一个新的DataSet。

↧