基于hadoop生态圈的数据仓库实践――OLAP与数据可视化（五）

September 1, 2016, 2:24 am

五、Hue、Zeppelin比较

上一节简单介绍了Hue这种Hadoop生态圈的数据可视化组件，本节讨论另一种类似的产品——Zeppelin。首先介绍一下Zeppelin，然后说明其安装的详细步骤，之后演示如何在Zeppelin中添加mysql翻译器，最后从功能、架构、使用场景几方面将Hue和Zeppelin做一个比较。

1. Zeppelin简介

Zeppelin是一个基于Web的软件，用于交互式地数据分析。一开始是Apache软件基金会的孵化项目，2016年5月正式成为一个顶级项目(Top-Level Project，TLP)。Zeppelin描述自己是一个可以进行数据摄取、数据发现、数据分析、数据可视化的笔记本，用以帮助开发者、数据科学家以及相关用户更有效地处理数据，而不必使用复杂的命令行，也不必关心集群的实现细节。Zeppelin的架构图如下所示。

从图中可以看到，Zeppelin具有客户端/服务器架构，客户端一般就是指浏览器。服务器接收客户端的请求，并将请求通过Thrift协议发送给翻译器组。翻译器组物理表现为JVM进程，负责实际处理客户端的请求并与服务器进行通信。

翻译器是一个插件式的体系结构，允许任何语言/后端数据处理程序以插件的形式添加到Zeppelin中。特别需要指出的是，Zeppelin内建Spark翻译器，因此不需要构建单独的模块、插件或库。Spark翻译器的架构图如下所示。

当前的Zeppelin已经支持很多翻译器，如Zeppelin 0.6.0版本自带的翻译器有alluxio、cassandra、file、hbase、ignite、kylin、md、phoenix、sh、tajo、angular、elasticsearch、flink、hive、jdbc、lens、psql、spark等18种之多。插件式架构允许用户在Zeppelin中使用自己熟悉的特定程序语言或数据处理方式。例如，通过使用%spark翻译器，可以在Zeppelin中使用Scala语言代码。

在数据可视化方面，Zeppelin已经包含一些基本的图表，如柱状图、饼图、线形图、散点图等，任何后端语言的输出都可以被图形化表示。

用户建立的每一个查询叫做一个note，note的URL在多用户间共享，Zeppelin将向所有用户实时广播note的变化。Zeppelin还提供一个只显示查询结果的URL，该页不包括任何菜单和按钮。用这种方式可以方便地将结果页作为一帧嵌入到自己的web站点中。

2. Zeppelin安装配置

下面用一个典型的使用场景——使用Zeppelin运行SparkSQL访问Hive表，在一个实验环境上说明Zeppelin的安装配置步骤。

(1)安装环境

12个节点的Spark集群，以standalone方式部署，各个节点运行的进程如下表所示。

主机名运行进程

nbidc-agent-03NameNode、Spark Master

nbidc-agent-04SecondaryNameNode

nbidc-agent-11ResourceManager、DataNode、NodeManager、Spark Worker

nbidc-agent-12DataNode、NodeManager、Spark Worker

nbidc-agent-13DataNode、NodeManager、Spark Worker

nbidc-agent-14DataNode、NodeManager、Spark Worker

nbidc-agent-15DataNode、NodeManager、Spark Worker

nbidc-agent-18DataNode、NodeManager、Spark Worker

nbidc-agent-19DataNode、NodeManager、Spark Worker

nbidc-agent-20DataNode、NodeManager、Spark Worker

nbidc-agent-21DataNode、NodeManager、Spark Worker

nbidc-agent-22DataNode、NodeManager、Spark Worker

操作系统：CentOS release 6.4

Hadoop版本：2.7.0

Hive版本：2.0.0

Spark版本：1.6.0

(2)在nbidc-agent-04上安装部署Zeppelin及其相关组件

前提：nbidc-agent-04需要能够连接互联网。

安装Git：在nbidc-agent-04上执行下面的指令。

yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel

yum install gcc perl-ExtUtils-MakeMaker

yum remove git

cd /home/work/tools/

wget https://github.com/git/git/archive/v2.8.1.tar.gz

tar -zxvf git-2.8.1.tar.gz

cd git-2.8.1.tar.gz

make prefix=/home/work/tools/git all

make prefix=/home/work/tools/git install

安装Java：在nbidc-agent-03机器上执行下面的指令拷贝Java安装目录到nbidc-agent-04机器上。

scp -r jdk1.7.0_75 nbidc-agent-04:/home/work/tools/

安装Apache Maven：在agent-04上执行下面的指令。

cd /home/work/tools/

wget ftp://mirror.reverse.net/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz

tar -zxvf apache-maven-3.3.9-bin.tar.gz

安装Hadoop客户端：在nbidc-agent-03机器上执行下面的指令拷贝Hadoop安装目录到nbidc-agent-04机器上。

scp -r hadoop nbidc-agent-04:/home/work/tools/

安装Spark客户端：在nbidc-agent-03机器上执行下面的指令拷贝Spark安装目录到nbidc-agent-04机器上。

scp -r spark nbidc-agent-04:/home/work/tools/

安装Hive客户端：在nbidc-agent-03机器上执行下面的指令拷贝Hive安装目录到nbidc-agent-04机器上。

scp -r hive nbidc-agent-04:/home/work/tools/

安装phantomjs：在nbidc-agent-04上执行下面的指令。

cd /home/work/tools/

tar -jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2

下载最新的zeppelin源码：在nbidc-agent-04上执行下面的指令。

cd /home/work/tools/

git clone https://github.com/apache/incubator-zeppelin.git

设置环境变量：在nbidc-agent-04上编辑/home/work/.bashrc文件，内容如下。

vi /home/work/.bashrc

# 添加下面的内容

export PATH=.:$PATH:/home/work/tools/jdk1.7.0_75/bin:/home/work/tools/hadoop/bin:/home/work/tools/spark/bin:/home/work/tools/hive/bin:/home/work/tools/phantomjs-2.1.1-linux-x86_64/bin:/home/work/tools/incubator-zeppelin/bin;

export JAVA_HOME=/home/work/tools/jdk1.7.0_75

export HADOOP_HOME=/home/work/tools/hadoop

export SPARK_HOME=/home/work/tools/spark

export HIVE_HOME=/home/work/tools/hive

export ZEPPELIN_HOME=/home/work/tools/incubator-zeppelin

# 保存文件，并是设置生效

source /home/work/.bashrc

编译zeppelin源码：在nbidc-agent-04上执行下面的指令。

cd /home/work/tools/incubator-zeppelin

mvn clean package -Pspark-1.6 -Dspark.version=1.6.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests

(3)配置zeppelin

配置zeppelin-env.sh文件：在nbidc-agent-04上执行下面的指令。

cp /home/work/tools/incubator-zeppelin/conf/zeppelin-env.sh.template /home/work/tools/incubator-zeppelin/conf/zeppelin-env.sh

vi /home/work/tools/incubator-zeppelin/conf/zeppelin-env.sh

# 添加下面的内容

export JAVA_HOME=/home/work/tools/jdk1.7.0_75

export HADOOP_CONF_DIR=/home/work/tools/hadoop/etc/hadoop

export MASTER=spark://nbidc-agent-03:7077

配置zeppelin-site.xml文件：在nbidc-agent-04上执行下面的指令。

cp /home/work/tools/incubator-zeppelin/conf/zeppelin-site.xml.template /home/work/tools/incubator-zeppelin/conf/zeppelin-site.xml

vi /home/work/tools/incubator-zeppelin/conf/zeppelin-site.xml

# 修改下面这段的value值，设置zeppelin的端口为9090

zeppelin.server.port

9090

Server port.

将hive-site.xml拷贝到zeppelin的配置目录下：在nbidc-agent-04上执行下面的指令。

cd /home/work/tools/incubator-zeppelin

cp /home/work/tools/hive/conf/hive-site.xml .

(4)启动zeppelin

在nbidc-agent-04上执行下面的指令。

zeppelin-daemon.sh start

(5)测试

从浏览器输入http://nbidc-agent-04:9090/，如下图所示。

点击‘Interpreter’菜单，配置并保存spark解释器，如下图所示。

配置并保存hive解释器，如下图所示。

点击‘NoteBook’->‘Create new note’子菜单项，建立一个新的查询并执行，结果如下图所示。

说明：这是一个动态表单SQL，SparkSQL语句为：

%sql

select * from wxy.t1 where rate > ${r}

第一行指定解释器为SparkSQL，第二行用${r}指定一个运行时参数，执行时页面上会出现一个文本编辑框，输入参数后回车，查询会按照指定参数进行，如图会查询rate > 100的记录。

3. 在Zeppelin中添加MySQL翻译器

数据可视化的需求很普遍，如果常用的如MySQL这样的关系数据库也能使用Zeppelin查询，并将结果图形化显示，那么就可以用一套统一的数据可视化方案处理大多数常用查询。Zeppelin本身还不带MySQL翻译器，幸运的是已经有MySQL翻译器插件了。下面说明该插件的安装步骤及简单测试。

(1)编译MySQL Interpreter源代码

cd /home/work/tools/

git clone https://github.com/jiekechoo/zeppelin-interpreter-mysql

mvn clean package

(2)部署二进制包

mkdir /home/work/tools/incubator-zeppelin/interpreter/mysql

cp /home/work/tools/zeppelin-interpreter-mysql/target/zeppelin-mysql-0.5.0-incubating.jar /home/work/tools/incubator-zeppelin/interpreter/mysql/

# copy dependencies to mysql directory

cp commons-exec-1.1.jar mysql-connector-java-5.1.6.jar slf4j-log4j12-1.7.10.jar log4j-1.2.17.jar slf4j-api-1.7.10.jar /home/work/tools/incubator-zeppelin/interpreter/mysql/

vi /home/work/tools/incubator-zeppelin/conf/zeppelin-site.xml

在zeppelin.interpreters 的value里增加一些内容“,org.apache.zeppelin.mysql.MysqlInterpreter”，如下图所示。

(3)重启Zeppelin

zeppelin-daemon.sh restart

(4)加载MySQL Interpreter

打开主页http://nbidc-agent-04:9090/，‘Interpreter’ -> ‘Create’，完成类似下图的页面，完成点击‘Save’

(5)测试

创建名为mysql_test的note，如下图所示。

输入下面的查询语句，按创建日期统计建立表的个数。

%mysql

select date_format(create_time,'%Y-%m-%d') d, count(*) c

from information_schema.tables

group by date_format(create_time,'%Y-%m-%d')

order by d;

查询结果的表格表示如下图所示。

查询结果的柱状图表示如下图所示。

查询结果的饼图表示如下图所示。

查询结果的堆叠图表示如下图所示。

查询结果的线形图表示如下图所示。

查询结果的散点图表示如下图所示。

报表模式的饼图表示如下图所示。

可以点击如下图所示的链接单独引用此报表

单独的页面能根据查询的修改而实时变化，比如将查询修改为：

select date_format(create_time,'%Y-%m-%d') d, count(*) c

from information_schema.tables

where create_time > '2016-06-07'

group by date_format(create_time,'%Y-%m-%d')

order by d;

增加了where子句，在运行此查询，结果如下图所示。

单独链接的页面也随之自动发生变化，如下图所示。

5. Hue与Zeppelin比较

(1)功能

Zeppelin和Hue都能提供一定的数据可视化的功能，都提供了多种图形化数据表示形式。单从这点来说，个人认为功能类似，大同小异，Hue可以通过经纬度进行地图定位，这个功能我在Zeppelin 0.6.0上没有找到。Zeppelin支持的后端数据查询程序较多，0.6.0版本缺省有18种，原生支持Spark。而Hue的3.9.0版本缺省只支持Hive、Impala、Pig和数据库查询。Zeppelin只提供了单一的数据处理功能，包括前面提到的数据摄取、数据发现、数据分析、数据可视化等都属于数据处理的范畴。而Hue的功能相对丰富的多，除了类似的数据处理，还有元数据管理、Oozie工作流管理、作业管理、用户管理、Sqoop集成等很多管理功能。从这点看，Zeppelin只是一个数据处理工具，而Hue更像是一个综合管理工具。(2)架构

Zeppelin采用插件式的翻译器，通过插件开发，可以添加任何后端语言和数据处理程序。相对来说更独立和开放。Hue与Hadoop生态圈的其它组件密切相关，一般都与CDH一同部署。(3)使用场景

Zeppelin适合单一数据处理、但后端处理语言繁多的场景，尤其适合Spark。Hue适合与Hadoop集群的多个组件交互、如Oozie工作流、Sqoop等联合处理数据的场景，尤其适合与Impala协同工作。

↧

RMAN快速恢复数据库（DBA再也不担心记不住指令了）

September 1, 2016, 2:23 am

≫ Next: MongoDB分布式设计-主从复制，副本集

≪ Previous: 基于hadoop生态圈的数据仓库实践――OLAP与数据可视化（五）

10g会使用RMAN备份恢复一般是DBA的工作，对技术要求较高，且对oracle的组织结构有较深的理解才可以进行操作，而且由于数据库故障不易发生，大部分DBA也不会记住命令，需要的手查一下，各种文件丢失的脚本又都不一样，例如

控制文件丢失恢复指令：restore controlfile from autobackup;

redolog 丢失的情况：alter database clear (unarchived) logfile;

不完全恢复指令：recover database until cancel;

11g后rman有了更丰富的指令集和修复方法，使得普通运维人员也能迅速快速修复数据库故障，（list 、advise、repair）

见如下实验。

第一种情况，模拟控制文件丢失，删除controlfile

SQL> startup
ORACLE instance started.
Total System Global Area 510554112 bytes
Fixed Size 1345968 bytes
Variable Size 171968080 bytes
Database Buffers 331350016 bytes
Redo Buffers 5890048 bytes
ORA-00205: error in identifying control file, check alert log for more info
启动数据库发现数据库已经无法启动，现在我们用两种方法来尝试恢复下：

传统的方法：

RMAN>restore controlfile from autobackup;
Starting restore at 30-AUG-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=20 device type=DISK
recovery area destination: /u01/app/oracle/fra
database name (or database unique name) used for search: PROD2
channel ORA_DISK_1: AUTOBACKUP /u01/app/oracle/fra/PROD2/autobackup/2016_08_24/o1_mf_s_920718874_cvt48tkl_.bkp found in the recovery area
AUTOBACKUP search with format "%F" not attempted because DBID was not set
channel ORA_DISK_1: restoring control file from AUTOBACKUP /u01/app/oracle/fra/PROD2/autobackup/2016_08_24/o1_mf_s_920718874_cvt48tkl_.bkp
channel ORA_DISK_1: control file restore from AUTOBACKUP complete
output file name=/u01/app/oracle/oradata/PROD2/control01.ctl
output file name=/u01/app/oracle/fast_recovery_area/PROD2/control02.ctl
Finished restore at 30-AUG-16
11g 的快速恢复方法：
RMAN> list failure;
using target database control file instead of recovery catalog
List of Database Failures
=========================
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
712 CRITICAL OPEN 30-AUG-16 Control file /u01/app/oracle/oradata/PROD2/control01.ctl is missing
RMAN> advise failure;
List of Database Failures
=========================
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
712 CRITICAL OPEN 30-AUG-16 Control file /u01/app/oracle/oradata/PROD2/control01.ctl is missing
analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=20 device type=DISK
analyzing automatic repair options complete
Mandatory Manual Actions
========================
no manual actions available
Optional Manual Actions
=======================
no manual actions available
Automated Repair Options
========================
Option Repair Description
------ ------------------
1 Use a multiplexed copy to restore control file /u01/app/oracle/oradata/PROD2/control01.ctl
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/prod2/PROD2/hm/reco_1499999453.hm
RMAN> repair failure;
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/prod2/PROD2/hm/reco_1499999453.hm
contents of repair script:
# restore control file using multiplexed copy
restore controlfile from '/u01/app/oracle/fast_recovery_area/PROD2/control02.ctl';
sql 'alter database mount';
Do you really want to execute the above repair (enter YES or NO)? yes
executing repair script
Starting restore at 30-AUG-16
using channel ORA_DISK_1
channel ORA_DISK_1: copied control file copy
output file name=/u01/app/oracle/oradata/PROD2/control01.ctl
output file name=/u01/app/oracle/fast_recovery_area/PROD2/control02.ctl
Finished restore at 30-AUG-16
sql statement: alter database mount
released channel: ORA_DISK_1
repair failure complete
从以上方法还看不出自动修复的好处，那我们再增加点难度，删除所有的数据文件（不包括参数文件），对比下吧
SQL> startup
ORACLE instance started.
Total System Global Area 510554112 bytes
Fixed Size 1345968 bytes
Variable Size 171968080 bytes
Database Buffers 331350016 bytes
Redo Buffers 5890048 bytes
ORA-00205: error in identifying control file, check alert log for more info
传统处理方法，使用以下脚本可以恢复数据库到启动状态，这里就需要比较专业的知识了
run{
restore controlfile from autobackup;
alter database mount;
restore database;
recover database until cancel;
alter database open resetlogs;
};
接下来是11g的恢复方法：list-advise-repair
RMAN> list failure;
using target database control file instead of recovery catalog
List of Database Failures
=========================
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
958 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' is missing
915 CRITICAL OPEN 30-AUG-16 Control file /u01/app/oracle/oradata/PROD2/control01.ctl is missing
838 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' needs media recovery
835 CRITICAL OPEN 30-AUG-16 Control file needs media recovery
415 HIGH OPEN 30-AUG-16 One or more non-system datafiles are missing
841 HIGH OPEN 30-AUG-16 One or more non-system datafiles need media recovery
可以发先已经告诉我们这些文件丢失了
RMAN> advise failure;
List of Database Failures
=========================
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
958 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' is missing
915 CRITICAL OPEN 30-AUG-16 Control file /u01/app/oracle/oradata/PROD2/control01.ctl is missing
838 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' needs media recovery
835 CRITICAL OPEN 30-AUG-16 Control file needs media recovery
415 HIGH OPEN 30-AUG-16 One or more non-system datafiles are missing
841 HIGH OPEN 30-AUG-16 One or more non-system datafiles need media recovery
analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=20 device type=DISK
analyzing automatic repair options complete
Not all specified failures can currently be repaired.
The following failures must be repaired before advise for others can be given.
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
915 CRITICAL OPEN 30-AUG-16 Control file /u01/app/oracle/oradata/PROD2/control01.ctl is missing
Mandatory Manual Actions
========================
no manual actions available
Optional Manual Actions
=======================
no manual actions available
Automated Repair Options
========================
Option Repair Description
------ ------------------
1 Use a multiplexed copy to restore control file /u01/app/oracle/oradata/PROD2/control01.ctl
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/prod2/PROD2/hm/reco_3157315699.hm
rman已经给出建议及执行的脚本。
RMAN> repair failure;
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/prod2/PROD2/hm/reco_3157315699.hm
contents of repair script:
# restore control file using multiplexed copy
restore controlfile from '/u01/app/oracle/fast_recovery_area/PROD2/control02.ctl';
sql 'alter database mount';
Do you really want to execute the above repair (enter YES or NO)?yes
executing repair script
Starting restore at 30-AUG-16
using channel ORA_DISK_1
channel ORA_DISK_1: copied control file copy
output file name=/u01/app/oracle/oradata/PROD2/control01.ctl
output file name=/u01/app/oracle/fast_recovery_area/PROD2/control02.ctl
Finished restore at 30-AUG-16
sql statement: alter database mount
released channel: ORA_DISK_1
repair failure complete
RMAN> list failure;
List of Database Failures
=========================
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
1230 CRITICAL OPEN 30-AUG-16 Redo log group 3 is unavailable
1224 CRITICAL OPEN 30-AUG-16 Redo log group 2 is unavailable
1218 CRITICAL OPEN 30-AUG-16 Redo log group 1 is unavailable
958 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' is missing
838 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' needs media recovery
1233 HIGH OPEN 30-AUG-16 Redo log file /u01/app/oracle/oradata/PROD2/redo03.log is missing
1227 HIGH OPEN 30-AUG-16 Redo log file /u01/app/oracle/oradata/PROD2/redo02.log is missing
1221 HIGH OPEN 30-AUG-16 Redo log file /u01/app/oracle/oradata/PROD2/redo01.log is missing
415 HIGH OPEN 30-AUG-16 One or more non-system datafiles are missing
841 HIGH OPEN 30-AUG-16 One or more non-system datafiles need media recovery
RMAN> advise failure;
List of Database Failures
=========================
Failure ID Priority Status Time Detected Summary
---------- -------- --------- ------------- -------
1230 CRITICAL OPEN 30-AUG-16 Redo log group 3 is unavailable
1224 CRITICAL OPEN 30-AUG-16 Redo log group 2 is unavailable
1218 CRITICAL OPEN 30-AUG-16 Redo log group 1 is unavailable
958 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' is missing
838 CRITICAL OPEN 30-AUG-16 System datafile 1: '/u01/app/oracle/oradata/PROD2/system01.dbf' needs media recovery
1233 HIGH OPEN 30-AUG-16 Redo log file /u01/app/oracle/oradata/PROD2/redo03.log is missing
1227 HIGH OPEN 30-AUG-16 Redo log file /u01/app/oracle/oradata/PROD2/redo02.log is missing
1221 HIGH OPEN 30-AUG-16 Redo log file /u01/app/oracle/oradata/PROD2/redo01.log is missing
415 HIGH OPEN 30-AUG-16 One or more non-system datafiles are missing
841 HIGH OPEN 30-AUG-16 One or more non-system datafiles need media recovery
analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=20 device type=DISK
analyzing automatic repair options complete
Mandatory Manual Actions
========================
no manual actions available
Optional Manual Actions
=======================
1. If file /u01/app/oracle/oradata/PROD2/redo03.log was unintentionally renamed or moved, restore it
2. If file /u01/app/oracle/oradata/PROD2/redo02.log was unintentionally renamed or moved, restore it
3. If file /u01/app/oracle/oradata/PROD2/redo01.log was unintentionally renamed or moved, restore it
Automated Repair Options
========================
Option Repair Description
------ ------------------
1 Perform incomplete database recovery to SCN 1206859
Strategy: The repair includes point-in-time recovery with some data loss
Repair script: /u01/app/oracle/diag/rdbms/prod2/PROD2/hm/reco_3316371170.hm
RMAN> repair failure;
Strategy: The repair includes point-in-time recovery with some data loss
Repair script: /u01/app/oracle/diag/rdbms/prod2/PROD2/hm/reco_3316371170.hm
contents of repair script:
# database point-in-time recovery
reset database to incarnation 5;
restore database until scn 1206859;
recover database until scn 1206859;
alter database open resetlogs;
Do you really want to execute the above repair (enter YES or NO)? YES
executing repair script
database reset to incarnation 5
Starting restore at 30-AUG-16
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to /u01/app/oracle/oradata/PROD2/system01.dbf
channel ORA_DISK_1: restoring datafile 00002 to /u01/app/oracle/oradata/PROD2/sysaux01.dbf
channel ORA_DISK_1: restoring datafile 00003 to /u01/app/oracle/oradata/PROD2/undotbs01.dbf
channel ORA_DISK_1: restoring datafile 00004 to /u01/app/oracle/oradata/PROD2/users01.dbf
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fra/PROD2/backupset/2016_08_24/o1_mf_nnndf_TAG20160824T111405_cvt47yrv_.bkp
channel ORA_DISK_1: piece handle=/u01/app/oracle/fra/PROD2/backupset/2016_08_24/o1_mf_nnndf_TAG20160824T111405_cvt47yrv_.bkp tag=TAG20160824T111405
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:15
Finished restore at 30-AUG-16
Starting recover at 30-AUG-16
using channel ORA_DISK_1
starting media recovery
archived log for thread 1 with sequence 3 is already on disk as file /u01/app/oracle/fra/PROD2/archivelog/2016_08_24/o1_mf_1_3_cvt48qv1_.arc
archived log for thread 1 with sequence 4 is already on disk as file /u01/app/oracle/fra/PROD2/archivelog/2016_08_24/o1_mf_1_4_cvvbdhx0_.arc
archived log for thread 1 with sequence 5 is already on disk as file /u01/app/oracle/fra/PROD2/archivelog/2016_08_30/o1_mf_1_5_cw9m2no2_.arc
archived log file name=/u01/app/oracle/fra/PROD2/archivelog/2016_08_24/o1_mf_1_3_cvt48qv1_.arc thread=1 sequence=3
archived log file name=/u01/app/oracle/fra/PROD2/archivelog/2016_08_24/o1_mf_1_4_cvvbdhx0_.arc thread=1 sequence=4
archived log file name=/u01/app/oracle/fra/PROD2/archivelog/2016_08_30/o1_mf_1_5_cw9m2no2_.arc thread=1 sequence=5
media recovery complete, elapsed time: 00:00:02
Finished recover at 30-AUG-16
database opened
repair failure complete
修复完毕后还帮你把库open了。基本上适用于所有类型的文件丢失。
妈妈再也不担心我记不住指令了
RMAN快速恢复数据库（DBA再也不担心记不住指令了）

。

↧

MongoDB分布式设计-主从复制，副本集

September 1, 2016, 2:22 am

≫ Next: 数据库性能优化

≪ Previous: RMAN快速恢复数据库（DBA再也不担心记不住指令了）

1.前言

为应对互联网业务的快速增加，用户访问在某段时间的快速增加，系统的高可用，可扩展，容错性被放在了越来越重要的位置。随着技术的发展，业界给出了很多解决方案。

下面为大家介绍下MongoDB的主从复制，副本集在windows 操作系统的部署。

2.主从复制

顾名思义，数据库主从库，一定为主库和从库。随着持久层压力增加，读写分离的技术已经得到广泛应用。mongoDB 的master-slave设计方式即遵从了上述实现方式。主库负责写入数据，从库负责读取数据;主库在写入数据后，自动把增量数据同步到可用从库节点。以下是代码实现细节：

1)首先下载MongoDB 企业级版本(https://www.mongodb.com/download-center)，然后把MongoDB的bin路径配置到Path路径下，以便MongoDB启动。

2)设置一个主库，两个从库，配置如下：

主库(三个文件，1111.conf，)：

1111.conf(完成主服务器的配置)：

dbpath = E:\mongodb\database\MS\a #数据库数据文件地址

port = 1111 #数据库监听端口

bind_ip = 127.0.0.1 #服务器地址

master = true #设置当前数据库为主库

mongo1111StartServer.bat(加载1111.conf,初始化数据库服务器配置):

mongd -config 1111.conf

mongo1111.bat(启动1111服务器客户端)：

mongo 127.0.0.1:1111

从库(127.0.0.1:2222)1：

2222.conf(完成主服务器的配置)：

dbpath = E:\mongodb\database\MS\b #数据库数据文件地址

port = 2222 #数据库监听端口

bind_ip = 127.0.0.1 #服务器地址

slave = true #设置当前数据库为从库

source = 127.0.0.1:2222#设置从库的主库地址，用于同步数据

mongo2222StartServer.bat(加载2222.conf,初始化数据库服务器配置):

mongd -config 2222.conf

mongo2222.bat(启动1111服务器客户端)：

mongo 127.0.0.1:2222

3)从库(127,0.0.1:3333)2:

3333.conf(完成主服务器的配置)：

dbpath = E:\mongodb\database\MS\c #数据库数据文件地址

port = 3333 #数据库监听端口

bind_ip = 127.0.0.1 #服务器地址

slave = true #设置当前数据库为从库

source = 127.0.0.1:3333#设置从库的主库地址，用于同步数据

mongo3333StartServer.bat(加载3333.conf,初始化数据库服务器配置):

mongd -config 3333.conf

mongo3333.bat(启动3333服务器客户端)：

mongo 127.0.0.13333

3)启动三个服务器，观察是否正确启动：

4.分别打开CMD，启动三个数据库客户端，操作主库，切换到stu数据库，插入数据可以正常插入，从库插入失败

5.查看从库数据，发现从库已经从主库同步到数据

6)至此，完成主从库搭建。

3.副本集

一般情况下，MongoDB的主从复制已基本完成分布式应用的需求。然而，随着业务的增加，用户访问的瞬时增加，服务器硬件寿命，故障因素理论存在。因此，一旦主服务器出现故障，无法完成业务请求处理，从库无法同步数据，用户则无法正常访问，因此系统的扩展性，容错性，高可用显得尤为重要。故，MongoDB的副本集技术应用而生。副本集，顾名思义，就是一个个副本的集合，因此不存在特定主服务器，如果已经设置为master的写主库出现故障，系统根据特定算法从数据同步完好，性能较好的读库中推举出另外一个master主写库，继续处理用户的请求，更好保证系统的高可用，高扩展，高容错。

以下给出部署步骤：

1)首先配置四个MongoDB数据库服务器(设定副本集名称，数据库文件所在路径，数据库监听端口，数据库数据同步之前的缓存空间大小：即数据达到多少时才须同步)打开CMD，输入以下命令后，出现wait for 127.0.0.1:... connection 后，说明创建成功

a.创建端口为9927数据库服务器：

mongod --replSet mayadong --dbpath E:\mongodb\database\a --port 9927 --oplogSize 512

b.创建端口为9927数据库服务器：

mongod --replSet mayadong --dbpath E:\mongodb\database\a --port 9927 --oplogSize 512

c.创建端口为9927数据库服务器：

mongod --replSet mayadong --dbpath E:\mongodb\database\a --port 9927 --oplogSize 512

d.创建端口为9927数据库服务器：

mongod --replSet mayadong --dbpath E:\mongodb\database\a --port 9927 --oplogSize 512

亦可以把启动服务端和客户端命令编写成bat文件(批处理文件)，每次点击执行

2)启动四台服务器之后，启动任意一台服务器的客户端，进行副本集的配置(主库确定，仲裁者服务器(非必须))

这里选择9927。

config = {

_id:"mayadong",

members:

[

{_id:0,host:"localhost:9927"},

{_id:1,host:“localhost:9928"},

{_id:2,host:"localhost:9929"},

{_id:3,host:"localhost:9930",arbiterOnly:true}

]

}

如若新增服务器，可用config.members.push({_id:3,host:"localhost:9930"}),因其数据结构为栈，支持push,pop操作，所以也用config.members.push(members元素下标)删除服务器节点。

3)进行副本集初始化，执行rs.initiate(config)进行初始化。

4)设置之后，几次命令执行后发现，命令行显示副本集名称：primary，说明设置主从库关系成功，查看副本集状态和当前数据库服务器是否为主库，

发现从库、仲裁库包含9928,9929,9930,当前数据库为主库。

5)设置完成后，进行插入数据操作。

6)待数据插入成功后，启动其他任意一个服务器的客户端(这里选择)，查看是否进行数据同步。

在打开客户端是，查看当前服务器数据库时，会出现如下错误，执行rs.slaveOk()即可，设置成从库。

7)观察后，确认数据以及同步。再现数据库故障宕机问题，强行关闭9927已设定服务器，自然在9927客户端无法操作，然后再打开9928,9929,9930，发现9929是数据节点已自动切换为主服务器，完成测试猜想。

4.总结

在正常情况下，服务器数量足够，运行良好，用户访问适当，主从服务器完全可以通过主从库的读写分离，应对系统压力，保证性能，配置相对简单。但是新业务的增加，服务器的故障，外部系统接入，必然对系统的扩展性，可用性，容错性提出了更高的要求，因此副本集出现解决了MongoDB在这方面的短板，虽然配置相对复杂，但时保证了系统扩展性，可用性，容错性。

↧

数据库性能优化

September 1, 2016, 2:21 am

≫ Next: Android中SQLiteDatabase操作

≪ Previous: MongoDB分布式设计-主从复制，副本集

影响性能的因素:

1.数据库设计

2查询

3硬件(c up,内存，io, 处理器速度跟不上，内存容量不足，I/O吞吐量小，形成瓶颈效应)

4事务管理

5数据分布

6网络

7操作系统

优化方式：

l 设计合理的数据表结构：

要在良好的数据库方案中实现最优的性能，最关键的是要有1个很好的数据库设计方案。在实际工作中，许多数据库方案往往是由于数据库设计得不好导致性能很差。所以，要实现良好的数据库设计就必须考虑这些问题。

一般来说，逻辑数据库设计会满足规范化的前3级标准:

a. 三大范式：1. 原子性(既不能在细分) 2. 每列必须和组件相关，即要求一个表只描述一个事情 3.各列必须和组件相关，不能间接相关，不依赖传递(e.g: 订单表：顾客姓名→ 顾客编号→订单编号)

遵守这些规则的设计会产生较少的列和更多的表，因而也就减少了数据冗余，也减少了用于存储数据的页。但表关系也许需要通过复杂的合并来处理，这样会降低系统的性能。某种程度上的非规范化可以改善系统的性能，非规范化过程可以根据性能方面不同的考虑用多种不同的方法进行

b. 为了避免联表查询，有时候可以适当的数据冗余, 有效的控制冗余有助于提高数据库的性能

c. 选择合适的数据类型：如果能够定长尽量定长

d. 不要使用无法加索引的类型作为关键字段，比如 text类型

e. 设计出的表要具有较好的使用性，主要体现在查询时是否需要关联多张表且还需使用复杂的SQL技巧

l 对数据表建立合适有效的数据库索引

索引并不一定就是给主键或是唯一的字段。如果在你的表中，有某个字段你总要会经常用来做搜索，那么，请为其建立索引吧

搜索字串 “last_name LIKE ‘a%’”一个是建了索引，一个是没有索引，性能差了4倍左右。

注意:(使用 LIKE ?c% 不能使用索引，使用 LIKE ‘abc%’ 将能够使用索引)

l 永远为每张表设置一个ID

我们应该为数据库里的每张表都设置一个ID做为其主键，而且最好的是一个INT型的。就算是你 users表有一个主键叫 “email”的字段，你也别让它成为主键。使用 VARCHAR 类型来当主键会使用得性能下降。另外，在你的程序中，你应该使用表的ID来构造你的数据结构。

还有一些操作需要使用主键，在这些情况下，主键的性能和设置变得非常重要，比如，集群，分区

l 最好不要给数据库留NULL，尽可能的使用 NOT NULL填充数据库

备注、描述、评论之类的可以设置为 NULL，其他的，最好不要。

不要以为 NULL 不需要空间，其需要额外的空间。(在 Oracle 里，NULL 和Empty 的字符串是一样的!)。

如果非要用null值，default是一个办法，可以多加一列bit 以0，1 的方式来表示某一列为null

比如：charchar(100) 型，在字段建立时，空间就固定了，不管是否插入值(NULL也包含在内)，都是占用 100个字符的空间的

如果是varchar这样的变长字段， null 不占用

当为null的字段在一条记录的最后，就是说他后边没有非null的字段值时，是不占空的

当为null的字段在一条记录的中间，就是说他后边还有非null的字段值时，他就好占用空间(oracle)

Null 这个类型和 Empty 很类似, 但不同点在于 Empty 代表一个变量尚未被初始化, 也就是还没有被赋予任何的值, 而一个变量为 Null 只有在你指定它为 Null 之后。最常遇到 Null 的机会应该是在处理数据库的时候, 当一个字段没有资料时, 便是 Null

Problem：table T包括a、b、c、d三个字段。

问题1：假设某行记录的d字段为null，则该行记录d字段应该不占用任何空间。对吗?

问题2：假设b字段数据类型为varchar2(10)，c字段not null。则不论b字段是否为null，都将占用存储空间。那它该占多少呢?是10个字节吗?

最后还有另外一个小问题：假设b字段数据类型为varchar2(10)且not null，则记录中b字段实际占用的空间应该为实际数据的长度(不大于10)。但如果某行记录b字段数据的长度为5，现在要update后长度变为8，该行记录的现有空间肯定无法存放，必定会引起整行记录的位置变动。这肯定会影响系统的性能，该如何避免这样的问题发生呢?

数据列定义成不能为空(NOT NULL)会使处理速度更快，需要的存储更少。有时还会简化查询，因为在某些情况下不需要检查值的NULL属性。

l 数据查询：编写简洁高效的SQL语句 (进行全表扫描，返回了不必要的行和列)

a.避免不恰当的使用“SELECT *”

除非真的需要读取表中的所有列，否则基于提高查询性能的考虑，在写 SQL 语句的时候应该尽量避免使用“SELECT *”这样的情况。这是一条很简单却常常被用户忽略的最佳实践。

Problem：数据库连接查询没用join条件会导致什么情况

select sum(project_o.danjia*project_o.mianji) from project_o,project_t where project_o.zhuangtai='no' and project_o.project_id=30

语句其实只是sql语句的一部分，问题是另一部分用到了表project_t，所以from中有project_t，但是上面的这部分语句完全没有用到project_t，但是不设置条件就导致了笛卡尔乘积。

l 笛卡尔乘积

笛卡尔乘积通俗的说，就是两个集合中的每一个成员，都与对方集合中的任意一个成员有关联。关系数据库中的笛卡尔积的结果就是两个表中行数的乘积

解决方法：

使用LEFT JOIN

select sum(project_o.danjia*project_o.mianji) from project_o LEFT JOIN project_t ON project_o.id=project_t.project_id

where project_o.zhuangtai='no' and project_o.project_id=30

SQL查询中没有JOIN条件导致的笛卡尔乘积从而影响性能的一个案例

l 在子查询中慎重使用IN或者NOT IN语句. Not exist 效率高于not in

in是把外表和内表作hash 连接，而exists是对外表作loop循环，每次loop循环再对内表进行查询。一直以来认为exists比in效率高的说法是不准确的。

如果查询的两个表大小相当，那么用in和exists差别不大。

如果两个表中一个较小，一个是大表，则子查询表大的用exists，子查询表小的用in

not in 和not exists如果查询语句使用了not in 那么内外表都进行全表扫描，没有用到索引;而not extsts 的子查询依然能用到表上的索引。所以无论那个表大，用not exists都比not in要快。

问题：其中一部分是删除大量数据, 一部分是往数据库中添加大量数据.

目前看来, 相互的影响太大(几十倍的性能降低), 请问这样的并行处理的一般优化方案是什么~

l 拆分大的 DELETE 或INSERT 语句，批量提交SQL语句

如果你需要在一个在线的网站上去执行一个大的 DELETE 或 INSERT 查询，你需要非常小心，要避免你的操作让你的整个网站停止相应。因为这两个操作是会锁表的，表一锁住了，别的操作都进不来了。

Apache 会有很多的子进程或线程。所以，其工作起来相当有效率，而我们的服务器也不希望有太多的子进程，线程和数据库链接，这是极大的占服务器资源的事情，尤其是内存。

如果你把你的表锁上一段时间，比如30秒钟，那么对于一个有很高访问量的站点来说，这30秒所积累的访问进程/线程，数据库链接，打开的文件数，可能不仅仅会让你的WEB服务崩溃，还可能会让你的整台服务器马上挂了。

所以，如果你有一个大的处理，你一定把其拆分，使用 LIMIT oracle(rownum),sqlserver(top)条件是一个好的方法。下面是一个mysql示例：

while(1){

//每次只做1000条

mysql_query(“delete from logs where log_date<=’2012-11-01’limit 1000”);

if(mysql_affected_rows()==0){

//没得可删了，退出!

break;

}

//每次都要休息一会儿

usleep(50000)

}

l 使用临时表加速查询

把表的一个子集进行排序并创建临时表，有时能加速查询。它有助于避免多重排序操作，而且在其他方面还能简化优化器的工作。

临时表中的行要比主表中的行少，而且物理顺序就是所要求的顺序，减少了磁盘I/O，所以查询工作量可以得到大幅减少。

l 采用连接操作，避免过多的子查询，产生的CPU和IO开销

l 对于连续的数值，使用between代替in

l 尽量不用触发器，特别是在大数据表上

l 当只要一行数据时使用 LIMIT

↧

Android中SQLiteDatabase操作

September 1, 2016, 2:20 am

≫ Next: 图解SQL子查询实例

≪ Previous: 数据库性能优化

Android中SQLiteDatabase操作【附源码】

　　像我们做的很多应用程序及网站一样，基本都是对数据库进行增删改查来实现相应的功能。那么Android开发也一样，不过由于在移动客户端应用，所以不会像sql server、mysql那么复杂，Android应用程序支持本地数据库，SQLiteDatabase，通俗的说就是在手机上我们开发的应用程序中创建一个数据库，然后我们可以在手机上对我们的数据进行增删改查，不过这并不是绝对的，像前段时间我们开发一个小组OA，需要多人使用，功能简单，但需要大家连接到一个数据库中进行数据读取操作，所以这种情况下就要考虑到用mysql这样的数据库，最后选择了用php操作后台，然后然会Android进行数据处理，不过对于我们使用的2G网络很多程度上对速度还是有影响的。缓存这一块接触的比较少，计划等到Android这一块学的差不多了再研究其稍底层的一些开发。

　　这篇文章主要向大家分享如何操作SQLiteDatabase。　

　　当然首先我们要了解SQLiteDatabase，它具有很多优点：

SQLite特性：

1. 轻量级

2. 独立性

3. 隔离性

SQLite数据库中所有的信息(比如表、视图、触发器等)都包含在一个文件内，方便管理和维护。

4. 跨平台

5. 多语言接口

6. 安全性

SQLite数据库通过数据库级上的独占性和共享锁来实现独立事务处理。这意味着多个进程可以在同一时间从同一数据库读取数据，但只有一个可以写入数据。在某个进程或现成向数据库执行操作之前，必须获得独占锁定。在发出独占锁定以后，其他的读或写操作将不会再发生。

创建和打开数据库：

openOrCreateDatabase(),自动检测是否存在这个数据库，如果存在则打开，否则创建，创建成功会返回一个SQLiteDatabase对象，否则抛出异常FileNotFoundException：

mSQLiteDatabase = this.openOrCreateDatabase("abc.db",MODE_PRIVATE,null);

创建表：

execSQL()：

String Create_Table = "Create table table1...";

mSQLiteDatabase.execSQL(Create_Table);

向表中添加数据：

insert方法需要把数据打包到ContentValues中，ContentValues其实就是一个Map，Key值是字段名称，Value值是字段的值。通过ContentValues的put方法就可以把数据放到ContentValues对象中，然后插入到表中：

ContentValue cv = new ContentValues();

cv.put(table_num,1);

mSQLiteDatabase.insert(TABLE_NAME,null,cv);

从表中删除数据：

delete():

mSQLiteDatabase.delete("abc.db","where...",null);

修改表数据：

update():

ContentValues cv = new ContentValues();

cv.put(TABLE_NUM,3);

mSQLiteDatabase.update("table1",cv,"num"+"="+Integer.toString(0),null);

当然，插入、删除和修改操作也可以通过execSQL(sql)方法来实现。

关闭数据库:

mSQLiteDatabase.close();

删除指定表：

mSQLiteDatabase.execSQL("DROP TABLE table1");

删除数据库：

this.deleteDatabase("abc.db");

查询表中的某条记录：

通过Cursor类实现，当使用SQLiteDatabase.query()方法时，会得到一个Cursor对象，Cursor指向的就是每一条数据。它提供了很多有关查询的方法：

方法

说明

move

以当前位置为参考，将Cursor移动到指定的位置，成功返回true

moveToPosition

将Cursor移动的指定的位置，返回boolean

moveToNext

将Cursor向前移动一个位置，返回boolean

moveToLast

将Cursor向后移动一个位置，返回boolean

moveToFirst

将Cursor移动的第一行，返回boolean

isBeforeFirst

返回Cursor是否指向第一项数据之前

isAfterLast

返回Cursor是否指向最后一项数据之后

isClosed

返回Cursor是否关闭

isFirst

返回Cursor是否指向第一项数据

isLast

返回Cursor是否指向最后一项数据

isNull

返回指定位置的值是否为null

getCount

返回总的数据项数

getInt

返回当前行指定索引的数据

例如：

Cursor cur = mSQLiteDatabase.rawQuery("select * from table",null);

if(cur != null)

{

if(cur.moveToFirst())

{

do{

int numColumn = cur.getColumnIndex("num");

int num = cur.getInt(numColumn);

}while(cur.moveToNext());

}

使用SQLiteDatabase数据库后要及时关闭，否则可能会抛出SQLiteException异常。

上面的方法像大部分基础语法书上一样直接执行sql语句的形式，那么在Android中为了简化用户操作以及提高性能，Android系统提供了SQLiteOpenHelper，封装了常用的数据库操作方法。利用它我们可以很轻松的完成对数据库的增删改查。　　

首先我们创建一个DBHelper类继承SQLiteOpenHelper，用它来完成数据库的初始化工作：创建数据库，创建表等操作。

他包含一些借口方法，在下面的注释里已经注释的很详细，就不再罗嗦。

1 package com.example.core;
2
3 import android.content.Context;
4 import android.database.sqlite.SQLiteDatabase;
5 import android.database.sqlite.SQLiteDatabase.CursorFactory;
6 import android.database.sqlite.SQLiteOpenHelper;
7
8 public class DBHelper extends SQLiteOpenHelper{
9
10 public DBHelper(Context context) {
11 //创建数据库名为march_test.db的数据库
12 super(context,"march_test.db",null,1);
13 }
14
15 /* (non-Javadoc)
16 * 数据库每次被创建时调用
17 * @see android.database.sqlite.SQLiteOpenHelper#onCreate(android.database.sqlite.SQLiteDatabase)
18 */
19 @Override
20 public void onCreate(SQLiteDatabase db) {
21 //创建数据库表
22 String create_sql = "CREATE TABLE student(id integer primary key autoincrement," +
23 "name varchar(20),age integer not null)";
24 db.execSQL(create_sql);
25 }
26
27 /* (non-Javadoc)
28 * 版本号发生变化时执行
29 * @see android.database.sqlite.SQLiteOpenHelper#onUpgrade(android.database.sqlite.SQLiteDatabase, int, int)
30 */
31 @Override
32 public void onUpgrade(SQLiteDatabase db, int oldVersion, int newVersion) {
33 // TODO Auto-generated method stub
34 String alter_sql = "ALTER TABLE student ADD money integer";
35 db.execSQL(alter_sql);
36 }
37
38 }

　　通过实例化这个类，可以创建一个名为march_test.db的数据库。包含数据表student。可以文件系统中看到：

　　路径为data/data/包名/databases/数据库名：

　　这种db格式的数据库在这里给大家推荐一个非常好用的工具SQLite Expert Professional，非常好用，在网上也很好找到。他mysql workbench等数据库可视化工具一样给我们提供了可视化数据库操作，软件界面如下：

　　我们可以把我们创建的表在文件系统中导出来然后放到这里查看。

　　首先要声明我们要操作的数据类型类：

1 package com.example.sqllite;
2
3 public class Student{
4
5 private Integer id;
6 private String name;
7 private Integer age;
8
9 public Student(Integer id, String name, Integer age) {
10 super();
11 this.id = id;
12 this.name = name;
13 this.age = age;
14 }
15
16 public Student(String name , Integer age){
17 super();
18 this.name = name;
19 this.age = age;
20 }
21
22 public Integer getId() {
23 return id;
24 }
25
26 public void setId(Integer id) {
27 this.id = id;
28 }
29
30 public String getName() {
31 return name;
32 }
33
34 public void setName(String name) {
35 this.name = name;
36 }
37
38 public Integer getAge() {
39 return age;
40 }
41
42 public void setAge(Integer age) {
43 this.age = age;
44 }
45
46 @Override
47 public String toString() {
48 return "Student [id=" + id + ", name=" + name + ", age=" + age + "]";
49 }
50
51 }

下面要编写对数据库的增删改查类，它继承我们上面创建的SQLiteOpenHelper为基类的DBHelper类：

1 package com.example.core;
2
3 import java.util.ArrayList;
4 import java.util.List;
5
6 import android.content.Context;
7 import android.database.Cursor;
8 import android.database.sqlite.SQLiteDatabase;
9
10 import com.example.sqllite.Student;
11
12 /**
13 * @author fanchangfa
14 *数据库操作类
15 *增删改查
16 *获取分页查询数据
17 *获取表中记录总数
18 */
19 public class DbServer{
20
21 private DBHelper dbhelper;
22
23 public DbServer(Context context){
24 this.dbhelper = new DBHelper(context);
25 }
26
27 /**
28 * 增加信息
29 * @param student 增加的学生信息
30 */
31 public void add(Student student){
32 SQLiteDatabase db = dbhelper.getWritableDatabase();
33 db.execSQL("insert into student(name , age) values(?,?)",
34 new Object[]{student.getName(),student.getAge()});
35 }
36
37 /**
38 * 删除信息
39 * @param id 要删除的学生id
40 */
41 public void delete(Integer id){
42 SQLiteDatabase db = dbhelper.getWritableDatabase();
43 db.execSQL("delete from student where id = ?",new Object[]{id});
44 }
45
46 /**
47 * 修改指定id的学生信息
48 * @param stu 包括修改学生的id，以及修改的信息
49 */
50 public void alter(Student stu){
51 SQLiteDatabase db = dbhelper.getWritableDatabase();
52 db.execSQL("update student set name=?,age=? where id=?",
53 new Object[]{stu.getName(),stu.getAge(),stu.getId()});
54 }
55
56 /**
57 * 查找信息
58 * @param id 要查找的学生id
59 */
60 public Student find(Integer id){
61 SQLiteDatabase db = dbhelper.getReadableDatabase();
62 Cursor cursor = db.rawQuery("select * from student where id = ?",new String[]{id.toString()});
63
64 if(cursor.moveToFirst()) //如果查询结果集中有数据,将游标指向第一条记录
65 {
66 int sid = cursor.getInt(cursor.getColumnIndex("id"));
67 String name = cursor.getString(cursor.getColumnIndex("name"));
68 int age = cursor.getInt(cursor.getColumnIndex("age"));
69
70 return new Student(sid , name , age);
71 }
72
73 cursor.close();
74
75 return null;
76 }
77
78 /**
79 * 分页查询数据
80 * @param start 分页开始记录数
81 * @param end 分页结束记录数
82 * @return 查询结果集
83 */
84 public List page(int start , int end){
85 SQLiteDatabase db = dbhelper.getReadableDatabase();
86 List page = new ArrayList();
87 Cursor cur = db.rawQuery("select id,name,age from student order by id limit ?,?",
88 new String[]{String.valueOf(start),String.valueOf(end)});
89
90 while(cur.moveToNext()){
91 int id = cur.getInt(cur.getColumnIndex("id"));
92 String name = cur.getString(cur.getColumnIndex("name"));
93 int age= cur.getInt(cur.getColumnIndex("age"));
94 page.add(new Student(id,name,age));
95 }
96
97 cur.close();
98
99 return page;
100 }
101
102 /**
103 * 返回指定分页数据
104 * @param start
105 * @param end
106 * @return Cursor型数据
107 */
108 public Cursor curpage(int start , int end){
109 SQLiteDatabase db = dbhelper.getReadableDatabase();
110 Cursor cur = db.rawQuery("select id as _id,name,age from student order by id limit ?,?",
111 new String[]{String.valueOf(start),String.valueOf(end)});
112
113 cur.moveToFirst();
114
115 return cur;
116 }
117
118 /**
119 * 获取表记录总数
120 * @return
121 */
122 public long getCount(){
123 SQLiteDatabase db = dbhelper.getReadableDatabase();
124
125 Cursor cur = db.rawQuery("select count(*) from student",null);
126 cur.moveToFirst();
127
128 long count = cur.getLong(0);
129
130 cur.close();
131
132 return count;
133 }
134
135 /**
136 * 执行事务
137 */
138 public void transaction(){
139 SQLiteDatabase db = dbhelper.getWritableDatabase();
140 db.beginTransaction();
141
142 try{
143 db.execSQL("update student set age = 21 where id =5");
144 db.execSQL("update student set age= 22 where id=6");
145 db.setTransactionSuccessful();
146 //事务默认有commit、rollback，默认为False，即非提交状态，需要设置为commit
147 }
148 finally{
149 db.endTransaction();
150 }
151
152 }
153 }

具体操作代码中已经注释完善，可以进行试验。

下面要对其进行测试：

编写测试单元如下：

1 package com.example.test;
2
3 import java.util.List;
4
5 import com.example.core.DbServer;
6 import com.example.sqllite.Student;
7
8 import android.test.AndroidTestCase;
9 import android.util.Log;
10
11 /**
12 * @author fanchangfa
13 * 数据库操作单元测试
14 * 测试DbServer中数据的增删改查
15 *
16 */
17 public class DbServerTest extends AndroidTestCase{
18
19 //控制台打印信息标志
20 private static final String TAG = "SQLtest";
21
22 /**
23 * 添加数据测试
24 */
25 public void addTest(){
26 DbServer dbserver = new DbServer(this.getContext());
27 for(int i = 0 ; i<20 ; i++)
28 {
29 Student stu = new Student("fanchangfa"+i,20);
30 dbserver.add(stu);
31 }
32 }
33
34 public void deleteTest(){
35 DbServer dbserver = new DbServer(this.getContext());
36 dbserver.delete(2);
37 }
38
39 public void alterTest(){
40 DbServer dbserver = new DbServer(this.getContext());
41 Student stu = dbserver.find(3);
42 stu.setName("liuzihang");
43 stu.setAge(25);
44 dbserver.alter(stu);
45 }
46
47 /**
48 * 测试数据库查找
49 * 根据提供id返回记录结果
50 */
51 public void findTest(){
52 DbServer dbserver = new DbServer(this.getContext());
53 Student stu = dbserver.find(5);
54 Log.i(TAG, stu.toString());
55 }
56
57 /**
58 * 数据库查找分页测试
59 */
60 public void findpage(){
61 DbServer dbserver = new DbServer(this.getContext());
62 List students = dbserver.page(0, 8);
63
64 for(Student stu : students){
65 Log.i(TAG,stu.toString());
66 }
67
68 }
69
70 /**
71 * 执行事务测试
72 */
73 public void transactionTest(){
74 DbServer dbserver = new DbServer(this.getContext());
75 dbserver.transaction();
76 }
77
78 }

经验证，没有问题，由于此文件系统和操作比较麻烦，我将自己写的实例放到这里共大家下载，此实例中包括数据库的操作以及SQLite中事物的使用，以及将在下一篇写的关于ListView显示数据的几种方法，界面虽然很难看，不过这只是demo，希望多多谅解，有问题多多交流。希望这里会是我们拥有共同爱好的程序员们相互交流共同进步的平台，而不是只是为了增加访问量而将文章放在这里。

↧

图解SQL子查询实例

September 1, 2016, 2:19 am

≫ Next: 海量数据存储之Key-Value存储简介

≪ Previous: Android中SQLiteDatabase操作

1 创建示例表

先创建示例表；

-------------------------
-- Create Customers table
-------------------------
CREATE TABLE Customers
(
cust_id char(10) NOT NULL ,
cust_name char(50) NOT NULL ,
cust_address char(50) NULL ,
cust_city char(50) NULL ,
cust_state char(5) NULL ,
cust_zip char(10) NULL ,
cust_country char(50) NULL ,
cust_contact char(50) NULL ,
cust_email char(255) NULL
);
--------------------------
-- Create OrderItems table
--------------------------
CREATE TABLE OrderItems
(
order_num int NOT NULL ,
order_item int NOT NULL ,
prod_id char(10) NOT NULL ,
quantity int NOT NULL ,
item_price decimal(8,2) NOT NULL
);
----------------------
-- Create Orders table
----------------------
CREATE TABLE Orders
(
order_num int NOT NULL ,
order_date datetime NOT NULL ,
cust_id char(10) NOT NULL
);
------------------------
-- Create Products table
------------------------
CREATE TABLE Products
(
prod_id char(10) NOT NULL ,
vend_id char(10) NOT NULL ,
prod_name char(255) NOT NULL ,
prod_price decimal(8,2) NOT NULL ,
prod_desc varchar(1000) NULL
);
-----------------------
-- Create Vendors table
-----------------------
CREATE TABLE Vendors
(
vend_id char(10) NOT NULL ,
vend_name char(50) NOT NULL ,
vend_address char(50) NULL ,
vend_city char(50) NULL ,
vend_state char(5) NULL ,
vend_zip char(10) NULL ,
vend_country char(50) NULL
);
----------------------
-- Define primary keys
----------------------
ALTER TABLE Customers WITH NOCHECK ADD CONSTRAINT PK_Customers PRIMARY KEY CLUSTERED (cust_id);
ALTER TABLE OrderItems WITH NOCHECK ADD CONSTRAINT PK_OrderItems PRIMARY KEY CLUSTERED (order_num, order_item);
ALTER TABLE Orders WITH NOCHECK ADD CONSTRAINT PK_Orders PRIMARY KEY CLUSTERED (order_num);
ALTER TABLE Products WITH NOCHECK ADD CONSTRAINT PK_Products PRIMARY KEY CLUSTERED (prod_id);
ALTER TABLE Vendors WITH NOCHECK ADD CONSTRAINT PK_Vendors PRIMARY KEY CLUSTERED (vend_id);
----------------------
-- Define foreign keys
----------------------
ALTER TABLE OrderItems ADD
CONSTRAINT FK_OrderItems_Orders FOREIGN KEY (order_num) REFERENCES Orders (order_num),
CONSTRAINT FK_OrderItems_Products FOREIGN KEY (prod_id) REFERENCES Products (prod_id);
ALTER TABLE Orders ADD
CONSTRAINT FK_Orders_Customers FOREIGN KEY (cust_id) REFERENCES Customers (cust_id);
ALTER TABLE Products ADD
CONSTRAINT FK_Products_Vendors FOREIGN KEY (vend_id) REFERENCES Vendors (vend_id);
---------------------------
-- Populate Customers table
---------------------------
INSERT INTO Customers(cust_id, cust_name, cust_address, cust_city, cust_state, cust_zip, cust_country, cust_contact, cust_email)
VALUES('1000000001', 'Village Toys', '200 Maple Lane', 'Detroit', 'MI', '44444', 'USA', 'John Smith', 'sales@villagetoys.com');
INSERT INTO Customers(cust_id, cust_name, cust_address, cust_city, cust_state, cust_zip, cust_country, cust_contact)
VALUES('1000000002', 'Kids Place', '333 South Lake Drive', 'Columbus', 'OH', '43333', 'USA', 'Michelle Green');
INSERT INTO Customers(cust_id, cust_name, cust_address, cust_city, cust_state, cust_zip, cust_country, cust_contact, cust_email)
VALUES('1000000003', 'Fun4All', '1 Sunny Place', 'Muncie', 'IN', '42222', 'USA', 'Jim Jones', 'jjones@fun4all.com');
INSERT INTO Customers(cust_id, cust_name, cust_address, cust_city, cust_state, cust_zip, cust_country, cust_contact, cust_email)
VALUES('1000000004', 'Fun4All', '829 Riverside Drive', 'Phoenix', 'AZ', '88888', 'USA', 'Denise L. Stephens', 'dstephens@fun4all.com');
INSERT INTO Customers(cust_id, cust_name, cust_address, cust_city, cust_state, cust_zip, cust_country, cust_contact)
VALUES('1000000005', 'The Toy Store', '4545 53rd Street', 'Chicago', 'IL', '54545', 'USA', 'Kim Howard');
-------------------------
-- Populate Vendors table
-------------------------
INSERT INTO Vendors(vend_id, vend_name, vend_address, vend_city, vend_state, vend_zip, vend_country)
VALUES('BRS01','Bears R Us','123 Main Street','Bear Town','MI','44444', 'USA');
INSERT INTO Vendors(vend_id, vend_name, vend_address, vend_city, vend_state, vend_zip, vend_country)
VALUES('BRE02','Bear Emporium','500 Park Street','Anytown','OH','44333', 'USA');
INSERT INTO Vendors(vend_id, vend_name, vend_address, vend_city, vend_state, vend_zip, vend_country)
VALUES('DLL01','Doll House Inc.','555 High Street','Dollsville','CA','99999', 'USA');
INSERT INTO Vendors(vend_id, vend_name, vend_address, vend_city, vend_state, vend_zip, vend_country)
VALUES('FRB01','Furball Inc.','1000 5th Avenue','New York','NY','11111', 'USA');
INSERT INTO Vendors(vend_id, vend_name, vend_address, vend_city, vend_state, vend_zip, vend_country)
VALUES('FNG01','Fun and Games','42 Galaxy Road','London', NULL,'N16 6PS', 'England');
INSERT INTO Vendors(vend_id, vend_name, vend_address, vend_city, vend_state, vend_zip, vend_country)
VALUES('JTS01','Jouets et ours','1 Rue Amusement','Paris', NULL,'45678', 'France');
--------------------------
-- Populate Products table
--------------------------
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('BR01', 'BRS01', '8 inch teddy bear', 5.99, '8 inch teddy bear, comes with cap and jacket');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('BR02', 'BRS01', '12 inch teddy bear', 8.99, '12 inch teddy bear, comes with cap and jacket');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('BR03', 'BRS01', '18 inch teddy bear', 11.99, '18 inch teddy bear, comes with cap and jacket');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('BNBG01', 'DLL01', 'Fish bean bag toy', 3.49, 'Fish bean bag toy, complete with bean bag worms with which to feed it');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('BNBG02', 'DLL01', 'Bird bean bag toy', 3.49, 'Bird bean bag toy, eggs are not included');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('BNBG03', 'DLL01', 'Rabbit bean bag toy', 3.49, 'Rabbit bean bag toy, comes with bean bag carrots');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('RGAN01', 'DLL01', 'Raggedy Ann', 4.99, '18 inch Raggedy Ann doll');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('RYL01', 'FNG01', 'King doll', 9.49, '12 inch king doll with royal garments and crown');
INSERT INTO Products(prod_id, vend_id, prod_name, prod_price, prod_desc)
VALUES('RYL02', 'FNG01', 'Queen doll', 9.49, '12 inch queen doll with royal garments and crown');
------------------------
-- Populate Orders table
------------------------
INSERT INTO Orders(order_num, order_date, cust_id)
VALUES(20005, '2012-05-01', '1000000001');
INSERT INTO Orders(order_num, order_date, cust_id)
VALUES(20006, '2012-01-12', '1000000003');
INSERT INTO Orders(order_num, order_date, cust_id)
VALUES(20007, '2012-01-30', '1000000004');
INSERT INTO Orders(order_num, order_date, cust_id)
VALUES(20008, '2012-02-03', '1000000005');
INSERT INTO Orders(order_num, order_date, cust_id)
VALUES(20009, '2012-02-08', '1000000001');
----------------------------
-- Populate OrderItems table
----------------------------
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20005, 1, 'BR01', 100, 5.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20005, 2, 'BR03', 100, 10.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20006, 1, 'BR01', 20, 5.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20006, 2, 'BR02', 10, 8.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20006, 3, 'BR03', 10, 11.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20007, 1, 'BR03', 50, 11.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20007, 2, 'BNBG01', 100, 2.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20007, 3, 'BNBG02', 100, 2.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20007, 4, 'BNBG03', 100, 2.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20007, 5, 'RGAN01', 50, 4.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20008, 1, 'RGAN01', 5, 4.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20008, 2, 'BR03', 5, 11.99);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20008, 3, 'BNBG01', 10, 3.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20008, 4, 'BNBG02', 10, 3.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20008, 5, 'BNBG03', 10, 3.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20009, 1, 'BNBG01', 250, 2.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20009, 2, 'BNBG02', 250, 2.49);
INSERT INTO OrderItems(order_num, order_item, prod_id, quantity, item_price)
VALUES(20009, 3, 'BNBG03', 250, 2.49);

示例表的创建和插入数据脚本可从此页底部下载；

http://www.forta.com/books/0672336073/

以下是Customers和Orders表的情况；

2 独立单值子查询（独立标量子查询）

子查询返回的是单个值，而不是数据集

select *
from customers as C
where C.cust_id=(
select O.cust_id from Orders as O where O.order_num = 20008
);
图解SQL子查询实例

注：使用单值独立子查询时，一定要保证子查询返回的是单个值，而不是数据集；
若返回的是数据集，则会出现如下状况：
图解SQL子查询实例

3 独立多值子查询

子查询返回的是数据集

select *
from customers as C
where C.cust_id IN(
select O.cust_id from Orders as O where O.order_num between 20005 and 20007
);
图解SQL子查询实例

4 相关子查询

相关子查询最基本的执行逻辑是：将外部表中的 “每一行” 逐行代入到子查询中（理解相关子查询的关键）

select *
from Customers as C
where exists (
select * from Orders as O
where O.cust_id=C.cust_id and (O.order_num between 20005 and 20008)
);
图解SQL子查询实例

5 作为计算字段使用子查询

统计顾客名字，顾客所在的州，顾客的订单数；

select cust_name,cust_state,
(select COUNT(*) from Orders where Orders.cust_id=Customers.cust_id) as Orders
from Customers
order by cust_name;
图解SQL子查询实例

第三列是子查询，统计每个顾客的订单数；

第三列的子查询单独执行情况如下；

select COUNT(*) from Orders where cust_id='1000000001';
图解SQL子查询实例

↧

海量数据存储之Key-Value存储简介

September 1, 2016, 2:18 am

≫ Next: Thelittleredisbook中文版

≪ Previous: 图解SQL子查询实例

Key-value存储简介

具备高可靠性及可扩展性的海量数据存储对互联网公司来说是一个巨大的挑战，传统的数据库往往很难满足该需求，并且很多时候对于特定的系统绝大部分的检索都是基于主键的的查询，在这种情况下使用关系型数据库将使得效率低下，并且扩展也将成为未来很大的难题。在这样的情况下，使用Key-value存储将会是一个很好的选择。

它被广泛应用于缓存，搜索引擎等等领域。

根据以上的描述，一个好的key-value存储需要满足哪些条件呢？

l Availability可用性

l Scalability可扩展性

l Failover故障恢复

l Performance高性能

简单来说，就是数据不能丢失，服务不能中断，能对故障进行感知并能自动恢复，读写性能极高。

文件存储

这一部分比较大，以后会另开主题写

单文件还是多文件

不少nosql的产品采用的是单文件存储的，数据量大以后肯定会遇到性能瓶颈，这一点无需多说，我想强调的是，采用多文件存储数据优点还是非常多的，不过也需要注意，操作系统对于能够打开的文件数目是由限制的，貌似linux好像是1024（待确认），

Only Append

为了支持更快的写操作，数据文件的写操作只支持append，这个就不多说了，相信大部分的海量存储设计都是这样的。因此，更新操作等价于写操作，不过在写的时候第一步判断写到树的哪个位置时肯定会定位到树已有的节点上，这样可以使得这次写失效或者直接覆盖。

这样存在一个问题，就是对于失效的数据（比如更新过的数据）如何处理，比较好的办法是启动独立线程定时或手动进行清理，请注意，这是一个非常巨大的过程，它将耗光你的CPU和I/O，因为要进行频繁计算和数据迁移。

数据结构

B Tree家族这一数据结构被广泛的运用于数据库索引，如mssql的B+tree，oracle的B-tree，熟悉索引的朋友一定很清楚，这种数据结构非常适合作为我们的Key-value存储的数据结构.关于B+tree，可以参见下图：它是一个多路搜索树, 数据存储在叶子节点上，非叶子节点作为叶子节点的索引，加速数据的查找，而叶子节点是一个有序的链表，每次搜索都会到达叶子节点才会结束，插入新数据可能会引起节点的分裂。

在本篇文章中，你需要知道，上层的节点成为IN（Internal Node）,它持有其他节点的引用，叶子节点的上层是（Bottom Internal Node），而叶子节点则是存储数据的节点。

图片来自：http://blog.csdn.net/manesking/archive/2007/02/09/1505979.aspx

这部分是纯粹的数据结构，就不多说了，如果想深入了解的话可以看看这篇论文《The Ubiquitous B-Tree》

设计要点
Partition

因为系统要具备高扩展性，因此，增加删除机器是频繁的操作，如何将数据均匀分散到集群中呢？比较常用的办法是hash取模的办法，但是这样一来，增加机器的瞬间，按照之前的hash取模方式，数据无法读取，这意味着需要对数据进行迁移，等待机器预热，这是很不好的办法。

目前比较公认的解决办法就是一致性哈希(consistent hashing)

首先按照机器的hash进行顺时针分布，如图，目前有5台机器，如果有一个读写请求，那么hash该key值得的一个hash值，定位到环上，如果没有定位到具体的机器，那么按照顺时针查找，找到的第一个机器就是目标节点。

如果需要新增机器，增加过程为，首先hash新机器得到其位置，加入新机器F，这时访问策略不变，那么按照之前的策略，如果hash到C-F之间的数据就无法找到了，但是这样一来影响就局限于C-F之间，不象之前需要整体迁移了。

最后，为了降低增加机器所带来的影响，我们可以为其增加虚拟节点（virtual nodes）。这样的话服务器在环上的分布就比较均匀，这样多个虚拟节点将对应一个我们的物理节点，增加机器所受到的影响也会变得最小。

Replication

为了达到高可用性和数据不丢失，我们需要通过复制将数据备份到多台机器，replication的实现机制一般是通过Master与replica之间的TCP/IP连接，然后根据相应的一致性策略将数据分发到replica上，这里的一致性策略主要包括两项：

1.replica能够延迟master的时间，这个的意思就是说，在这个时间内更新的数据，replica可能是看不到的。例如你设置的一致性时间是3s，那么在某个特定的时刻，replica上的数据实际上可能是master3s以前的snapshot。

2. master事务提交返回之前是否需要得到replica的确认。为了尽量保证数据不丢失，master需要得到一定数量的replica确认数据更新成功之后才能提交事务。

关于数据可靠性和性能之间，是需要进行折衷的，很显然，越是高的数据保障，那么性能肯定会受到影响。在这样的情况下，需要对上层的应用进行分析，看是否允许丢失一部分数据。

另外，还有一个问题就是，数据的同步是采用master分发还是replica定时请求的问题，两者各有优缺点，前者会在replica较多的情况下遇到瓶颈，而后者可能会有一些延迟。多级同步的方式能在一定程度上解决这个问题，即master向某些机器同步，而这些机器向其他机器同步。

当然，master管理写请求而replica管理读请求，至于如何决定读写请求的分发，我们可以使用monitor节点，由它来作为读写的入口，如下图，然后Monitor管理集群的状态和生命周期，例如Master fail后，monitor将收到事件，它将发起一次选举选出新的Master，一般的选举算法就是在集群中寻找最后一次更新的节点，因为往往它的数据是最新。还有就是在有新的机器加入集群的情况下，Monitor会告诉新机器集群内的master是谁，replica机器才能与master取得连接同步数据。

使用Monitor的Master-replica集群

当然，自从有了ZooKeeper，这种监控和协同的脏活累活就可以都交给它了，利用ZooKeeper管理集群内节点的健康状况具备很大的便利，毕竟，上面这个办法的架构存在单点问题，最后肯定Monitor会拖集群的后腿。所以用ZooKeeper管理集群并且针对不同的事件作出响应。下面是一个示意图：

用ZooKeeper管理集群

最后，关于事务提交时的处理策略也值得注意，你需要为master和replica都指定.一般情况下，我们需要在减少频繁的I/O操作和数据保障性方面进行折衷，以下提交策略可选：

1.在事务提交时将数据由内存刷到硬盘，这样数据具有最高的保障，但是你需要等待昂贵的I/O操作。

2.在事务提交时将数据刷到操作系统buffer中，然后由操作系统自己决定何时刷到硬盘。这样可以在虚拟机挂了的情况下保证数据不丢失，但是不能hardware failture

3.在事务提交时数据仍然维持在内存中，而在存储系统比较闲的时候进行持久到硬盘的操作，或在内存不够用的情况下进行写磁盘操作。

Get和Put操作

这两个动作是key-value存储系统最核心的两个操作，因此在设计上需要做更多的考虑。

下面是一种写操作的过程：

1.执行tree搜索以定位插入数据所在的位置

2.锁定该位置的父节点

3.创建新的叶子节点

4.将叶子节点写入。这个写操作发生在内存中，并且返回一个number它决定了叶子节点写到硬盘的位置

5.修改父节点指向该叶子节点的引用。该父节点既持有指向内存的叶子节点的引用又持有磁盘的位置的number。

6.标记该父节点为dirty(意味着内存中的版本没有出现在磁盘中)

7.解锁该父节点

请注意，很明显，这个写操作完全发生在内存中，那么何时将数据同步到磁盘呢？这就是后面要讲的checkpoint。通过它定时的将内存中的脏数据写到硬盘。

读操作就简单很多了，搜索tree，定位到叶子节点，取出数据就行了，当然还有一些包括为缓存服务的操作就不细讲了（比如，为每个节点计数，每次对该节点的访问都使得其+1，这样在缓存evict的时候就很有用了）

上面所说的写操作在内存中并不会影响到读操作，因为，为了加快读操作，我们会在启动时预加载硬盘数据文件的内容到内存，由于只有叶子节点存储数据，因此我们需要根据加载的叶子节点还原整棵B+tree，毫无疑问这是一个耗时的操作，但是却是值得的。

数据模型

这块比较大，这里只重点讲一下以什么数据结构存取的问题。

首先需要解决的是，存储对象的问题，很显然，我们都有存取对象的需求，那么如何将对象转换为我们的底层存储格式呢？一般的办法有序列化，Json，XML之类，下面依次讲一下优缺点：

1.序列化。可能是比较简单易实现的办法，但是空间占用过大

2.Json和XML都差不多，存储格式比较可读一点，解析和转换比较方便，不过对于数据量大的情况还是不推荐。

3.字符串或者字节数组。我们按照一定的约定将对象拼成字符串，或者一次将对象的属性写入到字节数组，读取时按照相同的顺序解析即可，比较好的办法是定义一个接口，然后由客户端去实现对象字符串之间转换顺序的方法。这个比较推荐。

还有一些序列化的工具值得推荐，比如hadoop下的avro。

Checkpoint

和普通的关系型数据库一样，key-value也可以有自己的checkpoint，一般情况，checkpoint是为了减少数据恢复所需要的时间.在检查点到来时，按照之前的设计，它会将所有的dirty Internal Node写入log，这样会存在一个问题，大多数情况下，checkpoint会把整棵树写到log,解决问题的办法是我们采用增量的办法进行log，例如，如果有a和b加入到某个父节点。那么此时如果进行checkpoint时我们需要首先写一个完整的IN引用，并且记录对其进行的操作，add a，add b，这些操作记录在一个list中，当list变得过大以后我们又重新写一个完整的IN节点。

过期数据清理

这一部分只针对按照顺序写并且仅append的情况，为了减少I/O操作，无效数据仅仅被标记为delete且删除内存中对应的树的叶节点而不进行物理的删除，那么长期下去，失效数据会很多，这时候需要进行清理，一般的策略就是，当失效数据在文件中所占比例达到一定程度以后，执行清除操作。

1.首先根据预先存储的记录信息判断哪些文件需要进行清理操作。

2.扫描文件，找到仍然active的数据，拷贝到新的文件，扫描完成后删除此文件。

请注意，在写操作密集的情况下，这会造成竞争，因此尽量在访问量少的情况下执行此操作。

另外，可以使用多线程来进行清理操作。当然还有很多策略可以在清理的时候使用，比如，缓存一个叶子节点的父节点，在锁住该父节点执行迁移操作的时候可以顺便扫描该父节点下的其他叶子节点。

Need More？
复杂查询

人的欲望是无止境的…….除了基于主键的检索你可能还需要基于某个属性的检索，最好还能在多个属性上查询完了以后来个取交集，这个时候怎么办呢？

首先，我们有一个基于主键的key-value数据库，key存储的主键，而value存储的对象（物理存储为byte数组），那么假如我们想对对象的某个属性如name进行查询的话，那么我们可以再建一个数据库，key是待查询的字段，value是主键数据库的id。

Primary Database

Key(ID)

Value(Object)

Byte[]

Secondary Database

Foreign Key(Name)

Value(ID)

dahuang

tugou

这样一来按照name查询只需要查询一次secondary database取出主键id，再查一查primary database即可，这有点像正排索引和倒排索引的概率，如果对于多个字段的组合查询，只需要对其进行一次join即可，在join的时候可以先对待join的结果按照结果集大小进行排序，这样可以省下不少时间消耗。

虽然key-value存储按照上面的描述也可以支持多条件查询，但是不建议这样做，一是建立索引（二级数据库）需要额外的空间，二是这样需要多次查询影响性能，不管怎么样适度的折衷吧。

最后不得不提一下，由于B+tree的数据结构，它很好地支持范围查询（查询可以不下到叶子节点）可以极大的弥补搜索引擎中倒排索引进行范围查询需要全部扫描的缺陷。这也是其应用场景之一。

总结

Key-value在海量数据存储中占据很重要的地位，对于它的深入研究能带给我们很多启发，而它在某些局部问题上所表现的优秀的能力也值得我们关注。本文大致总结了一下目前所了解的一些问题，没有提到的东西还有很多（文件系统设计，事务，缓存等等），接下来如果有空会对文件系统设计进行详细讲解。

声明

首先，本文只是代表本人的一些浅见，不代表任何官方意见。其次，由于作者水平的原因，肯定会出现错误，欢迎指正（最好站内，给我留一点脸面-_-）

↧

Thelittleredisbook中文版

September 1, 2016, 2:17 am

≫ Next: SuperMap iDesktop 达梦数据库型的数据源创建

≪ Previous: 海量数据存储之Key-Value存储简介

\thispagestyle{empty} \changepage{}{}{}{-0.5cm}{}{2cm}{}{}{}

\clearpage \changepage{}{}{}{0.5cm}{}{-2cm}{}{}{}

关于此书

许可证

《The Little Redis Book》是经由Attribution-NonCommercial 3.0 Unported license许可的，你不需要为此书付钱。

你可以自由地对此书进行复制，分发，修改或者展示等操作。当然，你必须知道且认可这本书的作者是Karl Seguin，译者是赖立维，而且不应该将此书用于商业用途。

关于这个许可证的详细描述在这里：

http://creativecommons.org/licenses/by-nc/3.0/legalcode

关于作者

作者Karl Seguin是一名在多项技术领域浸淫多年的开发者。他是开源软件计划的活跃贡献者，同时也是一名技术作者以及业余演讲者。他写过若干关于Radis的文章以及一些工具。在他的一个面向业余游戏开发者的免费服务里，Redis为其中的评级和统计功能提供了支持：mogade.com。

Karl之前还写了《The Little MongoDB Book》，这是一本免费且受好评，关于MongoDB的书。

他的博客是http://openmymind.net，你也可以关注他的Twitter帐号，via @karlseguin。

关于译者

译者赖立维是一名长在天朝的普通程序员，对许多技术都有浓厚的兴趣，是开源软件的支持者，Emacs的轻度使用者。

虽然译者已经很认真地对待这次翻译，但是限于水平有限，肯定会有不少错漏，如果发现该书的翻译有什么需要修改，可以通过他的邮箱与他联系。他的邮箱是jasonlai256@gmail.com。

致谢

必须特别感谢Perry Neal一直以来的指导，我的眼界、触觉以及激情都来源于你。你为我提供了无价的帮助，感谢你。

最新版本

此书的最新有效资源在： http://github.com/karlseguin/the-little-redis-book

中文版是英文版的一个分支，最新的中文版本在： https://github.com/JasonLai256/the-little-redis-book

\clearpage

简介

最近几年来，关于持久化和数据查询的相关技术，其需求已经增长到了让人惊讶的程度。可以断言，关系型数据库再也不是放之四海皆准。换一句话说，围绕数据的解决方案不可能再只有唯一一种。

对于我来说，在众多新出现的解决方案和工具里，最让人兴奋的，无疑是Redis。为什么?首先是因为其让人不可思议的容易学习，只需要简短的几个小时学习时间，就能对Redis有个大概的认识。还有，Redis在处理一组特定的问题集的同时能保持相当的通用性。更准确地说就是，Redis不会尝试去解决关于数据的所有事情。在你足够了解Redis后，事情就会变得越来越清晰，什么是可行的，什么是不应该由Redis来处理的。作为一名开发人员，如此的经验当是相当的美妙。

当你能仅使用Redis去构建一个完整系统时，我想大多数人将会发现，Redis能使得他们的许多数据方案变得更为通用，不论是一个传统的关系型数据库，一个面向文档的系统，或是其它更多的东西。这是一种用来实现某些特定特性的解决方法。就类似于一个索引引擎，你不会在Lucene上构建整个程序，但当你需要足够好的搜索，为什么不使用它呢?这对你和你的用户都有好处。当然，关于Redis和索引引擎之间相似性的讨论到此为止。

本书的目的是向读者传授掌握Redis所需要的基本知识。我们将会注重于学习Redis的5种数据结构，并研究各种数据建模方法。我们还会接触到一些主要的管理细节和调试技巧。

入门

每个人的学习方式都不一样，有的人喜欢亲自实践学习，有的喜欢观看教学视频，还有的喜欢通过阅读来学习。对于Redis，没有什么比亲自实践学习来得效果更好的了。Redis的安装非常简单。而且通过随之安装的一个简单的命令解析程序，就能处理我们想做的一切事情。让我们先花几分钟的时间把Redis安装到我们的机器上。

windows平台

Redis并没有官方支持Windows平台，但还是可供选择。你不会想在这里配置实际的生产环境，不过在我过往的开发经历里并没有感到有什么限制。

首先进入https://github.com/dmajkic/redis/downloads，然后下载最新的版本(应该会在列表的最上方)。

获取zip文件，然后根据你的系统架构，打开64bit或32bit文件夹。

*nix和MacOSX平台

对于*nix和MacOSX平台的用户，从源文件来安装是你的最佳选择。通过最新的版本号来选择，有效地址于http://redis.io/download。在编写此书的时候，最新的版本是2.4.6，我们可以运行下面的命令来安装该版本：

wget http://redis.googlecode.com/files/redis-2.4.6.tar.gz

tar xzf redis-2.4.6.tar.gz

cd redis-2.4.6

make

(当然，Redis同样可以通过套件管理程序来安装。例如，使用Homebrew的MaxOSX用户可以只键入brew install redis即可。)

如果你是通过源文件来安装，二进制可执行文件会被放置在src目录里。通过运行cd src可跳转到src目录。

运行和连接Redis

如果一切都工作正常，那Redis的二进制文件应该已经可以曼妙地跳跃于你的指尖之下。Redis只有少量的可执行文件，我们将着重于Redis的服务器和命令行界面(一个类DOS的客户端)。首先，让我们来运行服务器。在Windows平台，双击redis-server，在*nix/MacOSX平台则运行./redis-server.

如果你仔细看了启动信息，你会看到一个警告，指没能找到redis.conf文件。Redis将会采用内置的默认设置，这对于我们将要做的已经足够了。

然后，通过双击redis-cli(Windows平台)或者运行./redis-cli(*nix/MacOSX平台)，启动Redis的控制台。控制台将会通过默认的端口(6379)来连接本地运行的服务器。

可以在命令行界面键入info命令来查看一切是不是都运行正常。你会很乐意看到这么一大组关键字-值(key-value)对的显示，这为我们查看服务器的状态提供了大量有效信息。

如果在上面的启动步骤里遇到什么问题，我建议你到Redis的官方支持组里获取帮助。

驱动Redis

很快你就会发现，Redis的API就如一组定义明确的函数那般容易理解。Redis具有让人难以置信的简单性，其操作过程也同样如此。这意味着，无论你是使用命令行程序，或是使用你喜欢的语言来驱动，整体的感觉都不会相差多少。因此，相对于命令行程序，如果你更愿意通过一种编程语言去驱动Redis，你不会感觉到有任何适应的问题。如果真想如此，可以到Redis的客户端推荐页面下载适合的Redis载体。

\clearpage

第1章 - 基础知识

是什么使Redis显得这么特别?Redis具体能解决什么类型的问题?要实际应用Redis，开发者必须储备什么知识?在我们能回答这么一些问题之前，我们需要明白Redis到底是什么。

Redis通常被人们认为是一种持久化的存储器关键字-值型存储(in-memory persistent key-value store)。我认为这种对Redis的描述并不太准确。Redis的确是将所有的数据存放于存储器(更多是是按位存储)，而且也确实通过将数据写入磁盘来实现持久化，但是Redis的实际意义比单纯的关键字-值型存储要来得深远。纠正脑海里的这种误解观点非常关键，否则你对于Redis之道以及其应用的洞察力就会变得越发狭义。

事实是，Redis引入了5种不同的数据结构，只有一个是典型的关键字-值型结构。理解Redis的关键就在于搞清楚这5种数据结构，其工作的原理都是如何，有什么关联方法以及你能怎样应用这些数据结构去构建模型。首先，让我们来弄明白这些数据结构的实际意义。

应用上面提及的数据结构概念到我们熟悉的关系型数据库里，我们可以认为其引入了一个单独的数据结构——表格。表格既复杂又灵活，基于表格的存储和管理，没有多少东西是你不能进行建模的。然而，这种通用性并不是没有缺点。具体来说就是，事情并不是总能达到假设中的简单或者快速。相对于这种普遍适用(one-size-fits-all)的结构体系，我们可以使用更为专门化的结构体系。当然，因此可能有些事情我们会完成不了(至少，达不到很好的程度)。但话说回来，这样做就能确定我们可以获得想象中的简单性和速度吗?

针对特定类型的问题使用特定的数据结构?我们不就是这样进行编程的吗?你不会使用一个散列表去存储每份数据，也不会使用一个标量变量去存储。对我来说，这正是Redis的做法。如果你需要处理标量、列表、散列或者集合，为什么不直接就用标量、列表、散列和集合去存储他们?为什么不是直接调用exists(key)去检测一个已存在的值，而是要调用其他比O(1)(常量时间查找，不会因为待处理元素的增长而变慢)慢的操作?

数据库(Databases)

与你熟悉的关系型数据库一致，Redis有着相同的数据库基本概念，即一个数据库包含一组数据。典型的数据库应用案例是，将一个程序的所有数据组织起来，使之与另一个程序的数据保持独立。

在Redis里，数据库简单的使用一个数字编号来进行辨认，默认数据库的数字编号是0。如果你想切换到一个不同的数据库，你可以使用select命令来实现。在命令行界面里键入select 1，Redis应该会回复一条OK的信息，然后命令行界面里的提示符会变成类似redis 127.0.0.1:6379[1]>这样。如果你想切换回默认数据库，只要在命令行界面键入select 0即可。

命令、关键字和值(Commands, Keys and Values)

Redis不仅仅是一种简单的关键字-值型存储，从其核心概念来看，Redis的5种数据结构中的每一个都至少有一个关键字和一个值。在转入其它关于Redis的有用信息之前，我们必须理解关键字和值的概念。

关键字(Keys)是用来标识数据块。我们将会很常跟关键字打交道，不过在现在，明白关键字就是类似于users:leto这样的表述就足够了。一般都能很好地理解到，这样关键字包含的信息是一个名为leto的用户。这个关键字里的冒号没有任何特殊含义，对于Redis而言，使用分隔符来组织关键字是很常见的方法。

值(Values)是关联于关键字的实际值，可以是任何东西。有时候你会存储字符串，有时候是整数，还有时候你会存储序列化对象(使用JSON、XML或其他格式)。在大多数情况下，Redis会把值看做是一个字节序列，而不会关注它们实质上是什么。要注意，不同的Redis载体处理序列化会有所不同(一些会让你自己决定)。因此，在这本书里，我们将仅讨论字符串、整数和JSON。

现在让我们活动一下手指吧。在命令行界面键入下面的命令：

set users:leto "{name: leto, planet: dune, likes: [spice]}"

这就是Redis命令的基本构成。首先我们要有一个确定的命令，在上面的语句里就是set。然后就是相应的参数，set命令接受两个参数，包括要设置的关键字，以及相应要设置的值。很多的情况是，命令接受一个关键字(当这种情况出现，其经常是第一个参数)。你能想到如何去获取这个值吗?我想你会说(当然一时拿不准也没什么)：

get users:leto

关键字和值的是Redis的基本概念，而get和set命令是对此最简单的使用。你可以创建更多的用户，去尝试不同类型的关键字以及不同的值，看看一些不同的组合。

查询(Querying)

随着学习的持续深入，两件事情将变得清晰起来。对于Redis而言，关键字就是一切，而值是没有任何意义。更通俗来看就是，Redis不允许你通过值来进行查询。回到上面的例子，我们就不能查询生活在dune行星上的用户。

对许多人来说，这会引起一些担忧。在我们生活的世界里，数据查询是如此的灵活和强大，而Redis的方式看起来是这么的原始和不高效。不要让这些扰乱你太久。要记住，Redis不是一种普遍使用(one-size-fits-all)的解决方案，确实存在这么一些事情是不应该由Redis来解决的(因为其查询的限制)。事实上，在考虑了这些情况后，你会找到新的方法去构建你的数据。

很快，我们就能看到更多实际的用例。很重要的一点是，我们要明白关于Redis的这些基本事实。这能帮助我们弄清楚为什么值可以是任何东西，因为Redis从来不需要去读取或理解它们。而且，这也可以帮助我们理清思路，然后去思考如何在这个新世界里建立模型。

存储器和持久化(Memory and Persistence)

我们之前提及过，Redis是一种持久化的存储器内存储(in-memory persistent store)。对于持久化，默认情况下，Redis会根据已变更的关键字数量来进行判断，然后在磁盘里创建数据库的快照(snapshot)。你可以对此进行设置，如果X个关键字已变更，那么每隔Y秒存储数据库一次。默认情况下，如果1000个或更多的关键字已变更，Redis会每隔60秒存储数据库;而如果9个或更少的关键字已变更，Redis会每隔15分钟存储数据库。

除了创建磁盘快照外，Redis可以在附加模式下运行。任何时候，如果有一个关键字变更，一个单一附加(append-only)的文件会在磁盘里进行更新。在一些情况里，虽然硬件或软件可能发生错误，但用那60秒有效数据存储去换取更好性能是可以接受的。而在另一些情况里，这种损失就难以让人接受，Redis为你提供了选择。在第5章里，我们将会看到第三种选择，其将持久化任务减荷到一个从属数据库里。

至于存储器，Redis会将所有数据都保留在存储器中。显而易见，运行Redis具有不低的成本：因为RAM仍然是最昂贵的服务器硬件部件。

我很清楚有一些开发者对即使是一点点的数据空间都是那么的敏感。一本《威廉·莎士比亚全集》需要近5.5MB的存储空间。对于缩放的需求，其它的解决方案趋向于IO-bound或者CPU-bound。这些限制(RAM或者IO)将会需要你去理解更多机器实际依赖的数据类型，以及应该如何去进行存储和查询。除非你是存储大容量的多媒体文件到Redis中，否则存储器内存储应该不会是一个问题。如果这对于一个程序是个问题，你就很可能不会用IO-bound的解决方案。

Redis有虚拟存储器的支持。然而，这个功能已经被认为是失败的了(通过Redis的开发者)，而且它的使用已经被废弃了。

(从另一个角度来看，一本5.5MB的《威廉·莎士比亚全集》可以通过压缩减小到近2MB。当然，Redis不会自动对值进行压缩，但是因为其将所有值都看作是字节，没有什么限制让你不能对数据进行压缩/解压，通过牺牲处理时间来换取存储空间。)

整体来看(Putting It Together)

我们已经接触了好几个高层次的主题。在继续深入Redis之前，我想做的最后一件事情是将这些主题整合起来。这些主题包括，查询的限制，数据结构以及Redis在存储器内存储数据的方法。

当你将这3个主题整合起来，你最终会得出一个绝妙的结论：速度。一些人可能会想，当然Redis会很快速，要知道所有的东西都在存储器里。但这仅仅是其中的一部分，让Redis闪耀的真正原因是其不同于其它解决方案的特殊数据结构。

能有多快速?这依赖于很多东西，包括你正在使用着哪个命令，数据的类型等等。但Redis的性能测试是趋向于数万或数十万次操作每秒。你可以通过运行redis-benchmark(就在redis-server和redis-cli的同一个文件夹里)来进行测试。

我曾经试过将一组使用传统模型的代码转向使用Redis。在传统模型里，运行一个我写的载入测试，需要超过5分钟的时间来完成。而在Redis里，只需要150毫秒就完成了。你不会总能得到这么好的收获，但希望这能让你对我们所谈的东西有更清晰的理解。

理解Redis的这个特性很重要，因为这将影响到你如何去与Redis进行交互。拥有SQL背景的程序员通常会致力于让数据库的数据往返次数减至最小。这对于任何系统都是个好建议，包括Redis。然而，考虑到我们是在处理比较简单的数据结构，有时候我们还是需要与Redis服务器频繁交互，以达到我们的目的。刚开始的时候，可能会对这种数据访问模式感到不太自然。实际上，相对于我们通过Redis获得的高性能而言，这仅仅是微不足道的损失。

小结

虽然我们只接触和摆弄了Redis的冰山一角，但我们讨论的主题已然覆盖了很大范围内的东西。如果觉得有些事情还是不太清楚(例如查询)，不用为此而担心，在下一章我们将会继续深入探讨，希望你的问题都能得到解答。

这一章的要点包括：

关键字(Keys)是用于标识一段数据的一个字符串

值(Values)是一段任意的字节序列，Redis不会关注它们实质上是什么

Redis展示了(也实现了)5种专门的数据结构

上面的几点使得Redis快速而且容易使用，但要知道Redis并不适用于所有的应用场景

\clearpage

第2章 - 数据结构

现在开始将探究Redis的5种数据结构，我们会解释每种数据结构都是什么，包含了什么有效的方法(Method)，以及你能用这些数据结构处理哪些类型的特性和数据。

目前为止，我们所知道的Redis构成仅包括命令、关键字和值，还没有接触到关于数据结构的具体概念。当我们使用set命令时，Redis是怎么知道我们是在使用哪个数据结构?其解决方法是，每个命令都相对应于一种特定的数据结构。例如，当你使用set命令，你就是将值存储到一个字符串数据结构里。而当你使用hset命令，你就是将值存储到一个散列数据结构里。考虑到Redis的关键字集很小，这样的机制具有相当的可管理性。

Redis的网站里有着非常优秀的参考文档，没有任何理由去重造轮子。但为了搞清楚这些数据结构的作用，我们将会覆盖那些必须知道的重要命令。

没有什么事情比高兴的玩和试验有趣的东西来得更重要的了。在任何时候，你都能通过键入flushdb命令将你数据库里的所有值清除掉，因此，不要再那么害羞了，去尝试做些疯狂的事情吧!

字符串(Strings)

在Redis里，字符串是最基本的数据结构。当你在思索着关键字-值对时，你就是在思索着字符串数据结构。不要被名字给搞混了，如之前说过的，你的值可以是任何东西。我更喜欢将他们称作“标量”(Scalars)，但也许只有我才这样想。

我们已经看到了一个常见的字符串使用案例，即通过关键字存储对象的实例。有时候，你会频繁地用到这类操作：

set users:leto "{name: leto, planet: dune, likes: [spice]}"

除了这些外，Redis还有一些常用的操作。例如，strlen 能用来获取一个关键字对应值的长度;getrange 将返回指定范围内的关键字对应值;append 会将value附加到已存在的关键字对应值中(如果该关键字并不存在，则会创建一个新的关键字-值对)。不要犹豫，去试试看这些命令吧。下面是我得到的：

> strlen users:leto

(integer) 42

> getrange users:leto 27 40

"likes: [spice]"

> append users:leto " OVER 9000!!"

(integer) 54

现在你可能会想，这很好，但似乎没有什么意义。你不能有效地提取出一段范围内的JSON文件，或者为其附加一些值。你是对的，这里的经验是，一些命令，尤其是关于字符串数据结构的，只有在给定了明确的数据类型后，才会有实际意义。

之前我们知道了，Redis不会去关注你的值是什么东西。通常情况下，这没有错。然而，一些字符串命令是专门为一些类型或值的结构而设计的。作为一个有些含糊的用例，我们可以看到，对于一些自定义的空间效率很高的(space-efficient)串行化对象，append和getrange命令将会很有用。对于一个更为具体的用例，我们可以再看一下incr、incrby、decr和decrby命令。这些命令会增长或者缩减一个字符串数据结构的值：

> incr stats:page:about

(integer) 1

> incr stats:page:about

(integer) 2

> incrby ratings:video:12333 5

(integer) 5

> incrby ratings:video:12333 3

(integer) 8

由此你可以想象到，Redis的字符串数据结构能很好地用于分析用途。你还可以去尝试增长users:leto(一个不是整数的值)，然后看看会发生什么(应该会得到一个错误)。

更为进阶的用例是setbit和getbit命令。“今天我们有多少个独立用户访问”是个在Web应用里常见的问题，有一篇精彩的博文，在里面可以看到Spool是如何使用这两个命令有效地解决此问题。对于1.28亿个用户，一部笔记本电脑在不到50毫秒的时间里就给出了答复，而且只用了16MB的存储空间。

最重要的事情不是在于你是否明白位图(Bitmaps)的工作原理，或者Spool是如何去使用这些命令，而是应该要清楚Redis的字符串数据结构比你当初所想的要有用许多。然而，最常见的应用案例还是上面我们给出的：存储对象(简单或复杂)和计数。同时，由于通过关键字来获取一个值是如此之快，字符串数据结构很常被用来缓存数据。

散列(Hashes)

我们已经知道把Redis称为一种关键字-值型存储是不太准确的，散列数据结构是一个很好的例证。你会看到，在很多方面里，散列数据结构很像字符串数据结构。两者显著的区别在于，散列数据结构提供了一个额外的间接层：一个域(Field)。因此，散列数据结构中的set和get是：

hset users:goku powerlevel 9000

hget users:goku powerlevel

相关的操作还包括在同一时间设置多个域、同一时间获取多个域、获取所有的域和值、列出所有的域或者删除指定的一个域：

hmset users:goku race saiyan age 737

hmget users:goku race powerlevel

hgetall users:goku

hkeys users:goku

hdel users:goku age

如你所见，散列数据结构比普通的字符串数据结构具有更多的可操作性。我们可以使用一个散列数据结构去获得更精确的描述，是存储一个用户，而不是一个序列化对象。从而得到的好处是能够提取、更新和删除具体的数据片段，而不必去获取或写入整个值。

对于散列数据结构，可以从一个经过明确定义的对象的角度来考虑，例如一个用户，关键之处在于要理解他们是如何工作的。从性能上的原因来看，这是正确的，更具粒度化的控制可能会相当有用。在下一章我们将会看到，如何用散列数据结构去组织你的数据，使查询变得更为实效。在我看来，这是散列真正耀眼的地方。

列表(Lists)

对于一个给定的关键字，列表数据结构让你可以存储和处理一组值。你可以添加一个值到列表里、获取列表的第一个值或最后一个值以及用给定的索引来处理值。列表数据结构维护了值的顺序，提供了基于索引的高效操作。为了跟踪在网站里注册的最新用户，我们可以维护一个newusers的列表：

lpush newusers goku

ltrim newusers 0 50

(译注：ltrim命令的具体构成是LTRIM Key start stop。要理解ltrim命令，首先要明白Key所存储的值是一个列表，理论上列表可以存放任意个值。对于指定的列表，根据所提供的两个范围参数start和stop，ltrim命令会将指定范围外的值都删除掉，只留下范围内的值。)

首先，我们将一个新用户推入到列表的前端，然后对列表进行调整，使得该列表只包含50个最近被推入的用户。这是一种常见的模式。ltrim是一个具有O(N)时间复杂度的操作，N是被删除的值的数量。从上面的例子来看，我们总是在插入了一个用户后再进行列表调整，实际上，其将具有O(1)的时间复杂度(因为N将永远等于1)的常数性能。

这是我们第一次看到一个关键字的对应值索引另一个值。如果我们想要获取最近的10个用户的详细资料，我们可以运行下面的组合操作：

keys = redis.lrange('newusers', 0, 10)

redis.mget(*keys.map {|u| "users:#{u}"})

我们之前谈论过关于多次往返数据的模式，上面的两行Ruby代码为我们进行了很好的演示。

当然，对于存储和索引关键字的功能，并不是只有列表数据结构这种方式。值可以是任意的东西，你可以使用列表数据结构去存储日志，也可以用来跟踪用户浏览网站时的路径。如果你过往曾构建过游戏，你可能会使用列表数据结构去跟踪用户的排队活动。

集合(Sets)

集合数据结构常常被用来存储只能唯一存在的值，并提供了许多的基于集合的操作，例如并集。集合数据结构没有对值进行排序，但是其提供了高效的基于值的操作。使用集合数据结构的典型用例是朋友名单的实现：

sadd friends:leto ghanima paul chani jessica

sadd friends:duncan paul jessica alia

不管一个用户有多少个朋友，我们都能高效地(O(1)时间复杂度)识别出用户X是不是用户Y的朋友：

sismember friends:leto jessica

sismember friends:leto vladimir

而且，我们可以查看两个或更多的人是不是有共同的朋友：

sinter friends:leto friends:duncan

甚至可以在一个新的关键字里存储结果：

sinterstore friends:leto_duncan friends:leto friends:duncan

有时候需要对值的属性进行标记和跟踪处理，但不能通过简单的复制操作完成，集合数据结构是解决此类问题的最好方法之一。当然，对于那些需要运用集合操作的地方(例如交集和并集)，集合数据结构就是最好的选择。

分类集合(Sorted Sets)

最后也是最强大的数据结构是分类集合数据结构。如果说散列数据结构类似于字符串数据结构，主要区分是域(field)的概念;那么分类集合数据结构就类似于集合数据结构，主要区分是标记(score)的概念。标记提供了排序(sorting)和秩划分(ranking)的功能。如果我们想要一个秩分类的朋友名单，可以这样做：

zadd friends:duncan 70 ghanima 95 paul 95 chani 75 jessica 1 vladimir

对于duncan的朋友，要怎样计算出标记(score)为90或更高的人数?

zcount friends:duncan 90 100

如何获取chani在名单里的秩(rank)?

zrevrank friends:duncan chani

(译注：zrank命令的具体构成是ZRANK Key menber，要知道Key存储的Sorted Set默认是根据Score对各个menber进行升序的排列，该命令就是用来获取menber在该排列里的次序，这就是所谓的秩。)

我们使用了zrevrank命令而不是zrank命令，这是因为Redis的默认排序是从低到高，但是在这个例子里我们的秩划分是从高到低。对于分类集合数据结构，最常见的应用案例是用来实现排行榜系统。事实上，对于一些基于整数排序，且能以标记(score)来进行有效操作的东西，使用分类集合数据结构来处理应该都是不错的选择。

小结

对于Redis的5种数据结构，我们进行了高层次的概述。一件有趣的事情是，相对于最初构建时的想法，你经常能用Redis创造出一些更具实效的事情。对于字符串数据结构和分类集合数据结构的使用，很有可能存在一些构建方法是还没有人想到的。当你理解了那些常用的应用案例后，你将发现Redis对于许多类型的问题，都是很理想的选择。还有，不要因为Redis展示了5种数据结构和相应的各种方法，就认为你必须要把所有的东西都用上。只使用一些命令去构建一个特性是很常见的。

\clearpage

第3章 - 使用数据结构

在上一章里，我们谈论了Redis的5种数据结构，对于一些可能的用途也给出了用例。现在是时候来看看一些更高级，但依然很常见的主题和设计模式。

大O表示法(Big O Notation)

在本书中，我们之前就已经看到过大O表示法，包括O(1)和O(N)的表示。大O表示法的惯常用途是，描述一些用于处理一定数量元素的行为的综合表现。在Redis里，对于一个要处理一定数量元素的命令，大O表示法让我们能了解该命令的大概运行速度。

在Redis的文档里，每一个命令的时间复杂度都用大O表示法进行了描述，还能知道各命令的具体性能会受什么因素影响。让我们来看看一些用例。

常数时间复杂度O(1)被认为是最快速的，无论我们是在处理5个元素还是5百万个元素，最终都能得到相同的性能。对于sismember命令，其作用是告诉我们一个值是否属于一个集合，时间复杂度为O(1)。sismember命令很强大，很大部分的原因是其高效的性能特征。许多Redis命令都具有O(1)的时间复杂度。

对数时间复杂度O(log(N))被认为是第二快速的，其通过使需扫描的区间不断皱缩来快速完成处理。使用这种“分而治之”的方式，大量的元素能在几个迭代过程里被快速分解完整。zadd命令的时间复杂度就是O(log(N))，其中N是在分类集合中的元素数量。

再下来就是线性时间复杂度O(N)，在一个表格的非索引列里进行查找就需要O(N)次操作。ltrim命令具有O(N)的时间复杂度，但是，在ltrim命令里，N不是列表所拥有的元素数量，而是被删除的元素数量。从一个具有百万元素的列表里用ltrim命令删除1个元素，要比从一个具有一千个元素的列表里用ltrim命令删除10个元素来的快速(实际上，两者很可能会是一样快，因为两个时间都非常的小)。

根据给定的最小和最大的值的标记，zremrangebyscore命令会在一个分类集合里进行删除元素操作，其时间复杂度是O(log(N)+M)。这看起来似乎有点儿杂乱，通过阅读文档可以知道，这里的N指的是在分类集合里的总元素数量，而M则是被删除的元素数量。可以看出，对于性能而言，被删除的元素数量很可能会比分类集合里的总元素数量更为重要。

(译注：zremrangebyscore命令的具体构成是ZREMRANGEBYSCORE Key max mix。)

对于sort命令，其时间复杂度为O(N+M*log(M))，我们将会在下一章谈论更多的相关细节。从sort命令的性能特征来看，可以说这是Redis里最复杂的一个命令。

还存在其他的时间复杂度描述，包括O(N^2)和O(C^N)。随着N的增大，其性能将急速下降。在Redis里，没有任何一个命令具有这些类型的时间复杂度。

值得指出的一点是，在Redis里，当我们发现一些操作具有O(N)的时间复杂度时，我们可能可以找到更为好的方法去处理。

(译注：对于Big O Notation，相信大家都非常的熟悉，虽然原文仅仅是对该表示法进行简单的介绍，但限于个人的算法知识和文笔水平实在有限，此小节的翻译让我头痛颇久，最终成果也确实难以让人满意，望见谅。)

仿多关键字查询(Pseudo Multi Key Queries)

时常，你会想通过不同的关键字去查询相同的值。例如，你会想通过电子邮件(当用户开始登录时)去获取用户的具体信息，或者通过用户id(在用户登录后)去获取。有一种很不实效的解决方法，其将用户对象分别放置到两个字符串值里去：

set users:leto@dune.gov "{id: 9001, email: 'leto@dune.gov', ...}"

set users:9001 "{id: 9001, email: 'leto@dune.gov', ...}"

这种方法很糟糕，如此不但会产生两倍数量的内存，而且这将会成为数据管理的恶梦。

如果Redis允许你将一个关键字链接到另一个的话，可能情况会好很多，可惜Redis并没有提供这样的功能(而且很可能永远都不会提供)。Redis发展到现在，其开发的首要目的是要保持代码和API的整洁简单，关键字链接功能的内部实现并不符合这个前提(对于关键字，我们还有很多相关方法没有谈论到)。其实，Redis已经提供了解决的方法：散列。

使用散列数据结构，我们可以摆脱重复的缠绕：

set users:9001 "{id: 9001, email: leto@dune.gov, ...}"

hset users:lookup:email leto@dune.gov 9001

我们所做的是，使用域来作为一个二级索引，然后去引用单个用户对象。要通过id来获取用户信息，我们可以使用一个普通的get命令：

get users:9001

而如果想通过电子邮箱来获取用户信息，我们可以使用hget命令再配合使用get命令(Ruby代码)：

id = redis.hget('users:lookup:email', 'leto@dune.gov')

user = redis.get("users:#{id}")

你很可能将会经常使用这类用法。在我看来，这就是散列真正耀眼的地方。在你了解这类用法之前，这可能不是一个明显的用例。

引用和索引(References and Indexes)

我们已经看过几个关于值引用的用例，包括介绍列表数据结构时的用例，以及在上面使用散列数据结构来使查询更灵活一些。进行归纳后会发现，对于那些值与值间的索引和引用，我们都必须手动的去管理。诚实来讲，这确实会让人有点沮丧，尤其是当你想到那些引用相关的操作，如管理、更新和删除等，都必须手动的进行时。在Redis里，这个问题还没有很好的解决方法。

我们已经看到，集合数据结构很常被用来实现这类索引：

sadd friends:leto ghanima paul chani jessica

这个集合里的每一个成员都是一个Redis字符串数据结构的引用，而每一个引用的值则包含着用户对象的具体信息。那么如果chani改变了她的名字，或者删除了她的帐号，应该如何处理?从整个朋友圈的关系结构来看可能会更好理解，我们知道，chani也有她的朋友：

sadd friends_of:chani leto paul

如果你有什么待处理情况像上面那样，那在维护成本之外，还会有对于额外索引值的处理和存储空间的成本。这可能会令你感到有点退缩。在下一小节里，我们将会谈论减少使用额外数据交互的性能成本的一些方法(在第1章我们粗略地讨论了下)。

如果你确实在担忧着这些情况，其实，关系型数据库也有同样的开销。索引需要一定的存储空间，必须通过扫描或查找，然后才能找到相应的记录。其开销也是存在的，当然他们对此做了很多的优化工作，使之变得更为有效。

再次说明，需要在Redis里手动地管理引用确实是颇为棘手。但是，对于你关心的那些问题，包括性能或存储空间等，应该在经过测试后，才会有真正的理解。我想你会发现这不会是一个大问题。

数据交互和流水线(Round Trips and Pipelining)

我们已经提到过，与服务器频繁交互是Redis的一种常见模式。这类情况可能很常出现，为了使我们能获益更多，值得仔细去看看我们能利用哪些特性。

许多命令能接受一个或更多的参数，也有一种关联命令(sister-command)可以接受多个参数。例如早前我们看到过mget命令，接受多个关键字，然后返回值：

keys = redis.lrange('newusers', 0, 10)

redis.mget(*keys.map {|u| "users:#{u}"})

或者是sadd命令，能添加一个或多个成员到集合里：

sadd friends:vladimir piter

sadd friends:paul jessica leto "leto II" chani

Redis还支持流水线功能。通常情况下，当一个客户端发送请求到Redis后，在发送下一个请求之前必须等待Redis的答复。使用流水线功能，你可以发送多个请求，而不需要等待Redis响应。这不但减少了网络开销，还能获得性能上的显著提高。

值得一提的是，Redis会使用存储器去排列命令，因此批量执行命令是一个好主意。至于具体要多大的批量，将取决于你要使用什么命令(更明确来说，该参数有多大)。另一方面来看，如果你要执行的命令需要差不多50个字符的关键字，你大概可以对此进行数千或数万的批量操作。

对于不同的Redis载体，在流水线里运行命令的方式会有所差异。在Ruby里，你传递一个代码块到pipelined方法：

redis.pipelined do

9001.times do

redis.incr('powerlevel')

end

正如你可能猜想到的，流水线功能可以实际地加速一连串命令的处理。

事务(Transactions)

每一个Redis命令都具有原子性，包括那些一次处理多项事情的命令。此外，对于使用多个命令，Redis支持事务功能。

你可能不知道，但Redis实际上是单线程运行的，这就是为什么每一个Redis命令都能够保证具有原子性。当一个命令在执行时，没有其他命令会运行(我们会在往后的章节里简略谈论一下Scaling)。在你考虑到一些命令去做多项事情时，这会特别的有用。例如：

incr命令实际上就是一个get命令然后紧随一个set命令。

getset命令设置一个新的值然后返回原始值。

setnx命令首先测试关键字是否存在，只有当关键字不存在时才设置值

虽然这些都很有用，但在实际开发时，往往会需要运行具有原子性的一组命令。若要这样做，首先要执行multi命令，紧随其后的是所有你想要执行的命令(作为事务的一部分)，最后执行exec命令去实际执行命令，或者使用discard命令放弃执行命令。Redis的事务功能保证了什么?

事务中的命令将会按顺序地被执行

事务中的命令将会如单个原子操作般被执行(没有其它的客户端命令会在中途被执行)

事务中的命令要么全部被执行，要么不会执行

你可以(也应该)在命令行界面对事务功能进行一下测试。还有一点要注意到，没有什么理由不能结合流水线功能和事务功能。

multi

hincrby groups:1percent balance -9000000000

hincrby groups:99percent balance 9000000000

exec

最后，Redis能让你指定一个关键字(或多个关键字)，当关键字有改变时，可以查看或者有条件地应用一个事务。这是用于当你需要获取值，且待运行的命令基于那些值时，所有都在一个事务里。对于上面展示的代码，我们不能去实现自己的incr命令，因为一旦exec命令被调用，他们会全部被执行在一块。我们不能这么做：

redis.multi()

current = redis.get('powerlevel')

redis.set('powerlevel', current + 1)

redis.exec()

(译注：虽然Redis是单线程运行的，但是我们可以同时运行多个Redis客户端进程，常见的并发问题还是会出现。像上面的代码，在get运行之后，set运行之前，powerlevel的值可能会被另一个Redis客户端给改变，从而造成错误。)

这些不是Redis的事务功能的工作。但是，如果我们增加一个watch到powerlevel，我们可以这样做：

redis.watch('powerlevel')

current = redis.get('powerlevel')

redis.multi()

redis.set('powerlevel', current + 1)

redis.exec()

在我们调用watch后，如果另一个客户端改变了powerlevel的值，我们的事务将会运行失败。如果没有客户端改变powerlevel的值，那么事务会继续工作。我们可以在一个循环里运行这些代码，直到其能正常工作。

关键字反模式(Keys Anti-Pattern)

在下一章中，我们将会谈论那些没有确切关联到数据结构的命令，其中的一些是管理或调试工具。然而有一个命令我想特别地在这里进行谈论：keys命令。这个命令需要一个模式，然后查找所有匹配的关键字。这个命令看起来很适合一些任务，但这不应该用在实际的产品代码里。为什么?因为这个命令通过线性扫描所有的关键字来进行匹配。或者，简单地说，这个命令太慢了。

人们会如此去使用这个命令?一般会用来构建一个本地的Bug追踪服务。每一个帐号都有一个id，你可能会通过一个看起来像bug:account_id:bug_id的关键字，把每一个Bug存储到一个字符串数据结构值中去。如果你在任何时候需要查询一个帐号的Bug(显示它们，或者当用户删除了帐号时删除掉这些Bugs)，你可能会尝试去使用keys命令：

keys bug:1233:*

更好的解决方法应该使用一个散列数据结构，就像我们可以使用散列数据结构来提供一种方法去展示二级索引，因此我们可以使用域来组织数据：

hset bugs:1233 1 "{id:1, account: 1233, subject: '...'}"

hset bugs:1233 2 "{id:2, account: 1233, subject: '...'}"

从一个帐号里获取所有的Bug标识，可以简单地调用hkeys bugs:1233。去删除一个指定的Bug，可以调用hdel bugs:1233 2。如果要删除了一个帐号，可以通过del bugs:1233把关键字删除掉。

小结

结合这一章以及前一章，希望能让你得到一些洞察力，了解如何使用Redis去支持(Power)实际项目。还有其他的模式可以让你去构建各种类型的东西，但真正的关键是要理解基本的数据结构。你将能领悟到，这些数据结构是如何能够实现你最初视角之外的东西。

\clearpage

第4章超越数据结构

5种数据结构组成了Redis的基础，其他没有关联特定数据结构的命令也有很多。我们已经看过一些这样的命令：info, select, flushdb, multi, exec, discard, watch和keys。这一章将看看其他的一些重要命令。

使用期限(Expiration)

Redis允许你标记一个关键字的使用期限。你可以给予一个Unix时间戳形式(自1970年1月1日起)的绝对时间，或者一个基于秒的存活时间。这是一个基于关键字的命令，因此其不在乎关键字表示的是哪种类型的数据结构。

expire pages:about 30

expireat pages:about 1356933600

第一个命令将会在30秒后删除掉关键字(包括其关联的值)。第二个命令则会在2012年12月31日上午12点删除掉关键字。

这让Redis能成为一个理想的缓冲引擎。通过ttl命令，你可以知道一个关键字还能够存活多久。而通过persist命令，你可以把一个关键字的使用期限删除掉。

ttl pages:about

persist pages:about

最后，有个特殊的字符串命令，setex命令让你可以在一个单独的原子命令里设置一个字符串值，同时里指定一个生存期(这比任何事情都要方便)。

setex pages:about 30 '

about us

....'

发布和订阅(Publication and Subscriptions)

Redis的列表数据结构有blpop和brpop命令，能从列表里返回且删除第一个(或最后一个)元素，或者被堵塞，直到有一个元素可供操作。这可以用来实现一个简单的队列。

(译注：对于blpop和brpop命令，如果列表里没有关键字可供操作，连接将被堵塞，直到有另外的Redis客户端使用lpush或rpush命令推入关键字为止。)

此外，Redis对于消息发布和频道订阅有着一流的支持。你可以打开第二个redis-cli窗口，去尝试一下这些功能。在第一个窗口里订阅一个频道(我们会称它为warnings)：

subscribe warnings

其将会答复你订阅的信息。现在，在另一个窗口，发布一条消息到warnings频道：

publish warnings "it's over 9000!"

如果你回到第一个窗口，你应该已经接收到warnings频道发来的消息。

你可以订阅多个频道(subscribe channel1 channel2 ...)，订阅一组基于模式的频道(psubscribe warnings:*)，以及使用unsubscribe和punsubscribe命令停止监听一个或多个频道，或一个频道模式。

最后，可以注意到publish命令的返回值是1，这指出了接收到消息的客户端数量。

监控和延迟日志(Monitor and Slow Log)

monitor命令可以让你查看Redis正在做什么。这是一个优秀的调试工具，能让你了解你的程序如何与Redis进行交互。在两个redis-cli窗口中选一个(如果其中一个还处于订阅状态，你可以使用unsubscribe命令退订，或者直接关掉窗口再重新打开一个新窗口)键入monitor命令。在另一个窗口，执行任何其他类型的命令(例如get或set命令)。在第一个窗口里，你应该可以看到这些命令，包括他们的参数。

在实际生产环境里，你应该谨慎运行monitor命令，这真的仅仅就是一个很有用的调试和开发工具。除此之外，没有更多要说的了。

随同monitor命令一起，Redis拥有一个slowlog命令，这是一个优秀的性能剖析工具。其会记录执行时间超过一定数量微秒的命令。在下一章节，我们会简略地涉及如何配置Redis，现在你可以按下面的输入配置Redis去记录所有的命令：

config set slowlog-log-slower-than 0

然后，执行一些命令。最后，你可以检索到所有日志，或者检索最近的那些日志：

slowlog get

slowlog get 10

通过键入slowlog len，你可以获取延迟日志里的日志数量。

对于每个被你键入的命令，你应该查看4个参数：

一个自动递增的id

一个Unix时间戳，表示命令开始运行的时间

一个微妙级的时间，显示命令运行的总时间

该命令以及所带参数

延迟日志保存在存储器中，因此在生产环境中运行(即使有一个低阀值)也应该不是一个问题。默认情况下，它将会追踪最近的1024个日志。

排序(Sort)

sort命令是Redis最强大的命令之一。它让你可以在一个列表、集合或者分类集合里对值进行排序(分类集合是通过标记来进行排序，而不是集合里的成员)。下面是一个sort命令的简单用例：

rpush users:leto:guesses 5 9 10 2 4 10 19 2

sort users:leto:guesses

这将返回进行升序排序后的值。这里有一个更高级的例子：

sadd friends:ghanima leto paul chani jessica alia duncan

sort friends:ghanima limit 0 3 desc alpha

上面的命令向我们展示了，如何对已排序的记录进行分页(通过limit)，如何返回降序排序的结果(通过desc)，以及如何用字典序排序代替数值序排序(通过alpha)。

sort命令的真正力量是其基于引用对象来进行排序的能力。早先的时候，我们说明了列表、集合和分类集合很常被用于引用其他的Redis对象，sort命令能够解引用这些关系，而且通过潜在值来进行排序。例如，假设我们有一个Bug追踪器能让用户看到各类已存在问题。我们可能使用一个集合数据结构去追踪正在被监视的问题：

sadd watch:leto 12339 1382 338 9338

你可能会有强烈的感觉，想要通过id来排序这些问题(默认的排序就是这样的)，但是，我们更可能是通过问题的严重性来对这些问题进行排序。为此，我们要告诉Redis将使用什么模式来进行排序。首先，为了可以看到一个有意义的结果，让我们添加多一点数据：

set severity:12339 3

set severity:1382 2

set severity:338 5

set severity:9338 4

要通过问题的严重性来降序排序这些Bug，你可以这样做：

sort watch:leto by severity:* desc

Redis将会用存储在列表(集合或分类集合)中的值去替代模式中的*(通过by)。这会创建出关键字名字，Redis将通过查询其实际值来排序。

在Redis里，虽然你可以有成千上万个关键字，类似上面展示的关系还是会引起一些混乱。幸好，sort命令也可以工作在散列数据结构及其相关域里。相对于拥有大量的高层次关键字，你可以利用散列：

hset bug:12339 severity 3

hset bug:12339 priority 1

hset bug:12339 details "{id: 12339, ....}"

hset bug:1382 severity 2

hset bug:1382 priority 2

hset bug:1382 details "{id: 1382, ....}"

hset bug:338 severity 5

hset bug:338 priority 3

hset bug:338 details "{id: 338, ....}"

hset bug:9338 severity 4

hset bug:9338 priority 2

hset bug:9338 details "{id: 9338, ....}"

所有的事情不仅变得更为容易管理，而且我们能通过severity或priority来进行排序，还可以告诉sort命令具体要检索出哪一个域的数据：

sort watch:leto by bug:*->priority get bug:*->details

相同的值替代出现了，但Redis还能识别->符号，用它来查看散列中指定的域。里面还包括了get参数，这里也会进行值替代和域查看，从而检索出Bug的细节(details域的数据)。

对于太大的集合，sort命令的执行可能会变得很慢。好消息是，sort命令的输出可以被存储起来：

sort watch:leto by bug:*->priority get bug:*->details store watch_by_priority:leto

使用我们已经看过的expiration命令，再结合sort命令的store能力，这是一个美妙的组合。

小结

这一章主要关注那些非特定数据结构关联的命令。和其他事情一样，它们的使用依情况而定。构建一个程序或特性时，可能不会用到使用期限、发布和订阅或者排序等功能。但知道这些功能的存在是很好的。而且，我们也只接触到了一些命令。还有更多的命令，当你消化理解完这本书后，非常值得去浏览一下完整的命令列表。

\clearpage

第5章 - 管理

在最后一章里，我们将集中谈论Redis运行中的一些管理方面内容。这是一个不完整的Redis管理指南，我们将会回答一些基本的问题，初接触Redis的新用户可能会很感兴趣。

配置(Configuration)

当你第一次运行Redis的服务器，它会向你显示一个警告，指redis.conf文件没有被找到。这个文件可以被用来配置Redis的各个方面。一个充分定义(well-documented)的redis.conf文件对各个版本的Redis都有效。范例文件包含了默认的配置选项，因此，对于想要了解设置在干什么，或默认设置是什么，都会很有用。你可以在https://github.com/antirez/redis/raw/2.4.6/redis.conf找到这个文件。

这个配置文件针对的是Redis 2.4.6，你应该用你的版本号替代上面URL里的"2.4.6"。运行info命令，其显示的第一个值就是Redis的版本号。

因为这个文件已经是充分定义(well-documented)，我们就不去再进行设置了。

除了通过redis.conf文件来配置Redis，config set命令可以用来对个别值进行设置。实际上，在将slowlog-log-slower-than设置为0时，我们就已经使用过这个命令了。

还有一个config get命令能显示一个设置值。这个命令支持模式匹配，因此如果我们想要显示关联于日志(logging)的所有设置，我们可以这样做：

config get *log*

验证(Authentication)

通过设置requirepass(使用config set命令或redis.conf文件)，可以让Redis需要一个密码验证。当requirepass被设置了一个值(就是待用的密码)，客户端将需要执行一个auth password命令。

一旦一个客户端通过了验证，就可以在任意数据库里执行任何一条命令，包括flushall命令，这将会清除掉每一个数据库里的所有关键字。通过配置，你可以重命名一些重要命令为混乱的字符串，从而获得一些安全性。

rename-command CONFIG 5ec4db169f9d4dddacbfb0c26ea7e5ef

rename-command FLUSHALL 1041285018a942a4922cbf76623b741e

或者，你可以将新名字设置为一个空字符串，从而禁用掉一个命令。

大小限制(Size Limitations)

当你开始使用Redis，你可能会想知道，我能使用多少个关键字?还可能想知道，一个散列数据结构能有多少个域(尤其是当你用它来组织数据时)，或者是，一个列表数据结构或集合数据结构能有多少个元素?对于每一个实例，实际限制都能达到亿万级别(hundreds of millions)。

复制(Replication)

Redis支持复制功能，这意味着当你向一个Redis实例(Master)进行写入时，一个或多个其他实例(Slaves)能通过Master实例来保持更新。可以在配置文件里设置slaveof，或使用slaveof命令来配置一个Slave实例。对于那些没有进行这些设置的Redis实例，就可能一个Master实例。

为了更好保护你的数据，复制功能拷贝数据到不同的服务器。复制功能还能用于改善性能，因为读取请求可以被发送到Slave实例。他们可能会返回一些稍微滞后的数据，但对于大多数程序来说，这是一个值得做的折衷。

遗憾的是，Redis的复制功能还没有提供自动故障恢复。如果Master实例崩溃了，一个Slave实例需要手动的进行升级。如果你想使用Redis去达到某种高可用性，对于使用心跳监控(heartbeat monitoring)和脚本自动开关(scripts to automate the switch)的传统高可用性工具来说，现在还是一个棘手的难题。

备份文件(Backups)

备份Redis非常简单，你可以将Redis的快照(snapshot)拷贝到任何地方，包括S3、FTP等。默认情况下，Redis会把快照存储为一个名为dump.rdb的文件。在任何时候，你都可以对这个文件执行scp、ftp或cp等常用命令。

有一种常见情况，在Master实例上会停用快照以及单一附加文件(aof)，然后让一个Slave实例去处理备份事宜。这可以帮助减少Master实例的载荷。在不损害整体系统响应性的情况下，你还可以在Slave实例上设置更多主动存储的参数。

缩放和Redis集群(Scaling and Redis Cluster)

复制功能(Replication)是一个成长中的网站可以利用的第一个工具。有一些命令会比另外一些来的昂贵(例如sort命令)，将这些运行载荷转移到一个Slave实例里，可以保持整体系统对于查询的快速响应。

此外，通过分发你的关键字到多个Redis实例里，可以达到真正的缩放Redis(记住，Redis是单线程的，这些可以运行在同一个逻辑框里)。随着时间的推移，你将需要特别注意这些事情(尽管许多的Redis载体都提供了consistent-hashing算法)。对于数据水平分布(horizontal distribution)的考虑不在这本书所讨论的范围内。这些东西你也很可能不需要去担心，但是，无论你使用哪一种解决方案，有一些事情你还是必须意识到。

好消息是，这些工作都可在Redis集群下进行。不仅提供水平缩放(包括均衡)，为了高可用性，还提供了自动故障恢复。

高可用性和缩放是可以达到的，只要你愿意为此付出时间和精力，Redis集群也使事情变得简单多了。

小结

在过去的一段时间里，已经有许多的计划和网站使用了Redis，毫无疑问，Redis已经可以应用于实际生产中了。然而，一些工具还是不够成熟，尤其是一些安全性和可用性相关的工具。对于Redis集群，我们希望很快就能看到其实现，这应该能为一些现有的管理挑战提供处理帮忙。

\clearpage

总结

在许多方面，Redis体现了一种简易的数据处理方式，其剥离掉了大部分的复杂性和抽象，并可有效的在不同系统里运行。不少情况下，选择Redis不是最佳的选择。在另一些情况里，Redis就像是为你的数据提供了特别定制的解决方案。

最终，回到我最开始所说的：Redis很容易学习。现在有许多的新技术，很难弄清楚哪些才真正值得我们花时间去学习。如果你从实际好处来考虑，Redis提供了他的简单性。我坚信，对于你和你的团队，学习Redis是最好的技术投资之一。

↧

SuperMap iDesktop 达梦数据库型的数据源创建

September 1, 2016, 2:16 am

≫ Next: [YARN] NodeManager因为ContainerMetric导致OOM

≪ Previous: Thelittleredisbook中文版

SuperMap iDesktop 达梦数据库型的数据源创建

本文主要介绍如何如何创建达梦数据库型的数据源，目前我们的iDesktop产品支持达梦数据库的版本为7.1.4和7.1.5，我们就以64位的达梦数据库7.1.5版本和64位的iDesktop7.1.2版本为示例。

1. 安装数据库：达梦数据库的安装过程很简单，按安装向导流程化的安装即可。如图一至图三。

注：达梦数据库的服务器和客户端是在同一个安装程序里，如果只需要服务器的话，就不勾选客户端，如果只需要客户端的话，就不勾选服务器和数据库服务即可。

安装完成后就初始化数据库。

2. 配置数据库：达梦数据库的实例化也比较简单，因为它的组织和 Oracle 数据库类似，它的参数配置、创建数据库等过程都和 Oracle 数据库类似，它的实例、表空间、用户、数据文件等概念都和 oracle 数据库的类似。如图四至图十二。

注：安装到这一步，会提示防火墙阻止，勾选上所有网络，然后点击允许访问。

注：这里的字符集设置的是GB18030。如果使用过程中遇到数据表里有产生乱码现象，那基本都是字符集不一致导致的，这时请修改字符集类型。

注：设置用户名密码，这个需要记住。

注：数据库创建完成后。打开DM服务查看器，我们可以看到达梦数据库的实例服务已经在运行了，那么我们接下来就可以创建达梦数据库用户了。

3. 创建DM数据库用户，创建DM的用户也很简单，选择变空间为MAIN，然后赋予角色所有的权限，如图，然后下一步完成创建。

创建DM数据源，在创建前将DM数据库安装目录的bin文件夹的整个路径添加到系统环境Path变量里。

注：如果DM数据库安装在本机，那么服务器名称就填“localhost”，如果DM数据库不在本机，那么就填服务器的IP地址。带有DM数据库引擎的iDesktop需要单独申请，或者通过扩展开发写这个功能(组件已提供该引擎接口)。

↧

[YARN] NodeManager因为ContainerMetric导致OOM

September 1, 2016, 2:15 am

≫ Next: Merkle Trees

≪ Previous: SuperMap iDesktop 达梦数据库型的数据源创建

现象描述

之前hadoop2.7.1的集群，经常运行一段时间后会触发OOM，导致上面的map需要重跑，能想到的一种方案是调整GC参数，利用GC回收器对内存进行回收，另外一种情况则感觉代码可能处理有问题。

关键日志：

2016-07-05 09:35:40,907 WARN org.apache.hadoop.ipc.Client: Unexpected error reading responses on connection Thread[IPC Client (2069725879) connection to rmhost:8031 from yarn,5,main] java.lang.OutOfMemoryError: Java heap space at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:120) at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:68)

后来调整GC参数，追踪到底哪里出了问题，以下是参数参考

YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" 后来经过分析，确实是代码处理有问题
[YARN] NodeManager因为ContainerMetric导致OOM

containerMetric占用了大部分内存
[YARN] NodeManager因为ContainerMetric导致OOM

后来社区有patch可以修复 HADOOP-13362

参考工具

IBM分析工具

↧

Merkle Trees

September 1, 2016, 2:14 am

≫ Next: MongoDB schema design: There Is Always A Schema

≪ Previous: [YARN] NodeManager因为ContainerMetric导致OOM

30 Aug 2016

Merkle Treeis a data structure where every non-leaf node contains the hash of the labels of its child nodes, and the leaves have their own values hashed. Because of this characteristic, Merkle Trees are used to verify that two or more parties have the same data without exchanging the entire data collection. The following figure shows an example of a Merkle Tree:

┌───────────────┐ │ Root │ ┌─────────────────│Hash(AA1 + BB2)│───────────────┐ │ └───────────────┘ │ │ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ AA1 │ │ BB2 │ ┌────│Hash(A1 + B1)│──────┐ ┌──│Hash(C1 + D1)│────────┐ │ └─────────────┘ │ │ └─────────────┘ │ │ │ │ │ │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ A1 │ │ B1 │ │ C1 │ │ D1 │ ┌─│Hash(A + B)│─┐ ┌─│Hash(C + D)│─┐ ┌─│Hash(E + F)│─┐ ┌─│Hash(G + H)│─┐ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ A │ │ B │ │ C │ │ D │ │ E │ │ F │ │ G │ │ H │ │Hash(A)│ │Hash(B)│ │Hash(C)│ │Hash(D)│ │Hash(E)│ │Hash(F)│ │Hash(G)│ │Hash(H)│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ Trees created by different machines can be compared as a way to guarantee that each machine has the same data as the other. This is exactly what Amazon's Dynamo [1] does in its anti-entropy phase: nodes exchange their Merkle tree roots and compare the root hashes to see if they have the same data; if that's not the case they proceed by exchanging the inner nodes to see which branches are different and then synchronize accordingly. Another system that uses Merkle trees is Bitcoin [2]. In Bitcoin, each block includes the Merkle root of all transactions in such block. And somewhat similar to what happens in Dynamo, the Merkle tree in Bitcoin can be used to verify that a given transaction is present in the block.

All this reading led me to start building a library for creating and manipulating Merkle Trees in Erlang as a proof of concept. So far, it provides a way to create a binary Merkle Tree from {Key, Value} pairs, find which keys a given tree has, and compare two trees to find which keys are different.

References [1] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s Highly Available Key-value Store , 2007, pp. 1 16. [2] S. Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System , pp. 1 9, Mar. 2009. Notes

↧

MongoDB schema design: There Is Always A Schema

September 1, 2016, 2:13 am

≫ Next: Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Lib ...

≪ Previous: Merkle Trees

MongoDB Schema Design

When MongoDB was introduced a few years ago, one of the important features touted was the ability to be “schemaless” What does this mean for your documents?

MongoDB schema design does not enforce any schema on the documents stored in a collection. MongoDB essentially stores JSON documents. Each document can contain any structure that you want. Consider some examples from our “contacts” collection below. Here is one document that you can store:

{
'name':'user1',
'address':' 1 mountain view',
'phone': '123-324-3308',
'SSN':'123-45-7891'
}

Now the second document stored in the collection can be of this format:

{
'name': ' user2',
'employeeid': 546789
}

It’s pretty cool that you can store both these documents in the same collection. The problem, however, starts when you need to retrieve these documents from the collection. How do you tell if the retrieved document contains format 1 or format 2? You can check if the retrieved document contains the ‘ssn’ field and then make a decision. Another option is to store the type of the document in the document itself:

{
'type': xxx,
'name': ....
...
}

In both these cases what you have achieved is moving the schema enforcement from the database to the application -

There is always a schema, it is just a question of where it is implemented.

If you have the right indexes it alleviates the problem to a certain extent. If a majority of your queries are by ‘employeeid’ you know that the retrieved document is always of the second format however, the rest of your code that does not use this index will still have the problem mentioned above. Also If you are using an ODM like mongoose then it automatically already enforces a schema for you on top of MongoDB.

There are several applications that benefit from this flexibility. One scenario that comes to mind is the case of a schema where there are a number of optional fields/columns. In MongoDB, there is no penalty for having some missing columns. Each document can only contain the fields that it needs.

Document Validation

Starting version 3.2.x MongoDB now supports the concept of schema validation using the “validator” construct. This provides many levels of validation so you can choose the level that works for you. The default behaviour if you don’t use validator is the previous schemaless behaviour. Typically you will create the “validators” at the time of collection creation

db.createCollection( "contacts",
{ validator: { $or:
[
{ employeeid: { $exists: true }},
{ SSN: { $exists: true } }
]
}
} ) Existing Collections

Existing collections can be updated using the ‘collMod’ command:

db.runCommand( {
collMod: "contacts”,
validator: { $or: [ { employeeid: { $exists: true }}, { SSN: { $exists:true} } ] }
} ) Validation Level

MongoDB supports the concept of ‘ValidationLevel’. The default validation level is ‘strict’ which means that inserts and updates fail if the document does not meet the validation criteria. If the validation level is ‘Moderate’ it applies the validation to existing documents that meet the validation criteria. Documents that exist currently and don’t meet the criteria are not validated. While convenient the ‘Moderate’ validation level can get you into trouble down the line so it needs to be used with care.

Validation Action

By default, the validation action is ‘Error’. If your document fails validation it is an error and the update/insert fails. However, you can also set the Validation action to ‘warn’ which basically logs the schema violation in the log , but does not fail the insert.

What schema design exampleswould help you on your next project, let us know!

↧

Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Lib ...

September 1, 2016, 5:16 am

≫ Next: 缓存工厂之 Redis 缓存

≪ Previous: MongoDB schema design: There Is Always A Schema

Machine and statistical learning wizards are becoming more eager to perform analysis with Spark ML library if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster let’s say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I present sparklyr package (by RStudio ), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it user sparklyr! sparklyr basics dplyr and DBI interface on Spark Running Spark ML Machine Learning K-means Algorithm from R

If you don’t know much about Spark yet, you can read my April post Answers to FAQ about SparkR for R users where I explained how could we use SparkR package that is distributed with Spark. Many things (code) might have changed since that time, due to the rapid development caused by great popularity of Spark. Now we can use version 2.0.0 of Spark. If you are migrating from previous versions I suggest you should look at Migration Guide Upgrading From SparkR 1.6.x to 2.0 .

sparklyr basics

This packages is based on sparkapi package that enables to run Spark applications locally or on YARN cluster just from R. It translates R code to bash invocation of spark-shell. It’s biggest advantage is dplyr interface for working with Spark Data Frames (that might be Hive Tables) and possibility to invoke algorithms from Spark ML library.

Installation of sparklyr, then Spark itself and simple application initiation is described by this code

library(devtools) install_github('rstudio/sparklyr') library(sparklyr) spark_install(version = "2.0.0") sc <- spark_connect(master="yarn", config = list( default = list( spark.submit.deployMode= "client", spark.executor.instances= 20, spark.executor.memory= "2G", spark.executor.cores= 4, spark.driver.memory= "4G")))

One don’t have to specify config by himself, but if this is desired then remember that you could also specify parameters for Spark application with config.yml files so that you can benefit from many profiles (development, production). In version 2.0.0 it is desired to name master yarn instead of yarn-client and passing the deployMode parameter, which is different from version 1.6.x. All available parameters can be found in Running Spark on YARN documentation page.

dplyr and DBI interface on Spark

When connecting to YARN, it is most probable that you would like to use data tables that are stored on Hive. Remember that

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

where conf/ is set as HADOOP_CONF_DIR . Read more about using Hive tables from Spark

If everything is set up and the application runs properly, you can use dplyr interface to provide lazy evaluation for data manipulations. Data are stored on Hive, Spark application runs on YARN cluster, and the code is invoked from R in the simple language of data transformations (dplyr) everything thanks to sparklyr team great job! Easy example is below

library(dplyr) # give the list of tables src_tbls(sc) # copies iris from R to Hive iris_tbl <- copy_to(sc, iris, "iris") # create a hook for data stored on Hive data_tbl <- tbl(sc, "table_name") data_tbl2 <- tbl(sc, sql("SELECT * from table_name"))

You can also perform any operation on datasets use by Spark

iris_tbl %>% select(Petal_Length, Petal_Width) %>% top_n(40, Petal_Width) %>% arrange(Petal_Length)

Note that original commas in iris names have been translated to _ .

This package also provides interface for functions defined in DBI package

library(DBI) dbListTables(sc) dbGetQuery(sc, "use database_name") data_tbl3 <- dbGetQuery(sc, "SELECT * from table_name") dbListFields(sc, data_tbl3)

Running Spark ML Machine Learning K-means Algorithm from R

The basic example on how sparklyr invokes Scala code from Spark ML will be presented on K-means algorithm.

If you check the code of sparklyr::ml_kmeans function you will see that for input tbl_spark object, named x and character vector containing features’ names ( featuers )

envir <- new.env(parent = emptyenv()) df <- spark_dataframe(x) sc <- spark_connection(df) df <- ml_prepare_features(df, features) tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.

This is done in a new environment, so that we can store arguments for future ML algorithm and the model itself in its own environment. This is safe and clean solution. You can construct a simple model calling a Spark ML class like this

envir$model <- "org.apache.spark.ml.clustering.KMeans" kmeans <- invoke_new(sc, envir$model)

which invokes new object of class KMeans on which we can invoke parameters setters to change default parameters like this

model <- kmeans %>% invoke("setK", centers) %>% invoke("setMaxIter", iter.max) %>% invoke("setTol", tolerance) %>% invoke("setFeaturesCol", envir$features) # features where set in ml_prepare_dataframe

For an existing object of KMeans class we can invoke its method called fit that is responsible for starting the K-means clustering algorithm

fit <- model %>% invoke("fit", tdf)

which returns new object on which we can compute, e.g centers of outputted clustering

kmmCenters <- invoke(fit, "clusterCenters")

or the Within Set Sum of Squared Errors (called Cost) (which is mine small contribution #173 )

kmmCost <- invoke(fit, "computeCost", tdf)

This sometimes helps to decide how many clusters should we specify for clustering problem

Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Lib ...

and is presented in print method for ml_model_kmeans object

iris_tbl %>% select(Petal_Width, Petal_Length) %>% ml_kmeans(centers = 3, compute.cost = TRUE) %>% print() K-means clustering with 3 clusters Cluster centers: Petal_Width Petal_Length 1 1.359259 4.292593 2 2.047826 5.626087 3 0.246000 1.462000 Within Set Sum of Squared Errors = 31.41289

All that can be better understood if we’ll have a look on Spark ML docuemtnation for KMeans ( be carefull not to confuse with Spark MLlib where methods and parameters have different names than those in Spark ML). This enabled me to provide simple update for ml_kmeans() ( #179 ) so that we can specify tol (tolerance) parameter in ml_kmeans() to support tolerance of convergence.

↧

缓存工厂之 Redis 缓存

September 1, 2016, 5:15 am

≫ Next: The cat-and-mouse story of implementing anti-spam for Mail.Ru Group’s email ser ...

≪ Previous: Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Lib ...

（点击上方蓝字，可快速关注我们）

来源：神牛步行3

链接：cnblogs.com/wangrudong003/p/5785116.html#undefined

正式分享今天的文章吧：

搭建Redis服务端，并用客户端连接

封装缓存父类，定义Get，Set等常用方法

定义RedisCache缓存类，执行Redis的Get，Set方法

构造出缓存工厂调用方法

下面一步一个脚印的来分享：

搭建Redis服务端，并用客户端连接首先，咋们去这个地址下载安装文件https://github.com/dmajkic/redis/downloads，我这里的版本是：redis-2.4.5-win32-win64里面有32位和64位的执行文件，我这里服务器是64位的下面给出截图和用到部分程序的说明：

现在，咋们直接可以用鼠标双击redis-server.exe这个应用程序，这样就打开了redis服务窗体（您也可以下载一个windows服务承载器，把redis服务运行在windows的服务中，就不用担心每次关闭redis服务黑色窗体后无法访问redis了），运行起来是这样：

有红色框的信息就表示成功了，这里redis服务监听的端口默认是6379，要修改端口或者更多的配置信息请找到redis.conf配置文件，具体配置信息介绍可以来这里http://www.shouce.ren/api/view/a/6231

再来，打开客户端连接服务端，咋们退到64bit文件夹的目录中，鼠标移到64bit文件夹上并且安装Shift键，同时点击鼠标的右键，选中"在此处打开命令窗口"这样快速进入到了该文件夹的cmd命令窗口中（当然不同的操作系统不同，这里演示的是windows的操作；

还有其他进入的方式这里不做介绍，因为个人感觉这是最快的）；然后，在命令窗口中录入redis-cli.exe -h localhost -p 6379回车来访问服务端，效果图：

再来看下服务端窗体截图：

没错这样客户端就连接上服务端了，可以简单在客户端执行下set，get命令：

如果是客户端要访问远程的redis服务端，只需要把localhost换成可访问的ip就行了如果还需要密码等更多配置请去上面的那个地址链接；

封装缓存父类，定义Get，Set等常用方法

先来，上父类的代码：

public class BaseCache : IDisposable

{

protected string def_ip = string.Empty;

protected int def_port = 0;

protected string def_password = string.Empty;

public BaseCache()

{

}

public virtual void InitCache(string ip = "", int port = 0, string password = "")

{

}

public virtual bool SetCache<T>(string key, T t, int timeOutMinute = 10) where T : class,new()

{

return false;

}

public virtual T GetCache<T>(string key) where T : class,new()

{

return default(T);

}

public virtual bool Remove(string key)

{

return false;

}

public virtual bool FlushAll()

{

return false;

}

public virtual bool Any(string key)

{

return false;

}

public virtual void Dispose(bool isfalse)

{

if (isfalse)

{ }

}

//手动释放

public void Dispose()

{

this.Dispose(true);

//不自动释放

GC.SuppressFinalize(this);

}

这里定义的方法没有太多的注释，更多的意思我想看方法名称就明白了，这个父类主要实现了IDisposable，实现的Dispose()中主要用来释放资源并且自定义了一个public virtual void Dispose(bool isfalse)方法

这里面有一句是GC.SuppressFinalize(this);按照官网介绍的意思是阻塞自动释放资源，其他的没有什么了，继续看下面的

定义RedisCache缓存类，执行Redis的Get，Set方法。首先，咋们分别定义类RedisCache，MemcachedCache（这里暂未实现对memcache缓存的操作），并且继承BaseCache，重写Set，Get方法如下代码：

/// <summary>

/// Redis缓存

/// </summary>

public class RedisCache : BaseCache

{

public RedisClient redis = null;

public RedisCache()

{

//这里去读取默认配置文件数据

def_ip = "172.0.0.1";

def_port = 6379;

def_password = "";

}

#region Redis缓存

public override void InitCache(string ip = "", int port = 0, string password = "")

{

if (redis == null)

{

ip = string.IsNullOrEmpty(ip) ? def_ip : ip;

port = port == 0 ? def_port : port;

password = string.IsNullOrEmpty(password) ? def_password : password;

redis = new RedisClient(ip, port, password);

}

public override bool SetCache<T>(string key, T t, int timeOutMinute = 10)

{

var isfalse = false;

try

{

if (string.IsNullOrEmpty(key)) { return isfalse; }

InitCache();

isfalse = redis.Set<T>(key, t, TimeSpan.FromMinutes(timeOutMinute));

}

catch (Exception ex)

{

}

finally { this.Dispose(); }

return isfalse;

}

public override T GetCache<T>(string key)

{

var t = default(T);

try

{

if (string.IsNullOrEmpty(key)) { return t; }

InitCache();

t = redis.Get<T>(key);

}

catch (Exception ex)

{

}

finally { this.Dispose(); }

return t;

}

public override bool Remove(string key)

{

var isfalse = false;

try

{

if (string.IsNullOrEmpty(key)) { return isfalse; }

InitCache();

isfalse = redis.Remove(key);

}

catch (Exception ex)

{

}

finally { this.Dispose(); }

return isfalse;

}

public override void Dispose(bool isfalse)

{

if (isfalse && redis != null)

{

redis.Dispose();

redis = null;

}

#endregion

}

/// <summary>

/// Memcached缓存

/// </summary>

public class MemcachedCache : BaseCache

{

}

这里，用到的RedisClient类是来自nuget包引用的，这里nuget包是：

然后，来看下重写的InitCache方法，这里面有一些ip，port（端口），password（密码）参数，这里直接写入在cs文件中没有从配置文件读取，大家可以扩展下；

这些参数通过RedisClient构造函数传递给底层Socket访问需要的信息，下面简单展示下RedisClient几个的构造函数：

public RedisClient();

public RedisClient(RedisEndpoint config);

public RedisClient(string host);

public RedisClient(Uri uri);

public RedisClient(string host, int port);

public RedisClient(string host, int port, string password = null, long db = 0);

至于Get，Set方法最终都是使用RedisClient对象访问的，个人觉得需要注意的是Set方法里面的过期时间参数，目前还没有试验这种情况的效果：

通过这几种方法设置过期时间后，快到过期时间的时候如果此时有使用这个缓存key那么过期时间是否会往后自动增加过期时间有效期，这里暂时没有试验（这里是由于前面项目中的.net core框架中的memecache缓存都有这种设置，想来redis应该也有吧）

这里，需要重写下public override void Dispose(bool isfalse)方法，因为调用完RedisClient后需要释放，我们通过Dispose统一来手动释放，而不是直接在调用的时候使用using()

构造出缓存工厂调用方法

接下来，咋们需要定义一个缓存工厂，因为上面刚才定义了一个RedisCache和MemcachedCache明显这里会有多个不同缓存的方法调用，所用咋们来定义个工厂模式来调用对应的缓存；

这里的工厂模式没有使用直接显示创建

newRedisCache(), newMemcachedCache()对象的方法，而是使用了反射的原理，创建对应的缓存对象；

先来，定义个枚举，枚举里面的声明的名字要和咋们缓存类的名称相同，代码如下：

public enum CacheType

{

RedisCache,

MemcachedCache

}

再来，定义个工厂来CacheRepository(缓存工厂)，并且定义方法Current如下代码：

public static BaseCache Current(CacheType cacheType = CacheType.RedisCache)

{

var nspace = typeof(BaseCache);

var fullName = nspace.FullName;

var nowspace = fullName.Substring(0, fullName.LastIndexOf('.') + 1);

return Assembly.GetExecutingAssembly().CreateInstance(nowspace + cacheType.ToString(), true) as BaseCache;

}

通过传递枚举参数，来确定反射CreateInstance()方法需要用到的typeName参数，从而来定义需要访问的那个缓存对象，这里要注意的是加上了一个命名空间nowspace，因为缓存类可能和工厂类不是同一个命名空间，但是通常会和缓存基类是同命名空间所以在方法最开始的时候截取获取了缓存类需要的命名空间（这里看自身项目来定吧）；

Assembly.GetExecutingAssembly()这个是用来获取当前应用程序集的路径，这里就避免了咋们使用Assembly.Load()方法还需要传递程序集的路径地址了

好了满上上面要求后，咋们可以在测试页面调用代码如：

CacheRepository.Current(CacheType.RedisCache).SetCache<MoFlightSearchResponse>(keyData, value)；就如此简单，咋们使用redis-cli.exe客户端来看下缓存起来的数据：

怎么样，您们的是什么效果呢，下面给出整体代码：

public enum CacheType

{

RedisCache

↧

The cat-and-mouse story of implementing anti-spam for Mail.Ru Group’s email ser ...

September 1, 2016, 5:14 am

≫ Next: Digging Deep Into Cassandra Thrift Buffer Behavior

≪ Previous: 缓存工厂之 Redis 缓存

The cat-and-mouse story of implementing anti-spam for Mail.Ru Group’s email ser ...

Hey guys!

In this article, I’d like to tell you a story of implementing the anti-spam system for Mail.Ru Group’s email service and share our experience of using the Tarantool database within this project: what tasks Tarantool serves, what limitations and integration issues we faced, what pitfalls we fell into and how we finally arrived to a revelation.

Let me start with a short backtrace. We started introducing anti-spam for the email service roughly ten years ago. Our first filtering solution was Kaspersky Anti-Spam together with RBL ( Real-time blackhole list ― a realtime list of IP addresses that have something to do with spam mailouts). This allowed us to decrease the flow of spam messages, but due to the system’s inertia, we couldn’t suppress spam mailouts quickly enough (i.e. in the real time). The other requirement that wasn’t met was speed: users should have received verified email messages with a minimal delay, but the integrated solution was not fast enough to catch up with the spammers. Spam senders are very fast at changing their behavior model and the outlook of their spam content when they find out that spam messages are not delivered. So, we couldn’t put up with the system’s inertia and started developing our own spam filter.

Our second system was MRASD ― Mail.Ru Anti-Spam Daemon. In fact, this was a very simple solution. A client’s email message went to an Exim mail server, got through RBL that acted as the primary filter, and then went to MRASD where all the magic happened. The anti-spam daemon parsed the message into pieces: headers and body. Then it normalized each of the pieces using elementary algorithms like normalizing the character case (all in lowercase or uppercase), bringing similar-looking symbols to a specific form (using one symbol for the Russian and English “O”, for example), etc. After normalization, the daemon extracted so-called “entities”, or email signatures. Our spam filters analyzed different parts of the email message and blocked the message if they found any suspicious content. For example, we could define a signature for the word “viagra”, and all messages that contained this word were blocked. An entity could also be a URL, an image, an attachment and so on. Another thing done during the anti-spam check was to calculate a fingerprint for the verified email message. A fingerprint, calculated as a handful of tricky hash functions, was a unique characteristic of the message. Based on the calculated hash values and collected hash statistics, the anti-spam system could filter out the message as spam or let it through. When a hash value or an entity reached a certain frequency threshold, a server started blocking all matching email messages. For this purpose, we maintained statistics (counters) that tracked how often an entity was met, how often the recipients complained about it, and set an entity flag SPAM/HAM (in spam-related terminology, “ham” is the opposite of “spam” and means that the verified email message contains no spam content).

The core part of MRASD was implemented using C++, while a considerable piece of its business logic was implemented using an interpretive language, Lua. As I have already said, spammers are highly dynamic guys who change their behavior very fast. Our aim was to respond as fast to every change on the spammers’ side, that’s why we implemented our business logic using an interpretive language (with Lua, we didn’t have to recompile the system and update it on all servers every time). The other requirement was speed: code in Lua showed good results in performance testing. Finally, it was easy to integrate with code in C++.

The scheme above illustrates a simplified workflow of our anti-spam filter: an email message comes from the sender to our mail server; if the message has successfully passed the primary filter (1), it goes further to MRASD (2). MRASD returns its check results to the mail server (3), and based on these results the message is delivered either to the recipient’s “Spam” folder or to the inbox.

MRASD allowed us to decrease the number of non-filtered spam messages ten times. As time went on, we kept improving the system: added new subsystems and components, introduced new tools. So, the system kept growing and became still more complex, and anti-spam tasks also became still more diverse. These changes couldn’t help affecting our technology stack. That’s what the next part of this story is about.

Evolution of our technology stack

At the dawn of the era of email services, the message flow as well as the message content was notably scarcer than today. But the choice of tools and computing capacities was poorer too. As you can see from the above-described “parental” model of MRASD, it was necessary to store all sorts of statistical data. A considerable part of this data was “hot” (i.e. frequently used), and this posed certain requirements for the data storage system. As a result, we chose mysql as a storage system for the “cold” data, but felt still undecided about that for the “hot” statistics. We analyzed all possible solutions (their performance and functionality as applied for “hot” but not mission-critical data) and finally arrived to Memcached ― at that moment, this solution was stable enough. But we still had a problem with storing “hot” and critical data. Like any other cache solution, Memcached has its limitations, including no replication and the long-and-slow warm up period after the cache went down (and was flushed). Our further search brought us to Kyoto Cabinet , a non-relational key-value database system.

The time ticked by, and the email workload increased, and so did the anti-spam workload. There emerged new services that required storing ever more data (Hadoop, Hypertable). By the way, today’s peak processing workload reaches 550 thousand email messages per minute (if we calculate a daily average, this makes about 350 thousand email messages every minute), and the amount of log files to analyze is over 10 Tbytes a day. But let’s get back into the past: in spite of the increasing workloads, our requirements for data processing (loading, saving) remained the same. And one day we realized that Kyoto couldn’t manage the amount of data we needed. Moreover, we wanted a storage system with broader functionality for our “hot” and critical data. That said, it was high time to look around for better alternatives that would be more flexible and easier to use, with higher performance and failover capabilities. It was the time when a NoSQL database named Tarantool gained popularity within our company. Tarantool was developed inside the company and fully met our “wannas”. By the way, I’ve been lately revising our services, and I felt an archeologist when I bumped into one of the earliest Tarantool versions ― Tarantool/Silverbox . We decided to give Tarantool a try as its benchmark tests covered the data amounts we needed (I don’t have exact workload figures for that period) and it also satisfied our requirements for memory usage. Another important factor was that the project team was located next door, and we could quickly make feature requests using JIRA. We were among the pioneers who decided to try Tarantool in their project, and I think that our first step towards Tarantool was much encouraged by the positive experience of the other pioneers.

That’s when our “Tarantool era” began. We actively introduced ― and keep introducing ― Tarantool into our anti-spam architecture. Today we have queues based on Tarantool, high-workload services for storing all sorts of statistics: user reputation, sender IP reputation, user trustworthiness (“karma” statistics), etc. Our current activity is integrating the upgraded data storage system into the our entity statistics processor. You may be wondering why we have focused on a single database solution for our anti-spam project and do not consider migrating to other storages. Well, that’s not quite the case. We consider and analyze competing systems as well, but for the time being Tarantool handles well all tasks and workloads required within our project. Introducing a new (unknown, previously not used) system is always risky and takes much time and resources. Meanwhile, Tarantool is a well-known tool for us (and for many other projects). Our developers and system administrators already know all the onions of using and configuring Tarantool and how to make the most of it. Another advantage is that Tarantool’s development team keeps improving its product and provides good support (and these guys are working next door, which is nice:)). When we were implementing still another Tarantool-based solution, we got all the necessary help and support straightaway (I will tell you about this a bit later).

Further on I’ll give you an overview of several systems in our anti-spam project that use Tarantool and will relate the issues we faced.

Overview of our systems that use Tarantool Karma

Karmais a numeric value that indicates a user’s trustworthiness. It was originally intended as the basis of a general “carrot and stick” system for users that wouldn’t require complex dependent systems. Karma is an aggregative value based on data received from other user reputation systems. The idea behind the Karma system is simple: every user has their karma ― the higher, the more we trust this user; the lower, the more strict we are while assessing their email messages during our anti-spam check. For example, if a sender sends an email message with suspicious content and the sender’s karma rating is high, such message will hit the recipient’s inbox. And low karma rating would be a pessimistic factor for the anti-spam system. This system makes me think about an attendance book that a teacher consults during school examinations. Students that attended all classes get just a couple of extra questions and leave for vacations, while those who missed many classes will have to answer lots of questions to get a high grade.

Tarantool that stores karma-related data works on a single server. The graph below illustrates the number of requests that one such instance performs per minute.

RepIP/RepUser

RepIPand RepUser (reputation IP and reputation user) is a high-workload service for processing statistics related to the activity and actions of a sender (user) with a specific IP as well as statistics related to how intensively a user worked with the email service over a certain period of time. This systems lets us know how many email messages a user has sent, how many of them were read, and how many were marked as spam. An advantage of this system is that it provides a timeline rather than a snapshot of a user’s activity. Why is that important for behavior analysis? Imagine that you have moved to a foreign country without any means of communication, and all your friends remained at home. Then, several years later, you get an Internet cable in your hut. Wow! You browse to the website of your favorite social network and see a photo your friend ― hm, he has changed a lot… How much information can you get from that photo? I guess, not too much. And now imagine that you watch a video that shows your friend change, get married and so on ― kind of a short biographical clip. I bet, in the second case you’ll get a much better idea of your friend’s life. The same thing is with data analysis: the more information we have, the better we can assess a user’s behavior. We can notice trends in a sender’s mailing activities, understand a sender’s habits. Based on this kind of statistics, each user and IP address is assigned “trust rating points” and a special flag. This flag is used in the primary filter that filters out up to 70% of spam messages before they even hit our mail server. This percentage illustrates the great importance of the reputation service. This is why this service requires the maximum possible performance and failure tolerance. And this is why we use Tarantool here.

Reputation statistics are stored on two servers with four Tarantool instances per each server. The graph below illustrates the average number of requests to RepIP per minute.

While we implemented the reputation service, we had a number of configuration issues with Tarantool. Unlike the systems we discussed earlier, a data packet for RepIP/RepUser is much larger: the average packet size here is 471,97 bytes (the maximal size is 16 Kbytes). Logically, a packet comprises two parts: a small “basic” part (flags, aggregated statistics) and a large statistical part (detailed per-action statistics). Addressing an entire packet results in intensive network usage, so it takes more time to load and save a record. Many systems need only the basic part of a packet, but how can we strip it out of a tuple (“tuple” is Tarantool’s term for a record)? Here’s where stored procedures come in handy. We added the required function to Tarantool’s init.lua file and called it from the client (starting from Tarantool version 1.6, you can write stored procedures in plain C).

Problems with Tarantool versions before 1.5.20

It would be wrong to say that we’ve never had problems with Tarantool. Yes, we had some. For example, after a scheduled restart, Tarantool clients (more than 500) failed to reconnect due to a timeout. We tried introducing progressive timeouts when after a failure the next reconnection attempt is delayed for some increasing amount of time, but this didn’t help. As we found out, the problem was that Tarantool accepted just one connection request within every cycle of its event loop, although there were hundreds of requests awaiting. We had two alternatives: install a new Tarantool version (1.5.20 or higher) or amend Tarantool’s configuration (disabling the io_collect_interval option solved the problem). Tarantool developers fixed this bug very quickly, so you won’t have it with Tarantool 1.6 or 1.7.

RepEntity ― entity reputation

We are currently integrating a new component for storing entity statistics (URL, image, attachment, etc.) ― RepEntity . The purpose of RepEntity is similar to that of the already discussed RepIP/RepUser: it offers detailed information about entity behavior, which is a decision criterion for our anti-spam filter. Thanks to RepEntity statistics, we can filter out a spam mailout based on the entities of an email message. As an example, a mailout may contain a suspicious URL (e.g. it may contain spam content or lead to a phishing website), and RepEntity helps us notice and block such mailouts much faster. How? We can see the mailing out dynamics of this URL, and we can detect changes in its behavior, which would be impossible with “flat” counters.

Besides a different data packet format, the basic difference between the RepEntity and RepIP systems is that RepEntity produces a tangibly higher workload on our service (the amount of processed and stored data is greater, and so is the number of requests). A single email message may contai

↧

Digging Deep Into Cassandra Thrift Buffer Behavior

September 1, 2016, 5:13 am

≫ Next: Visa Interview Experience |Set 11

≪ Previous: The cat-and-mouse story of implementing anti-spam for Mail.Ru Group’s email ser ...

Everyone who works in tech has had to debug a problem. Hopefully it is as simple as looking into a log file, but many times it is not. Sometimes the problem goes away and sometimes it only looks like it goes away. Other times it might not look like a problem at all. A lot of factors will go into deciding if you need to investigate, and how deep you need to go. These investigations can take a lot of resources for an organization and there is always the chance of coming up empty handed which should never be seen as a failure.

This post will summarize an investigation into some Cassandra memory behavior that the database team at Knewton conducted recently. It is a good illustration of the kind of multi-pronged approach needed to unravel strange behaviors low in the stack. While we will be specifically talking about Cassandra running on AWS the takeaways from this article are applicable to systems running on different platforms as well.

Uncovering a Problem Background

Knewton had a Cassandra cluster that was very overprovisioned. The instances were scaled up (each instance is made more powerful) rather than scaled out (more instances are added to the cluster). Cassandra is a horizontally scalable datastore and it benefits from having more machines rather than better ones.

So we added more machines and reduced the power of each one. We now had 4GB of available memory for the Java heap instead of 8GB. This configuration worked well in all of our tests and in other clusters but in this specific overprovisioned cluster we found we had scaled each node down too much, so we moved to machines that could accommodate 8GB of heap, m4.xlarge instances.

Anomalous Behavior on New Instances

After moving all nodes over to m4.xl instances we saw our heap situation stabilize. However we began to notice anomalous CPU and load averages across the cluster. Some nodes would run higher than other nodes. The metrics showed that, out of the four cores on a m4.xl instance, one was completely saturated.

If you saw this load on its own you would not think that it is a problem. Total usage of CPU on the box is at 25% and there are no catastrophically long GC pauses. However, the cluster was not uniform, which called for further investigation.

In these CPU graphs, you can see nodes running on low CPU that encountered something that would rapidly promote them into the high CPU band, and they would almost never drop back down.

Digging Deep Into Cassandra Thrift Buffer Behavior

We found that this graph was correlated with the graph of average garbage collection time.

When nodes promote themselves into the high CPU band, their GC times spike.

What is Cassandra holding in its heap that is causing GC issues? With a lot of overhead in memory and CPU, crashes are not a risk, but performance is far from where we want to be.

Before tackling the garbage collection question, we have two other questions that we can answer already:

Why is this behavior showing up now?

We had an 8GB heap before and should have seen the same behavior. The reason we only saw this CPU and GC behavior once on m4.xlarge instances is twofold:

Something unique in this cluster caused a 4GB heap to be fatal but an 8GB heap to be adequate. Expanding the heap did not get rid of the problem. The original cluster that had an 8GB heap was around for years and all nodes were promoted to the high CPU band. The baseline operation of the cluster looked like the high CPU band. It is only because of previous issues with provisioning this cluster that we were watching closely when we moved to these m4.xlarge instances. This highlights the importance of understanding your baseline resource utilization and not assuming that it means you are in a healthy state. Why is this behavior a problem?

Even though the high usage band is not a problematic load, the behavior of the cluster was unexpected, and this promotion behavior is problematic. The biggest reason that is problematic is that it meant we had a nondeterministic factor in our latencies. We can expect different performance on a node that is in the high CPU band than one in the low CPU usage band. However we cannot predict when a node will promote itself as we don’t know the cause for this behavior. So we have an unknown factor in our system, which is dangerous for repeatability and reliability.

Investigation

Investigating further is resource intensive, and often your senior engineering staff has to undertake the investigation, as it requires some degree of independence and experience. So make sure you decide that the problem is actually worth the time of your organization before sinking days or weeks of engineering time into it.

Logs

The first thing to do is to look through the relevant logs. Looking through the Cassandra logs we found, unsurprisingly, a lot of garbage collections. We’d had GC logging on and found several “large block” messages in the CMS garbage collection logs at about the time these promotions happened. To get more visibility into what is actually in the heap, we turned on GC class histogram logs, but these said that almost all the memory was being held in byte arrays.

Not helpful. Time to go even deeper and see what is in the heap.

Heap Dump

So we took a heap dump from a problematic node and on a node that was exhibiting good GC behavior as a control. A heap dump is the size of the used Java heap. Dumping the heap is a “stop the world” operation for the process you are dumping, so when doing this in production be sure that the service can be unresponsive for a minute or more. The file is binary and examining it is labor intensive, so it is best to move the heap dump to a computer that’s not being used in production to investigate.

We used the Eclipse Memory Analyzer Tool (MAT) to investigate the heap dumps and found that almost all of the memory was taken up by the write buffer of the TframedTransport objects. There were several hundred of these objects and the write buffer size ranged from 1kB to 30MB, with many in the range of 20-30MB. We saw similar objects in the heap dump of the control node, but not nearly as many. The write buffer contains what is being written to the Thrift transport socket and does not correspond to Cassandra reads and writes. In this case Cassandra is writing the output of an incoming read to this buffer to send to the client that has requested the data.

It became pretty clear that this was a Thrift protocol issue so we searched for related issues.

Literature Search

Any problem you find has been reported in some way or another by someone else, especially if you are using anything but the latest versions of open-source software, It is very useful to search the web for related problems at every step of the investigation, but as you get more specific you might uncover things that previously you would not have encountered. In this case the investigation led us to the Thrift protocol, which is not something we would have searched for earlier.

The Thrift library that our version of Cassandra used had some memory issues referenced in CASSANDRA-4896 . These are referenced again in more detail and resolved in CASSANDRA-8685 . So you don’t have to read through the tickets, basically the write buffer ―the thing taking up all of our spa

↧

Visa Interview Experience |Set 11

September 1, 2016, 5:12 am

≫ Next: Save the whale: Docker rightfully shuns standardization

≪ Previous: Digging Deep Into Cassandra Thrift Buffer Behavior

I recently had the opportunity to interview with Visa Inc.

Round 1:

Online Coding Test:

Platform:HackerRank

Questions: There were 4 coding questions to be solved in 90 minutes.

Question 1: Given ‘n’ Jars filled with ‘m’ number of Jellybeans. ‘T’ represents the number of operations performed on these jars. Given a range [a-b] and number of jelly beans to be filled in the jars lying in the range [a-b], find the number of jelly beans in each jar after these ‘T’ operations. Question 2: Given a number ‘N’ and an array a[ ], find the number of possibilities of a[i]-a[j]=N such that i>j. (Could be solved in O( n ) using a HashMap).

Question 3: Given the number of nodes and the number of edges connecting these nodes, arrange these edges such that maximum number of nodes are strongly connected. Return the number of nodes that could be strongly connected. A node is said to be strongly connected if that node is connected to every other node in the graph (Derived a formula for it and solved in O(1) worked for 10/14 testcases).

Formula: (No:Of Edges*2)=(x^2-x) where x represents the number of strongly connected nodes. Solve the equation for x. Could be reduced as (1+Math.sqrt(1+(8*edges)))/2. The approximation produced an error of +(-) 1.

Question 4: https://www.hackerrank.com/contests/w1/challenges/volleyball-match (Solved this using Combinatorics in O(1)).

50 students were shortlisted.

Personal Interview:

Interviewer asked me to introduce myself.

He went through my resume and asked me about a few of my projects.

He asked me to go on explaining without minding him. He was noting down a few points from what I had been explaining ( Not sure what though ).

Since my project was Android DBMS based, he asked a few questions on indexing and database archiving.

He asked me the problems I faced while developing the application and how I overcame them.

They also had the compete summary of the Coding round on HackerRank.

He asked me what went wrong in the graph problem since 10/14 testcases had only passed. This discussion went for some time.

Around 15 students were shortlisted after this round.

It was around 6PM when this round started and so the Interviewer asked me how the previous rounds were and if I was tired.

She asked me for my Areas of Interest, for which I replied Android app. Dev, DBMS, DataStructures and Algorithmics.

She started with questions on DBMS.

What is star schema? Where it is used and its advantages?

What isnormalization? Explain each normal form with examples.

Given a scenario. Design a schema and it’s E-R diagram.

She noticed a project in my Resume, using MongoDB. So she started asking the differences between MongoDB and the traditional RDBMS.

Difference between traditional SQL and NoSQL, it’s advantages and a few basic queries using both SQL and NoSQL.

Asked how data is formatted in MongoDB? (BSON) and a few basic questions on that.

A short discussion about RESTful services and GCM(Google Cloud messaging) since it was used in my projects.

Difference between a Complete and a full binary tree. Given the number of nodes in a full binary tree find the height of the tree in O(1).

Given a BST and a sum, find 2 nodes in the BST that yield the sum .

13 of us were shortlisted for the next round.

This was the final round.

Started with “Tell me about yourself” and the HR started improving the conversation.

He talked with me about my projects on my Resume.

He then asked me to explain in detail about a project that was the most interesting.

He then asked me what I could have done to improve this project.

I was then asked the programming language in which I was the most comfortable. (Java).

The HR seemed to be a nice guy and this round went casually.

I was finally asked if I had any questions for him.

Results were then announced around 9PM.

10 of us were finally selected for a FTE at Visa Inc.

I thank GeeksForGeeks for helping me through this process.

If you like GeeksforGeeks and would like to contribute, you can also write an article and mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above

↧

Save the whale: Docker rightfully shuns standardization

September 1, 2016, 4:27 pm

≫ Next: 【最全干货实例】缓存手册（Memcached、redis、RabbitMQ）

≪ Previous: Visa Interview Experience |Set 11

Save the whale: Docker rightfully shuns standardization

The Open Container Project wants to standardize Docker the old-fashioned way: by committee. The pushback from Docker delivers a compelling counterpoint.

You can be forgiven for thinking Docker cares about standardization.

A little more than a year agoDocker donated"its software container format and its runtime, as well as the associated specifications," to theOpen Container Projectto be housed under the linux Foundation. In the FAQ, theLinux Foundation stressed, "Docker has taken the entire contents of thelibcontainer project, including nsinit, and all modifications needed to make it run independently of Docker, and donated it to this effort."

It was euphoric, this kumbaya moment. Many,including Google's Kelsey Hightower, thought the container leader was offering full Docker standardization in our time.

Nope.

As Docker founderSolomon Hykes declaredlast week, "We didn't standardize our tech, just a very narrow piece where it made sense." Importantly, what Hykes is arguing, a "reasonable argument against weaponized standards," may be the best kind of "Docker standardization" possible.

The committee to slow all real work

Standards are what people at lumbering old companies get paid to do. No, I'm not arguing that standards are not useful or good, but in the early days of a technology, standards are inimical to progress -- which, perhaps not ironically, is precisely what legacy vendors tend to want.

While I was at MongoDB, a popular open source NoSQL database, various legacy software companies courted us to establish foundations and "standardize" the technology. One went so far as to gather a gaggle of NoSQL vendors to try to harmonize an overarching NoSQL standard, an effort that baffled each of the different NoSQL organizations, given the diverse technologies (document, wide-column, graph, and so on) assembled under the NoSQL banner.

Not surprisingly, these large companies tended to have established database businesses of their own that were fading in popularity or threatened to do so. A quick look atDB-Engines database rankingsmay offer some clues as to who would have the most to gain from a sluggish MongoDB standard or from hobbling all the fast-moving NoSQL technologies under a common standard.

Such standards would no doubt help the legacy database vendors by giving them time to regroup, but they would also almost certainly cripple a movement that has redefined the data infrastructure landscape.

Standardizing Docker

This is whyPuppet CEO Luke Kanies was right to askhow requests to standardize Docker are remotely reasonable. Telling Hightower that he "wouldn't make Puppet a standard," Kanies goes on todeclare, "I know of no great companies that standardized their tech before they had a mature business."

The reason, of course, is that the best "standards" are those that become such by default, not by committee. The minute multiple vendors are involved in defining code, that code becomes cumbersome and often irrelevant. We might wish it were otherwise -- the idea of diverse companies collaborating to define standards evokes feelings of blissful happiness -- but in the real world, companies, not committees, tend to win.

1 2 Next Page

↧

【最全干货实例】缓存手册（Memcached、redis、RabbitMQ）

September 1, 2016, 4:26 pm

≫ Next: The Next Generation of Apache Hadoop

≪ Previous: Save the whale: Docker rightfully shuns standardization

本章内容：

Memcached 简介、安装、使用 python 操作 Memcached 天生支持集群 redis 简介、安装、使用、实例 Python 操作 Redis String、Hash、List、Set、Sort Set 操作管道发布订阅 RabbitMQ 简介、安装、使用使用 API 操作 RabbitMQ 消息不丢失发布订阅关键字发送模糊匹配一、Memcached 1、简介、安装、使用

Memcached 是一个高性能的分布式内存对象缓存系统，用于动态 Web 应用以减轻数据库负载压力。它通过在内存中缓存数据和对象来减少读取数据库的次数，从而提高动态、数据库驱动网站的速度。Memcached 基于一个存储键/值对的 hashmap 。其守护进程（daemon ）是用 C 写的，但是客户端可以用任何语言来编写，并通过 memcached 协议与守护进程通信。

Memcached 内存管理机制：

Menceched通过预分配指定的内存空间来存取数据，所有的数据都保存在 memcached 内置的内存中。

利用 Slab Allocation 机制来分配和管理内存。按照预先规定的大小，将分配的内存分割成特定长度的内存块，再把尺寸相同的内存块分成组，这些内存块不会释放，可以重复利用。

当存入的数据占满内存空间时，Memcached 使用 LRU 算法自动删除不是用的缓存数据，即重用过期数据的内存空间。Memcached 是为缓存系统设计的，因此没有考虑数据的容灾问题，和机器的内存一样，重启机器将会丢失，如果希望服务重启数据依然能保留，那么就需要 sina 网开发的 Memcachedb 持久性内存缓冲系统，当然还有常见的 NOSQL 服务如 redis。

默认监听端口：11211

Memcached 安装

wget http://memcached.org/latest
tar -zxvf memcached-1.x.x.tar.gz
cd memcached-1.x.x
./configure && make && make test && sudo make install
PS：依赖libevent
yum install libevent-devel
apt-get install libevent-dev
# Memcached 服务安装
# 1、安装libevent
mkdir /home/oldsuo/tools/
cd /home/oldsuo/tools/
wget http://down1.chinaunix.net/distfiles/libevent-2.0.21-stable.tar.gz
ls libevent-2.0.21-stable.tar.gz
tar zxf libevent-2.0.21-stable.tar.gz
cd libevent-2.0.21-stable
./configure
make && make install
echo $?
cd ..
# 2、安装Memcached
wget http://memcached.org/files/memcached-1.4.24.tar.gz
tar zxf memcached-1.4.24.tar.gz
cd memcached-1.4.24
./configure
make
make install
echo $?
cd ..
# PS :
memcached-1.4.24.tar -->客户端
memcached-1.4.24.tar.gz -->服务端
# 3、启动及关闭服务
echo "/usr/local/lib" >> /etc/ld.so.conf
ldconfig
# 查看帮助
/usr/local/bin/memcached h
# 启动Memcached服务
memcached -p 11211 -u root -m 16m -c 10240 d
# 查看启动状态
lsof -i :11211
# 关闭服务
pkill memcached
# memcached -p 11212 -u root -m 16m -c 10240 -d -P /var/run/11212.pid
# kill `cat /var/run/11212.pid`
# PS：开机自启动把上述启动命令放入/etc/rc.local
源码安装启动 Memcached 快速部署文档
# Memcached php 客户端安装
cd /home/oldsuo/tools/
wget http://pecl.php.net/get/memcache-3.0.7.tgz
tar zxf memcache-3.0.7.tgz
cd memcache-3.0.7
/application/php/bin/phpize
./configure --enable-memcahce --with-php-config=/application/php/bin/php-config --with-zlib-dir
make
make install
# 安装完成后会有类似这样的提示：
Installing shared extensions: /application/php5.3.27/lib/php/extensions/no-debug-zts-20131226/
[root@localhost memcache-3.0.7]# ll /application/php5.3.27/lib/php/extensions/no-debug-zts-20131226/
total 1132
-rwxr-xr-x 1 root root 452913 Nov 17 16:52 memcache.so
-rwxr-xr-x. 1 root root 157862 Oct 9 21:01 mysql.so
-rwxr-xr-x. 1 root root 542460 Oct 9 19:25 opcache.so
# 编辑php.ini文件，添加extension = memcache.so 一行
vim /application/php/lib/php.ini
Extension_dir = "/application/php5.3.27/lib/php/extensions/no-debug-zts-20131226/"
extension = memcache.so
# 重启 apache 服务是PHP的配置生效
[root@localhost application]# /usr/local/apache/bin/apachectl -t
Syntax OK
[root@localhost application]# /usr/local/apache/bin/apachectl graceful
源码安装 Memcached PHP 客户端

Memcached 启动

memcached -d -m 10 -u root -l 218.97.240.118 -p 12000 -c 256 -P /tmp/memcached.pid
参数说明:
-d 是启动一个守护进程
-m 是分配给Memcache使用的内存数量，单位是MB
-u 是运行Memcache的用户
-l 是监听的服务器IP地址
-p 是设置Memcache监听的端口,最好是1024以上的端口
-c 选项是最大运行的并发连接数，默认是1024，按照你服务器的负载量来设定
-P 是设置保存Memcache的pid文件

Memcached 命令

存储命令: set/add/replace/append/prepend/cas
获取命令: get/gets
其他命令: delete/stats..

Memcached 管理

#1、telnet ip port 方式管理
telnet 127.0.0.1 11211
#2、命令直接操作，nc这样的命令
[root@localhost application]# printf "stats slabs\r\n"|nc 127.0.0.1 11211
STAT active_slabs 0
STAT total_malloced 0
END
#3、管理 Memcached 命令
a、stats 统计Memcached的各种信息。
b、stats reset 重新统计数据，重新开始统计。
c、stats slabs 显示slabs信息。通过这命令能获取每个slabs的chunksize长度，从而确定数据保存在哪个slab。
d、stats items 显示slab中的item数目。
e、stats setting 查看一些Memcached设置，列如线程数….
f、stats slabs 查看slabs相关情况。
g、stats sizes 查看存在Item个数和大小。
h、stats cachedump 查看key value。
i、stats reset 清理统计数据。
j、set|get,gets 用来保存或获取数据。
# memadmin php 工具管理（memcadmin-1.0.12.tar.gz）
1、安装memadmin php工具。
cd /home/oldsuo/tools
wget http://www.junopen.com/memadmin/memadmin-1.0.12.tar.gz
tar zxf memadmin-1.0.12.tar.gz -C /usr/local/apache/htdocs/
ll /usr/local/apache/htdocs/memadmin/
2、登陆memadmin php。
web方式访问：http://IP地址/memadmin/
默认用户名密码都为admin。
Memcached memadmin php工具界面化管理安装部署文档 2、Python 操作 Memcached 1> 安装 API 及基本操作
python 操作 Memcached 使用 Python-memcached 模块
下载安装：https://pypi.python.org/pypi/python-memcached
import memcache
mc = memcache.Client(['192.168.1.5:12000'], debug=True)
mc.set("foo", "bar")
ret = mc.get('foo')
print ret
2> 天生支持集群

python-memcached 模块原生支持集群操作，其原理本质是在内存维护一个主机列表，数字为权重，为3即出现3次，相对应的几率大

mc = memcache.Client([
('192.168.1.5:12000', 3), # 数字为权重
('192.168.1.9:12000', 1),
], debug=True)
# 那么在内存中主机列表为：
# host_list = ["192.168.1.5","192.168.1.5","192.168.1.5","192.168.1.9",]
那么问题来了，集群情况下如何选择服务器存储呢？

如果要创建设置一个键值对（如：k1 = "v1"），那么它的执行流程如下：

将 k1 转换成一个数字将数字和主机列表的长度求余数，得到一个值 N（N 的范围： 0 <= N < 列表长度）在主机列表中根据第2步得到的值为索引获取主机，例如：host_list[N] 连接将第3步中获取的主机，将 k1 = "v1" 放置在该服务器的内存中

获取值的话也一样

#!/usr/bin/env python
#-*- coding:utf-8 -*-
__author__ = 'Nick Suo'
import binascii
str_input = 'suoning'
str_bytes = bytes(str_input, encoding='utf-8')
num = (((binascii.crc32(str_bytes) & 0xffffffff) >> 16) & 0x7fff) or 1
print(num)
源码、将字符串转换为数字 3> add

添加一个键值对，如果 key 已经存在，重复添加执行 add 则抛出异常

import memcache
mc = memcache.Client(['192.168.1.5:12000'], debug=True)
mc.add('k1', 'v1')
# mc.add('k1', 'v2') # 报错，对已经存在的key重复添加，失败！！！
4> replace

replace 修改某个 key 的值，如果 key 不存在，则异常

import memcache
mc = memcache.Client(['192.168.1.5:12000'], debug=True)
# 如果memcache中存在kkkk，则替换成功，否则一场
mc.replace('kkkk','999')
5>set 和 set_multi

set 设置一个键值对，如果 key 不存在，则创建

set_multi 设置多个键值对，如果 key 不存在，则创建

import memcache
mc = memcache.Client(['192.168.1.5:12000'], debug=True)
mc.set('name', 'nick')
mc.set_multi({'name': 'nick', 'age': '18'})
6>delete 和 delete_multi

delete 删除指定的一个键值对

delete_multi 删除指定的多个键值对

import memcache
mc = memcache.Client(['192.168.1.5:12000'], debug=True)
mc..delete('name', 'nick')
mc.delete_multi({'name': 'nick', 'age': '18'})
7>get 和 get_multi

get 获取一个键值对

get_multi 获取多个键值对

import memcache
mc = memcache.Client(['192.168.1.5:12000'], debug=True)
val = mc.get('name')
item_dict = mc.get_multi(["name", "age",])
8>append 和 prep

↧

The Next Generation of Apache Hadoop

September 1, 2016, 5:46 pm

≫ Next: MongoDB 3.3.12 发布，分布式文档存储数据库

≪ Previous: 【最全干货实例】缓存手册（Memcached、redis、RabbitMQ）

Apache Hadoop turned ten this year. To celebrate, Karthik and I gave a talk at USENIX ATC '16 about open problems to solve in Hadoop's second decade. This was an opportunity to revisit our academic roots and get a new crop of graduate students interested in the real distributed systems problems we're trying to solve inindustry.

This is a huge topic and we only had a 25 minute talk slot, so we were pitching problems rather than solutions. However, we did have some ideas in our back pocket, and the hallway track and birds-of-a-feather we hosted afterwards led to a lot of gooddiscussion.

Karthik and I split up the content thematically, which worked really well. I covered scalability, meaning sharded filesystems and federated resource management. Karthik addressed scheduling (unifying batch jobs and long-running services) and utilization (overprovisioning, preemption,isolation).

I'm hoping to give this talk again in longer form, since I'm proud of thecontent.

Slides:pptx

USENIX site with PDF slides andaudio

Talking big ideas like this with Karthik also made me nostalgic for graduate school. Karthik is one of the most impressive people I know; I thought he'd left graduate school for Cloudera like me, but he's actually been working on his PhD nights and weekends! While we were prepping this presentation for ATC, he was also working on a submission for SoCC, and is apparently close tograduating.

↧