
1.现象描述
外界连接 9090 端口均超时,但 telnet 端口总是成功。使用 top 命令观察,发现单个线程的 CPU 最高达 99.99% ,但并不总是 99.9% ,而是在波动。当迁走往该机器的流量后,能够访问成功,但仍然有超时,读超时比写超时多:
#./hbase_stress--hbase=110.13.136.207:9090--test=2--timeout=10
[2016-11-2710:15:21/771][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090):EAGAIN(timedout) [2016-11-2710:15:31/775][139756154767104/31562][ERROR][hbase_stress.cpp:302]TransportException(thrift://110.13.136.207:9090):EAGAIN(timedout)PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
20727 zhangsan20010.843g9.263g26344R 99.9 26.41448:00java
20729zhangsan20010.843g9.263g26344R99.926.41448:00java
20730zhangsan20010.843g9.263g26344R99.926.41449:10java
20728zhangsan20010.843g9.263g26344R99.826.41448:00java
20693zhangsan20010.843g9.263g26344S0.026.40:00.00java
20727 zhangsan20010.843g9.263g26344R 75.5 26.41448:06java
20728zhangsan20010.843g9.263g26344R75.226.41448:06java
20729zhangsan20010.843g9.263g26344R75.226.41448:06java
20730zhangsan20010.843g9.263g26344R75.226.41449:15java
20716zhangsan20010.843g9.263g26344S24.926.493:48.75java
2.问题定位使用 ps 命令找出 CPU 最多的线程,和 top 显示的一致:
$ps-mp20693-oTHREAD,tid,time|sort-rn
zhangsan18.819----207301-00:11:23
zhangsan18.719----207291-00:10:13
zhangsan18.719----207281-00:10:13
zhangsan18.719----207271-00:10:13
zhangsan16.119-futex_--2073120:44:51
zhangsan5.219-futex_--2073206:46:39
然后借助 jstack ,发现为 GC 进程:
"Gangworker#0(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d4000nid=0x50f7runnable
"Gangworker#1(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d5800nid=0x50f8runnable
"Gangworker#2(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d7800nid=0x50f9runnable
"Gangworker#3(ParallelCMSThreads)"os_prio=0tid=0x00007fb7200d9000nid=0x50farunnable
使用 jstat 工具查看 GC ,情况很不乐观,问题就是有 GC 引起的:
$jstat-gcutil206931000100
S0S1EOMCCSYGCYGCTFGCFGCTGCT
0.0099.67100.00100.0098.0894.4142199369.1322708434869.60135238.733
0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580
0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580
0.0099.67100.00100.0098.0894.4142199369.1322708434870.44835239.580
$jstat-gccapacity20693
NGCMNNGCMXNGCS0CS1CECOGCMNOGCMXOGC OC MCMNMCMXMCCCSMNCCSMXCCSC YGC FGC
191808.01107520.01107520.0110720.0110720.0886080.0383680.08094144.08094144.0 8094144 .00.01077248.031584.00.01048576.03424.0 42199 27156
$jstat-gcold20693
MCMUCCSCCCSUOCOUYGCFGCFGCTGCT
31584.030978.73424.03232.78094144.08094144.0421992717434964.34735333.479
$jstat-gcoldcapacity20693
OGCMNOGCMXOGCOCYGCFGCFGCTGCT
383680.08094144.08094144.08094144.0421992719234982.62335351.755
$jstat-gcnewcapacity20693
NGCMNNGCMXNGCS0CMXS0CS1CMXS1CECMXECYGCFGC
191808.01107520.01107520.0110720.0110720.0110720.0110720.0886080.0886080.04219927202
$jstat-gc20693
S0CS1CS0US1UECEUOCOUMCMUCCSCCCSUYGCYGCTFGCFGCTGCT
110720.0110720.00.0110395.9886080.0886080.08094144.08094144.031584.030978.73424.03232.742199369.1322720634996.53835365.671
$jstat-gcnew20693
S0CS1CS0US1UTTMTTDSSECEUYGCYGCT
110720.0110720.00.0110396.96655360.0886080.0886080.042199369.132
使用 lsof 显示该进程的连接数不多,完全在安全范围内,问题应当是有对象不能被回收。使用 jmap 查看内存详情,先看堆的使用情况:
$jmap-heap20693
AttachingtoprocessID20693,pleasewait...
Debuggerattachedsuccessfully.
Servercompilerdetected.
JVMversionis25.77-b03
usingparallelthreadsinthenewgeneration.
usingthread-localobjectallocation.
ConcurrentMark-SweepGC
HeapConfiguration:
MinHeapFreeRatio=40
MaxHeapFreeRatio=70
MaxHeapSize=9422503936(8986.0MB)
NewSize=196411392(187.3125MB)
MaxNewSize=1134100480(1081.5625MB)
OldSize=392888320(374.6875MB)
NewRatio=2
SurvivorRatio=8
MetaspaceSize=21807104(20.796875MB)
CompressedClassSpaceSize=1073741824(1024.0MB)
MaxMetaspaceSize=17592186044415MB
G1HeapRegionSize=0(0.0MB)
HeapUsage:
NewGeneration(Eden+1SurvivorSpace):
capacity=1020723200(973.4375MB)
used=1020398064(973.1274261474609MB)
free=325136(0.3100738525390625MB)
99.96814650632022%used
EdenSpace:
capacity=907345920(865.3125MB)
used=907345920(865.3125MB)
free=0(0.0MB)
100.0%used
FromSpace:
capacity=113377280(108.125MB)
used=113052144(107.81492614746094MB)
free=325136(0.3100738525390625MB)
99.71322649476156%used
ToSpace:
capacity=113377280(108.125MB)
used=0(0.0MB)
free=113377280(108.125MB)
0.0%used
concurrentmark-sweepgeneration:
capacity=8288403456(7904.4375MB)
used=8288403424(7904.437469482422MB)
free=32(3.0517578125E-5MB)
99.9999996139184%used
10216interned