Off-heaping the Apache HBase Read Path

Off Heaping HBase Read path
HBASE-11425
Anoop Sam John
Ramkrishna S Vasudevan
Intel BigData Team – Bangalore, India

 L2 off heap cache can give large cache size
 Not constrained by Java max heap size possible issues.
 4 MB physical memory buffers.
 Different sized buckets 5 KB, 9 KB,… 513 KB. Each bucket having at least 4 slots
 HFile blocks placed in appropriate sized bucket
 One Block may span across 2 ByteBuffers.
 Read path assumption of data being in a byte array.
 Cells having assumption of data parts being in byte array. (ie. Rowkey, family, value
etc)
 Read hitting block in cache need on heap copy of that block
 Temp array of 64K creation and copy. More garbage
Overview
4 MB
513 KB buckets

Read from Bucket Cache
Region1
Region2
Read request
Read request
HRegionServer
Read response
Read response Scanner layers
Scanner layers
On heap
HfileBlock
On heap
HfileBlock
Off heap
Bucket Cache

Off Heap Read Path from Bucket
Cache
Region1
Region2
Read request
Read request
HRegionServer
Off heap
Bucket Cache
Read response
Scanner layers
End to End off heap - from bucket cache till RPC

 Selection of data structure for off heap storage
 During reads, parse individual Cell components multiple times
 Cells are frequently compared for proper ordering
 Bucket cache uses NIO DirectByteBuffer for off heap cache
 JMH benchmark NIO vs Netty
 Test doing reads of int, long, bytes from NIO ByteBuffer and Netty ByteBuff
 Test with Unsafe based reads
 Conclusion : Continue with the existing NIO DBB based buckets in BucketCache
Off Heap Data Structure
Benchmark Mode Cnt Score Error Units
nettyOffheap: thrpt 57366360.944 ±11533933.769 ops/s
nioOffheap : thrpt 60089837.738 ±14171768.229 ops/s
nettyOffheap: thrpt 83613659.416 ± 535211.991 ops/s
nioOffheap : thrpt 84514777.734 ± 1199369.976 ops/s

 Cellify read path HBASE-7320 , HBASE-11871 , HBASE-11805
 Cells flow in read path
 Move out of KeyValue assumption
 HFile block backed by ByteBuffer rather than byte[]
 Remove all byte[] assumption in seeking, encoding etc
 Cell extension
 Support ByteBuffer backed getXXX APIs.
 Added Cell extension ByteBufferedCell and exposed within Server only
 Creating off heap backed ByteBufferedCell when reading blocks from off heap bucket cache
 getXXXArray() calls on off heap buffer backed Cells works with a temp byte[] copy. More garbage
 CellUtil APIs for operations like equals, copy which checks for ByteBufferedCell
 Suggest CPs, custom filter use these APIs.
Note
 Filter# filterRowKey(byte[] buffer, int offset, int length) deprecated against filterRowKey(Cell firstRowCell)
 RegionObserver # postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner,
byte[], int, short, boolean) deprecated against
postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, Cell, boolean)
Building Blocks for Off Heaping

 KVComparator -> CellComparator HBASE-10800 , HBASE-13500
 JMH benchmark with off heap buffer compare vs byte[] compare
 Using Unsafe way of compare
 Each buffer with 135 bytes
 Both buffers equal
 No performance overhead with comparing off heap backed cells
offheapCompare: thrpt 38205893.545 ± 265309.769 ops/s
onheapCompare: thrpt 37166847.740 ± 430242.970 ops/s
Building Blocks for Off Heaping

 HFile block data might split across 2 ByteBuffers
 Avoid copy
 Need single data structure which backs N ByteBuffers
 Java NIO ByteBuffer is not extendable
 Wrapper class org.apache.hadoop.hbase.nio.ByteBuff
 org.apache.hadoop.hbase.nio.SingleByteBuff
 org.apache.hadoop.hbase.nio. MultiByteBuff
 HFile block’s data structure type changed to ByteBuff
NIO ByteBuffer Wrapper
MultiByteBuff
SingleByteBuff

 BucketCache evicts blocks and frees the buckets when out of space
 Any block can be evicted. Readers copy block data to temp byte[]
 After HBASE-11425 readers refer to bucket memory area directly
 Can evict only unreferenced blocks
Bucket Cache Block Eviction
Call#setResponse
RpcCallback#run
RegionScanner#shipped
KeyValueHeap#shipped
StoreScanner#shipped
KeyValueHeap#shipped
StoreFileScanner#shipped
HFileScanner#shipped
HFile.Reader#returnBlock
BlockCache#returnBlock Decrement ref count
 Ref count based block cache and block eviction
 Increment ref count when reader hits a block in L2 cache
 Decrement once response is created for RPC
 Evict if/when ref count = 0

Complete Picture
Region1
Region2
Read request
Read request
HRegionServer
Off heap
Bucket Cache
Refcount++
Read response
Scanner layers
Refcount++
callback
callback
Refcount--
Refcount--
MultiByteBuff
SingleByteBuff
End to End off heap - from bucket cache till RPC

Performance Test Results
 PerformanceEvaluation Tool (PE)
 Table with one CF and one cell per row. 100 GB total data. Each row with 1K value size
 Entire data is loaded into Bucket cache
 Single node cluster
 CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB
 JDK : 1.8
 HBase configuration
 HBASE_HEAPSIZE = 9 GB
 HBASE_OFFHEAPSIZE = 105 GB
 hbase.bucketcache.size = 102GB
 GC – Default HBase GC setting (CMS )
 Multi get with 100 rows
 Every thread doing 100 K operations
= 10 million rows get
 Avg completion run time of each
thread (In secs)
 Convert to throughput – Gain of
102% - 460%
89.38
139.81
285.66
361.23
817.91
1372.81
44.04 50.55 70.23 88.6
165.4
244.72
0
200
400
600
800
1000
1200
1400
1600
5 threads 10 threads 20 threads 25 threads 50 threads 75 threads
HBase Random GET Average Completion Time (s) (The
lower the better)
Before HBASE-11425 After HBASE-11425

 Random Range Scan 10K range
 with filterAll filter (No data returned back)
 Each thread doing range scan for 1000 times
449.1
728.64
908.26
1904.93
319.87
451.58
560.46
1158
0
500
1000
1500
2000
2500
10 threads 20 threads 25 threads 50 threads
Range Scan only server side
Average Completion Time (s) (The lower the better)

 Random Range Scan 10K range
 Returning 10% of rows back to client
 Each thread doing range scan for 1000 times
449.1
728.64
908.26
1904.93
319.87
451.58
560.46
1158
0
500
1000
1500
2000
2500
HBase Range Scan with filter
Average Completion Time (s) (The lower the better)

 YCSB Test
 Table with one CF and 10 columns per row. Each row with 1K value. 90 GB total data
 Single node cluster
 CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB
 JDK : 1.8
 HBase configuration
 HBASE_HEAPSIZE = 9 GB
 HBASE_OFFHEAPSIZE = 105 GB
 hbase.bucketcache.size = 102GB
23277.97
25922.18 24558.72 24316.74
28045.53
45767.99
58904.03
63280.86
0
10000
20000
30000
40000
50000
60000
70000
YCSB Random GET
Throughput
 Every thread doing 5
million operations
 20- 160% improvement

 PE test comparing L1 cache vs Off heap L2 cache with 20GB data
 Entire data is loaded into bucket cache
 Each thread doing 10 million operations = 10 billion rows get
L1 test L2 test
Max heap – 32 GB Max heap – 12 GB
300.5
559.3
1195.9
1793.1
307.6
523.9
1144.2
1707.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
HBase Random GET Average Completion Time (s) (The
lower the better)
L1 cache L2 cache

MultiGets – Before HBASE-11425 (25 threads) MultiGets – After HBASE-11425(25 threads)
GC Graphs

ScanRange10000 – Before HBASE-11425 (20 threads) ScanRange10000 – After HBASE-11425(20 threads)
GC Graphs

 Feature will be available in HBase 2.0 release
 Make Bucket cache default in HBase 2.0 – Refer HBASE-11323
 ‘Rocketfuel’ started using this for random read work load
 Backported to 1.x based version
 More details
 https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in
Feature Availability

 Future work
 Off heaping write path – HBASE-11579
 Off heap MSLAB pool
 Read request bytes into off heap buffer pool
 Lazy creation of ByteBuffer pools
 Fixed sized off heap ByteBuffers from pool
 Protobuf changes to handle off heap ByteBuffers
 In-memory flushes/compaction (HBASE-14918) from Yahoo
Questions??
Future work & QA

Off-heaping the Apache HBase Read Path

More Related Content

What's hot

Viewers also liked

Similar to Off-heaping the Apache HBase Read Path

More from HBaseCon

Off-heaping the Apache HBase Read Path

Editor's Notes