Off Heaping HBase Read path
HBASE-11425
Anoop Sam John
Ramkrishna S Vasudevan
Intel BigData Team – Bangalore, India
 L2 off heap cache can give large cache size
 Not constrained by Java max heap size possible issues.
 4 MB physical memory buffers.
 Different sized buckets 5 KB, 9 KB,… 513 KB. Each bucket having at least 4 slots
 HFile blocks placed in appropriate sized bucket
 One Block may span across 2 ByteBuffers.
 Read path assumption of data being in a byte array.
 Cells having assumption of data parts being in byte array. (ie. Rowkey, family, value
etc)
 Read hitting block in cache need on heap copy of that block
 Temp array of 64K creation and copy. More garbage
Overview
4 MB
513 KB buckets
Read from Bucket Cache
Region1
Region2
Read request
Read request
HRegionServer
Read response
Read response Scanner layers
Scanner layers
On heap
HfileBlock
On heap
HfileBlock
Off heap
Bucket Cache
Off Heap Read Path from Bucket
Cache
Region1
Region2
Read request
Read request
HRegionServer
Off heap
Bucket Cache
Read response
Read response Scanner layers
Scanner layers
End to End off heap - from bucket cache till RPC
 Selection of data structure for off heap storage
 During reads, parse individual Cell components multiple times
 Cells are frequently compared for proper ordering
 Bucket cache uses NIO DirectByteBuffer for off heap cache
 JMH benchmark NIO vs Netty
 Test doing reads of int, long, bytes from NIO ByteBuffer and Netty ByteBuff
 Test with Unsafe based reads
 Conclusion : Continue with the existing NIO DBB based buckets in BucketCache
Off Heap Data Structure
Benchmark Mode Cnt Score Error Units
nettyOffheap: thrpt 57366360.944 ±11533933.769 ops/s
nioOffheap : thrpt 60089837.738 ±14171768.229 ops/s
Benchmark Mode Cnt Score Error Units
nettyOffheap: thrpt 83613659.416 ± 535211.991 ops/s
nioOffheap : thrpt 84514777.734 ± 1199369.976 ops/s
 Cellify read path HBASE-7320 , HBASE-11871 , HBASE-11805
 Cells flow in read path
 Move out of KeyValue assumption
 HFile block backed by ByteBuffer rather than byte[]
 Remove all byte[] assumption in seeking, encoding etc
 Cell extension
 Support ByteBuffer backed getXXX APIs.
 Added Cell extension ByteBufferedCell and exposed within Server only
 Creating off heap backed ByteBufferedCell when reading blocks from off heap bucket cache
 getXXXArray() calls on off heap buffer backed Cells works with a temp byte[] copy. More garbage
 CellUtil APIs for operations like equals, copy which checks for ByteBufferedCell
 Suggest CPs, custom filter use these APIs.
Note
 Filter# filterRowKey(byte[] buffer, int offset, int length) deprecated against filterRowKey(Cell firstRowCell)
 RegionObserver # postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner,
byte[], int, short, boolean) deprecated against
postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, Cell, boolean)
Building Blocks for Off Heaping
 KVComparator -> CellComparator HBASE-10800 , HBASE-13500
 JMH benchmark with off heap buffer compare vs byte[] compare
 Using Unsafe way of compare
 Each buffer with 135 bytes
 Both buffers equal
 No performance overhead with comparing off heap backed cells
Benchmark Mode Cnt Score Error Units
offheapCompare: thrpt 38205893.545 ± 265309.769 ops/s
onheapCompare: thrpt 37166847.740 ± 430242.970 ops/s
Building Blocks for Off Heaping
 HFile block data might split across 2 ByteBuffers
 Avoid copy
 Need single data structure which backs N ByteBuffers
 Java NIO ByteBuffer is not extendable
 Wrapper class org.apache.hadoop.hbase.nio.ByteBuff
 org.apache.hadoop.hbase.nio.SingleByteBuff
 org.apache.hadoop.hbase.nio. MultiByteBuff
 HFile block’s data structure type changed to ByteBuff
NIO ByteBuffer Wrapper
MultiByteBuff
SingleByteBuff
 BucketCache evicts blocks and frees the buckets when out of space
 Any block can be evicted. Readers copy block data to temp byte[]
 After HBASE-11425 readers refer to bucket memory area directly
 Can evict only unreferenced blocks
Bucket Cache Block Eviction
Call#setResponse
RpcCallback#run
RegionScanner#shipped
KeyValueHeap#shipped
StoreScanner#shipped
KeyValueHeap#shipped
StoreFileScanner#shipped
HFileScanner#shipped
HFile.Reader#returnBlock
BlockCache#returnBlock Decrement ref count
 Ref count based block cache and block eviction
 Increment ref count when reader hits a block in L2 cache
 Decrement once response is created for RPC
 Evict if/when ref count = 0
Complete Picture
Region1
Region2
Read request
Read request
HRegionServer
Off heap
Bucket Cache
Refcount++
Read response
Read response Scanner layers
Scanner layers
Refcount++
callback
callback
Refcount--
Refcount--
MultiByteBuff
SingleByteBuff
End to End off heap - from bucket cache till RPC
Performance Test Results
 PerformanceEvaluation Tool (PE)
 Table with one CF and one cell per row. 100 GB total data. Each row with 1K value size
 Entire data is loaded into Bucket cache
 Single node cluster
 CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB
 JDK : 1.8
 HBase configuration
 HBASE_HEAPSIZE = 9 GB
 HBASE_OFFHEAPSIZE = 105 GB
 hbase.bucketcache.size = 102GB
 GC – Default HBase GC setting (CMS )
 Multi get with 100 rows
 Every thread doing 100 K operations
= 10 million rows get
 Avg completion run time of each
thread (In secs)
 Convert to throughput – Gain of
102% - 460%
89.38
139.81
285.66
361.23
817.91
1372.81
44.04 50.55 70.23 88.6
165.4
244.72
0
200
400
600
800
1000
1200
1400
1600
5 threads 10 threads 20 threads 25 threads 50 threads 75 threads
HBase Random GET Average Completion Time (s) (The
lower the better)
Before HBASE-11425 After HBASE-11425
Performance Test Results
 PerformanceEvaluation Tool (PE)
 Random Range Scan 10K range
 with filterAll filter (No data returned back)
 Each thread doing range scan for 1000 times
 Entire data is loaded into Bucket cache
449.1
728.64
908.26
1904.93
319.87
451.58
560.46
1158
0
500
1000
1500
2000
2500
10 threads 20 threads 25 threads 50 threads
Range Scan only server side
Average Completion Time (s) (The lower the better)
Before HBASE-11425 After HBASE-11425
Performance Test Results
 PerformanceEvaluation Tool (PE)
 Random Range Scan 10K range
 Returning 10% of rows back to client
 Each thread doing range scan for 1000 times
 Entire data is loaded into Bucket cache
449.1
728.64
908.26
1904.93
319.87
451.58
560.46
1158
0
500
1000
1500
2000
2500
10 threads 20 threads 25 threads 50 threads
HBase Range Scan with filter
Average Completion Time (s) (The lower the better)
Before HBASE-11425 After HBASE-11425
Performance Test Results
 YCSB Test
 Table with one CF and 10 columns per row. Each row with 1K value. 90 GB total data
 Entire data is loaded into Bucket cache
 Single node cluster
 CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB
 JDK : 1.8
 HBase configuration
 HBASE_HEAPSIZE = 9 GB
 HBASE_OFFHEAPSIZE = 105 GB
 hbase.bucketcache.size = 102GB
23277.97
25922.18 24558.72 24316.74
28045.53
45767.99
58904.03
63280.86
0
10000
20000
30000
40000
50000
60000
70000
10 threads 25 threads 50 threads 75 threads
YCSB Random GET
Throughput
Before HBASE-11425 After HBASE-11425
 Multi get with 100 rows
 Every thread doing 5
million operations
 20- 160% improvement
 PE test comparing L1 cache vs Off heap L2 cache with 20GB data
 Multi get with 100 rows
 Entire data is loaded into bucket cache
 Each thread doing 10 million operations = 10 billion rows get
L1 test L2 test
Max heap – 32 GB Max heap – 12 GB
Performance Test Results
300.5
559.3
1195.9
1793.1
307.6
523.9
1144.2
1707.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
10 threads 25 threads 50 threads 75 threads
HBase Random GET Average Completion Time (s) (The
lower the better)
L1 cache L2 cache
MultiGets – Before HBASE-11425 (25 threads) MultiGets – After HBASE-11425(25 threads)
GC Graphs
ScanRange10000 – Before HBASE-11425 (20 threads) ScanRange10000 – After HBASE-11425(20 threads)
GC Graphs
 Feature will be available in HBase 2.0 release
 Make Bucket cache default in HBase 2.0 – Refer HBASE-11323
 ‘Rocketfuel’ started using this for random read work load
 Backported to 1.x based version
 More details
 https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in
Feature Availability
 Future work
 Off heaping write path – HBASE-11579
 Off heap MSLAB pool
 Read request bytes into off heap buffer pool
 Lazy creation of ByteBuffer pools
 Fixed sized off heap ByteBuffers from pool
 Protobuf changes to handle off heap ByteBuffers
 In-memory flushes/compaction (HBASE-14918) from Yahoo
Questions??
Future work & QA

Off-heaping the Apache HBase Read Path

  • 1.
    Off Heaping HBaseRead path HBASE-11425 Anoop Sam John Ramkrishna S Vasudevan Intel BigData Team – Bangalore, India
  • 2.
     L2 offheap cache can give large cache size  Not constrained by Java max heap size possible issues.  4 MB physical memory buffers.  Different sized buckets 5 KB, 9 KB,… 513 KB. Each bucket having at least 4 slots  HFile blocks placed in appropriate sized bucket  One Block may span across 2 ByteBuffers.  Read path assumption of data being in a byte array.  Cells having assumption of data parts being in byte array. (ie. Rowkey, family, value etc)  Read hitting block in cache need on heap copy of that block  Temp array of 64K creation and copy. More garbage Overview 4 MB 513 KB buckets
  • 3.
    Read from BucketCache Region1 Region2 Read request Read request HRegionServer Read response Read response Scanner layers Scanner layers On heap HfileBlock On heap HfileBlock Off heap Bucket Cache
  • 4.
    Off Heap ReadPath from Bucket Cache Region1 Region2 Read request Read request HRegionServer Off heap Bucket Cache Read response Read response Scanner layers Scanner layers End to End off heap - from bucket cache till RPC
  • 5.
     Selection ofdata structure for off heap storage  During reads, parse individual Cell components multiple times  Cells are frequently compared for proper ordering  Bucket cache uses NIO DirectByteBuffer for off heap cache  JMH benchmark NIO vs Netty  Test doing reads of int, long, bytes from NIO ByteBuffer and Netty ByteBuff  Test with Unsafe based reads  Conclusion : Continue with the existing NIO DBB based buckets in BucketCache Off Heap Data Structure Benchmark Mode Cnt Score Error Units nettyOffheap: thrpt 57366360.944 ±11533933.769 ops/s nioOffheap : thrpt 60089837.738 ±14171768.229 ops/s Benchmark Mode Cnt Score Error Units nettyOffheap: thrpt 83613659.416 ± 535211.991 ops/s nioOffheap : thrpt 84514777.734 ± 1199369.976 ops/s
  • 6.
     Cellify readpath HBASE-7320 , HBASE-11871 , HBASE-11805  Cells flow in read path  Move out of KeyValue assumption  HFile block backed by ByteBuffer rather than byte[]  Remove all byte[] assumption in seeking, encoding etc  Cell extension  Support ByteBuffer backed getXXX APIs.  Added Cell extension ByteBufferedCell and exposed within Server only  Creating off heap backed ByteBufferedCell when reading blocks from off heap bucket cache  getXXXArray() calls on off heap buffer backed Cells works with a temp byte[] copy. More garbage  CellUtil APIs for operations like equals, copy which checks for ByteBufferedCell  Suggest CPs, custom filter use these APIs. Note  Filter# filterRowKey(byte[] buffer, int offset, int length) deprecated against filterRowKey(Cell firstRowCell)  RegionObserver # postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, byte[], int, short, boolean) deprecated against postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, Cell, boolean) Building Blocks for Off Heaping
  • 7.
     KVComparator ->CellComparator HBASE-10800 , HBASE-13500  JMH benchmark with off heap buffer compare vs byte[] compare  Using Unsafe way of compare  Each buffer with 135 bytes  Both buffers equal  No performance overhead with comparing off heap backed cells Benchmark Mode Cnt Score Error Units offheapCompare: thrpt 38205893.545 ± 265309.769 ops/s onheapCompare: thrpt 37166847.740 ± 430242.970 ops/s Building Blocks for Off Heaping
  • 8.
     HFile blockdata might split across 2 ByteBuffers  Avoid copy  Need single data structure which backs N ByteBuffers  Java NIO ByteBuffer is not extendable  Wrapper class org.apache.hadoop.hbase.nio.ByteBuff  org.apache.hadoop.hbase.nio.SingleByteBuff  org.apache.hadoop.hbase.nio. MultiByteBuff  HFile block’s data structure type changed to ByteBuff NIO ByteBuffer Wrapper MultiByteBuff SingleByteBuff
  • 9.
     BucketCache evictsblocks and frees the buckets when out of space  Any block can be evicted. Readers copy block data to temp byte[]  After HBASE-11425 readers refer to bucket memory area directly  Can evict only unreferenced blocks Bucket Cache Block Eviction Call#setResponse RpcCallback#run RegionScanner#shipped KeyValueHeap#shipped StoreScanner#shipped KeyValueHeap#shipped StoreFileScanner#shipped HFileScanner#shipped HFile.Reader#returnBlock BlockCache#returnBlock Decrement ref count  Ref count based block cache and block eviction  Increment ref count when reader hits a block in L2 cache  Decrement once response is created for RPC  Evict if/when ref count = 0
  • 10.
    Complete Picture Region1 Region2 Read request Readrequest HRegionServer Off heap Bucket Cache Refcount++ Read response Read response Scanner layers Scanner layers Refcount++ callback callback Refcount-- Refcount-- MultiByteBuff SingleByteBuff End to End off heap - from bucket cache till RPC
  • 11.
    Performance Test Results PerformanceEvaluation Tool (PE)  Table with one CF and one cell per row. 100 GB total data. Each row with 1K value size  Entire data is loaded into Bucket cache  Single node cluster  CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB  JDK : 1.8  HBase configuration  HBASE_HEAPSIZE = 9 GB  HBASE_OFFHEAPSIZE = 105 GB  hbase.bucketcache.size = 102GB  GC – Default HBase GC setting (CMS )  Multi get with 100 rows  Every thread doing 100 K operations = 10 million rows get  Avg completion run time of each thread (In secs)  Convert to throughput – Gain of 102% - 460% 89.38 139.81 285.66 361.23 817.91 1372.81 44.04 50.55 70.23 88.6 165.4 244.72 0 200 400 600 800 1000 1200 1400 1600 5 threads 10 threads 20 threads 25 threads 50 threads 75 threads HBase Random GET Average Completion Time (s) (The lower the better) Before HBASE-11425 After HBASE-11425
  • 12.
    Performance Test Results PerformanceEvaluation Tool (PE)  Random Range Scan 10K range  with filterAll filter (No data returned back)  Each thread doing range scan for 1000 times  Entire data is loaded into Bucket cache 449.1 728.64 908.26 1904.93 319.87 451.58 560.46 1158 0 500 1000 1500 2000 2500 10 threads 20 threads 25 threads 50 threads Range Scan only server side Average Completion Time (s) (The lower the better) Before HBASE-11425 After HBASE-11425
  • 13.
    Performance Test Results PerformanceEvaluation Tool (PE)  Random Range Scan 10K range  Returning 10% of rows back to client  Each thread doing range scan for 1000 times  Entire data is loaded into Bucket cache 449.1 728.64 908.26 1904.93 319.87 451.58 560.46 1158 0 500 1000 1500 2000 2500 10 threads 20 threads 25 threads 50 threads HBase Range Scan with filter Average Completion Time (s) (The lower the better) Before HBASE-11425 After HBASE-11425
  • 14.
    Performance Test Results YCSB Test  Table with one CF and 10 columns per row. Each row with 1K value. 90 GB total data  Entire data is loaded into Bucket cache  Single node cluster  CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB  JDK : 1.8  HBase configuration  HBASE_HEAPSIZE = 9 GB  HBASE_OFFHEAPSIZE = 105 GB  hbase.bucketcache.size = 102GB 23277.97 25922.18 24558.72 24316.74 28045.53 45767.99 58904.03 63280.86 0 10000 20000 30000 40000 50000 60000 70000 10 threads 25 threads 50 threads 75 threads YCSB Random GET Throughput Before HBASE-11425 After HBASE-11425  Multi get with 100 rows  Every thread doing 5 million operations  20- 160% improvement
  • 15.
     PE testcomparing L1 cache vs Off heap L2 cache with 20GB data  Multi get with 100 rows  Entire data is loaded into bucket cache  Each thread doing 10 million operations = 10 billion rows get L1 test L2 test Max heap – 32 GB Max heap – 12 GB Performance Test Results 300.5 559.3 1195.9 1793.1 307.6 523.9 1144.2 1707.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000 10 threads 25 threads 50 threads 75 threads HBase Random GET Average Completion Time (s) (The lower the better) L1 cache L2 cache
  • 16.
    MultiGets – BeforeHBASE-11425 (25 threads) MultiGets – After HBASE-11425(25 threads) GC Graphs
  • 17.
    ScanRange10000 – BeforeHBASE-11425 (20 threads) ScanRange10000 – After HBASE-11425(20 threads) GC Graphs
  • 18.
     Feature willbe available in HBase 2.0 release  Make Bucket cache default in HBase 2.0 – Refer HBASE-11323  ‘Rocketfuel’ started using this for random read work load  Backported to 1.x based version  More details  https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in Feature Availability
  • 19.
     Future work Off heaping write path – HBASE-11579  Off heap MSLAB pool  Read request bytes into off heap buffer pool  Lazy creation of ByteBuffer pools  Fixed sized off heap ByteBuffers from pool  Protobuf changes to handle off heap ByteBuffers  In-memory flushes/compaction (HBASE-14918) from Yahoo Questions?? Future work & QA

Editor's Notes

  • #6 https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in
  • #7 postScannerFilterRow
  • #8 Temp array of 64K creation and copy Typically only 20% heap left other than memstore and BC.
  • #11 Same slide to show the E2E picture after explanation