HBase Accelerated:
In-Memory Flush and Compaction
E s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v ⎪ H B a s e C o n , S a n
F r a n c i s c o , M a y 2 4 , 2 0 1 6
Outline
2
 Background
 In-Memory Compaction
› Design & Evaluation
 In-Memory Index Reduction
› Design & Evaluation
Motivation: Dynamic Content Processing on Top of HBase
3
 Real-time content processing pipelines
› Store intermediate results in persistent map
› Notification mechanism is prevalent
 Storage and notifications on the same platform
 Sieve – Yahoo’s real-time content management platform
Crawl
Docpro
c
Link
Analysis Queue
Crawl
schedule
Content
Queue
Links
Serving
Apache Storm
Apache HBase
Notification Mechanism is Like a Sliding Window
4
 Small working set but not necessarily FIFO queue
 Short life-cycle delete message after processing it
 High-churn workload message state can be updated
 Frequent scans to consume message
HBase Accelerated: Mission Definition
5
Goal:
Real-time performance in persistent KV-stores
How:
Use less in-memory space  less I/O
HBase Accelerated: Two Base Ideas
6
In-Memory Compaction
 Exploit redundancies in the workload to eliminate duplicates in memory
 Gain is proportional to the duplicate ratio
In-Memory Index Reduction
 Reduce the index memory footprint, less overhead per cell
 Gain is proportional to the cell size
Prolong in-memory lifetime, before flushing to disk
 Reduce amount of new files
 Reduce write amplification effect (overall I/O)
 Reduce retrieval latencies
Outline
7
 Background
 In-Memory Compaction
› Design & Evaluation
 In-Memory Index Reduction
› Design & Evaluation
In-Memory
Compaction
Design
 Random writes absorbed in active segment (Cm)
 When active segment is full
› Becomes immutable segment (snapshot)
› A new mutable (active) segment serves writes
› Flushed to disk, truncate WAL
 On-Disk compaction reads a few files, merge-sorts them, writes back new files
9
HBase Writes
C’m
flush
memory HDFS
Cm
prepare-for-flush
Cd
WAL
write
10
HBase Reads
C’m
memory HDFS
Cm
Cd
Read
 Random reads from Cm or C’m or Cd (Block Cache)
 When data piles-up on disk
› Hit ratio drops
› Retrieval latency up
 Compaction re-writes small files into fewer bigger files
› Causes replication-related network and disk IO
Block cache
12
Hbase In-Memory Compaction
C’m
flush
memory HDFS
Cm
in-memory-flush
Cd
WAL
Block cache
Compaction
pipeline
 New compaction pipeline
› Active segment flushed to pipeline
› Pipeline segments compacted in memory
› Flush to disk only when needed
13
New Design: In-Memory Flush and Compaction
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
Trade read cache (BlockCache) for write cache (compaction pipeline)
14
New Design: In-Memory Flush and Compaction
write read
cache cache
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
15
New Design: In-Memory Flush and Compaction
CPU
IO
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
Trade read cache (BlockCache) for write cache (compaction pipeline)
More CPU cycles for less I/O
Outline
16
 Background
 In-Memory Compaction
› Design & Evaluation
 In-Memory Index Reduction
› Design & Evaluation
Evaluation Settings: In-Memory Working Set
17
 YCSB: compares compacting vs. default memstore
 Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks)
› Default: 128MB flush size, 100MB block-cache
› Compacting: 192MB flush size, 36MB block-cache
 High-churn workload, small working set
› 128,000 records, 1KB value field
› 10 threads running 5 millions operations, various key distributions
› 50% reads 50% updates, target 1000ops
› 1% (long) scans 99% updates, target 500ops
 Measure average latency over time
› Latencies accumulated over intervals of 10 seconds
Evaluation Results: Read Latency (Zipfian Distribution)
18
Flush
to disk
Compaction
Data
fits into
cache
Evaluation Results: Read Latency (Uniform Distribution)
19
Region
split
Evaluation Results: Scan Latency (Uniform Distribution)
20
Evaluation Settings: Handling Tombstones
21
 YCSB: compares compacting vs. default memstore
 Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks), 128MB flush size, 64MB block-cache
› Default: Minimum 4 files for compaction
› Compaction: Minimum 2 files for compaction
 High-churn workload, small working set with deletes
› 128,000 records, 1KB value field
› 10 threads running 5 millions operations, various key distributions
› 40% reads 40% updates 20% deletes (with 50,000 updates head start), target 1000ops
› 1% (long) scans 66% updates 33% deletes (with head start), target 500ops
 Measure average latency over time
› Latencies accumulated over intervals of 10 seconds
Evaluation Results: Read Latency (Zipfian Distribution)
22
(total 2 flushes and 1 disk compactions)
(total 15 flushes and 4 disk compactions)
Evaluation Results: Read Latency (Uniform Distribution)
23
(total 3 flushes and 2 disk compactions)
(total 15 flushes and 4 disk compactions)
Evaluation Results: Scan Latency (Zipfian Distribution)
24
Outline
25
 Background
 In-Memory Compaction
› Design & Evaluation
 In-Memory Index Reduction
› Design & Evaluation
In-Memory
Index Reduction
Design
26
New Design: Effective in-memory representation
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
27
New Design: Effective in-memory representation
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
28
Segment for dynamic updates
Cell
A
Cell G
Cell F
Cell
BCell D Cell E
MSLAB
Exploit the Immutability of Segment after Compaction
29
 Current design
› Data stored in flat buffers, index is a skip-list
› All memory allocated on-heap
 New Design: Flat layout for immutable segments index
› Less overhead per cell
› Manage (allocate, store, release) data buffers off-heap
 Pros
› Locality in access to index
› Reduce memory fragmentation
› Significantly reduce GC work
› Better utilization of memory and CPU
30
Read-Only Segment
Cell
A
Cell G
Cell F
Cell
BCell D Cell E
MSLAB
31
New Design: Effective in-memory representation
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
memory
Are there redundancies
to compact?
yesno
Flatten the index –
less overhead per
cell
Flatten the index &
compact – less cells
& less overhead per
cell
Evaluation Results: Read Latency 1K Cell (Small Cache)
32
0
500
1000
1500
2000
2500
10
110
210
310
410
510
610
710
810
910
1010
1110
1210
1310
1410
1510
1610
1710
1810
1910
2010
2110
2210
2310
2410
2510
2610
2710
2810
2910
3010
3110
3210
3310
3410
3510
3610
3710
3810
3910
4010
4110
4210
4310
4410
Latency(us)
Timeline (seconds)
Uniform (Reads 50% - Writes 50%) - Read Latency
skip-list based compaction
cell-array based compaction
Evaluation Results: Read Latency 100Byte Cell (Small Cache)
33
0
200
400
600
800
1000
1200
10
140
270
400
530
660
790
920
1050
1180
1310
1440
1570
1700
1830
1960
2090
2220
2350
2480
2610
2740
2870
3000
3130
3260
3390
3520
3650
3780
3910
4040
4170
4300
4430
4560
4690
4820
4950
5080
5210
5340
5470
5600
Latency(us)
Timeline (seconds)
Zipfian (Reads 50% - Writes 50%) - Read Latency
skip-list based compaction
cell-array based compaction
Evaluation Results: Scan Latency (Uniform Distribution)
34
0
50000
100000
150000
200000
250000
10
250
490
730
970
1210
1450
1690
1930
2170
2410
2650
2890
3130
3370
3610
3850
4090
4330
4570
4810
5050
5290
5530
5770
6010
6250
6490
6730
6970
7210
7450
7690
7930
8170
8410
8650
8890
9130
9370
9610
9850
10090
10330
10570
10810
11050
11290
11530
11770
12010
12250
12490
12730
12970
13210
13450
13690
13930
Latency(us)
Timeline (seconds)
skip-list based compaction
cell-array based compaction
Status Umbrella Jira HBASE-14918
35
 HBASE-14919 HBASE-15016 HBASE-15359 Infrastructure refactoring
› Status: committed
 HBASE-14920 new compacting memstore
› Status: pre-commit
 HBASE-14921 memory optimizations (memory layout, off-heaping)
› Status: under code review
Summary
36
 Feature intended for HBase 2.0.0
 New design pros over default implementation
› Predictable retrieval latency by serving (mainly) from memory
› Less compaction on disk reduces write amplification effect
› Less disk I/O and network traffic reduces load on HDFS
› New space efficient index representation
 We would like to thank the reviewers
› Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu
Image
38
Evaluation Results: Write Latency

Apache HBase, Accelerated: In-Memory Flush and Compaction

  • 1.
    HBase Accelerated: In-Memory Flushand Compaction E s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v ⎪ H B a s e C o n , S a n F r a n c i s c o , M a y 2 4 , 2 0 1 6
  • 2.
    Outline 2  Background  In-MemoryCompaction › Design & Evaluation  In-Memory Index Reduction › Design & Evaluation
  • 3.
    Motivation: Dynamic ContentProcessing on Top of HBase 3  Real-time content processing pipelines › Store intermediate results in persistent map › Notification mechanism is prevalent  Storage and notifications on the same platform  Sieve – Yahoo’s real-time content management platform Crawl Docpro c Link Analysis Queue Crawl schedule Content Queue Links Serving Apache Storm Apache HBase
  • 4.
    Notification Mechanism isLike a Sliding Window 4  Small working set but not necessarily FIFO queue  Short life-cycle delete message after processing it  High-churn workload message state can be updated  Frequent scans to consume message
  • 5.
    HBase Accelerated: MissionDefinition 5 Goal: Real-time performance in persistent KV-stores How: Use less in-memory space  less I/O
  • 6.
    HBase Accelerated: TwoBase Ideas 6 In-Memory Compaction  Exploit redundancies in the workload to eliminate duplicates in memory  Gain is proportional to the duplicate ratio In-Memory Index Reduction  Reduce the index memory footprint, less overhead per cell  Gain is proportional to the cell size Prolong in-memory lifetime, before flushing to disk  Reduce amount of new files  Reduce write amplification effect (overall I/O)  Reduce retrieval latencies
  • 7.
    Outline 7  Background  In-MemoryCompaction › Design & Evaluation  In-Memory Index Reduction › Design & Evaluation In-Memory Compaction Design
  • 8.
     Random writesabsorbed in active segment (Cm)  When active segment is full › Becomes immutable segment (snapshot) › A new mutable (active) segment serves writes › Flushed to disk, truncate WAL  On-Disk compaction reads a few files, merge-sorts them, writes back new files 9 HBase Writes C’m flush memory HDFS Cm prepare-for-flush Cd WAL write
  • 9.
    10 HBase Reads C’m memory HDFS Cm Cd Read Random reads from Cm or C’m or Cd (Block Cache)  When data piles-up on disk › Hit ratio drops › Retrieval latency up  Compaction re-writes small files into fewer bigger files › Causes replication-related network and disk IO Block cache
  • 10.
    12 Hbase In-Memory Compaction C’m flush memoryHDFS Cm in-memory-flush Cd WAL Block cache Compaction pipeline
  • 11.
     New compactionpipeline › Active segment flushed to pipeline › Pipeline segments compacted in memory › Flush to disk only when needed 13 New Design: In-Memory Flush and Compaction C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory
  • 12.
    Trade read cache(BlockCache) for write cache (compaction pipeline) 14 New Design: In-Memory Flush and Compaction write read cache cache C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory
  • 13.
    15 New Design: In-MemoryFlush and Compaction CPU IO C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory Trade read cache (BlockCache) for write cache (compaction pipeline) More CPU cycles for less I/O
  • 14.
    Outline 16  Background  In-MemoryCompaction › Design & Evaluation  In-Memory Index Reduction › Design & Evaluation
  • 15.
    Evaluation Settings: In-MemoryWorking Set 17  YCSB: compares compacting vs. default memstore  Small cluster: 3 HDFS nodes on a single rack, 1 RS › 1GB heap space, MSLAB enabled (2MB chunks) › Default: 128MB flush size, 100MB block-cache › Compacting: 192MB flush size, 36MB block-cache  High-churn workload, small working set › 128,000 records, 1KB value field › 10 threads running 5 millions operations, various key distributions › 50% reads 50% updates, target 1000ops › 1% (long) scans 99% updates, target 500ops  Measure average latency over time › Latencies accumulated over intervals of 10 seconds
  • 16.
    Evaluation Results: ReadLatency (Zipfian Distribution) 18 Flush to disk Compaction Data fits into cache
  • 17.
    Evaluation Results: ReadLatency (Uniform Distribution) 19 Region split
  • 18.
    Evaluation Results: ScanLatency (Uniform Distribution) 20
  • 19.
    Evaluation Settings: HandlingTombstones 21  YCSB: compares compacting vs. default memstore  Small cluster: 3 HDFS nodes on a single rack, 1 RS › 1GB heap space, MSLAB enabled (2MB chunks), 128MB flush size, 64MB block-cache › Default: Minimum 4 files for compaction › Compaction: Minimum 2 files for compaction  High-churn workload, small working set with deletes › 128,000 records, 1KB value field › 10 threads running 5 millions operations, various key distributions › 40% reads 40% updates 20% deletes (with 50,000 updates head start), target 1000ops › 1% (long) scans 66% updates 33% deletes (with head start), target 500ops  Measure average latency over time › Latencies accumulated over intervals of 10 seconds
  • 20.
    Evaluation Results: ReadLatency (Zipfian Distribution) 22 (total 2 flushes and 1 disk compactions) (total 15 flushes and 4 disk compactions)
  • 21.
    Evaluation Results: ReadLatency (Uniform Distribution) 23 (total 3 flushes and 2 disk compactions) (total 15 flushes and 4 disk compactions)
  • 22.
    Evaluation Results: ScanLatency (Zipfian Distribution) 24
  • 23.
    Outline 25  Background  In-MemoryCompaction › Design & Evaluation  In-Memory Index Reduction › Design & Evaluation In-Memory Index Reduction Design
  • 24.
    26 New Design: Effectivein-memory representation C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory
  • 25.
    27 New Design: Effectivein-memory representation C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory
  • 26.
    28 Segment for dynamicupdates Cell A Cell G Cell F Cell BCell D Cell E MSLAB
  • 27.
    Exploit the Immutabilityof Segment after Compaction 29  Current design › Data stored in flat buffers, index is a skip-list › All memory allocated on-heap  New Design: Flat layout for immutable segments index › Less overhead per cell › Manage (allocate, store, release) data buffers off-heap  Pros › Locality in access to index › Reduce memory fragmentation › Significantly reduce GC work › Better utilization of memory and CPU
  • 28.
    30 Read-Only Segment Cell A Cell G CellF Cell BCell D Cell E MSLAB
  • 29.
    31 New Design: Effectivein-memory representation C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL memory Are there redundancies to compact? yesno Flatten the index – less overhead per cell Flatten the index & compact – less cells & less overhead per cell
  • 30.
    Evaluation Results: ReadLatency 1K Cell (Small Cache) 32 0 500 1000 1500 2000 2500 10 110 210 310 410 510 610 710 810 910 1010 1110 1210 1310 1410 1510 1610 1710 1810 1910 2010 2110 2210 2310 2410 2510 2610 2710 2810 2910 3010 3110 3210 3310 3410 3510 3610 3710 3810 3910 4010 4110 4210 4310 4410 Latency(us) Timeline (seconds) Uniform (Reads 50% - Writes 50%) - Read Latency skip-list based compaction cell-array based compaction
  • 31.
    Evaluation Results: ReadLatency 100Byte Cell (Small Cache) 33 0 200 400 600 800 1000 1200 10 140 270 400 530 660 790 920 1050 1180 1310 1440 1570 1700 1830 1960 2090 2220 2350 2480 2610 2740 2870 3000 3130 3260 3390 3520 3650 3780 3910 4040 4170 4300 4430 4560 4690 4820 4950 5080 5210 5340 5470 5600 Latency(us) Timeline (seconds) Zipfian (Reads 50% - Writes 50%) - Read Latency skip-list based compaction cell-array based compaction
  • 32.
    Evaluation Results: ScanLatency (Uniform Distribution) 34 0 50000 100000 150000 200000 250000 10 250 490 730 970 1210 1450 1690 1930 2170 2410 2650 2890 3130 3370 3610 3850 4090 4330 4570 4810 5050 5290 5530 5770 6010 6250 6490 6730 6970 7210 7450 7690 7930 8170 8410 8650 8890 9130 9370 9610 9850 10090 10330 10570 10810 11050 11290 11530 11770 12010 12250 12490 12730 12970 13210 13450 13690 13930 Latency(us) Timeline (seconds) skip-list based compaction cell-array based compaction
  • 33.
    Status Umbrella JiraHBASE-14918 35  HBASE-14919 HBASE-15016 HBASE-15359 Infrastructure refactoring › Status: committed  HBASE-14920 new compacting memstore › Status: pre-commit  HBASE-14921 memory optimizations (memory layout, off-heaping) › Status: under code review
  • 34.
    Summary 36  Feature intendedfor HBase 2.0.0  New design pros over default implementation › Predictable retrieval latency by serving (mainly) from memory › Less compaction on disk reduces write amplification effect › Less disk I/O and network traffic reduces load on HDFS › New space efficient index representation  We would like to thank the reviewers › Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu
  • 36.