Optimizing HBase for the cloud in
Microsoft Azure HDInsight
Maxim Lukiyanov, Microsoft, Senior Program Manager
Ashit Gosalia, Microsoft, Principal Software Engineering Manager
May 7th 2015, HBaseCon 2015
About Us
Maxim Lukiyanov
Senior Program Manager,
Big Data team
Microsoft
Contact
email: maxluk@microsoft
@maxiluk
Ashit Gosalia
Principal Software Engineering
Manager, Big Data team
Microsoft
Contact
email: ashitg@microsoft
Maxim Lukiyanov, Ashit Gosalia2
Outline
Motivation
Use Cases
Performance Tuning
Demo
Maxim Lukiyanov, Ashit Gosalia3
Context
Lifetime of the service
June 2014 Aug 2014 May 2015
GAPreview Today
Maxim Lukiyanov, Ashit Gosalia5
Lifetime of the service
June 2014 Aug 2014 May 2015
GAPreview Today
Usage in Compute Hours
4x growth
since GA
Maxim Lukiyanov, Ashit Gosalia6
Motivation
HBase can be expensive
Cloud Storage is cheap
Lower Cost HBase on Cloud
Storage!
=>
Maxim Lukiyanov, Ashit Gosalia7
HBase in the cloud
RS
RS
RS
RS
HBase Storage
Latency? Consistency?
Network
Maxim Lukiyanov, Ashit Gosalia8
Bandwidth?
HBase in the cloud
RS
RS
RS
RS
HBase Storage
HDD-like latency
50 Tb+ aggregate
bandwidth[1]
Strong consistency
Network
Maxim Lukiyanov, Ashit Gosalia9 [1] Azure Flat Network Architecture
Throughput Optimization = Cost Minimization
Capacity
Price
Decoupling of compute and storage
Removes capacity constraint
Which allows minimization of cluster size
to the exact level of throughput required
by workload Local VM Storage
Cloud Storage
Maxim Lukiyanov, Ashit Gosalia10
Cost Comparison
Price of 6 node cluster / month 6 hs1.8xlarge VM = $21,000 6 Large VM = $1,400
Price of 100TB / month Azure Blob Storage = $2,300
Total Price of Cluster / month $21,000 $3,700
Maxim Lukiyanov, Ashit Gosalia11
6x cheaper than local HDFS
Use Cases
Maxim Lukiyanov, Ashit Gosalia12
Use Cases
Key value store
Sensor data store
Time series store
Maxim Lukiyanov, Ashit Gosalia13
Use case #1: key value store
Example
Product recommendation engine
Map-reduce populates HBase with
reference data
Recommendation service reads reference
data from HBase
10TB of data in 2 node cluster
Cloud optimization
In general throughput requirements vary
greatly by workload
In this extreme example:
40 nodes* -> 2 nodes
$9000/month -> $700/month = 12x
* All nodes in use case examples are Azure A3: 4 cores, 7GB
RAM, 1TB HDD
Maxim Lukiyanov, Ashit Gosalia14
12x
Use case #2: sensor data store
Example
Metric store for online advertising
platform
Storm cluster computes metrics on the
link click counts, etc over the stream of
user activity events
Storm stores aggregates in HBase
8TB of data in 4 node cluster
Cloud optimization
32 nodes -> 4 nodes
$7000/month -> $1100/month = 6x
Maxim Lukiyanov, Ashit Gosalia15
6x
Use case #3: time series store
Example
Performance metric time series
30TB in 40 node HBase cluster
Cloud optimization – step 1
120 nodes -> 40 nodes
$27,000/month -> $9,700/month = 2.8x
Row key: metric + timestamp
Region updates:
Cloud optimization – step 2
120 nodes -> 10 nodes
$27,000/month -> $2,800/month = 10x
30TB -> 400TB
Row key: day + metric + timestamp
Region updates:
Maxim Lukiyanov, Ashit Gosalia16
10x
3x
Performance Tuning
Maxim Lukiyanov, Ashit Gosalia17



GW1
GW2
ZK1
Master1
ZK2
Master2
ZK3
Master3
Region
Servers
Region
Servers
Region
Server 1
Region
Server N
S
S
Blob
Storage
Account
RESTREST
Head Node
Yarn, M/R
services
Web Front
End 1
Web App
HBase
Web Front
End N
Virtual Network
Read Latency
File System WASB Block Transfer
Size
Read Latency
99 percentile,
millisec
WASB 4096 KB 400
WASB 256 KB 75
WASB 64 KB 50 (+66% over HDFS)
HDFS 30
Maxim Lukiyanov, Ashit Gosalia19
Results from 2014:
YCSB read test, 32GB of 1K byte rows (non-cached reads),
3 nodes (A3): 4 cores, 7GB memory, 1TB HDD, 1Gb NIC
100 RPC Handlers
Write Throughput
HFiles -> Azure Block Blobs
WAL -> Azure Page Blobs
Optimized for random writes
Coalesces parallel writes into streaming
write on the server side
Enabling parallel writes improves
throughput
WASB parallel throughput 15% lower
than HDFS YCSB write test, 4GB of 100 byte rows, uncompressed,
3 nodes (A3): 4 cores, 7GB memory, 1TB HDD, 1Gb NIC
100 RPC Handlers
100 Sync threads
100 Parallel writers
Maxim Lukiyanov, Ashit Gosalia20
Avg. HDFS 15MBbs
Avg. Parallel 13MBps
Avg. Serial 9MBps
Announcement
Maxim Lukiyanov, Ashit Gosalia21
Announcing
HBase on Azure Data Lake
Azure Data Lake
A hyper scale repository for big data
workloads
HDFS for the cloud
Unlimited capacity
High throughput, low latency
Strong consistency
Durable and highly available
Sing up page for Public Preview
http://azure.microsoft.com/en-us/campaigns/data-lake/
Maxim Lukiyanov, Ashit Gosalia22
Demo
Maxim Lukiyanov, Ashit Gosalia23
Summary
Cost
Azure HBase offers new low cost
deployment option, up to 10x
cheaper for some workloads, by
direct integration with cloud
storage
Performance
Comparable to HDD-based
clusters (66% worse storage-
backed read latency)
Flexibility
Easy to shrink or recreate cluster
without data loss
Maxim Lukiyanov, Ashit Gosalia24
Capacity
Price
Local VM Storage
Cloud Storage
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight

  • 1.
    Optimizing HBase forthe cloud in Microsoft Azure HDInsight Maxim Lukiyanov, Microsoft, Senior Program Manager Ashit Gosalia, Microsoft, Principal Software Engineering Manager May 7th 2015, HBaseCon 2015
  • 2.
    About Us Maxim Lukiyanov SeniorProgram Manager, Big Data team Microsoft Contact email: maxluk@microsoft @maxiluk Ashit Gosalia Principal Software Engineering Manager, Big Data team Microsoft Contact email: ashitg@microsoft Maxim Lukiyanov, Ashit Gosalia2
  • 3.
  • 4.
  • 5.
    Lifetime of theservice June 2014 Aug 2014 May 2015 GAPreview Today Maxim Lukiyanov, Ashit Gosalia5
  • 6.
    Lifetime of theservice June 2014 Aug 2014 May 2015 GAPreview Today Usage in Compute Hours 4x growth since GA Maxim Lukiyanov, Ashit Gosalia6
  • 7.
    Motivation HBase can beexpensive Cloud Storage is cheap Lower Cost HBase on Cloud Storage! => Maxim Lukiyanov, Ashit Gosalia7
  • 8.
    HBase in thecloud RS RS RS RS HBase Storage Latency? Consistency? Network Maxim Lukiyanov, Ashit Gosalia8 Bandwidth?
  • 9.
    HBase in thecloud RS RS RS RS HBase Storage HDD-like latency 50 Tb+ aggregate bandwidth[1] Strong consistency Network Maxim Lukiyanov, Ashit Gosalia9 [1] Azure Flat Network Architecture
  • 10.
    Throughput Optimization =Cost Minimization Capacity Price Decoupling of compute and storage Removes capacity constraint Which allows minimization of cluster size to the exact level of throughput required by workload Local VM Storage Cloud Storage Maxim Lukiyanov, Ashit Gosalia10
  • 11.
    Cost Comparison Price of6 node cluster / month 6 hs1.8xlarge VM = $21,000 6 Large VM = $1,400 Price of 100TB / month Azure Blob Storage = $2,300 Total Price of Cluster / month $21,000 $3,700 Maxim Lukiyanov, Ashit Gosalia11 6x cheaper than local HDFS
  • 12.
  • 13.
    Use Cases Key valuestore Sensor data store Time series store Maxim Lukiyanov, Ashit Gosalia13
  • 14.
    Use case #1:key value store Example Product recommendation engine Map-reduce populates HBase with reference data Recommendation service reads reference data from HBase 10TB of data in 2 node cluster Cloud optimization In general throughput requirements vary greatly by workload In this extreme example: 40 nodes* -> 2 nodes $9000/month -> $700/month = 12x * All nodes in use case examples are Azure A3: 4 cores, 7GB RAM, 1TB HDD Maxim Lukiyanov, Ashit Gosalia14 12x
  • 15.
    Use case #2:sensor data store Example Metric store for online advertising platform Storm cluster computes metrics on the link click counts, etc over the stream of user activity events Storm stores aggregates in HBase 8TB of data in 4 node cluster Cloud optimization 32 nodes -> 4 nodes $7000/month -> $1100/month = 6x Maxim Lukiyanov, Ashit Gosalia15 6x
  • 16.
    Use case #3:time series store Example Performance metric time series 30TB in 40 node HBase cluster Cloud optimization – step 1 120 nodes -> 40 nodes $27,000/month -> $9,700/month = 2.8x Row key: metric + timestamp Region updates: Cloud optimization – step 2 120 nodes -> 10 nodes $27,000/month -> $2,800/month = 10x 30TB -> 400TB Row key: day + metric + timestamp Region updates: Maxim Lukiyanov, Ashit Gosalia16 10x 3x
  • 17.
  • 18.
  • 19.
    Read Latency File SystemWASB Block Transfer Size Read Latency 99 percentile, millisec WASB 4096 KB 400 WASB 256 KB 75 WASB 64 KB 50 (+66% over HDFS) HDFS 30 Maxim Lukiyanov, Ashit Gosalia19 Results from 2014: YCSB read test, 32GB of 1K byte rows (non-cached reads), 3 nodes (A3): 4 cores, 7GB memory, 1TB HDD, 1Gb NIC 100 RPC Handlers
  • 20.
    Write Throughput HFiles ->Azure Block Blobs WAL -> Azure Page Blobs Optimized for random writes Coalesces parallel writes into streaming write on the server side Enabling parallel writes improves throughput WASB parallel throughput 15% lower than HDFS YCSB write test, 4GB of 100 byte rows, uncompressed, 3 nodes (A3): 4 cores, 7GB memory, 1TB HDD, 1Gb NIC 100 RPC Handlers 100 Sync threads 100 Parallel writers Maxim Lukiyanov, Ashit Gosalia20 Avg. HDFS 15MBbs Avg. Parallel 13MBps Avg. Serial 9MBps
  • 21.
  • 22.
    Announcing HBase on AzureData Lake Azure Data Lake A hyper scale repository for big data workloads HDFS for the cloud Unlimited capacity High throughput, low latency Strong consistency Durable and highly available Sing up page for Public Preview http://azure.microsoft.com/en-us/campaigns/data-lake/ Maxim Lukiyanov, Ashit Gosalia22
  • 23.
  • 24.
    Summary Cost Azure HBase offersnew low cost deployment option, up to 10x cheaper for some workloads, by direct integration with cloud storage Performance Comparable to HDD-based clusters (66% worse storage- backed read latency) Flexibility Easy to shrink or recreate cluster without data loss Maxim Lukiyanov, Ashit Gosalia24 Capacity Price Local VM Storage Cloud Storage

Editor's Notes

  • #21 TODO: Is it specific to WASB?