HBase Scale and Multi-tenancy @ Y!
PRESENTED BY
Francis Liu | toffer@apache.org
Vandana Ayyalasomayajula | avandana@apache.org
Virag Kothari | virag@apache.org
Outline
▪ HBase @ Y!
▪ Group Favored Nodes
▪ Scaling to 1M Regions and beyond
Y! Grid
▪ Off-Stage Processing
▪ Hosted Service
▪ Multi-tenant
Y! HBase
▪ Hosted Multi-tenant Service
▪ Isolation
› Isolated Deployment
› Region Server Groups
› Namespace
▪ Security
› ACLs
› Audit Logging
▪ Cross-Colo Replication
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase MasterZookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
TaskTracker
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Isolated Deployment
Region Server Groups - Overview
▪ Member Tables
▪ Resource Isolation
▪ Flexibility with configuration
Group Bar
Region Server 5…8
Table3
Table4
Group Foo
Region Server 1…4
Table1
Table2
RS1
Table1
Table2
RS2
Table1
Table2
RS3
Table1
Table2
RS4 RS5
Table3
Table4
RS6
Table3
Table4
RS7
Table3
Table4
RS8
Configs
Region Server Groups - Implementation
LoadBalancer
GroupBasedLoadBalancer
GroupAdminEndpoint
GroupMasterObserver
HMaster
FilterBy
Group
foo
bar
GroupInfoManager
Group Table
Group
ZNode
Namespace
▪ Analogous to Database
▪ Full Table Name: <table namespace>:<table name>
▪ i.e. my_ns:my_table
▪ Reserved namespaces
› default – tables with no explicit namespace
› hbase – system tables (ie hbase:meta, hbase:acl, etc)
▪ Table Path: /<hbaseRoot>/data/<namespace>/<tableName>
Namespace
▪ Default Region Server Group
▪ Quota
› Max Tables
› Max Regions
▪ Per Tenant
Replication
▪ Sinks are randomly picked
▪ Sources recover any queue
▪ Shared RPC Quality of Protection config
source: https://hbase.apache.org/replication.html
Replication + Group
▪ Region Server Group Aware
▪ Rule based API
› Source: {namespace},[Table], [CF]
› Slave: {Peer}
› Effective Time
Group Foo
Group Bar
Table1
Table2
Group Foo
Table1
Table2
Replication + Thrift
▪ Encryption via SASL
▪ 0.94 <-> 0.96+ interoperability
Favored Nodes
▪ What are Favored Nodes ?
› While writing data, we can pass a set of preferred hosts to HDFS client to replicate data.
› preferred hosts => “Favored Nodes”
› Usually 3 hosts : primary, secondary, tertiary.
› Constraint: Primary host on one rack , secondary and tertiary hosts on different rack.
▪ Favored Nodes of regions are scattered across various groups.
› No guarantees about data locality within a region server group.
Example
RS7
DN7
RS Group - B
RS5
DN5 DN6
RS6
RS8
DN8
RS3
DN3
RS Group - A
RS1
DN1 DN2
RS2
RS4
DN4
Example
▪ Locality is lost when region server RS1 dies.
RS7
DN7
RS Group - B
RS5
DN5 DN6
RS6
RS8
DN8
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS dies
▪ Fix the data locality problem by
› choosing favored nodes within region server group
› Assigning regions to only favored nodes
Group Aware Favored Nodes
RS7
DN7
RS Group - B
RS5
DN5 DN6
RS6
RS8
DN8
RS3
DN3
RS Group - A
RS1
DN1 DN2
RS2
RS4
DN4
FavoredGroupLoadBalancer
▪ Region server groups aware
▪ Region assignment on favored nodes
▪ Region balancing done using Stochastic Load Balancer
▪ Favored Node Management
› Generate favored nodes for regions
› Favored nodes are inherited during a region split/merge events.
› Favored nodes do not change unless required.
Image
Favored Node Management APIs
▪ Redistribute
› Ability to expand region block replicas to newly added nodes.
› Change favored nodes of regions such that replicas spread to newly added nodes
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
RS5
DN5
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
RS5
DN5
redistribute
New node
added
Favored Node Management APIs
▪ Complete_Redistribute
› Ability to recreate entire set of favored nodes in balanced fashion
› Balances the replica load evenly among all the nodes
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
complete
redistribute
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
Host with least number of
replicas
Enhancements
▪ Improvements to Stochastic Load Balancer (HBASE-13376)
▪ Improvements to Region Placement Maintainer Tool
› Ability to view locality of region on each of its FN.
› Ability to view primary, secondary and tertiary node distribution of region servers.
▪ Hadoop JIRA’s
› HDFS-7300
› HDFS-7795
▪ Configuration changes made on Hadoop side
› Set “dfs.namenode.replication.considerLoad” to false in small clusters
Scaling to 1M and beyond (HBASE-11165)
▪ Store Petabytes of data
▪ Support mixed workload (batch and near real-time)
▪ Performance
› Latency, throughput
▪ Operability
› Load balancing, compactions, etc.
Experience at Scale
▪ Web Crawl Cache
› ~2.3PB Table
› 80GB regions -> 20GB regions
› Batch workload
▪ Hot Regions
▪ Large compactions (Write amplification)
▪ Longer failover time
▪ Less Parallel/Imbalanced MapReduce Tasks
▪ Large MapReduce tasks
Scaling Region Count
▪ Master Region Management
› Creation, Assign, Balance, etc.
› Meta table
▪ Metadata
› HDFS scalability
› Zookeeper
› Region Server density
RSMaster
Meta
region
Zookeeper
Region 1
Region 2
Region 1
Region 2
RS
RS
Assignment
communication
Write
ops
Observations
▪ Assignment
› ZK assignment - complex and more storage
› High CPU usage on master
▪ Single hot meta
› 7GB in size for 1M
› Master writing at 400 ops/second
› Longer scanning times
▪ HDFS
▪ Longer directory creation time
User region 1
User region 2
RS
Master
▪ Assignment
› Zk less assignment (HBASE-11059)
› Simpler
› No involvement of Zk
› Unlock region states (HBASE-11290)
Enhancements - Assignment
User region 1
User region 2
User region
Meta region
RS
User region 1
User region 2
RS
▪ Split meta (HBASE-11288)
› Distributed IO load
› Distributed caching
› Shorter scan time
› Distributed compaction
Meta region
User region
RS
Master
Meta region
User region
User region
Meta region
RS
Meta region
User region
RS
Enhancements – Split Meta
Region dir creation time - 4k buckets
1M regions 5M 10M
normal table 20 mins 4 hours 23 minutes Doesn’t finish
humongous table 15 mins 48 secs 1 hour 27 minutes 2hr 53 minutes
Enhancements - Hierarchical region dir
● Scaling namenode operations - Table dir has millions of region files
● Approach - Buckets within table directory
● E.g 3 letters of bucket names gives 4k buckets
HBaseCon 2014
Thank You!
(We’re Hiring)

HBaseCon 2015: Multitenancy in HBase

  • 1.
    HBase Scale andMulti-tenancy @ Y! PRESENTED BY Francis Liu | toffer@apache.org Vandana Ayyalasomayajula | avandana@apache.org Virag Kothari | virag@apache.org
  • 2.
    Outline ▪ HBase @Y! ▪ Group Favored Nodes ▪ Scaling to 1M Regions and beyond
  • 3.
    Y! Grid ▪ Off-StageProcessing ▪ Hosted Service ▪ Multi-tenant
  • 4.
    Y! HBase ▪ HostedMulti-tenant Service ▪ Isolation › Isolated Deployment › Region Server Groups › Namespace ▪ Security › ACLs › Audit Logging ▪ Cross-Colo Replication
  • 5.
    HBase Client HBase Client JobTracker Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase MasterZookeeper Quorum HBase Client MRClient M/R Task TaskTracker DataNode M/R Task TaskTracker DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Isolated Deployment
  • 6.
    Region Server Groups- Overview ▪ Member Tables ▪ Resource Isolation ▪ Flexibility with configuration Group Bar Region Server 5…8 Table3 Table4 Group Foo Region Server 1…4 Table1 Table2 RS1 Table1 Table2 RS2 Table1 Table2 RS3 Table1 Table2 RS4 RS5 Table3 Table4 RS6 Table3 Table4 RS7 Table3 Table4 RS8 Configs
  • 7.
    Region Server Groups- Implementation LoadBalancer GroupBasedLoadBalancer GroupAdminEndpoint GroupMasterObserver HMaster FilterBy Group foo bar GroupInfoManager Group Table Group ZNode
  • 8.
    Namespace ▪ Analogous toDatabase ▪ Full Table Name: <table namespace>:<table name> ▪ i.e. my_ns:my_table ▪ Reserved namespaces › default – tables with no explicit namespace › hbase – system tables (ie hbase:meta, hbase:acl, etc) ▪ Table Path: /<hbaseRoot>/data/<namespace>/<tableName>
  • 9.
    Namespace ▪ Default RegionServer Group ▪ Quota › Max Tables › Max Regions ▪ Per Tenant
  • 10.
    Replication ▪ Sinks arerandomly picked ▪ Sources recover any queue ▪ Shared RPC Quality of Protection config source: https://hbase.apache.org/replication.html
  • 11.
    Replication + Group ▪Region Server Group Aware ▪ Rule based API › Source: {namespace},[Table], [CF] › Slave: {Peer} › Effective Time Group Foo Group Bar Table1 Table2 Group Foo Table1 Table2
  • 12.
    Replication + Thrift ▪Encryption via SASL ▪ 0.94 <-> 0.96+ interoperability
  • 13.
    Favored Nodes ▪ Whatare Favored Nodes ? › While writing data, we can pass a set of preferred hosts to HDFS client to replicate data. › preferred hosts => “Favored Nodes” › Usually 3 hosts : primary, secondary, tertiary. › Constraint: Primary host on one rack , secondary and tertiary hosts on different rack. ▪ Favored Nodes of regions are scattered across various groups. › No guarantees about data locality within a region server group.
  • 14.
    Example RS7 DN7 RS Group -B RS5 DN5 DN6 RS6 RS8 DN8 RS3 DN3 RS Group - A RS1 DN1 DN2 RS2 RS4 DN4
  • 15.
    Example ▪ Locality islost when region server RS1 dies. RS7 DN7 RS Group - B RS5 DN5 DN6 RS6 RS8 DN8 RS3 DN3 RS Group - A DN1 DN2 RS2 RS4 DN4 RS dies
  • 16.
    ▪ Fix thedata locality problem by › choosing favored nodes within region server group › Assigning regions to only favored nodes Group Aware Favored Nodes RS7 DN7 RS Group - B RS5 DN5 DN6 RS6 RS8 DN8 RS3 DN3 RS Group - A RS1 DN1 DN2 RS2 RS4 DN4
  • 17.
    FavoredGroupLoadBalancer ▪ Region servergroups aware ▪ Region assignment on favored nodes ▪ Region balancing done using Stochastic Load Balancer ▪ Favored Node Management › Generate favored nodes for regions › Favored nodes are inherited during a region split/merge events. › Favored nodes do not change unless required.
  • 19.
    Favored Node ManagementAPIs ▪ Redistribute › Ability to expand region block replicas to newly added nodes. › Change favored nodes of regions such that replicas spread to newly added nodes RS3 DN3 RS Group - A DN1 DN2 RS2 RS4 DN4 RS1 RS5 DN5 RS3 DN3 RS Group - A DN1 DN2 RS2 RS4 DN4 RS1 RS5 DN5 redistribute New node added
  • 20.
    Favored Node ManagementAPIs ▪ Complete_Redistribute › Ability to recreate entire set of favored nodes in balanced fashion › Balances the replica load evenly among all the nodes RS3 DN3 RS Group - A DN1 DN2 RS2 RS4 DN4 RS1 complete redistribute RS3 DN3 RS Group - A DN1 DN2 RS2 RS4 DN4 RS1 Host with least number of replicas
  • 21.
    Enhancements ▪ Improvements toStochastic Load Balancer (HBASE-13376) ▪ Improvements to Region Placement Maintainer Tool › Ability to view locality of region on each of its FN. › Ability to view primary, secondary and tertiary node distribution of region servers. ▪ Hadoop JIRA’s › HDFS-7300 › HDFS-7795 ▪ Configuration changes made on Hadoop side › Set “dfs.namenode.replication.considerLoad” to false in small clusters
  • 22.
    Scaling to 1Mand beyond (HBASE-11165) ▪ Store Petabytes of data ▪ Support mixed workload (batch and near real-time) ▪ Performance › Latency, throughput ▪ Operability › Load balancing, compactions, etc.
  • 23.
    Experience at Scale ▪Web Crawl Cache › ~2.3PB Table › 80GB regions -> 20GB regions › Batch workload ▪ Hot Regions ▪ Large compactions (Write amplification) ▪ Longer failover time ▪ Less Parallel/Imbalanced MapReduce Tasks ▪ Large MapReduce tasks
  • 24.
    Scaling Region Count ▪Master Region Management › Creation, Assign, Balance, etc. › Meta table ▪ Metadata › HDFS scalability › Zookeeper › Region Server density
  • 25.
    RSMaster Meta region Zookeeper Region 1 Region 2 Region1 Region 2 RS RS Assignment communication Write ops Observations ▪ Assignment › ZK assignment - complex and more storage › High CPU usage on master ▪ Single hot meta › 7GB in size for 1M › Master writing at 400 ops/second › Longer scanning times ▪ HDFS ▪ Longer directory creation time
  • 26.
    User region 1 Userregion 2 RS Master ▪ Assignment › Zk less assignment (HBASE-11059) › Simpler › No involvement of Zk › Unlock region states (HBASE-11290) Enhancements - Assignment User region 1 User region 2 User region Meta region RS User region 1 User region 2 RS
  • 27.
    ▪ Split meta(HBASE-11288) › Distributed IO load › Distributed caching › Shorter scan time › Distributed compaction Meta region User region RS Master Meta region User region User region Meta region RS Meta region User region RS Enhancements – Split Meta
  • 28.
    Region dir creationtime - 4k buckets 1M regions 5M 10M normal table 20 mins 4 hours 23 minutes Doesn’t finish humongous table 15 mins 48 secs 1 hour 27 minutes 2hr 53 minutes Enhancements - Hierarchical region dir ● Scaling namenode operations - Table dir has millions of region files ● Approach - Buckets within table directory ● E.g 3 letters of bucket names gives 4k buckets
  • 29.