1
David MacKenzie
Box Engineering
@davrmac @BoxEng
/events @ Box: Using HBase
as a message queue
2
Share, manage and access your content from any device, anywhere
3
What is the /events API?
• Realtime stream of all activity happening within a user’s account
• GET /events?stream_position=1234&stream_type=all
• Persistent and re-playable
1 2 3 4 5
Client
4
Why did we build it?
• Main use-case was sync  switch from batch to incremental diffs
• Several requirements arose from the sync use case:
‒ Guaranteed delivery
‒ Clients can be offline for days at a time
‒ Arbitrary number of clients consuming each user’s stream
Persistence
Re-playability
5
How is it implemented?
• Each user assigned a separate section of the HBase key-space
• Messages are stored in order from oldest to newest within a user’s
section of the key-space
• Reads map directly to scans from the provided position to the user’s end
key
• Row key structure: <pseudo-random prefix>_<user_id>_<position>
2-bytes of user_id sha1 Millisecond timestamp
6
Using a timestamp as a queue position
• Pro: Allows for allocating roughly monotonically increasing positions
with no co-ordination between write requests
• Con: Isn’t sufficient to guarantee append-only semantics in the presence
of parallel writes
Write
Write 2
Write
R
e
a
d
1
2
R
e
a
d
7
Time-bounding and Back-scanning
• Need to ensure that clients don’t advance their stream positions past
writes that will eventually succeed
‒ But clients do need to advance position eventually
‒ How do we know when it’s safe?
• Solution: time-bound writes and back-scan reads
‒ Time-bounding: every write to HBase must complete within a fixed time-bound to be
considered successful
‒ No guaranteed delivery for unsuccessful writes.
‒ Clients should retry failed writes at higher stream positions.
‒ Back-scanning: clients cannot advance their stream positions further than (current
time – back-scan interval)
‒ Back-scan interval >= write time-bound
• Provides guaranteed delivery but at the cost of duplicate events
8
3
Write
Write
R
e
a
d
2
3
Write R
e
a
d
1
2
3
Write
R
e
a
dWrite 4
9
Replication
• Master/slave architecture
‒ One cluster per DC
‒ Master cluster handles all reads and writes
‒ Slave clusters are passive replicas
• On promotion, clients transparently fail over to the new master cluster
• Can’t use native HBase replication directly
‒ Could cause clients to miss events when failing over to a lagging cluster
Replication
1
2
1
Failover Replication
1
2
1
Write
R
e
a
d3
10
Replication Contd.
• Replication system needs to be aware of master/slave failovers
‒ Stop exactly replicating messages. Start appending messages to the current ends of
the queues.
• Currently, use a client-level replication system piggy backing on MySQL
replication
• Plan to switch to a system that hooks into HBase replication by
configuring itself as a slave HBase cluster
1
2
1
Failover
1
2
1
3
4
R
e
a
d
11
Why HBase?
• Closest off-the-rack queuing system is Kafka
‒ Developed at LinkedIn. Open sourced in 2011.
‒ Originally built to power LinkedIn’s analytics pipeline
‒ Very similar model built around “ordered commit logs”
‒ Allow for easy addition of new subscribers
‒ Allow for varying subscriber consumption patterns  slow subscribers don’t back up the
pipeline
12
Why HBase and not Kafka?
• Better consistency vs. availability tradeoffs
‒ No automatic rack aware replica placement
‒ No automatic replica re-assignment upon replica failure
‒ On replica failure, no fast failover of new writes to new replicas.
‒ Can’t require minimum replication factor for new writes without significantly impacting
availability on replica failure
• Replication support
‒ Not enough control over Kafka queue positions to implement transparent client
failovers between replica clusters
• Unable to scale to millions of topics
‒ Currently tops out in the tens of thousands of topics.
‒ Design requires very granular topic tracking. Barrier to scale.
13
In conclusion…
• We were able to leverage HBase to store millions of guaranteed delivery
message queues, each of which was:
‒ replicated between data centers
‒ independently consumable by an arbitrary number of clients
• Cluster metrics:
‒ ~30 nodes per cluster
‒ 15K write/sec at peak. Bursts of up to 40K writes/sec.
‒ 50K-60K requests/sec at peak.
14
Questions?
Twitter @davrmac
@BoxEng
Engineering Blog tech.blog.box.com
Platform developers.box.com
Open Source opensource.box.com

HBaseCon 2015: Events @ Box - Using HBase as a Message Queue

  • 1.
    1 David MacKenzie Box Engineering @davrmac@BoxEng /events @ Box: Using HBase as a message queue
  • 2.
    2 Share, manage andaccess your content from any device, anywhere
  • 3.
    3 What is the/events API? • Realtime stream of all activity happening within a user’s account • GET /events?stream_position=1234&stream_type=all • Persistent and re-playable 1 2 3 4 5 Client
  • 4.
    4 Why did webuild it? • Main use-case was sync  switch from batch to incremental diffs • Several requirements arose from the sync use case: ‒ Guaranteed delivery ‒ Clients can be offline for days at a time ‒ Arbitrary number of clients consuming each user’s stream Persistence Re-playability
  • 5.
    5 How is itimplemented? • Each user assigned a separate section of the HBase key-space • Messages are stored in order from oldest to newest within a user’s section of the key-space • Reads map directly to scans from the provided position to the user’s end key • Row key structure: <pseudo-random prefix>_<user_id>_<position> 2-bytes of user_id sha1 Millisecond timestamp
  • 6.
    6 Using a timestampas a queue position • Pro: Allows for allocating roughly monotonically increasing positions with no co-ordination between write requests • Con: Isn’t sufficient to guarantee append-only semantics in the presence of parallel writes Write Write 2 Write R e a d 1 2 R e a d
  • 7.
    7 Time-bounding and Back-scanning •Need to ensure that clients don’t advance their stream positions past writes that will eventually succeed ‒ But clients do need to advance position eventually ‒ How do we know when it’s safe? • Solution: time-bound writes and back-scan reads ‒ Time-bounding: every write to HBase must complete within a fixed time-bound to be considered successful ‒ No guaranteed delivery for unsuccessful writes. ‒ Clients should retry failed writes at higher stream positions. ‒ Back-scanning: clients cannot advance their stream positions further than (current time – back-scan interval) ‒ Back-scan interval >= write time-bound • Provides guaranteed delivery but at the cost of duplicate events
  • 8.
  • 9.
    9 Replication • Master/slave architecture ‒One cluster per DC ‒ Master cluster handles all reads and writes ‒ Slave clusters are passive replicas • On promotion, clients transparently fail over to the new master cluster • Can’t use native HBase replication directly ‒ Could cause clients to miss events when failing over to a lagging cluster Replication 1 2 1 Failover Replication 1 2 1 Write R e a d3
  • 10.
    10 Replication Contd. • Replicationsystem needs to be aware of master/slave failovers ‒ Stop exactly replicating messages. Start appending messages to the current ends of the queues. • Currently, use a client-level replication system piggy backing on MySQL replication • Plan to switch to a system that hooks into HBase replication by configuring itself as a slave HBase cluster 1 2 1 Failover 1 2 1 3 4 R e a d
  • 11.
    11 Why HBase? • Closestoff-the-rack queuing system is Kafka ‒ Developed at LinkedIn. Open sourced in 2011. ‒ Originally built to power LinkedIn’s analytics pipeline ‒ Very similar model built around “ordered commit logs” ‒ Allow for easy addition of new subscribers ‒ Allow for varying subscriber consumption patterns  slow subscribers don’t back up the pipeline
  • 12.
    12 Why HBase andnot Kafka? • Better consistency vs. availability tradeoffs ‒ No automatic rack aware replica placement ‒ No automatic replica re-assignment upon replica failure ‒ On replica failure, no fast failover of new writes to new replicas. ‒ Can’t require minimum replication factor for new writes without significantly impacting availability on replica failure • Replication support ‒ Not enough control over Kafka queue positions to implement transparent client failovers between replica clusters • Unable to scale to millions of topics ‒ Currently tops out in the tens of thousands of topics. ‒ Design requires very granular topic tracking. Barrier to scale.
  • 13.
    13 In conclusion… • Wewere able to leverage HBase to store millions of guaranteed delivery message queues, each of which was: ‒ replicated between data centers ‒ independently consumable by an arbitrary number of clients • Cluster metrics: ‒ ~30 nodes per cluster ‒ 15K write/sec at peak. Bursts of up to 40K writes/sec. ‒ 50K-60K requests/sec at peak.
  • 14.
    14 Questions? Twitter @davrmac @BoxEng Engineering Blogtech.blog.box.com Platform developers.box.com Open Source opensource.box.com