Hourglass: a Library for Incremental Processing on Hadoop

Hourglass: a Library for Incremental Processing on
Hadoop
IEEE BigData 2013
October 9th
Matthew Hayes
©2013 LinkedIn Corporation. All Rights Reserved.

Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/
©2013 LinkedIn Corporation. All Rights Reserved.
• 3+ Years on Applied Data Team at LinkedIn
• Skills
• Endorsements
• DataFu
• White Elephant

Agenda
 Motivation
 Design
 Experiments
 Q&A
©2013 LinkedIn Corporation. All Rights Reserved. 3

Motivation

Event Collection in an Online System
 Typically online websites have
instrumented services that collect
events
 Events stored in an offline system
(such as Hadoop) for later analysis
 Using events, can build dashboards
with metrics such as:
– # of page views over last month
– # of active users over last month
 Metrics derived from events can also
be useful in recommendation pipelines
– e.g. impression discounting

Event Storage
 Events can be categorized into topics, for example:
– page view
– user login
– ad impression/click
 Store events by topic and by day:
– /data/page_view/daily/2013/10/08
– /data/page_view/daily/2013/10/09
– ...
– /data/ad_click/daily/2013/10/08
 Now can perform computation over specific time windows

Computation Over Time Windows
 In practice, many of our computations over time windows use
either:

Recognizing Inefficiencies
 But, typically jobs compute these daily
 From one day to next, input changes little
 Fixed-start window includes one new day:

 Fixed-length window includes one new day, minus oldest day

 Repeatedly processing same input data
 This wastes cluster resources
 Better to process new data only
 How can we do better?

Hourglass Design

Design Goals
 Address use cases:
– Fixed-start and fixed-length window computations
– Daily partitioned data
 Reduce resource usage
 Reduce wall clock time
 Run on standard Hadoop

Improving Fixed-Start Computations
 Suppose we must compute page view counts per member
 The job consumes all days of available input, producing one output.
 We call this a partition-collapsing job.
 But, if the job runs tomorrow it has to reprocess the same data.

Improving Fixed-Start Computations
 Solution: Merge new data with previous output
 We can do this because this is an arithmetic operation
 Hourglass provides a partition-collapsing job that supports output
reuse.

Partition-Collapsing Job Architecture (Fixed-Start)
 When applied to a fixed-start window computation:

Improving Fixed-Length Computations
 For a fixed-length job, can reuse output using a similar trick:
– Add new day to previous output
– Subtract old day from result
 We can subtract the old day since this is arithmetic

Partition-Collapsing Job Architecture (Fixed-Length)
 When applied to a fixed-length window computation:

Improving Fixed-Length Computations
 But, for some operations, cannot subtract old data
– example: max(), min()
 Cannot reuse previous output, so how do we reduce computation?
 Solution: partition-preserving job
 Partitioned input data, partitioned output data
 Essentially: aggregate the data in advance
 Aggregating in advance can be useful even when you can reuse
output

Partition-Preserving Job Architecture

MapReduce in Hourglass
 MapReduce is a fairly general programming model
 Hourglass requires:
– reduce() must output (key,value) pair
– reduce() must produce at most one value
– reduce() implemented by an accumulator

Building Blocks
 Two types of jobs:
– Partition-preserving: consume partitioned input data, produce
partitioned output data
– Partition-collapsing: consume partitioned input data, produce single
output
 Must provide to jobs:
– Inputs and output paths
– Desired time range
 Must implement:
– map()
– accumulate()
 May implement if necessary:
– merge()
– unmerge()

Experiments

Metrics for Evaluation
 Wall clock time
– Amount of time that elapses until job completes
 Total task time
– Sum of execution times for all tasks
– Represents usage of cluster resources
 Compare each against baseline non-incremental job

Experiment: Page Views per Member
 Goal: Count page views per member over last n days
 Chain partition-preserving and partition-collapsing
 Can reuse previous output:

Experiment: Page Views per Member

Member Count Estimation
 Goal: Estimate number of members visiting site over past n days
 Use HyperLogLog cardinality estimation (space vs. accuracy)
 Can't reuse output, but with partition-preserving can save state:

Member Count Estimation: Results

Conclusion
 Computations over sliding windows are quite common
 Implementations are typically inefficient
 Incrementalizing Hadoop jobs can in some cases yield:
– 95-98% reductions in total task time
– 20-40% reductions in wall clock time

datafu.org
Learning More

Hourglass: a Library for Incremental Processing on Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Hourglass: a Library for Incremental Processing on Hadoop

Recently uploaded

Hourglass: a Library for Incremental Processing on Hadoop