Hourglass: a Library for Incremental Processing on
Hadoop
IEEE BigData 2013
October 9th
Matthew Hayes
©2013 LinkedIn Corporation. All Rights Reserved.
Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/
©2013 LinkedIn Corporation. All Rights Reserved.
• 3+ Years on Applied Data Team at LinkedIn
• Skills
• Endorsements
• DataFu
• White Elephant
Agenda
 Motivation
 Design
 Experiments
 Q&A
©2013 LinkedIn Corporation. All Rights Reserved. 3
Motivation
©2013 LinkedIn Corporation. All Rights Reserved. 4
Event Collection in an Online System
 Typically online websites have
instrumented services that collect
events
 Events stored in an offline system
(such as Hadoop) for later analysis
 Using events, can build dashboards
with metrics such as:
– # of page views over last month
– # of active users over last month
 Metrics derived from events can also
be useful in recommendation pipelines
– e.g. impression discounting
©2013 LinkedIn Corporation. All Rights Reserved. 5
Event Storage
 Events can be categorized into topics, for example:
– page view
– user login
– ad impression/click
 Store events by topic and by day:
– /data/page_view/daily/2013/10/08
– /data/page_view/daily/2013/10/09
– ...
– /data/ad_click/daily/2013/10/08
 Now can perform computation over specific time windows
©2013 LinkedIn Corporation. All Rights Reserved. 6
Computation Over Time Windows
 In practice, many of our computations over time windows use
either:
©2013 LinkedIn Corporation. All Rights Reserved. 7
Recognizing Inefficiencies
 But, typically jobs compute these daily
 From one day to next, input changes little
 Fixed-start window includes one new day:
©2013 LinkedIn Corporation. All Rights Reserved. 8
Recognizing Inefficiencies
 Fixed-length window includes one new day, minus oldest day
©2013 LinkedIn Corporation. All Rights Reserved. 9
Recognizing Inefficiencies
 Repeatedly processing same input data
 This wastes cluster resources
 Better to process new data only
 How can we do better?
©2013 LinkedIn Corporation. All Rights Reserved. 10
Hourglass Design
©2013 LinkedIn Corporation. All Rights Reserved. 11
Design Goals
 Address use cases:
– Fixed-start and fixed-length window computations
– Daily partitioned data
 Reduce resource usage
 Reduce wall clock time
 Run on standard Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. 12
Improving Fixed-Start Computations
 Suppose we must compute page view counts per member
 The job consumes all days of available input, producing one output.
 We call this a partition-collapsing job.
 But, if the job runs tomorrow it has to reprocess the same data.
©2013 LinkedIn Corporation. All Rights Reserved. 13
Improving Fixed-Start Computations
 Solution: Merge new data with previous output
 We can do this because this is an arithmetic operation
 Hourglass provides a partition-collapsing job that supports output
reuse.
©2013 LinkedIn Corporation. All Rights Reserved. 14
Partition-Collapsing Job Architecture (Fixed-Start)
 When applied to a fixed-start window computation:
©2013 LinkedIn Corporation. All Rights Reserved. 15
Improving Fixed-Length Computations
 For a fixed-length job, can reuse output using a similar trick:
– Add new day to previous output
– Subtract old day from result
 We can subtract the old day since this is arithmetic
©2013 LinkedIn Corporation. All Rights Reserved. 16
Partition-Collapsing Job Architecture (Fixed-Length)
 When applied to a fixed-length window computation:
©2013 LinkedIn Corporation. All Rights Reserved. 17
Improving Fixed-Length Computations
 But, for some operations, cannot subtract old data
– example: max(), min()
 Cannot reuse previous output, so how do we reduce computation?
 Solution: partition-preserving job
 Partitioned input data, partitioned output data
 Essentially: aggregate the data in advance
 Aggregating in advance can be useful even when you can reuse
output
©2013 LinkedIn Corporation. All Rights Reserved. 18
Partition-Preserving Job Architecture
©2013 LinkedIn Corporation. All Rights Reserved. 19
MapReduce in Hourglass
 MapReduce is a fairly general programming model
 Hourglass requires:
– reduce() must output (key,value) pair
– reduce() must produce at most one value
– reduce() implemented by an accumulator
©2013 LinkedIn Corporation. All Rights Reserved. 20
Building Blocks
 Two types of jobs:
– Partition-preserving: consume partitioned input data, produce
partitioned output data
– Partition-collapsing: consume partitioned input data, produce single
output
 Must provide to jobs:
– Inputs and output paths
– Desired time range
 Must implement:
– map()
– accumulate()
 May implement if necessary:
– merge()
– unmerge()
©2013 LinkedIn Corporation. All Rights Reserved. 21
Experiments
©2013 LinkedIn Corporation. All Rights Reserved. 22
Metrics for Evaluation
 Wall clock time
– Amount of time that elapses until job completes
 Total task time
– Sum of execution times for all tasks
– Represents usage of cluster resources
 Compare each against baseline non-incremental job
©2013 LinkedIn Corporation. All Rights Reserved. 23
Experiment: Page Views per Member
 Goal: Count page views per member over last n days
 Chain partition-preserving and partition-collapsing
 Can reuse previous output:
©2013 LinkedIn Corporation. All Rights Reserved. 24
Experiment: Page Views per Member
©2013 LinkedIn Corporation. All Rights Reserved. 25
Member Count Estimation
 Goal: Estimate number of members visiting site over past n days
 Use HyperLogLog cardinality estimation (space vs. accuracy)
 Can't reuse output, but with partition-preserving can save state:
©2013 LinkedIn Corporation. All Rights Reserved. 26
Member Count Estimation: Results
©2013 LinkedIn Corporation. All Rights Reserved. 27
Conclusion
 Computations over sliding windows are quite common
 Implementations are typically inefficient
 Incrementalizing Hadoop jobs can in some cases yield:
– 95-98% reductions in total task time
– 20-40% reductions in wall clock time
©2013 LinkedIn Corporation. All Rights Reserved. 28
datafu.org
Learning More
©2013 LinkedIn Corporation. All Rights Reserved. 29

Hourglass: a Library for Incremental Processing on Hadoop

  • 1.
    Hourglass: a Libraryfor Incremental Processing on Hadoop IEEE BigData 2013 October 9th Matthew Hayes ©2013 LinkedIn Corporation. All Rights Reserved.
  • 2.
    Matthew Hayes Staff SoftwareEngineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved. • 3+ Years on Applied Data Team at LinkedIn • Skills • Endorsements • DataFu • White Elephant
  • 3.
    Agenda  Motivation  Design Experiments  Q&A ©2013 LinkedIn Corporation. All Rights Reserved. 3
  • 4.
  • 5.
    Event Collection inan Online System  Typically online websites have instrumented services that collect events  Events stored in an offline system (such as Hadoop) for later analysis  Using events, can build dashboards with metrics such as: – # of page views over last month – # of active users over last month  Metrics derived from events can also be useful in recommendation pipelines – e.g. impression discounting ©2013 LinkedIn Corporation. All Rights Reserved. 5
  • 6.
    Event Storage  Eventscan be categorized into topics, for example: – page view – user login – ad impression/click  Store events by topic and by day: – /data/page_view/daily/2013/10/08 – /data/page_view/daily/2013/10/09 – ... – /data/ad_click/daily/2013/10/08  Now can perform computation over specific time windows ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7.
    Computation Over TimeWindows  In practice, many of our computations over time windows use either: ©2013 LinkedIn Corporation. All Rights Reserved. 7
  • 8.
    Recognizing Inefficiencies  But,typically jobs compute these daily  From one day to next, input changes little  Fixed-start window includes one new day: ©2013 LinkedIn Corporation. All Rights Reserved. 8
  • 9.
    Recognizing Inefficiencies  Fixed-lengthwindow includes one new day, minus oldest day ©2013 LinkedIn Corporation. All Rights Reserved. 9
  • 10.
    Recognizing Inefficiencies  Repeatedlyprocessing same input data  This wastes cluster resources  Better to process new data only  How can we do better? ©2013 LinkedIn Corporation. All Rights Reserved. 10
  • 11.
    Hourglass Design ©2013 LinkedInCorporation. All Rights Reserved. 11
  • 12.
    Design Goals  Addressuse cases: – Fixed-start and fixed-length window computations – Daily partitioned data  Reduce resource usage  Reduce wall clock time  Run on standard Hadoop ©2013 LinkedIn Corporation. All Rights Reserved. 12
  • 13.
    Improving Fixed-Start Computations Suppose we must compute page view counts per member  The job consumes all days of available input, producing one output.  We call this a partition-collapsing job.  But, if the job runs tomorrow it has to reprocess the same data. ©2013 LinkedIn Corporation. All Rights Reserved. 13
  • 14.
    Improving Fixed-Start Computations Solution: Merge new data with previous output  We can do this because this is an arithmetic operation  Hourglass provides a partition-collapsing job that supports output reuse. ©2013 LinkedIn Corporation. All Rights Reserved. 14
  • 15.
    Partition-Collapsing Job Architecture(Fixed-Start)  When applied to a fixed-start window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 15
  • 16.
    Improving Fixed-Length Computations For a fixed-length job, can reuse output using a similar trick: – Add new day to previous output – Subtract old day from result  We can subtract the old day since this is arithmetic ©2013 LinkedIn Corporation. All Rights Reserved. 16
  • 17.
    Partition-Collapsing Job Architecture(Fixed-Length)  When applied to a fixed-length window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 17
  • 18.
    Improving Fixed-Length Computations But, for some operations, cannot subtract old data – example: max(), min()  Cannot reuse previous output, so how do we reduce computation?  Solution: partition-preserving job  Partitioned input data, partitioned output data  Essentially: aggregate the data in advance  Aggregating in advance can be useful even when you can reuse output ©2013 LinkedIn Corporation. All Rights Reserved. 18
  • 19.
    Partition-Preserving Job Architecture ©2013LinkedIn Corporation. All Rights Reserved. 19
  • 20.
    MapReduce in Hourglass MapReduce is a fairly general programming model  Hourglass requires: – reduce() must output (key,value) pair – reduce() must produce at most one value – reduce() implemented by an accumulator ©2013 LinkedIn Corporation. All Rights Reserved. 20
  • 21.
    Building Blocks  Twotypes of jobs: – Partition-preserving: consume partitioned input data, produce partitioned output data – Partition-collapsing: consume partitioned input data, produce single output  Must provide to jobs: – Inputs and output paths – Desired time range  Must implement: – map() – accumulate()  May implement if necessary: – merge() – unmerge() ©2013 LinkedIn Corporation. All Rights Reserved. 21
  • 22.
  • 23.
    Metrics for Evaluation Wall clock time – Amount of time that elapses until job completes  Total task time – Sum of execution times for all tasks – Represents usage of cluster resources  Compare each against baseline non-incremental job ©2013 LinkedIn Corporation. All Rights Reserved. 23
  • 24.
    Experiment: Page Viewsper Member  Goal: Count page views per member over last n days  Chain partition-preserving and partition-collapsing  Can reuse previous output: ©2013 LinkedIn Corporation. All Rights Reserved. 24
  • 25.
    Experiment: Page Viewsper Member ©2013 LinkedIn Corporation. All Rights Reserved. 25
  • 26.
    Member Count Estimation Goal: Estimate number of members visiting site over past n days  Use HyperLogLog cardinality estimation (space vs. accuracy)  Can't reuse output, but with partition-preserving can save state: ©2013 LinkedIn Corporation. All Rights Reserved. 26
  • 27.
    Member Count Estimation:Results ©2013 LinkedIn Corporation. All Rights Reserved. 27
  • 28.
    Conclusion  Computations oversliding windows are quite common  Implementations are typically inefficient  Incrementalizing Hadoop jobs can in some cases yield: – 95-98% reductions in total task time – 20-40% reductions in wall clock time ©2013 LinkedIn Corporation. All Rights Reserved. 28
  • 29.
    datafu.org Learning More ©2013 LinkedInCorporation. All Rights Reserved. 29