Facebook Production Engineering Open House 2018

I attended the Facebook Production Engineering Open House at their Menlo Park HQ.

“Production Engineering is Facebook’s secret sauce – it draws on multiple disciplines (Software Engineering, Systems Administration, Distributed Systems and Networking) to plan, build and maintain the massive Facebook infrastructure.”

It’s always interesting to see how large operators solve problems at scale. Even if you’re small, usually one can borrow one or two nuggets.

6:00 – 7:00: Registration, apps and drinks

7:00 – 7:30: “Welcome and PE overview” by Fernanda Weiden (OG Ads Team Member) Wikipedia

Production Engineering (PE) is Facebook’s term for Devops. PE’s must know how to program, but don’t necessarily end up doing so. Newer services may require more programming from PE’s, and older services less.

Tech Talks:
7:30 – 7:50: “Scaling Instagram On-call” by Nick Shortway

– took a long time to refine on-call schedule
– 1-day on-calls were too much administrative effort to schedule
– now newbies are Level 1 and experienced people are Level 2
– “loop” is a period of time
– 3-day loops where the #1 priority Is being on-call
– Level 1 escalates to Level 2 within 2-3 minutes if no apparent solution
– Level 1 might not get sleep for 3 days
– 3 day loops are a lot of effort to administer, but manageable.

7:50 – 8:10: “How we monitor and scale FB” by Patrick Taylor

– lsof, strace, nm, /proc/pid/exe and trace.py
– cubism (folded perf graphs) click through to “Deep Dive” graphs
– region, data center, cluster, rack, server
– examples with retransmit, cluster cache failure

Patrick/Facebook has developed a “Maszlow Hierarchy of Needs for PE”, similar to this one.

Facebook uses a style of time series visualization called a cubism or horizon graph, first implemented in D3 by Mike Bostock at Square in 2012.



Similar Horizon or Cubism Chart by Splunk. Charts are folded into 25% of original height, with darker colors representing larger values.

8:10 – 8:30: “Supporting Global Events in FB Live” by Peter Knowles (10-year employee)

Justin Bieber caused a lot of problems up to 2015 (melt-down of his PostgreSQL shard, cache-busting.)

– 3 methods of load testing:

  1. Remove nodes
  2. Synthetic load
  3. Shadow traffic (duplicated traffic, but don’t show copies in user timeline)

8:30 – 9:00 Q&A Speaker Panel and Mingle

Q: If you run out of capacity, do you prioritize ads or user platform up?
A: Ad systems are lower priority than user platform.

Q: How does Facebook support multiple development languages?
A: Thrift. Longer answer is developers should use “officially supported” languages, but they can use anything they want as long as they write their own Thrift client library.

Q: When is Facebook moving to the Public Cloud?
A: Unnecessary. (Editor: FB is the Cloud.)

Q: What was your longest full outage?
A: 36 minutes or so about 18 months ago. (Speaker said he wrote the memo, so he won’t forget the number of minutes.)

Q: Are you using AI/NLP in monitoring?
A: Not yet, but something we follow.

There were some exhibits, with one item apparently being an OpenCompute server.

Food was chicken sliders, tomato brusciatta, triangle-turnovers and smaller appetizers with an open bar.

Thanks to Facebook for hosting the event.

code.facebook.com: How production engineers support global events on Facebook, PE Blog

Cubism.js Time Series Visualization Slides

r-bloggers.com: Cubism Horizon Charts in R
acm.org: Sizing the Horizon: The Effects of Chart Size and Layering on the Graphical Perception of Time Series Visualizations
tableau.com: Horizon Chart Workarounds in Tableau

This entry was posted in API Programming, Cloud, Linux, Microservices, Postgresql, Tech. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.