I attended the Facebook Production Engineering Open House at their Menlo Park HQ.
“Production Engineering is Facebook’s secret sauce – it draws on multiple disciplines (Software Engineering, Systems Administration, Distributed Systems and Networking) to plan, build and maintain the massive Facebook infrastructure.”
It’s always interesting to see how large operators solve problems at scale. Even if you’re small, usually one can borrow one or two nuggets.
6:00 – 7:00: Registration, apps and drinks
7:00 – 7:30: “Welcome and PE overview” by Fernanda Weiden (OG Ads Team Member) Wikipedia
Production Engineering (PE) is Facebook’s term for Devops. PE’s must know how to program, but don’t necessarily end up doing so. Newer services may require more programming from PE’s, and older services less.
7:30 – 7:50: “Scaling Instagram On-call” by Nick Shortway
– took a long time to refine on-call schedule
– 1-day on-calls were too much administrative effort to schedule
– now newbies are Level 1 and experienced people are Level 2
– “loop” is a period of time
– 3-day loops where the #1 priority Is being on-call
– Level 1 escalates to Level 2 within 2-3 minutes if no apparent solution
– Level 1 might not get sleep for 3 days
– 3 day loops are a lot of effort to administer, but manageable.
7:50 – 8:10: “How we monitor and scale FB” by Patrick Taylor
– lsof, strace, nm, /proc/pid/exe and trace.py
– cubism (folded perf graphs) click through to “Deep Dive” graphs
– region, data center, cluster, rack, server
– examples with retransmit, cluster cache failure
Patrick/Facebook has developed a “Maszlow Hierarchy of Needs for PE”, similar to this one.
Similar Horizon or Cubism Chart by Splunk. Charts are folded into 25% of original height, with darker colors representing larger values.
8:10 – 8:30: “Supporting Global Events in FB Live” by Peter Knowles (10-year employee)
– 3 methods of load testing:
- Remove nodes
- Synthetic load
- Shadow traffic (duplicated traffic, but don’t show copies in user timeline)
8:30 – 9:00 Q&A Speaker Panel and Mingle
Q: If you run out of capacity, do you prioritize ads or user platform up?
A: Ad systems are lower priority than user platform.
Q: How does Facebook support multiple development languages?
A: Thrift. Longer answer is developers should use “officially supported” languages, but they can use anything they want as long as they write their own Thrift client library.
Q: When is Facebook moving to the Public Cloud?
A: Unnecessary. (Editor: FB is the Cloud.)
Q: What was your longest full outage?
A: 36 minutes or so about 18 months ago. (Speaker said he wrote the memo, so he won’t forget the number of minutes.)
Q: Are you using AI/NLP in monitoring?
A: Not yet, but something we follow.
There were some exhibits, with one item apparently being an OpenCompute server.
Food was chicken sliders, tomato brusciatta, triangle-turnovers and smaller appetizers with an open bar.
Thanks to Facebook for hosting the event.
code.facebook.com: How production engineers support global events on Facebook, PE Blog
r-bloggers.com: Cubism Horizon Charts in R
acm.org: Sizing the Horizon: The Effects of Chart Size and Layering on the Graphical Perception of Time Series Visualizations
tableau.com: Horizon Chart Workarounds in Tableau