I was at the free Yahoo! Hadoop cluster computing summit conference today at Techmart in Santa Clara.
About 200+ attendees showed up – packed and SRO. There was a single track that was filmed. Poor WiFi at times.
I would describe the presenters as being “mathies”.
Chris from Yahoo! on PIG dataflow language
- Talked about goals of a dataflow language, how PIG does it (PIG Latin script, syntax like English meets makefile)
- all their users are engineers
- explained difference between a dataflow language (procedural) vs. declarative like SQL
- showed example of finding users that read high PR sites by joining web logs, site ranks, and a filter
IBM on jaql query language
- APLish looking syntax overall
- uses data in JSON format because XML is for interchange but not storage
- Alternative to PIG
- talking about merging jaql and PIG at some level
- not Open Sourced yet, has legal approval, needs cleanup
- can drop into “raw” MapReduce programming
- can define functions
Michael Isard, Microsoft Research
- MapReduce is hard to program
- Dryad abstracts that
- LINQ and DryadLINQ
- LINQ+C#, LINQ+.net
- PLINQ, SQL Server, partition tables
- Dryad, DryadLINQ, vector library, machine learning
- more research areas: app-level, system-level, LINQDB
Andy Konwinski, UCB – Hadoop Path Tracing
- student of RadLab (Reliable, Adaptive, Distributed) with Matei Zaharia
- X-Trace framework, 500 LOC change to Hadoop, very low overhead
- 50 GB crawl, 20 MB trace data
- useful for developers and operators
- path based tracing framework (directed acyclic graph)
- RCPs, HTTP requests
- annotate messages with trace metadata (16 bytes) carried along execution path
- instrument protocol APIs and RPC libraries
- Wikipedia corpus, 25 nodes, 2 hrs down to 7 min by 2 reducers per node
- Facebook data from hive jobs
- need to test > 25 nodes
- x-trace.net
- plan to submit code to project after cleanup
Ben Reed from Yahoo!, Zookeeper API
- locks are easy to write! Hard to debug
- need better coordination, wait-free desirable, start with file api, strip it, ordered and persistent and atomic, hierarchical, watch
- hod hadoop on demand, sample jobtracking
- zookeeper.sf.net, java code, java and c clients, paxos is slower, synced to disk, static config
- no healing, replace with new host same ip, currently single DC, looking at weakening for multi-DC, namespaces.
Facebook, HIVE
- SQL layer on top of MR with SELECT, UNION ALL, GROUP BY
- SQL appealing to non-procedural programmers
- 22 TB raw logs per day, compressed down to 300 GB
- 40 engineers use it, 25% of eng staff, busy cluster with thousands of jobs run
- to be Open Sourced in a few months
Amazon Web Services Evangelist
- EC2 makes hadoop easy
- 2 scripts to setup, to be Open Sourced soon
- used internally for Alex GrepTheWeb and cluster GC tasks
- can search for an email address across 10 million pages in 6 minutes
Ground Model Generator
- Hadoop and MR test to process geophysical data
- both stack C code and hadoop can run in a few hours
- cost of sort, shuffle, reduce not really necessary for this app
- 59 node Hadoop cluster
- overlay output data on Google Earth with roads enabled, fairly good registration
- experimenting with using FUSE to mount HDFS for VMs distributed FS, works so far with 10 VMs
AutoDesk Seek
- Autodesk seek vertical search for ACE parts.
- Save architect time, design to reduce energy.
- 11 million parts and growing.
- products in 34 languages.
- Partatom – their extension to Atom feed format
- Taxonomies to map input feed to canonical format.
- Also do rendering and structure modelling with Hadoop when possible.
- EC2 with 25 AMI (Amazon Machine) typically.
Yahoo! WebMap
- Webmap is directed graph of entire web, maintained with 100 old programs.
- Wanted C/C++ access to a DFS and MapReduce. Old framework 1000 nodes
- hadoop over 2000 nodes, more resilient to node failures.
- now 70 hour jobs with 100k maps with 300 tb, 10k cores.
- Open good for science.
Prof. Lin, Maryland U and Google
- discussed Google contributions to academic cluster computing
- started with backdoor machine time loans, then 40 node cluster, then DC time through NSF
- Prof. projects include machine translation, bio alignment
Panel Discussion
- speakers mainly going over their roadmap slide.
- plans to improve reliability of HDFS and add Kerberos authentication to Hadoop (prevent end-users from deleting 200 TB of somebody else’s data)
- append mode for HDFS
- Jeremy Z. asked for a show of hands on who wanted a monthly user group meeting hosted at Yahoo!. Positive response.
Thanks to Yahoo! for hosting the conference and the various sponsors, including Google and Powerset.

