Hadoop Summit Conference, Santa Clara

I was at the free Yahoo! Hadoop cluster computing summit conference today at Techmart in Santa Clara.

About 200+ attendees showed up – packed and SRO. There was a single track that was filmed. Poor WiFi at times.

I would describe the presenters as being “mathies”. :)

Chris from Yahoo! on PIG dataflow language

- Talked about goals of a dataflow language, how PIG does it (PIG Latin script, syntax like English meets makefile)
- all their users are engineers
- explained difference between a dataflow language (procedural) vs. declarative like SQL
- showed example of finding users that read high PR sites by joining web logs, site ranks, and a filter

IBM on jaql query language

- APLish looking syntax overall
- uses data in JSON format because XML is for interchange but not storage
- Alternative to PIG
- talking about merging jaql and PIG at some level
- not Open Sourced yet, has legal approval, needs cleanup
- can drop into “raw” MapReduce programming
- can define functions

Michael Isard, Microsoft Research

- MapReduce is hard to program
- Dryad abstracts that
- LINQ and DryadLINQ
- LINQ+C#, LINQ+.net
- PLINQ, SQL Server, partition tables
- Dryad, DryadLINQ, vector library, machine learning
- more research areas: app-level, system-level, LINQDB

Andy Konwinski, UCB – Hadoop Path Tracing

- student of RadLab (Reliable, Adaptive, Distributed) with Matei Zaharia
- X-Trace framework, 500 LOC change to Hadoop, very low overhead
- 50 GB crawl, 20 MB trace data
- useful for developers and operators
- path based tracing framework (directed acyclic graph)
- RCPs, HTTP requests
- annotate messages with trace metadata (16 bytes) carried along execution path
- instrument protocol APIs and RPC libraries
- Wikipedia corpus, 25 nodes, 2 hrs down to 7 min by 2 reducers per node
- Facebook data from hive jobs
- need to test > 25 nodes
- x-trace.net
- plan to submit code to project after cleanup

Ben Reed from Yahoo!, Zookeeper API

- locks are easy to write! Hard to debug
- need better coordination, wait-free desirable, start with file api, strip it, ordered and persistent and atomic, hierarchical, watch
- hod hadoop on demand, sample jobtracking
- zookeeper.sf.net, java code, java and c clients, paxos is slower, synced to disk, static config
- no healing, replace with new host same ip, currently single DC, looking at weakening for multi-DC, namespaces.

Facebook, HIVE

- SQL layer on top of MR with SELECT, UNION ALL, GROUP BY
- SQL appealing to non-procedural programmers
- 22 TB raw logs per day, compressed down to 300 GB
- 40 engineers use it, 25% of eng staff, busy cluster with thousands of jobs run
- to be Open Sourced in a few months

Amazon Web Services Evangelist

- EC2 makes hadoop easy
- 2 scripts to setup, to be Open Sourced soon
- used internally for Alex GrepTheWeb and cluster GC tasks
- can search for an email address across 10 million pages in 6 minutes

Ground Model Generator

- Hadoop and MR test to process geophysical data
- both stack C code and hadoop can run in a few hours
- cost of sort, shuffle, reduce not really necessary for this app
- 59 node Hadoop cluster
- overlay output data on Google Earth with roads enabled, fairly good registration
- experimenting with using FUSE to mount HDFS for VMs distributed FS, works so far with 10 VMs

AutoDesk Seek

- Autodesk seek vertical search for ACE parts.
- Save architect time, design to reduce energy.
- 11 million parts and growing.
- products in 34 languages.
- Partatom – their extension to Atom feed format
- Taxonomies to map input feed to canonical format.
- Also do rendering and structure modelling with Hadoop when possible.
- EC2 with 25 AMI (Amazon Machine) typically.

Yahoo! WebMap

- Webmap is directed graph of entire web, maintained with 100 old programs.
- Wanted C/C++ access to a DFS and MapReduce. Old framework 1000 nodes
- hadoop over 2000 nodes, more resilient to node failures.
- now 70 hour jobs with 100k maps with 300 tb, 10k cores.
- Open good for science.

Prof. Lin, Maryland U and Google

- discussed Google contributions to academic cluster computing
- started with backdoor machine time loans, then 40 node cluster, then DC time through NSF
- Prof. projects include machine translation, bio alignment

Panel Discussion

- speakers mainly going over their roadmap slide.
- plans to improve reliability of HDFS and add Kerberos authentication to Hadoop (prevent end-users from deleting 200 TB of somebody else’s data)
- append mode for HDFS
- Jeremy Z. asked for a show of hands on who wanted a monthly user group meeting hosted at Yahoo!. Positive response.

Thanks to Yahoo! for hosting the conference and the various sponsors, including Google and Powerset.

This entry was posted in Conferences, Tech. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>