Archive for March, 2008

Hadoop Summit Conference, Santa Clara

Tuesday, March 25th, 2008

I was at the free Yahoo! Hadoop cluster computing summit conference today at Techmart in Santa Clara.

About 200+ attendees showed up – packed and SRO. There was a single track that was filmed. Poor WiFi at times.

I would describe the presenters as being “mathies”. :)

Chris from Yahoo! on PIG dataflow language

- Talked about goals of a dataflow language, how PIG does it (PIG Latin script, syntax like English meets makefile)
- all their users are engineers
- explained difference between a dataflow language (procedural) vs. declarative like SQL
- showed example of finding users that read high PR sites by joining web logs, site ranks, and a filter

IBM on jaql query language

- APLish looking syntax overall
- uses data in JSON format because XML is for interchange but not storage
- Alternative to PIG
- talking about merging jaql and PIG at some level
- not Open Sourced yet, has legal approval, needs cleanup
- can drop into “raw” MapReduce programming
- can define functions

Michael Isard, Microsoft Research

- MapReduce is hard to program
- Dryad abstracts that
- LINQ and DryadLINQ
- LINQ+C#, LINQ+.net
- PLINQ, SQL Server, partition tables
- Dryad, DryadLINQ, vector library, machine learning
- more research areas: app-level, system-level, LINQDB

Andy Konwinski, UCB – Hadoop Path Tracing

- student of RadLab (Reliable, Adaptive, Distributed) with Matei Zaharia
- X-Trace framework, 500 LOC change to Hadoop, very low overhead
- 50 GB crawl, 20 MB trace data
- useful for developers and operators
- path based tracing framework (directed acyclic graph)
- RCPs, HTTP requests
- annotate messages with trace metadata (16 bytes) carried along execution path
- instrument protocol APIs and RPC libraries
- Wikipedia corpus, 25 nodes, 2 hrs down to 7 min by 2 reducers per node
- Facebook data from hive jobs
- need to test > 25 nodes
- x-trace.net
- plan to submit code to project after cleanup

Ben Reed from Yahoo!, Zookeeper API

- locks are easy to write! Hard to debug
- need better coordination, wait-free desirable, start with file api, strip it, ordered and persistent and atomic, hierarchical, watch
- hod hadoop on demand, sample jobtracking
- zookeeper.sf.net, java code, java and c clients, paxos is slower, synced to disk, static config
- no healing, replace with new host same ip, currently single DC, looking at weakening for multi-DC, namespaces.

Facebook, HIVE

- SQL layer on top of MR with SELECT, UNION ALL, GROUP BY
- SQL appealing to non-procedural programmers
- 22 TB raw logs per day, compressed down to 300 GB
- 40 engineers use it, 25% of eng staff, busy cluster with thousands of jobs run
- to be Open Sourced in a few months

Amazon Web Services Evangelist

- EC2 makes hadoop easy
- 2 scripts to setup, to be Open Sourced soon
- used internally for Alex GrepTheWeb and cluster GC tasks
- can search for an email address across 10 million pages in 6 minutes

Ground Model Generator

- Hadoop and MR test to process geophysical data
- both stack C code and hadoop can run in a few hours
- cost of sort, shuffle, reduce not really necessary for this app
- 59 node Hadoop cluster
- overlay output data on Google Earth with roads enabled, fairly good registration
- experimenting with using FUSE to mount HDFS for VMs distributed FS, works so far with 10 VMs

AutoDesk Seek

- Autodesk seek vertical search for ACE parts.
- Save architect time, design to reduce energy.
- 11 million parts and growing.
- products in 34 languages.
- Partatom – their extension to Atom feed format
- Taxonomies to map input feed to canonical format.
- Also do rendering and structure modelling with Hadoop when possible.
- EC2 with 25 AMI (Amazon Machine) typically.

Yahoo! WebMap

- Webmap is directed graph of entire web, maintained with 100 old programs.
- Wanted C/C++ access to a DFS and MapReduce. Old framework 1000 nodes
- hadoop over 2000 nodes, more resilient to node failures.
- now 70 hour jobs with 100k maps with 300 tb, 10k cores.
- Open good for science.

Prof. Lin, Maryland U and Google

- discussed Google contributions to academic cluster computing
- started with backdoor machine time loans, then 40 node cluster, then DC time through NSF
- Prof. projects include machine translation, bio alignment

Panel Discussion

- speakers mainly going over their roadmap slide.
- plans to improve reliability of HDFS and add Kerberos authentication to Hadoop (prevent end-users from deleting 200 TB of somebody else’s data)
- append mode for HDFS
- Jeremy Z. asked for a show of hands on who wanted a monthly user group meeting hosted at Yahoo!. Positive response.

Thanks to Yahoo! for hosting the conference and the various sponsors, including Google and Powerset.

IMUG: Vertical Text on the World Wide Web

Thursday, March 20th, 2008

Stephen Zilles, Standards Architect, Adobe Systems, gave a talk tonite at IMUG on “Vertical Text on the World Wide Web” about W3C text formatting standards, such as CSS, SVG and XSL, for various languages.

Some interesting examples are Mongolian, which is written top to bottom, and Japanese, which can be written top to bottom or right to left, or tate-chu-yoko (horizontal within vertical), commonly used with numbers. Line-breaking may be codified in JIS X 4051. Ogham and Batak are bottom-to-top languages, which is not specifically supported.

Text formatting can include direction, rotation and transform properties, glyph orientation, line height and width.

Asian printing often uses rotation of English characters to conform to the block progression that started with vertical Chinese or Japanese, for example.

Thanks to Apple for hosting the meeting.

W3C Documents (Membership Required)

Happy St. Patrick’s Day

Monday, March 17th, 2008

St. Patrick’s Day is an Irish holiday celebrated on March 17 in the USA, Canada and other commonwealth countries by wearing green clothes and drinking green beer.

I’ll have to take it easier on the green beer next year. :)

wikipedia: St. Patrick’s Day