I went to ApacheCon 2009 in Oakland. Why Oakland? The ASF was founded here 10 years ago.
Most of the attendees that I talked to were primarily interested in search technologies, or were Apache project comitters. The search users were already using either Lucene and Solr, or were using commercial software and evaluating Lucene and Solr.
Also a lot of interest in Hadoop, Zookeeper and NoSQL projects.
I added a wikipedia NoSQL project features table after the NoSQL BoF.
The conference was very well-organized, with tutorials, BoFs, a BarCamp, and sessions. Meetup.com was used to generate the highest BoF turnout that I’ve ever seen – close to 100 at the Lucene and Hadoop BoFs. (O’Reilly Conferences can learn from that.)
The Oakland Convention Center was a good venue for this conference, though the attached Oakland Marriott hotel is $$$$ and fond of surcharges, like $33/day for parking, $5 draught beer and $3.75 for a bottle of water in-room.
The keynotes and one track per day were recorded and are available for $99 at Linux Pro Magazine Streaming.
StoneCircle Productions was the conference organizer.
Although I live in San Jose, Oakland is far enough away that I’ve never been there. Oakland has a compact downtown full of historical-era buildings, and Alameda is also nice, but things get less pretty at night.
I went to the Lucene tutorial on Monday.
– awesome views of Bay Area past Golden Gate bridge from 21st floor
– FAST pretty good indexing and search solution, but bought by Microsoft recently (going to continue linux support or not?)
– FAST has FQL (users pronounce it fecal) query language
– 150 FAST servers replaced by 40 lucene servers by 1 company
– FAST4 to FAST5 upgrade tough, similar to port to say lucene, forced upgrades for support
– linguistics is 60% of value of Fast according to Monster, 13 languages supported
– “bad stems” can be a nightmare
– SOLR gives you 90% of what you would need to program in java, built on top of Lucene
– Open Source search is not really about price, but about control and flexibility
Monday Afternoon – Lucene Tutorial
– user-assigned document id not mandatory, but great idea for many reasons, including after an index-rebuild
– lucene-assigned id only valid for that snapshot (life of score doc)
– parameter to keep or delete old index directory
– StringBuilder is more efficient than strcat
– populating title column is a good idea
– results boosting handy for ecommerce, specials, etc.
– LUKE – handy tool for index statistics, etc.
– Searcher class, snapshot in time, won’t see new merges
– contrib/ has more analyzers
– snowball stemmers
– use 1 tokenizer and 0 or more token filters
– precision-recall curve ??
– n-grams and shingles (“the president”, “United states”)
– pre-2.9 lucene, numbers and dates really strings
– 2.9 NumericField builds tri structure, help optimize range queries
– SOLR analysis tool apache-solr
– relevance feedback with MoreLikeThis
– “ground computing”
– “offline by default”
– now an ubuntu service
– mozilla raindrop to combine chat client msgs
– append-only btree
– rsyncable since append-only, also replication
– checksums everywhere
– windows not first class yet, mozilla improving it
– don’t like sql
– brasstacks test tool storage
– store now, index later
– replicate to handle large indexing load
– testbot ci
– great for articles, books
– extremely fast, largest 200 TB xml, 166 on hosts
– database server
– 180 clients, 150 employees
– markmail.org demo contains 42 million email messages, very impressive performance with 5 views in almost realtime. Search is distributed across 160 nodes.
JCR in 15 minutes
– Bertrand Del
– JCR is JackRabbit,
a fully conforming implementation of the Content Repository for Java Technology API (JCR). A content repository is a hierarchical content store with support for structured and unstructured content, full text search, versioning, transactions, observation, and more.
– the ultimate content store
– content repo, union of database and filesystem, best of both worlds
– full-text search combined with structured search
– information forage
– “resume-driven design”
– available in 1.4
– tune by modifying precisionStep
One bewildered attendee wished for a NoSQL product matrix, so I added that to the wikipedia NoSQL page.
Becoming a Pig Developer, Alan Gates
– Apache Pig is a sub-project of Apache Hadoop.
– this talk was really how to use PIG as an end-user, not to become a Pig project developer
Apache Hadoop in the Cloud, Tom White
– general comments on using EC2 with Hadoop mostly
Practical HBase, Michael Stack
– Apache HBase is the Apache Hadoop database, similar to BigTable.
– HBASE usage
mod_jk / mod_proxy and others, Jean-Frederic Clere and 2 others
– mod_jk, mod_proxy, mod_serf and mod_cluster original topics
– mostly focused on mod_jk, mod_proxy and isapi_redirect
– good talk by 3 long-term project contributors
– jk is kind of Java-centric, with support for Apache JServ Protocol (AJP) only available in Java back-end servers for now, like Tomcat
– isapi_redirect is primary way to do redirects on Windows IIS
– survey of audience showed several mod_proxy users, maybe one intentional mod_jk user
“Apache Lucene and Apache Solr Performance Tuning with Mark Miller” was packed, so moving along to a different room …
Scalable Internet Architectures, Theo Schlossnagle
– amazing and thought-provoking talk, also one of the most popular
– think about performance from network packet level to application level
– carp, vrrp, whackamole
– alterdns, neustar
– anycast (shared IP), geoip (but need actually accurate database)
– activemq, rabbitmq instead of Spread
– “memcached is the worst thing that ever happened to our industry – it solves a problem, just not the original problem”
– many apps today are so poorly designed that network issues never become scalability concerns – ie. RoR applications
– max out at 500 requests per second across 40 boxes – RoR
– firebug and yslow have been fantastic at making front-end engineers aware of networking performance
– 10 gb nics suck
– instead of one big 20 Gbps loadbalancer, use anycast from core router to 5x 4 gpbs cheaper load balancers
– spiky load or DDoS – announce a /32 to separate load balancer, use symmetric return path
– jms, aqmp, spread
durable message queues
– activemq (java)
– openamq (c) – hard to use
– rabbitmq (erlang) – nice except in durable mode because erlang disk io blows
– most common protocol Stomp is awful and slow (hard to read 100k messages per second) and not binary, but lots of clients exist.
– activemq and stomp is a good start.
– rabbitmq and native connectors are better, but no perl client.
– PCI compliance requires a stateful firewall. Hard to do 1.5 million packets per second traffic for most medium-sized data centers, need to use a CDN to distribute static requests and distribute the packets somewhere else
– leaving trailing / off causes 302, doubles traffic
– read/write ratio is 1 … likely IM or email?
– went over some networking details with Paul L. afterwards
Recent Developments in SSL and Browsers, Rick Andrews, Thawte
– 1.6 billion OCSP requests per day, need good infrastructure to support that
– intermediate CA allows root CA to be offline – chained hierarchy – SSLCertificateChainFile,
needs intermediate certificates before cross-certificates, some clients need in proper order
– EV hierarchy more complex. wanted new EV root, but older browsers don’t know about it.
– browser ubiquity problem with any new feature, hash or crypto algorithm
– logotypes – trademark and copyright issues with using other companies’ logos in a product
– Verisign does not have apache httpd committers, but should
Organizers didn’t show up, so spent 10 minutes talking to a handful of end-users about subversion gripes and moved along to …
– zk is persistent to disk
– can run on one node, but 3 is minimum non-toy
– zk is popular in academia now for some reason
– avoid split-brain partitioning between 2 data centers – bad
– very recent merge to fix -368, not ready for production yet
– people using it for a message queue, perhaps more reliable than many other Open Source ones
– need 1 zk node for testing, but 3 zk nodes for non-trivial implementation
– 4x to 5x compression with lzo. similar disk bw improvement
A local owner of a gelato store handed out 6 free samples from a portable gelato freezer.
Building Intelligent Search Applications with the Lucene Ecosystem, Ted Dunnin
– some matrix math
– using his matrix math optimization, a perl program on 1 server was faster than Mahout running on a $250k cluster
– the original LLR in NLP paper
“Accurate Methods for the Statistics of Surprise and Coincidence” check on citeseer
– Mahout project
tdunning [at] apache.org
Realtime Search, Jason Rutherglen
– many technical issues prevent Lucene from being able to do realtime search
– lots of patches done, lots to do
– audience member thanked author for great work so far
Closing Plenary: Brian Behlendorf on Open Source and Charity
Talked to Alex Karasulu a little after the final presentation. He’s a committer on the Apache Directory project. He suggested adding dbm to the NoSQL product matrix. Wants a MacBook Air with 8 GB RAM to run his Java apps.