The MySQL Conference was held again at the Santa Clara Convention Center. As always, it was a great conference, with so many good talks in parallel tracks that it made choosing difficult. Here’s a link to the slides.
The Santa Clara Convention Center is not near any restaurants, so you have to rely on the conference lunches. Fortunately, they were pretty good this year (less sandwiches than last year.) However, it was often hard to find a coffee urn that wasn’t empty, so that was annoying.
My conference blog posts tend to focus on talks in the tracks that I attended and BOFs. Click on the “Conferences” category on the right-hand side to see my posts on previous conferences.
Monday, April 23
Tutorial: Scaling and High Availability Architectures, Jeremy Cole and Eric Bergen, ProvenScaling.com
A good tutorial on large-scale database architectures popular with the Web 2.0 crowd.
- Hash Partitioning
- Directory Partitioning
- HiveDB Java project started by Cole, contributions from clients
Tutorial: Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues, Domas Mituzas
This was a very funny talk on how Wikipedia handles scaling and performance. Not intentionally funny, but ghetto database and system administration for when your project doesn’t have a lot of money or people.
Jimmy Wales’ SLA: wants “more up than down”
Downtime does not mean being fired, it means more donations,” because an error page soliciting donations is displayed.
One 9 is pretty good for Wikipedia site, most businesses aim for 5.
RAID0 fast, more storage, use replication.
RAID1 does not make sense because 1 hour
rebuild and could be dead.
RAIS
dc failure - get backup from mailing list users
security patches: if you can ping a server, or get a shell, you have bigger problems.
memcached is not faster than mysql if on network.
squid 2.6. mysql 4 with patches, can’t upgrade and
maintain performance. 2.6 can’t rewrite single header line
squid sync broken on Gb ethernet with 0.0 ping time
.net Lucene for search, worked the best for them when tested
read load bigger than write load, replication is great for updaters, put binlog position in cookie
hundreds of writes per second to single master, thousands of reads
no local logging, just wite to network and forget
“people complain about everything”
loves LVS
use fedora, testing ubuntu
Tuesday, April 24
Security and Debugging in PHP, Rasmus Lerdorf
Another lucid and practical talk by Rasmus on performance tuning. This time he picks on Drupal, which does like 50 database calls per page and gets load time down from 1.2 seconds to 50 ms.
mysql native driver
http_load
base/mysqli/mysqli-nd same speed 1.2s
callgrind
- recursive calls can take more than 100%
- if you see compile high (20%), install
op code cache like apc
- xdebug
- kcachegrind drupal_xdebug1.out
- drupal has 3 performance options:
- off 1200 ms
- normal 80 ms
- aggressive 50 ms
- div
- cracker JS control panel: inspect dom for colored links (like banking sites), see if operation succeeds in same browser session (like a banking transaction)
- browser zombie control panel with iframe
- linksys dmz to make your pc outside fw
- filter extension in 5.2 by default
- alert(0), /, /xss, /abc
- make sure not cached
- host header xss-scriptable, apache picks 1st
- javascript:
- IE6 allows JavaScript in img src tag
- #039 is a single quote in On* and style handlers
- hit yahoo mail and google groups in
next/prev which showed subject lines
- update acrobat now!!! or people can run JS against your local drive because there’s c:\…\file.html
- cross-site request forgery
- xsrf
http://talks.php.net/show/mysql07
http://xdebug.org/docs-profiling2.php
http://sla.ckers.org/forum/list.php?3
http://php.net/filter
Building a Vertical Search Engine in a Day, Ken Krugler
Awesome talk, reminiscent of the old OSCON talk “Build Your Own Map Server like Google”.
- vertical search
- less than 1% of content out there is
of interest to my audience
- like specialized processing on source code
(indexing, hiliting)
- specialized optimized search interface
- 25 people, 120 servers
- 2 years
- get it right for user, or get lost in
technology - 10 papers a week
- nutch
- 1 day of crawling, 2.5 GB data
nutch tomcat
- mysql used to support operations
- nutch args in xml
- results in mysql for analysis
- db falls over at 100 million records
- one good record, 5 crap
- injection list
- fetching pages
- parse the content
- update crawl db
- whitelist
- blacklist
- convergence
- index (lucene)
- map reduce - split up big problem into small pieces, combine results
- recrawl
100 servers 1000 threads each
- 20 billion pages out there, lots of junk, 1,000 servers regularly, 2 Gbps
- breadth first would take years and cost millions of dollars
- dmoz has 500 dollars, 95% of content
- depth first crawling to the rescue
OPIC + Term Vectors = Depth First
- page rank is a 20 billion by 20 billion sparse matrix you solve for each page
- pure opic means “fetch well-linked pages first”
- we modify it to “fetch pages about mysql first”
- term vector
mysql - 5.0
sql - 4.5
schema 2.5
- can use wikipedia to train based on mysql category
- nutch and hadoop running on 4 server cluster (2 hours to make lucene index)
see http://lucene.apache.org/nutch for details
- 23 loops, 150k pages fetched, 2M urls found
- common problems
- honey pots
- bogus pages (spam, porn, out of scope)
- dynamic links (http://mydomain.com?param=value)
- getting blocked
- 2U fast sata quad core 8 GB RAM
- searchers for web
- introduction to nutch
- 40 Mbps ok for 4 servers because of database update, 1 day
- they have a 1 Gbps link
- can make a NetNanny list from crap
- mapreduce of text score plus incoming links gives useful score
- also score subdomain name like blogpost
- personalized yahoo and google: no ui, so special processing
- krugle is on nutch .8 or .9, hadoop .9
- papers: do they scale, do they really improve results?
- precision is relevance
- recall is completeness
- focus on code search side of things
- 1.5 people to manually score index
- search is tough on code because no comments
- look in tab search for “haskell quicksort”
- solr
- compass
- omnifind
- htdig
SPASQL
- used for federated queries, semantic web.
Lunch
Pretty good food, at my table were some conference first-timers from out of the country.
Angel, IT Manager, from WillowBrook Bathrooms, UK, was here to learn more about MySQL from an IT perspective.
And Juan Carlos Gonzalez, Project Coordinator and Programmer, Innox, from Guadalajara, Mexico, was here to learn more about MySQL and software development and do some bizdev. Innox does PHP development. Programmer rates in Mexico are about $1000/month and Juan is looking for clients. +52 (33) 3030-7181, jgonzalez@innox.com.mx
MySQL Performance Tuning, Peter Zaitsev, Percona
- max_connect_errors
- bump up to prevent DoS
- created_tmp_tables
- created_tmp_disk_tables
- increase tmp_table_size and max_heap_table_size
- tmpdir=/dev/shm
The Blackhole and Federated Storage Engines: The Coolest Kids on the Block, Kai Voigt, Giuseppe Maxia
Great talk. These 2 guys have the most twisted uses for these table types that you can imagine.
Blackhole Storage Engine
- blackhole is ACI not D
- smallest storage engine codebase (under 6k), so prolly reliable
Good for 3 things:
- statement-based replication relay slave layer (throw away data, keep binlog)
- benchmark SQL layer or network layer
- test dumps
- real time data warehousing - use triggers to automatically fill summary tables, avoid joins later
only causes result with insert, not delete or update because 0 rows affected
Federated Storage Engine
- create local table with connection to remote server
- uses local query cache if enabled
- can federate views or tables
- strange behavior of queries where index not available
- select count(*) causes a table scan - all aggregates are slow, so use views.
- can do subqueries against mysql 3.23 from newer versions!!!
- use federated table on slave to check against master in pure sql
- remote command execution on remote server by using view and function or udf
- takes one connection per table
- should have restricted account for federated table login
PHP Performance and Scalability Best Practices, Laura Thomson
http://omniti.com/~george/talks/
adv. php programming, g. schlossnagle
building scalable web sites, c. henderson
scalable internet architectures, t. schlossnagle
BOF: Data Warehousing
Host: Pentaho, Julian Hyde from Mondrian Project
- Julian works on modrian reporting project, sponsored by Pentaho now
- Julian recommended looking at Kettle as an ETL tool
- he thinks that parallel queries across multiple disks and materialized views are often important for DW
- other companies selling ETL tools include ab initio, ardent and informatica.
BOF: MySQL Performance
Jay Pipes and Flupps, MySQL AB
- fastest way to load external data is drop a csv file in the dbdir and do create table, alter table?
- load data infile stream from memory?
- single master means 2 data centers will have lag
- stored procedures are compiled per connection
- Jay likes PDO 5.1.13 for prepared statements
- 5.0 just had query caching of prepared statements 3 weeks ago, took 2 days of programmer effort and solves a frequest customer support question
- 5.1 will be GA in Q3 or Q4?
- will need to dump and restore if moving data, esp. UTF-8, to 5.1
- for syncing databases, look at mysqltoolkit mytablesync
Wednesday, April 25
Federation at Flickr: Doing Billions of Queries Per Day, Dathan Pattishall
Thousands of Shards, each shard has 2 nodes in master-master replication
Storage on NetApp Filer
InnoDB, InnoBackup nightly
tables as files
multiple data centers, want all hosts to be active for site performance
4 billion queries per day
migrate in bulk
need patch for linux to work with 2,000 threads in mysql
Using MySQL as Active DBMS for Monitoring Applications, Jacob Nikom, Lincoln Labs
- talked about design of database system to do event monitoring in subway attacks
- Oracle site license at lab, but clients don’t want to pay for Oracle, or hire a DBA per database
- likes combination of MySQL, UDF, triggers, Spread - because no Java in MySQL db yet
- audience member noted that MySQL triggers could use more docs, debugging tools, overall design
Storage Replication with DRBD: The Next Big Thing in Cluster Storage Management, Florian Haas, Phillip Reisner
DRBD FAQ
- layer below database, similar relationship of SCSI to MySQL
- now on version 8
- Siemens uses it.
askask.com: DRBD and MySQL
Lucene and MySQL, Farhan Mashraqi
- plugins for office formats, pdf, html
- capable up to 100 qps, need architecture to support more
- ranks
- supports + - modifiers, phrase, wildcard, range, more
- usually not right solution for very small or very big sites
- 20 MB/min on P M 1.5Ghz
- index size 20-30% of size of text indexed
- small RAM needed (1 MB heap is ok)
- Basis Technologies Rosette Linguistics Platform basistech.com/lucene
same as google and yahoo
farsi, etc.
- many ports of lucene to different languages. Java lucene is main.
lucene4c is index compatible with java, ferret and rubylucene is not.
- solr
- 18th - 25th, sept 2007 heidelberg Kai
Lightning Rounds with Top MySQL Community Contributors, Giuseppe Maxia, Ask Bjørn Hansen, Jeremy Cole, Martin Friebe, Bill Karwin, Sheeri Kritzer, Mike Kruckenberg, Paul McCullagh, Baron Schwartz, Yoshiaki Tajika, Beat Vontobel, Peter Zaitsev
QC’s (Quality Contributors)
- several people in the MySQL community from around the world got awards and talked for a few minutes about their contributions and motivations
Yoshiaki Tajika, Customer Support, NEC
- formerly supported Microsoft SQL Server
- likes free bug reporting and access to developers with MySQL
- “The manual was be perfect. My customers might be confused by mistakes.”
MySQL Performance Cookbook, Peter Zaitsev
- slave_transaction_retries= to prevent timeouts when doing an alter table
- innodb_flush_log_at_trx_commit=0 huge performance improvement with replication
- log-slave-updates chain replication
- replicate by db or table
- set global sql_slave_skip_counter=1
- start slave
- log event entry exceeded max_allowed_packet
(prolly master bounced, slave is looking for non-existent binlog entry)
- could not parse relay log event entry
MySQL Quiz Show
- mostly silly questions, some MySQL history trivia, a few technical questions
- lot of popcorn and ice cream.
BOF: Sphinx Search, Andrew Aksyonoff and Peter Zaitsev
- Sphinx Search has more and more users
- Peter is providing Sphinx support to web companies
- Andrew Aksyonoff is thinking of doing 2 months of development on Sphinx and offering support services and embedding in products.
Exhibits
I like to spend 5 minutes with each exhibitor to learn about what they’re doing and see what’s new.
The most notable were:
Alfresco Enterprise Content Mgmt.
Joomla CMS
Sphinx Search was there again.
Absent this year were booths for DealNews and Quest Software. DealNews is sponsoring Phorum.
Thursday, April 26
The innotop Workshop, Baron Schwartz
- showed innotop
- talked about changes to InnoDB for more statistics
- talked about replication monitoring tools. one of most popular projects on sf.net
Managing MySQL the Slack Way: How Google Deploys New MySQL Servers, Michael Still, Grant Allen
- Google uses PXE, slack to build hosts, Open Source
- guessed that CF Engine would be more complicated than necessary
- can’t upgrade rsync on all legacy nodes, so experimented with scp first, then netcat
- can build MySQL host in 10 minutes, faster to rebuild than troubleshoot
- use IP address as the master/slave id since both are 32 bits.
Introduction to the Japanese Charset, Yoshinori Matsunobu
- strange definition of multi-byte - more than 2 bytes
- mysql utf-8 is 4 bytes or not?
mysqlforge: WL#3780: Replace 4-byte UTF8 input
High Performance Data Warehousing with MySQL: Tricks and Tips from the Field, Brian Miezejewski
Great talk. He used to work at American Airlines on DW projects.
- memory engine for smaller tables that are not updated often
- nitro
- infobright 10 to 1 compression including indexes on all columns
–init-file= in init script
- insert on duplicate key
- sort_buffer_size key_buffer_size
- key_cache_division_limit
manual myisam-key-cache
- mysqldumpslow –help
- log_queries_not_using_indexes
- mysqladmin ex -r -i60 | grep -v ‘ 0 ‘
- ran into Seti At Home administrator later, he’s interested in the hi-perf table types, gets 500 GB drive from Arecibo instead of tape now, always something with budgets
Closing Keynote
Yahoo! Pipes, Pasha Sadri
- good demo, showed searching for rentals with characteristics that he likes
- author surprised to see Pipes used to do language translation in Second Life
- Mercury News says this is the first Brickhouse incubator project to get productionized
Cnet Sysadmin conversation
- continuance mcluster/unicluster as middleware/load balancer to multiple MySQL databases
- different database product for DW
sheeri.com: more blogs about MySQL Conference