Archive for April, 2007

MySQL Conference 2007

Thursday, April 26th, 2007

MySQL LogoThe MySQL Conference was held again at the Santa Clara Convention Center. As always, it was a great conference, with so many good talks in parallel tracks that it made choosing difficult. Here’s a link to the slides.

The Santa Clara Convention Center is not near any restaurants, so you have to rely on the conference lunches. Fortunately, they were pretty good this year (less sandwiches than last year.) However, it was often hard to find a coffee urn that wasn’t empty, so that was annoying.

My conference blog posts tend to focus on talks in the tracks that I attended and BOFs. Click on the “Conferences” category on the right-hand side to see my posts on previous conferences.

Monday, April 23

Tutorial: Scaling and High Availability Architectures, Jeremy Cole and Eric Bergen, ProvenScaling.com

A good tutorial on large-scale database architectures popular with the Web 2.0 crowd.

- Hash Partitioning
- Directory Partitioning
- HiveDB Java project started by Cole, contributions from clients

Tutorial: Wikipedia: Site Internals, Configuration and Code Examples, and Management Issues, Domas Mituzas

This was a very funny talk on how Wikipedia handles scaling and performance. Not intentionally funny, but ghetto database and system administration for when your project doesn’t have a lot of money or people.

Jimmy Wales’ SLA: wants “more up than down”

Downtime does not mean being fired, it means more donations,” because an error page soliciting donations is displayed.

One 9 is pretty good for Wikipedia site, most businesses aim for 5.

RAID0 fast, more storage, use replication.
RAID1 does not make sense because 1 hour
rebuild and could be dead.
RAIS

dc failure – get backup from mailing list users

security patches: if you can ping a server, or get a shell, you have bigger problems.

memcached is not faster than mysql if on network.

squid 2.6. mysql 4 with patches, can’t upgrade and
maintain performance. 2.6 can’t rewrite single header line

squid sync broken on Gb ethernet with 0.0 ping time
.net Lucene for search, worked the best for them when tested

read load bigger than write load, replication is great for updaters, put binlog position in cookie

hundreds of writes per second to single master, thousands of reads

no local logging, just wite to network and forget

“people complain about everything”

loves LVS

use fedora, testing ubuntu

Tuesday, April 24

Security and Debugging in PHP, Rasmus Lerdorf

Another lucid and practical talk by Rasmus on performance tuning. This time he picks on Drupal, which does like 50 database calls per page and gets load time down from 1.2 seconds to 50 ms.

mysql native driver

http_load

base/mysqli/mysqli-nd same speed 1.2s

callgrind

- recursive calls can take more than 100%
- if you see compile high (20%), install
op code cache like apc

- xdebug
- kcachegrind drupal_xdebug1.out
- drupal has 3 performance options:
– off 1200 ms
– normal 80 ms
– aggressive 50 ms

- div
- cracker JS control panel: inspect dom for colored links (like banking sites), see if operation succeeds in same browser session (like a banking transaction)
- browser zombie control panel with iframe
- linksys dmz to make your pc outside fw

- filter extension in 5.2 by default

- alert(0), /, /xss, /abc
- make sure not cached

- host header xss-scriptable, apache picks 1st
- javascript:

- IE6 allows JavaScript in img src tag

- #039 is a single quote in On* and style handlers

- hit yahoo mail and google groups in
next/prev which showed subject lines

- update acrobat now!!! or people can run JS against your local drive because there’s c:\…\file.html

- cross-site request forgery

- xsrf

http://talks.php.net/show/mysql07
http://xdebug.org/docs-profiling2.php
http://sla.ckers.org/forum/list.php?3
http://php.net/filter

Building a Vertical Search Engine in a Day, Ken Krugler

Awesome talk, reminiscent of the old OSCON talk “Build Your Own Map Server like Google”.

- vertical search
– less than 1% of content out there is
of interest to my audience

- like specialized processing on source code
(indexing, hiliting)

- specialized optimized search interface

- 25 people, 120 servers
- 2 years
- get it right for user, or get lost in
technology – 10 papers a week
- nutch
- 1 day of crawling, 2.5 GB data
nutch tomcat
- mysql used to support operations
– nutch args in xml
– results in mysql for analysis
- db falls over at 100 million records
- one good record, 5 crap

- injection list
- fetching pages
- parse the content
- update crawl db
- whitelist
- blacklist
- convergence
- index (lucene)
- map reduce – split up big problem into small pieces, combine results
- recrawl

100 servers 1000 threads each

- 20 billion pages out there, lots of junk, 1,000 servers regularly, 2 Gbps

- breadth first would take years and cost millions of dollars

- dmoz has 500 dollars, 95% of content

- depth first crawling to the rescue

OPIC + Term Vectors = Depth First

- page rank is a 20 billion by 20 billion sparse matrix you solve for each page

- pure opic means “fetch well-linked pages first”

- we modify it to “fetch pages about mysql first”

- term vector

mysql – 5.0
sql – 4.5
schema 2.5

- can use wikipedia to train based on mysql category

- nutch and hadoop running on 4 server cluster (2 hours to make lucene index)
see http://lucene.apache.org/nutch for details
- 23 loops, 150k pages fetched, 2M urls found
- common problems
– honey pots
– bogus pages (spam, porn, out of scope)
– dynamic links (http://mydomain.com?param=value)
– getting blocked

- 2U fast sata quad core 8 GB RAM
– searchers for web

- introduction to nutch

- 40 Mbps ok for 4 servers because of database update, 1 day
- they have a 1 Gbps link
- can make a NetNanny list from crap

- mapreduce of text score plus incoming links gives useful score
- also score subdomain name like blogpost

- personalized yahoo and google: no ui, so special processing

- krugle is on nutch .8 or .9, hadoop .9

- papers: do they scale, do they really improve results?

- precision is relevance
- recall is completeness

- focus on code search side of things
- 1.5 people to manually score index
- search is tough on code because no comments
- look in tab search for “haskell quicksort”

- solr
- compass
- omnifind
- htdig

SPASQL

- used for federated queries, semantic web.

Lunch

Pretty good food, at my table were some conference first-timers from out of the country.

Angel, IT Manager, from WillowBrook Bathrooms, UK, was here to learn more about MySQL from an IT perspective.

And Juan Carlos Gonzalez, Project Coordinator and Programmer, Innox, from Guadalajara, Mexico, was here to learn more about MySQL and software development and do some bizdev. Innox does PHP development. Programmer rates in Mexico are about $1000/month and Juan is looking for clients. +52 (33) 3030-7181, jgonzalez@innox.com.mx

MySQL Performance Tuning, Peter Zaitsev, Percona

- max_connect_errors
– bump up to prevent DoS
- created_tmp_tables
- created_tmp_disk_tables
– increase tmp_table_size and max_heap_table_size
- tmpdir=/dev/shm

The Blackhole and Federated Storage Engines: The Coolest Kids on the Block, Kai Voigt, Giuseppe Maxia

Great talk. These 2 guys have the most twisted uses for these table types that you can imagine.

Blackhole Storage Engine

- blackhole is ACI not D
- smallest storage engine codebase (under 6k), so prolly reliable

Good for 3 things:

- statement-based replication relay slave layer (throw away data, keep binlog)
- benchmark SQL layer or network layer
- test dumps
- real time data warehousing – use triggers to automatically fill summary tables, avoid joins later
only causes result with insert, not delete or update because 0 rows affected

Federated Storage Engine

- create local table with connection to remote server
- uses local query cache if enabled
- can federate views or tables
- strange behavior of queries where index not available
- select count(*) causes a table scan – all aggregates are slow, so use views.
- can do subqueries against mysql 3.23 from newer versions!!!
- use federated table on slave to check against master in pure sql
- remote command execution on remote server by using view and function or udf
- takes one connection per table
- should have restricted account for federated table login

PHP Performance and Scalability Best Practices, Laura Thomson

http://omniti.com/~george/talks/

adv. php programming, g. schlossnagle
building scalable web sites, c. henderson
scalable internet architectures, t. schlossnagle

BOF: Data Warehousing

Host: Pentaho, Julian Hyde from Mondrian Project

- Julian works on modrian reporting project, sponsored by Pentaho now
- Julian recommended looking at Kettle as an ETL tool
- he thinks that parallel queries across multiple disks and materialized views are often important for DW
- other companies selling ETL tools include ab initio, ardent and informatica.

BOF: MySQL Performance

Jay Pipes and Flupps, MySQL AB

- fastest way to load external data is drop a csv file in the dbdir and do create table, alter table?
- load data infile stream from memory?
- single master means 2 data centers will have lag
- stored procedures are compiled per connection
- Jay likes PDO 5.1.13 for prepared statements
- 5.0 just had query caching of prepared statements 3 weeks ago, took 2 days of programmer effort and solves a frequest customer support question
- 5.1 will be GA in Q3 or Q4?
- will need to dump and restore if moving data, esp. UTF-8, to 5.1
- for syncing databases, look at mysqltoolkit mytablesync

Wednesday, April 25

Federation at Flickr: Doing Billions of Queries Per Day, Dathan Pattishall

Thousands of Shards, each shard has 2 nodes in master-master replication
Storage on NetApp Filer
InnoDB, InnoBackup nightly
tables as files
multiple data centers, want all hosts to be active for site performance
4 billion queries per day
migrate in bulk
need patch for linux to work with 2,000 threads in mysql

Using MySQL as Active DBMS for Monitoring Applications, Jacob Nikom, Lincoln Labs

- talked about design of database system to do event monitoring in subway attacks
- Oracle site license at lab, but clients don’t want to pay for Oracle, or hire a DBA per database
- likes combination of MySQL, UDF, triggers, Spread – because no Java in MySQL db yet
- audience member noted that MySQL triggers could use more docs, debugging tools, overall design

Storage Replication with DRBD: The Next Big Thing in Cluster Storage Management, Florian Haas, Phillip Reisner

DRBD FAQ

- layer below database, similar relationship of SCSI to MySQL
- now on version 8
- Siemens uses it.

askask.com: DRBD and MySQL

Lucene and MySQL, Farhan Mashraqi

- plugins for office formats, pdf, html
- capable up to 100 qps, need architecture to support more
- ranks
- supports + – modifiers, phrase, wildcard, range, more
- usually not right solution for very small or very big sites
- 20 MB/min on P M 1.5Ghz
- index size 20-30% of size of text indexed
- small RAM needed (1 MB heap is ok)
- Basis Technologies Rosette Linguistics Platform basistech.com/lucene
same as google and yahoo
farsi, etc.
- many ports of lucene to different languages. Java lucene is main.
lucene4c is index compatible with java, ferret and rubylucene is not.
- solr
- 18th – 25th, sept 2007 heidelberg Kai

Lightning Rounds with Top MySQL Community Contributors, Giuseppe Maxia, Ask Bjørn Hansen, Jeremy Cole, Martin Friebe, Bill Karwin, Sheeri Kritzer, Mike Kruckenberg, Paul McCullagh, Baron Schwartz, Yoshiaki Tajika, Beat Vontobel, Peter Zaitsev

QC’s (Quality Contributors)

- several people in the MySQL community from around the world got awards and talked for a few minutes about their contributions and motivations

Yoshiaki Tajika, Customer Support, NEC

- formerly supported Microsoft SQL Server
- likes free bug reporting and access to developers with MySQL
- “The manual was be perfect. My customers might be confused by mistakes.”

MySQL Performance Cookbook, Peter Zaitsev

- slave_transaction_retries= to prevent timeouts when doing an alter table
- innodb_flush_log_at_trx_commit=0 huge performance improvement with replication
- log-slave-updates chain replication
- replicate by db or table
- set global sql_slave_skip_counter=1
- start slave
- log event entry exceeded max_allowed_packet
(prolly master bounced, slave is looking for non-existent binlog entry)
- could not parse relay log event entry

MySQL Quiz Show

- mostly silly questions, some MySQL history trivia, a few technical questions
- lot of popcorn and ice cream.

BOF: Sphinx Search, Andrew Aksyonoff and Peter Zaitsev

- Sphinx Search has more and more users
- Peter is providing Sphinx support to web companies
- Andrew Aksyonoff is thinking of doing 2 months of development on Sphinx and offering support services and embedding in products.

Exhibits

I like to spend 5 minutes with each exhibitor to learn about what they’re doing and see what’s new.

The most notable were:

Alfresco Enterprise Content Mgmt.
Joomla CMS
Sphinx Search was there again.

Absent this year were booths for DealNews and Quest Software. DealNews is sponsoring Phorum.

Thursday, April 26

The innotop Workshop, Baron Schwartz

- showed innotop
- talked about changes to InnoDB for more statistics
- talked about replication monitoring tools. one of most popular projects on sf.net

Managing MySQL the Slack Way: How Google Deploys New MySQL Servers, Michael Still, Grant Allen

- Google uses PXE, slack to build hosts, Open Source
- guessed that CF Engine would be more complicated than necessary
- can’t upgrade rsync on all legacy nodes, so experimented with scp first, then netcat
- can build MySQL host in 10 minutes, faster to rebuild than troubleshoot
- use IP address as the master/slave id since both are 32 bits.

Introduction to the Japanese Charset, Yoshinori Matsunobu

- strange definition of multi-byte – more than 2 bytes
- mysql utf-8 is 4 bytes or not?
mysqlforge: WL#3780: Replace 4-byte UTF8 input

High Performance Data Warehousing with MySQL: Tricks and Tips from the Field, Brian Miezejewski

Great talk. He used to work at American Airlines on DW projects.

- memory engine for smaller tables that are not updated often
- nitro
- infobright 10 to 1 compression including indexes on all columns

–init-file= in init script
- insert on duplicate key
- sort_buffer_size key_buffer_size
- key_cache_division_limit
manual myisam-key-cache
- mysqldumpslow –help
- log_queries_not_using_indexes
- mysqladmin ex -r -i60 | grep -v ‘ 0 ‘
- ran into Seti At Home administrator later, he’s interested in the hi-perf table types, gets 500 GB drive from Arecibo instead of tape now, always something with budgets

Closing Keynote

Yahoo! Pipes, Pasha Sadri

- good demo, showed searching for rentals with characteristics that he likes
- author surprised to see Pipes used to do language translation in Second Life
- Mercury News says this is the first Brickhouse incubator project to get productionized

Cnet Sysadmin conversation

- continuance mcluster/unicluster as middleware/load balancer to multiple MySQL databases
- different database product for DW

sheeri.com: more blogs about MySQL Conference

Eclipse Aviation Receives FAA Production Certificate

Thursday, April 26th, 2007

Congratulations to Eclipse Aviation on receiving an FAA production certificate!

It means that they can build, sell and certify aircraft themself, without having the FAA inspect each aircraft before delivery – a time-consuming process.

I believe they needed to fix some icing problems with pitot tubes to finally get certification.

It’s been a long road from an idea to production for them. Of course, it helps when you first raise a quarter billion dollars from Bill Gates, among others.

Still, what started as a very low-cost jet at $850,000 in 1998 is now priced at $1.5 million. There’s actually a business that only brokers Eclipse Jet purchase positions.

wikipedia: Eclipse 500
Eclipse Aviation Critic

Google Releases MySQL Patches

Thursday, April 26th, 2007

At MySQLCamp, mention was made of a few Google internal patches to MySQL4 to improve replication and IO throughput on InnoDB (Mark Callaghan, ex-Oracle, did the InnoDB work).

The good news is that Google has just released them.

The license is GPL, so we’ll see what MySQL AB can do with them as their code is dual-licensed (GPL and commercial).

Hopefully this will prod Yahoo! into releasing more patches for Open Source software they use. I’ve heard that in 2006 it was tough for employees to get permission to release MySQL patches into the open.

Google releases patches that enhance the manageability and reliability of MySQL
List of MySQL4 Patches

Another Mac Security Patch Day

Sunday, April 22nd, 2007

I have a MacBook Pro for web software testing. Today I fired up the Mac. Am I doing testing today? Nope. Just installing security updates.

Luckily, I’m not relying on Mac OS X to be more secure than Windows, since Apple is fixing about 1 security hole per day on average this year.

Security Update 2007-004 is recommended for all users and improves the security of the following components:


AFP Client
AirPort
CarbonCore
diskdev_cmds
fetchmail
ftpd
gnutar
Help Viewer
HID Family
Installer
Kerberos
Libinfo
Login Window
network_cmds
SMB
System Configuration
URLMount
Video Conference
WebDAV

For detailed information on this update, please visit this website: http://docs.info.apple.com/article.html?artnum=61798

Small-Business USPS Tips

Friday, April 20th, 2007

I happened to be at the Campbell Post Office during a quiet period and had a nice conversation with one of the counter employees.

He had some interesting mailing observations and tips:

  • Most of the people mailing in a tax return on the last day owe money. State tax returns are generally under 1 ounce, federal over 1 ounce.
  • Customers can request their postage be metered. Your receipt will indicate the city and zip code each item is mailed to – providing an informal proof of mailing. Not the same level of proof as return receipt, but no additional charge. He likes to do that for all payments.
  • The cheapest way to mail DVDs and Compact Discs is in a mailer first class – not with media mail.

In 2000 I toured the Willow Glen processing facility on Meridian Avenue. Not sure if they still give tours, but it’s always good for business people to know more about your mailing options. Netflix actually hired a former USPS Postmaster, so they certainly recognize that.

Businesses aren’t stuck on customized stamps

General Small Business USA Tax Tips

Tuesday, April 17th, 2007

I had a morning meeting with my accountant, which meant she had plenty of time to talk with me about small business taxes. (She gets busier in the evening during tax season since people try to do it after work.)

She was very happy that I had prepared an Excel spreadsheet with revenue and expenses. Although it took me about 4 hours to do a detailed spreadsheet beforehand, it meant that she could focus on accounting and tax strategy instead of sorting receipts.

Some of her tax tips:

  • Schedule 179 can be used to depreciate most new assets in one year. I like to do that with items that I replace every year, like notebook computers and cell phones. You need the name of the item, date purchased and price.
  • It’s important for each business to show a profit at least some years, otherwise the IRS may consider it to be a hobby and disallow expenses. One way to ensure a profit is to only make purchases after enough revenue is collected to pay for them. Another is to get somebody else to pay for things like conferences and travel.
  • Items that include food or travel as a component can attract scrutiny because what you eat is not tax-deductible. It can make cruise trips trickier to account.
  • If you own one phone and use it at all personally, only the business long-distance calls are tax deductible. To fully deduct it you need to use it only for business and have an alternate method of making personal calls. Recently I use Skype and YM for personal calls, and my cell phone only for business. (Same restriction applies to computers.)
  • state.gov has maximum rates of per diem allowances for travel in foreign areas, which apply to all U.S. Government employees and contractors.
  • She recommends setting aside some book-keeping time each week and using Quicken.
  • Resident employees get a W2 if you withold taxes for them, 1099 if not, by end of February in following year. Nonresident aliens are to receive a IRS 1042-S form on or before March 15.
  • learn more about taxes. Take classes and read books. Hang around accountants. She uses the 1040 Quickfinder® Handbook Q40 book.
  • when mailing federal and state tax forms, there are different addresses depending on whether you will pay additional taxes or receive a refund.
  • federal tax refunds are mailed only for previous years, while state can still be direct-deposited

unclutterer.com: Paper clutter begone, part 4

WebEx Deal Comments

Sunday, April 15th, 2007

I just got an email from WebEx promoting some kind of upgrade for April 23, so I checked out their site to see what’s new in the past few years.

I guess the biggest changes are that the only free account option is for 14-days now. Otherwise the price is $20/hour per attendee. And they bought internets.com to add more colllaboration tools, like discussion forums.

Financial Perspective

Cisco paid $3.2 billion for WebEx last month. It’s a lot of money, but not insane. WebEx has revenue of $380 million per year, net income of $50 million per year, and is growing at 15% to 20% per year. WebEx has a solid brand, is the top brand in its space, and claims to have 28,000 paying customers.

Cisco, on the other hand, needs acquisitions to grow, and wants a toehold into software subscriptions a la salesforce.com.

Cisco and WebEx are good matches because they both mainly sell to businesses, and Cisco believes it needs to offer everything that has an IP address to its customers.

Technical Perspective

I was never impressed with WebEx because their product was always slow and bloated and didn’t have a reputation for being reliable cross-platform. I always thought that Microsoft should just get their act together, release a reliable version of LiveMeeting in the OS, and put WebEx out of business. But that hasn’t happened yet.

WebEx is up to 2,500 employees now. Hard to write tight code in that kind of bureaucracy.

Something to keep in mind with these Internet collaboration tools is the very high likelihood of numerous security holes. Citrix is the poster child for that. I could go on.

I’ve read a few interviews with the active founder, Subrah Iyar, ex-Quartderdeck, and was always impressed with how down to earth and straight-forward he was. In one Mercury News article the interviewer asked him what WebEx’s biggest problem was.

He replied, “Our biggest problem is corporate visitors seeing our typical one-story SV campus and asking us where the gleaming skyscraper is.” Later they put their logo on a tall building on Mission College Blvd.

WebEx does have a lot of competitors. Hopefully one can release a light-weight, cross-platform application.

But even now you can go far with Skype voice and text chat and a wiki or fleck.

TechCrunch: Cisco Buys WebEx for $3.2 Billion
wikipedia: WebEx

Don Ho Says Final Aloha

Sunday, April 15th, 2007

Don Ho died Saturday of heart problems at 76.

I made a special point of seeing his show in Waikiki several years ago, because he was getting on in age and I wanted to see him perform while he was still able to do so.

I enjoyed myself, although I got the impression that he was a jaded performer doing a parody of his lounge act after so many shows and years. I don’t remember him getting off his stool during the entire show.

Half the show was actually performed by young hula dancing girls (common in Hawaiian shows) and a young male guitar/singer protege. Many in the audience were repeat guests from over the decades.

I didn’t realize that he was a star football player, sociology graduate, and Air Force pilot in WW2.

Apparently he got into singing to bring in patrons to his father’s bar.

cnn.com: Don Ho, Hawaiian ‘Tiny Bubbles’ crooner, dies
cnn.com: Sunset memorial for singer Don Ho draws thousands