MongoDB San Francisco 2013

MongoDB Logo
I attended MongoDB San Francisco 2013 in the beautiful Palace Hotel on Thursday and Friday. #MongoDBDays

Executive Summary:

  1. MongoDB 2.4 is a feature-rich document-oriented database with essential built-in Operational features like HA and sharding that are suitable for evaluation for production. It is single-master, single-record-atomic, but writes can be made to multiple shards.
  2. 10gen, the company behind MongoDB, has built a hard-core technical and training team, and is further addressing the needs of Operations with optimizations in the 2.6 roadmap (late 2013). See their TCO and Operations Best Practices whitepapers for more details.
  3. In particular, MongoDB’s 3-node replica set topology is ideal for HA deployment in the Cloud. (The main limitations are that voting takes a few seconds, and your apps will have to reconnect after a reconfiguration.) See Figure 1.
  4. MongoDB supports sharding with automatic shard balancing for scale-out, but 90% of installations do not need the additional performance, and sharding on top of replica sets for HA is complex to monitor and maintain – 13+ daemons (services) on as many servers – a lot of moving parts. See Figure 2.


Simple Replica Set Architecture for Automated HA
Simple Replica Set Architecture for Automated HA

Sharded Replica Set Architecture
Sharded Replica Set Architecture – Lots of Moving Parts

Thursday Lunch

Nice spread! Chinese-style jumbo shrimp, chicken and beef broccoli with chocolate fortune and pecan cookies.

Thursday Afternoon Workshops

RedHat OpenShift Overview
SteveCP (Steve0) and Grant, RedHat OpenShift Evangelists

- OpenShift is a PaaS (like Heroku, Engineyard)
- everything is Open Source and in git
- uses AWS East large, selinux for access control, Linux cgroups for RAM, linux quota for disk space, to overcommit 4,000x! (maybe LXE in future, like a year)
- why AWS East? to be comparable with Heroku, etc. already in East
- 3 gears free forever (512 MB RAM, 1 GB disk)
- uses git to install apps, known as “cartridges”
- good integration with Eclipse
- automatically scalable (takes 15-20 seconds to rsync your data, depending on data size)
- OpenShift is trying to avoid Google’s “beta” pricing issues (cheap/free for beta, 3x for production)
- with 2×1 GB RAM gears, can run Tomcat in 1 gear and Solr in another!
- no root access, can write in /tmp with PAM namespaces (still ephemeral) and $OPENSHIFT_DATA_DIR (EBS, mirrored plus striped likely)
- hands-on lab: “picking up lines on your resume” (you deployed a python app in the cloud and made code modifications)
- env | grep OPENSHIFT
- mongodb sharding: can’t shard right now, OpenShift is handing off mongo cartridge maintenance to 10gen for sharding and replication

talks.theSteve0.com

MongoDB: Operations Hands On
Asya Kamsky, Senior Solutions Architect, 10gen @asya999

Talk description: “Do you need to grow a replica set? Migrate servers to different hosts? Repair a deployment after hardware failures? If so, then this workshop is for you. Attendees will work through several model operational scenarios, covering both planned and unplanned maintenance tasks, backups and recovery processes, responding to database growth requirements, and more!”

- use mongodb 2.2.x or 2.4.x or higher for replica sets or sharding testing

# build a replica set
MPATH=/var/lib/mongo
for i in `seq 1 3`; do
   mkdir -p $MPATH/rs$i
   mongod --dbpath $MPATH/rs$1 --port 3701$1 -replSet lab &
done

- replica sets are nice for HA, but secondaries usually not that useful as read-slaves
- can use arbiter node, which votes but does not store data
- max 12 nodes in a replica set to reduce mesh traffic
- max 7 nodes can vote to reduce voting traffic
- local database is not replicated
- even the primary node will go read-only if network-partition occurs
- oplog is a circular buffer, idempotent (increment operations => set operations)

db.adminCommand({setParameter:1, logLevel:0});

- use snapshot or file copy to put data on another secondary. It will recognize itself vs. copy and fix itself.
- tags, like dc or purpose: etl
- nearest is shortest ping
- rs.slaveOK() on a connection if you want to query collections on a slave instead of the master
- priority
- slaveDelay
- “maybe you’re in somewhere like Amazon where the disk write is over the network anyway”
- set write concern per connection (like w2 or wmajority), like password update.

Replica Set Exercises

  1. Set up a 3 node replica set
  2. Run the command to step down a primary: rs.stepDown() and ensure that a secondary is elected the new primary. Bring the set back up to 3 members again
  3. Set a priority on a node. Terminate the primary node and practice automated failover, see what happens
  4. Add a new node (so 4 members now) in your RS and kill 2, does a primary get elected? See what happens.

Sharding

3 Rules for Effective Sharding (in the unlikely event you actually need sharding)

  1. pick a good shard key, depending on write and read loads, and next scale jump
  2. don’t wait to shard too long (or you will be overloaded during automatic resharding), but don’t do it at all if you don’t need sharding because of the additional complexity
  3. configdb very important (run 3 config servers in production)

- mongos (mongo shard router)
- replica sets are orthogonal to sharding in mongo
- hash or range partitioning
- automatic resharding using 64 MB chunks
- shard key and values immutable without redesign
- shard key must be indexed, limited to 512 bytes in size, used to route queries

Miscellaneous Production Notes

- MongoDB can handle 20,000 connections, if your driver has connection pooling, even more.
- MongoDB relies on OS (filesystem, filehandles, etc.) and a lot of algorithms are quadratic down there
- for multi-tenant, many users use separate databases because security/permissions is enforced at the db level

MongoDB Docs: Deploy a Sharded Cluster

Friday

An early Caltrain hit Eric Salvatierra, a PayPal VP, near Menlo Park, so my train was delayed 90 minutes and arrived in SF at 11 am.

MongoDB Capacity Planning
Shaun Verch, Software Engineer, 10gen

Based on Asya’s Slides

Requirements

- Availability?
- Throughput?
- Responsiveness? The fuzziest one!

CPU

- speed
- cores
– avoid global write locks db.doceval ?
– 10gen is trying to reduce locks (or lock more local data structure level locks than global)

- non-indexed data is a memory hog
- sorting
- cores affect the amount of commands each connection can execute

Network

- latency
– WriteConcern
– ReadPreference
– Batching
– Documents (and Collections)

- throughput
– update/write patterns
– reads/queries

Memory

- working set
– active data in memory in blocks/pages
– measured over periods

- sorting
- aggregation
- ?

- qps vs. page faults
- qps vs. disk util

Disk

- IOPs
- Size
- data and loading patterns

Storage

- Active
- Archival
- Loading patterns

Monitoring

- SMCN
- Application Metrics

Tools

- MMS
- mongotop, mongostat
- linux tools

- in 2.4
workingSet option
- iostat
- mongoperf

- Load/Users
– Response Time/TTFB

System Performance
- Peak Usage

Basic Sharding in MongoDB
Brandon Black, 10gen (Ruby Driver Maintainer)

- fairly high-level overview, since only 40 minutes

Brandon’s Presentations

Indexing and Query Optimization
Max Schireson, CEO, 10gen

Executive Summary: If you know MySQL indexes well, then MongoDB index performance tuning is conceptually identical and nearly practically identical, which is a good thing.

- “I’m the CEO, but I still do this talk because it’s important. You can make your database 1000x faster.”
- B-trees
- in compound indexes, ascending or descending order matter
- index locality, like a log timestamp, is good for single mongod, likely bad for shards, which needs randomness to avoid hotspots
- can create index in background, but it still has to scan data, affecting caches
- dropdupes, sparse, geospatial options
- 64 indexes per collection, queries usually use only 1 index, except merge
- sort/select/limit => benchmark order
- index covered queries
- anchored regexes are indexed, but not case-insensitive (use lower-cased copy of it)
- nulls are indexed unless you ask for sparse, so doing sparse index is smaller if you expect a lot of nulls

Data Processing and Aggregation Options
Asya Kamsky, Senior Solutions Architect, 10gen

Aggregation Framework

- computed fields
- final result must fit in a single document (16 MB)
- pipeline operator memory limits (10% of total system RAM)

MapReduce

- original way of doing processing

Hash-based Sharding in MongoDB 2.4
Brandon Black, 10gen

- the ObjectID’s leading bytes are generated from the current timestamp, causing a hotspot when used as a shard key
- MongoDB now has hash-based indexing that uses MD5 to evenly distribute keys, but then you can no longer do efficient range lookups
- it’s still better to use a natural key like (customer_id,month) to preserve locality of range queries

Brandon’s Presentations

Advanced Replication Internals
Scott Hernandez, Software Engineer, 10gen

- know the oplog

Advanced Replication Features
Dwight Merriman, Chairman/Co-Founder, 10gen

- overview of replica set topologies and use cases
- DR replica set members generally are hidden or have a priority of 0 so they are not normally chosen as a primary

Best Practices

- odd number of set members
- read from the primary except for:
– geographical distribution
– analytics
- use logical names, not IP addresses in configs
- set WriteConcern appropriately for what you’re doing
- monitor secondaries for lag (alerts in MMS)
- use w:all or w:majority when bulk loading every Nth item to “flow control”

Closing Notes – MongoDB Roadmap
Dwight Merriman, Chairman/Co-Founder, 10gen

Dwight talked about the MongoDB roadmap and increasing 3rd-party community support.

Vendor Exhibits

It’s nice to see a vibrant third-party ecosystem developing around MongoDB.

mongolab (MongoDB-as-a-Service on AWS, Azure, Joyent, Rackspace, no sharding yet)
mongohq
gazzang
O’Reilly
Redhat OpenShift (PaaS)
ObjectRocket
28msec
StrongLoop
Jax
CumuLogic (MongoDB-as-a-Service on HP Cloud)
SoftLayer dedicated and cloud hosting, featuring (MongoDB-as-a-Service) (but not hands-on mgmt, no sharding yet) They have a beta feature to create a library of os images that can be deployed to bother dedicated and cloud hosts.
Microsoft Open Technologies – Brian Benz is a Senior Technical Evangelist msopentech.com 425 706 5793
NetEnrich Inc. (Infrastructure Mgmt.) Narayan D. Raju 408 436 5900 x7103

10gen links: Education, Presentations, Events, Whitepapers
@10gen

[mongodb-user] How does slaveDelay work in replica sets (are queries aggregated)?
Stop worrying about Time To First Byte (TTFB)

Retrofitting PCIE SSD Cards In Older Servers

I talked to three PCIe-card flash vendors (Fusion IO, STEC and Micron) at the Percona Live Conference.

Their listing pricing for Quantity 1 is:

  1. Fusion IO Direct Acceleration Products – $10,000+ for ioDrive2, volume discount for ioScale product (Google, Facebook, Apple are the most likely customers -“MLC flash with capacities ranging from 410GB to 3.2TB with cost/GB at $3.89/GB for the 3.2TB product which sells in 1,090-unit quantities”)
  2. Micron – $7,000+
  3. STEC – $4,000+

All three vendors said that the easiest way to guarantee server compatibility is to order their SSD card at the same time as your new server.

That’s nice, but some of us have existing gear that just needs more IOPs.

All said that their cards should fit, as they follow PCIe form factor specifications.

The technical issues mentioned were:

  • sometimes a BIOS update is needed for higher fan speeds, especially with higher-capacity (hotter) storage
  • particular motherboards may require a BIOS updated for known compatibility issues
  • there are many versions of the PCIe standard
  • ensure you have a spare PCIe slot.

I’ve never heard anybody say that server cosmetic faceplate panels help with airflow, so consider removing those. That also makes server installation and deinstallation easier and quicker.

Some of the soft issues are:

  • convincing vendor sales that you’re serious about adding SSD costing more than the original server
  • psychological – SSD devices offer 100,000+ IOPs, likely more than MySQL needs. Some people don’t want to approve invoices for “wasted IOPs” if only half can be used. Perhaps consider running 2 MySQL instances on the same server.

For MySQL servers:

  • if your master gets SSD, then likely your slaves will need SSD too, or they will lag
  • MySQL-specific features, like InnoDB atomic writes, and other software should be evaluated before deciding on a vendor
  • universal lack of useful documentation at http://vendor.com/mysql/ – vendors, databases are the market – you’re throwing away revenue!

wikipedia: PCI Express
theregister.co.uk: Flash card latency: Time to get some marks on benches

theregister.co.uk: EMC hits flash leader Fusion-io where it hurts: Low-cost server cards

Minimizing System Latency in Amazon AWS

Amazon AWS LogoLatency is very hard to wring out of a system once it is creeps in, which is why Internet engineers are obsessed with measuring and minimizing it.

Unfortunately, the Cloud is like a flashback to the 1970s in terms of the amount and variety of latencies it introduces in network and storage systems.

Some of the methods to decrease latency in AWS are:

  1. “comparison shop” for the best-performing instances and EBSs using benchmarking tools, then drop the slow ones
  2. write data to a queue in memory before writing to disk
  3. use local (“ephemeral”) storage instead of EBS, or try provisioned IOPs if you can afford it
  4. EBSs are a barrel of laughs in terms of performance and reliability. The likelihood of a “stuck” volume increases with striping, so use 4 or fewer striped EBS volumes
  5. use private IP addresses, not public IP addresses, from EC2 instances to other EC2 nodes to avoid extra routing hops
  6. note that EIPs can take a max of 9-13 seconds to update, which also affects RDS failover
  7. allocate the entire server by using 32 GB RAM instances to avoid “noisy neighbours”
  8. use KVM (Northwood-based Intel CPUs) instead of Xen hosts
  9. use placement groups if possible/affordable
  10. SSDs have much lower latency than rotating disk (spinning rust) drives
  11. MySQL InnoDB with SSDs can do atomic (single) writes, instead of using double-write buffer on 4k block devices
  12. reduce data writes with compression and archiving of data, and use a large enough buffer pool to hold data in memory
  13. Amazon Route 53 Adds Latency Based Routing
  14. Cross-region latency between US and non-US regions can be surprisingly high
  15. VPC might help (all new accounts are on VPC)
  16. nodes in different AZs in the same region are typically about 1 – 2 ms apart – that’s huge compared to 0.0 ms on a local switch
  17. configure geo-replication with MySQL replication or log-shipping with mysqlbinlog –read-from-remote-server (possible from EC2 or ClearDB, but not RDS) can decrease failover time in case of another entire region failure
  18. and obviously, don’t “double-virtualize” your instances like Russian Dolls.

To detect latency problems, look at the outliers on 95% and 99% graphs, and also linux top steal time (time given to other DomU instances). CloudPing is informative.

Please leave a comment if you can think of any more methods to monitor or reduce latency!

Resources

Database Latency is the Achilles Heel of Cloud Computing
AWS: the good, the bad and the ugly
Stuart Cheshire: It’s the Latency, Stupid
Dynamo Sure Works Hard
Amazon AWS team: Choosing the Right EC2 Instance Type for Your Application, Multi-Region Latency Based Routing now Available for AWS – “a regional outage wouldn’t have a direct effect on the routing decisions; the absence of measurements doesn’t contribute towards the averages observed … hours and days is where to set expectations” (ie. very high-latency)
EC2Instances.info: Easy Amazon EC2 Instance Comparison

Percona Live MySQL Conference And Expo 2013

MySQL LogoThe Percona Live MySQL Conference was held once again at the Santa Clara Convention Center. Percona did a great job of organizing the show, with a good program, full exhibits area, and a few thousand attendees.

Oracle even sent speakers this year!

The big news announcements at this conference were:

  1. SkySQL is merging with Monty Program, the MariaDB developers. Monty is remaining independent on the MariaDB Foundation
  2. Tokutek has Open Sourced their tokudb storage engine and it is already in MariaDB
  3. Oracle did a great job adding features to MySQL 5.6, but those features are a bug festival that will take a year to clean up. Since the replication binlog format has changed, users of third-party replication products will have to wait for an update to use 5.6.
  4. MySQL 5.6 has Bug #68460 – blocked with FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS

Some suggested improvements for next year:

  1. BoFs should be two per nite, starting at 6 pm or 7 pm, not one at 6 pm
  2. every year, I suggest that there be a room for storage and SSD vendors to demonstrate systems
  3. after the closing keynote, possibly have a hacking session area.

Tuesday Keynotes

Birds of a Feather Sessions

BoF: MySQL 5.6 Deep Dive
Luis Soares, Sunny Bains, Dimitri KRAVTCHUK

Some of the topics discussed were:

  • Peter Z. mentioned wanting parallel queries to take advantage of multi-core servers. An Oracle programmer replied that actually Innodb has vestigial framework code for that, but it’s likely bitrotted
  • an audience member asked a vague question about how to get more than 10,000 connections, and there were a bunch of vague responses, including that there are hard-coded limits in the MySQL source code

BoF: MariaDB
Colin Charles, Monty

Some of the topics discussed were:

  • SkySQL is absorbing the MariaDB developers, with Monty remaining independent on the MariaDB Foundation
  • an audience member inquired about NDB support in MySQL, Monty said it would take $5 million to update it
  • roadmap after MariaDB 10 is MariaDB 11 :)
  • database development requires a QA process, always write tests
  • an audience member asked how to get a feature added. Monty replied there’s 3 ways:
    1. write a patch to show how, for rewrite by a committer
    2. write a check
    3. submit description to Google Summer of Code.

    You can also file a bug or worklog and hope for the best.

Wednesday Keynotes

  • Robert Hodges, Continuent – KEYNOTE: HOW MYSQL CAN THRIVE IN THE WORLD OF MASSIVE DATA HYPE
    Robert is an excellent speaker, one of the best I’ve seen at an Open Source conference. He starts with the CAP theorem and launches from there.

  • Panel: IMPACT OF MYSQL 5.6 AND ITS FUTURE IN THE CLOUD

    Moderator: Terry Erisman, Chief Marketing Officer, Percona

    Brian Aker: Fellow, HP Cloud Division (wants online DML and easier mgmt. of clusters of servers)
    Robert Hodges: CEO, Continuent (many users hit a wall at 3 TB and migrate to Oracle, wants more competition amongst 3rd party MySQL developers)
    Simone Brunozzi: Senior Technology Evangelist, Amazon Web Services
    Peter Zaitsev: CEO, Percona (wants parallel queries to make better use of multi-core servers, more community effort)

Community Networking Reception (in Expo Hall) MySQL Community Awards & Lightning Talks

Giuseppe Maxia’s MySQL Sandbox won the Community Application of the Year Award.

There were some strong lightning talks, particularly:

  • LeFred (Percona Europe) “Easy Schema Migration” with his online-migration tool
  • Domas (Facebook) “SSD at Facebook”
  • Sabika Makhdoom (thePlatform), a female DBA, “Time Saving Technique for Changing Replication Masters”

MySQL Paradise – Song Parody

Thursday Keynotes

Percona Closing Remarks and Expo Prize Giveaway

Most of the exhibitors and sponsors provided a moderately expensive gift, usually an iPad, that was drawn against the completed “exhibitor bingo card” submissions.

I won a Nest Learning Thermostat.

Exhibitions Area

The Expo area was very energetic. There were about two dozen commercial booths, plus the Dot Org pavilion.

SphinxSearch, the C-language #2 Open Source full-text search engine, had a booth. Besides having a great search engine that you can download for your website or forum, you can also engage them to install, optimize or customize the behavior.

Looker demoed an interesting BI tool that is a halfway point between TOAD and a fully-custom DW. After a DBA or programmer defines the BI schema, Looker autogenerates SQL to hyperlink BI queries.

There were 4 flash vendors, most with PCIe cards:

  1. Fusion IO Direct Acceleration Products – $10,000+ for ioDrive2, volume discount for ioScale product (Google, Facebook, Apple are the most likely customers)
  2. Micron – $7,000+
  3. STEC – $4,000+
  4. Virident

There were 2 non-MySQL databases:

  1. Clustrix (a MySQL-protocol-compatible appliance starting with 3 flash-based bricks and an almost zero-latency network fabric, great for Black Friday, travel discounts or other spiky ecommerce) $120,000+ or lease for $7k/month
  2. NuoDB (Jim Starkey‘s new database venture)

There were 3 MySQL Proxy vendors:

  1. Scalebase gives you a comprehensive range of features to do traffic directing based on your MySQL query stream, including automatic and rules-based sharding. I spent about 10 minutes looking at the UI with their engineer, and was able to figure out how to add sharding even to legacy custom apps
  2. GreenSQL Security Proxy, Sam Friedman, Director of Sales, 1 347 329 2937, sam -ATTTT- greensql.com. This proxy helps protect your frequently vulnerable web apps from malicious Internet traffic – awesome for companies that deploy PHP community projects and can’t update every time there’s a CERT issued.
  3. ScaleArc

There were 4 MySQL Database-as-a-service companies:

  1. Amazon RDS for MySQL
  2. Rackspace Database as a Service
  3. ClearDB (multi-cloud and multi-region!) – Michael Russo, VP Sales & Marketing 469 828 3439 x 505, michael -ATTTT- cleardb.com
  4. HP Cloud – Bruce Basil Matthews, Cloud Evangelist, bruce.matthews -ATTTT- hp.com, 760 553 3197

There were 3 MySQL support companies:

  1. Percona
  2. Pythian – Robert Groth, Director of Sales 613 565 8696 x422, groth -ATTT- pythian.com
  3. PalaminoDB

Thanks to Percona for organizing the conference again this year, and to Oracle for their serious participation in a third-party event.

Percona Live MySQL Conference 2013: Keynotes Recordings, Slides
#perconalive
YouTube: Percona MySQL Channel
There is no reason at all to use MySQL: MariaDB, MySQL founder Michael Widenius

Simulating Replication Lag on MySQL Development and Test Sandboxes

MySQL LogoMYSQL DBAs usually try to minimize replication slave lag, but there are scenarios where you need non-zero lag.

Production applications often write to a MySQL master and read from one or more slave databases.

Development and Test environments (sandboxes) seldom simulate possible replication lag though.

Galera

With a typical installation of Galera, replication lag is usually less than 0.1 second.

You can make a node lag more than that by either shrinking the transaction log size, using large transactions, or both.

You can measure lag by using or modifying one of the causality test scripts included with Galera.

Continuent

You can configure slave lag with the time-delay filter.

You can see what the lag is in the appliedLatency value.

MySQL Built-in Replication

An easy way to do that in MySQL 5.6 is to install two MySQL databases in your sandbox, then configure delayed replication:

mysql> CHANGE MASTER TO MASTER_DELAY = N;

For MySQL 5.x, if you can install the Percona toolkit, then you can use the pt-slave-delay program.

Otherwise, you can use the following commands manually to do something similar:

mysql> STOP SLAVE SQL_THREAD;
mysql> SELECT SLEEP(10);
mysql> START SLAVE SQL_THREAD;

Or programmatically in 5.x, like this pseudocode:

Loop:
   STOP SLAVE; # on slave
   SHOW MASTER STATUS; # on master, parse out File and Position values
   SELECT SLEEP(3);
   START SLAVE UNTIL MASTER_LOG_FILE='theMasterFile',
      MASTER_LOG_POS=theMasterPosition; # on slave
   SELECT MASTER_POS_WAIT('theMasterFile', theMasterPosition); # on slave, optional
   # detect any slave errors and report if found

A variation on the above pseudocode, if you don’t want to connect to the master, is to parse SHOW SLAVE STATUS for Master_Log_File and Read_Master_Log_Pos:

Loop:
   SHOW SLAVE STATUS;
   # detect any slave errors and report if found and
   # parse out Master_Log_File and Read_Master_Log_Pos values
   STOP SLAVE;
   SELECT SLEEP(3);
   START SLAVE UNTIL MASTER_LOG_FILE='theMasterFile',
      MASTER_LOG_POS=theMasterPosition;
   SELECT MASTER_POS_WAIT('theMasterFile', theMasterPosition); # optional

Note that show slave status and stop slave can block or even deadlock with various versions of MySQL, and when FTWRL (Flush Tables with Read Lock) is used.

MySQL Bug #68460: blocked with FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS

A tip from the wikipedia project is to add the replication lag to application cookies. Deciding what to do with that info is left as an exercise for the reader. :)

dev.mysql.com: Delayed Replication,
Pausing Replication on the Slave, Start Slave
Stopping the slave exactly at a specified binlog position
MySQL 5.6 FAQ: Replication
xaprb.com: Introducing MySQL Slave Delay (2007)
Percona Toolkit – pt-slave-delay