SVLUG: Greg Lindahl on Blekko Search Engine

blekko logoGreg Lindahl from Blekko gave a talk tonite at Silicon Valley Linux Users Group (SVLUG.)

Blekko is positioned as an alternative search engine to the currently entrenched ones.

Some of the alternative features are:

  • slashtags UI feature for end users
  • enhanced SEO reporting, including incoming links to your competitor sites
  • currently no ads.

Some stats about Blekko:

  • positioned as “alternative search engine to Google” with public launch November 1.
  • project started 3 years ago
  • Blekko has 21 employees, 19 of them engineers
  • $20 million in funding. Ron Conway is an angel investor.
  • $5 million of that spent on hardware, including 700 servers and 5 PB storage (1/3 used currently) managed with Greg’s provisioning tools (a version Open Sourced in 1999). Nice monitoring.
  • hard to think of a good domain/company name. was a domain name held by a founder, was another candidate (too boring.)
  • servers here and in New Jersey. Nice to have in Europe to reduce latency.
  • They use servers with Intel X25-M 160 GB SSD. Only 1% of available write cycles used after 3 months of operation.

They wrote their own “eventually consistent” cluster software called PM, as Hadoop and Cassandra were not awesome enough 3 years ago and they need to know every line of the source code anyway. He would like to Open Source PM, but that’s up to the board.

Perl/XS and C are used, no Java. They made a sharded architecture with 3 copies of data the goal.

Currently they crawl Top 3 billion pages, and need to expand hardware by 5x to crawl and store entire web. No rush.

To fill in more rare queries, they use Yahoo! Search BOSS (Google Custom Search Engine (CSE) is not available now except to large partners.)

They try to focus on core search technology and not get distracted by things which can be outsourced. An example might be to outsource an online calculator UI instead of re-inventing the wheel.

Each page has 100 metrics computed for it, including things like number of unique links per Class C and freshness.

Greg said that provisioning software not only needs to do initial configuration of servers, but also detect and repair changes. One concern is a rogue root process or user that can change anything.

Mahalo has banned them in robots.txt (Blekko’s crawler identifies itself as ScoutJet):

# ScoutJet has been ill-behaved in the past.
# They'll fix it soon..
User-agent: ScoutJet
Disallow: /

Blekko plans to crawl trending SEO-bait sites like for statistical purposes, but not necessarily to link them in organic search results.

PowerSet’s algorithms/calculations were too slow, lucky to get bought out.

Blekko might have used the wikipedia corpus for crawler testing earlier on, but the live Internet is used for everything now.

Next stage is to work on UI and other user features.

Thanks to Symantec for hosting the meeting once again. blekko Blekko’ ScoutJet Begins Web Crawl
blekko’s ambient cluster health visualization

ChuckMcM on load shedding at Blekko:

“We did this very successfully at Blekko (search engine) to keep the system from getting over loaded. The frontend engineer Bryn designed a really useful way of monitoring nginx connections to the backend and to shed load when they exceeded a threshold, and Greg designed a ‘geoknob’ that would let us turn off traffic to regions of the world that were unlikely to be our primary customer base.

Also anomalous load shedding is a great indication of a traffic anomaly. Big scrapers sometimes appeared that way first even when their attack was coming from a wide number of IPs.”

This entry was posted in Business, Japanese, Linux, Open Source, Perl, Storage, Tech, Toys. Bookmark the permalink.

Leave a Reply

Your email address will not be published.