OpenWorld 2017: MySQL Automatic Diagnostics: System, Mechanism, and Usage

Shangshun Lei and Lixun Peng from Alibaba Cloud discussed in this session a system they have built called CloudDBA to automate a lot of traditional DBA roles.  The system sounds pretty cool and is what my team should be aspiring to do.  But there was not a lot of information on the how, and a lot of the what for CloudDBA.  My notes from the session are:

  • Why CloudDBA
    • Reduce Costs –
      • 80% of time spent on finding root cause, optimizing performance, scaling hardware and resources
      • 20% on database platform
    • Focus your resources on business
    • Provide best technology
  • Architecture
    • Kafka/JStorm for log collection
    • Offline Data repository
      • Error log
      • slow log
      • audit log
      • cpu/ios/status
    • Offline diagnostics
      • Top SQL Analyss
      • Trx Analysis
      • SQL Review
      • Deadlock Analysis
    • Online diagnostics
      • Knowledge base – rule engine
      • Inference Engine – matches for conditions and runs execution to resolve or provide advise to users
    • Realtime Event and advise
      • Slave delay
      • Config Tuning
      • Active Session
      • Lock and Transaction
      • Resource
  • Rule Engine
    • Immediate detection of useful changes with low cost
    • Choose correct inference model
      • Database global status is mature and easy to get
      • High frequency monitoring to make sure no useful info is missed
      • Real time state change detection algorithms
      • Importance of database experience
  • Knowledge Base and inference engine
    • Ability to accumulate DBA experts’ experience in short time
    • Accurate issue detection & corresponding advice
  • Offline diagnosis
    • Audit log does matter
    • Record full SQLs for database
    • A feature of AliSQL, no performance impact
  • Transaction analyiss
    • uncommitted transactions
    • long transactions
    • long interval between transactions statements
    • big transactions
  • SQL review
    • how many types of sql
    • how many types of transactions
    • sqls or sequence in transaction is expected or not
    • scan rows, return rows, elapsed time and sql advise
  • Top SQL
    • top sql before optimize
    • help explain questions such as why my cpu is 100%
    • different statistics dimensions and performance metrics
  • SQL Advisor
    • Not a database optimizer
    • built outside of MySQL kernel
    • query rewriter
    • follow rules to create indexes that works for the lowest cost

Next generation monitoring: moving beyond Nagios – Percona Live 2015

This session titled “Next generation monitoring: moving beyond Nagios” by Jenni Snyder and Josh Snyder at Yelp was my first session after lunch.  Always a tough time slot for speakers.  This session was on how they migrated from Nagios to Sensu.  Sensu is great because it is specialized vs. Nagios which does a lot of things well.  The session started with an overview of some of the pains associated with Nagios at Yelp.  Big pain points were related with scaling and a lack of HA.  Also pains associated with the Nagios GUI.  Issue if you acknowledge a warning it will not fire if then becomes critical.

Sensu uses the Nagios plugin interface.  Existing Nagios checks work in Sensu.  Sensu uses RabbitMQ to decouple the clients from the server.  You can use standalone checks (agent decides when to run them) or server directed checks.  Yelp uses Nagios and then Sensu to monitor Sensu.  They deploy Sensu using puppet and deploy each component 3 times.  They use HAProxy to handle failover.  They use puppet to configure the checks in Sensu.

Yelp uses Sensu for host-based condition monitoring not for graphing.  They have A LOT of checks!!!  They like the tight integration between Sensu and Puppet.  Sensu only run checks, generates events and then calls handlers.  So it does a lot less than Nagios but it does it really well.  Yelp has open sourced several of their handlers.  You can stash Sensu events or a host to disable checks for maintenance or reboots.  Handlers include PagerDuty, JIRA, IRC announcements, and email.  They also have handlers for Graphite, aws_prune, and OpsGenie.  They use Uchiwa as a GUI for Sensu.  They shared some information on how they build parameterization into checks.  They basically use config files and symlinks to achieve this.

Overall a good session.  We are pretty happy with Zenoss right now, but with the migration to Puppet Sensu may become a viable replacement.