Performance Birds of a Feather – Percona Live 2015

This session was a talk with Rick James where we just talked about performance issues we in the audience were interested in.  (http://mysql.rjweb.org) My Notes are:

  • Start with InnoDB buffer pool at 70% of memory unless really small VM
  • Partitioning should not be done for performance
  • Stored procedures can provide better performance over views
  • If you are replicating then break up your deletes into small batches
  • We talked about ways to create shards to minimize impact
  • Slow Query log at 2 seconds or less and use pt-query-digest to analyze
  • Batch your inserts to have one insert statement with up to a 1,000 rows to add
  • For Galera you can make an argument that you can run innodb to 0 or 2 because a node can be rebuilt with SST
  • You can use INFORMATION_SCHEMA tables to build a blocking chain
  • subqueries can be poor performing because the optimizer is not smart enough to handle them
  • Try your best to turn them into a join unless you have a group by in the subquery to limit the rows
  • Also can create a temp table in memory and join to it instead of a subquery

I need to grab Rick’s information on Galera for developers.

Scaling MySQL in the cloud with Vitess and Kubernetes – Percona Live 2015

Scaling MySQL in the cloud with Vitess and Kubernetes” by Anthony Yeh from Google.  He works on the YouTube team at Google.  The talk was basically how they use Vitess and their container platform (Kubernetes).  Vitess provides

  • Performance
  • Protection
  • Monitoring/Analysis
  • Mangement
  • Sharding

It sits between the application and database layer. It leverages a vtgate which abstracts the data layer and what shard to go get the information from.  Can separate live traffic from analytic traffic.  Vitess can cache based on primary key identified rows.  This is different from the block caching performed by MySQL.  Vitess will kill poorly performing queries automatically to avoid replicating bad queries.  It performs transparent rerouting.  It also performs live resharding.

Vitess is a container cluster.  Kubernates is a Container Cluster OS built on top of Docker.  For more info see vitess.io/getting-started.  Looks very promising and will need to see how this fits with Trove.

Sharding, horizontal and vertical scaling – Percona Live 2015

In “Sharding, horizontal and vertical scaling“, Tom Christ from Tumblr discussed sharding for scale.  He discussed scale dimensions of read, write, and size.  While challenges are often specific to data sets and/or applications, there are ways to categorize and reason about them.  Read challenges include query complexity, query volume, and buffer pool effectiveness.  To fix read issues you can scale horizontally with replica fanout, or you can add caching.  You can also scale vertical with bigger servers, or leverage smaller concurrent queries.

He discussed the options for write and size scaling and this is where sharding comes into help.  He discussed the trade off between costs, complexity, write amplification, and concurrency when deciding how to scale and if sharding should be part of your solution.  Tom discussed the types of sharding lookup, range, and  mod sharding.  He also went over forms of caching: key/value (memcache) and structured (redis).

All in all a good overview of issues with scale and how sharding and cache can help and the pitfalls.

Improving Performance with Better Indexes – Percona Live 2015

In the session titled “Improving Performance with Better Indexes” by Ronald Bradford from EffectiveMySQL went over a test case he had with a customer to improve the performance of an already indexed system.  Basically we learned how to use the USING INDEX information in the explain plan.  He has a 6 step process:

  • Capture
  • Identify
  • Confirm
  • Analyze
  • Optimize
  • Verify

Performance Schema provides 3 tables events_statements_…  you can use to identify queries.  Ronald also went over several other ways to log SQL Statements.  Then he discussed how to identify the queries to look at.  Slowest query is not the best place to start.  Instead you need to look at total duration for all queries with the same signature in a given time window.  This is the same issue and common mistake made by DBAs on all platforms.  He then went over the following commands:

  • EXPLAIN – The explain plan
  • SHOW CREATE TALBE – shows DDL to create a table
  • SHOW INDEXES FROM – shows statistics (cardinality only no histogram)
  • INFORMATION_SCHEMA.TABLES – shows size of data on disk
  • SHOW TABLE STATUS LIKE – also shows size of table on disk
  • EXPLAIN EXTENDED – adds more information to explain plan

Talked about optimized trace https://dev/mysql.com/doc/internals/en/optimizer-tracing.html very briefly.

When looking at explain key: NULL means it is not using an index.  row: tells you the number or rows compared.  Also, adds Using Index to the Extra:.  This means that we are using a covering index and are not having to go to the table to get information from the rows.  key_len tells us the number of bytes being used for comparison, so bigger the better in query tuning.  Covering indexes can be an issue when large columns are involved.  If it is too big then you will add I/O to use your indexes.  If they are small enough then you will be entirely in memory.

This session had some great information.  I will probably take it back and combine it with my SQL Server 12 step query tuning session to make a MySQL tuning 12 step presentation to give to our developers.

Pseudo GTID and easy replication management – Percona Live 2015

I attended the session titled “Pseudo GTID and easy replication management” by Shlomi Noach from Booking.com.  In this session Shlomi went over replication, GTID, then Pseudo GTID and some demos.

Pseudo GTID provides:

  • Slave repointing
  • Failover schemas
  • less requirements

Binary logs are pulled by the slave where it is written to the slaves relay logs.  Nothing ties these two together.  They can be flushed at different times, etc.  With log slave updates the relay logs are written to the slaves binary logs.  The problem only grows when you add multiple slaves and relay slaves.  GTID has the following properties

  • Every transaction has a unique identifier
  • When slave connects it looks for last GTID it already executed
  • Requires log slave updates (fixed in 5.7) on all slaves (Disk I/O)
  • Migration is all at once (master and slaves all at the same time)
  • Oracle GTID and MariaDB GTID are different implementations so can cause vendor lock-in

Pseudo GTID is an application side enhancement.  The application injects every few seconds which you can uniquely identify.  It is like “tracer tokens” in SQL Server replication.  This gives you something to measure where the tracer is in your environment.  This gives you information you can use to more easily identify how to reconnect a slave to a master.

Then Shlomi went over a tool called Orchestrator that is topology aware to help you manage replication.  Works with command line or web api.  He killed a live server as part to the demo.  That is some confidence in your tool and your presentation.  With Pseudo GTID it can automatically move slaves to a new candidate master based on a number of different factors.  Pretty impressive.

Next generation monitoring: moving beyond Nagios – Percona Live 2015

This session titled “Next generation monitoring: moving beyond Nagios” by Jenni Snyder and Josh Snyder at Yelp was my first session after lunch.  Always a tough time slot for speakers.  This session was on how they migrated from Nagios to Sensu.  Sensu is great because it is specialized vs. Nagios which does a lot of things well.  The session started with an overview of some of the pains associated with Nagios at Yelp.  Big pain points were related with scaling and a lack of HA.  Also pains associated with the Nagios GUI.  Issue if you acknowledge a warning it will not fire if then becomes critical.

Sensu uses the Nagios plugin interface.  Existing Nagios checks work in Sensu.  Sensu uses RabbitMQ to decouple the clients from the server.  You can use standalone checks (agent decides when to run them) or server directed checks.  Yelp uses Nagios and then Sensu to monitor Sensu.  They deploy Sensu using puppet and deploy each component 3 times.  They use HAProxy to handle failover.  They use puppet to configure the checks in Sensu.

Yelp uses Sensu for host-based condition monitoring not for graphing.  They have A LOT of checks!!!  They like the tight integration between Sensu and Puppet.  Sensu only run checks, generates events and then calls handlers.  So it does a lot less than Nagios but it does it really well.  Yelp has open sourced several of their handlers.  You can stash Sensu events or a host to disable checks for maintenance or reboots.  Handlers include PagerDuty, JIRA, IRC announcements, and email.  They also have handlers for Graphite, aws_prune, and OpsGenie.  They use Uchiwa as a GUI for Sensu.  They shared some information on how they build parameterization into checks.  They basically use config files and symlinks to achieve this.

Overall a good session.  We are pretty happy with Zenoss right now, but with the migration to Puppet Sensu may become a viable replacement.

An Introduction to Database as a Service with an emphasis on OpenStack using Trove – Percona Live 2015

My first real session of the conference was “An introduction to Database as a Service with an emphasis on OpenStack using Trove” by Amrith Kumar (Tesora) and Matt Griffin (Percona).  I had originally planned on attending “MySQL and Openstack deep dive” by Peter Boros (Percona), but after speaking with him last night in the vendor hall I realized that this session could have more immediate applicability to my team.

The session started with an overview of the cloud landscape which showed AWS and Microsoft at the top of the cloud market for Enterprises.  But there are a lot of other clouds that are getting evaluated and the market is growing fast and changing.  68% of enterprises have less than 20% of their application in a public cloud so a lot of room for growth.

Clouds provide a lot of value but this session concentrated on Speed, Low Cost, and Flexibility.  They pointed out that most of the cost savings come in the early phases of a project/application because you don’t have to oversize your infrastructure.  Database as a Service provides simplified database provisioning and the ability to manage at scale.  OpenStack and Trove help companies provide Database as a Service.

Next they provided a good overview of OpenStack and the services it provides.  Then they introduced Sahara (Hadoop as a Service) and Trove (Database as s Service).  Since this was a session on Trove there was very little information on Sahara.  Trove has to run ontop of OpenStack today, but there are discussions of breaking that dependency in the future.  Trove supports multiple databases (MySQL, PostgreSQL, MongoDB, Counchbase, ….).  It provides framework for higher level operations like backup and recovery, replication, clustering, HA, etc.  They are adding new database platforms with just about every release (twice a year release cycle).  The real value of Trove is that the configurations contributed are from industry experts.  Trove does not remove the need for expertise, but it frees organizations to roll out quicker because you can get expertise as you go.  This helps with dev ops because you don’t have to have a platform expert on each project.

Trove supports a number of different database platforms, but they are all at different levels of maturity.  There is some loss of control because o the standardization of instances, but Trove is open source so you can modify it as needed.  Configuration groups in Trove is one way you can modify the default behaviors.  Trove will add Galera clustering in Liberty.  Oracle support in Trove can be purchased.  Biggest advantage provided by Trove and OpenStack is a common API to provision resources from private, to public, to hybrid cloud deployments.

All in all this looks like a good project that we should dive into to see how it can increase our effectiveness.

Key Notes Day 1 – Percona Live 2015

The morning session of key notes started off with some infomation on the conference.  Some of the interesting information included:

  • Over 1200 attendees, 22 from Facebook and 20 from Oracle
  • Percona Live Amsterdam Sept 21-22nd 2015 call for papers open now
  • 205 Speakers
  • 146 Breakout Sessions

Then Peter Zaitsev CEO of Percona came out and gave a great session on the history of MySQL.  With this year being the 20th anniversary of MySQL it was interesting to look back on how it has changed.  Then there was some information about the role that MySQL is playing today.  MySQL is leverage by the top 5 websites today and there is a lot of growth in the future with the move to the cloud and platforms like OpenStack.

After Peter was done then Harrison Fisk from Facebook came up on stage and gave a great overview on how they leverage multiple data platforms to handle different workloads.  Some noted from that presentation are:

  • Different data storage technology to handle varying data storage needs
  • MySQL used for small data also messaging and other systems
  • small data is 10’s of PB
  • get <5ms response times
  • flash storage and flash cache (hybrid flash and disk system built by Facebook)
  • 100’s of millions of QPS
  • HBase is used for small data as well (the messages themselves are in HBase for longer term storage)
  • HBase is used on disk and provides a great platform for archived data
  • HBASE 10’s of PB
  • HBASE Millions of QPS
  • HBASE < 50ms response time
  • RocksDB is an imbedded storage engine developed by Facebook (SQLLite replacement)
  • Used by a lot of internal systems 9B QPS at peak and is open source
  • Used by News feed
  • BigData
    • Hive
      • Minutes/Hours Queries
      • >300 PB logical size
      • >10PB Daily Processing
    • Presto
      • developed by Facebook
      • opensource distributed query engine in memory streaming
      • presto doesn’t handle failures well
      • Seconds/minutes Queries
      • >10K Queries Per Day (big analytic queries)
      • >30 Trilion rows per day
    • Scuba
      • Not open sourced
      • Real time analysis tool used for debugging at Facebook
      • Leaf aggregator in memory columnar store
      • 100s of TB
      • Many M rows ingest
    • New systems
      • RocksDB and MySQL – MySQL with RocksDB storage engine (Rocks is write optimized while InnoDB is read optimized)
      • Presto and MySQL – Sharded MySQL with Presto running on top
        • 100s of Millions of rows at once
        • Updatable all the benefits of MySQL

After Harrison was done then they had a panel discussion on the next disruptive technology after the cloud and big data.  The first question asked was “What’s the next big thing”?  Think about a datacenter as a single element of compute and remove the human element out of the management.  Application needs over technology will lead the way instead of technology leading the applications.  Choosing technology based on needs rather than trying to cram what we have to fulfill a need.  Internet of things is inevitable based on human needs to connect.  Technology is just an abstraction created by humans and it will follow.  Data security is going to be a big growth area.  IoT will drive this need when combined with the cheap costs of storage.  Singularity of information is going to come where the data silos will be broken down.

The question was asked how we would know the net thing is here.  Thomas Hazel from Deep Information Sciences had the answer that I identified with the most when he said “It will change our perspective, then it will change the business, then it will change our lives”.  Other ideas given were:

  • propriety technology will become niche
  • The next big thing will involve data
  • We are in the beginning of a data revolution
  • We can no longer be associated with a platform. Instead we have to be associated with services we provide.
  • The next big opportunity is turning all this data into useful information.
  • The core problem in data is analytics.
  • Data can be used to backup just about anything if we ask the wrong questions.  The big issue is how to ask the correct questions and know that we are doing so.
  • Data is becoming more personal where it used to be corporate.

Last they had a question and answer session with Steve Wozniak one of the founders of Apple.  He shared some great ideas about the direction of technology as well as the meaning of life.  He shared with us some of his concerns surrounding education and what it will take for computers to play a lead role in the education of our youth.  Computers today serve as a powerful tool, but we continue to educate our children very similarly to the way we did in the past.  Steve suggested this is because of a number of factors.  Steve said that the way that computers will take a lead role is to become more cognitive and human like.  People gravitate to humans to learn and until computers are more computer like they will remain as tools.

Steve also shared with us his two formulas for happiness.  His first formula was H=S-F or Happiness = Smiles – Frowns.  Later on he decide that H=F^3.  Happiness is equal to food times fun times friends.  Then Steve wrapped up by sharing information on his relationship with Steve Jobs.

Then just when you thought we were done there was a surprise announcement from Peter that Percona has acquired Tokutek a storage engine for MySQL and MongoDB.

All in all a great morning of information and a great kickoff to the main conference.

Getting Ready for Day 1 – Percona Live 2015

Waiting for Day 1 to kick off at Percona Live and I can’t help but reflect on the great day of training and networking we had yesterday.  This conference is a bit smaller than the ones I am used to, but the information has been top notch so far and the smaller size makes meeting and sharing ideas much easier.  Everyone I have met so far has been super nice and interested in sharing ideas and challenges.

Today we start the session portion of the event and so we’ll be covering more topics in less detail.  I have sessions lined up for the rest of the week across the MySQL space.  I will try to share as much as I can while still trying to keep up with all that is going on.

Percona XtraDB Cluster in Practice – Percona Live 2015

I knew right off this was going to be an awesome session.  When you sit down and the slide on the screen has instructions on how to set up an actual environment in Virtual Box you know it is going to be good.  The session was presented by Jay Janssen at Percona.  The session was broken up into two parts.  First we migrated a master slave environment to Galera and then we went over Galera issues and some admin information.

Most of the morning was spent setting up the environment by migrating from traditional replication to Galera.  I am sure there were some in the room who thought this was too basic, but it was great for me.  My team has been working on Galera for several weeks, but this was my first time putting my hands on it.  First we started with a 3 node asynchronous replication environment (1 master and 2 slaves).  We then replaced XtraDB server with XtraDB Cluster on one of the slaves.  Then we changed the configuration on that node to make it a single node cluster and started it with bootstrap.  At this point we had a one node Galera cluster getting transactions from the master.  Next we did the same on the second slave node and added it to the Galera cluster.  We then had to add a grant on the first Galera node in MySQL to allow backups to run so that the new Galera node could receive SST.  Then we did the same for the last node.  The tutorial is open source and available here https://github.com/percona/xtradb-cluster-tutorial.

After the hands on part we dived into Galera specific issues.  Highlights from that part of the talk were:

  • Need to look up the grant for /usr/bin/clustercheck and run it to allow the script to work
  • wsrep_cluster_address list of addresses that the node looks at to find and join its cluster in wsrep_cluster_name
  • Setting up gmcast.segment doesn’t magically convert the traffic to async between the two segments.  You will need to use traditional replication between one of the nodes on the cluster and a stand alone MySQL or one node on another cluster instead.
  • Synchronous part means that all nodes have the transaction and acknowledge that it has been received.   Then transaction is certified on each nodes.  Commit finalized node by node.  Commits are certified asynchronously on the other nodes.  Race conditions are handled by issuing a deadlock error while it is certifying.
  • Certification is deterministic.  If it passes on one node it will pass on all nodes.  Certification happens in order as commits happen on each node.
  • app needs to test for deadlocks on commit
  • first commit wins
  • locks are not replicated amongst other nodes because replication only happens on commit
  • there is a slave lag  This is not two phase commit.  No data loss on node failure
  • Slow nodes cause an issue because of flow control.
  • Replication stops during DDL to ensure that certification remains deterministic
  • Rolling Schema Upgrade mode allows you to upgrade one node at a time.  Takes node out of flow control and get behind and does not replicate the change. Then catches back up.  Same as master slave and change on slave first.  Cannot drop a column, rename a column, or insert a column in the middle.  All the issues with Alter table on replicated slaves exist with rolling schema upgrades. pt-online-schema-change fixes the issue but can cause and issue with foreign keys because of the renaming done by pt-osc which can cause foreign keys not to fire properly for a short period of time.
  • All Galera settings are node specific
  • Galera.cache does not persist past restarts.  So having multiple nodes down can cause SST if you don’t have a node up will they are down.
  • On some linux versions you can tell a node what donor node to use by running: service mysql start –wsrep-sst-door=node1 or you can set the option in config file start and then modify config file
  • Suspect timeout allows nodes to be evicted and a new quorum vote is taken based on the previous node count.  If greater than 50% then cluster is still primary (P in myq_status) otherwise non-primary (n in myq_status).
  • You can connect to a non-primary node but you cannot perform selects, DML, or DDL with the connection but you can look at variables.  You can bootstrap the node but now you have two clusters.
  • Writes are frozen during the timeout (10 seconds) when a node fails
  • The arbitrator node gets all the replication traffic.  This allows it to forward the traffic if the link between the nodes in the other data centers is down.  Need to take this into account when selecting the DC for the arbitrator node as latency and security of this matters.
  • If all nodes go down you will have to bootstrap the node with the highest GTID (cat /var/lib/mysql/grastate.dat).  Only way to cleanly shut it down is to stop the application from writing before you shutdown.
  • If the shutdown was not clean you will not have GTIDs in grastate.dat.  You will need to use “mysqld_safe –wsrep-recover”  GTID is the stuff after the : on the recovery position line.
  • You can avoid SST by using a backup to initialize a node if your galera.cache is sized log enough to contain the transactions since the backup used.  You will need to have used the –galera-info option with your backup and then use the file created by the backup to create a grastate.dat before you start mysql.
  • If cat /var/lib/mysql/grastate.dat shows 00000000-0000-00000-0000000 then it will SST just as if the file is missing.
  • If node won’t SST and the data dir is corrupt you need to remove the data dir and mysql install_db –user=mysql to create a clean empty data dir and then start mysql to SST.  Make sure you remove the .sst.
  • flush tables with read lock; will stall your cluster because of flow control.
  • You can tune flow control using gcs.fc_limit to adjust how deep the queue is before flow control kicks in.
  • The problem node with flow control is the node with the non zero queue depth.  Kill it or fix it and the cluster will resume.
  • wsrep_sync_wait will cause reads to wait on applies to happen before the read.  Slows down reads but ensures reads on all nodes are consistent.

I was able to keep up all morning and for most of the afternoon, but Jay lost me with the cluster check part and so I had to stop doing the hands on parts.  But we plan on using an F5 and I understand the use so I will plat with it more when I get back home.  I was able to fix the issue on the break.  I had miss heard what option to put in the mysqlchk xinetd config file for parameter type.  It needed to be UNLISTED not UNCHECKED as I had heard.  This allowed me to get it running on all the nodes.

So my brain is fried from all this.  I will definitely be playing with this more when i get back from conference.  So you might see more posts around this content.  Stay tuned for more from Percona Live and on Galera.