Percona XtraDB Cluster in Practice – Percona Live 2015

I knew right off this was going to be an awesome session.  When you sit down and the slide on the screen has instructions on how to set up an actual environment in Virtual Box you know it is going to be good.  The session was presented by Jay Janssen at Percona.  The session was broken up into two parts.  First we migrated a master slave environment to Galera and then we went over Galera issues and some admin information.

Most of the morning was spent setting up the environment by migrating from traditional replication to Galera.  I am sure there were some in the room who thought this was too basic, but it was great for me.  My team has been working on Galera for several weeks, but this was my first time putting my hands on it.  First we started with a 3 node asynchronous replication environment (1 master and 2 slaves).  We then replaced XtraDB server with XtraDB Cluster on one of the slaves.  Then we changed the configuration on that node to make it a single node cluster and started it with bootstrap.  At this point we had a one node Galera cluster getting transactions from the master.  Next we did the same on the second slave node and added it to the Galera cluster.  We then had to add a grant on the first Galera node in MySQL to allow backups to run so that the new Galera node could receive SST.  Then we did the same for the last node.  The tutorial is open source and available here

After the hands on part we dived into Galera specific issues.  Highlights from that part of the talk were:

  • Need to look up the grant for /usr/bin/clustercheck and run it to allow the script to work
  • wsrep_cluster_address list of addresses that the node looks at to find and join its cluster in wsrep_cluster_name
  • Setting up gmcast.segment doesn’t magically convert the traffic to async between the two segments.  You will need to use traditional replication between one of the nodes on the cluster and a stand alone MySQL or one node on another cluster instead.
  • Synchronous part means that all nodes have the transaction and acknowledge that it has been received.   Then transaction is certified on each nodes.  Commit finalized node by node.  Commits are certified asynchronously on the other nodes.  Race conditions are handled by issuing a deadlock error while it is certifying.
  • Certification is deterministic.  If it passes on one node it will pass on all nodes.  Certification happens in order as commits happen on each node.
  • app needs to test for deadlocks on commit
  • first commit wins
  • locks are not replicated amongst other nodes because replication only happens on commit
  • there is a slave lag  This is not two phase commit.  No data loss on node failure
  • Slow nodes cause an issue because of flow control.
  • Replication stops during DDL to ensure that certification remains deterministic
  • Rolling Schema Upgrade mode allows you to upgrade one node at a time.  Takes node out of flow control and get behind and does not replicate the change. Then catches back up.  Same as master slave and change on slave first.  Cannot drop a column, rename a column, or insert a column in the middle.  All the issues with Alter table on replicated slaves exist with rolling schema upgrades. pt-online-schema-change fixes the issue but can cause and issue with foreign keys because of the renaming done by pt-osc which can cause foreign keys not to fire properly for a short period of time.
  • All Galera settings are node specific
  • Galera.cache does not persist past restarts.  So having multiple nodes down can cause SST if you don’t have a node up will they are down.
  • On some linux versions you can tell a node what donor node to use by running: service mysql start –wsrep-sst-door=node1 or you can set the option in config file start and then modify config file
  • Suspect timeout allows nodes to be evicted and a new quorum vote is taken based on the previous node count.  If greater than 50% then cluster is still primary (P in myq_status) otherwise non-primary (n in myq_status).
  • You can connect to a non-primary node but you cannot perform selects, DML, or DDL with the connection but you can look at variables.  You can bootstrap the node but now you have two clusters.
  • Writes are frozen during the timeout (10 seconds) when a node fails
  • The arbitrator node gets all the replication traffic.  This allows it to forward the traffic if the link between the nodes in the other data centers is down.  Need to take this into account when selecting the DC for the arbitrator node as latency and security of this matters.
  • If all nodes go down you will have to bootstrap the node with the highest GTID (cat /var/lib/mysql/grastate.dat).  Only way to cleanly shut it down is to stop the application from writing before you shutdown.
  • If the shutdown was not clean you will not have GTIDs in grastate.dat.  You will need to use “mysqld_safe –wsrep-recover”  GTID is the stuff after the : on the recovery position line.
  • You can avoid SST by using a backup to initialize a node if your galera.cache is sized log enough to contain the transactions since the backup used.  You will need to have used the –galera-info option with your backup and then use the file created by the backup to create a grastate.dat before you start mysql.
  • If cat /var/lib/mysql/grastate.dat shows 00000000-0000-00000-0000000 then it will SST just as if the file is missing.
  • If node won’t SST and the data dir is corrupt you need to remove the data dir and mysql install_db –user=mysql to create a clean empty data dir and then start mysql to SST.  Make sure you remove the .sst.
  • flush tables with read lock; will stall your cluster because of flow control.
  • You can tune flow control using gcs.fc_limit to adjust how deep the queue is before flow control kicks in.
  • The problem node with flow control is the node with the non zero queue depth.  Kill it or fix it and the cluster will resume.
  • wsrep_sync_wait will cause reads to wait on applies to happen before the read.  Slows down reads but ensures reads on all nodes are consistent.

I was able to keep up all morning and for most of the afternoon, but Jay lost me with the cluster check part and so I had to stop doing the hands on parts.  But we plan on using an F5 and I understand the use so I will plat with it more when I get back home.  I was able to fix the issue on the break.  I had miss heard what option to put in the mysqlchk xinetd config file for parameter type.  It needed to be UNLISTED not UNCHECKED as I had heard.  This allowed me to get it running on all the nodes.

So my brain is fried from all this.  I will definitely be playing with this more when i get back from conference.  So you might see more posts around this content.  Stay tuned for more from Percona Live and on Galera.

One thought on “Percona XtraDB Cluster in Practice – Percona Live 2015

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s