The mystery of stuck snapshots

26 June 2017
Hard disk

A few days ago, in one of our Opestack clouds, we have noticed weird issues with CEPH-backed Cinder volumes. Hundreds of snapshots and volumes got stuck in "deleting" state. Seemed like cinder-volume service was down (and, in fact, Zabbix was complaining every now and then that it's down, but it quickly recovered). 

We decided that old, good "service restart" command might be a good solution to this issue - and, in fact, it was. At least for the moment. We've cleaned up the mess, reseted state of stuck volumes and snapshots, and went home for a well-deserved rest. However, as it turned out, the rest wasn't really so well-deserved.

Next morning, once we got to the office, we noticed that the issue occured again. We were a bit worried and decided to investigate this in details, as it's not normal to have hundreds of volumes and snapshots stuck in "deleting" state... is it?

We started an investigation. After interrogating several suspects, we came to a conclusion that the issue is caused by script executed every night by one of our developers. The script was very simple - it just deleted snapshots. Hundreds of snapshots. Or maybe even thousands.

OK, but in our enterprise-level, scalable, reliable, distributed CEPH cluster, deleting of even thousands of snapshots should not cause cinder-volume to hang... We were sure that there is definitely something wrong with it. And we felt that we're on a good track to solve our mystery.

Evidence

Our system is running Mirantis Openstack 9 and was deployed with Fuel. Users using Dashboard didn't notice any issues with volume creation and deletion. The issues occur only when deleting several volumes or snapshots at once. When cinder-volume gets into unstable state, it fails to response to keepalive requests sent by Cinder server, and - in effect - cinder-volume status in Zabbix fluctuates.

As a quick workaround, our developer introduced delay between snapshot deletion operations. Unfortunately, this was not a real solution - it slowed down whole script, and, in fact, our system was still vulnerable and it  was just a matter of time until it hangs again.

Finding solution

Fortunately, our Graylog decided to give us a hint. We found out that the issue could be connected with MySQL database:

Deadlock detected when running 'quota_reserve': Retrying...

We started to investigate this. First, we've decreased MySQL deadlock timeout, but it didn't help too much. After that, we asked Uncle Google and found this issue in Cinder bug tracker, describing very similar problem. 

Final fix

Thanks to Gerhard Muntingh, we have found and implemented our solution. Our issues were caused by incorrect configuration of MySQL connection URI. Fuel, during deployment of Openstack, set it to:

mysql://cinder:[email protected]/cinder

Unfortunately, driver selected by Cinder based on this configuration didn't handle concurrency correctly, what resulted in weird behavior. To fix this, a simple change of driver to mysql+pymysql was needed:

mysql+pymysql://cinder:[email protected]/cinder .

Thanks, Gerhard, for great solution!