OpenStack is a conceptually neat piece of software. It's very distributed and fault-tolerant and loosely coupled. That also means that getting back meaningful errors can be a tough process when something goes wrong.
Today, I upgraded my development cluster from Grizzly to Havanah. I spent several hours understanding new and changed config options throughout the landscape and updated my puppet definitions to spread these new changes across the nodes. I finally decided to go home and get some food before attempting to finish debugging some of the errors coming up from various nodes. I've complained to colleagues in the past about error reporting and how non-trivial it is to trace things through OpenStack. Well, here is a prime example.
In Horizon (the Web-GUI for mortals to use OpenStack), I tried creating an instance. It simply said it couldn't do it and left a cryptic error with a uuid listed. I'm used to this by now so I jump to the error logs in the apache log for Horizon. Not much there either. Hmm ... well, I just upgraded *everything* so where do I start? How about a compute node... after all, that's where the instance is supposed to start. Let's see if anything is going on there. Probe around a little and notice that nova-compute isn't actually running. Simple! Let's just start that up. Oh wait, the daemon is still not running. Now I have something in my error logs:
2013-09-13 19:07:08.959 4504 ERROR nova.openstack.common.threadgroup [-] Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
[u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u' File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 117, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup x.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 49, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup return self.thread.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup return self._exit_event.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup return hubs.get_hub().switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup return self.greenlet.switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup result = function(*args, **kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 65, in run_service
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup service.start()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/service.py", line 156, in start
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup self.manager.init_host()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 752, in init_host
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup context, self.host, expected_attrs=['info_cache'])
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/objects/base.py", line 90, in wrapper
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup args, kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 507, in object_class_action
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup return self.call(context, msg, version='1.50')
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 126, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup result = rpc.call(context, real_topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/__init__.py", line 140, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup return _get_impl().call(CONF, context, topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 824, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup rpc_amqp.get_connection_pool(conf, Connection))
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 539, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup rv = list(rv)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 504, in __iter__
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup raise result
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup RemoteError: Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u' File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup
2013-09-13 19:07:08.962 4504 DEBUG amqp [-] Closed channel #1 _do_close /usr/lib/python2.7/dist-packages/amqp/channel.py:88
Yup, clear as mud. I try the usual, "plug this into a popular search engine and see who else is getting the error." Nobody! Wow, a brand new one! Woohoo... Happy Friday. Fine, this is open source and written in python, let's just dive into the code a little and see where the culprit is. Aha! This is an error we get when we don't know what is going on. The last case(in fact the last line) of the file.
So, let's try a little harder. We are dealing with RPC and we are getting a version issue. Something... Somewhere... doesn't like talking to me. Maybe it's the message server itself? *poke around with that* Other things seem fine, no errors to report there. Let's run nova-compute in debug mode in the foreground and see if we can glean anything about where it is failing. Hmm, one of the last things it reports on is talking to nova-conductor. Let's have a look at nova-conductor logs ... WOAH! That same error message! That's right, nova-conductor is *returning* this error for some reason:
2013-09-14 02:27:19.394 ERROR nova.openstack.common.rpc.amqp [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Exception during message handling
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp rval = self.proxy.dispatch(ctxt, version, method, **args)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp raise rpc_common.UnsupportedRpcVersion(version=version)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp
2013-09-14 02:27:19.397 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Returning exception Specified RPC version, 1.50, not supported by this endpoint. to caller
2013-09-14 02:27:19.399 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n rval = self.proxy.dispatch(ctxt, version, method, **args)\n', ' File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n raise rpc_common.UnsupportedRpcVersion(version=version)\n', 'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n']
I guess this should have been clear given all of the blatant references to nova-conductor. In fact, I was able to find the term "conductor" once in the backtrace which is pretty good for this kind of problem. So, this *must* be the source of my issue since this is where the error is being generated. Well, tonight, I got lucky. A package wasn't updated and doing a dist-upgrade to pull in matching packages fixed the issue. However, this is a scary place to be:
Upgrading one component (or not upgrading another) can cause components to completely fail on remote nodes. If I had 100 nova-compute nodes and a single node with an outdated nova-conductor running, presumably the 100 nodes would have crashed nova-compute daemons.
The take-away here is: make sure all components across the board are the same version for your cloud. The CLI commands seem to be able to accept a little bit of version lag but the internal components have issues with different versions running.
The less scary part in all of this is that I'm managed to keep my previously-started instances running because kvm processes were not directly affected. If I had customers that needed to start/stop processes, this would have been not fun. Of course, that's why we all keep a complete development/testing openstack cluster around, right?