Friday, September 13, 2013

OpenStack Error Reporting

OpenStack is a conceptually neat piece of software. It's very distributed and fault-tolerant and loosely coupled. That also means that getting back meaningful errors can be a tough process when something goes wrong.

Today, I upgraded my development cluster from Grizzly to Havanah. I spent several hours understanding new and changed config options throughout the landscape and updated my puppet definitions to spread these new changes across the nodes. I finally decided to go home and get some food before attempting to finish debugging some of the errors coming up from various nodes. I've complained to colleagues in the past about error reporting and how non-trivial it is to trace things through OpenStack. Well, here is a prime example.

In Horizon (the Web-GUI for mortals to use OpenStack), I tried creating an instance. It simply said it couldn't do it and left a cryptic error with a uuid listed. I'm used to this by now so I jump to the error logs in the apache log for Horizon. Not much there either. Hmm ... well, I just upgraded *everything* so where do I start? How about a compute node... after all, that's where the instance is supposed to start. Let's see if anything is going on there. Probe around a little and notice that nova-compute isn't actually running. Simple! Let's just start that up. Oh wait, the daemon is still not running. Now I have something in my error logs:

2013-09-13 19:07:08.959 4504 ERROR nova.openstack.common.threadgroup [-] Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 117, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     x.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 49, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.thread.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self._exit_event.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return hubs.get_hub().switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.greenlet.switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     result = function(*args, **kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 65, in run_service
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     service.start()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 156, in start
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     self.manager.init_host()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 752, in init_host
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     context, self.host, expected_attrs=['info_cache'])
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/objects/base.py", line 90, in wrapper
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     args, kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 507, in object_class_action
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.call(context, msg, version='1.50')
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 126, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     result = rpc.call(context, real_topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/__init__.py", line 140, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return _get_impl().call(CONF, context, topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 824, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     rpc_amqp.get_connection_pool(conf, Connection))
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 539, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     rv = list(rv)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 504, in __iter__
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     raise result
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup RemoteError: Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup 
2013-09-13 19:07:08.962 4504 DEBUG amqp [-] Closed channel #1 _do_close /usr/lib/python2.7/dist-packages/amqp/channel.py:88

Yup, clear as mud. I try the usual, "plug this into a popular search engine and see who else is getting the error." Nobody! Wow, a brand new one! Woohoo... Happy Friday. Fine, this is open source and written in python, let's just dive into the code a little and see where the culprit is. Aha! This is an error we get when we don't know what is going on. The last case(in fact the last line) of the file.

So, let's try a little harder. We are dealing with RPC and we are getting a version issue. Something... Somewhere... doesn't like talking to me. Maybe it's the message server itself? *poke around with that* Other things seem fine, no errors to report there. Let's run nova-compute in debug mode in the foreground and see if we can glean anything about where it is failing. Hmm, one of the last things it reports on is talking to nova-conductor. Let's have a look at nova-conductor logs ... WOAH! That same error message! That's right, nova-conductor is *returning* this error for some reason:

2013-09-14 02:27:19.394 ERROR nova.openstack.common.rpc.amqp [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Exception during message handling
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp     rval = self.proxy.dispatch(ctxt, version, method, **args)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp     raise rpc_common.UnsupportedRpcVersion(version=version)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp 
2013-09-14 02:27:19.397 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Returning exception Specified RPC version, 1.50, not supported by this endpoint. to caller
2013-09-14 02:27:19.399 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', '  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', 'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n']

I guess this should have been clear given all of the blatant references to nova-conductor. In fact, I was able to find the term "conductor" once in the backtrace which is pretty good for this kind of problem. So, this *must* be the source of my issue since this is where the error is being generated. Well, tonight, I got lucky. A package wasn't updated and doing a dist-upgrade to pull in matching packages fixed the issue. However, this is a scary place to be:

Upgrading one component (or not upgrading another) can cause components to completely fail on remote nodes. If I had 100 nova-compute nodes and a single node with an outdated nova-conductor running, presumably the 100 nodes would have crashed nova-compute daemons.

The take-away here is: make sure all components across the board are the same version for your cloud. The CLI commands seem to be able to accept a little bit of version lag but the internal components have issues with different versions running.

The less scary part in all of this is that I'm managed to keep my previously-started instances running because kvm processes were not directly affected. If I had customers that needed to start/stop processes, this would have been not fun. Of course, that's why we all keep a complete development/testing openstack cluster around, right?

3 comments:

  1. Hey!

    I'm having the same error but I can't get a solution. How did u solve it exactly?

    Thanks!

    ReplyDelete
    Replies
    1. The final solution for me was to make sure everything was exactly the same version across the board.

      Delete
  2. I'm having the same error you. It was great to see I'm not the only who had this problem because I also didn't get any joy from web searches for the solution or at least some pointer on resolving the issue. From my logs, it seems my error is related to Ceilometer, but when I try to upgrade Ceilometer it says it is at it's most up to date level! How does one know which version of the various components is compatible or not compatible with others. It would be great to have compatibility checks built into the every Openstack project because the software is evolving very quickly.

    ReplyDelete