Friday, September 13, 2013

OpenStack Error Reporting

OpenStack is a conceptually neat piece of software. It's very distributed and fault-tolerant and loosely coupled. That also means that getting back meaningful errors can be a tough process when something goes wrong.

Today, I upgraded my development cluster from Grizzly to Havanah. I spent several hours understanding new and changed config options throughout the landscape and updated my puppet definitions to spread these new changes across the nodes. I finally decided to go home and get some food before attempting to finish debugging some of the errors coming up from various nodes. I've complained to colleagues in the past about error reporting and how non-trivial it is to trace things through OpenStack. Well, here is a prime example.

In Horizon (the Web-GUI for mortals to use OpenStack), I tried creating an instance. It simply said it couldn't do it and left a cryptic error with a uuid listed. I'm used to this by now so I jump to the error logs in the apache log for Horizon. Not much there either. Hmm ... well, I just upgraded *everything* so where do I start? How about a compute node... after all, that's where the instance is supposed to start. Let's see if anything is going on there. Probe around a little and notice that nova-compute isn't actually running. Simple! Let's just start that up. Oh wait, the daemon is still not running. Now I have something in my error logs:

2013-09-13 19:07:08.959 4504 ERROR nova.openstack.common.threadgroup [-] Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 117, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     x.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 49, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.thread.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self._exit_event.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return hubs.get_hub().switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.greenlet.switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     result = function(*args, **kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 65, in run_service
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     service.start()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/service.py", line 156, in start
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     self.manager.init_host()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 752, in init_host
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     context, self.host, expected_attrs=['info_cache'])
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/objects/base.py", line 90, in wrapper
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     args, kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 507, in object_class_action
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.call(context, msg, version='1.50')
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 126, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     result = rpc.call(context, real_topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/__init__.py", line 140, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return _get_impl().call(CONF, context, topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 824, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     rpc_amqp.get_connection_pool(conf, Connection))
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 539, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     rv = list(rv)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 504, in __iter__
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     raise result
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup RemoteError: Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup 
2013-09-13 19:07:08.962 4504 DEBUG amqp [-] Closed channel #1 _do_close /usr/lib/python2.7/dist-packages/amqp/channel.py:88

Yup, clear as mud. I try the usual, "plug this into a popular search engine and see who else is getting the error." Nobody! Wow, a brand new one! Woohoo... Happy Friday. Fine, this is open source and written in python, let's just dive into the code a little and see where the culprit is. Aha! This is an error we get when we don't know what is going on. The last case(in fact the last line) of the file.

So, let's try a little harder. We are dealing with RPC and we are getting a version issue. Something... Somewhere... doesn't like talking to me. Maybe it's the message server itself? *poke around with that* Other things seem fine, no errors to report there. Let's run nova-compute in debug mode in the foreground and see if we can glean anything about where it is failing. Hmm, one of the last things it reports on is talking to nova-conductor. Let's have a look at nova-conductor logs ... WOAH! That same error message! That's right, nova-conductor is *returning* this error for some reason:

2013-09-14 02:27:19.394 ERROR nova.openstack.common.rpc.amqp [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Exception during message handling
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp     rval = self.proxy.dispatch(ctxt, version, method, **args)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp     raise rpc_common.UnsupportedRpcVersion(version=version)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp 
2013-09-14 02:27:19.397 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Returning exception Specified RPC version, 1.50, not supported by this endpoint. to caller
2013-09-14 02:27:19.399 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', '  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', 'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n']

I guess this should have been clear given all of the blatant references to nova-conductor. In fact, I was able to find the term "conductor" once in the backtrace which is pretty good for this kind of problem. So, this *must* be the source of my issue since this is where the error is being generated. Well, tonight, I got lucky. A package wasn't updated and doing a dist-upgrade to pull in matching packages fixed the issue. However, this is a scary place to be:

Upgrading one component (or not upgrading another) can cause components to completely fail on remote nodes. If I had 100 nova-compute nodes and a single node with an outdated nova-conductor running, presumably the 100 nodes would have crashed nova-compute daemons.

The take-away here is: make sure all components across the board are the same version for your cloud. The CLI commands seem to be able to accept a little bit of version lag but the internal components have issues with different versions running.

The less scary part in all of this is that I'm managed to keep my previously-started instances running because kvm processes were not directly affected. If I had customers that needed to start/stop processes, this would have been not fun. Of course, that's why we all keep a complete development/testing openstack cluster around, right?

Monday, August 26, 2013

OpenStack is ready for prime (test) time -- quantum/neutron rant

I've been spending the last year poking and prodding OpenStack in hopes of having a management layer for the various hosts with instances throughout my data center. OpenStack has so many neat little bells and whistles and ... green developers.

My DC is pretty well established and so I need to be careful about introducing new technology. I'm amused at the toy that is devStack. It seems to be the only streamlined way to install a cloud. The other way is to use pre-established puppet templates that make assumptions about how your hardware is setup. My use-case is covered by neither one so I get the third option: install everything and write my own puppet templates. This isn't even my real problem with OpenStack. My real problem is that there is no way to accomplish basic tasks that you would expect in migrating your infrastructure and while the documentation is there and pretty, it's not always clear.

Let's dive into my infrastructure for a moment. I run hosts with link aggregation and vlans. There are no unused ethernet ports on the back of my hosts thus there is no "management port." The management network is simply another vlan on that same trunk. Between trying to figure out whether or not to use quantum and some plugin under quantum or nova-network with one of its plugins (and which version of OpenStack to go along with that decision) I was flailing miserably trying to figure out how OpenStack would allow me to manage infrastructure in the same way: link-aggregation and vlans. There is a VLAN Manager plugin and there is an Open vSwitch plugin ... both of which seemed to be promising. After hosing my entire network with the Open vSwitch plugin (it bridged all of my vlans together by connecting them directly to br-int, something I still don't understand) I knew I had to hold this entire project with kid-gloves.

Finally, I found out that Quantum and the LinuxBridge plugin would do what I needed. That was quite a relief once I finally figured out the terminology used in the docs are not what a system-administrator with a networking bent would expect to see. Ok, time to get dirty! I can bring up VMs, I've got my image service (glance) running on top of my distributed object store (swift) and it's all authenticated via a central/replicated authentication service (keystone with mysql db.) Wow, I can create a tiny little subnet with 3 IPs for testing. Ok, let's bring up a VM ... and then another. Oops, I need more IPs! No problem, there seems to be a "quantum subnet-update" command! Oh, hmm ... it won't let me update the allocation pool. Alright, let's remove the allocation and replace it with a larger one. No dice, IPs are allocated. How about adding another IP allocation pool to the same subnet definition? Nope, can't have the same subnet CIDR declared for two allocation pools.

This is a *huge* problem if you are migrating hosts from a more than half-filled subnet into OpenStack. I guess it's not really a problem if you just have toy VMs on a toy network but I'd really like to think that something with this much effort put into it could be put into at least a semi-production environment.

A few other *big* problems with OpenStack:

  * logging - errors show up anywhere/everywhere except where you are looking
  * meaningful errors - many errors show up like, "couldn't connect to host" ... which host?!
  * stable interfaces - configuration files, configuration terms, backend daemons, ... all change between releases of OpenStack.
  * decent FAQ with answers - however, there are many launchpad bugs/discussions and IRC

If, after a year, I can't figure out how to get this thing safely into my infrastructure, I seriously doubt my abilities. Well, that is, until I realized that I wasn't the only one with serious fundamental problems with the architecture fitting into my current network.

What it seems to come down to is, either I need to significantly change the way I do things or I need to not do OpenStack.