Monday, August 26, 2013

OpenStack is ready for prime (test) time -- quantum/neutron rant

I've been spending the last year poking and prodding OpenStack in hopes of having a management layer for the various hosts with instances throughout my data center. OpenStack has so many neat little bells and whistles and ... green developers.

My DC is pretty well established and so I need to be careful about introducing new technology. I'm amused at the toy that is devStack. It seems to be the only streamlined way to install a cloud. The other way is to use pre-established puppet templates that make assumptions about how your hardware is setup. My use-case is covered by neither one so I get the third option: install everything and write my own puppet templates. This isn't even my real problem with OpenStack. My real problem is that there is no way to accomplish basic tasks that you would expect in migrating your infrastructure and while the documentation is there and pretty, it's not always clear.

Let's dive into my infrastructure for a moment. I run hosts with link aggregation and vlans. There are no unused ethernet ports on the back of my hosts thus there is no "management port." The management network is simply another vlan on that same trunk. Between trying to figure out whether or not to use quantum and some plugin under quantum or nova-network with one of its plugins (and which version of OpenStack to go along with that decision) I was flailing miserably trying to figure out how OpenStack would allow me to manage infrastructure in the same way: link-aggregation and vlans. There is a VLAN Manager plugin and there is an Open vSwitch plugin ... both of which seemed to be promising. After hosing my entire network with the Open vSwitch plugin (it bridged all of my vlans together by connecting them directly to br-int, something I still don't understand) I knew I had to hold this entire project with kid-gloves.

Finally, I found out that Quantum and the LinuxBridge plugin would do what I needed. That was quite a relief once I finally figured out the terminology used in the docs are not what a system-administrator with a networking bent would expect to see. Ok, time to get dirty! I can bring up VMs, I've got my image service (glance) running on top of my distributed object store (swift) and it's all authenticated via a central/replicated authentication service (keystone with mysql db.) Wow, I can create a tiny little subnet with 3 IPs for testing. Ok, let's bring up a VM ... and then another. Oops, I need more IPs! No problem, there seems to be a "quantum subnet-update" command! Oh, hmm ... it won't let me update the allocation pool. Alright, let's remove the allocation and replace it with a larger one. No dice, IPs are allocated. How about adding another IP allocation pool to the same subnet definition? Nope, can't have the same subnet CIDR declared for two allocation pools.

This is a *huge* problem if you are migrating hosts from a more than half-filled subnet into OpenStack. I guess it's not really a problem if you just have toy VMs on a toy network but I'd really like to think that something with this much effort put into it could be put into at least a semi-production environment.

A few other *big* problems with OpenStack:

  * logging - errors show up anywhere/everywhere except where you are looking
  * meaningful errors - many errors show up like, "couldn't connect to host" ... which host?!
  * stable interfaces - configuration files, configuration terms, backend daemons, ... all change between releases of OpenStack.
  * decent FAQ with answers - however, there are many launchpad bugs/discussions and IRC

If, after a year, I can't figure out how to get this thing safely into my infrastructure, I seriously doubt my abilities. Well, that is, until I realized that I wasn't the only one with serious fundamental problems with the architecture fitting into my current network.

What it seems to come down to is, either I need to significantly change the way I do things or I need to not do OpenStack.