Friday, December 11, 2015

For the love of bits, stop using gzip!

Everytime you download a tar.gz, Zod kills a kitten. Everytime you generate a tar.gz, well, let's keep this family-safe.

In 2015, there should be very few reasons to generate .tar.gz files. We might as well just use .zip for all the progress we have made since 1992 when gzip was initially released. xz has been a thing since 2009, yet I still see very little adoption of the format. Today, I downloaded Meteor and noticed that it downloads a .tar.gz. I manually downloaded the file and then recompressed it using xz:

$ du -h meteor-bootstrap-os.osx.x86_64.tar.*
139M meteor-bootstrap-os.osx.x86_64.tar.gz
67M meteor-bootstrap-os.osx.x86_64.tar.xz

Seriously, less than half the size! Maybe it's the amount of time is takes to compress? Let's see:

$ cat meteor-bootstrap-os.osx.x86_64.tar | time xz -9 -c > /dev/null xz -9 -c > /dev/null 165.21s user 1.14s system 99% cpu 2:47.70 total

$ cat meteor-bootstrap-os.osx.x86_64.tar | time gzip -9 -c > /dev/null
gzip -9 -c > /dev/null 35.03s user 0.23s system 99% cpu 35.583 total

Ok, so compressing takes longer. You have to do it once. It's still on the order of reasonable for something that compiles a 600MB tarball in the first place. What about decompressing?

$ time xz -d -c meteor-bootstrap-os.osx.x86_64.tar.xz > /dev/null
4.25s user 0.08s system 99% cpu 4.327 total

$ time gzip -d -c meteor-bootstrap-os.osx.x86_64.tar.gz > /dev/null
1.35s user 0.04s system 99% cpu 1.389 total

... and decompressing takes a little longer. But, wait a second, how long does it take to download the file in the first place? I'm on a decent connection and the file is being hosted on something that delivers the content at an average of (say) 1.5 MB/s. That's 88 seconds for the .tar.gz and 42 seconds for the tar.xz. Since the content is streamed directly to tar (a la: curl ... | tar -xf - ), we actually don't see a time slowdown because xz is slower, we see an overall speedup because the slowest operation is getting the bits in the first place!

What about tooling?

OSX: tar -xf some.tar.xz (WORKS!)
Linux: tar -xf some.tar.xz (WORKS!)
Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)

Why am I picking on Meteor? Well, they place the tagline of "Build apps that are a delight to use, faster than you ever thought possible" right on their homepage. I just ran their install incantation and timed it:

./ 2.32s user 7.67s system 14% cpu 1:10.53 total

70 seconds! Nice job! I must have downloaded it slightly faster than in my initial testing. It also means that the install is extremely limited by download speeds. So ... I can easily imagine this being twice as fast. All that needs to be done is change the compression format and I should be able to install this in 33 seconds!

So, who *does* use xz? Also, the linux kernel itself optionally supports xz compression of initrd images. Vendors just need to pay attention and turn the flags on. Anyone else want to be part of the elite field of people who use xz? Please?

Sunday, July 27, 2014

Runners Manifesto -- draft

It seems like the running community needs something along the lines of what the Hacker Manifesto was meant to be, except for running. There are many articles describing running to runners but perhaps this can be a document to introduce non-runners to runners... or even a document for runners to cherish. 


Hi, I am a runner. I may not run faster than you, but I might run a lot further. Or maybe I can run more consistently every morning or evening. Or maybe I just enjoy lacing up a pair of shoes, or running barefoot.

First, let's be clear. I don’t run away from things, I run towards them. I make goals, have dreams and connect with other runners because we have a fundamental understanding of each other. We like to run.

Why do I run? Well, why do you breath? Running is kinda like breathing. Yeah, I could stop for a while but I would start getting light headed and dizzy… kinda like before a marathon. You really don’t want to bother me then.

I, like so many others, used to not understand runners. In high school, I played soccer and I could run. Im college, I biked around quite a bit. It wasn’t until life forced me to discover new things that I finally found my inner runner. Once I did, I realized that running fast was fun. Running far was fun. Trying to do both was dangerous… and also fun. Ultimately, neither was fundamental to what I did on a daily basis but somehow running becomes a way of life.

I was lucky when I started; I met a group of runners who had run a long time. There were wise old owls to give sage advice and young yet dedicated runners to keep the energy up. Just about everyone could whup my butt except I found that it didn’t matter. We were all friends and supported each other in our goals. Were we running the same race? Cool! Let’s get together and keep each other motivated to do better. Running similar races? Awesome … let’s get together and help keep each other motivated. Completely different races? Let’s get together and keep each other motivated. Essentially, I walked into what was a hidden community that hides in plain sight and supports itself.

What defines a runner? For me, its someone who gets out there and pushes themselves to achieve their goals. It may start with running and seeing a 5k to accomplish, and then trying to beat their 5k time. Maybe it takes you twice as long to do a 5k as me, that means you are working for twice as long to do that and I respect that. Maybe you respect that I can do it twice as fast. Either way, you rock for doing it.

If you stop running, are you still a runner? Yeah, I think so. Running is a way of life, not an activity you do. I’ve met people who used to run and I can tell they are runners. They say they used to be runners but I know better. They still run, just not on their feet.

If you are a non-runner, that’s cool too. Please don’t ask why I run and I won’t ask why you don’t. Also, if you want to challenge me to a race, that’s awesome. Come and train with me and we’ll run whatever I’ve signed up for next.

See you out there!

Wednesday, March 19, 2014

Host configuration -- use the hostname to configure the network

As I imagine many people to be, when bringing up hosts, I'm still stuck in the days of:

Step 1: (node) configure networking
Step 2: (node) configure the hostname
Step 3: (service1) configure DNS to match
Step 4: (service2) plug this into some form of configuration management
Step 5: profit

The problem is that you often have a complicated back and forth between configuration of the node and configuration of the node. If you decide to semi-automate this you might try and add dhcp into the mix:

Step 1: (service1) Allocate IP address for a node
Step 2: (node) Get MAC address from node
Step 3: (service2) Plug MAC+IP into DHCP configuration
Step 4: (service1+service2) Push out changes to dhcp/dns
Step 5: (node) initialize networking using DHCP
Step 6: profit

Adding in IPv6 and SLAAC, things get worse since you have to grab the SLAAC address after networking is brought up and that means two different changes to DNS. Pretty soon, you are spending several minutes just moving basic data between services on your network.

ENTER IPv6 and a slight amount of thought:

I recently had some other push factors (related to our private cloud) to try and minimize the effort spent here. I now use SLAAC and an IPv6 DNS address as a temporary configuration point. Then I query DNS for the information I need to configure the node. This basically just comes down to IPv4 and IPv6 address at the moment:

Step 1: (service1) Allocate IPv4 and IPv6 address
Step 2: (node) bring up node with temporary (SLAAC) config
Step 3: (node) query DNS for addresses and plug the values into a static configuration
Step 4: profit

Now all I need is a script on the new node to make this much faster and viola! My configuration is down to two steps:

Step 1: (service1) Allocate addresses
Step 2: (node) run: init_host new-hostname
Step 3: profit

Of course, init_host runs dig and plugs values into the key places. Finally, it runs some form of configuration management (puppet in our case) to get the rest of the host configured. Since many hosts are virtual instances under OpenStack, we can simply leave an init_host script in the base image for convenience. Time taken to bring up a new node has gone from minutes of error-prone copy/paste to seconds of error-prone typing. I'm much more likely to be careful over a period of seconds with a few steps than a period of minutes with many steps. Hopefully our infrastructure will benefit ... and hopefully yours will too!

Monday, March 3, 2014

Napa Valley Marathon - 2014

I live in a beautiful city that is practically ideal for runners in the winter. There are long flat areas to run as well as hilly and mountainous regions to train in depending on your goals. There is also an abundance of sunshine and ever so slight lack of oxygen at 2500' above sea level. To clarify, it's an ideal place for winter training for an early spring marathon since the summer gets pretty hot.

This training season has been non-ideal in many ways (stress fracture which meant one month with no running) and very little speed/strength training. I made my focus entirely on my long runs which I was just barely able to schedule before the marathon with two 20 mile runs on two successive weekends. The last one was on flat-as-a-pancake pavement with two weeks (one weekend) worth of tapering before the marathon. Given my somewhat ad-hoc training plan, I didn't have a solid goal pace in mind and was planning on winging it somewhat.

On race weekend, my diet varied from very decent hotel breakfasts to dark chocolate peanut butter cups. Suffice to say, I should have been a little more regimented with my food. Pre-race dinner wasn't bad but I probably could have done better by adding a potato somehow for potassium. I had chicken, rice and veggies. I did manage to avoid wine and drank plenty of water throughout the day so that I would be able to spend less time worrying about hydration during the run. In hindsight, I may have had a little too much water and not enough salt/potassium to stay balanced.

Ok, ready, set, go! I woke up on race morning with very little sleep. My neighbors in the hotel decided to come back at 2 am in the morning the previous night and have an hour long conversation. This night wasn't as bad since they started at midnight instead of 2 am. Of course, I had to wake up at 4:00 so I could leave by 4:30 to walk to the busses that left the finish area at 5:15 for the race that started at 7. To be fair to the race directors, I think this was just enough time for all logistics to be worked out for the crowd but I did spend 30 minutes on the bus after we arrived waiting to be kicked off. I think I will try to stay in Calistoga (close to the start) instead of Napa for those precious extra few hours of sleep if I run this again.

Chatter on the bus while driving up was the usual range of topics between runners. "This is my first marathon" ... "doesn't matter if it's your first, your 50th or you are 20 or 60, the feeling at the beginning is always the same" ... "does anyone know what the weather is supposed to be?" Ok, the last one may not be as typical on race day but the weather report had varied wildly in the past week. The forecast went from "no rain" on race day to "it's gonna rain" and everywhere in between. The night before the race I checked and there was a 0% chance. I took that and planned for only slightly worse. Thankfully I did plan for slightly worse.

At the start, I saw some friends in passing and made a few new ones. I saw a guy standing on his own and decided to make simple conversation. Jim  would be happy for a sub-4 and hadn't run a marathon in a while. He asked what I was planning on and I let him know that I was hoping for 3:20 - 3:40 but was going to just see what happened. He seemed impressed and we continued chatting for a little bit. I saw another friend a little bit ahead, went to wish her good luck and then made my way back through the pack so I didn't get caught up in the rush of the start. I tend to go out to fast if I am too far in the front and so I compensate by starting a little further back and taking it easy for a mile.

After some poetry, the national anthem and the usual positive heckling of the announcer, we were off! I started further back than usual because people were walking all the way up to the starting line (which was chip timed.) I wasn't trying to PR so I did my best to just go with it and then gently pass people until I reached a slightly faster group. I did my usual identify people by race shirts they had on and say, "go Ragnar!" for people with Ragnar shirts. This worked well for a while until about mile 2 where I was able to hit my target range.

I caught up with some friends who were running the course for the heck of it. They originally signed up for the race but had injuries during training and thus planned on not doing it. The night before, there was wine and "good" decision making so the whole group decided to start the race and see what happened. I caught up to a few people in the group and passed them without realizing it was them. Once I had gone a little ways ahead they called me on it and caught up. We ran together for a short period and they definitely had their own conversation going. I decided to let the ladies talk while I keep focusing on my run which was starting to feel increasingly difficult.

I was breathing much harder than I would have at the same pace which seemed odd given that I was 2000' of elevation lower than usual. Eventually they pulled past and I started throttling my pace back. My body wasn't hurting or feeling tired but I was starting to feel a lot of mental resistance; Far more than should happen around mile 3 or 4. I couldn't really lock on what was problematic and so I equated the stress to negative thoughts and tried pulling myself out of them.

I passed Jim and he surprised me, "Hey Tim, way to go!" I had to look back a little to recognize him and cheered him on too. There was still a lot positive energy in the air and some surrounding runners chuckled at the exchange. I think he finished with, "I'll see you at the end!"

The spectators were great along the sides of the road. Nice posters and positive (but not over the top) comments for the passing runners. One group was even offering an entire line of high-fives. I think that was the fastest part of the course for me. I am a total sucker for that sort of thing and love to play into it. My gps log shows a single peak at 2 minutes faster than my target pace. I'm glad only one crowd was that organized with the high-fives.

I kept the pace, or at least I thought I was for a while. The resistance was getting unusually difficult to wade through. I tried everything to get positive thoughts flowing. I looked around at the beautiful grape fields and hills, enjoyed the cool temperatures, appreciated the runners around me, and came back to thinking of wine at the end. Just pushing through wasn't working either. Nothing helped. It's a downhill race with an uphill battle. Finally, I dropped my pace by two minutes for a half mile to try and reset a little bit. There was no way I could sustain this mentally even though my body was feeling mostly ok. Finally, I decided to walk. That's a tough decision for me, especially before mile 13.

The resistance I felt just would not give up. I started employing tactics such as 1 mile on and then a short walk/jog rest to try and break out of the head space. Once I figured out that I could walk, I started walking more and my overall pace was starting to really suffer. Jim caught up to me and asked what was wrong. I just kinda shook my head and said, "it's all mental." He knew exactly what I was saying and gave me an encouraging pat on the back. After talking for a little bit he continued on, "I'll wait for you at the finish!" At that point, a course motor cycle monitor looked back and asked for a thumbs up/thumbs down. I gave him a thumbs up but I noticed he took a moment to believe me and he didn't ask someone ahead of me that was also walking.

Finally, I suspected that this may be food related. I didn't plan on taking anything but salt pills and aid-station water which had worked fine for long runs in the past. There was an emergency GU in my pocket but I did not want to resort to that if I didn't have too. I started accepting gatorade and eventually bananas and even went back for seconds a few times. By mile 19, I was starting to feel a little less mental fatigue but all the walk breaks and cold had told my body, "ok, we are basically done." I then walked for half a mile trying to just focus on gradually increasing steam. I was hoping I could build up a little momentum and start into a jog that would eventually become a run again. I started an easy pace and decided that I had to hold it for a mile to see what happened.

I walked again for a brief period, started up and then it happened; My right calf started knotting up when I would land/push off. No longer was it just a question in my mind about running vs not, my body was enforcing that I could not run. By this point I had gotten past the, "maybe I should drop" thoughts and graduated to, "if I have to walk 6 miles then so be it." My mental state was getting much better but my body worse which reinforced the fuel-related diagnosis.

What was nice about passing mile 20 is that others were also hitting a wall. I wasn't alone anymore! There were people passing and walking ahead. Some people walked around me and someone even passed and said, "go Ragnar!" since I had my oldest Ragnar shirt on. That managed to bring a smile to my face.

I walked for two miles and tried again. Nope! Not happening. Potassium doesn't come from mid-air but rather from aid-station volunteers. Finally, an aid station! Lots of gatorade, water to help dilute it, a salt pill and as many bananas as I could hold with a cup of water. I gave myself 10 minutes to let the salt be absorbed into my system and let the food ease its way into a safe place. "Ok, let's try running again," I thought to myself. Nope. Nope nope nope! One more time? Nope. Fine, little more walking, but damnit, I'm making it!

In the the final stretch, once I got over all the people passing me, I found new motivation. If I couldn't run, I was going to walk as fast as I could. Constant forward motion was my goal. I tried running a few times and after analyzing the knots trying to form in my calf I finally figured out that if I used my heel a little more and my forefoot a little less, I could jog. Normally, this is not how I optimize my run. I wear minimal footwear and heel strikes are considered dangerous. The shoes I was wearing had several millimeters of padding, however, so I figured I could get away with it.

I thought to myself, "I'm going to make it, no matter what I need to do." The light at the end of the tunnel was starting to twinkle through slowly and the only thing keeping me from shedding a few tears of happiness were well placed crowds of people cheering on runners. They probably wouldn't have noticed given the light but constant rain that was falling by now. Runners themselves were saying motivating things as they pass. Sometimes it was for themselves via a mantra to keep going, other times it was definitely aimed at me. Either way, it was inspiring.

Ok, new goal with new information: I'm going to run through the finish. Oh, someone has homemade sorbet! I stopped and asked briefly about flavors. I finally chose the lightly colored one and it was SO GOOD. If the person who made that and stood out in the cold to serve sorbet is reading this, "Thank you! Something about it made me feel instantly better." I jogged out of sight and the cramps came back. So close! Walk for a moment... then start up again. Nope. Ok, let's walk and conserve a little energy. I'm running through this finish no matter what.

One spectator had measured 0.2 miles beyond the 25 mile post. She was trying to inspire people to keep going since there was exactly one mile. Again, being a sucker for spectators, I played into it, got a fist bump and started running. Oops, slowly, right ... cramps. Ok, I can do a mile! I can always do a mile!

With the finish area in sight, both of my calves were starting to cramp up even with my modified run. I finally decided I didn't care and just did whatever I had to in order to keep my body moving forward through the finish. I vaguely saw the finish clock in so much as to recognize a "4" as the first number... which I was well aware of anyway. I crossed all three finish matts and was greeted by someone immediately after I stopped running. She was looking me very intently in the eyes and I was trying really hard to recognize this person. Did I know her? Was she part of the group I came with? She had a lot of stuff on to stay warm/dry but I finally figured out she was trying to assess my state. I mentioned some cramping and she mentioned some medical tents. Someone placed a medal around my neck while another placed a heat blanket around my shoulders. The medal is cool but, hands down, it was the heat blanket that won in my mind.

This was definitely not the marathon I planned to run but there is no way you could get me to change a thing about it for the world. My time was an hour slower than anticipated, my body still hurts and I'm pretty sure I'll have a few nagging injuries for a while but the memories will last a lifetime.

Tuesday, February 25, 2014

OpenStack and Glance Image Types vs Size

I have looked around for good information on images and support for sparse storage and compression. After experimenting a bit I decided to compile a few key results. There are two attributes I care about here: storage and network transfer.

In order for efficient storage of an image, the image must be as small as possible. If there is a run of no data on an image (free space that has been filled with zeros) then that should not reserve bits on the filesystem. The key term here is "sparse image." Once you transfer a sparse image to another host, the "sparse bits" actually get transmitted as long strings of zeros... thus you transfer 10 GB of data for a 10 GB sparsely allocated image with (say) 6 GB of actual filesystem data.

In order to be efficient for a network transfer, the image has to actually be small which ends up meaning compression. (eg: gzip/bzip2/...)

I used an LVM backed instance for my source data and in order to fill zeros into the free space of the block storage I ran this inside the instance:

dd if=/dev/zero of=/tmp/zeros bs=4M; rm /tmp/zeros; halt

This basically just wrote zeros into a file under /tmp, removed it and then halted the instance. You may need to use a different location on your host because /tmp is sometimes mounted as a different filesystem (eg: tmpfs.)

Now, my lv is ready to convert to different image types. I used qemu-img in order to read from the block device to create images of different varieties:

qemu-img convert -c -O qcow2 /dev/vg/kvm-woof.snap host-woof.compressed.qcow2
qemu-img convert -O qcow2 /dev/vg/kvm-woof.snap host-woof.qcow2
qemu-img convert -O vdi /dev/vg/kvm-woof.snap host-woof.vdi
qemu-img convert -O vmdk /dev/vg/kvm-woof.snap host-woof.vmdk

The differences between sizes somewhat surprised me:

root@os-ph1:/staging# ls -lh host-woof.* 
-rw-r--r-- 1 root root 2.1G Feb 25 17:01 host-woof.compressed.qcow2
-rw-r--r-- 1 root root 6.3G Feb 25 16:52 host-woof.qcow2
-rw-r--r-- 1 root root  10G Feb 25 16:42 host-woof.raw
-rw-r--r-- 1 root root 6.9G Feb 25 16:49 host-woof.vdi
-rw-r--r-- 1 root root 6.3G Feb 25 16:49 host-woof.vmdk

A raw image is pretty much the worst thing you can do. It turns out that only the qcow2 format supports compression within the image itself. There are some downsides to qcow2 with performance but certainly if you are creating/destroying VMs a lot then you can save yourself some network bandwidth by transferring smaller images. Creating the compressed qcow2 image took roughly twice as long (no hard numbers here) as the uncompressed version. YMMV as not all data is created equal (or equally compressible).

It's also interesting to note that a compressed qcow2 image may slowly expand through use. New writes to the image may not be compressed.

-- EDIT --

I use LVM to back instances on hosts in OpenStack. It turns out that regardless of the format chosen, the whole (raw) image still needs to be written out to the logical volume. Unless you have really fast disks, this is by far the slowest part of the process. At least in my environment (with spinning rust) all I gain is some space on my image store and less overall used network bandwidth. Also, images are cached in a raw format on each node. The image also acts something like a sparse image:

root@os-ph12:/var/lib/nova/instances/_base# ls -lh abcca*
-rw-r--r-- 1 nova nova 10G Feb 25 21:05 abcca5bbe40c5b147f8a110bf81dab8bbb65db25
root@os-ph12:/var/lib/nova/instances/_base# du -h abcca*
6.2G    abcca5bbe40c5b147f8a110bf81dab8bbb65db25

Tuesday, January 7, 2014

"Unavailable console type spice" -- Another OpenStack Error

Tonight, I found myself trying to implement the spice proxy out of curiosity in havana. I found a nice post that talked about how to do this after banging my head on the wall for a while:

Anyway, this lead to getting "console is currently unavailable. Please try again later. Reload" when I tried to load a console of an already running host. I dove into the stack trace I received in the nova-compute logs:

2014-01-07 22:41:51.402 5706 ERROR nova.openstack.common.rpc.amqp [req-4ad10aaf-60c0-4f88-8964-cb3f6dd06814 6c978326923a4fa997a6a83b3fdbd11e 47eedd8414b84466a731289a5d6dee35] Exception during message handling
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 461, in _process_data
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     **args)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 172, in dispatch
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     result = getattr(proxyobj, method)(ctxt, **kwargs)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 439, in inner
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     return catch_client_exception(exceptions, func, *args, **kwargs)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 420, in catch_client_exception
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     return func(*args, **kwargs)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/", line 90, in wrapped
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     payload)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/", line 73, in wrapped
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     return f(self, context, *args, **kw)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/compute/", line 271, in decorated_function
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     e, sys.exc_info())
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/compute/", line 258, in decorated_function
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     return function(self, context, *args, **kwargs)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/compute/", line 3579, in get_spice_console
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     connect_info = self.driver.get_spice_console(instance)
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/", line 2199, in get_spice_console
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     ports = get_spice_ports_for_instance(instance['name'])
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/", line 2197, in get_spice_ports_for_instance
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp     raise exception.ConsoleTypeUnavailable(console_type='spice')
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp ConsoleTypeUnavailable: Unavailable console type spice.
2014-01-07 22:41:51.402 5706 TRACE nova.openstack.common.rpc.amqp 
2014-01-07 22:41:51.404 5706 ERROR nova.openstack.common.rpc.common [req-4ad10aaf-60c0-4f88-8964-cb3f6dd06814 6c978326923a4fa997a6a83b3fdbd11e 47eedd8414b84466a731289a5d6dee35] Returning exception Unavailable console type spice. to caller

and the final hint was a comment in the source:

# head -2197 /usr/lib/python2.7/dist-packages/nova/virt/libvirt/ | tail -3
            # NOTE(rmk): We had Spice consoles enabled but the instance in
            # question is not actually listening for connections.
            raise exception.ConsoleTypeUnavailable(console_type='spice')

Turns out that running VMs have VNC or SPICE nailed into their configuration:

root@os-ph12:/# EDITOR=cat virsh edit instance-000000a1  | egrep 'graphics.*(spice|vnc)' 
root@os-ph12:/# EDITOR=cat virsh edit instance-000000a0  | egrep 'graphics.*(spice|vnc)' 

thus the VMs needed to be rebuilt on the host with spice.

Friday, September 13, 2013

OpenStack Error Reporting

OpenStack is a conceptually neat piece of software. It's very distributed and fault-tolerant and loosely coupled. That also means that getting back meaningful errors can be a tough process when something goes wrong.

Today, I upgraded my development cluster from Grizzly to Havanah. I spent several hours understanding new and changed config options throughout the landscape and updated my puppet definitions to spread these new changes across the nodes. I finally decided to go home and get some food before attempting to finish debugging some of the errors coming up from various nodes. I've complained to colleagues in the past about error reporting and how non-trivial it is to trace things through OpenStack. Well, here is a prime example.

In Horizon (the Web-GUI for mortals to use OpenStack), I tried creating an instance. It simply said it couldn't do it and left a cryptic error with a uuid listed. I'm used to this by now so I jump to the error logs in the apache log for Horizon. Not much there either. Hmm ... well, I just upgraded *everything* so where do I start? How about a compute node... after all, that's where the instance is supposed to start. Let's see if anything is going on there. Probe around a little and notice that nova-compute isn't actually running. Simple! Let's just start that up. Oh wait, the daemon is still not running. Now I have something in my error logs:

2013-09-13 19:07:08.959 4504 ERROR nova.openstack.common.threadgroup [-] Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/", line 117, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     x.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/", line 49, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.thread.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/", line 168, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self._exit_event.wait()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/", line 116, in wait
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return hubs.get_hub().switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/hubs/", line 187, in switch
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return self.greenlet.switch()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/eventlet/", line 194, in main
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     result = function(*args, **kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/", line 65, in run_service
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     service.start()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/", line 156, in start
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     self.manager.init_host()
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/compute/", line 752, in init_host
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     context,, expected_attrs=['info_cache'])
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/objects/", line 90, in wrapper
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     args, kwargs)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/conductor/", line 507, in object_class_action
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return, msg, version='1.50')
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 126, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     result =, real_topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 140, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     return _get_impl().call(CONF, context, topic, msg, timeout)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 824, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     rpc_amqp.get_connection_pool(conf, Connection))
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 539, in call
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     rv = list(rv)
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 504, in __iter__
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup     raise result
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup RemoteError: Remote error: UnsupportedRpcVersion Specified RPC version, 1.50, not supported by this endpoint.
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', u'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n'].
2013-09-13 19:07:08.959 4504 TRACE nova.openstack.common.threadgroup 
2013-09-13 19:07:08.962 4504 DEBUG amqp [-] Closed channel #1 _do_close /usr/lib/python2.7/dist-packages/amqp/

Yup, clear as mud. I try the usual, "plug this into a popular search engine and see who else is getting the error." Nobody! Wow, a brand new one! Woohoo... Happy Friday. Fine, this is open source and written in python, let's just dive into the code a little and see where the culprit is. Aha! This is an error we get when we don't know what is going on. The last case(in fact the last line) of the file.

So, let's try a little harder. We are dealing with RPC and we are getting a version issue. Something... Somewhere... doesn't like talking to me. Maybe it's the message server itself? *poke around with that* Other things seem fine, no errors to report there. Let's run nova-compute in debug mode in the foreground and see if we can glean anything about where it is failing. Hmm, one of the last things it reports on is talking to nova-conductor. Let's have a look at nova-conductor logs ... WOAH! That same error message! That's right, nova-conductor is *returning* this error for some reason:

2013-09-14 02:27:19.394 ERROR nova.openstack.common.rpc.amqp [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Exception during message handling
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 430, in _process_data
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp     rval = self.proxy.dispatch(ctxt, version, method, **args)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 138, in dispatch
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp     raise rpc_common.UnsupportedRpcVersion(version=version)
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.
2013-09-14 02:27:19.394 17496 TRACE nova.openstack.common.rpc.amqp 
2013-09-14 02:27:19.397 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] Returning exception Specified RPC version, 1.50, not supported by this endpoint. to caller
2013-09-14 02:27:19.399 ERROR nova.openstack.common.rpc.common [req-cd3c0d15-fa20-486a-8935-2ed5fec76f7d None None] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 430, in _process_data\n    rval = self.proxy.dispatch(ctxt, version, method, **args)\n', '  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/", line 138, in dispatch\n    raise rpc_common.UnsupportedRpcVersion(version=version)\n', 'UnsupportedRpcVersion: Specified RPC version, 1.50, not supported by this endpoint.\n']

I guess this should have been clear given all of the blatant references to nova-conductor. In fact, I was able to find the term "conductor" once in the backtrace which is pretty good for this kind of problem. So, this *must* be the source of my issue since this is where the error is being generated. Well, tonight, I got lucky. A package wasn't updated and doing a dist-upgrade to pull in matching packages fixed the issue. However, this is a scary place to be:

Upgrading one component (or not upgrading another) can cause components to completely fail on remote nodes. If I had 100 nova-compute nodes and a single node with an outdated nova-conductor running, presumably the 100 nodes would have crashed nova-compute daemons.

The take-away here is: make sure all components across the board are the same version for your cloud. The CLI commands seem to be able to accept a little bit of version lag but the internal components have issues with different versions running.

The less scary part in all of this is that I'm managed to keep my previously-started instances running because kvm processes were not directly affected. If I had customers that needed to start/stop processes, this would have been not fun. Of course, that's why we all keep a complete development/testing openstack cluster around, right?

Monday, August 26, 2013

OpenStack is ready for prime (test) time -- quantum/neutron rant

I've been spending the last year poking and prodding OpenStack in hopes of having a management layer for the various hosts with instances throughout my data center. OpenStack has so many neat little bells and whistles and ... green developers.

My DC is pretty well established and so I need to be careful about introducing new technology. I'm amused at the toy that is devStack. It seems to be the only streamlined way to install a cloud. The other way is to use pre-established puppet templates that make assumptions about how your hardware is setup. My use-case is covered by neither one so I get the third option: install everything and write my own puppet templates. This isn't even my real problem with OpenStack. My real problem is that there is no way to accomplish basic tasks that you would expect in migrating your infrastructure and while the documentation is there and pretty, it's not always clear.

Let's dive into my infrastructure for a moment. I run hosts with link aggregation and vlans. There are no unused ethernet ports on the back of my hosts thus there is no "management port." The management network is simply another vlan on that same trunk. Between trying to figure out whether or not to use quantum and some plugin under quantum or nova-network with one of its plugins (and which version of OpenStack to go along with that decision) I was flailing miserably trying to figure out how OpenStack would allow me to manage infrastructure in the same way: link-aggregation and vlans. There is a VLAN Manager plugin and there is an Open vSwitch plugin ... both of which seemed to be promising. After hosing my entire network with the Open vSwitch plugin (it bridged all of my vlans together by connecting them directly to br-int, something I still don't understand) I knew I had to hold this entire project with kid-gloves.

Finally, I found out that Quantum and the LinuxBridge plugin would do what I needed. That was quite a relief once I finally figured out the terminology used in the docs are not what a system-administrator with a networking bent would expect to see. Ok, time to get dirty! I can bring up VMs, I've got my image service (glance) running on top of my distributed object store (swift) and it's all authenticated via a central/replicated authentication service (keystone with mysql db.) Wow, I can create a tiny little subnet with 3 IPs for testing. Ok, let's bring up a VM ... and then another. Oops, I need more IPs! No problem, there seems to be a "quantum subnet-update" command! Oh, hmm ... it won't let me update the allocation pool. Alright, let's remove the allocation and replace it with a larger one. No dice, IPs are allocated. How about adding another IP allocation pool to the same subnet definition? Nope, can't have the same subnet CIDR declared for two allocation pools.

This is a *huge* problem if you are migrating hosts from a more than half-filled subnet into OpenStack. I guess it's not really a problem if you just have toy VMs on a toy network but I'd really like to think that something with this much effort put into it could be put into at least a semi-production environment.

A few other *big* problems with OpenStack:

  * logging - errors show up anywhere/everywhere except where you are looking
  * meaningful errors - many errors show up like, "couldn't connect to host" ... which host?!
  * stable interfaces - configuration files, configuration terms, backend daemons, ... all change between releases of OpenStack.
  * decent FAQ with answers - however, there are many launchpad bugs/discussions and IRC

If, after a year, I can't figure out how to get this thing safely into my infrastructure, I seriously doubt my abilities. Well, that is, until I realized that I wasn't the only one with serious fundamental problems with the architecture fitting into my current network.

What it seems to come down to is, either I need to significantly change the way I do things or I need to not do OpenStack.

Thursday, December 15, 2011

execve under OSX Snow Leopard


A colleague and I are developing a piece of code (called pbsmake) in python that ends up interpreting something that looks like a Makefile. We will use pbsmake to help distribute jobs to a local scheduler (each makefile target gets its own job) but we found that we may want to use the pbsmake interpreter as a shell interpreter itself so we can simply execute certain commands (which are really like makefiles) with a target.

Most of our development has been under GNU/Linux and this works just fine. However, as soon as we do this under OSX Snow Leopard, the top-level makefile starts being executed as if it were a bash script.

How to replicate:

  1. create a simple interpreter that is a shell script itself:
    1. /Users/imoverclocked/
      1. #!/bin/cat
      2. This is my simple interpreter, it simply spits the file onto STDOUT
  2. create a script that uses this interpreter:
    1. /Users/imoverclocked/
      1. #!/Users/imoverclocked/simple/
      2. This is a simple script that is interpreted (really, just spit out by cat)
  3. try and execute the
    1. ./
      1. Badly placed ()'s.
  4. change the interpreter to first invoke /usr/bin/env as a work-around
    1. /Users/imoverclocked/
      1. #!/usr/bin/env /Users/imoverclocked/simple/
      2. This is a simple script that is interpreted (really, just spit out by cat)
  5. executing the script now gives the same output as on other unix-like architectures.
    1. $ ./ 
      This is my simple interpreter, it simply spits the file onto STDOUT
      This is a simple script that is interpreted (really, just spit out by cat)

My guess to what is happening is that execv* can't handle the case where a script calls a script to interpret a script. One work-around is to use /usr/bin/env in the "" header which works since /usr/bin/env is a binary executable which then invokes the script in it's own execvp call.

Apparently this works under Lion but we can't quite make the plunge across our infrastructure yet. Hope this helps someone else out there! Maybe Apple is listening?

Wednesday, September 14, 2011

Getting ethernet bonding to work under Debian

I have a small cluster of different hardware running Debian squeeze/wheezy and several machine have been really simple to setup with bonding (aka: trunking/link aggregation/port trunking/ ... the list goes on) using mode=4 or "IEEE 802.3ad Dynamic link aggregation". My final clue to solving the problem was some dmesg output:

[   14.777381] e1000e 0000:06:00.1: eth0: changing MTU from 1500 to 9000
[   15.072476] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   15.128059] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   15.129468] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   15.129473] 8021q: adding VLAN 0 to HW filter on device eth0
[   16.290994] e1000e 0000:06:00.0: eth1: changing MTU from 1500 to 9000
[   16.586584] e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
[   16.640053] e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
[   16.641411] ADDRCONF(NETDEV_UP): eth1: link is not ready
[   20.530343] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[   20.530350] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[   20.710398] bonding: bond0: Adding slave eth0.
[   20.710415] e1000e 0000:06:00.1: eth0: changing MTU from 9000 to 1500
[   21.006430] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   21.060058] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   21.061374] 8021q: adding VLAN 0 to HW filter on device eth0
[   21.061445] bonding: bond0: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full.
[   21.061462] bonding: bond0: enslaving eth0 as an active interface with an up link.
[   21.242433] bonding: bond0: Adding slave eth1.

If I brought bond0 down/up again later I would get a working link. I tried adding "sleep" commands into the network init sequence to try and figure out if this was just some quiescent state that the NIC driver was in during initialization. This didn't help... so I finally read the warning about needing miimon/arp_interval/arp_ip_target for bonding to work. This was odd because my /etc/network/interfaces file looks like this:

iface bond0 inet manual
        bond-slaves eth0 eth1 eth2 eth3
        bond-mode 4
        bond-miimon 100
        bond-xmit-hash-policy layer3+4
        mtu 9000
        post-up ip link set $IFACE mtu 9000
        post-up sysctl -w net.ipv6.conf.all.autoconf=0
        post-up sysctl -w net.ipv6.conf.default.accept_ra=0

As it turns out, miimon is not set when the bonding driver is loaded. To solve this problem I created a new file /etc/modprobe.d/bonding with the following content:

alias bond0 bonding
options bonding mode=4 miimon=100

This fixes the issue of bonding not working on boot and should probably be the source of a Debian/Linux bug report.

Thursday, August 11, 2011

Blender 2.5 Compositing from Python

If you are like me then you hate doing the same thing over and over ... which means you like to automate these kinds of processes when it makes sense to do so. I wrote a Blender plugin that imports a HiRISE DTM directly into Blender 2.5x and allows you to easily make fly-throughs of the various Mars locations that a DTM exists for.

My work wants to make automated fly throughs with a trivial piece of compositing to place a foreground and background image into the blender scene. I came across very few working examples or documentation on how to use the compositing part of Blender from python so ... here we go! I'll simply place the final example code here with comments and describe each section a little more in depth below.

# 1) Use compositing for our render, setup some paths
bpy.context.scene.use_nodes = True

fgImageLoc = "/path/to/foreground.tiff"

bgImageLoc = "/path/to/background.tiff"

# 2) Get references to the scene
Scene = bpy.context.scene
Tree = Scene.node_tree
Tree.links.remove( Tree.links[0] )

# 3) The default env will have an input and an output (Src/Dst)
Src = Tree.nodes["Render Layers"]
Dst = Tree.nodes["Composite"]

# 4) Let's create two groups to encapsulate our work
FG_Node =
    "ForegroundImage", type='COMPOSITE')
BG_Node =
    "BackgroundImage", type='COMPOSITE')

# 5) The foreground group has one input and one output"Source", 'RGBA')"Result", 'RGBA')

# 6) The foreground node contains an Image and an AlphaOver node
FG_Image ='IMAGE')
FG_Image.image = fgImageLoc )

# 7) The Image and the Group Input are routed to the AlphaOver
#    and the AlphaOver output is routed to the group's output["Image"], FG_Alpha.inputs[2])["Source"], FG_Alpha.inputs[1])["Result"], FG_Alpha.outputs["Image"])

# 8) Add foreground image compositing to the environment
newFGGroup ="GROUP", group = FG_Node)

# 9) Route the default render output to the input of the FG Group[0], Src.outputs["Image"])

# 10) The background group has one input and one output"Source", 'RGBA')"Result", 'RGBA')

# 11) The background group contains an Image and AlphaOver node
BG_Image ='IMAGE')
BG_Image.image = bgImageLoc )

# 12) Create links to internal nodes["Image"], BG_Alpha.inputs[1])["Source"], BG_Alpha.inputs[2])["Result"], BG_Alpha.outputs["Image"])

# Add background image compositing, similar to 8/9
newBGGroup ="GROUP", group = BG_Node)[0], newFGGroup.outputs[0])[0], Dst.inputs["Image"])

When you run this you will end up with a pipeline that looks like this:

The rendered scene outputs to the foreground image group which then outputs to the background image group which then outputs to the specified file path in Blender. Each group is a composite of primitives in blender. When expanded (via selecting the group and pressing Tab) you will see this:

The group input and Image outputs are routed into the Alpha Over image. The Alpha Over output is routed into to the groups output. This will overlay the Image node onto the scene. A similar setup is produced for the background image.

Here is a slightly more detailed breakdown of the script:

  1. Tell blender that we have a special compositing setup
    • also, store info about where our foreground/background images are kept
  2. Blender's environment is not empty by default.
    • Get a reference to it
    • Clear the link between the Render Layer and the Composite output
  3. Get a reference to the default nodes for later use
    • Src is the source for rendered content from the scene
    • Dst is the node that takes an output and generates a file (or preview)
  4. To simplify our compositing graph, create two groups to encapsulate each function
  5. The group we just created has one input and one output
  6. Create two nodes
    • Image - acts as an output with a static image
    • AlphaOver - overlays two images using the alpha channel defined in the second image
  7. Create links between nodes.
    • It's easier to see these in the image above.
  8. Instantiate the new group in the compositing environment
    • This is where I was a little lost, the group needs to be instantiated and then a new object is returned. The input/output of the returned object are the external input/output ports of the group. If you use the previous object you will make bad connections to the internal structures of the group. Don't do it!
  9. Connect the rendered image to the foreground group input.
  10. through 12. are pretty much the same as 5. through 9.
    • Different connections make the image a background image instead of a foreground image

Thanks to Uncle_Entity and Senshi in #blendercoders@freenode for fixing my initially poor usage.

Here a few links that use the compositing above. Notice that the text hovers above the DTM while the background ... stays in the background:

Saturday, April 30, 2011

IPv6 - how to listen?

In working with IPv6 I've come to realize that there are a lot of funny defaults out there. The most basic of them seems to be in how to bind to a port. Simple? Yeah ... or so I thought.

Under some versions Linux if you are to just listen on, say, [::]:80 then people connecting to your site via IPv4 might be logged as ::ffff: which is an IPv4 mapped IPv6 address. This means that everyone is allowed to talk IPv4 but your server gets to deal with everything as if it is an IPv6 address... pretty much. DNS resolution gets a little fuzzy here but we'll sweep that issue under the carpet for the remainder of this post.

Under other versions of Linux, you may not accept IPv4 connections at all by listening on just [::]:80. As it turns out, this is a configurable default and there are (of course) arguments both ways about which is better. I personally like the one socket fits all approach but I'm also very pro-IPv6. The magic default under Linux can be found/set via "sysctl net.ipv6.bindv6only".

Stepping outside of the wonderful world of Linux and into the world of Solaris/OpenSolaris (RIP) you'll find a pretty consistent behavior towards what Linux would know as net.ipv6.bindv6only=1. In fact, I never did find a way to change the default under Solaris and had to revert to using funky configurations that specified two listening sockets: one for IPv4 and another for IPv6. In some cases this was more than a simple annoyance, it was impossible. In the case of Ejabberd, things get ugly. There is no way to specify which behavior you want and, on top of that, the connections are managed via a hash by port inside Ejabberd. That means you can't listen twice on the same socket! I hacked around this issue in our environment but I look forward to not needing it in the future.

Another place this becomes a problem is in our nginx configurations. On Solaris we have something that looks like:

server {
 listen 80;
 listen [::]:80; 


but migrating this configuration to Linux where the default is net.ipv6.bindv6only=0, we simply use:

server {
 listen   [::]:80;

Which does close to the same thing. Our log files may change a little since ::ffff: is now in front of our IPv4 entries but everything else pretty much stays the same.

Alternatively we can do (for the default server):

server {
 listen   80;
 listen   [::]:80 default ipv6only=on;

and then we are back to the kludge of using two different sockets for pretty much the same thing. There are applications where providing a different answer on IPv6 than on IPv4 makes sense but most of the time it doesn't.

What can we do as application developers to do things the right way the first time? That's highly language dependent. Some high-level languages don't distinguish between IPv4 and IPv6 unless you dig a little and ask specifically for it. The problem is that they may be compiled with or without IPv6 support (like ruby) and then you may be powerless to use IPv6 at all. Other languages you will need to make small adjustments (eg: C needs to use getaddrinfo() instead of gethostbyaddr()) Google is your friend here and be sure to checkout tools like IPv6 CARE which can tell you what is wrong as well as dynamically patch a running binary to do the right thing. Pretty slick!

Finally, what is the best practice on how to listen to an IPv6 socket? My preference is to listen once and get all traffic on the one socket but there are cases where it is desirable to use two sockets. This means the best practice is to be configurable and capable of doing both. You could make me happy and default to one socket for your application though! It makes IPv6 "just work".

Thanks for reading and Happy Hacking!

Friday, April 29, 2011

NFS over UDP, fast + cheap = not good?

From yesterday's post I took off and started researching ways that NFS over UDP can go wrong. I am now sufficiently scared away from ever writing/appending files on NFS that is using UDP. There are references everywhere to potential data corruption but only a few good sources that will give me anything concrete on the topic. Those references seem to be a bit outdated but the cautious sys-admin/engineer side of me is now screaming, "ok, fast + cheap somehow usually means not good."

Most of the cases I came across dealt with writing via UDP and not so much reading via UDP. There were some cache issues mentioned but we have run into those regardless of UDP/TCP so nothing new there. The particular use-case of the previous test does only need to read but in considering our general systems infrastructure we definitely need write functionality so UDP is probably not a good idea anymore.

Now that the NFS path has been travelled, maybe I can find a better way?

Thursday, April 28, 2011

NFS Performance: TCP vs UDP

I have found many places that will state that NFS over TCP is not appreciably slower on Linux unless you measure carefully. In fact, I found so many of them that I took it for granted for a while... until today.

Here is the backstory (circa 2005):

We are an NFS shop scaling to meet the demands of HiRISE image collection/planning/processing and we are having severe problems scaling our NFS servers to handle processing and desktop loads on our network. Turns out the final fix for some issues was to use TCP for our NFS traffic. (BTW: thanks for the pointer from our then-in-testing-vendor, Pillar) Ok, simple fix! Quick tests show that performance is about the same.

Some time after this I was working with our DBA to try and speed up a process that reads a bunch of records from the database and then verifies the first 32KiB of about a half-million files that we have published via the PDS standard. I mention that we have some shiny new SunFire T1000 servers with 24 cores which could speed this effort up via threading. He takes this to heart and threads his code so each thread deals with checking a single file at once. We did get a speedup, but definitely not 24x.

Ok, jump forward to the present day, literally today. I spec'd some hardware and put together a 5-node cluster to virtualize a bunch of operations on. Each host has 32 cores and 128GB RAM and the particular DomU we were testing with has 24 vcpus and plenty of RAM for this task. Our NFS servers are also Sun (Oracle *cough* *cough*) servers with fancy DTrace analytics which can tell you exactly what is going on with your storage. All of this should be very capable and the network in-between is rock solid and has plenty of bandwidth... so why are we only peaking around 35 reads per second? Why is this job still taking half a day to complete?

The network is not heavily loaded, the fileserver is twiddling its thumbs and the host has a load of about 60. I do some quick calculation and figure out that the computation is speeding along and processing a "whopping" 1MB of text every second (sigh.) Ok, let's point a finger at Java. It's certainly not my favorite language and as far as string processing goes, there are much better alternatives IMHO.

I gingerly ask our DBA who wrote the code if I can peruse it to see if I see anything that could be optimized. He obliges and I peruse through the code. Of course, being an abstraction on an abstraction I'm sure there is a lot to be gleaned from digging deeper but nothing pops out at me as needing to take up this entire node to process text at 1MB/s. I mention that it could be the abstractions underneath and our DBA asks why it is faster in a certain case (I haven't verified that but I believe him) and I decide, "ok, let's take Java out of the equation. import python" So here is the script I write to approximate his java class:

#!/usr/bin/env python

import sys
import threading
import Queue

class read32k(threading.Thread):
def readFileQ(self, fileQ):
self._fq = fileQ

def run(self):
while not self._fq.empty():
fn = self._fq.get()
f = open(fn)
data =

# need to join our threads later on
threads = []

# need a queue of files to look through
fileQ = Queue.Queue()
for arg in sys.argv[1:]:

# initialize/start the threads
for t in range(60):
readchunk = read32k()
readchunk.readFileQ( fileQ )
threads.append( readchunk )

# wait for all threads to return
for t in threads:

# note the number of files processed
print "read chunks of %d files" % (len( sys.argv )-1)

pretty brain-dead simple (and sloppy, sorry...) For the non-python coders but casually interested, the script starts 60 threads, and reads a 32KiB chunk of data from a queue of files that are passed from the command line. I invoke it as such:

bash# time (find . -name *.IMG | xargs /tmp/

and it takes on the order of time that the java class is taking with NFS over TCP (with much less CPU usage though...)

Ok. What gives?!? I have seen a Mac OSX host push the fileserver harder from a single threaded apps than this entire host is pushing with a multi-threaded app. Look into NFS settings, jumbo frames are enabled everywhere, nothing is popping out at me. Ok, now let's take a step back and look at the problem. No matter how many threads I use, the performance stays the same. What is it that is single-threaded and can't scale to meet these demands? It slowly occurs to me that, while TCP ensures that all packets get from client->server and back, it also ensures in-order delivery of that data. I think a little further and wonder, "What are the chances that people who posted these comments about NFS over TCP have ever done real parallel tests against a capable NFS server?"

NFS TCP Operations
As it turns out, they probably didn't. Above is a portion of the DTrace output from the fileserver while using TCP for NFS. The python script took 6 minutes to read over 17836 files (about 49 files per second.) Changing nothing else, below is a similar screenshot while using UDP instead of TCP.

Yes, that's the whole graph. I had to use my mouse quickly enough to grab the shot before it went off the screen. The same files with a new mount using UDP instead took a total of 16 seconds. We can see that latencies are much lower but we are looking at a speedup of 22.5x. The latency differences alone do not account for this speedup. While I have not dug down deeper, I feel there may be an ability to send multiple out-of-order requests via UDP that can't be currently achieved with TCP.

Ok, so the take-away message is: NFS over TCP can be much slower than NFS over UDP.

Now the real work is in front of me: "Can we re-implement NFS over UDP now that our network infrastructure is solid?" or maybe even, "Does it make sense to deal with other failure modes to gain a 22.5x improvement in speed?"

... only time will tell. Maybe we can come to the answer 22.5x faster by just implementing it across the board ;) ...

Kidding of course! (sorta)