Thursday, December 15, 2011

execve under OSX Snow Leopard

Background:

A colleague and I are developing a piece of code (called pbsmake) in python that ends up interpreting something that looks like a Makefile. We will use pbsmake to help distribute jobs to a local scheduler (each makefile target gets its own job) but we found that we may want to use the pbsmake interpreter as a shell interpreter itself so we can simply execute certain commands (which are really like makefiles) with a target.

Most of our development has been under GNU/Linux and this works just fine. However, as soon as we do this under OSX Snow Leopard, the top-level makefile starts being executed as if it were a bash script.

How to replicate:


  1. create a simple interpreter that is a shell script itself:
    1. /Users/imoverclocked/simple-interp.sh:
      1. #!/bin/cat
      2. This is my simple interpreter, it simply spits the file onto STDOUT
  2. create a script that uses this interpreter:
    1. /Users/imoverclocked/simple-script.sh:
      1. #!/Users/imoverclocked/simple/interp.sh
      2. This is a simple script that is interpreted (really, just spit out by cat)
  3. try and execute the simple-script.sh
    1. ./simple-script.sh
      1. Badly placed ()'s.
  4. change the interpreter to first invoke /usr/bin/env as a work-around
    1. /Users/imoverclocked/simple-script.sh:
      1. #!/usr/bin/env /Users/imoverclocked/simple/interp.sh
      2. This is a simple script that is interpreted (really, just spit out by cat)
  5. executing the script now gives the same output as on other unix-like architectures.
    1. $ ./simple-script.sh 
      #!/bin/cat
      This is my simple interpreter, it simply spits the file onto STDOUT
      #!/home/pirl/tims/sh/simple-interp.sh
      This is a simple script that is interpreted (really, just spit out by cat)

My guess to what is happening is that execv* can't handle the case where a script calls a script to interpret a script. One work-around is to use /usr/bin/env in the "simple-script.sh" header which works since /usr/bin/env is a binary executable which then invokes the script in it's own execvp call.

Apparently this works under Lion but we can't quite make the plunge across our infrastructure yet. Hope this helps someone else out there! Maybe Apple is listening?

Wednesday, September 14, 2011

Getting ethernet bonding to work under Debian


I have a small cluster of different hardware running Debian squeeze/wheezy and several machine have been really simple to setup with bonding (aka: trunking/link aggregation/port trunking/ ... the list goes on) using mode=4 or "IEEE 802.3ad Dynamic link aggregation". My final clue to solving the problem was some dmesg output:

[   14.777381] e1000e 0000:06:00.1: eth0: changing MTU from 1500 to 9000
[   15.072476] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   15.128059] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   15.129468] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   15.129473] 8021q: adding VLAN 0 to HW filter on device eth0
[   16.290994] e1000e 0000:06:00.0: eth1: changing MTU from 1500 to 9000
[   16.586584] e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
[   16.640053] e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
[   16.641411] ADDRCONF(NETDEV_UP): eth1: link is not ready
...
[   20.530343] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[   20.530350] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[   20.710398] bonding: bond0: Adding slave eth0.
[   20.710415] e1000e 0000:06:00.1: eth0: changing MTU from 9000 to 1500
[   21.006430] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   21.060058] e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
[   21.061374] 8021q: adding VLAN 0 to HW filter on device eth0
[   21.061445] bonding: bond0: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full.
[   21.061462] bonding: bond0: enslaving eth0 as an active interface with an up link.
[   21.242433] bonding: bond0: Adding slave eth1.



If I brought bond0 down/up again later I would get a working link. I tried adding "sleep" commands into the network init sequence to try and figure out if this was just some quiescent state that the NIC driver was in during initialization. This didn't help... so I finally read the warning about needing miimon/arp_interval/arp_ip_target for bonding to work. This was odd because my /etc/network/interfaces file looks like this:

iface bond0 inet manual
        bond-slaves eth0 eth1 eth2 eth3
        bond-mode 4
        bond-miimon 100
        bond-xmit-hash-policy layer3+4
        mtu 9000
        dns-search lpl.arizona.edu
        post-up ip link set $IFACE mtu 9000
        post-up sysctl -w net.ipv6.conf.all.autoconf=0
        post-up sysctl -w net.ipv6.conf.default.accept_ra=0


As it turns out, miimon is not set when the bonding driver is loaded. To solve this problem I created a new file /etc/modprobe.d/bonding with the following content:

alias bond0 bonding
options bonding mode=4 miimon=100

This fixes the issue of bonding not working on boot and should probably be the source of a Debian/Linux bug report.

Thursday, August 11, 2011

Blender 2.5 Compositing from Python

If you are like me then you hate doing the same thing over and over ... which means you like to automate these kinds of processes when it makes sense to do so. I wrote a Blender plugin that imports a HiRISE DTM directly into Blender 2.5x and allows you to easily make fly-throughs of the various Mars locations that a DTM exists for.

My work wants to make automated fly throughs with a trivial piece of compositing to place a foreground and background image into the blender scene. I came across very few working examples or documentation on how to use the compositing part of Blender from python so ... here we go! I'll simply place the final example code here with comments and describe each section a little more in depth below.



# 1) Use compositing for our render, setup some paths
bpy.context.scene.use_nodes = True

fgImageLoc = "/path/to/foreground.tiff"

bgImageLoc = "/path/to/background.tiff"


# 2) Get references to the scene
Scene = bpy.context.scene
Tree = Scene.node_tree
Tree.links.remove( Tree.links[0] )


# 3) The default env will have an input and an output (Src/Dst)
Src = Tree.nodes["Render Layers"]
Dst = Tree.nodes["Composite"]


# 4) Let's create two groups to encapsulate our work
FG_Node = bpy.data.node_groups.new(
    "ForegroundImage", type='COMPOSITE')
BG_Node = bpy.data.node_groups.new(
    "BackgroundImage", type='COMPOSITE')


# 5) The foreground group has one input and one output
FG_Node.inputs.new("Source", 'RGBA')
FG_Node.outputs.new("Result", 'RGBA')


# 6) The foreground node contains an Image and an AlphaOver node
FG_Image = FG_Node.nodes.new('IMAGE')
FG_Image.image = bpy.data.images.load( fgImageLoc )
FG_Alpha = FG_Node.nodes.new('ALPHAOVER')


# 7) The Image and the Group Input are routed to the AlphaOver
#    and the AlphaOver output is routed to the group's output
FG_Node.links.new(FG_Image.outputs["Image"], FG_Alpha.inputs[2])
FG_Node.links.new(FG_Node.inputs["Source"], FG_Alpha.inputs[1])
FG_Node.links.new(FG_Node.outputs["Result"], FG_Alpha.outputs["Image"])


# 8) Add foreground image compositing to the environment
newFGGroup = Tree.nodes.new("GROUP", group = FG_Node)


# 9) Route the default render output to the input of the FG Group
Tree.links.new(newFGGroup.inputs[0], Src.outputs["Image"])


# 10) The background group has one input and one output
BG_Node.inputs.new("Source", 'RGBA')
BG_Node.outputs.new("Result", 'RGBA')


# 11) The background group contains an Image and AlphaOver node
BG_Image = BG_Node.nodes.new('IMAGE')
BG_Image.image = bpy.data.images.load( bgImageLoc )
BG_Alpha = BG_Node.nodes.new('ALPHAOVER')


# 12) Create links to internal nodes
BG_Node.links.new(BG_Image.outputs["Image"], BG_Alpha.inputs[1])
BG_Node.links.new(BG_Node.inputs["Source"], BG_Alpha.inputs[2])
BG_Node.links.new(BG_Node.outputs["Result"], BG_Alpha.outputs["Image"])


# Add background image compositing, similar to 8/9
newBGGroup = Tree.nodes.new("GROUP", group = BG_Node)
Tree.links.new(newBGGroup.inputs[0], newFGGroup.outputs[0])
Tree.links.new(newBGGroup.outputs[0], Dst.inputs["Image"])


When you run this you will end up with a pipeline that looks like this:


The rendered scene outputs to the foreground image group which then outputs to the background image group which then outputs to the specified file path in Blender. Each group is a composite of primitives in blender. When expanded (via selecting the group and pressing Tab) you will see this:


The group input and Image outputs are routed into the Alpha Over image. The Alpha Over output is routed into to the groups output. This will overlay the Image node onto the scene. A similar setup is produced for the background image.

Here is a slightly more detailed breakdown of the script:

  1. Tell blender that we have a special compositing setup
    • also, store info about where our foreground/background images are kept
  2. Blender's environment is not empty by default.
    • Get a reference to it
    • Clear the link between the Render Layer and the Composite output
  3. Get a reference to the default nodes for later use
    • Src is the source for rendered content from the scene
    • Dst is the node that takes an output and generates a file (or preview)
  4. To simplify our compositing graph, create two groups to encapsulate each function
  5. The group we just created has one input and one output
  6. Create two nodes
    • Image - acts as an output with a static image
    • AlphaOver - overlays two images using the alpha channel defined in the second image
  7. Create links between nodes.
    • It's easier to see these in the image above.
  8. Instantiate the new group in the compositing environment
    • This is where I was a little lost, the group needs to be instantiated and then a new object is returned. The input/output of the returned object are the external input/output ports of the group. If you use the previous object you will make bad connections to the internal structures of the group. Don't do it!
  9. Connect the rendered image to the foreground group input.
  10. through 12. are pretty much the same as 5. through 9.
    • Different connections make the image a background image instead of a foreground image

Thanks to Uncle_Entity and Senshi in #blendercoders@freenode for fixing my initially poor usage.

Here a few links that use the compositing above. Notice that the text hovers above the DTM while the background ... stays in the background:

Saturday, April 30, 2011

IPv6 - how to listen?

In working with IPv6 I've come to realize that there are a lot of funny defaults out there. The most basic of them seems to be in how to bind to a port. Simple? Yeah ... or so I thought.

Under some versions Linux if you are to just listen on, say, [::]:80 then people connecting to your site via IPv4 might be logged as ::ffff:192.0.32.10 which is an IPv4 mapped IPv6 address. This means that everyone is allowed to talk IPv4 but your server gets to deal with everything as if it is an IPv6 address... pretty much. DNS resolution gets a little fuzzy here but we'll sweep that issue under the carpet for the remainder of this post.

Under other versions of Linux, you may not accept IPv4 connections at all by listening on just [::]:80. As it turns out, this is a configurable default and there are (of course) arguments both ways about which is better. I personally like the one socket fits all approach but I'm also very pro-IPv6. The magic default under Linux can be found/set via "sysctl net.ipv6.bindv6only".

Stepping outside of the wonderful world of Linux and into the world of Solaris/OpenSolaris (RIP) you'll find a pretty consistent behavior towards what Linux would know as net.ipv6.bindv6only=1. In fact, I never did find a way to change the default under Solaris and had to revert to using funky configurations that specified two listening sockets: one for IPv4 and another for IPv6. In some cases this was more than a simple annoyance, it was impossible. In the case of Ejabberd, things get ugly. There is no way to specify which behavior you want and, on top of that, the connections are managed via a hash by port inside Ejabberd. That means you can't listen twice on the same socket! I hacked around this issue in our environment but I look forward to not needing it in the future.

Another place this becomes a problem is in our nginx configurations. On Solaris we have something that looks like:


server {
 listen 80;
 listen [::]:80; 
...

}

but migrating this configuration to Linux where the default is net.ipv6.bindv6only=0, we simply use:


server {
 listen   [::]:80;
 ...
}


Which does close to the same thing. Our log files may change a little since ::ffff: is now in front of our IPv4 entries but everything else pretty much stays the same.

Alternatively we can do (for the default server):


server {
 listen   80;
 listen   [::]:80 default ipv6only=on;
 ...
}


and then we are back to the kludge of using two different sockets for pretty much the same thing. There are applications where providing a different answer on IPv6 than on IPv4 makes sense but most of the time it doesn't.

What can we do as application developers to do things the right way the first time? That's highly language dependent. Some high-level languages don't distinguish between IPv4 and IPv6 unless you dig a little and ask specifically for it. The problem is that they may be compiled with or without IPv6 support (like ruby) and then you may be powerless to use IPv6 at all. Other languages you will need to make small adjustments (eg: C needs to use getaddrinfo() instead of gethostbyaddr()) Google is your friend here and be sure to checkout tools like IPv6 CARE which can tell you what is wrong as well as dynamically patch a running binary to do the right thing. Pretty slick!


Finally, what is the best practice on how to listen to an IPv6 socket? My preference is to listen once and get all traffic on the one socket but there are cases where it is desirable to use two sockets. This means the best practice is to be configurable and capable of doing both. You could make me happy and default to one socket for your application though! It makes IPv6 "just work".

Thanks for reading and Happy Hacking!

Friday, April 29, 2011

NFS over UDP, fast + cheap = not good?

From yesterday's post I took off and started researching ways that NFS over UDP can go wrong. I am now sufficiently scared away from ever writing/appending files on NFS that is using UDP. There are references everywhere to potential data corruption but only a few good sources that will give me anything concrete on the topic. Those references seem to be a bit outdated but the cautious sys-admin/engineer side of me is now screaming, "ok, fast + cheap somehow usually means not good."

Most of the cases I came across dealt with writing via UDP and not so much reading via UDP. There were some cache issues mentioned but we have run into those regardless of UDP/TCP so nothing new there. The particular use-case of the previous test does only need to read but in considering our general systems infrastructure we definitely need write functionality so UDP is probably not a good idea anymore.

Now that the NFS path has been travelled, maybe I can find a better way?

Thursday, April 28, 2011

NFS Performance: TCP vs UDP

I have found many places that will state that NFS over TCP is not appreciably slower on Linux unless you measure carefully. In fact, I found so many of them that I took it for granted for a while... until today.

Here is the backstory (circa 2005):

We are an NFS shop scaling to meet the demands of HiRISE image collection/planning/processing and we are having severe problems scaling our NFS servers to handle processing and desktop loads on our network. Turns out the final fix for some issues was to use TCP for our NFS traffic. (BTW: thanks for the pointer from our then-in-testing-vendor, Pillar) Ok, simple fix! Quick tests show that performance is about the same.

Some time after this I was working with our DBA to try and speed up a process that reads a bunch of records from the database and then verifies the first 32KiB of about a half-million files that we have published via the PDS standard. I mention that we have some shiny new SunFire T1000 servers with 24 cores which could speed this effort up via threading. He takes this to heart and threads his code so each thread deals with checking a single file at once. We did get a speedup, but definitely not 24x.


Ok, jump forward to the present day, literally today. I spec'd some hardware and put together a 5-node cluster to virtualize a bunch of operations on. Each host has 32 cores and 128GB RAM and the particular DomU we were testing with has 24 vcpus and plenty of RAM for this task. Our NFS servers are also Sun (Oracle *cough* *cough*) servers with fancy DTrace analytics which can tell you exactly what is going on with your storage. All of this should be very capable and the network in-between is rock solid and has plenty of bandwidth... so why are we only peaking around 35 reads per second? Why is this job still taking half a day to complete?

The network is not heavily loaded, the fileserver is twiddling its thumbs and the host has a load of about 60. I do some quick calculation and figure out that the computation is speeding along and processing a "whopping" 1MB of text every second (sigh.) Ok, let's point a finger at Java. It's certainly not my favorite language and as far as string processing goes, there are much better alternatives IMHO.

I gingerly ask our DBA who wrote the code if I can peruse it to see if I see anything that could be optimized. He obliges and I peruse through the code. Of course, being an abstraction on an abstraction I'm sure there is a lot to be gleaned from digging deeper but nothing pops out at me as needing to take up this entire node to process text at 1MB/s. I mention that it could be the abstractions underneath and our DBA asks why it is faster in a certain case (I haven't verified that but I believe him) and I decide, "ok, let's take Java out of the equation. import python" So here is the script I write to approximate his java class:


#!/usr/bin/env python


import sys
import threading
import Queue


class read32k(threading.Thread):
def readFileQ(self, fileQ):
self._fq = fileQ


def run(self):
while not self._fq.empty():
fn = self._fq.get()
f = open(fn)
data = f.read(32768)
f.close()


# need to join our threads later on
threads = []


# need a queue of files to look through
fileQ = Queue.Queue()
for arg in sys.argv[1:]:
fileQ.put(arg)


# initialize/start the threads
for t in range(60):
readchunk = read32k()
readchunk.readFileQ( fileQ )
readchunk.start()
threads.append( readchunk )


# wait for all threads to return
for t in threads:
t.join()


# note the number of files processed
print "read chunks of %d files" % (len( sys.argv )-1)



pretty brain-dead simple (and sloppy, sorry...) For the non-python coders but casually interested, the script starts 60 threads, and reads a 32KiB chunk of data from a queue of files that are passed from the command line. I invoke it as such:

bash# time (find . -name *.IMG | xargs /tmp/test.py)

and it takes on the order of time that the java class is taking with NFS over TCP (with much less CPU usage though...)

Ok. What gives?!? I have seen a Mac OSX host push the fileserver harder from a single threaded apps than this entire host is pushing with a multi-threaded app. Look into NFS settings, jumbo frames are enabled everywhere, nothing is popping out at me. Ok, now let's take a step back and look at the problem. No matter how many threads I use, the performance stays the same. What is it that is single-threaded and can't scale to meet these demands? It slowly occurs to me that, while TCP ensures that all packets get from client->server and back, it also ensures in-order delivery of that data. I think a little further and wonder, "What are the chances that people who posted these comments about NFS over TCP have ever done real parallel tests against a capable NFS server?"

NFS TCP Operations
As it turns out, they probably didn't. Above is a portion of the DTrace output from the fileserver while using TCP for NFS. The python script took 6 minutes to read over 17836 files (about 49 files per second.) Changing nothing else, below is a similar screenshot while using UDP instead of TCP.


Yes, that's the whole graph. I had to use my mouse quickly enough to grab the shot before it went off the screen. The same files with a new mount using UDP instead took a total of 16 seconds. We can see that latencies are much lower but we are looking at a speedup of 22.5x. The latency differences alone do not account for this speedup. While I have not dug down deeper, I feel there may be an ability to send multiple out-of-order requests via UDP that can't be currently achieved with TCP.

Ok, so the take-away message is: NFS over TCP can be much slower than NFS over UDP.

Now the real work is in front of me: "Can we re-implement NFS over UDP now that our network infrastructure is solid?" or maybe even, "Does it make sense to deal with other failure modes to gain a 22.5x improvement in speed?"

... only time will tell. Maybe we can come to the answer 22.5x faster by just implementing it across the board ;) ...

Kidding of course! (sorta)