Saturday, April 30, 2011

IPv6 - how to listen?

In working with IPv6 I've come to realize that there are a lot of funny defaults out there. The most basic of them seems to be in how to bind to a port. Simple? Yeah ... or so I thought.

Under some versions Linux if you are to just listen on, say, [::]:80 then people connecting to your site via IPv4 might be logged as ::ffff:192.0.32.10 which is an IPv4 mapped IPv6 address. This means that everyone is allowed to talk IPv4 but your server gets to deal with everything as if it is an IPv6 address... pretty much. DNS resolution gets a little fuzzy here but we'll sweep that issue under the carpet for the remainder of this post.

Under other versions of Linux, you may not accept IPv4 connections at all by listening on just [::]:80. As it turns out, this is a configurable default and there are (of course) arguments both ways about which is better. I personally like the one socket fits all approach but I'm also very pro-IPv6. The magic default under Linux can be found/set via "sysctl net.ipv6.bindv6only".

Stepping outside of the wonderful world of Linux and into the world of Solaris/OpenSolaris (RIP) you'll find a pretty consistent behavior towards what Linux would know as net.ipv6.bindv6only=1. In fact, I never did find a way to change the default under Solaris and had to revert to using funky configurations that specified two listening sockets: one for IPv4 and another for IPv6. In some cases this was more than a simple annoyance, it was impossible. In the case of Ejabberd, things get ugly. There is no way to specify which behavior you want and, on top of that, the connections are managed via a hash by port inside Ejabberd. That means you can't listen twice on the same socket! I hacked around this issue in our environment but I look forward to not needing it in the future.

Another place this becomes a problem is in our nginx configurations. On Solaris we have something that looks like:


server {
 listen 80;
 listen [::]:80; 
...

}

but migrating this configuration to Linux where the default is net.ipv6.bindv6only=0, we simply use:


server {
 listen   [::]:80;
 ...
}


Which does close to the same thing. Our log files may change a little since ::ffff: is now in front of our IPv4 entries but everything else pretty much stays the same.

Alternatively we can do (for the default server):


server {
 listen   80;
 listen   [::]:80 default ipv6only=on;
 ...
}


and then we are back to the kludge of using two different sockets for pretty much the same thing. There are applications where providing a different answer on IPv6 than on IPv4 makes sense but most of the time it doesn't.

What can we do as application developers to do things the right way the first time? That's highly language dependent. Some high-level languages don't distinguish between IPv4 and IPv6 unless you dig a little and ask specifically for it. The problem is that they may be compiled with or without IPv6 support (like ruby) and then you may be powerless to use IPv6 at all. Other languages you will need to make small adjustments (eg: C needs to use getaddrinfo() instead of gethostbyaddr()) Google is your friend here and be sure to checkout tools like IPv6 CARE which can tell you what is wrong as well as dynamically patch a running binary to do the right thing. Pretty slick!


Finally, what is the best practice on how to listen to an IPv6 socket? My preference is to listen once and get all traffic on the one socket but there are cases where it is desirable to use two sockets. This means the best practice is to be configurable and capable of doing both. You could make me happy and default to one socket for your application though! It makes IPv6 "just work".

Thanks for reading and Happy Hacking!

Friday, April 29, 2011

NFS over UDP, fast + cheap = not good?

From yesterday's post I took off and started researching ways that NFS over UDP can go wrong. I am now sufficiently scared away from ever writing/appending files on NFS that is using UDP. There are references everywhere to potential data corruption but only a few good sources that will give me anything concrete on the topic. Those references seem to be a bit outdated but the cautious sys-admin/engineer side of me is now screaming, "ok, fast + cheap somehow usually means not good."

Most of the cases I came across dealt with writing via UDP and not so much reading via UDP. There were some cache issues mentioned but we have run into those regardless of UDP/TCP so nothing new there. The particular use-case of the previous test does only need to read but in considering our general systems infrastructure we definitely need write functionality so UDP is probably not a good idea anymore.

Now that the NFS path has been travelled, maybe I can find a better way?

Thursday, April 28, 2011

NFS Performance: TCP vs UDP

I have found many places that will state that NFS over TCP is not appreciably slower on Linux unless you measure carefully. In fact, I found so many of them that I took it for granted for a while... until today.

Here is the backstory (circa 2005):

We are an NFS shop scaling to meet the demands of HiRISE image collection/planning/processing and we are having severe problems scaling our NFS servers to handle processing and desktop loads on our network. Turns out the final fix for some issues was to use TCP for our NFS traffic. (BTW: thanks for the pointer from our then-in-testing-vendor, Pillar) Ok, simple fix! Quick tests show that performance is about the same.

Some time after this I was working with our DBA to try and speed up a process that reads a bunch of records from the database and then verifies the first 32KiB of about a half-million files that we have published via the PDS standard. I mention that we have some shiny new SunFire T1000 servers with 24 cores which could speed this effort up via threading. He takes this to heart and threads his code so each thread deals with checking a single file at once. We did get a speedup, but definitely not 24x.


Ok, jump forward to the present day, literally today. I spec'd some hardware and put together a 5-node cluster to virtualize a bunch of operations on. Each host has 32 cores and 128GB RAM and the particular DomU we were testing with has 24 vcpus and plenty of RAM for this task. Our NFS servers are also Sun (Oracle *cough* *cough*) servers with fancy DTrace analytics which can tell you exactly what is going on with your storage. All of this should be very capable and the network in-between is rock solid and has plenty of bandwidth... so why are we only peaking around 35 reads per second? Why is this job still taking half a day to complete?

The network is not heavily loaded, the fileserver is twiddling its thumbs and the host has a load of about 60. I do some quick calculation and figure out that the computation is speeding along and processing a "whopping" 1MB of text every second (sigh.) Ok, let's point a finger at Java. It's certainly not my favorite language and as far as string processing goes, there are much better alternatives IMHO.

I gingerly ask our DBA who wrote the code if I can peruse it to see if I see anything that could be optimized. He obliges and I peruse through the code. Of course, being an abstraction on an abstraction I'm sure there is a lot to be gleaned from digging deeper but nothing pops out at me as needing to take up this entire node to process text at 1MB/s. I mention that it could be the abstractions underneath and our DBA asks why it is faster in a certain case (I haven't verified that but I believe him) and I decide, "ok, let's take Java out of the equation. import python" So here is the script I write to approximate his java class:


#!/usr/bin/env python


import sys
import threading
import Queue


class read32k(threading.Thread):
def readFileQ(self, fileQ):
self._fq = fileQ


def run(self):
while not self._fq.empty():
fn = self._fq.get()
f = open(fn)
data = f.read(32768)
f.close()


# need to join our threads later on
threads = []


# need a queue of files to look through
fileQ = Queue.Queue()
for arg in sys.argv[1:]:
fileQ.put(arg)


# initialize/start the threads
for t in range(60):
readchunk = read32k()
readchunk.readFileQ( fileQ )
readchunk.start()
threads.append( readchunk )


# wait for all threads to return
for t in threads:
t.join()


# note the number of files processed
print "read chunks of %d files" % (len( sys.argv )-1)



pretty brain-dead simple (and sloppy, sorry...) For the non-python coders but casually interested, the script starts 60 threads, and reads a 32KiB chunk of data from a queue of files that are passed from the command line. I invoke it as such:

bash# time (find . -name *.IMG | xargs /tmp/test.py)

and it takes on the order of time that the java class is taking with NFS over TCP (with much less CPU usage though...)

Ok. What gives?!? I have seen a Mac OSX host push the fileserver harder from a single threaded apps than this entire host is pushing with a multi-threaded app. Look into NFS settings, jumbo frames are enabled everywhere, nothing is popping out at me. Ok, now let's take a step back and look at the problem. No matter how many threads I use, the performance stays the same. What is it that is single-threaded and can't scale to meet these demands? It slowly occurs to me that, while TCP ensures that all packets get from client->server and back, it also ensures in-order delivery of that data. I think a little further and wonder, "What are the chances that people who posted these comments about NFS over TCP have ever done real parallel tests against a capable NFS server?"

NFS TCP Operations
As it turns out, they probably didn't. Above is a portion of the DTrace output from the fileserver while using TCP for NFS. The python script took 6 minutes to read over 17836 files (about 49 files per second.) Changing nothing else, below is a similar screenshot while using UDP instead of TCP.


Yes, that's the whole graph. I had to use my mouse quickly enough to grab the shot before it went off the screen. The same files with a new mount using UDP instead took a total of 16 seconds. We can see that latencies are much lower but we are looking at a speedup of 22.5x. The latency differences alone do not account for this speedup. While I have not dug down deeper, I feel there may be an ability to send multiple out-of-order requests via UDP that can't be currently achieved with TCP.

Ok, so the take-away message is: NFS over TCP can be much slower than NFS over UDP.

Now the real work is in front of me: "Can we re-implement NFS over UDP now that our network infrastructure is solid?" or maybe even, "Does it make sense to deal with other failure modes to gain a 22.5x improvement in speed?"

... only time will tell. Maybe we can come to the answer 22.5x faster by just implementing it across the board ;) ...

Kidding of course! (sorta)