I have found many places that will state that NFS over TCP is not appreciably slower on Linux unless you measure carefully. In fact, I found so many of them that I took it for granted for a while... until today.
Here is the backstory (circa 2005):
We are an NFS shop scaling to meet the demands of HiRISE image collection/planning/processing and we are having severe problems scaling our NFS servers to handle processing and desktop loads on our network. Turns out the final fix for some issues was to use TCP for our NFS traffic. (BTW: thanks for the pointer from our then-in-testing-vendor, Pillar) Ok, simple fix! Quick tests show that performance is about the same.
Some time after this I was working with our DBA to try and speed up a process that reads a bunch of records from the database and then verifies the first 32KiB of about a half-million files that we have published via the PDS standard. I mention that we have some shiny new SunFire T1000 servers with 24 cores which could speed this effort up via threading. He takes this to heart and threads his code so each thread deals with checking a single file at once. We did get a speedup, but definitely not 24x.
Ok, jump forward to the present day, literally today. I spec'd some hardware and put together a 5-node cluster to virtualize a bunch of operations on. Each host has 32 cores and 128GB RAM and the particular DomU we were testing with has 24 vcpus and plenty of RAM for this task. Our NFS servers are also Sun (Oracle *cough* *cough*) servers with fancy DTrace analytics which can tell you exactly what is going on with your storage. All of this should be very capable and the network in-between is rock solid and has plenty of bandwidth... so why are we only peaking around 35 reads per second? Why is this job still taking half a day to complete?
The network is not heavily loaded, the fileserver is twiddling its thumbs and the host has a load of about 60. I do some quick calculation and figure out that the computation is speeding along and processing a "whopping" 1MB of text every second (sigh.) Ok, let's point a finger at Java. It's certainly not my favorite language and as far as string processing goes, there are much better alternatives IMHO.
I gingerly ask our DBA who wrote the code if I can peruse it to see if I see anything that could be optimized. He obliges and I peruse through the code. Of course, being an abstraction on an abstraction I'm sure there is a lot to be gleaned from digging deeper but nothing pops out at me as needing to take up this entire node to process text at 1MB/s. I mention that it could be the abstractions underneath and our DBA asks why it is faster in a certain case (I haven't verified that but I believe him) and I decide, "ok, let's take Java out of the equation. import python" So here is the script I write to approximate his java class:
#!/usr/bin/env python
import sys
import threading
import Queue
class read32k(threading.Thread):
def readFileQ(self, fileQ):
self._fq = fileQ
def run(self):
while not self._fq.empty():
fn = self._fq.get()
f = open(fn)
data = f.read(32768)
f.close()
# need to join our threads later on
threads = []
# need a queue of files to look through
fileQ = Queue.Queue()
for arg in sys.argv[1:]:
fileQ.put(arg)
# initialize/start the threads
for t in range(60):
readchunk = read32k()
readchunk.readFileQ( fileQ )
readchunk.start()
threads.append( readchunk )
# wait for all threads to return
for t in threads:
t.join()
# note the number of files processed
print "read chunks of %d files" % (len( sys.argv )-1)
pretty brain-dead simple (and sloppy, sorry...) For the non-python coders but casually interested, the script starts 60 threads, and reads a 32KiB chunk of data from a queue of files that are passed from the command line. I invoke it as such:
bash# time (find . -name *.IMG | xargs /tmp/test.py)
and it takes on the order of time that the java class is taking with NFS over TCP (with much less CPU usage though...)
Ok. What gives?!? I have seen a Mac OSX host push the fileserver harder from a single threaded apps than this entire host is pushing with a multi-threaded app. Look into NFS settings, jumbo frames are enabled everywhere, nothing is popping out at me. Ok, now let's take a step back and look at the problem. No matter how many threads I use, the performance stays the same. What is it that is single-threaded and can't scale to meet these demands? It slowly occurs to me that, while TCP ensures that all packets get from client->server and back, it also ensures in-order delivery of that data. I think a little further and wonder, "What are the chances that people who posted these comments about NFS over TCP have ever done real parallel tests against a capable NFS server?"
As it turns out, they probably didn't. Above is a portion of the DTrace output from the fileserver while using TCP for NFS. The python script took 6 minutes to read over 17836 files (about 49 files per second.) Changing nothing else, below is a similar screenshot while using UDP instead of TCP.
Yes, that's the whole graph. I had to use my mouse quickly enough to grab the shot before it went off the screen. The same files with a new mount using UDP instead took a total of 16 seconds. We can see that latencies are much lower but we are looking at a speedup of 22.5x. The latency differences alone do not account for this speedup. While I have not dug down deeper, I feel there may be an ability to send multiple out-of-order requests via UDP that can't be currently achieved with TCP.
Ok, so the take-away message is: NFS over TCP can be much slower than NFS over UDP.
Now the real work is in front of me: "Can we re-implement NFS over UDP now that our network infrastructure is solid?" or maybe even, "Does it make sense to deal with other failure modes to gain a 22.5x improvement in speed?"
... only time will tell. Maybe we can come to the answer 22.5x faster by just implementing it across the board ;) ...
Kidding of course! (sorta)
Here is the backstory (circa 2005):
We are an NFS shop scaling to meet the demands of HiRISE image collection/planning/processing and we are having severe problems scaling our NFS servers to handle processing and desktop loads on our network. Turns out the final fix for some issues was to use TCP for our NFS traffic. (BTW: thanks for the pointer from our then-in-testing-vendor, Pillar) Ok, simple fix! Quick tests show that performance is about the same.
Some time after this I was working with our DBA to try and speed up a process that reads a bunch of records from the database and then verifies the first 32KiB of about a half-million files that we have published via the PDS standard. I mention that we have some shiny new SunFire T1000 servers with 24 cores which could speed this effort up via threading. He takes this to heart and threads his code so each thread deals with checking a single file at once. We did get a speedup, but definitely not 24x.
Ok, jump forward to the present day, literally today. I spec'd some hardware and put together a 5-node cluster to virtualize a bunch of operations on. Each host has 32 cores and 128GB RAM and the particular DomU we were testing with has 24 vcpus and plenty of RAM for this task. Our NFS servers are also Sun (Oracle *cough* *cough*) servers with fancy DTrace analytics which can tell you exactly what is going on with your storage. All of this should be very capable and the network in-between is rock solid and has plenty of bandwidth... so why are we only peaking around 35 reads per second? Why is this job still taking half a day to complete?
The network is not heavily loaded, the fileserver is twiddling its thumbs and the host has a load of about 60. I do some quick calculation and figure out that the computation is speeding along and processing a "whopping" 1MB of text every second (sigh.) Ok, let's point a finger at Java. It's certainly not my favorite language and as far as string processing goes, there are much better alternatives IMHO.
I gingerly ask our DBA who wrote the code if I can peruse it to see if I see anything that could be optimized. He obliges and I peruse through the code. Of course, being an abstraction on an abstraction I'm sure there is a lot to be gleaned from digging deeper but nothing pops out at me as needing to take up this entire node to process text at 1MB/s. I mention that it could be the abstractions underneath and our DBA asks why it is faster in a certain case (I haven't verified that but I believe him) and I decide, "ok, let's take Java out of the equation. import python" So here is the script I write to approximate his java class:
#!/usr/bin/env python
import sys
import threading
import Queue
class read32k(threading.Thread):
def readFileQ(self, fileQ):
self._fq = fileQ
def run(self):
while not self._fq.empty():
fn = self._fq.get()
f = open(fn)
data = f.read(32768)
f.close()
# need to join our threads later on
threads = []
# need a queue of files to look through
fileQ = Queue.Queue()
for arg in sys.argv[1:]:
fileQ.put(arg)
# initialize/start the threads
for t in range(60):
readchunk = read32k()
readchunk.readFileQ( fileQ )
readchunk.start()
threads.append( readchunk )
# wait for all threads to return
for t in threads:
t.join()
# note the number of files processed
print "read chunks of %d files" % (len( sys.argv )-1)
pretty brain-dead simple (and sloppy, sorry...) For the non-python coders but casually interested, the script starts 60 threads, and reads a 32KiB chunk of data from a queue of files that are passed from the command line. I invoke it as such:
bash# time (find . -name *.IMG | xargs /tmp/test.py)
and it takes on the order of time that the java class is taking with NFS over TCP (with much less CPU usage though...)
Ok. What gives?!? I have seen a Mac OSX host push the fileserver harder from a single threaded apps than this entire host is pushing with a multi-threaded app. Look into NFS settings, jumbo frames are enabled everywhere, nothing is popping out at me. Ok, now let's take a step back and look at the problem. No matter how many threads I use, the performance stays the same. What is it that is single-threaded and can't scale to meet these demands? It slowly occurs to me that, while TCP ensures that all packets get from client->server and back, it also ensures in-order delivery of that data. I think a little further and wonder, "What are the chances that people who posted these comments about NFS over TCP have ever done real parallel tests against a capable NFS server?"
NFS TCP Operations |
Yes, that's the whole graph. I had to use my mouse quickly enough to grab the shot before it went off the screen. The same files with a new mount using UDP instead took a total of 16 seconds. We can see that latencies are much lower but we are looking at a speedup of 22.5x. The latency differences alone do not account for this speedup. While I have not dug down deeper, I feel there may be an ability to send multiple out-of-order requests via UDP that can't be currently achieved with TCP.
Ok, so the take-away message is: NFS over TCP can be much slower than NFS over UDP.
Now the real work is in front of me: "Can we re-implement NFS over UDP now that our network infrastructure is solid?" or maybe even, "Does it make sense to deal with other failure modes to gain a 22.5x improvement in speed?"
... only time will tell. Maybe we can come to the answer 22.5x faster by just implementing it across the board ;) ...
Kidding of course! (sorta)
No comments:
Post a Comment