In April of 2001, Jeffrey Rothman and I wrote an article entitled "Which OS is fastest for high-performance network applications". It is due to be published in SysAdmin magazine, in their July 2001 issue.|
The article compares the performance of Linux, Windows 2000, FreeBSD and Solaris, and tries to draw conclusions about which OS is fastest, when you are running a high-performance network application.
Below is an earlier version of the article, from which the SysAdmin article was derived.
By Jeffrey B. Rothman, Ph.D. and John Buckman
In this article, we compare Linux, Solaris (for Intel), FreeBSD and Windows 2000 to determine which OS runs high-performance network applications the fastest. We describe which software designs to look for from your network software vendor, explaining how each design yields different performance characteristics, and determine which OS platform is best suited for each common network programming design. We present our OS benchmarks with both simulated and real-world tests, then evaluate the results.
We found that the software application's architecture determines speed results much more than the operating system on which it runs. Our benchmarks demonstrate a 12x performance difference between process-based and asynchronous task architectures. Significantly, we found up to a 75% overall performance difference between OSes when using the most efficient asynchronous architecture. We found Linux to be the best performing operating system based on our metrics, performing 35% better than Solaris, which came in 2nd, followed by Windows, and finally, FreeBSD.
At Lyris Technologies, we write high-performance, cross-platform, email-based server applications. Better application performance is a competitive advantage, so we spend a great deal of time tuning all aspects of an application's performance profile (software, hardware and operating system). Our customers frequently ask us which operating system is best for running our software, or if they already have an OS choice, they want to know how to make their operating system (OS) run our applications faster. Additionally, we run a hosting (outsourcing) division and want to reduce our hardware cost while providing the best performance for our hosting customers.
Most Internet applications follow these steps: (1) accept an incoming TCP/IP connection or create a connection to another machine; (2) once connected, exchange various text- based commands via TCP/IP; (3) these commands cause various activities to happen, such as disk-reading (e.g., viewing a web page), disk-writing (e.g., queuing a received an email message) or calling external functionality (mail filtering, reverse DNS lookup).
In general, the performance issues for network applications are to: (1) accomplish many concurrent tasks as quickly as possible, (2) efficiently cope with a great deal of waiting (caused by TCP/IP slowness, or for the other end to send the next command), and (3) perform TCP/IP operations efficiently.
The most effective way to maximize network application performance is for the application's software designer to choose an architecture that addresses the 3 criteria above. The two significant variables are the task architecture, and the TCP/IP call architecture.
In the area of task architecture, there are three major techniques:
- One-process-per-task (process-oriented): many copies of the program are run with each copy handling one task at a time. Sometimes, a new process is created each time a new task is created (e.g., inetd, sendmail) or processes are re-used (e.g., Apache). This architecture yields good performance at low loads. Medium loads can also be handled, if the process image is small (e.g., qmail), if application-specific efficiency improvements are implemented, or if the application genre does not create too many simultaneous tasks. Multiple CPUs are efficiently used if process caching is used, and the total number of processes is kept low (i.e., low to medium load). This technique works on all operating systems; although Unix is significantly more efficient than Windows at implementing it (Windows lacks the fork() system call, and is so slow that few Windows applications use this technique).
- One-thread-per-task (multithreaded): one copy of the program is run with a separate thread of execution inside the process handling each task. Multithreaded applications perform very well at low to medium loads. Higher loads cause decreasing (but usually still acceptable) performance, though extremely high loads can cause your multithreaded application to death-spiral. Multithreaded applications typically scale to between 500 and 1000 concurrent tasks, which is acceptable in many situations. Each new task uses a new thread, which consumes less memory and less CPU power than a new process would. Few open source projects use multithreading because only the most-popular Unixes are stable under heavy multithreading loads. Performance with multiple CPUs can be worse than on one CPU, because semaphore locks are much more costly on multiprocessor machines. Examples: Netscape web server, Apache on Windows.
- One-thread-many-tasks (asynchronous): one copy of the program is run with a set number of threads (typically, one thread per type of task), and each thread handles a large number of tasks using a technique called asynchronous (or non- blocking) TCP/IP. Because most programs are not required to handle high loads and because asynchronous programming is difficult, few programs support this architecture. Asynchronous programs scale well to multiple CPU machines, because they typically use long running threads operating independently of each other, require few cross-CPU locks, and so each thread can be permanently and effectively assigned to a CPU. Example: the DNS BIND daemon.
The second major performance variable is the TCP/IP call architecture. On an operating system level, there are multiple different ways to accomplish the same network operation. A tradeoff exists between TCP/IP speed versus programming effort (faster techniques are more work for the programmer). In addition, some faster techniques are not available on all platforms; higher performance requirements may limit the platform choice. A blocking TCP/IP call sits and waits for the requested operation to complete, then acts immediately on the result. With small numbers of tasks, this results in immediate reaction to events as they occur. With large numbers of tasks, the operating system incurs significant context-switching overhead, and overall efficiency is poor. Blocking (synchronous) TCP/IP calls yield very short latencies under low loads, and are ideal for an application such as a low-load web server, where page-response time should be very fast, and the load is never very high. However, if a process-oriented architecture is used and a new process is created for each new connection (e.g., inetd) then the latency improvements from blocking TCP/IP are negated by the significant overhead of running a new process.
A non-blocking (asynchronous) TCP/IP call initiates an operation, then continuous on with other activities. When the operation completes or an event occurs, the application is notified and then reacts. There is more programming work involved with this two-step process and sometimes a small amount of time is needed to react to the new event (increasing latency). This non-blocking technique yields much better performance under medium to high loads, and can survive abusively high loads, but latency may be slightly longer than with blocking TCP/IP calls.
Each of the three task handling architecture matches up with a particular TCP/IP system call model. Process-oriented and multithreaded programs tend to use blocking TCP/IP calls, as this is the simplest way to program, and handles the low loads that are most common case. However, an application that uses the asynchronous task architecture must use non-blocking TCP/IP operations in order to handle multiple tasks: blocking TCP/IP is not an option. Therefore, if you find a network application which uses the highly- scalable asynchronous task architecture, you also benefit from that application using the most scalable TCP/IP call architecture (non-blocking).
To evaluate the performance of various operating systems and network applications, we created 3 different tests: real world, disk I/O and task architecture comparison. The operating systems we examined were Linux (RedHat 7.0, kernel 2.2.16-22), Solaris 2.8 for Intel, FreeBSD 4.2, and Windows 2000 Server. The operating systems were the latest version available from a commercial distribution and were not recompiled (i.e., everything was tested right out of the box). We installed all OSes on identical 4GB SCSI-3 drives (IBM model DCAS-34330), and ran the tests on the same machine (ASUS P3B motherboard, Intel Pentium III 550 MHz processor, 384 MB SDRAM, Adaptec 2940UW SCSI controller, ATI Rage Pro 3D video card, Intel EtherExpress Pro 10/100 Ethernet card).
As a real-world test, we measured how quickly email could be sent using our MailEngine software. MailEngine is an email delivery server, ships on all the tested platforms (plus on Solaris for Sparc), and uses an asynchronous architecture (with non-blocking TCP/IP using the poll () system call). So that email was not actually delivered to our 200,000- member test list, we ran MailEngine in test mode: MailEngine performs all the steps of sending mail, but sends the RSET command instead of the DATA command at the last moment. The SMTP connection is then QUIT, and no email is delivered to the recipient. Our workload consisted of a single message being delivered to 200,000 distinct email addresses spread across 9113 domains. Because the same message was queued in memory for every recipient, disk I/O was not a significant factor. We slowly raised the number of simultaneous connections to see how the increased load altered performance.
Figure 1 ("Operating system comparison") shows the test-mode email delivery speed for MailEngine over a range of simultaneous connections for each OS. Linux is the clear speed winner, roughly 35% faster than Solaris, the runner-up. Overall performance increased as connections were added, showing marginal additional speed with more than 1500. FreeBSD performance decreased somewhat when more than 1500 connections were added. On the Unix-style OSes, it was necessary to tweak the kernel slightly in order to allow the use of so many connections in one process. Despite kernel tweaking, FreeBSD gave us resource-shortage warnings and failed to run when loaded with more than 2500 connections.
Many network applications also require the ability to queue information on disk for later processing (i.e. sendmail's mail queue), or to handle overflow situations. To measure file system efficiency mimicking typical situations, we wrote a C++ program that creates, writes, and reads back 10,000 files in a single directory, one file at a time. To measure file system efficiency of various kinds of files, the file size was increased from 4KB to 128KB.
Figure 2 ("Time to Create, Write and Read 10,000 Files") displays our file system test results. Linux and Windows speeds were almost identical, significantly faster than the other two: 6x faster than FreeBSD, 10x faster than Solaris. The file system for each OS was: Linux - EXT2, Solaris - UFS, Windows 2000 - NTFS, FreeBSD - UFS. Other file systems would undoubtedly yield different performance results. If your software application depends heavily of disk I/O, we would highly recommend you use Linux or Windows, or else investigate alternative file systems on FreeBSD or Solaris.
Finally, we evaluated how different network application architectures performed on each operating system. We wrote a simple C++ server program that responded to incoming connections with the message "450 too busy", using one of 3 architectures to handle sending the response message. The 3 architectures our program tested were: (1) a process-based architecture, with a new process executed to handle each connection; (2) a multithreaded architecture, with a thread assigned to each connection; and (3) an asynchronous architecture, with all connections answered using non-blocking TCP/IP. A separate C++ program, running on a different machine (on Linux), attempts to connect to our simple server program as quickly as possible, slowly increasing the simultaneous connection load, and counting successfully received response messages. The multiple charts (12 test runs) being too much to present in this article, we instead chart the average for each task architecture to show general performance differen
Figure 3 ("Average Throughput per Network Architecture") shows the performance of each type of task architecture, averaged across all OS platforms. While there was a significant amount of variation in performance between platforms, the variation was not nearly as significant as the architecture choice. The slowest network application architecture is the process-based architecture, which can handle only about 5 percent of the connections of the asynchronous method. The asynchronous method can handle about 35 percent more load than the thread based method at 1000 simultaneous connections. The trend lines show that the multithreaded vs. asynchronous performance gap widens as load increases.
In their default configurations, the Unix-style operating systems we tested do not support the large numbers of simultaneous TCP/IP connections that multithreaded and asynchronous applications require. This limitation drastically restricts applications performance, and can incorrectly dissuade a system administrator from using these kinds of high performance architectures. Fortunately, these limitations are easily overcome with a few kernel tweaks. On Unix, each TCP/IP connection uses a file descriptor, so you must increase the total number of descriptors available to the operating system, and also increase the maximum number of descriptors each process is allowed to use. All Unix style operating systems have a "ulimit" shell command (sh and bash) which can allow more open file descriptors to commands started in that shell, once the appropriate kernel tweak has been made. We recommend "ulimit -n 8192". Here are our recommended kernel tweaks:
Linux: "echo 65536 > /proc/sys/fs/file-max" changes the number of system-wide file descriptors
In summary, our real-world test observed a 75% performance gap between the best and worst performing operating systems, with Linux enjoying a 35% lead over runner-up Solaris. Of more significance, asynchronous applications were on average 12x faster than process-based applications, and 35% faster than multithreaded applications. If disk I/O occupies a significant run-time portion of your application, your disk I/O tasks will run up to 10x faster on Linux and Windows 2000, when compared to Solaris, or 6x faster than FreeBSD.
We recommend that if you are evaluating a network software application, and final performance is important to you, that the software architecture be a vital evaluation criterion (i.e., you should show a preference for multithreaded or asynchronous architectures).
Sidebar #1: How can a SysAdmin determine what network programming architecture is being used?
Most program's documentation remain vague about what network programming architecture is used. Since the architecture is the single strongest determinant of performance under load, you need a way to find out. Below are techniques you can use snoop on an application, even if you do not have access to the source code. First, you have to set up a load-creating situation, with at least 200 simultaneous tasks. Then, examine how the process is running with various tools:
- Run 'top'
- One-process-per-task: many processes running, all with a different amount of memory use.
- One-thread-per-task: many processes running, all with the same amount of memory use, and the number of processes changes over time
- One-thread-many-tasks (one thread): a single process running.
- One-thread-many-tasks (several threads): many processes running, all with the same amount of memory use, and the number of processes is less than 100.
- Run "ps -efL". The "NLWP" column indicates how many "lightweight processes" are running inside each process, and lists each LWP in a separate row.
- One-process-per-task: "ps )efL" reports multiple copies of your application, but each has a different process ID, then the application uses the process-per-task approach.
- One-thread-per-task: multiple processes reported by "ps )efL" for your application, but all have the same process ID, and the number of copies changes over time, then the application uses the one-thread-per-task (multithreaded) approach.
- One-thread-many-tasks (one thread): a single process reported.
- One-thread-many-tasks (several threads): multiple copies of your application reported by "ps - efL", all running copies of your application have the same process ID, and the number of copies is less than 100.
- Use Task Manager to view active tasks and perfmon.exe to view the number of threads in the running task.
- One-process-per-task: Task Manager shows approximately one process per running task.
- One-thread-per-task: perfmon.exe shows approximately one thread per running task.
- One-thread-many-tasks (one thread): one process, one thread.
- One-thread-many-tasks (several threads): Task Manager shows one process; perfmon.exe shows less than 100 threads.
- use strace (on Linux) or truss (on Solaris and FreeBSD) to display the application's system calls.
- Grep for calls to "select" or "poll", which usually indicate asynchronous operations. Poll() is more efficient than select() with large numbers of connections.
- Grep for calls that start with "aio" (which stands for Asynchronous I/O) or "lio" (List I/O). These calls are a likely sign of a sophisticated asynchronous architecture. Examples: aio_write(), aio_read(), lio_listio().
- Grep for calls that start with "pthread" or "thr_" (on Solaris), which indicates some amount of multithreading. If a new task always causes a call the pthread_create(), then the "one-thread-per- task" design is being used.
Side Bar #2: What do the authors use for servers?
We settled on custom assembled PCs, using components that we find to be very reliable, and which enjoy widespread driver support. Our standard system consists of: ASUS P3B or CUV4X Pentium III motherboards, 384MB RAM, Adaptec SCSI3 controller, IBM SCSI3 hard drives in a dual-fan removable hard drive chassis, ATI AGP Rage Pro 3D display card, Intel EtherExpress 10/100, all housed in an Antec 4U rack-mount case. Final cost: about $2000 per PC.
- Low cost PC hardware
- Reliable and high performance multithreading
- Very, very stable.
- Reasonably good overall OS performance
- Fairly good stability
- Reliable and high performance multithreading
- With VNC, fairly easy to remotely administer
- When using MS SQL server as a data store, reliability is much improved over using NT file system (a client OS crash no longer corrupts data, and live backups are possible, both of which are not possible with the NTFS file system)
- Unix better than Windows for application servers due to stability, easy application crash recovery, and scripting
- When an application is available on all Unixes, we tend to use Solaris, due to stability. For example, very high DNS load crashes the Linux (RedHat 6.1 or 2.2.12-20) kernel, but DNS has been very stable on Solaris.
- Some applications have a better feature set on Linux than on other Unixes, so in those cases we use RedHat Linux.
- Strong Samba support (bi-directional, can mount Windows volumes)
- Software RAID is incredibly fast, easy to install (but not very reliable)
- Considering FreeBSD w/Vinum (http://www.vinumvm.org/) as a more stable, more full-featured software RAID alternative (but installation is much more difficult than on Linux)
- Latest kernel version (2.8) is very stable, even with tens of thousands of connections
- Built-in TCP/IP filtering at the kernel level is very fast (much faster than on Linux)
- Security-conscious OpenBSD developers keep default feature set at a minimum to keep security tight
- Routing and firewall can be handled in one system, helping performance and management simplicity.