LINUX SCHEDULER BENCHMARK RESULTS

Richard Gooch
30-SEP-1998

Introduction

One of the important considerations when choosing a Real Time (RT) operating system is the scheduler (context switch) overhead. For example, if a low priority process releases a lock which a high priority task is waiting for, how long will it take for the high priority task to resume processing? The minimum time this can take is determined by the time taken to perform a context switch. In other words, the time taken to schedule a new process.

Serious RT operating systems can quote context switch overheads of a few tens of microseconds or less. This page shows some results for the Linux scheduler.


Raw Results

This section lists some raw results collected by myself and other people. All results in microseconds.

First my own results (UP kernels, no extra processes, sched_yield(), RT processes):

CPU		Memory	process switch	thread switch	Kernel version
AMD386DX40	FPM	39.4		24.8		2.1.122+fix
Intel386DX33	FPM	22.5		12.3		2.1.122+fix
486DX266	FPM	8.3		5.2		2.1.122+fix
Pentium 100	FPM	8.8		2.8		2.1.122+fix
PPro 180    	EDO	2.2		1.5		2.1.122+fix
Pentium/MMX 200	SDRAM	2.4		0.6		2.1.122+fix
PPro 180    	EDO	2.2		1.4		2.1.122+REG
Pentium 100	FPM	8.0-8.2		2.5		2.1.123+fix
PPro 180    	FPM	2.0		1.0		2.1.123+fix
Pentium 100	FPM	7.2		2.0		2.1.123+rtqueue
PPro 180    	FPM	1.9		1.0		2.1.123+rtqueue
Now the same tests with 10 low-priority processes on the run queue:
CPU		Memory	process switch	thread switch	Kernel version
AMD386DX40	FPM	118.7		106.0		2.1.122+fix
Intel386DX33	FPM	96.7		85.9		2.1.122+fix
486DX266	FPM	46.5		35.5		2.1.122+fix
Pentium 100	FPM	23.0		11.7		2.1.122+fix
Pentium/MMX 200	SDRAM	7.8		6.0		2.1.122+fix
PPro 180    	EDO	4.3		3.6		2.1.122+fix
PPro 180    	EDO	3.6		3.0		2.1.122+REG
Pentium 100    	FPM	22.1		10.8		2.1.123+fix
PPro 180    	FPM	4.1		3.2		2.1.123+fix
Pentium 100	FPM	7.2		2.0		2.1.123+rtqueue
PPro 180    	FPM	1.9		1.0		2.1.123+rtqueue
Now using pipes and token passing rather than sched_yield() (no extra processes):
CPU		Memory	process switch	thread switch	Kernel version
486DX266	FPM	16.4		10.8		2.1.122+fix
Pentium 100	FPM	13.1		7.0		2.1.122+fix
PPro 180    	EDO	3.0		2.2		2.1.122+fix
Pentium/MMX 200	SDRAM	4.2		1.3		2.1.122+fix
PPro 180    	EDO	3.4		2.6		2.1.122+REG
Pentium 100    	FPM	13.6		7.5		2.1.123+fix
PPro 180    	FPM	3.1		2.4		2.1.123+fix
Pentium 100    	FPM	14.2		6.5		2.1.123+rtqueue
PPro 180    	FPM	3.0		2.1		2.1.123+rtqueue
And now adding FPU pollution in the main timing loop:
CPU		Memory	process switch	thread switch	Kernel version
486DX266	FPM	27.1		16.9		2.1.122+fix
Pentium 100	FPM	15.6		9.0		2.1.122+fix
PPro 180    	EDO	4.8		4.0		2.1.122+fix
Pentium/MMX 200	SDRAM	5.0		2.3		2.1.122+fix
PPro 180    	EDO	5.3		4.6		2.1.122+REG
Pentium 100    	FPM	15.7		9.8		2.1.123+fix
PPro 180    	FPM	5.2		4.2		2.1.123+fix
Pentium 100    	FPM	16.3		9.0		2.1.123+rtqueue
PPro 180    	FPM	4.9		4.1		2.1.123+rtqueue
A comparative test using "lat_ctx" from lmbench shows the context switch time go from about 19 us to 41 us when 10 low-priority processes are added to the run queue. Timings for the Pentium 100.

Another test for normal processes (UP kernels, no extra processes, sched_yield(), including syscall overheads):

CPU		Memory	process switch	thread switch	Kernel version
Pentium 100	FPM	11.2-11.5	6.0		2.1.123+fix
PPro 180    	FPM	5.3		4.5		2.1.123+fix
Pentium 100	FPM	12.2		6.6		2.1.123+rtqueue
PPro 180    	FPM	5.2		4.6		2.1.123+rtqueue

Note the "2.1.112+fix" kernel is 2.1.122 plus a bugfix from Linus for lazy FPU state save/restoring.


Discussion

This section will discuss the results.

Effect of Run Queue Length

Looking at the costs of having more processes on the run queue, we can see from the PPro 180 times with kernel 2.1.122+fix that an extra 10 processes on the run queue increased the thread switch time from 1.5 us to 3.6 us. This is a cost of 2.1 us per 10 processes, or 0.21 us per process. For the Pentium 100 the cost is 0.89 us. With only 3 extra processes on the run queue, we can expect the thread switching time for a Pentium 100 to double! A more pessimistic result may be obtained using lmbench which shows 2.2 us per process cost for a Pentium 100. For an Intel 386DX33 the cost is 7.4 us per process. Here a mere two extra processes on the run queue more than doubles thread switch latency.

Note that I look at thread switch times since I expect an RT application to be implemented with threads rather than full processes. Comparing with a RT OS like pSOS+, all tasks run in the same memory space, so this seems like a common paradime for RT applications.

Given these results, I think there is a case to be made to have a separate run queue for RT processes under Linux. Consider a system which serves a dual purpose: an RT application which is used to control an instrument and an online data reduction and visualisation application. While the "obvious" solution is to run the RT and other application on separate machines, this has some drawbacks. Firstly, there is the added cost of a second machine: not all organisations have the necessary budget. Secondly, it makes sense to perform the data reduction and visualisation on the same machine performing the data capture. Having separate machines would require data to be shipped across a network and would also consume memory bandwidth on the data capture machine.

A separate RT run queue provides an increased measure of isolation between RT processes and normal processes. This would help give a more stable response for RT processes on a system loaded with normal processes.

I have written a kernel patch which maintains a separate run queue for RT processes and also simplfies some of the code paths. The results are shown for the "2.1.123+rtqueue" kernel. The "2.1.123+fix" kernel includes a bugfix for POSIX.4 compliance. From the results we can see that there is a minor benefit for the case where the run queue is short. This benefit is due to the simplfied code paths and probably better cache utilisation. In addition, we can see that there is now no penalty for additional low-priority processes on the run queue.

Effect of Cache Line Aliasing

It's interesting to note that the thread switch ratio between a Pentium 100 and a Pentium Pro 180 is only about two, and that the Pentium/MMX 200 is twice as fast as the PPro 180. On the other hand the PPro 180 is slightly faster than the Pentium/MMX 200 for process context switches and both are around 4 times faster than the Pentium 100.

A possible explanation for this is that since the Pentium/MMX has a 4-way L1 cache, cache line aliasing is not occurring for the thread switch case, whereas for the Pentium and PPro cache line aliasing requires more cache line fills and hence slows down thread switching. A full process switch on the other hand requires more memory to be manipulated and hence the faster L2 cache of the PPro dominates.

Looking further at the case with an extra 10 low-priority processes placed on the run queue, we see again that the Pentium/MMX is not performing as well as the PPro. Since the run queue is longer, the Pentium/MMX may now also be exhibiting cache line aliasing problems. The faster L2 cache of the PPro may again be helping.

The Linux task (process) structure is page aligned. Thus there is the potential for cache line thrashing, as accesses to the same member in successive task structures can be aliased to the same cache line. This then results in that cache line being flushed/invalidated and a new cache line must be brought in from main memory. To reduce the impact from this, I've reordered some of the members in the task structure so that the scheduler only needs to access a maximum of 48 contiguous bytes when scanning the run queue. On a CPU with 32 byte cache lines, this means only 2 cache lines per task needs to be brought in. The standard version of the kernel requires at least 3 cache lines per thread and 4 caches lines per non-RT process on the run queue.

The effect of my reording can be seen in the results for the "2.1.122+REG" kernel. As well as reducing the basic (2 thread) switching time, the per-process cost has been reduced from 0.21 us to 0.16 us for a PPro 180. However, other tests show that this change makes some operations slower, so clearly more work is required to determine an optimal ordering. In my view it is more fruitful to create a separate RT run queue, as this would have a greater benefit.

FPU Issues

From the results above we can see that the time taken to save and restore the FPU state dominates the (short run queue) context switch time. On a PPro 180, process switch times went from 3.0 us to 4.8 us, a cost of 1.8 us. This is worth noting for a few reasons.

Firstly, RT processes which wish to ensure low context switch times and wakeup latencies should avoid floating point operations if possible. Performing floating point operations will increase the latency.

Secondly, when comparing results from my test code with lmbench results, it is worth noting that lmbench uses floating point arithmetic to compute timings. This explains why most of my benchmarks give lower absolute times than lmbench.

Finally, the effect that this FPU state saving/restoring has on variance. This is discussed below.

Variance

Earlier generations of my test code showed considerable variance (up to 100%). Some of that variance has been explained by unexpected system effects (such as my shell, xterm and X server being placed on the run queue while the benchmark was running). The current code takes steps to avoid these effects.

There still remains some variance (up to 50%) where an occasional measurement is higher than usual. This is in contrast to the lmbench results, which so far have only exhibited a 30% variance (mean 5.25 us, minimum 3.94 us on a PPro 180). However, it is worth considering the FPU save/restore times. The large variance in my benchmarks was without FPU pollution (and hence the lower switch times). If we take the 1.8 us FPU state overhead away from the lmbench numbers, we get a mean of 3.45 us and a minimum of 2.14 us, yielding a variance of over 60%. While this is a crude comparison, it should be a reasonable first approximation. In my tests which included FPU pollution, the variance was under 30%.

The assumption I've made is that FPU state save/restore times are less subject to variance than the rest of the context switch time. I think this is a reasonable assumption considering that the FPU state is kept in a single block of memory whereas the rest of the context switch operations require access to scattered parts of memory. Access to scattered memory is more dependent on cache effects.

Finally, it is worth noting that I've seen greatest variance on the 486DX266, a bit less on the Pentium 100, less again on the PPro 180 and least on the Pentium/MMX 200. This leads me to the suspicion that much of the variance is due to cache effects.


Source Code

Below is the source code used in these tests. /* time-schedule.c Programme to test how long a context switch takes. Copyright (C) 1998 Richard Gooch This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Richard Gooch may be reached by email at rgooch@atnf.csiro.au The postal address is: Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. */ /* This programme will determine the context switch (scheduling) overhead on a system. It takes into account SMP machines. True context switches are measured. Written by Richard Gooch 15-SEP-1998 Last updated by Richard Gooch 25-SEP-1998 */ #include <unistd.h> #ifdef _POSIX_THREADS # ifndef _REENTRANT # define _REENTRANT # endif # ifndef _POSIX_THREAD_SAFE_FUNCTIONS # define _POSIX_THREAD_SAFE_FUNCTIONS # endif # include <pthread.h> #endif #include <stdlib.h> #include <stdio.h> #include <math.h> #include <string.h> #include <ctype.h> #include <signal.h> #include <sched.h> #include <sys/time.h> #include <sys/mman.h> #include <sys/types.h> #include <dirent.h> #include <errno.h> #ifndef __KARMA__ # define mt_num_processors() 1 /* Set to the number of processors */ # define ERRSTRING sys_errlist[errno] # define FALSE 0 # define TRUE 1 #else # include <karma.h> # include <karma_mt.h> #endif #define MAX_ITERATIONS 1000 static unsigned int hog_other_cpus (); static void run_yielder (int use_threads, int read_fd); static void *yielder_main (void *arg); static void s_term_handler (); static void run_low_priority (unsigned int num, int read_fd); static unsigned long compute_median (unsigned long values[MAX_ITERATIONS], unsigned long max_value); static unsigned int get_run_queue_size (); static unsigned long get_num_switches (); static void use_fpu_value (double val); static volatile unsigned int sched_count = 0; /* For yielder */ static int pipe_read_fd = -1; static int pipe_write_fd = -1; static pid_t child = -1; int main (int argc, char **argv) { int use_threads = FALSE; int use_pipe = FALSE; int no_test = FALSE; int frob_fpu = FALSE; int read_fd = -1; int write_fd = -1; int num_low_priority = -1; int i, j; int fds[2]; unsigned int count, num_yields, run_queue_size1, run_queue_size2, num_hogs; unsigned long median, switches1, num_switches, num_overhead_switches; signed long overhead, total_diffs; signed long min_diff = 1000000000; signed long max_diff = -1000000000; double dcount = 0.0; unsigned long diffs[MAX_ITERATIONS]; struct timeval before, after; static char *usage = "time-schedule [-h] [-thread] [-notest] [-pipe] [-fpu] [num_running]"; /* First create pipe used to sychronise low priority processes */ if (pipe (fds) != 0) { fprintf (stderr, "Error creating pipe\t%s\n", ERRSTRING); exit (1); } read_fd = fds[0]; pipe_write_fd = fds[1]; for (count = 1; count < argc; ++count) { if (strcmp (argv[count], "-thread") == 0) { #ifdef _POSIX_THREADS use_threads = TRUE; #else fprintf (stderr, "POSIX threads not available\n"); #endif } else if (strcmp (argv[count], "-pipe") == 0) use_pipe = TRUE; else if (strcmp (argv[count], "-notest") == 0) no_test = TRUE; else if (strcmp (argv[count], "-fpu") == 0) frob_fpu = TRUE; else if ( isdigit (argv[count][0]) ) num_low_priority = atoi (argv[count]); else { fprintf (stderr, "Programme to time context switches (schedules)\n"); fprintf (stderr, "(C) 1998 Richard Gooch <rgooch@atnf.csiro.au>\n"); fprintf (stderr, "Usage:\t%s\n", usage); fprintf (stderr, "\t-thread\t\tswitch threads not processes\n"); fprintf (stderr, "\t-pipe\t\tuse pipes not sched_yield()\n"); fprintf (stderr, "\t-fpu\t\tpollute the FPU after each switch in main\n"); fprintf (stderr, "\tnum_running\tnumber of extra processes\n"); exit (0); } } if (no_test) { if (num_low_priority > 0) run_low_priority (num_low_priority, read_fd); while (TRUE) pause (); } if (geteuid () == 0) { struct sched_param sp; memset (&sp, 0, sizeof sp); sp.sched_priority = 10; if (sched_setscheduler (0, SCHED_FIFO, &sp) != 0) { fprintf (stderr, "Error changing to RT class\t%s\n", ERRSTRING); exit (1); } if (mlockall (MCL_CURRENT | MCL_FUTURE) != 0) { fprintf (stderr, "Error locking pages\t%s\n", ERRSTRING); exit (1); } } else fprintf (stderr, "Not running with RT priority!\n"); /* Give shell and login programme time to finish up and get off the run queue */ usleep (200000); if (use_pipe) { if (pipe (fds) != 0) { fprintf (stderr, "Error creating pipe\t%s\n", ERRSTRING); exit (1); } pipe_read_fd = fds[0]; write_fd = fds[1]; } num_hogs = hog_other_cpus (); /* Determine overhead. Do it in a loop=2. The first iteration should warm the cache, the second will compute the overhead */ for (j = 0; j < 2; ++j) { switches1 = get_num_switches (); gettimeofday (&before, NULL); for (i = 0; i < 20; ++i) { if (use_pipe) { char ch = 0; write (pipe_write_fd, &ch, 1); read (read_fd, &ch, 1); } else sched_yield (); if (frob_fpu) ++dcount; } gettimeofday (&after, NULL); num_overhead_switches = get_num_switches () - switches1; overhead = 1000000 * (after.tv_sec - before.tv_sec); overhead += after.tv_usec - before.tv_usec; } use_fpu_value (dcount); if (num_low_priority > 0) run_low_priority (num_low_priority, read_fd); /* Set up for the benchmark */ run_yielder (use_threads, read_fd); memset (diffs, 0, sizeof diffs); run_queue_size1 = get_run_queue_size (); total_diffs = 0; switches1 = get_num_switches (); /* Benchmark! */ for (count = 0; count < MAX_ITERATIONS; ++count) { signed long diff; gettimeofday (&before, NULL); /* Generate 20 context switches */ for (i = 0; i < 10; ++i) { if (use_pipe) { char ch = 0; write (write_fd, &ch, 1); read (read_fd, &ch, 1); } else sched_yield (); if (frob_fpu) dcount += 1.0; } gettimeofday (&after, NULL); diff = 1000000 * (after.tv_sec - before.tv_sec); diff += after.tv_usec - before.tv_usec; diffs[count] = diff; total_diffs += diff; if (diff < min_diff) min_diff = diff; if (diff > max_diff) max_diff = diff; } num_yields = sched_count; run_queue_size2 = get_run_queue_size (); num_switches = get_num_switches () - switches1; if (!use_threads) kill (child, SIGTERM); fprintf (stderr, "Started %u hog processes\n", num_hogs); fprintf (stderr, "Syscall%s overhead: %.1f us\n", frob_fpu ? "/FPU" : "", (double) overhead / 20.0); if (switches1 > 0) fprintf (stderr, "Num switches during overhead check: %lu\n", num_overhead_switches); fprintf (stderr, "Minimum scheduling latency: %.1f (%.1f) us\n", (double) min_diff / 20.0, (double) (min_diff - overhead) / 20.0); median = compute_median (diffs, max_diff); fprintf (stderr, "Median scheduling latency: %.1f (%.1f) us\n", (double) median / 20.0, (double) (median - overhead) / 20.0); fprintf (stderr, "Average scheduling latency: %.1f (%.1f) us\n", (double) total_diffs / (double) MAX_ITERATIONS / 20.0, (double) (total_diffs - overhead * MAX_ITERATIONS) / (double) MAX_ITERATIONS / 20.0); fprintf (stderr, "Maximum scheduling latency: %.1f (%.1f) us\n", (double) max_diff / 20.0, (double) (max_diff - overhead) / 20.0); fprintf (stderr, "Run queue size: %u, %u\n", run_queue_size1, run_queue_size2); use_fpu_value (dcount); if (use_threads) fprintf (stderr, "Number of yields: %u\n", num_yields); if (num_switches > 0) fprintf (stderr, "Num switches: %lu\n",num_switches); /* Finish up */ kill (0, SIGTERM); return (0); } /* End Function main */ static unsigned int hog_other_cpus () /* [SUMMARY] Hog other CPUs with a high-priority job. [RETURNS] The number of hogged CPUs. */ { unsigned int count; for (count = mt_num_processors (); count > 1; --count) { switch ( fork () ) { case 0: /* Child */ while (TRUE); break; case -1: /* Error */ fprintf (stderr, "Error forking\t%s\n", ERRSTRING); kill (0, SIGTERM); break; default: /* Parent */ break; } } return mt_num_processors () - 1; } /* End Function hog_other_cpus */ static void run_yielder (int use_threads, int read_fd) /* [SUMMARY] Run other process which will continuously yield. <use_threads> If TRUE, the yielding process is just a thread. <read_fd> The pipe to read the synchronisation byte from. [RETURNS] Nothing. */ { char ch; struct sigaction new_action; #ifdef _POSIX_THREADS pthread_t thread; if (use_threads) { if (pthread_create (&thread, NULL, yielder_main, NULL) != 0) { fprintf (stderr, "Error creating thread\t%s\n", ERRSTRING); kill (0, SIGTERM); } read (read_fd, &ch, 1); return; } #endif switch ( child = fork () ) { case 0: /* Child */ break; case -1: /* Error */ fprintf (stderr, "Error forking\t%s\n", ERRSTRING); kill (0, SIGTERM); break; default: /* Parent */ read (read_fd, &ch, 1); return; /*break;*/ } memset (&new_action, 0, sizeof new_action); sigemptyset (&new_action.sa_mask); new_action.sa_handler = s_term_handler; if (sigaction (SIGTERM, &new_action, NULL) != 0) { fprintf (stderr, "Error setting SIGTERM handler\t%s\n", ERRSTRING); exit (1); } yielder_main (NULL); } /* End Function run_yielder */ static void *yielder_main (void *arg) /* [SUMMARY] Yielder function. <arg> An arbirtrary pointer. Ignored. [RETURNS] NULL. */ { char ch = 0; sched_count = 0; write (pipe_write_fd, &ch, 1); while (TRUE) { if (pipe_read_fd >= 0) { read (pipe_read_fd, &ch, 1); write (pipe_write_fd, &ch, 1); } else sched_yield (); ++sched_count; } return (NULL); } /* End Function yielder_main */ static void s_term_handler () { fprintf (stderr, "Number of yields: %u\n", sched_count); exit (0); } /* End Function s_term_handler */ static void run_low_priority (unsigned int num, int read_fd) /* [SUMMARY] Run low priority processes. <num> Number of processes. <read_fd> The pipe to read the synchronisation byte from. [RETURNS] Nothing. */ { char ch = 0; for (; num > 0; --num) { switch ( fork () ) { case 0: /* Child */ if (geteuid () == 0) { struct sched_param sp; memset (&sp, 0, sizeof sp); sp.sched_priority = 0; if (sched_setscheduler (0, SCHED_OTHER, &sp) != 0) { fprintf (stderr, "Error changing to SCHED_OTHER class\t%s\n", ERRSTRING); exit (1); } } if (nice (20) != 0) { fprintf (stderr, "Error nicing\t%s\n", ERRSTRING); kill (0, SIGTERM); } write (pipe_write_fd, &ch, 1); while (TRUE) sched_yield (); break; case -1: /* Error */ fprintf (stderr, "Error forking\t%s\n", ERRSTRING); kill (0, SIGTERM); break; default: /* Parent */ read (read_fd, &ch, 1); break; } } } /* End Function run_low_priority */ static unsigned long compute_median (unsigned long values[MAX_ITERATIONS], unsigned long max_value) /* [SUMMARY] Compute the median from an array of values. <values> The array of values. <max_value> The maximum value in the array. [RETURNS] The median value. */ { unsigned long count; unsigned long median = 0; unsigned long peak = 0; unsigned long *table; /* Crude but effective */ if ( ( table = calloc (max_value + 1, sizeof *table) ) == NULL ) { fprintf (stderr, "Error allocating median table\n"); exit (1); } for (count = 0; count < MAX_ITERATIONS; ++count) { ++table[ values[count] ]; } /* Now search for peak. Position of peak is median */ for (count = 0; count < max_value + 1; ++count) { if (table[count] < peak) continue; peak = table[count]; median = count; } return (median); } /* End Function compute_median */ static unsigned int get_run_queue_size () /* [SUMMARY] Compute the current size of the run queue. [RETURNS] The length of the run queue. */ { int dummy_i; unsigned int length = 0; FILE *fp; DIR *dp; struct dirent *de; char txt[64], dummy_str[64]; if ( ( dp = opendir ("/proc") ) == NULL ) return (0); while ( ( de = readdir (dp) ) != NULL ) { if ( !isdigit (de->d_name[0]) ) continue; sprintf (txt, "/proc/%s/stat", de->d_name); if ( ( fp = fopen (txt, "r") ) == NULL ) return (length); fscanf (fp, "%d %s %s", &dummy_i, dummy_str, txt); if (txt[0] == 'R') ++length; fclose (fp); } closedir (dp); return (length); } /* End Function get_run_queue_size */ static unsigned long get_num_switches () /* [SUMMARY] Get the number of context switches. [RETURNS] The number of context switches on success, else 0. */ { unsigned long val; FILE *fp; char line[256], name[64]; if ( ( fp = fopen ("/proc/stat", "r") ) == NULL ) return (0); while (fgets (line, sizeof line, fp) != NULL) { sscanf (line, "%s %lu", name, &val); if (strcasecmp (name, "ctxt") != 0) continue; fclose (fp); return (val); } fclose (fp); return (0); } /* End Function get_num_switches */ static void use_fpu_value (double val) /* [SUMMARY] Dummy function to consume FPU value. Fools compiler. <val> The value. [RETURNS] Nothing. */ { } /* End Function use_fpu_value */