From 849369d6c66d3054688672f97d31fceb8e8230fb Mon Sep 17 00:00:00 2001 From: root Date: Fri, 25 Dec 2015 04:40:36 +0000 Subject: initial_commit --- Documentation/timers/00-INDEX | 12 ++ Documentation/timers/Makefile | 8 + Documentation/timers/highres.txt | 249 ++++++++++++++++++++++++++++ Documentation/timers/hpet.txt | 30 ++++ Documentation/timers/hpet_example.c | 294 ++++++++++++++++++++++++++++++++++ Documentation/timers/hrtimers.txt | 178 ++++++++++++++++++++ Documentation/timers/timer_stats.txt | 73 +++++++++ Documentation/timers/timers-howto.txt | 105 ++++++++++++ 8 files changed, 949 insertions(+) create mode 100644 Documentation/timers/00-INDEX create mode 100644 Documentation/timers/Makefile create mode 100644 Documentation/timers/highres.txt create mode 100644 Documentation/timers/hpet.txt create mode 100644 Documentation/timers/hpet_example.c create mode 100644 Documentation/timers/hrtimers.txt create mode 100644 Documentation/timers/timer_stats.txt create mode 100644 Documentation/timers/timers-howto.txt (limited to 'Documentation/timers') diff --git a/Documentation/timers/00-INDEX b/Documentation/timers/00-INDEX new file mode 100644 index 00000000..a9248da5 --- /dev/null +++ b/Documentation/timers/00-INDEX @@ -0,0 +1,12 @@ +00-INDEX + - this file +highres.txt + - High resolution timers and dynamic ticks design notes +hpet.txt + - High Precision Event Timer Driver for Linux +hpet_example.c + - sample hpet timer test program +hrtimers.txt + - subsystem for high-resolution kernel timers +timer_stats.txt + - timer usage statistics diff --git a/Documentation/timers/Makefile b/Documentation/timers/Makefile new file mode 100644 index 00000000..73f75f8a --- /dev/null +++ b/Documentation/timers/Makefile @@ -0,0 +1,8 @@ +# kbuild trick to avoid linker error. Can be omitted if a module is built. +obj- := dummy.o + +# List of programs to build +hostprogs-$(CONFIG_X86) := hpet_example + +# Tell kbuild to always build the programs +always := $(hostprogs-y) diff --git a/Documentation/timers/highres.txt b/Documentation/timers/highres.txt new file mode 100644 index 00000000..21332233 --- /dev/null +++ b/Documentation/timers/highres.txt @@ -0,0 +1,249 @@ +High resolution timers and dynamic ticks design notes +----------------------------------------------------- + +Further information can be found in the paper of the OLS 2006 talk "hrtimers +and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can +be found on the OLS website: +http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf + +The slides to this talk are available from: +http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf + +The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the +changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the +design of the Linux time(r) system before hrtimers and other building blocks +got merged into mainline. + +Note: the paper and the slides are talking about "clock event source", while we +switched to the name "clock event devices" in meantime. + +The design contains the following basic building blocks: + +- hrtimer base infrastructure +- timeofday and clock source management +- clock event management +- high resolution timer functionality +- dynamic ticks + + +hrtimer base infrastructure +--------------------------- + +The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of +the base implementation are covered in Documentation/hrtimers/hrtimer.txt. See +also figure #2 (OLS slides p. 15) + +The main differences to the timer wheel, which holds the armed timer_list type +timers are: + - time ordered enqueueing into a rb-tree + - independent of ticks (the processing is based on nanoseconds) + + +timeofday and clock source management +------------------------------------- + +John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of +code out of the architecture-specific areas into a generic management +framework, as illustrated in figure #3 (OLS slides p. 18). The architecture +specific portion is reduced to the low level hardware details of the clock +sources, which are registered in the framework and selected on a quality based +decision. The low level code provides hardware setup and readout routines and +initializes data structures, which are used by the generic time keeping code to +convert the clock ticks to nanosecond based time values. All other time keeping +related functionality is moved into the generic code. The GTOD base patch got +merged into the 2.6.18 kernel. + +Further information about the Generic Time Of Day framework is available in the +OLS 2005 Proceedings Volume 1: +http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf + +The paper "We Are Not Getting Any Younger: A New Approach to Time and +Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan. + +Figure #3 (OLS slides p.18) illustrates the transformation. + + +clock event management +---------------------- + +While clock sources provide read access to the monotonically increasing time +value, clock event devices are used to schedule the next event +interrupt(s). The next event is currently defined to be periodic, with its +period defined at compile time. The setup and selection of the event device +for various event driven functionalities is hardwired into the architecture +dependent code. This results in duplicated code across all architectures and +makes it extremely difficult to change the configuration of the system to use +event interrupt devices other than those already built into the +architecture. Another implication of the current design is that it is necessary +to touch all the architecture-specific implementations in order to provide new +functionality like high resolution timers or dynamic ticks. + +The clock events subsystem tries to address this problem by providing a generic +solution to manage clock event devices and their usage for the various clock +event driven kernel functionalities. The goal of the clock event subsystem is +to minimize the clock event related architecture dependent code to the pure +hardware related handling and to allow easy addition and utilization of new +clock event devices. It also minimizes the duplicated code across the +architectures as it provides generic functionality down to the interrupt +service handler, which is almost inherently hardware dependent. + +Clock event devices are registered either by the architecture dependent boot +code or at module insertion time. Each clock event device fills a data +structure with clock-specific property parameters and callback functions. The +clock event management decides, by using the specified property parameters, the +set of system functions a clock event device will be used to support. This +includes the distinction of per-CPU and per-system global event devices. + +System-level global event devices are used for the Linux periodic tick. Per-CPU +event devices are used to provide local CPU functionality such as process +accounting, profiling, and high resolution timers. + +The management layer assigns one or more of the following functions to a clock +event device: + - system global periodic tick (jiffies update) + - cpu local update_process_times + - cpu local profiling + - cpu local next event interrupt (non periodic mode) + +The clock event device delegates the selection of those timer interrupt related +functions completely to the management layer. The clock management layer stores +a function pointer in the device description structure, which has to be called +from the hardware level handler. This removes a lot of duplicated code from the +architecture specific timer interrupt handlers and hands the control over the +clock event devices and the assignment of timer interrupt related functionality +to the core code. + +The clock event layer API is rather small. Aside from the clock event device +registration interface it provides functions to schedule the next event +interrupt, clock event device notification service and support for suspend and +resume. + +The framework adds about 700 lines of code which results in a 2KB increase of +the kernel binary size. The conversion of i386 removes about 100 lines of +code. The binary size decrease is in the range of 400 byte. We believe that the +increase of flexibility and the avoidance of duplicated code across +architectures justifies the slight increase of the binary size. + +The conversion of an architecture has no functional impact, but allows to +utilize the high resolution and dynamic tick functionalities without any change +to the clock event device and timer interrupt code. After the conversion the +enabling of high resolution timers and dynamic ticks is simply provided by +adding the kernel/time/Kconfig file to the architecture specific Kconfig and +adding the dynamic tick specific calls to the idle routine (a total of 3 lines +added to the idle function and the Kconfig file) + +Figure #4 (OLS slides p.20) illustrates the transformation. + + +high resolution timer functionality +----------------------------------- + +During system boot it is not possible to use the high resolution timer +functionality, while making it possible would be difficult and would serve no +useful function. The initialization of the clock event device framework, the +clock source framework (GTOD) and hrtimers itself has to be done and +appropriate clock sources and clock event devices have to be registered before +the high resolution functionality can work. Up to the point where hrtimers are +initialized, the system works in the usual low resolution periodic mode. The +clock source and the clock event device layers provide notification functions +which inform hrtimers about availability of new hardware. hrtimers validates +the usability of the registered clock sources and clock event devices before +switching to high resolution mode. This ensures also that a kernel which is +configured for high resolution timers can run on a system which lacks the +necessary hardware support. + +The high resolution timer code does not support SMP machines which have only +global clock event devices. The support of such hardware would involve IPI +calls when an interrupt happens. The overhead would be much larger than the +benefit. This is the reason why we currently disable high resolution and +dynamic ticks on i386 SMP systems which stop the local APIC in C3 power +state. A workaround is available as an idea, but the problem has not been +tackled yet. + +The time ordered insertion of timers provides all the infrastructure to decide +whether the event device has to be reprogrammed when a timer is added. The +decision is made per timer base and synchronized across per-cpu timer bases in +a support function. The design allows the system to utilize separate per-CPU +clock event devices for the per-CPU timer bases, but currently only one +reprogrammable clock event device per-CPU is utilized. + +When the timer interrupt happens, the next event interrupt handler is called +from the clock event distribution code and moves expired timers from the +red-black tree to a separate double linked list and invokes the softirq +handler. An additional mode field in the hrtimer structure allows the system to +execute callback functions directly from the next event interrupt handler. This +is restricted to code which can safely be executed in the hard interrupt +context. This applies, for example, to the common case of a wakeup function as +used by nanosleep. The advantage of executing the handler in the interrupt +context is the avoidance of up to two context switches - from the interrupted +context to the softirq and to the task which is woken up by the expired +timer. + +Once a system has switched to high resolution mode, the periodic tick is +switched off. This disables the per system global periodic clock event device - +e.g. the PIT on i386 SMP systems. + +The periodic tick functionality is provided by an per-cpu hrtimer. The callback +function is executed in the next event interrupt context and updates jiffies +and calls update_process_times and profiling. The implementation of the hrtimer +based periodic tick is designed to be extended with dynamic tick functionality. +This allows to use a single clock event device to schedule high resolution +timer and periodic events (jiffies tick, profiling, process accounting) on UP +systems. This has been proved to work with the PIT on i386 and the Incrementer +on PPC. + +The softirq for running the hrtimer queues and executing the callbacks has been +separated from the tick bound timer softirq to allow accurate delivery of high +resolution timer signals which are used by itimer and POSIX interval +timers. The execution of this softirq can still be delayed by other softirqs, +but the overall latencies have been significantly improved by this separation. + +Figure #5 (OLS slides p.22) illustrates the transformation. + + +dynamic ticks +------------- + +Dynamic ticks are the logical consequence of the hrtimer based periodic tick +replacement (sched_tick). The functionality of the sched_tick hrtimer is +extended by three functions: + +- hrtimer_stop_sched_tick +- hrtimer_restart_sched_tick +- hrtimer_update_jiffies + +hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code +evaluates the next scheduled timer event (from both hrtimers and the timer +wheel) and in case that the next event is further away than the next tick it +reprograms the sched_tick to this future event, to allow longer idle sleeps +without worthless interruption by the periodic tick. The function is also +called when an interrupt happens during the idle period, which does not cause a +reschedule. The call is necessary as the interrupt handler might have armed a +new timer whose expiry time is before the time which was identified as the +nearest event in the previous call to hrtimer_stop_sched_tick. + +hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before +it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick, +which is kept active until the next call to hrtimer_stop_sched_tick(). + +hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens +in the idle period to make sure that jiffies are up to date and the interrupt +handler has not to deal with an eventually stale jiffy value. + +The dynamic tick feature provides statistical values which are exported to +userspace via /proc/stats and can be made available for enhanced power +management control. + +The implementation leaves room for further development like full tickless +systems, where the time slice is controlled by the scheduler, variable +frequency profiling, and a complete removal of jiffies in the future. + + +Aside the current initial submission of i386 support, the patchset has been +extended to x86_64 and ARM already. Initial (work in progress) support is also +available for MIPS and PowerPC. + + Thomas, Ingo + + + diff --git a/Documentation/timers/hpet.txt b/Documentation/timers/hpet.txt new file mode 100644 index 00000000..767392ff --- /dev/null +++ b/Documentation/timers/hpet.txt @@ -0,0 +1,30 @@ + High Precision Event Timer Driver for Linux + +The High Precision Event Timer (HPET) hardware follows a specification +by Intel and Microsoft which can be found at + + http://www.intel.com/hardwaredesign/hpetspec_1.pdf + +Each HPET has one fixed-rate counter (at 10+ MHz, hence "High Precision") +and up to 32 comparators. Normally three or more comparators are provided, +each of which can generate oneshot interrupts and at least one of which has +additional hardware to support periodic interrupts. The comparators are +also called "timers", which can be misleading since usually timers are +independent of each other ... these share a counter, complicating resets. + +HPET devices can support two interrupt routing modes. In one mode, the +comparators are additional interrupt sources with no particular system +role. Many x86 BIOS writers don't route HPET interrupts at all, which +prevents use of that mode. They support the other "legacy replacement" +mode where the first two comparators block interrupts from 8254 timers +and from the RTC. + +The driver supports detection of HPET driver allocation and initialization +of the HPET before the driver module_init routine is called. This enables +platform code which uses timer 0 or 1 as the main timer to intercept HPET +initialization. An example of this initialization can be found in +arch/x86/kernel/hpet.c. + +The driver provides a userspace API which resembles the API found in the +RTC driver framework. An example user space program is provided in +file:Documentation/timers/hpet_example.c diff --git a/Documentation/timers/hpet_example.c b/Documentation/timers/hpet_example.c new file mode 100644 index 00000000..9a3e7012 --- /dev/null +++ b/Documentation/timers/hpet_example.c @@ -0,0 +1,294 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +extern void hpet_open_close(int, const char **); +extern void hpet_info(int, const char **); +extern void hpet_poll(int, const char **); +extern void hpet_fasync(int, const char **); +extern void hpet_read(int, const char **); + +#include +#include + +struct hpet_command { + char *command; + void (*func)(int argc, const char ** argv); +} hpet_command[] = { + { + "open-close", + hpet_open_close + }, + { + "info", + hpet_info + }, + { + "poll", + hpet_poll + }, + { + "fasync", + hpet_fasync + }, +}; + +int +main(int argc, const char ** argv) +{ + int i; + + argc--; + argv++; + + if (!argc) { + fprintf(stderr, "-hpet: requires command\n"); + return -1; + } + + + for (i = 0; i < (sizeof (hpet_command) / sizeof (hpet_command[0])); i++) + if (!strcmp(argv[0], hpet_command[i].command)) { + argc--; + argv++; + fprintf(stderr, "-hpet: executing %s\n", + hpet_command[i].command); + hpet_command[i].func(argc, argv); + return 0; + } + + fprintf(stderr, "do_hpet: command %s not implemented\n", argv[0]); + + return -1; +} + +void +hpet_open_close(int argc, const char **argv) +{ + int fd; + + if (argc != 1) { + fprintf(stderr, "hpet_open_close: device-name\n"); + return; + } + + fd = open(argv[0], O_RDONLY); + if (fd < 0) + fprintf(stderr, "hpet_open_close: open failed\n"); + else + close(fd); + + return; +} + +void +hpet_info(int argc, const char **argv) +{ + struct hpet_info info; + int fd; + + if (argc != 1) { + fprintf(stderr, "hpet_info: device-name\n"); + return; + } + + fd = open(argv[0], O_RDONLY); + if (fd < 0) { + fprintf(stderr, "hpet_info: open of %s failed\n", argv[0]); + return; + } + + if (ioctl(fd, HPET_INFO, &info) < 0) { + fprintf(stderr, "hpet_info: failed to get info\n"); + goto out; + } + + fprintf(stderr, "hpet_info: hi_irqfreq 0x%lx hi_flags 0x%lx ", + info.hi_ireqfreq, info.hi_flags); + fprintf(stderr, "hi_hpet %d hi_timer %d\n", + info.hi_hpet, info.hi_timer); + +out: + close(fd); + return; +} + +void +hpet_poll(int argc, const char **argv) +{ + unsigned long freq; + int iterations, i, fd; + struct pollfd pfd; + struct hpet_info info; + struct timeval stv, etv; + struct timezone tz; + long usec; + + if (argc != 3) { + fprintf(stderr, "hpet_poll: device-name freq iterations\n"); + return; + } + + freq = atoi(argv[1]); + iterations = atoi(argv[2]); + + fd = open(argv[0], O_RDONLY); + + if (fd < 0) { + fprintf(stderr, "hpet_poll: open of %s failed\n", argv[0]); + return; + } + + if (ioctl(fd, HPET_IRQFREQ, freq) < 0) { + fprintf(stderr, "hpet_poll: HPET_IRQFREQ failed\n"); + goto out; + } + + if (ioctl(fd, HPET_INFO, &info) < 0) { + fprintf(stderr, "hpet_poll: failed to get info\n"); + goto out; + } + + fprintf(stderr, "hpet_poll: info.hi_flags 0x%lx\n", info.hi_flags); + + if (info.hi_flags && (ioctl(fd, HPET_EPI, 0) < 0)) { + fprintf(stderr, "hpet_poll: HPET_EPI failed\n"); + goto out; + } + + if (ioctl(fd, HPET_IE_ON, 0) < 0) { + fprintf(stderr, "hpet_poll, HPET_IE_ON failed\n"); + goto out; + } + + pfd.fd = fd; + pfd.events = POLLIN; + + for (i = 0; i < iterations; i++) { + pfd.revents = 0; + gettimeofday(&stv, &tz); + if (poll(&pfd, 1, -1) < 0) + fprintf(stderr, "hpet_poll: poll failed\n"); + else { + long data; + + gettimeofday(&etv, &tz); + usec = stv.tv_sec * 1000000 + stv.tv_usec; + usec = (etv.tv_sec * 1000000 + etv.tv_usec) - usec; + + fprintf(stderr, + "hpet_poll: expired time = 0x%lx\n", usec); + + fprintf(stderr, "hpet_poll: revents = 0x%x\n", + pfd.revents); + + if (read(fd, &data, sizeof(data)) != sizeof(data)) { + fprintf(stderr, "hpet_poll: read failed\n"); + } + else + fprintf(stderr, "hpet_poll: data 0x%lx\n", + data); + } + } + +out: + close(fd); + return; +} + +static int hpet_sigio_count; + +static void +hpet_sigio(int val) +{ + fprintf(stderr, "hpet_sigio: called\n"); + hpet_sigio_count++; +} + +void +hpet_fasync(int argc, const char **argv) +{ + unsigned long freq; + int iterations, i, fd, value; + sig_t oldsig; + struct hpet_info info; + + hpet_sigio_count = 0; + fd = -1; + + if ((oldsig = signal(SIGIO, hpet_sigio)) == SIG_ERR) { + fprintf(stderr, "hpet_fasync: failed to set signal handler\n"); + return; + } + + if (argc != 3) { + fprintf(stderr, "hpet_fasync: device-name freq iterations\n"); + goto out; + } + + fd = open(argv[0], O_RDONLY); + + if (fd < 0) { + fprintf(stderr, "hpet_fasync: failed to open %s\n", argv[0]); + return; + } + + + if ((fcntl(fd, F_SETOWN, getpid()) == 1) || + ((value = fcntl(fd, F_GETFL)) == 1) || + (fcntl(fd, F_SETFL, value | O_ASYNC) == 1)) { + fprintf(stderr, "hpet_fasync: fcntl failed\n"); + goto out; + } + + freq = atoi(argv[1]); + iterations = atoi(argv[2]); + + if (ioctl(fd, HPET_IRQFREQ, freq) < 0) { + fprintf(stderr, "hpet_fasync: HPET_IRQFREQ failed\n"); + goto out; + } + + if (ioctl(fd, HPET_INFO, &info) < 0) { + fprintf(stderr, "hpet_fasync: failed to get info\n"); + goto out; + } + + fprintf(stderr, "hpet_fasync: info.hi_flags 0x%lx\n", info.hi_flags); + + if (info.hi_flags && (ioctl(fd, HPET_EPI, 0) < 0)) { + fprintf(stderr, "hpet_fasync: HPET_EPI failed\n"); + goto out; + } + + if (ioctl(fd, HPET_IE_ON, 0) < 0) { + fprintf(stderr, "hpet_fasync, HPET_IE_ON failed\n"); + goto out; + } + + for (i = 0; i < iterations; i++) { + (void) pause(); + fprintf(stderr, "hpet_fasync: count = %d\n", hpet_sigio_count); + } + +out: + signal(SIGIO, oldsig); + + if (fd >= 0) + close(fd); + + return; +} diff --git a/Documentation/timers/hrtimers.txt b/Documentation/timers/hrtimers.txt new file mode 100644 index 00000000..ce31f65e --- /dev/null +++ b/Documentation/timers/hrtimers.txt @@ -0,0 +1,178 @@ + +hrtimers - subsystem for high-resolution kernel timers +---------------------------------------------------- + +This patch introduces a new subsystem for high-resolution kernel timers. + +One might ask the question: we already have a timer subsystem +(kernel/timers.c), why do we need two timer subsystems? After a lot of +back and forth trying to integrate high-resolution and high-precision +features into the existing timer framework, and after testing various +such high-resolution timer implementations in practice, we came to the +conclusion that the timer wheel code is fundamentally not suitable for +such an approach. We initially didn't believe this ('there must be a way +to solve this'), and spent a considerable effort trying to integrate +things into the timer wheel, but we failed. In hindsight, there are +several reasons why such integration is hard/impossible: + +- the forced handling of low-resolution and high-resolution timers in + the same way leads to a lot of compromises, macro magic and #ifdef + mess. The timers.c code is very "tightly coded" around jiffies and + 32-bitness assumptions, and has been honed and micro-optimized for a + relatively narrow use case (jiffies in a relatively narrow HZ range) + for many years - and thus even small extensions to it easily break + the wheel concept, leading to even worse compromises. The timer wheel + code is very good and tight code, there's zero problems with it in its + current usage - but it is simply not suitable to be extended for + high-res timers. + +- the unpredictable [O(N)] overhead of cascading leads to delays which + necessitate a more complex handling of high resolution timers, which + in turn decreases robustness. Such a design still led to rather large + timing inaccuracies. Cascading is a fundamental property of the timer + wheel concept, it cannot be 'designed out' without unevitably + degrading other portions of the timers.c code in an unacceptable way. + +- the implementation of the current posix-timer subsystem on top of + the timer wheel has already introduced a quite complex handling of + the required readjusting of absolute CLOCK_REALTIME timers at + settimeofday or NTP time - further underlying our experience by + example: that the timer wheel data structure is too rigid for high-res + timers. + +- the timer wheel code is most optimal for use cases which can be + identified as "timeouts". Such timeouts are usually set up to cover + error conditions in various I/O paths, such as networking and block + I/O. The vast majority of those timers never expire and are rarely + recascaded because the expected correct event arrives in time so they + can be removed from the timer wheel before any further processing of + them becomes necessary. Thus the users of these timeouts can accept + the granularity and precision tradeoffs of the timer wheel, and + largely expect the timer subsystem to have near-zero overhead. + Accurate timing for them is not a core purpose - in fact most of the + timeout values used are ad-hoc. For them it is at most a necessary + evil to guarantee the processing of actual timeout completions + (because most of the timeouts are deleted before completion), which + should thus be as cheap and unintrusive as possible. + +The primary users of precision timers are user-space applications that +utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel +users like drivers and subsystems which require precise timed events +(e.g. multimedia) can benefit from the availability of a separate +high-resolution timer subsystem as well. + +While this subsystem does not offer high-resolution clock sources just +yet, the hrtimer subsystem can be easily extended with high-resolution +clock capabilities, and patches for that exist and are maturing quickly. +The increasing demand for realtime and multimedia applications along +with other potential users for precise timers gives another reason to +separate the "timeout" and "precise timer" subsystems. + +Another potential benefit is that such a separation allows even more +special-purpose optimization of the existing timer wheel for the low +resolution and low precision use cases - once the precision-sensitive +APIs are separated from the timer wheel and are migrated over to +hrtimers. E.g. we could decrease the frequency of the timeout subsystem +from 250 Hz to 100 HZ (or even smaller). + +hrtimer subsystem implementation details +---------------------------------------- + +the basic design considerations were: + +- simplicity + +- data structure not bound to jiffies or any other granularity. All the + kernel logic works at 64-bit nanoseconds resolution - no compromises. + +- simplification of existing, timing related kernel code + +another basic requirement was the immediate enqueueing and ordering of +timers at activation time. After looking at several possible solutions +such as radix trees and hashes, we chose the red black tree as the basic +data structure. Rbtrees are available as a library in the kernel and are +used in various performance-critical areas of e.g. memory management and +file systems. The rbtree is solely used for time sorted ordering, while +a separate list is used to give the expiry code fast access to the +queued timers, without having to walk the rbtree. + +(This separate list is also useful for later when we'll introduce +high-resolution clocks, where we need separate pending and expired +queues while keeping the time-order intact.) + +Time-ordered enqueueing is not purely for the purposes of +high-resolution clocks though, it also simplifies the handling of +absolute timers based on a low-resolution CLOCK_REALTIME. The existing +implementation needed to keep an extra list of all armed absolute +CLOCK_REALTIME timers along with complex locking. In case of +settimeofday and NTP, all the timers (!) had to be dequeued, the +time-changing code had to fix them up one by one, and all of them had to +be enqueued again. The time-ordered enqueueing and the storage of the +expiry time in absolute time units removes all this complex and poorly +scaling code from the posix-timer implementation - the clock can simply +be set without having to touch the rbtree. This also makes the handling +of posix-timers simpler in general. + +The locking and per-CPU behavior of hrtimers was mostly taken from the +existing timer wheel code, as it is mature and well suited. Sharing code +was not really a win, due to the different data structures. Also, the +hrtimer functions now have clearer behavior and clearer names - such as +hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly +equivalent to del_timer() and del_timer_sync()] - so there's no direct +1:1 mapping between them on the algorithmical level, and thus no real +potential for code sharing either. + +Basic data types: every time value, absolute or relative, is in a +special nanosecond-resolution type: ktime_t. The kernel-internal +representation of ktime_t values and operations is implemented via +macros and inline functions, and can be switched between a "hybrid +union" type and a plain "scalar" 64bit nanoseconds representation (at +compile time). The hybrid union type optimizes time conversions on 32bit +CPUs. This build-time-selectable ktime_t storage format was implemented +to avoid the performance impact of 64-bit multiplications and divisions +on 32bit CPUs. Such operations are frequently necessary to convert +between the storage formats provided by kernel and userspace interfaces +and the internal time format. (See include/linux/ktime.h for further +details.) + +hrtimers - rounding of timer values +----------------------------------- + +the hrtimer code will round timer events to lower-resolution clocks +because it has to. Otherwise it will do no artificial rounding at all. + +one question is, what resolution value should be returned to the user by +the clock_getres() interface. This will return whatever real resolution +a given clock has - be it low-res, high-res, or artificially-low-res. + +hrtimers - testing and verification +---------------------------------- + +We used the high-resolution clock subsystem ontop of hrtimers to verify +the hrtimer implementation details in praxis, and we also ran the posix +timer tests in order to ensure specification compliance. We also ran +tests on low-resolution clocks. + +The hrtimer patch converts the following kernel functionality to use +hrtimers: + + - nanosleep + - itimers + - posix-timers + +The conversion of nanosleep and posix-timers enabled the unification of +nanosleep and clock_nanosleep. + +The code was successfully compiled for the following platforms: + + i386, x86_64, ARM, PPC, PPC64, IA64 + +The code was run-tested on the following platforms: + + i386(UP/SMP), x86_64(UP/SMP), ARM, PPC + +hrtimers were also integrated into the -rt tree, along with a +hrtimers-based high-resolution clock implementation, so the hrtimers +code got a healthy amount of testing and use in practice. + + Thomas Gleixner, Ingo Molnar diff --git a/Documentation/timers/timer_stats.txt b/Documentation/timers/timer_stats.txt new file mode 100644 index 00000000..8abd40b2 --- /dev/null +++ b/Documentation/timers/timer_stats.txt @@ -0,0 +1,73 @@ +timer_stats - timer usage statistics +------------------------------------ + +timer_stats is a debugging facility to make the timer (ab)usage in a Linux +system visible to kernel and userspace developers. If enabled in the config +but not used it has almost zero runtime overhead, and a relatively small +data structure overhead. Even if collection is enabled runtime all the +locking is per-CPU and lookup is hashed. + +timer_stats should be used by kernel and userspace developers to verify that +their code does not make unduly use of timers. This helps to avoid unnecessary +wakeups, which should be avoided to optimize power consumption. + +It can be enabled by CONFIG_TIMER_STATS in the "Kernel hacking" configuration +section. + +timer_stats collects information about the timer events which are fired in a +Linux system over a sample period: + +- the pid of the task(process) which initialized the timer +- the name of the process which initialized the timer +- the function where the timer was initialized +- the callback function which is associated to the timer +- the number of events (callbacks) + +timer_stats adds an entry to /proc: /proc/timer_stats + +This entry is used to control the statistics functionality and to read out the +sampled information. + +The timer_stats functionality is inactive on bootup. + +To activate a sample period issue: +# echo 1 >/proc/timer_stats + +To stop a sample period issue: +# echo 0 >/proc/timer_stats + +The statistics can be retrieved by: +# cat /proc/timer_stats + +The readout of /proc/timer_stats automatically disables sampling. The sampled +information is kept until a new sample period is started. This allows multiple +readouts. + +Sample output of /proc/timer_stats: + +Timerstats sample period: 3.888770 s + 12, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick) + 15, 1 swapper hcd_submit_urb (rh_timer_func) + 4, 959 kedac schedule_timeout (process_timeout) + 1, 0 swapper page_writeback_init (wb_timer_fn) + 28, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick) + 22, 2948 IRQ 4 tty_flip_buffer_push (delayed_work_timer_fn) + 3, 3100 bash schedule_timeout (process_timeout) + 1, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) + 1, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) + 1, 1 swapper neigh_table_init_no_netlink (neigh_periodic_timer) + 1, 2292 ip __netdev_watchdog_up (dev_watchdog) + 1, 23 events/1 do_cache_clean (delayed_work_timer_fn) +90 total events, 30.0 events/sec + +The first column is the number of events, the second column the pid, the third +column is the name of the process. The forth column shows the function which +initialized the timer and in parenthesis the callback function which was +executed on expiry. + + Thomas, Ingo + +Added flag to indicate 'deferrable timer' in /proc/timer_stats. A deferrable +timer will appear as follows + 10D, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) + diff --git a/Documentation/timers/timers-howto.txt b/Documentation/timers/timers-howto.txt new file mode 100644 index 00000000..038f8c77 --- /dev/null +++ b/Documentation/timers/timers-howto.txt @@ -0,0 +1,105 @@ +delays - Information on the various kernel delay / sleep mechanisms +------------------------------------------------------------------- + +This document seeks to answer the common question: "What is the +RightWay (TM) to insert a delay?" + +This question is most often faced by driver writers who have to +deal with hardware delays and who may not be the most intimately +familiar with the inner workings of the Linux Kernel. + + +Inserting Delays +---------------- + +The first, and most important, question you need to ask is "Is my +code in an atomic context?" This should be followed closely by "Does +it really need to delay in atomic context?" If so... + +ATOMIC CONTEXT: + You must use the *delay family of functions. These + functions use the jiffie estimation of clock speed + and will busy wait for enough loop cycles to achieve + the desired delay: + + ndelay(unsigned long nsecs) + udelay(unsigned long usecs) + mdelay(unsigned long msecs) + + udelay is the generally preferred API; ndelay-level + precision may not actually exist on many non-PC devices. + + mdelay is macro wrapper around udelay, to account for + possible overflow when passing large arguments to udelay. + In general, use of mdelay is discouraged and code should + be refactored to allow for the use of msleep. + +NON-ATOMIC CONTEXT: + You should use the *sleep[_range] family of functions. + There are a few more options here, while any of them may + work correctly, using the "right" sleep function will + help the scheduler, power management, and just make your + driver better :) + + -- Backed by busy-wait loop: + udelay(unsigned long usecs) + -- Backed by hrtimers: + usleep_range(unsigned long min, unsigned long max) + -- Backed by jiffies / legacy_timers + msleep(unsigned long msecs) + msleep_interruptible(unsigned long msecs) + + Unlike the *delay family, the underlying mechanism + driving each of these calls varies, thus there are + quirks you should be aware of. + + + SLEEPING FOR "A FEW" USECS ( < ~10us? ): + * Use udelay + + - Why not usleep? + On slower systems, (embedded, OR perhaps a speed- + stepped PC!) the overhead of setting up the hrtimers + for usleep *may* not be worth it. Such an evaluation + will obviously depend on your specific situation, but + it is something to be aware of. + + SLEEPING FOR ~USECS OR SMALL MSECS ( 10us - 20ms): + * Use usleep_range + + - Why not msleep for (1ms - 20ms)? + Explained originally here: + http://lkml.org/lkml/2007/8/3/250 + msleep(1~20) may not do what the caller intends, and + will often sleep longer (~20 ms actual sleep for any + value given in the 1~20ms range). In many cases this + is not the desired behavior. + + - Why is there no "usleep" / What is a good range? + Since usleep_range is built on top of hrtimers, the + wakeup will be very precise (ish), thus a simple + usleep function would likely introduce a large number + of undesired interrupts. + + With the introduction of a range, the scheduler is + free to coalesce your wakeup with any other wakeup + that may have happened for other reasons, or at the + worst case, fire an interrupt for your upper bound. + + The larger a range you supply, the greater a chance + that you will not trigger an interrupt; this should + be balanced with what is an acceptable upper bound on + delay / performance for your specific code path. Exact + tolerances here are very situation specific, thus it + is left to the caller to determine a reasonable range. + + SLEEPING FOR LARGER MSECS ( 10ms+ ) + * Use msleep or possibly msleep_interruptible + + - What's the difference? + msleep sets the current task to TASK_UNINTERRUPTIBLE + whereas msleep_interruptible sets the current task to + TASK_INTERRUPTIBLE before scheduling the sleep. In + short, the difference is whether the sleep can be ended + early by a signal. In general, just use msleep unless + you know you have a need for the interruptible variant. -- cgit v1.2.3