From 849369d6c66d3054688672f97d31fceb8e8230fb Mon Sep 17 00:00:00 2001 From: root Date: Fri, 25 Dec 2015 04:40:36 +0000 Subject: initial_commit --- Documentation/RCU/00-INDEX | 32 ++ Documentation/RCU/NMI-RCU.txt | 120 +++++ Documentation/RCU/RTFP.txt | 875 ++++++++++++++++++++++++++++++++ Documentation/RCU/UP.txt | 135 +++++ Documentation/RCU/arrayRCU.txt | 141 ++++++ Documentation/RCU/checklist.txt | 399 +++++++++++++++ Documentation/RCU/listRCU.txt | 315 ++++++++++++ Documentation/RCU/lockdep.txt | 91 ++++ Documentation/RCU/rcu.txt | 96 ++++ Documentation/RCU/rcubarrier.txt | 311 ++++++++++++ Documentation/RCU/rculist_nulls.txt | 172 +++++++ Documentation/RCU/rcuref.txt | 66 +++ Documentation/RCU/stallwarn.txt | 127 +++++ Documentation/RCU/torture.txt | 201 ++++++++ Documentation/RCU/trace.txt | 617 ++++++++++++++++++++++ Documentation/RCU/whatisRCU.txt | 985 ++++++++++++++++++++++++++++++++++++ 16 files changed, 4683 insertions(+) create mode 100644 Documentation/RCU/00-INDEX create mode 100644 Documentation/RCU/NMI-RCU.txt create mode 100644 Documentation/RCU/RTFP.txt create mode 100644 Documentation/RCU/UP.txt create mode 100644 Documentation/RCU/arrayRCU.txt create mode 100644 Documentation/RCU/checklist.txt create mode 100644 Documentation/RCU/listRCU.txt create mode 100644 Documentation/RCU/lockdep.txt create mode 100644 Documentation/RCU/rcu.txt create mode 100644 Documentation/RCU/rcubarrier.txt create mode 100644 Documentation/RCU/rculist_nulls.txt create mode 100644 Documentation/RCU/rcuref.txt create mode 100644 Documentation/RCU/stallwarn.txt create mode 100644 Documentation/RCU/torture.txt create mode 100644 Documentation/RCU/trace.txt create mode 100644 Documentation/RCU/whatisRCU.txt (limited to 'Documentation/RCU') diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX new file mode 100644 index 00000000..1d7a8857 --- /dev/null +++ b/Documentation/RCU/00-INDEX @@ -0,0 +1,32 @@ +00-INDEX + - This file +arrayRCU.txt + - Using RCU to Protect Read-Mostly Arrays +checklist.txt + - Review Checklist for RCU Patches +listRCU.txt + - Using RCU to Protect Read-Mostly Linked Lists +lockdep.txt + - RCU and lockdep checking +NMI-RCU.txt + - Using RCU to Protect Dynamic NMI Handlers +rcubarrier.txt + - RCU and Unloadable Modules +rculist_nulls.txt + - RCU list primitives for use with SLAB_DESTROY_BY_RCU +rcuref.txt + - Reference-count design for elements of lists/arrays protected by RCU +rcu.txt + - RCU Concepts +RTFP.txt + - List of RCU papers (bibliography) going back to 1980. +stallwarn.txt + - RCU CPU stall warnings (module parameter rcu_cpu_stall_suppress) +torture.txt + - RCU Torture Test Operation (CONFIG_RCU_TORTURE_TEST) +trace.txt + - CONFIG_RCU_TRACE debugfs files and formats +UP.txt + - RCU on Uniprocessor Systems +whatisRCU.txt + - What is RCU? diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt new file mode 100644 index 00000000..a8536cb8 --- /dev/null +++ b/Documentation/RCU/NMI-RCU.txt @@ -0,0 +1,120 @@ +Using RCU to Protect Dynamic NMI Handlers + + +Although RCU is usually used to protect read-mostly data structures, +it is possible to use RCU to provide dynamic non-maskable interrupt +handlers, as well as dynamic irq handlers. This document describes +how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer +work in "arch/i386/oprofile/nmi_timer_int.c" and in +"arch/i386/kernel/traps.c". + +The relevant pieces of code are listed below, each followed by a +brief explanation. + + static int dummy_nmi_callback(struct pt_regs *regs, int cpu) + { + return 0; + } + +The dummy_nmi_callback() function is a "dummy" NMI handler that does +nothing, but returns zero, thus saying that it did nothing, allowing +the NMI handler to take the default machine-specific action. + + static nmi_callback_t nmi_callback = dummy_nmi_callback; + +This nmi_callback variable is a global function pointer to the current +NMI handler. + + void do_nmi(struct pt_regs * regs, long error_code) + { + int cpu; + + nmi_enter(); + + cpu = smp_processor_id(); + ++nmi_count(cpu); + + if (!rcu_dereference_sched(nmi_callback)(regs, cpu)) + default_do_nmi(regs); + + nmi_exit(); + } + +The do_nmi() function processes each NMI. It first disables preemption +in the same way that a hardware irq would, then increments the per-CPU +count of NMIs. It then invokes the NMI handler stored in the nmi_callback +function pointer. If this handler returns zero, do_nmi() invokes the +default_do_nmi() function to handle a machine-specific NMI. Finally, +preemption is restored. + +In theory, rcu_dereference_sched() is not needed, since this code runs +only on i386, which in theory does not need rcu_dereference_sched() +anyway. However, in practice it is a good documentation aid, particularly +for anyone attempting to do something similar on Alpha or on systems +with aggressive optimizing compilers. + +Quick Quiz: Why might the rcu_dereference_sched() be necessary on Alpha, + given that the code referenced by the pointer is read-only? + + +Back to the discussion of NMI and RCU... + + void set_nmi_callback(nmi_callback_t callback) + { + rcu_assign_pointer(nmi_callback, callback); + } + +The set_nmi_callback() function registers an NMI handler. Note that any +data that is to be used by the callback must be initialized up -before- +the call to set_nmi_callback(). On architectures that do not order +writes, the rcu_assign_pointer() ensures that the NMI handler sees the +initialized values. + + void unset_nmi_callback(void) + { + rcu_assign_pointer(nmi_callback, dummy_nmi_callback); + } + +This function unregisters an NMI handler, restoring the original +dummy_nmi_handler(). However, there may well be an NMI handler +currently executing on some other CPU. We therefore cannot free +up any data structures used by the old NMI handler until execution +of it completes on all other CPUs. + +One way to accomplish this is via synchronize_sched(), perhaps as +follows: + + unset_nmi_callback(); + synchronize_sched(); + kfree(my_nmi_data); + +This works because synchronize_sched() blocks until all CPUs complete +any preemption-disabled segments of code that they were executing. +Since NMI handlers disable preemption, synchronize_sched() is guaranteed +not to return until all ongoing NMI handlers exit. It is therefore safe +to free up the handler's data as soon as synchronize_sched() returns. + +Important note: for this to work, the architecture in question must +invoke irq_enter() and irq_exit() on NMI entry and exit, respectively. + + +Answer to Quick Quiz + + Why might the rcu_dereference_sched() be necessary on Alpha, given + that the code referenced by the pointer is read-only? + + Answer: The caller to set_nmi_callback() might well have + initialized some data that is to be used by the new NMI + handler. In this case, the rcu_dereference_sched() would + be needed, because otherwise a CPU that received an NMI + just after the new handler was set might see the pointer + to the new NMI handler, but the old pre-initialized + version of the handler's data. + + This same sad story can happen on other CPUs when using + a compiler with aggressive pointer-value speculation + optimizations. + + More important, the rcu_dereference_sched() makes it + clear to someone reading the code that the pointer is + being protected by RCU-sched. diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt new file mode 100644 index 00000000..c43460da --- /dev/null +++ b/Documentation/RCU/RTFP.txt @@ -0,0 +1,875 @@ +Read the F-ing Papers! + + +This document describes RCU-related publications, and is followed by +the corresponding bibtex entries. A number of the publications may +be found at http://www.rdrop.com/users/paulmck/RCU/. + +The first thing resembling RCU was published in 1980, when Kung and Lehman +[Kung80] recommended use of a garbage collector to defer destruction +of nodes in a parallel binary search tree in order to simplify its +implementation. This works well in environments that have garbage +collectors, but most production garbage collectors incur significant +overhead. + +In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring +destruction until all threads running at that time have terminated, again +for a parallel binary search tree. This approach works well in systems +with short-lived threads, such as the K42 research operating system. +However, Linux has long-lived tasks, so more is needed. + +In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive +serialization, which is an RCU-like mechanism that relies on the presence +of "quiescent states" in the VM/XA hypervisor that are guaranteed not +to be referencing the data structure. However, this mechanism was not +optimized for modern computer systems, which is not surprising given +that these overheads were not so expensive in the mid-80s. Nonetheless, +passive serialization appears to be the first deferred-destruction +mechanism to be used in production. Furthermore, the relevant patent +has lapsed, so this approach may be used in non-GPL software, if desired. +(In contrast, implementation of RCU is permitted only in software licensed +under either GPL or LGPL. Sorry!!!) + +In 1990, Pugh [Pugh90] noted that explicitly tracking which threads +were reading a given data structure permitted deferred free to operate +in the presence of non-terminating threads. However, this explicit +tracking imposes significant read-side overhead, which is undesirable +in read-mostly situations. This algorithm does take pains to avoid +write-side contention and parallelize the other write-side overheads by +providing a fine-grained locking design, however, it would be interesting +to see how much of the performance advantage reported in 1990 remains +in 2004. + +At about this same time, Adams [Adams91] described ``chaotic relaxation'', +where the normal barriers between successive iterations of convergent +numerical algorithms are relaxed, so that iteration $n$ might use +data from iteration $n-1$ or even $n-2$. This introduces error, +which typically slows convergence and thus increases the number of +iterations required. However, this increase is sometimes more than made +up for by a reduction in the number of expensive barrier operations, +which are otherwise required to synchronize the threads at the end +of each iteration. Unfortunately, chaotic relaxation requires highly +structured data, such as the matrices used in scientific programs, and +is thus inapplicable to most data structures in operating-system kernels. + +In 1992, Henry (now Alexia) Massalin completed a dissertation advising +parallel programmers to defer processing when feasible to simplify +synchronization. RCU makes extremely heavy use of this advice. + +In 1993, Jacobson [Jacobson93] verbally described what is perhaps the +simplest deferred-free technique: simply waiting a fixed amount of time +before freeing blocks awaiting deferred free. Jacobson did not describe +any write-side changes he might have made in this work using SGI's Irix +kernel. Aju John published a similar technique in 1995 [AjuJohn95]. +This works well if there is a well-defined upper bound on the length of +time that reading threads can hold references, as there might well be in +hard real-time systems. However, if this time is exceeded, perhaps due +to preemption, excessive interrupts, or larger-than-anticipated load, +memory corruption can ensue, with no reasonable means of diagnosis. +Jacobson's technique is therefore inappropriate for use in production +operating-system kernels, except when such kernels can provide hard +real-time response guarantees for all operations. + +Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's +read-side-tracking to permit replugging of algorithms within a commercial +Unix operating system. However, this replugging permitted only a single +reader at a time. The following year, this same group of researchers +extended their technique to allow for multiple readers [Cowan96a]. +Their approach requires memory barriers (and thus pipeline stalls), +but reduces memory latency, contention, and locking overheads. + +1995 also saw the first publication of DYNIX/ptx's RCU mechanism +[Slingwine95], which was optimized for modern CPU architectures, +and was successfully applied to a number of situations within the +DYNIX/ptx kernel. The corresponding conference paper appeared in 1998 +[McKenney98]. + +In 1999, the Tornado and K42 groups described their "generations" +mechanism, which quite similar to RCU [Gamsa99]. These operating systems +made pervasive use of RCU in place of "existence locks", which greatly +simplifies locking hierarchies. + +2001 saw the first RCU presentation involving Linux [McKenney01a] +at OLS. The resulting abundance of RCU patches was presented the +following year [McKenney02a], and use of RCU in dcache was first +described that same year [Linder02a]. + +Also in 2002, Michael [Michael02b,Michael02a] presented "hazard-pointer" +techniques that defer the destruction of data structures to simplify +non-blocking synchronization (wait-free synchronization, lock-free +synchronization, and obstruction-free synchronization are all examples of +non-blocking synchronization). In particular, this technique eliminates +locking, reduces contention, reduces memory latency for readers, and +parallelizes pipeline stalls and memory latency for writers. However, +these techniques still impose significant read-side overhead in the +form of memory barriers. Researchers at Sun worked along similar lines +in the same timeframe [HerlihyLM02]. These techniques can be thought +of as inside-out reference counts, where the count is represented by the +number of hazard pointers referencing a given data structure (rather than +the more conventional counter field within the data structure itself). + +By the same token, RCU can be thought of as a "bulk reference count", +where some form of reference counter covers all reference by a given CPU +or thread during a set timeframe. This timeframe is related to, but +not necessarily exactly the same as, an RCU grace period. In classic +RCU, the reference counter is the per-CPU bit in the "bitmask" field, +and each such bit covers all references that might have been made by +the corresponding CPU during the prior grace period. Of course, RCU +can be thought of in other terms as well. + +In 2003, the K42 group described how RCU could be used to create +hot-pluggable implementations of operating-system functions [Appavoo03a]. +Later that year saw a paper describing an RCU implementation of System +V IPC [Arcangeli03], and an introduction to RCU in Linux Journal +[McKenney03a]. + +2004 has seen a Linux-Journal article on use of RCU in dcache +[McKenney04a], a performance comparison of locking to RCU on several +different CPUs [McKenney04b], a dissertation describing use of RCU in a +number of operating-system kernels [PaulEdwardMcKenneyPhD], a paper +describing how to make RCU safe for soft-realtime applications [Sarma04c], +and a paper describing SELinux performance with RCU [JamesMorris04b]. + +2005 brought further adaptation of RCU to realtime use, permitting +preemption of RCU realtime critical sections [PaulMcKenney05a, +PaulMcKenney05b]. + +2006 saw the first best-paper award for an RCU paper [ThomasEHart2006a], +as well as further work on efficient implementations of preemptible +RCU [PaulEMcKenney2006b], but priority-boosting of RCU read-side critical +sections proved elusive. An RCU implementation permitting general +blocking in read-side critical sections appeared [PaulEMcKenney2006c], +Robert Olsson described an RCU-protected trie-hash combination +[RobertOlsson2006a]. + +2007 saw the journal version of the award-winning RCU paper from 2006 +[ThomasEHart2007a], as well as a paper demonstrating use of Promela +and Spin to mechanically verify an optimization to Oleg Nesterov's +QRCU [PaulEMcKenney2007QRCUspin], a design document describing +preemptible RCU [PaulEMcKenney2007PreemptibleRCU], and the three-part +LWN "What is RCU?" series [PaulEMcKenney2007WhatIsRCUFundamentally, +PaulEMcKenney2008WhatIsRCUUsage, and PaulEMcKenney2008WhatIsRCUAPI]. + +2008 saw a journal paper on real-time RCU [DinakarGuniguntala2008IBMSysJ], +a history of how Linux changed RCU more than RCU changed Linux +[PaulEMcKenney2008RCUOSR], and a design overview of hierarchical RCU +[PaulEMcKenney2008HierarchicalRCU]. + +2009 introduced user-level RCU algorithms [PaulEMcKenney2009MaliciousURCU], +which Mathieu Desnoyers is now maintaining [MathieuDesnoyers2009URCU] +[MathieuDesnoyersPhD]. TINY_RCU [PaulEMcKenney2009BloatWatchRCU] made +its appearance, as did expedited RCU [PaulEMcKenney2009expeditedRCU]. +The problem of resizeable RCU-protected hash tables may now be on a path +to a solution [JoshTriplett2009RPHash]. + +Bibtex Entries + +@article{Kung80 +,author="H. T. Kung and Q. Lehman" +,title="Concurrent Maintenance of Binary Search Trees" +,Year="1980" +,Month="September" +,journal="ACM Transactions on Database Systems" +,volume="5" +,number="3" +,pages="354-382" +} + +@techreport{Manber82 +,author="Udi Manber and Richard E. Ladner" +,title="Concurrency Control in a Dynamic Search Structure" +,institution="Department of Computer Science, University of Washington" +,address="Seattle, Washington" +,year="1982" +,number="82-01-01" +,month="January" +,pages="28" +} + +@article{Manber84 +,author="Udi Manber and Richard E. Ladner" +,title="Concurrency Control in a Dynamic Search Structure" +,Year="1984" +,Month="September" +,journal="ACM Transactions on Database Systems" +,volume="9" +,number="3" +,pages="439-455" +} + +@techreport{Hennessy89 +,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}" +,title="Passive Serialization in a Multitasking Environment" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1989" +,number="US Patent 4,809,168 (lapsed)" +,month="February" +,pages="11" +} + +@techreport{Pugh90 +,author="William Pugh" +,title="Concurrent Maintenance of Skip Lists" +,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland" +,address="College Park, Maryland" +,year="1990" +,number="CS-TR-2222.1" +,month="June" +} + +@Book{Adams91 +,Author="Gregory R. Adams" +,title="Concurrent Programming, Principles, and Practices" +,Publisher="Benjamin Cummins" +,Year="1991" +} + +@phdthesis{HMassalinPhD +,author="H. Massalin" +,title="Synthesis: An Efficient Implementation of Fundamental Operating +System Services" +,school="Columbia University" +,address="New York, NY" +,year="1992" +,annotation=" + Mondo optimizing compiler. + Wait-free stuff. + Good advice: defer work to avoid synchronization. +" +} + +@unpublished{Jacobson93 +,author="Van Jacobson" +,title="Avoid Read-Side Locking Via Delayed Free" +,year="1993" +,month="September" +,note="Verbal discussion" +} + +@Conference{AjuJohn95 +,Author="Aju John" +,Title="Dynamic vnodes -- Design and Implementation" +,Booktitle="{USENIX Winter 1995}" +,Publisher="USENIX Association" +,Month="January" +,Year="1995" +,pages="11-23" +,Address="New Orleans, LA" +} + +@conference{Pu95a, +Author = "Calton Pu and Tito Autrey and Andrew Black and Charles Consel and +Crispin Cowan and Jon Inouye and Lakshmi Kethana and Jonathan Walpole and +Ke Zhang", +Title = "Optimistic Incremental Specialization: Streamlining a Commercial +Operating System", +Booktitle = "15\textsuperscript{th} ACM Symposium on +Operating Systems Principles (SOSP'95)", +address = "Copper Mountain, CO", +month="December", +year="1995", +pages="314-321", +annotation=" + Uses a replugger, but with a flag to signal when people are + using the resource at hand. Only one reader at a time. +" +} + +@conference{Cowan96a, +Author = "Crispin Cowan and Tito Autrey and Charles Krasic and +Calton Pu and Jonathan Walpole", +Title = "Fast Concurrent Dynamic Linking for an Adaptive Operating System", +Booktitle = "International Conference on Configurable Distributed Systems +(ICCDS'96)", +address = "Annapolis, MD", +month="May", +year="1996", +pages="108", +isbn="0-8186-7395-8", +annotation=" + Uses a replugger, but with a counter to signal when people are + using the resource at hand. Allows multiple readers. +" +} + +@techreport{Slingwine95 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and Method for Achieving Reduced Overhead Mutual +Exclusion and Maintaining Coherency in a Multiprocessor System +Utilizing Execution History and Thread Monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1995" +,number="US Patent 5,442,758 (contributed under GPL)" +,month="August" +} + +@techreport{Slingwine97 +,author="John D. Slingwine and Paul E. McKenney" +,title="Method for maintaining data coherency using thread +activity summaries in a multicomputer system" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1997" +,number="US Patent 5,608,893 (contributed under GPL)" +,month="March" +} + +@techreport{Slingwine98 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and method for achieving reduced overhead +mutual exclusion and maintaining coherency in a multiprocessor +system utilizing execution history and thread monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="1998" +,number="US Patent 5,727,209 (contributed under GPL)" +,month="March" +} + +@Conference{McKenney98 +,Author="Paul E. McKenney and John D. Slingwine" +,Title="Read-Copy Update: Using Execution History to Solve Concurrency +Problems" +,Booktitle="{Parallel and Distributed Computing and Systems}" +,Month="October" +,Year="1998" +,pages="509-518" +,Address="Las Vegas, NV" +} + +@Conference{Gamsa99 +,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm" +,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory +Multiprocessor Operating System" +,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on +Operating System Design and Implementation}" +,Month="February" +,Year="1999" +,pages="87-100" +,Address="New Orleans, LA" +} + +@techreport{Slingwine01 +,author="John D. Slingwine and Paul E. McKenney" +,title="Apparatus and method for achieving reduced overhead +mutual exclusion and maintaining coherency in a multiprocessor +system utilizing execution history and thread monitoring" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="2001" +,number="US Patent 5,219,690 (contributed under GPL)" +,month="April" +} + +@Conference{McKenney01a +,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and +Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni" +,Title="Read-Copy Update" +,Booktitle="{Ottawa Linux Symposium}" +,Month="July" +,Year="2001" +,note="Available: +\url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php} +\url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf} +[Viewed June 23, 2004]" +annotation=" +Described RCU, and presented some patches implementing and using it in +the Linux kernel. +" +} + +@Conference{Linder02a +,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni" +,Title="Scalability of the Directory Entry Cache" +,Booktitle="{Ottawa Linux Symposium}" +,Month="June" +,Year="2002" +,pages="289-300" +} + +@Conference{McKenney02a +,Author="Paul E. McKenney and Dipankar Sarma and +Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell" +,Title="Read-Copy Update" +,Booktitle="{Ottawa Linux Symposium}" +,Month="June" +,Year="2002" +,pages="338-367" +,note="Available: +\url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz} +[Viewed June 23, 2004]" +} + +@conference{Michael02a +,author="Maged M. Michael" +,title="Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic +Reads and Writes" +,Year="2002" +,Month="August" +,booktitle="{Proceedings of the 21\textsuperscript{st} Annual ACM +Symposium on Principles of Distributed Computing}" +,pages="21-30" +,annotation=" + Each thread keeps an array of pointers to items that it is + currently referencing. Sort of an inside-out garbage collection + mechanism, but one that requires the accessing code to explicitly + state its needs. Also requires read-side memory barriers on + most architectures. +" +} + +@conference{Michael02b +,author="Maged M. Michael" +,title="High Performance Dynamic Lock-Free Hash Tables and List-Based Sets" +,Year="2002" +,Month="August" +,booktitle="{Proceedings of the 14\textsuperscript{th} Annual ACM +Symposium on Parallel +Algorithms and Architecture}" +,pages="73-82" +,annotation=" + Like the title says... +" +} + +@InProceedings{HerlihyLM02 +,author={Maurice Herlihy and Victor Luchangco and Mark Moir} +,title="The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized, +Lock-Free Data Structures" +,booktitle={Proceedings of 16\textsuperscript{th} International +Symposium on Distributed Computing} +,year=2002 +,month="October" +,pages="339-353" +} + +@article{Appavoo03a +,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and +D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and +B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and +B. Rosenburg and M. Stumm and J. Xenidis" +,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping" +,Year="2003" +,Month="January" +,journal="IBM Systems Journal" +,volume="42" +,number="1" +,pages="60-76" +} + +@Conference{Arcangeli03 +,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and +Dipankar Sarma" +,Title="Using Read-Copy Update Techniques for {System V IPC} in the +{Linux} 2.5 Kernel" +,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference +(FREENIX Track)" +,Publisher="USENIX Association" +,year="2003" +,month="June" +,pages="297-310" +} + +@article{McKenney03a +,author="Paul E. McKenney" +,title="Using {RCU} in the {Linux} 2.5 Kernel" +,Year="2003" +,Month="October" +,journal="Linux Journal" +,volume="1" +,number="114" +,pages="18-26" +} + +@techreport{Friedberg03a +,author="Stuart A. Friedberg" +,title="Lock-Free Wild Card Search Data Structure and Method" +,institution="US Patent and Trademark Office" +,address="Washington, DC" +,year="2003" +,number="US Patent 6,662,184 (contributed under GPL)" +,month="December" +,pages="112" +} + +@article{McKenney04a +,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni" +,title="Scaling dcache with {RCU}" +,Year="2004" +,Month="January" +,journal="Linux Journal" +,volume="1" +,number="118" +,pages="38-46" +} + +@Conference{McKenney04b +,Author="Paul E. McKenney" +,Title="{RCU} vs. Locking Performance on Different {CPUs}" +,Booktitle="{linux.conf.au}" +,Month="January" +,Year="2004" +,Address="Adelaide, Australia" +,note="Available: +\url{http://www.linux.org.au/conf/2004/abstracts.html#90} +\url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf} +[Viewed June 23, 2004]" +} + +@phdthesis{PaulEdwardMcKenneyPhD +,author="Paul E. McKenney" +,title="Exploiting Deferred Destruction: +An Analysis of Read-Copy-Update Techniques +in Operating System Kernels" +,school="OGI School of Science and Engineering at +Oregon Health and Sciences University" +,year="2004" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf} +[Viewed October 15, 2004]" +} + +@Conference{Sarma04c +,Author="Dipankar Sarma and Paul E. McKenney" +,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications" +,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference +(FREENIX Track)" +,Publisher="USENIX Association" +,year="2004" +,month="June" +,pages="182-191" +} + +@unpublished{JamesMorris04b +,Author="James Morris" +,Title="Recent Developments in {SELinux} Kernel Performance" +,month="December" +,year="2004" +,note="Available: +\url{http://www.livejournal.com/users/james_morris/2153.html} +[Viewed December 10, 2004]" +} + +@unpublished{PaulMcKenney05a +,Author="Paul E. McKenney" +,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress" +,month="May" +,year="2005" +,note="Available: +\url{http://lkml.org/lkml/2005/5/9/185} +[Viewed May 13, 2005]" +,annotation=" + First publication of working lock-based deferred free patches + for the CONFIG_PREEMPT_RT environment. +" +} + +@conference{PaulMcKenney05b +,Author="Paul E. McKenney and Dipankar Sarma" +,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware" +,Booktitle="linux.conf.au 2005" +,month="April" +,year="2005" +,address="Canberra, Australia" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf} +[Viewed May 13, 2005]" +,annotation=" + Realtime turns into making RCU yet more realtime friendly. +" +} + +@conference{ThomasEHart2006a +,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown" +,Title="Making Lockless Synchronization Fast: Performance Implications +of Memory Reclamation" +,Booktitle="20\textsuperscript{th} {IEEE} International Parallel and +Distributed Processing Symposium" +,month="April" +,year="2006" +,day="25-29" +,address="Rhodes, Greece" +,annotation=" + Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free + reference counting. +" +} + +@Conference{PaulEMcKenney2006b +,Author="Paul E. McKenney and Dipankar Sarma and Ingo Molnar and +Suparna Bhattacharya" +,Title="Extending RCU for Realtime and Embedded Workloads" +,Booktitle="{Ottawa Linux Symposium}" +,Month="July" +,Year="2006" +,pages="v2 123-138" +,note="Available: +\url{http://www.linuxsymposium.org/2006/index_2006.php} +\url{http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf} +[Viewed January 1, 2007]" +,annotation=" + Described how to improve the -rt implementation of realtime RCU. +" +} + +@unpublished{PaulEMcKenney2006c +,Author="Paul E. McKenney" +,Title="Sleepable {RCU}" +,month="October" +,day="9" +,year="2006" +,note="Available: +\url{http://lwn.net/Articles/202847/} +Revised: +\url{http://www.rdrop.com/users/paulmck/RCU/srcu.2007.01.14a.pdf} +[Viewed August 21, 2006]" +,annotation=" + LWN article introducing SRCU. +" +} + +@unpublished{RobertOlsson2006a +,Author="Robert Olsson and Stefan Nilsson" +,Title="{TRASH}: A dynamic {LC}-trie and hash data structure" +,month="August" +,day="18" +,year="2006" +,note="Available: +\url{http://www.nada.kth.se/~snilsson/public/papers/trash/trash.pdf} +[Viewed February 24, 2007]" +,annotation=" + RCU-protected dynamic trie-hash combination. +" +} + +@unpublished{ThomasEHart2007a +,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown and Jonathan Walpole" +,Title="Performance of memory reclamation for lockless synchronization" +,journal="J. Parallel Distrib. Comput." +,year="2007" +,note="To appear in J. Parallel Distrib. Comput. + \url{doi=10.1016/j.jpdc.2007.04.010}" +,annotation={ + Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free + reference counting. Journal version of ThomasEHart2006a. +} +} + +@unpublished{PaulEMcKenney2007QRCUspin +,Author="Paul E. McKenney" +,Title="Using Promela and Spin to verify parallel algorithms" +,month="August" +,day="1" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/243851/} +[Viewed September 8, 2007]" +,annotation=" + LWN article describing Promela and spin, and also using Oleg + Nesterov's QRCU as an example (with Paul McKenney's fastpath). +" +} + +@unpublished{PaulEMcKenney2007PreemptibleRCU +,Author="Paul E. McKenney" +,Title="The design of preemptible read-copy-update" +,month="October" +,day="8" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/253651/} +[Viewed October 25, 2007]" +,annotation=" + LWN article describing the design of preemptible RCU. +" +} + +######################################################################## +# +# "What is RCU?" LWN series. +# + +@unpublished{PaulEMcKenney2007WhatIsRCUFundamentally +,Author="Paul E. McKenney and Jonathan Walpole" +,Title="What is {RCU}, Fundamentally?" +,month="December" +,day="17" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/262464/} +[Viewed December 27, 2007]" +,annotation=" + Lays out the three basic components of RCU: (1) publish-subscribe, + (2) wait for pre-existing readers to complete, and (2) maintain + multiple versions. +" +} + +@unpublished{PaulEMcKenney2008WhatIsRCUUsage +,Author="Paul E. McKenney" +,Title="What is {RCU}? Part 2: Usage" +,month="January" +,day="4" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/263130/} +[Viewed January 4, 2008]" +,annotation=" + Lays out six uses of RCU: + 1. RCU is a Reader-Writer Lock Replacement + 2. RCU is a Restricted Reference-Counting Mechanism + 3. RCU is a Bulk Reference-Counting Mechanism + 4. RCU is a Poor Man's Garbage Collector + 5. RCU is a Way of Providing Existence Guarantees + 6. RCU is a Way of Waiting for Things to Finish +" +} + +@unpublished{PaulEMcKenney2008WhatIsRCUAPI +,Author="Paul E. McKenney" +,Title="{RCU} part 3: the {RCU} {API}" +,month="January" +,day="17" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/264090/} +[Viewed January 10, 2008]" +,annotation=" + Gives an overview of the Linux-kernel RCU API and a brief annotated RCU + bibliography. +" +} + +# +# "What is RCU?" LWN series. +# +######################################################################## + +@article{DinakarGuniguntala2008IBMSysJ +,author="D. Guniguntala and P. E. McKenney and J. Triplett and J. Walpole" +,title="The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with {Linux}" +,Year="2008" +,Month="April" +,journal="IBM Systems Journal" +,volume="47" +,number="2" +,pages="@@-@@" +,annotation=" + RCU, realtime RCU, sleepable RCU, performance. +" +} + +@article{PaulEMcKenney2008RCUOSR +,author="Paul E. McKenney and Jonathan Walpole" +,title="Introducing technology into the {Linux} kernel: a case study" +,Year="2008" +,journal="SIGOPS Oper. Syst. Rev." +,volume="42" +,number="5" +,pages="4--17" +,issn="0163-5980" +,doi={http://doi.acm.org/10.1145/1400097.1400099} +,publisher="ACM" +,address="New York, NY, USA" +,annotation={ + Linux changed RCU to a far greater degree than RCU has changed Linux. +} +} + +@unpublished{PaulEMcKenney2008HierarchicalRCU +,Author="Paul E. McKenney" +,Title="Hierarchical {RCU}" +,month="November" +,day="3" +,year="2008" +,note="Available: +\url{http://lwn.net/Articles/305782/} +[Viewed November 6, 2008]" +,annotation=" + RCU with combining-tree-based grace-period detection, + permitting it to handle thousands of CPUs. +" +} + +@conference{PaulEMcKenney2009MaliciousURCU +,Author="Paul E. McKenney" +,Title="Using a Malicious User-Level {RCU} to Torture {RCU}-Based Algorithms" +,Booktitle="linux.conf.au 2009" +,month="January" +,year="2009" +,address="Hobart, Australia" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/urcutorture.2009.01.22a.pdf} +[Viewed February 2, 2009]" +,annotation=" + Realtime RCU and torture-testing RCU uses. +" +} + +@unpublished{MathieuDesnoyers2009URCU +,Author="Mathieu Desnoyers" +,Title="[{RFC} git tree] Userspace {RCU} (urcu) for {Linux}" +,month="February" +,day="5" +,year="2009" +,note="Available: +\url{http://lkml.org/lkml/2009/2/5/572} +\url{git://lttng.org/userspace-rcu.git} +[Viewed February 20, 2009]" +,annotation=" + Mathieu Desnoyers's user-space RCU implementation. + git://lttng.org/userspace-rcu.git +" +} + +@unpublished{PaulEMcKenney2009BloatWatchRCU +,Author="Paul E. McKenney" +,Title="{RCU}: The {Bloatwatch} Edition" +,month="March" +,day="17" +,year="2009" +,note="Available: +\url{http://lwn.net/Articles/323929/} +[Viewed March 20, 2009]" +,annotation=" + Uniprocessor assumptions allow simplified RCU implementation. +" +} + +@unpublished{PaulEMcKenney2009expeditedRCU +,Author="Paul E. McKenney" +,Title="[{PATCH} -tip 0/3] expedited 'big hammer' {RCU} grace periods" +,month="June" +,day="25" +,year="2009" +,note="Available: +\url{http://lkml.org/lkml/2009/6/25/306} +[Viewed August 16, 2009]" +,annotation=" + First posting of expedited RCU to be accepted into -tip. +" +} + +@unpublished{JoshTriplett2009RPHash +,Author="Josh Triplett" +,Title="Scalable concurrent hash tables via relativistic programming" +,month="September" +,year="2009" +,note="Linux Plumbers Conference presentation" +,annotation=" + RP fun with hash tables. +" +} + +@phdthesis{MathieuDesnoyersPhD +, title = "Low-Impact Operating System Tracing" +, author = "Mathieu Desnoyers" +, school = "Ecole Polytechnique de Montr\'{e}al" +, month = "December" +, year = 2009 +,note="Available: +\url{http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf} +[Viewed December 9, 2009]" +} diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt new file mode 100644 index 00000000..90ec5341 --- /dev/null +++ b/Documentation/RCU/UP.txt @@ -0,0 +1,135 @@ +RCU on Uniprocessor Systems + + +A common misconception is that, on UP systems, the call_rcu() primitive +may immediately invoke its function. The basis of this misconception +is that since there is only one CPU, it should not be necessary to +wait for anything else to get done, since there are no other CPUs for +anything else to be happening on. Although this approach will -sort- -of- +work a surprising amount of the time, it is a very bad idea in general. +This document presents three examples that demonstrate exactly how bad +an idea this is. + + +Example 1: softirq Suicide + +Suppose that an RCU-based algorithm scans a linked list containing +elements A, B, and C in process context, and can delete elements from +this same list in softirq context. Suppose that the process-context scan +is referencing element B when it is interrupted by softirq processing, +which deletes element B, and then invokes call_rcu() to free element B +after a grace period. + +Now, if call_rcu() were to directly invoke its arguments, then upon return +from softirq, the list scan would find itself referencing a newly freed +element B. This situation can greatly decrease the life expectancy of +your kernel. + +This same problem can occur if call_rcu() is invoked from a hardware +interrupt handler. + + +Example 2: Function-Call Fatality + +Of course, one could avert the suicide described in the preceding example +by having call_rcu() directly invoke its arguments only if it was called +from process context. However, this can fail in a similar manner. + +Suppose that an RCU-based algorithm again scans a linked list containing +elements A, B, and C in process contexts, but that it invokes a function +on each element as it is scanned. Suppose further that this function +deletes element B from the list, then passes it to call_rcu() for deferred +freeing. This may be a bit unconventional, but it is perfectly legal +RCU usage, since call_rcu() must wait for a grace period to elapse. +Therefore, in this case, allowing call_rcu() to immediately invoke +its arguments would cause it to fail to make the fundamental guarantee +underlying RCU, namely that call_rcu() defers invoking its arguments until +all RCU read-side critical sections currently executing have completed. + +Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in + this case? + + +Example 3: Death by Deadlock + +Suppose that call_rcu() is invoked while holding a lock, and that the +callback function must acquire this same lock. In this case, if +call_rcu() were to directly invoke the callback, the result would +be self-deadlock. + +In some cases, it would possible to restructure to code so that +the call_rcu() is delayed until after the lock is released. However, +there are cases where this can be quite ugly: + +1. If a number of items need to be passed to call_rcu() within + the same critical section, then the code would need to create + a list of them, then traverse the list once the lock was + released. + +2. In some cases, the lock will be held across some kernel API, + so that delaying the call_rcu() until the lock is released + requires that the data item be passed up via a common API. + It is far better to guarantee that callbacks are invoked + with no locks held than to have to modify such APIs to allow + arbitrary data items to be passed back up through them. + +If call_rcu() directly invokes the callback, painful locking restrictions +or API changes would be required. + +Quick Quiz #2: What locking restriction must RCU callbacks respect? + + +Summary + +Permitting call_rcu() to immediately invoke its arguments breaks RCU, +even on a UP system. So do not do it! Even on a UP system, the RCU +infrastructure -must- respect grace periods, and -must- invoke callbacks +from a known environment in which no locks are held. + +It -is- safe for synchronize_sched() and synchronize_rcu_bh() to return +immediately on an UP system. It is also safe for synchronize_rcu() +to return immediately on UP systems, except when running preemptable +RCU. + +Quick Quiz #3: Why can't synchronize_rcu() return immediately on + UP systems running preemptable RCU? + + +Answer to Quick Quiz #1: + Why is it -not- legal to invoke synchronize_rcu() in this case? + + Because the calling function is scanning an RCU-protected linked + list, and is therefore within an RCU read-side critical section. + Therefore, the called function has been invoked within an RCU + read-side critical section, and is not permitted to block. + +Answer to Quick Quiz #2: + What locking restriction must RCU callbacks respect? + + Any lock that is acquired within an RCU callback must be + acquired elsewhere using an _irq variant of the spinlock + primitive. For example, if "mylock" is acquired by an + RCU callback, then a process-context acquisition of this + lock must use something like spin_lock_irqsave() to + acquire the lock. + + If the process-context code were to simply use spin_lock(), + then, since RCU callbacks can be invoked from softirq context, + the callback might be called from a softirq that interrupted + the process-context critical section. This would result in + self-deadlock. + + This restriction might seem gratuitous, since very few RCU + callbacks acquire locks directly. However, a great many RCU + callbacks do acquire locks -indirectly-, for example, via + the kfree() primitive. + +Answer to Quick Quiz #3: + Why can't synchronize_rcu() return immediately on UP systems + running preemptable RCU? + + Because some other task might have been preempted in the middle + of an RCU read-side critical section. If synchronize_rcu() + simply immediately returned, it would prematurely signal the + end of the grace period, which would come as a nasty shock to + that other thread when it started running again. diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt new file mode 100644 index 00000000..453ebe69 --- /dev/null +++ b/Documentation/RCU/arrayRCU.txt @@ -0,0 +1,141 @@ +Using RCU to Protect Read-Mostly Arrays + + +Although RCU is more commonly used to protect linked lists, it can +also be used to protect arrays. Three situations are as follows: + +1. Hash Tables + +2. Static Arrays + +3. Resizeable Arrays + +Each of these situations are discussed below. + + +Situation 1: Hash Tables + +Hash tables are often implemented as an array, where each array entry +has a linked-list hash chain. Each hash chain can be protected by RCU +as described in the listRCU.txt document. This approach also applies +to other array-of-list situations, such as radix trees. + + +Situation 2: Static Arrays + +Static arrays, where the data (rather than a pointer to the data) is +located in each array element, and where the array is never resized, +have not been used with RCU. Rik van Riel recommends using seqlock in +this situation, which would also have minimal read-side overhead as long +as updates are rare. + +Quick Quiz: Why is it so important that updates be rare when + using seqlock? + + +Situation 3: Resizeable Arrays + +Use of RCU for resizeable arrays is demonstrated by the grow_ary() +function used by the System V IPC code. The array is used to map from +semaphore, message-queue, and shared-memory IDs to the data structure +that represents the corresponding IPC construct. The grow_ary() +function does not acquire any locks; instead its caller must hold the +ids->sem semaphore. + +The grow_ary() function, shown below, does some limit checks, allocates a +new ipc_id_ary, copies the old to the new portion of the new, initializes +the remainder of the new, updates the ids->entries pointer to point to +the new array, and invokes ipc_rcu_putref() to free up the old array. +Note that rcu_assign_pointer() is used to update the ids->entries pointer, +which includes any memory barriers required on whatever architecture +you are running on. + + static int grow_ary(struct ipc_ids* ids, int newsize) + { + struct ipc_id_ary* new; + struct ipc_id_ary* old; + int i; + int size = ids->entries->size; + + if(newsize > IPCMNI) + newsize = IPCMNI; + if(newsize <= size) + return newsize; + + new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize + + sizeof(struct ipc_id_ary)); + if(new == NULL) + return size; + new->size = newsize; + memcpy(new->p, ids->entries->p, + sizeof(struct kern_ipc_perm *)*size + + sizeof(struct ipc_id_ary)); + for(i=size;ip[i] = NULL; + } + old = ids->entries; + + /* + * Use rcu_assign_pointer() to make sure the memcpyed + * contents of the new array are visible before the new + * array becomes visible. + */ + rcu_assign_pointer(ids->entries, new); + + ipc_rcu_putref(old); + return newsize; + } + +The ipc_rcu_putref() function decrements the array's reference count +and then, if the reference count has dropped to zero, uses call_rcu() +to free the array after a grace period has elapsed. + +The array is traversed by the ipc_lock() function. This function +indexes into the array under the protection of rcu_read_lock(), +using rcu_dereference() to pick up the pointer to the array so +that it may later safely be dereferenced -- memory barriers are +required on the Alpha CPU. Since the size of the array is stored +with the array itself, there can be no array-size mismatches, so +a simple check suffices. The pointer to the structure corresponding +to the desired IPC object is placed in "out", with NULL indicating +a non-existent entry. After acquiring "out->lock", the "out->deleted" +flag indicates whether the IPC object is in the process of being +deleted, and, if not, the pointer is returned. + + struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) + { + struct kern_ipc_perm* out; + int lid = id % SEQ_MULTIPLIER; + struct ipc_id_ary* entries; + + rcu_read_lock(); + entries = rcu_dereference(ids->entries); + if(lid >= entries->size) { + rcu_read_unlock(); + return NULL; + } + out = entries->p[lid]; + if(out == NULL) { + rcu_read_unlock(); + return NULL; + } + spin_lock(&out->lock); + + /* ipc_rmid() may have already freed the ID while ipc_lock + * was spinning: here verify that the structure is still valid + */ + if (out->deleted) { + spin_unlock(&out->lock); + rcu_read_unlock(); + return NULL; + } + return out; + } + + +Answer to Quick Quiz: + + The reason that it is important that updates be rare when + using seqlock is that frequent updates can livelock readers. + One way to avoid this problem is to assign a seqlock for + each array entry rather than to the entire array. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt new file mode 100644 index 00000000..0c134f8a --- /dev/null +++ b/Documentation/RCU/checklist.txt @@ -0,0 +1,399 @@ +Review Checklist for RCU Patches + + +This document contains a checklist for producing and reviewing patches +that make use of RCU. Violating any of the rules listed below will +result in the same sorts of problems that leaving out a locking primitive +would cause. This list is based on experiences reviewing such patches +over a rather long period of time, but improvements are always welcome! + +0. Is RCU being applied to a read-mostly situation? If the data + structure is updated more than about 10% of the time, then you + should strongly consider some other approach, unless detailed + performance measurements show that RCU is nonetheless the right + tool for the job. Yes, RCU does reduce read-side overhead by + increasing write-side overhead, which is exactly why normal uses + of RCU will do much more reading than updating. + + Another exception is where performance is not an issue, and RCU + provides a simpler implementation. An example of this situation + is the dynamic NMI code in the Linux 2.6 kernel, at least on + architectures where NMIs are rare. + + Yet another exception is where the low real-time latency of RCU's + read-side primitives is critically important. + +1. Does the update code have proper mutual exclusion? + + RCU does allow -readers- to run (almost) naked, but -writers- must + still use some sort of mutual exclusion, such as: + + a. locking, + b. atomic operations, or + c. restricting updates to a single task. + + If you choose #b, be prepared to describe how you have handled + memory barriers on weakly ordered machines (pretty much all of + them -- even x86 allows later loads to be reordered to precede + earlier stores), and be prepared to explain why this added + complexity is worthwhile. If you choose #c, be prepared to + explain how this single task does not become a major bottleneck on + big multiprocessor machines (for example, if the task is updating + information relating to itself that other tasks can read, there + by definition can be no bottleneck). + +2. Do the RCU read-side critical sections make proper use of + rcu_read_lock() and friends? These primitives are needed + to prevent grace periods from ending prematurely, which + could result in data being unceremoniously freed out from + under your read-side code, which can greatly increase the + actuarial risk of your kernel. + + As a rough rule of thumb, any dereference of an RCU-protected + pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), + rcu_read_lock_sched(), or by the appropriate update-side lock. + Disabling of preemption can serve as rcu_read_lock_sched(), but + is less readable. + +3. Does the update code tolerate concurrent accesses? + + The whole point of RCU is to permit readers to run without + any locks or atomic operations. This means that readers will + be running while updates are in progress. There are a number + of ways to handle this concurrency, depending on the situation: + + a. Use the RCU variants of the list and hlist update + primitives to add, remove, and replace elements on + an RCU-protected list. Alternatively, use the other + RCU-protected data structures that have been added to + the Linux kernel. + + This is almost always the best approach. + + b. Proceed as in (a) above, but also maintain per-element + locks (that are acquired by both readers and writers) + that guard per-element state. Of course, fields that + the readers refrain from accessing can be guarded by + some other lock acquired only by updaters, if desired. + + This works quite well, also. + + c. Make updates appear atomic to readers. For example, + pointer updates to properly aligned fields will + appear atomic, as will individual atomic primitives. + Sequences of perations performed under a lock will -not- + appear to be atomic to RCU readers, nor will sequences + of multiple atomic primitives. + + This can work, but is starting to get a bit tricky. + + d. Carefully order the updates and the reads so that + readers see valid data at all phases of the update. + This is often more difficult than it sounds, especially + given modern CPUs' tendency to reorder memory references. + One must usually liberally sprinkle memory barriers + (smp_wmb(), smp_rmb(), smp_mb()) through the code, + making it difficult to understand and to test. + + It is usually better to group the changing data into + a separate structure, so that the change may be made + to appear atomic by updating a pointer to reference + a new structure containing updated values. + +4. Weakly ordered CPUs pose special challenges. Almost all CPUs + are weakly ordered -- even x86 CPUs allow later loads to be + reordered to precede earlier stores. RCU code must take all of + the following measures to prevent memory-corruption problems: + + a. Readers must maintain proper ordering of their memory + accesses. The rcu_dereference() primitive ensures that + the CPU picks up the pointer before it picks up the data + that the pointer points to. This really is necessary + on Alpha CPUs. If you don't believe me, see: + + http://www.openvms.compaq.com/wizard/wiz_2637.html + + The rcu_dereference() primitive is also an excellent + documentation aid, letting the person reading the code + know exactly which pointers are protected by RCU. + Please note that compilers can also reorder code, and + they are becoming increasingly aggressive about doing + just that. The rcu_dereference() primitive therefore + also prevents destructive compiler optimizations. + + The rcu_dereference() primitive is used by the + various "_rcu()" list-traversal primitives, such + as the list_for_each_entry_rcu(). Note that it is + perfectly legal (if redundant) for update-side code to + use rcu_dereference() and the "_rcu()" list-traversal + primitives. This is particularly useful in code that + is common to readers and updaters. However, lockdep + will complain if you access rcu_dereference() outside + of an RCU read-side critical section. See lockdep.txt + to learn what to do about this. + + Of course, neither rcu_dereference() nor the "_rcu()" + list-traversal primitives can substitute for a good + concurrency design coordinating among multiple updaters. + + b. If the list macros are being used, the list_add_tail_rcu() + and list_add_rcu() primitives must be used in order + to prevent weakly ordered machines from misordering + structure initialization and pointer planting. + Similarly, if the hlist macros are being used, the + hlist_add_head_rcu() primitive is required. + + c. If the list macros are being used, the list_del_rcu() + primitive must be used to keep list_del()'s pointer + poisoning from inflicting toxic effects on concurrent + readers. Similarly, if the hlist macros are being used, + the hlist_del_rcu() primitive is required. + + The list_replace_rcu() and hlist_replace_rcu() primitives + may be used to replace an old structure with a new one + in their respective types of RCU-protected lists. + + d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" + type of RCU-protected linked lists. + + e. Updates must ensure that initialization of a given + structure happens before pointers to that structure are + publicized. Use the rcu_assign_pointer() primitive + when publicizing a pointer to a structure that can + be traversed by an RCU read-side critical section. + +5. If call_rcu(), or a related primitive such as call_rcu_bh() or + call_rcu_sched(), is used, the callback function must be + written to be called from softirq context. In particular, + it cannot block. + +6. Since synchronize_rcu() can block, it cannot be called from + any sort of irq context. The same rule applies for + synchronize_rcu_bh(), synchronize_sched(), synchronize_srcu(), + synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(), + synchronize_sched_expedite(), and synchronize_srcu_expedited(). + + The expedited forms of these primitives have the same semantics + as the non-expedited forms, but expediting is both expensive + and unfriendly to real-time workloads. Use of the expedited + primitives should be restricted to rare configuration-change + operations that would not normally be undertaken while a real-time + workload is running. + +7. If the updater uses call_rcu() or synchronize_rcu(), then the + corresponding readers must use rcu_read_lock() and + rcu_read_unlock(). If the updater uses call_rcu_bh() or + synchronize_rcu_bh(), then the corresponding readers must + use rcu_read_lock_bh() and rcu_read_unlock_bh(). If the + updater uses call_rcu_sched() or synchronize_sched(), then + the corresponding readers must disable preemption, possibly + by calling rcu_read_lock_sched() and rcu_read_unlock_sched(). + If the updater uses synchronize_srcu(), the the corresponding + readers must use srcu_read_lock() and srcu_read_unlock(), + and with the same srcu_struct. The rules for the expedited + primitives are the same as for their non-expedited counterparts. + Mixing things up will result in confusion and broken kernels. + + One exception to this rule: rcu_read_lock() and rcu_read_unlock() + may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() + in cases where local bottom halves are already known to be + disabled, for example, in irq or softirq context. Commenting + such cases is a must, of course! And the jury is still out on + whether the increased speed is worth it. + +8. Although synchronize_rcu() is slower than is call_rcu(), it + usually results in simpler code. So, unless update performance + is critically important or the updaters cannot block, + synchronize_rcu() should be used in preference to call_rcu(). + + An especially important property of the synchronize_rcu() + primitive is that it automatically self-limits: if grace periods + are delayed for whatever reason, then the synchronize_rcu() + primitive will correspondingly delay updates. In contrast, + code using call_rcu() should explicitly limit update rate in + cases where grace periods are delayed, as failing to do so can + result in excessive realtime latencies or even OOM conditions. + + Ways of gaining this self-limiting property when using call_rcu() + include: + + a. Keeping a count of the number of data-structure elements + used by the RCU-protected data structure, including + those waiting for a grace period to elapse. Enforce a + limit on this number, stalling updates as needed to allow + previously deferred frees to complete. Alternatively, + limit only the number awaiting deferred free rather than + the total number of elements. + + One way to stall the updates is to acquire the update-side + mutex. (Don't try this with a spinlock -- other CPUs + spinning on the lock could prevent the grace period + from ever ending.) Another way to stall the updates + is for the updates to use a wrapper function around + the memory allocator, so that this wrapper function + simulates OOM when there is too much memory awaiting an + RCU grace period. There are of course many other + variations on this theme. + + b. Limiting update rate. For example, if updates occur only + once per hour, then no explicit rate limiting is required, + unless your system is already badly broken. The dcache + subsystem takes this approach -- updates are guarded + by a global lock, limiting their rate. + + c. Trusted update -- if updates can only be done manually by + superuser or some other trusted user, then it might not + be necessary to automatically limit them. The theory + here is that superuser already has lots of ways to crash + the machine. + + d. Use call_rcu_bh() rather than call_rcu(), in order to take + advantage of call_rcu_bh()'s faster grace periods. + + e. Periodically invoke synchronize_rcu(), permitting a limited + number of updates per grace period. + + The same cautions apply to call_rcu_bh() and call_rcu_sched(). + +9. All RCU list-traversal primitives, which include + rcu_dereference(), list_for_each_entry_rcu(), + list_for_each_continue_rcu(), and list_for_each_safe_rcu(), + must be either within an RCU read-side critical section or + must be protected by appropriate update-side locks. RCU + read-side critical sections are delimited by rcu_read_lock() + and rcu_read_unlock(), or by similar primitives such as + rcu_read_lock_bh() and rcu_read_unlock_bh(), in which case + the matching rcu_dereference() primitive must be used in order + to keep lockdep happy, in this case, rcu_dereference_bh(). + + The reason that it is permissible to use RCU list-traversal + primitives when the update-side lock is held is that doing so + can be quite helpful in reducing code bloat when common code is + shared between readers and updaters. Additional primitives + are provided for this case, as discussed in lockdep.txt. + +10. Conversely, if you are in an RCU read-side critical section, + and you don't hold the appropriate update-side lock, you -must- + use the "_rcu()" variants of the list macros. Failing to do so + will break Alpha, cause aggressive compilers to generate bad code, + and confuse people trying to read your code. + +11. Note that synchronize_rcu() -only- guarantees to wait until + all currently executing rcu_read_lock()-protected RCU read-side + critical sections complete. It does -not- necessarily guarantee + that all currently running interrupts, NMIs, preempt_disable() + code, or idle loops will complete. Therefore, if you do not have + rcu_read_lock()-protected read-side critical sections, do -not- + use synchronize_rcu(). + + Similarly, disabling preemption is not an acceptable substitute + for rcu_read_lock(). Code that attempts to use preemption + disabling where it should be using rcu_read_lock() will break + in real-time kernel builds. + + If you want to wait for interrupt handlers, NMI handlers, and + code under the influence of preempt_disable(), you instead + need to use synchronize_irq() or synchronize_sched(). + +12. Any lock acquired by an RCU callback must be acquired elsewhere + with softirq disabled, e.g., via spin_lock_irqsave(), + spin_lock_bh(), etc. Failing to disable irq on a given + acquisition of that lock will result in deadlock as soon as + the RCU softirq handler happens to run your RCU callback while + interrupting that acquisition's critical section. + +13. RCU callbacks can be and are executed in parallel. In many cases, + the callback code simply wrappers around kfree(), so that this + is not an issue (or, more accurately, to the extent that it is + an issue, the memory-allocator locking handles it). However, + if the callbacks do manipulate a shared data structure, they + must use whatever locking or other synchronization is required + to safely access and/or modify that data structure. + + RCU callbacks are -usually- executed on the same CPU that executed + the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(), + but are by -no- means guaranteed to be. For example, if a given + CPU goes offline while having an RCU callback pending, then that + RCU callback will execute on some surviving CPU. (If this was + not the case, a self-spawning RCU callback would prevent the + victim CPU from ever going offline.) + +14. SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(), + synchronize_srcu(), and synchronize_srcu_expedited()) may only + be invoked from process context. Unlike other forms of RCU, it + -is- permissible to block in an SRCU read-side critical section + (demarked by srcu_read_lock() and srcu_read_unlock()), hence the + "SRCU": "sleepable RCU". Please note that if you don't need + to sleep in read-side critical sections, you should be using + RCU rather than SRCU, because RCU is almost always faster and + easier to use than is SRCU. + + Also unlike other forms of RCU, explicit initialization + and cleanup is required via init_srcu_struct() and + cleanup_srcu_struct(). These are passed a "struct srcu_struct" + that defines the scope of a given SRCU domain. Once initialized, + the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock() + synchronize_srcu(), and synchronize_srcu_expedited(). A given + synchronize_srcu() waits only for SRCU read-side critical + sections governed by srcu_read_lock() and srcu_read_unlock() + calls that have been passed the same srcu_struct. This property + is what makes sleeping read-side critical sections tolerable -- + a given subsystem delays only its own updates, not those of other + subsystems using SRCU. Therefore, SRCU is less prone to OOM the + system than RCU would be if RCU's read-side critical sections + were permitted to sleep. + + The ability to sleep in read-side critical sections does not + come for free. First, corresponding srcu_read_lock() and + srcu_read_unlock() calls must be passed the same srcu_struct. + Second, grace-period-detection overhead is amortized only + over those updates sharing a given srcu_struct, rather than + being globally amortized as they are for other forms of RCU. + Therefore, SRCU should be used in preference to rw_semaphore + only in extremely read-intensive situations, or in situations + requiring SRCU's read-side deadlock immunity or low read-side + realtime latency. + + Note that, rcu_assign_pointer() relates to SRCU just as they do + to other forms of RCU. + +15. The whole point of call_rcu(), synchronize_rcu(), and friends + is to wait until all pre-existing readers have finished before + carrying out some otherwise-destructive operation. It is + therefore critically important to -first- remove any path + that readers can follow that could be affected by the + destructive operation, and -only- -then- invoke call_rcu(), + synchronize_rcu(), or friends. + + Because these primitives only wait for pre-existing readers, it + is the caller's responsibility to guarantee that any subsequent + readers will execute safely. + +16. The various RCU read-side primitives do -not- necessarily contain + memory barriers. You should therefore plan for the CPU + and the compiler to freely reorder code into and out of RCU + read-side critical sections. It is the responsibility of the + RCU update-side primitives to deal with this. + +17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and + the __rcu sparse checks to validate your RCU code. These + can help find problems as follows: + + CONFIG_PROVE_RCU: check that accesses to RCU-protected data + structures are carried out under the proper RCU + read-side critical section, while holding the right + combination of locks, or whatever other conditions + are appropriate. + + CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the + same object to call_rcu() (or friends) before an RCU + grace period has elapsed since the last time that you + passed that same object to call_rcu() (or friends). + + __rcu sparse checks: tag the pointer to the RCU-protected data + structure with __rcu, and sparse will warn you if you + access that pointer without the services of one of the + variants of rcu_dereference(). + + These debugging aids can help you find problems that are + otherwise extremely difficult to spot. diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt new file mode 100644 index 00000000..4349c148 --- /dev/null +++ b/Documentation/RCU/listRCU.txt @@ -0,0 +1,315 @@ +Using RCU to Protect Read-Mostly Linked Lists + + +One of the best applications of RCU is to protect read-mostly linked lists +("struct list_head" in list.h). One big advantage of this approach +is that all of the required memory barriers are included for you in +the list macros. This document describes several applications of RCU, +with the best fits first. + + +Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates + +The best applications are cases where, if reader-writer locking were +used, the read-side lock would be dropped before taking any action +based on the results of the search. The most celebrated example is +the routing table. Because the routing table is tracking the state of +equipment outside of the computer, it will at times contain stale data. +Therefore, once the route has been computed, there is no need to hold +the routing table static during transmission of the packet. After all, +you can hold the routing table static all you want, but that won't keep +the external Internet from changing, and it is the state of the external +Internet that really matters. In addition, routing entries are typically +added or deleted, rather than being modified in place. + +A straightforward example of this use of RCU may be found in the +system-call auditing support. For example, a reader-writer locked +implementation of audit_filter_task() might be as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + read_lock(&auditsc_lock); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + read_unlock(&auditsc_lock); + return state; + } + } + read_unlock(&auditsc_lock); + return AUDIT_BUILD_CONTEXT; + } + +Here the list is searched under the lock, but the lock is dropped before +the corresponding value is returned. By the time that this value is acted +on, the list may well have been modified. This makes sense, since if +you are turning auditing off, it is OK to audit a few extra system calls. + +This means that RCU can be easily applied to the read side, as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +The read_lock() and read_unlock() calls have become rcu_read_lock() +and rcu_read_unlock(), respectively, and the list_for_each_entry() has +become list_for_each_entry_rcu(). The _rcu() list-traversal primitives +insert the read-side memory barriers that are required on DEC Alpha CPUs. + +The changes to the update side are also straightforward. A reader-writer +lock might be used as follows for deletion and insertion: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + write_lock(&auditsc_lock); + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del(&e->list); + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + write_lock(&auditsc_lock); + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add(&entry->list, list); + } else { + list_add_tail(&entry->list, list); + } + write_unlock(&auditsc_lock); + return 0; + } + +Following are the RCU equivalents for these two functions: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not use the _rcu iterator here, since this is the only + * deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del_rcu(&e->list); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + static inline int audit_add_rule(struct audit_entry *entry, + struct list_head *list) + { + if (entry->rule.flags & AUDIT_PREPEND) { + entry->rule.flags &= ~AUDIT_PREPEND; + list_add_rcu(&entry->list, list); + } else { + list_add_tail_rcu(&entry->list, list); + } + return 0; + } + +Normally, the write_lock() and write_unlock() would be replaced by +a spin_lock() and a spin_unlock(), but in this case, all callers hold +audit_netlink_sem, so no additional locking is required. The auditsc_lock +can therefore be eliminated, since use of RCU eliminates the need for +writers to exclude readers. Normally, the write_lock() calls would +be converted into spin_lock() calls. + +The list_del(), list_add(), and list_add_tail() primitives have been +replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). +The _rcu() list-manipulation primitives add memory barriers that are +needed on weakly ordered CPUs (most of them!). The list_del_rcu() +primitive omits the pointer poisoning debug-assist code that would +otherwise cause concurrent readers to fail spectacularly. + +So, when readers can tolerate stale data and when entries are either added +or deleted, without in-place modification, it is very easy to use RCU! + + +Example 2: Handling In-Place Updates + +The system-call auditing code does not update auditing rules in place. +However, if it did, reader-writer-locked code to do so might look as +follows (presumably, the field_count is only permitted to decrease, +otherwise, the added fields would need to be filled in): + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + write_lock(&auditsc_lock); + /* Note: audit_netlink_sem held by caller. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + e->rule.action = newaction; + e->rule.file_count = newfield_count; + write_unlock(&auditsc_lock); + return 0; + } + } + write_unlock(&auditsc_lock); + return -EFAULT; /* No matching rule */ + } + +The RCU version creates a copy, updates the copy, then replaces the old +entry with the newly updated entry. This sequence of actions, allowing +concurrent reads while doing a copy to perform an update, is what gives +RCU ("read-copy update") its name. The RCU code is as follows: + + static inline int audit_upd_rule(struct audit_rule *rule, + struct list_head *list, + __u32 newaction, + __u32 newfield_count) + { + struct audit_entry *e; + struct audit_newentry *ne; + + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + ne = kmalloc(sizeof(*entry), GFP_ATOMIC); + if (ne == NULL) + return -ENOMEM; + audit_copy_rule(&ne->rule, &e->rule); + ne->rule.action = newaction; + ne->rule.file_count = newfield_count; + list_replace_rcu(e, ne); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + +Again, this assumes that the caller holds audit_netlink_sem. Normally, +the reader-writer lock would become a spinlock in this sort of code. + + +Example 3: Eliminating Stale Data + +The auditing examples above tolerate stale data, as do most algorithms +that are tracking external state. Because there is a delay from the +time the external state changes before Linux becomes aware of the change, +additional RCU-induced staleness is normally not a problem. + +However, there are many examples where stale data cannot be tolerated. +One example in the Linux kernel is the System V IPC (see the ipc_lock() +function in ipc/util.c). This code checks a "deleted" flag under a +per-entry spinlock, and, if the "deleted" flag is set, pretends that the +entry does not exist. For this to be helpful, the search function must +return holding the per-entry spinlock, as ipc_lock() does in fact do. + +Quick Quiz: Why does the search function need to return holding the + per-entry lock for this deleted-flag technique to be helpful? + +If the system-call audit module were to ever need to reject stale data, +one way to accomplish this would be to add a "deleted" flag and a "lock" +spinlock to the audit_entry structure, and modify audit_filter_task() +as follows: + + static enum audit_state audit_filter_task(struct task_struct *tsk) + { + struct audit_entry *e; + enum audit_state state; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_tsklist, list) { + if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { + spin_lock(&e->lock); + if (e->deleted) { + spin_unlock(&e->lock); + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + rcu_read_unlock(); + return state; + } + } + rcu_read_unlock(); + return AUDIT_BUILD_CONTEXT; + } + +Note that this example assumes that entries are only added and deleted. +Additional mechanism is required to deal correctly with the +update-in-place performed by audit_upd_rule(). For one thing, +audit_upd_rule() would need additional memory barriers to ensure +that the list_add_rcu() was really executed before the list_del_rcu(). + +The audit_del_rule() function would need to set the "deleted" +flag under the spinlock as follows: + + static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) + { + struct audit_entry *e; + + /* Do not need to use the _rcu iterator here, since this + * is the only deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + spin_lock(&e->lock); + list_del_rcu(&e->list); + e->deleted = 1; + spin_unlock(&e->lock); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -EFAULT; /* No matching rule */ + } + + +Summary + +Read-mostly list-based data structures that can tolerate stale data are +the most amenable to use of RCU. The simplest case is where entries are +either added or deleted from the data structure (or atomically modified +in place), but non-atomic in-place modifications can be handled by making +a copy, updating the copy, then replacing the original with the copy. +If stale data cannot be tolerated, then a "deleted" flag may be used +in conjunction with a per-entry spinlock in order to allow the search +function to reject newly deleted data. + + +Answer to Quick Quiz + Why does the search function need to return holding the per-entry + lock for this deleted-flag technique to be helpful? + + If the search function drops the per-entry lock before returning, + then the caller will be processing stale data in any case. If it + is really OK to be processing stale data, then you don't need a + "deleted" flag. If processing stale data really is a problem, + then you need to hold the per-entry lock across all of the code + that uses the value that was returned. diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt new file mode 100644 index 00000000..d7a49b2f --- /dev/null +++ b/Documentation/RCU/lockdep.txt @@ -0,0 +1,91 @@ +RCU and lockdep checking + +All flavors of RCU have lockdep checking available, so that lockdep is +aware of when each task enters and leaves any flavor of RCU read-side +critical section. Each flavor of RCU is tracked separately (but note +that this is not the case in 2.6.32 and earlier). This allows lockdep's +tracking to include RCU state, which can sometimes help when debugging +deadlocks and the like. + +In addition, RCU provides the following primitives that check lockdep's +state: + + rcu_read_lock_held() for normal RCU. + rcu_read_lock_bh_held() for RCU-bh. + rcu_read_lock_sched_held() for RCU-sched. + srcu_read_lock_held() for SRCU. + +These functions are conservative, and will therefore return 1 if they +aren't certain (for example, if CONFIG_DEBUG_LOCK_ALLOC is not set). +This prevents things like WARN_ON(!rcu_read_lock_held()) from giving false +positives when lockdep is disabled. + +In addition, a separate kernel config parameter CONFIG_PROVE_RCU enables +checking of rcu_dereference() primitives: + + rcu_dereference(p): + Check for RCU read-side critical section. + rcu_dereference_bh(p): + Check for RCU-bh read-side critical section. + rcu_dereference_sched(p): + Check for RCU-sched read-side critical section. + srcu_dereference(p, sp): + Check for SRCU read-side critical section. + rcu_dereference_check(p, c): + Use explicit check expression "c". This is useful in + code that is invoked by both readers and updaters. + rcu_dereference_raw(p) + Don't check. (Use sparingly, if at all.) + rcu_dereference_protected(p, c): + Use explicit check expression "c", and omit all barriers + and compiler constraints. This is useful when the data + structure cannot change, for example, in code that is + invoked only by updaters. + rcu_access_pointer(p): + Return the value of the pointer and omit all barriers, + but retain the compiler constraints that prevent duplicating + or coalescsing. This is useful when when testing the + value of the pointer itself, for example, against NULL. + +The rcu_dereference_check() check expression can be any boolean +expression, but would normally include one of the rcu_read_lock_held() +family of functions and a lockdep expression. However, any boolean +expression can be used. For a moderately ornate example, consider +the following: + + file = rcu_dereference_check(fdt->fd[fd], + rcu_read_lock_held() || + lockdep_is_held(&files->file_lock) || + atomic_read(&files->count) == 1); + +This expression picks up the pointer "fdt->fd[fd]" in an RCU-safe manner, +and, if CONFIG_PROVE_RCU is configured, verifies that this expression +is used in: + +1. An RCU read-side critical section, or +2. with files->file_lock held, or +3. on an unshared files_struct. + +In case (1), the pointer is picked up in an RCU-safe manner for vanilla +RCU read-side critical sections, in case (2) the ->file_lock prevents +any change from taking place, and finally, in case (3) the current task +is the only task accessing the file_struct, again preventing any change +from taking place. If the above statement was invoked only from updater +code, it could instead be written as follows: + + file = rcu_dereference_protected(fdt->fd[fd], + lockdep_is_held(&files->file_lock) || + atomic_read(&files->count) == 1); + +This would verify cases #2 and #3 above, and furthermore lockdep would +complain if this was used in an RCU read-side critical section unless one +of these two cases held. Because rcu_dereference_protected() omits all +barriers and compiler constraints, it generates better code than do the +other flavors of rcu_dereference(). On the other hand, it is illegal +to use rcu_dereference_protected() if either the RCU-protected pointer +or the RCU-protected data that it points to can change concurrently. + +There are currently only "universal" versions of the rcu_assign_pointer() +and RCU list-/tree-traversal primitives, which do not (yet) check for +being in an RCU read-side critical section. In the future, separate +versions of these primitives might be created. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt new file mode 100644 index 00000000..31852705 --- /dev/null +++ b/Documentation/RCU/rcu.txt @@ -0,0 +1,96 @@ +RCU Concepts + + +The basic idea behind RCU (read-copy update) is to split destructive +operations into two parts, one that prevents anyone from seeing the data +item being destroyed, and one that actually carries out the destruction. +A "grace period" must elapse between the two parts, and this grace period +must be long enough that any readers accessing the item being deleted have +since dropped their references. For example, an RCU-protected deletion +from a linked list would first remove the item from the list, wait for +a grace period to elapse, then free the element. See the listRCU.txt +file for more information on using RCU with linked lists. + + +Frequently Asked Questions + +o Why would anyone want to use RCU? + + The advantage of RCU's two-part approach is that RCU readers need + not acquire any locks, perform any atomic instructions, write to + shared memory, or (on CPUs other than Alpha) execute any memory + barriers. The fact that these operations are quite expensive + on modern CPUs is what gives RCU its performance advantages + in read-mostly situations. The fact that RCU readers need not + acquire locks can also greatly simplify deadlock-avoidance code. + +o How can the updater tell when a grace period has completed + if the RCU readers give no indication when they are done? + + Just as with spinlocks, RCU readers are not permitted to + block, switch to user-mode execution, or enter the idle loop. + Therefore, as soon as a CPU is seen passing through any of these + three states, we know that that CPU has exited any previous RCU + read-side critical sections. So, if we remove an item from a + linked list, and then wait until all CPUs have switched context, + executed in user mode, or executed in the idle loop, we can + safely free up that item. + + Preemptible variants of RCU (CONFIG_TREE_PREEMPT_RCU) get the + same effect, but require that the readers manipulate CPU-local + counters. These counters allow limited types of blocking + within RCU read-side critical sections. SRCU also uses + CPU-local counters, and permits general blocking within + RCU read-side critical sections. These two variants of + RCU detect grace periods by sampling these counters. + +o If I am running on a uniprocessor kernel, which can only do one + thing at a time, why should I wait for a grace period? + + See the UP.txt file in this directory. + +o How can I see where RCU is currently used in the Linux kernel? + + Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", + "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh", + "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu", + "synchronize_net", "synchronize_srcu", and the other RCU + primitives. Or grab one of the cscope databases from: + + http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html + +o What guidelines should I follow when writing code that uses RCU? + + See the checklist.txt file in this directory. + +o Why the name "RCU"? + + "RCU" stands for "read-copy update". The file listRCU.txt has + more information on where this name came from, search for + "read-copy update" to find it. + +o I hear that RCU is patented? What is with that? + + Yes, it is. There are several known patents related to RCU, + search for the string "Patent" in RTFP.txt to find them. + Of these, one was allowed to lapse by the assignee, and the + others have been contributed to the Linux kernel under GPL. + There are now also LGPL implementations of user-level RCU + available (http://lttng.org/?q=node/18). + +o I hear that RCU needs work in order to support realtime kernels? + + This work is largely completed. Realtime-friendly RCU can be + enabled via the CONFIG_TREE_PREEMPT_RCU kernel configuration + parameter. However, work is in progress for enabling priority + boosting of preempted RCU read-side critical sections. This is + needed if you have CPU-bound realtime threads. + +o Where can I find more information on RCU? + + See the RTFP.txt file in this directory. + Or point your browser at http://www.rdrop.com/users/paulmck/RCU/. + +o What are all these files in this directory? + + See 00-INDEX for the list. diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt new file mode 100644 index 00000000..e439a0ed --- /dev/null +++ b/Documentation/RCU/rcubarrier.txt @@ -0,0 +1,311 @@ +RCU and Unloadable Modules + +[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/] + +RCU (read-copy update) is a synchronization mechanism that can be thought +of as a replacement for read-writer locking (among other things), but with +very low-overhead readers that are immune to deadlock, priority inversion, +and unbounded latency. RCU read-side critical sections are delimited +by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT +kernels, generate no code whatsoever. + +This means that RCU writers are unaware of the presence of concurrent +readers, so that RCU updates to shared data must be undertaken quite +carefully, leaving an old version of the data structure in place until all +pre-existing readers have finished. These old versions are needed because +such readers might hold a reference to them. RCU updates can therefore be +rather expensive, and RCU is thus best suited for read-mostly situations. + +How can an RCU writer possibly determine when all readers are finished, +given that readers might well leave absolutely no trace of their +presence? There is a synchronize_rcu() primitive that blocks until all +pre-existing readers have completed. An updater wishing to delete an +element p from a linked list might do the following, while holding an +appropriate lock, of course: + + list_del_rcu(p); + synchronize_rcu(); + kfree(p); + +But the above code cannot be used in IRQ context -- the call_rcu() +primitive must be used instead. This primitive takes a pointer to an +rcu_head struct placed within the RCU-protected data structure and +another pointer to a function that may be invoked later to free that +structure. Code to delete an element p from the linked list from IRQ +context might then be as follows: + + list_del_rcu(p); + call_rcu(&p->rcu, p_callback); + +Since call_rcu() never blocks, this code can safely be used from within +IRQ context. The function p_callback() might be defined as follows: + + static void p_callback(struct rcu_head *rp) + { + struct pstruct *p = container_of(rp, struct pstruct, rcu); + + kfree(p); + } + + +Unloading Modules That Use call_rcu() + +But what if p_callback is defined in an unloadable module? + +If we unload the module while some RCU callbacks are pending, +the CPUs executing these callbacks are going to be severely +disappointed when they are later invoked, as fancifully depicted at +http://lwn.net/images/ns/kernel/rcu-drop.jpg. + +We could try placing a synchronize_rcu() in the module-exit code path, +but this is not sufficient. Although synchronize_rcu() does wait for a +grace period to elapse, it does not wait for the callbacks to complete. + +One might be tempted to try several back-to-back synchronize_rcu() +calls, but this is still not guaranteed to work. If there is a very +heavy RCU-callback load, then some of the callbacks might be deferred +in order to allow other processing to proceed. Such deferral is required +in realtime kernels in order to avoid excessive scheduling latencies. + + +rcu_barrier() + +We instead need the rcu_barrier() primitive. This primitive is similar +to synchronize_rcu(), but instead of waiting solely for a grace +period to elapse, it also waits for all outstanding RCU callbacks to +complete. Pseudo-code using rcu_barrier() is as follows: + + 1. Prevent any new RCU callbacks from being posted. + 2. Execute rcu_barrier(). + 3. Allow the module to be unloaded. + +Quick Quiz #1: Why is there no srcu_barrier()? + +The rcutorture module makes use of rcu_barrier in its exit function +as follows: + + 1 static void + 2 rcu_torture_cleanup(void) + 3 { + 4 int i; + 5 + 6 fullstop = 1; + 7 if (shuffler_task != NULL) { + 8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task"); + 9 kthread_stop(shuffler_task); +10 } +11 shuffler_task = NULL; +12 +13 if (writer_task != NULL) { +14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task"); +15 kthread_stop(writer_task); +16 } +17 writer_task = NULL; +18 +19 if (reader_tasks != NULL) { +20 for (i = 0; i < nrealreaders; i++) { +21 if (reader_tasks[i] != NULL) { +22 VERBOSE_PRINTK_STRING( +23 "Stopping rcu_torture_reader task"); +24 kthread_stop(reader_tasks[i]); +25 } +26 reader_tasks[i] = NULL; +27 } +28 kfree(reader_tasks); +29 reader_tasks = NULL; +30 } +31 rcu_torture_current = NULL; +32 +33 if (fakewriter_tasks != NULL) { +34 for (i = 0; i < nfakewriters; i++) { +35 if (fakewriter_tasks[i] != NULL) { +36 VERBOSE_PRINTK_STRING( +37 "Stopping rcu_torture_fakewriter task"); +38 kthread_stop(fakewriter_tasks[i]); +39 } +40 fakewriter_tasks[i] = NULL; +41 } +42 kfree(fakewriter_tasks); +43 fakewriter_tasks = NULL; +44 } +45 +46 if (stats_task != NULL) { +47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task"); +48 kthread_stop(stats_task); +49 } +50 stats_task = NULL; +51 +52 /* Wait for all RCU callbacks to fire. */ +53 rcu_barrier(); +54 +55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */ +56 +57 if (cur_ops->cleanup != NULL) +58 cur_ops->cleanup(); +59 if (atomic_read(&n_rcu_torture_error)) +60 rcu_torture_print_module_parms("End of test: FAILURE"); +61 else +62 rcu_torture_print_module_parms("End of test: SUCCESS"); +63 } + +Line 6 sets a global variable that prevents any RCU callbacks from +re-posting themselves. This will not be necessary in most cases, since +RCU callbacks rarely include calls to call_rcu(). However, the rcutorture +module is an exception to this rule, and therefore needs to set this +global variable. + +Lines 7-50 stop all the kernel tasks associated with the rcutorture +module. Therefore, once execution reaches line 53, no more rcutorture +RCU callbacks will be posted. The rcu_barrier() call on line 53 waits +for any pre-existing callbacks to complete. + +Then lines 55-62 print status and do operation-specific cleanup, and +then return, permitting the module-unload operation to be completed. + +Quick Quiz #2: Is there any other situation where rcu_barrier() might + be required? + +Your module might have additional complications. For example, if your +module invokes call_rcu() from timers, you will need to first cancel all +the timers, and only then invoke rcu_barrier() to wait for any remaining +RCU callbacks to complete. + +Of course, if you module uses call_rcu_bh(), you will need to invoke +rcu_barrier_bh() before unloading. Similarly, if your module uses +call_rcu_sched(), you will need to invoke rcu_barrier_sched() before +unloading. If your module uses call_rcu(), call_rcu_bh(), -and- +call_rcu_sched(), then you will need to invoke each of rcu_barrier(), +rcu_barrier_bh(), and rcu_barrier_sched(). + + +Implementing rcu_barrier() + +Dipankar Sarma's implementation of rcu_barrier() makes use of the fact +that RCU callbacks are never reordered once queued on one of the per-CPU +queues. His implementation queues an RCU callback on each of the per-CPU +callback queues, and then waits until they have all started executing, at +which point, all earlier RCU callbacks are guaranteed to have completed. + +The original code for rcu_barrier() was as follows: + + 1 void rcu_barrier(void) + 2 { + 3 BUG_ON(in_interrupt()); + 4 /* Take cpucontrol mutex to protect against CPU hotplug */ + 5 mutex_lock(&rcu_barrier_mutex); + 6 init_completion(&rcu_barrier_completion); + 7 atomic_set(&rcu_barrier_cpu_count, 0); + 8 on_each_cpu(rcu_barrier_func, NULL, 0, 1); + 9 wait_for_completion(&rcu_barrier_completion); +10 mutex_unlock(&rcu_barrier_mutex); +11 } + +Line 3 verifies that the caller is in process context, and lines 5 and 10 +use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the +global completion and counters at a time, which are initialized on lines +6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is +shown below. Note that the final "1" in on_each_cpu()'s argument list +ensures that all the calls to rcu_barrier_func() will have completed +before on_each_cpu() returns. Line 9 then waits for the completion. + +This code was rewritten in 2008 to support rcu_barrier_bh() and +rcu_barrier_sched() in addition to the original rcu_barrier(). + +The rcu_barrier_func() runs on each CPU, where it invokes call_rcu() +to post an RCU callback, as follows: + + 1 static void rcu_barrier_func(void *notused) + 2 { + 3 int cpu = smp_processor_id(); + 4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + 5 struct rcu_head *head; + 6 + 7 head = &rdp->barrier; + 8 atomic_inc(&rcu_barrier_cpu_count); + 9 call_rcu(head, rcu_barrier_callback); +10 } + +Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure, +which contains the struct rcu_head that needed for the later call to +call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line +8 increments a global counter. This counter will later be decremented +by the callback. Line 9 then registers the rcu_barrier_callback() on +the current CPU's queue. + +The rcu_barrier_callback() function simply atomically decrements the +rcu_barrier_cpu_count variable and finalizes the completion when it +reaches zero, as follows: + + 1 static void rcu_barrier_callback(struct rcu_head *notused) + 2 { + 3 if (atomic_dec_and_test(&rcu_barrier_cpu_count)) + 4 complete(&rcu_barrier_completion); + 5 } + +Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes + immediately (thus incrementing rcu_barrier_cpu_count to the + value one), but the other CPU's rcu_barrier_func() invocations + are delayed for a full grace period? Couldn't this result in + rcu_barrier() returning prematurely? + + +rcu_barrier() Summary + +The rcu_barrier() primitive has seen relatively little use, since most +code using RCU is in the core kernel rather than in modules. However, if +you are using RCU from an unloadable module, you need to use rcu_barrier() +so that your module may be safely unloaded. + + +Answers to Quick Quizzes + +Quick Quiz #1: Why is there no srcu_barrier()? + +Answer: Since there is no call_srcu(), there can be no outstanding SRCU + callbacks. Therefore, there is no need to wait for them. + +Quick Quiz #2: Is there any other situation where rcu_barrier() might + be required? + +Answer: Interestingly enough, rcu_barrier() was not originally + implemented for module unloading. Nikita Danilov was using + RCU in a filesystem, which resulted in a similar situation at + filesystem-unmount time. Dipankar Sarma coded up rcu_barrier() + in response, so that Nikita could invoke it during the + filesystem-unmount process. + + Much later, yours truly hit the RCU module-unload problem when + implementing rcutorture, and found that rcu_barrier() solves + this problem as well. + +Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes + immediately (thus incrementing rcu_barrier_cpu_count to the + value one), but the other CPU's rcu_barrier_func() invocations + are delayed for a full grace period? Couldn't this result in + rcu_barrier() returning prematurely? + +Answer: This cannot happen. The reason is that on_each_cpu() has its last + argument, the wait flag, set to "1". This flag is passed through + to smp_call_function() and further to smp_call_function_on_cpu(), + causing this latter to spin until the cross-CPU invocation of + rcu_barrier_func() has completed. This by itself would prevent + a grace period from completing on non-CONFIG_PREEMPT kernels, + since each CPU must undergo a context switch (or other quiescent + state) before the grace period can complete. However, this is + of no use in CONFIG_PREEMPT kernels. + + Therefore, on_each_cpu() disables preemption across its call + to smp_call_function() and also across the local call to + rcu_barrier_func(). This prevents the local CPU from context + switching, again preventing grace periods from completing. This + means that all CPUs have executed rcu_barrier_func() before + the first rcu_barrier_callback() can possibly execute, in turn + preventing rcu_barrier_cpu_count from prematurely reaching zero. + + Currently, -rt implementations of RCU keep but a single global + queue for RCU callbacks, and thus do not suffer from this + problem. However, when the -rt RCU eventually does have per-CPU + callback queues, things will have to change. One simple change + is to add an rcu_read_lock() before line 8 of rcu_barrier() + and an rcu_read_unlock() after line 8 of this same function. If + you can think of a better change, please let me know! diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt new file mode 100644 index 00000000..18f9651f --- /dev/null +++ b/Documentation/RCU/rculist_nulls.txt @@ -0,0 +1,172 @@ +Using hlist_nulls to protect read-mostly linked lists and +objects using SLAB_DESTROY_BY_RCU allocations. + +Please read the basics in Documentation/RCU/listRCU.txt + +Using special makers (called 'nulls') is a convenient way +to solve following problem : + +A typical RCU linked list managing objects which are +allocated with SLAB_DESTROY_BY_RCU kmem_cache can +use following algos : + +1) Lookup algo +-------------- +rcu_read_lock() +begin: +obj = lockless_lookup(key); +if (obj) { + if (!try_get_ref(obj)) // might fail for free objects + goto begin; + /* + * Because a writer could delete object, and a writer could + * reuse these object before the RCU grace period, we + * must check key after getting the reference on object + */ + if (obj->key != key) { // not the object we expected + put_ref(obj); + goto begin; + } +} +rcu_read_unlock(); + +Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu() +but a version with an additional memory barrier (smp_rmb()) + +lockless_lookup(key) +{ + struct hlist_node *node, *next; + for (pos = rcu_dereference((head)->first); + pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); + pos = rcu_dereference(next)) + if (obj->key == key) + return obj; + return NULL; + +And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() : + + struct hlist_node *node; + for (pos = rcu_dereference((head)->first); + pos && ({ prefetch(pos->next); 1; }) && + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); + pos = rcu_dereference(pos->next)) + if (obj->key == key) + return obj; + return NULL; +} + +Quoting Corey Minyard : + +"If the object is moved from one list to another list in-between the + time the hash is calculated and the next field is accessed, and the + object has moved to the end of a new list, the traversal will not + complete properly on the list it should have, since the object will + be on the end of the new list and there's not a way to tell it's on a + new list and restart the list traversal. I think that this can be + solved by pre-fetching the "next" field (with proper barriers) before + checking the key." + +2) Insert algo : +---------------- + +We need to make sure a reader cannot read the new 'obj->obj_next' value +and previous value of 'obj->key'. Or else, an item could be deleted +from a chain, and inserted into another chain. If new chain was empty +before the move, 'next' pointer is NULL, and lockless reader can +not detect it missed following items in original chain. + +/* + * Please note that new inserts are done at the head of list, + * not in the middle or end. + */ +obj = kmem_cache_alloc(...); +lock_chain(); // typically a spin_lock() +obj->key = key; +/* + * we need to make sure obj->key is updated before obj->next + * or obj->refcnt + */ +smp_wmb(); +atomic_set(&obj->refcnt, 1); +hlist_add_head_rcu(&obj->obj_node, list); +unlock_chain(); // typically a spin_unlock() + + +3) Remove algo +-------------- +Nothing special here, we can use a standard RCU hlist deletion. +But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused +very very fast (before the end of RCU grace period) + +if (put_last_reference_on(obj) { + lock_chain(); // typically a spin_lock() + hlist_del_init_rcu(&obj->obj_node); + unlock_chain(); // typically a spin_unlock() + kmem_cache_free(cachep, obj); +} + + + +-------------------------------------------------------------------------- +With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup() +and extra smp_wmb() in insert function. + +For example, if we choose to store the slot number as the 'nulls' +end-of-list marker for each slot of the hash table, we can detect +a race (some writer did a delete and/or a move of an object +to another chain) checking the final 'nulls' value if +the lookup met the end of chain. If final 'nulls' value +is not the slot number, then we must restart the lookup at +the beginning. If the object was moved to the same chain, +then the reader doesn't care : It might eventually +scan the list again without harm. + + +1) lookup algo + + head = &table[slot]; + rcu_read_lock(); +begin: + hlist_nulls_for_each_entry_rcu(obj, node, head, member) { + if (obj->key == key) { + if (!try_get_ref(obj)) // might fail for free objects + goto begin; + if (obj->key != key) { // not the object we expected + put_ref(obj); + goto begin; + } + goto out; + } +/* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(node) != slot) + goto begin; + obj = NULL; + +out: + rcu_read_unlock(); + +2) Insert function : +-------------------- + +/* + * Please note that new inserts are done at the head of list, + * not in the middle or end. + */ +obj = kmem_cache_alloc(cachep); +lock_chain(); // typically a spin_lock() +obj->key = key; +/* + * changes to obj->key must be visible before refcnt one + */ +smp_wmb(); +atomic_set(&obj->refcnt, 1); +/* + * insert obj in RCU way (readers might be traversing chain) + */ +hlist_nulls_add_head_rcu(&obj->obj_node, list); +unlock_chain(); // typically a spin_unlock() diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt new file mode 100644 index 00000000..4202ad09 --- /dev/null +++ b/Documentation/RCU/rcuref.txt @@ -0,0 +1,66 @@ +Reference-count design for elements of lists/arrays protected by RCU. + +Reference counting on elements of lists which are protected by traditional +reader/writer spinlocks or semaphores are straightforward: + +1. 2. +add() search_and_reference() +{ { + alloc_object read_lock(&list_lock); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + write_lock(&list_lock); ... + add_element read_unlock(&list_lock); + ... ... + write_unlock(&list_lock); } +} + +3. 4. +release_referenced() delete() +{ { + ... write_lock(&list_lock); + atomic_dec(&el->rc, relfunc) ... + ... delete_element +} write_unlock(&list_lock); + ... + if (atomic_dec_and_test(&el->rc)) + kfree(el); + ... + } + +If this list/array is made lock free using RCU as in changing the +write_lock() in add() and delete() to spin_lock() and changing read_lock() +in search_and_reference() to rcu_read_lock(), the atomic_inc() in +search_and_reference() could potentially hold reference to an element which +has already been deleted from the list/array. Use atomic_inc_not_zero() +in this scenario as follows: + +1. 2. +add() search_and_reference() +{ { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) { + spin_lock(&list_lock); rcu_read_unlock(); + return FAIL; + add_element } + ... ... + spin_unlock(&list_lock); rcu_read_unlock(); +} } +3. 4. +release_referenced() delete() +{ { + ... spin_lock(&list_lock); + if (atomic_dec_and_test(&el->rc)) ... + call_rcu(&el->head, el_free); delete_element + ... spin_unlock(&list_lock); +} ... + if (atomic_dec_and_test(&el->rc)) + call_rcu(&el->head, el_free); + ... + } + +Sometimes, a reference to the element needs to be obtained in the +update (write) stream. In such cases, atomic_inc_not_zero() might be +overkill, since we hold the update-side spinlock. One might instead +use atomic_inc() in such cases. diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt new file mode 100644 index 00000000..4e959208 --- /dev/null +++ b/Documentation/RCU/stallwarn.txt @@ -0,0 +1,127 @@ +Using RCU's CPU Stall Detector + +The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall +detector, which detects conditions that unduly delay RCU grace periods. +This module parameter enables CPU stall detection by default, but +may be overridden via boot-time parameter or at runtime via sysfs. +The stall detector's idea of what constitutes "unduly delayed" is +controlled by a set of kernel configuration variables and cpp macros: + +CONFIG_RCU_CPU_STALL_TIMEOUT + + This kernel configuration parameter defines the period of time + that RCU will wait from the beginning of a grace period until it + issues an RCU CPU stall warning. This time period is normally + ten seconds. + +RCU_SECONDS_TILL_STALL_RECHECK + + This macro defines the period of time that RCU will wait after + issuing a stall warning until it issues another stall warning + for the same stall. This time period is normally set to three + times the check interval plus thirty seconds. + +RCU_STALL_RAT_DELAY + + The CPU stall detector tries to make the offending CPU print its + own warnings, as this often gives better-quality stack traces. + However, if the offending CPU does not detect its own stall in + the number of jiffies specified by RCU_STALL_RAT_DELAY, then + some other CPU will complain. This delay is normally set to + two jiffies. + +When a CPU detects that it is stalling, it will print a message similar +to the following: + +INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies) + +This message indicates that CPU 5 detected that it was causing a stall, +and that the stall was affecting RCU-sched. This message will normally be +followed by a stack dump of the offending CPU. On TREE_RCU kernel builds, +RCU and RCU-sched are implemented by the same underlying mechanism, +while on TREE_PREEMPT_RCU kernel builds, RCU is instead implemented +by rcu_preempt_state. + +On the other hand, if the offending CPU fails to print out a stall-warning +message quickly enough, some other CPU will print a message similar to +the following: + +INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies) + +This message indicates that CPU 2 detected that CPUs 3 and 5 were both +causing stalls, and that the stall was affecting RCU-bh. This message +will normally be followed by stack dumps for each CPU. Please note that +TREE_PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, +and that the tasks will be indicated by PID, for example, "P3421". +It is even possible for a rcu_preempt_state stall to be caused by both +CPUs -and- tasks, in which case the offending CPUs and tasks will all +be called out in the list. + +Finally, if the grace period ends just as the stall warning starts +printing, there will be a spurious stall-warning message: + +INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies) + +This is rare, but does happen from time to time in real life. + +So your kernel printed an RCU CPU stall warning. The next question is +"What caused it?" The following problems can result in RCU CPU stall +warnings: + +o A CPU looping in an RCU read-side critical section. + +o A CPU looping with interrupts disabled. This condition can + result in RCU-sched and RCU-bh stalls. + +o A CPU looping with preemption disabled. This condition can + result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh + stalls. + +o A CPU looping with bottom halves disabled. This condition can + result in RCU-sched and RCU-bh stalls. + +o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel + without invoking schedule(). + +o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might + happen to preempt a low-priority task in the middle of an RCU + read-side critical section. This is especially damaging if + that low-priority task is not permitted to run on any other CPU, + in which case the next RCU grace period can never complete, which + will eventually cause the system to run out of memory and hang. + While the system is in the process of running itself out of + memory, you might see stall-warning messages. + +o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that + is running at a higher priority than the RCU softirq threads. + This will prevent RCU callbacks from ever being invoked, + and in a CONFIG_TREE_PREEMPT_RCU kernel will further prevent + RCU grace periods from ever completing. Either way, the + system will eventually run out of memory and hang. In the + CONFIG_TREE_PREEMPT_RCU case, you might see stall-warning + messages. + +o A bug in the RCU implementation. + +o A hardware failure. This is quite unlikely, but has occurred + at least once in real life. A CPU failed in a running system, + becoming unresponsive, but not causing an immediate crash. + This resulted in a series of RCU CPU stall warnings, eventually + leading the realization that the CPU had failed. + +The RCU, RCU-sched, and RCU-bh implementations have CPU stall +warning. SRCU does not have its own CPU stall warnings, but its +calls to synchronize_sched() will result in RCU-sched detecting +RCU-sched-related CPU stalls. Please note that RCU only detects +CPU stalls when there is a grace period in progress. No grace period, +no CPU stall warnings. + +To diagnose the cause of the stall, inspect the stack traces. +The offending function will usually be near the top of the stack. +If you have a series of stall warnings from a single extended stall, +comparing the stack traces can often help determine where the stall +is occurring, which will usually be in the function nearest the top of +that portion of the stack which remains the same from trace to trace. +If you can reliably trigger the stall, ftrace can be quite helpful. + +RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE. diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt new file mode 100644 index 00000000..5d901679 --- /dev/null +++ b/Documentation/RCU/torture.txt @@ -0,0 +1,201 @@ +RCU Torture Test Operation + + +CONFIG_RCU_TORTURE_TEST + +The CONFIG_RCU_TORTURE_TEST config option is available for all RCU +implementations. It creates an rcutorture kernel module that can +be loaded to run a torture test. The test periodically outputs +status messages via printk(), which can be examined via the dmesg +command (perhaps grepping for "torture"). The test is started +when the module is loaded, and stops when the module is unloaded. + +CONFIG_RCU_TORTURE_TEST_RUNNABLE + +It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will +result in the tests being loaded into the base kernel. In this case, +the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify +whether the RCU torture tests are to be started immediately during +boot or whether the /proc/sys/kernel/rcutorture_runnable file is used +to enable them. This /proc file can be used to repeatedly pause and +restart the tests, regardless of the initial state specified by the +CONFIG_RCU_TORTURE_TEST_RUNNABLE config option. + +You will normally -not- want to start the RCU torture tests during boot +(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing +this can sometimes be useful in finding boot-time bugs. + + +MODULE PARAMETERS + +This module has the following parameters: + +fqs_duration Duration (in microseconds) of artificially induced bursts + of force_quiescent_state() invocations. In RCU + implementations having force_quiescent_state(), these + bursts help force races between forcing a given grace + period and that grace period ending on its own. + +fqs_holdoff Holdoff time (in microseconds) between consecutive calls + to force_quiescent_state() within a burst. + +fqs_stutter Wait time (in seconds) between consecutive bursts + of calls to force_quiescent_state(). + +irqreaders Says to invoke RCU readers from irq level. This is currently + done via timers. Defaults to "1" for variants of RCU that + permit this. (Or, more accurately, variants of RCU that do + -not- permit this know to ignore this variable.) + +nfakewriters This is the number of RCU fake writer threads to run. Fake + writer threads repeatedly use the synchronous "wait for + current readers" function of the interface selected by + torture_type, with a delay between calls to allow for various + different numbers of writers running in parallel. + nfakewriters defaults to 4, which provides enough parallelism + to trigger special cases caused by multiple writers, such as + the synchronize_srcu() early return optimization. + +nreaders This is the number of RCU reading threads supported. + The default is twice the number of CPUs. Why twice? + To properly exercise RCU implementations with preemptible + read-side critical sections. + +shuffle_interval + The number of seconds to keep the test threads affinitied + to a particular subset of the CPUs, defaults to 3 seconds. + Used in conjunction with test_no_idle_hz. + +stat_interval The number of seconds between output of torture + statistics (via printk()). Regardless of the interval, + statistics are printed when the module is unloaded. + Setting the interval to zero causes the statistics to + be printed -only- when the module is unloaded, and this + is the default. + +stutter The length of time to run the test before pausing for this + same period of time. Defaults to "stutter=5", so as + to run and pause for (roughly) five-second intervals. + Specifying "stutter=0" causes the test to run continuously + without pausing, which is the old default behavior. + +test_no_idle_hz Whether or not to test the ability of RCU to operate in + a kernel that disables the scheduling-clock interrupt to + idle CPUs. Boolean parameter, "1" to test, "0" otherwise. + Defaults to omitting this test. + +torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API, + "rcu_sync" for rcu_read_lock() with synchronous reclamation, + "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for + rcu_read_lock_bh() with synchronous reclamation, "srcu" for + the "srcu_read_lock()" API, "sched" for the use of + preempt_disable() together with synchronize_sched(), + and "sched_expedited" for the use of preempt_disable() + with synchronize_sched_expedited(). + +verbose Enable debug printk()s. Default is disabled. + + +OUTPUT + +The statistics output is as follows: + + rcu-torture: --- Start of test: nreaders=16 stat_interval=0 verbose=0 + rcu-torture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915 + rcu-torture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0 + rcu-torture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0 + rcu-torture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0 + rcu-torture: --- End of test + +The command "dmesg | grep torture:" will extract this information on +most systems. On more esoteric configurations, it may be necessary to +use other commands to access the output of the printk()s used by +the RCU torture test. The printk()s use KERN_ALERT, so they should +be evident. ;-) + +The entries are as follows: + +o "rtc": The hexadecimal address of the structure currently visible + to readers. + +o "ver": The number of times since boot that the rcutw writer task + has changed the structure visible to readers. + +o "tfle": If non-zero, indicates that the "torture freelist" + containing structure to be placed into the "rtc" area is empty. + This condition is important, since it can fool you into thinking + that RCU is working when it is not. :-/ + +o "rta": Number of structures allocated from the torture freelist. + +o "rtaf": Number of allocations from the torture freelist that have + failed due to the list being empty. + +o "rtf": Number of frees into the torture freelist. + +o "Reader Pipe": Histogram of "ages" of structures seen by readers. + If any entries past the first two are non-zero, RCU is broken. + And rcutorture prints the error flag string "!!!" to make sure + you notice. The age of a newly allocated structure is zero, + it becomes one when removed from reader visibility, and is + incremented once per grace period subsequently -- and is freed + after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods. + + The output displayed above was taken from a correctly working + RCU. If you want to see what it looks like when broken, break + it yourself. ;-) + +o "Reader Batch": Another histogram of "ages" of structures seen + by readers, but in terms of counter flips (or batches) rather + than in terms of grace periods. The legal number of non-zero + entries is again two. The reason for this separate view is that + it is sometimes easier to get the third entry to show up in the + "Reader Batch" list than in the "Reader Pipe" list. + +o "Free-Block Circulation": Shows the number of torture structures + that have reached a given point in the pipeline. The first element + should closely correspond to the number of structures allocated, + the second to the number that have been removed from reader view, + and all but the last remaining to the corresponding number of + passes through a grace period. The last entry should be zero, + as it is only incremented if a torture structure's counter + somehow gets incremented farther than it should. + +Different implementations of RCU can provide implementation-specific +additional information. For example, SRCU provides the following: + + srcu-torture: rtc: f8cf46a8 ver: 355 tfle: 0 rta: 356 rtaf: 0 rtf: 346 rtmbe: 0 + srcu-torture: Reader Pipe: 559738 939 0 0 0 0 0 0 0 0 0 + srcu-torture: Reader Batch: 560434 243 0 0 0 0 0 0 0 0 + srcu-torture: Free-Block Circulation: 355 354 353 352 351 350 349 348 347 346 0 + srcu-torture: per-CPU(idx=1): 0(0,1) 1(0,1) 2(0,0) 3(0,1) + +The first four lines are similar to those for RCU. The last line shows +the per-CPU counter state. The numbers in parentheses are the values +of the "old" and "current" counters for the corresponding CPU. The +"idx" value maps the "old" and "current" values to the underlying array, +and is useful for debugging. + +Similarly, sched_expedited RCU provides the following: + + sched_expedited-torture: rtc: d0000000016c1880 ver: 1090796 tfle: 0 rta: 1090796 rtaf: 0 rtf: 1090787 rtmbe: 0 nt: 27713319 + sched_expedited-torture: Reader Pipe: 12660320201 95875 0 0 0 0 0 0 0 0 0 + sched_expedited-torture: Reader Batch: 12660424885 0 0 0 0 0 0 0 0 0 0 + sched_expedited-torture: Free-Block Circulation: 1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0 + + +USAGE + +The following script may be used to torture RCU: + + #!/bin/sh + + modprobe rcutorture + sleep 100 + rmmod rcutorture + dmesg | grep torture: + +The output can be manually inspected for the error flag of "!!!". +One could of course create a more elaborate script that automatically +checked for such errors. The "rmmod" command forces a "SUCCESS" or +"FAILURE" indication to be printk()ed. diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt new file mode 100644 index 00000000..8173cec4 --- /dev/null +++ b/Documentation/RCU/trace.txt @@ -0,0 +1,617 @@ +CONFIG_RCU_TRACE debugfs Files and Formats + + +The rcutree and rcutiny implementations of RCU provide debugfs trace +output that summarizes counters and state. This information is useful for +debugging RCU itself, and can sometimes also help to debug abuses of RCU. +The following sections describe the debugfs files and formats, first +for rcutree and next for rcutiny. + + +CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU debugfs Files and Formats + +These implementations of RCU provides several debugfs files under the +top-level directory "rcu": + +rcu/rcudata: + Displays fields in struct rcu_data. +rcu/rcudata.csv: + Comma-separated values spreadsheet version of rcudata. +rcu/rcugp: + Displays grace-period counters. +rcu/rcuhier: + Displays the struct rcu_node hierarchy. +rcu/rcu_pending: + Displays counts of the reasons rcu_pending() decided that RCU had + work to do. +rcu/rcutorture: + Displays rcutorture test progress. +rcu/rcuboost: + Displays RCU boosting statistics. Only present if + CONFIG_RCU_BOOST=y. + +The output of "cat rcu/rcudata" looks as follows: + +rcu_sched: + 0 c=20972 g=20973 pq=1 pqc=20972 qp=0 dt=545/1/0 df=50 of=0 ri=0 ql=163 qs=NRW. kt=0/W/0 ktl=ebc3 b=10 ci=153737 co=0 ca=0 + 1 c=20972 g=20973 pq=1 pqc=20972 qp=0 dt=967/1/0 df=58 of=0 ri=0 ql=634 qs=NRW. kt=0/W/1 ktl=58c b=10 ci=191037 co=0 ca=0 + 2 c=20972 g=20973 pq=1 pqc=20972 qp=0 dt=1081/1/0 df=175 of=0 ri=0 ql=74 qs=N.W. kt=0/W/2 ktl=da94 b=10 ci=75991 co=0 ca=0 + 3 c=20942 g=20943 pq=1 pqc=20942 qp=1 dt=1846/0/0 df=404 of=0 ri=0 ql=0 qs=.... kt=0/W/3 ktl=d1cd b=10 ci=72261 co=0 ca=0 + 4 c=20972 g=20973 pq=1 pqc=20972 qp=0 dt=369/1/0 df=83 of=0 ri=0 ql=48 qs=N.W. kt=0/W/4 ktl=e0e7 b=10 ci=128365 co=0 ca=0 + 5 c=20972 g=20973 pq=1 pqc=20972 qp=0 dt=381/1/0 df=64 of=0 ri=0 ql=169 qs=NRW. kt=0/W/5 ktl=fb2f b=10 ci=164360 co=0 ca=0 + 6 c=20972 g=20973 pq=1 pqc=20972 qp=0 dt=1037/1/0 df=183 of=0 ri=0 ql=62 qs=N.W. kt=0/W/6 ktl=d2ad b=10 ci=65663 co=0 ca=0 + 7 c=20897 g=20897 pq=1 pqc=20896 qp=0 dt=1572/0/0 df=382 of=0 ri=0 ql=0 qs=.... kt=0/W/7 ktl=cf15 b=10 ci=75006 co=0 ca=0 +rcu_bh: + 0 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=545/1/0 df=6 of=0 ri=1 ql=0 qs=.... kt=0/W/0 ktl=ebc3 b=10 ci=0 co=0 ca=0 + 1 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=967/1/0 df=3 of=0 ri=1 ql=0 qs=.... kt=0/W/1 ktl=58c b=10 ci=151 co=0 ca=0 + 2 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=1081/1/0 df=6 of=0 ri=1 ql=0 qs=.... kt=0/W/2 ktl=da94 b=10 ci=0 co=0 ca=0 + 3 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=1846/0/0 df=8 of=0 ri=1 ql=0 qs=.... kt=0/W/3 ktl=d1cd b=10 ci=0 co=0 ca=0 + 4 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=369/1/0 df=6 of=0 ri=1 ql=0 qs=.... kt=0/W/4 ktl=e0e7 b=10 ci=0 co=0 ca=0 + 5 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=381/1/0 df=4 of=0 ri=1 ql=0 qs=.... kt=0/W/5 ktl=fb2f b=10 ci=0 co=0 ca=0 + 6 c=1480 g=1480 pq=1 pqc=1479 qp=0 dt=1037/1/0 df=6 of=0 ri=1 ql=0 qs=.... kt=0/W/6 ktl=d2ad b=10 ci=0 co=0 ca=0 + 7 c=1474 g=1474 pq=1 pqc=1473 qp=0 dt=1572/0/0 df=8 of=0 ri=1 ql=0 qs=.... kt=0/W/7 ktl=cf15 b=10 ci=0 co=0 ca=0 + +The first section lists the rcu_data structures for rcu_sched, the second +for rcu_bh. Note that CONFIG_TREE_PREEMPT_RCU kernels will have an +additional section for rcu_preempt. Each section has one line per CPU, +or eight for this 8-CPU system. The fields are as follows: + +o The number at the beginning of each line is the CPU number. + CPUs numbers followed by an exclamation mark are offline, + but have been online at least once since boot. There will be + no output for CPUs that have never been online, which can be + a good thing in the surprisingly common case where NR_CPUS is + substantially larger than the number of actual CPUs. + +o "c" is the count of grace periods that this CPU believes have + completed. Offlined CPUs and CPUs in dynticks idle mode may + lag quite a ways behind, for example, CPU 6 under "rcu_sched" + above, which has been offline through not quite 40,000 RCU grace + periods. It is not unusual to see CPUs lagging by thousands of + grace periods. + +o "g" is the count of grace periods that this CPU believes have + started. Again, offlined CPUs and CPUs in dynticks idle mode + may lag behind. If the "c" and "g" values are equal, this CPU + has already reported a quiescent state for the last RCU grace + period that it is aware of, otherwise, the CPU believes that it + owes RCU a quiescent state. + +o "pq" indicates that this CPU has passed through a quiescent state + for the current grace period. It is possible for "pq" to be + "1" and "c" different than "g", which indicates that although + the CPU has passed through a quiescent state, either (1) this + CPU has not yet reported that fact, (2) some other CPU has not + yet reported for this grace period, or (3) both. + +o "pqc" indicates which grace period the last-observed quiescent + state for this CPU corresponds to. This is important for handling + the race between CPU 0 reporting an extended dynticks-idle + quiescent state for CPU 1 and CPU 1 suddenly waking up and + reporting its own quiescent state. If CPU 1 was the last CPU + for the current grace period, then the CPU that loses this race + will attempt to incorrectly mark CPU 1 as having checked in for + the next grace period! + +o "qp" indicates that RCU still expects a quiescent state from + this CPU. Offlined CPUs and CPUs in dyntick idle mode might + well have qp=1, which is OK: RCU is still ignoring them. + +o "dt" is the current value of the dyntick counter that is incremented + when entering or leaving dynticks idle state, either by the + scheduler or by irq. This number is even if the CPU is in + dyntick idle mode and odd otherwise. The number after the first + "/" is the interrupt nesting depth when in dyntick-idle state, + or one greater than the interrupt-nesting depth otherwise. + The number after the second "/" is the NMI nesting depth. + + This field is displayed only for CONFIG_NO_HZ kernels. + +o "df" is the number of times that some other CPU has forced a + quiescent state on behalf of this CPU due to this CPU being in + dynticks-idle state. + + This field is displayed only for CONFIG_NO_HZ kernels. + +o "of" is the number of times that some other CPU has forced a + quiescent state on behalf of this CPU due to this CPU being + offline. In a perfect world, this might never happen, but it + turns out that offlining and onlining a CPU can take several grace + periods, and so there is likely to be an extended period of time + when RCU believes that the CPU is online when it really is not. + Please note that erring in the other direction (RCU believing a + CPU is offline when it is really alive and kicking) is a fatal + error, so it makes sense to err conservatively. + +o "ri" is the number of times that RCU has seen fit to send a + reschedule IPI to this CPU in order to get it to report a + quiescent state. + +o "ql" is the number of RCU callbacks currently residing on + this CPU. This is the total number of callbacks, regardless + of what state they are in (new, waiting for grace period to + start, waiting for grace period to end, ready to invoke). + +o "qs" gives an indication of the state of the callback queue + with four characters: + + "N" Indicates that there are callbacks queued that are not + ready to be handled by the next grace period, and thus + will be handled by the grace period following the next + one. + + "R" Indicates that there are callbacks queued that are + ready to be handled by the next grace period. + + "W" Indicates that there are callbacks queued that are + waiting on the current grace period. + + "D" Indicates that there are callbacks queued that have + already been handled by a prior grace period, and are + thus waiting to be invoked. Note that callbacks in + the process of being invoked are not counted here. + Callbacks in the process of being invoked are those + that have been removed from the rcu_data structures + queues by rcu_do_batch(), but which have not yet been + invoked. + + If there are no callbacks in a given one of the above states, + the corresponding character is replaced by ".". + +o "kt" is the per-CPU kernel-thread state. The digit preceding + the first slash is zero if there is no work pending and 1 + otherwise. The character between the first pair of slashes is + as follows: + + "S" The kernel thread is stopped, in other words, all + CPUs corresponding to this rcu_node structure are + offline. + + "R" The kernel thread is running. + + "W" The kernel thread is waiting because there is no work + for it to do. + + "O" The kernel thread is waiting because it has been + forced off of its designated CPU or because its + ->cpus_allowed mask permits it to run on other than + its designated CPU. + + "Y" The kernel thread is yielding to avoid hogging CPU. + + "?" Unknown value, indicates a bug. + + The number after the final slash is the CPU that the kthread + is actually running on. + +o "ktl" is the low-order 16 bits (in hexadecimal) of the count of + the number of times that this CPU's per-CPU kthread has gone + through its loop servicing invoke_rcu_cpu_kthread() requests. + +o "b" is the batch limit for this CPU. If more than this number + of RCU callbacks is ready to invoke, then the remainder will + be deferred. + +o "ci" is the number of RCU callbacks that have been invoked for + this CPU. Note that ci+ql is the number of callbacks that have + been registered in absence of CPU-hotplug activity. + +o "co" is the number of RCU callbacks that have been orphaned due to + this CPU going offline. These orphaned callbacks have been moved + to an arbitrarily chosen online CPU. + +o "ca" is the number of RCU callbacks that have been adopted due to + other CPUs going offline. Note that ci+co-ca+ql is the number of + RCU callbacks registered on this CPU. + +There is also an rcu/rcudata.csv file with the same information in +comma-separated-variable spreadsheet format. + + +The output of "cat rcu/rcugp" looks as follows: + +rcu_sched: completed=33062 gpnum=33063 +rcu_bh: completed=464 gpnum=464 + +Again, this output is for both "rcu_sched" and "rcu_bh". Note that +kernels built with CONFIG_TREE_PREEMPT_RCU will have an additional +"rcu_preempt" line. The fields are taken from the rcu_state structure, +and are as follows: + +o "completed" is the number of grace periods that have completed. + It is comparable to the "c" field from rcu/rcudata in that a + CPU whose "c" field matches the value of "completed" is aware + that the corresponding RCU grace period has completed. + +o "gpnum" is the number of grace periods that have started. It is + comparable to the "g" field from rcu/rcudata in that a CPU + whose "g" field matches the value of "gpnum" is aware that the + corresponding RCU grace period has started. + + If these two fields are equal (as they are for "rcu_bh" above), + then there is no grace period in progress, in other words, RCU + is idle. On the other hand, if the two fields differ (as they + do for "rcu_sched" above), then an RCU grace period is in progress. + + +The output of "cat rcu/rcuhier" looks as follows, with very long lines: + +c=6902 g=6903 s=2 jfq=3 j=72c7 nfqs=13142/nfqsng=0(13142) fqlh=6 +1/1 ..>. 0:127 ^0 +3/3 ..>. 0:35 ^0 0/0 ..>. 36:71 ^1 0/0 ..>. 72:107 ^2 0/0 ..>. 108:127 ^3 +3/3f ..>. 0:5 ^0 2/3 ..>. 6:11 ^1 0/0 ..>. 12:17 ^2 0/0 ..>. 18:23 ^3 0/0 ..>. 24:29 ^4 0/0 ..>. 30:35 ^5 0/0 ..>. 36:41 ^0 0/0 ..>. 42:47 ^1 0/0 ..>. 48:53 ^2 0/0 ..>. 54:59 ^3 0/0 ..>. 60:65 ^4 0/0 ..>. 66:71 ^5 0/0 ..>. 72:77 ^0 0/0 ..>. 78:83 ^1 0/0 ..>. 84:89 ^2 0/0 ..>. 90:95 ^3 0/0 ..>. 96:101 ^4 0/0 ..>. 102:107 ^5 0/0 ..>. 108:113 ^0 0/0 ..>. 114:119 ^1 0/0 ..>. 120:125 ^2 0/0 ..>. 126:127 ^3 +rcu_bh: +c=-226 g=-226 s=1 jfq=-5701 j=72c7 nfqs=88/nfqsng=0(88) fqlh=0 +0/1 ..>. 0:127 ^0 +0/3 ..>. 0:35 ^0 0/0 ..>. 36:71 ^1 0/0 ..>. 72:107 ^2 0/0 ..>. 108:127 ^3 +0/3f ..>. 0:5 ^0 0/3 ..>. 6:11 ^1 0/0 ..>. 12:17 ^2 0/0 ..>. 18:23 ^3 0/0 ..>. 24:29 ^4 0/0 ..>. 30:35 ^5 0/0 ..>. 36:41 ^0 0/0 ..>. 42:47 ^1 0/0 ..>. 48:53 ^2 0/0 ..>. 54:59 ^3 0/0 ..>. 60:65 ^4 0/0 ..>. 66:71 ^5 0/0 ..>. 72:77 ^0 0/0 ..>. 78:83 ^1 0/0 ..>. 84:89 ^2 0/0 ..>. 90:95 ^3 0/0 ..>. 96:101 ^4 0/0 ..>. 102:107 ^5 0/0 ..>. 108:113 ^0 0/0 ..>. 114:119 ^1 0/0 ..>. 120:125 ^2 0/0 ..>. 126:127 ^3 + +This is once again split into "rcu_sched" and "rcu_bh" portions, +and CONFIG_TREE_PREEMPT_RCU kernels will again have an additional +"rcu_preempt" section. The fields are as follows: + +o "c" is exactly the same as "completed" under rcu/rcugp. + +o "g" is exactly the same as "gpnum" under rcu/rcugp. + +o "s" is the "signaled" state that drives force_quiescent_state()'s + state machine. + +o "jfq" is the number of jiffies remaining for this grace period + before force_quiescent_state() is invoked to help push things + along. Note that CPUs in dyntick-idle mode throughout the grace + period will not report on their own, but rather must be check by + some other CPU via force_quiescent_state(). + +o "j" is the low-order four hex digits of the jiffies counter. + Yes, Paul did run into a number of problems that turned out to + be due to the jiffies counter no longer counting. Why do you ask? + +o "nfqs" is the number of calls to force_quiescent_state() since + boot. + +o "nfqsng" is the number of useless calls to force_quiescent_state(), + where there wasn't actually a grace period active. This can + happen due to races. The number in parentheses is the difference + between "nfqs" and "nfqsng", or the number of times that + force_quiescent_state() actually did some real work. + +o "fqlh" is the number of calls to force_quiescent_state() that + exited immediately (without even being counted in nfqs above) + due to contention on ->fqslock. + +o Each element of the form "1/1 0:127 ^0" represents one struct + rcu_node. Each line represents one level of the hierarchy, from + root to leaves. It is best to think of the rcu_data structures + as forming yet another level after the leaves. Note that there + might be either one, two, or three levels of rcu_node structures, + depending on the relationship between CONFIG_RCU_FANOUT and + CONFIG_NR_CPUS. + + o The numbers separated by the "/" are the qsmask followed + by the qsmaskinit. The qsmask will have one bit + set for each entity in the next lower level that + has not yet checked in for the current grace period. + The qsmaskinit will have one bit for each entity that is + currently expected to check in during each grace period. + The value of qsmaskinit is assigned to that of qsmask + at the beginning of each grace period. + + For example, for "rcu_sched", the qsmask of the first + entry of the lowest level is 0x14, meaning that we + are still waiting for CPUs 2 and 4 to check in for the + current grace period. + + o The characters separated by the ">" indicate the state + of the blocked-tasks lists. A "G" preceding the ">" + indicates that at least one task blocked in an RCU + read-side critical section blocks the current grace + period, while a "E" preceding the ">" indicates that + at least one task blocked in an RCU read-side critical + section blocks the current expedited grace period. + A "T" character following the ">" indicates that at + least one task is blocked within an RCU read-side + critical section, regardless of whether any current + grace period (expedited or normal) is inconvenienced. + A "." character appears if the corresponding condition + does not hold, so that "..>." indicates that no tasks + are blocked. In contrast, "GE>T" indicates maximal + inconvenience from blocked tasks. + + o The numbers separated by the ":" are the range of CPUs + served by this struct rcu_node. This can be helpful + in working out how the hierarchy is wired together. + + For example, the first entry at the lowest level shows + "0:5", indicating that it covers CPUs 0 through 5. + + o The number after the "^" indicates the bit in the + next higher level rcu_node structure that this + rcu_node structure corresponds to. + + For example, the first entry at the lowest level shows + "^0", indicating that it corresponds to bit zero in + the first entry at the middle level. + + +The output of "cat rcu/rcu_pending" looks as follows: + +rcu_sched: + 0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741 + 1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792 + 2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629 + 3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723 + 4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110 + 5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456 + 6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834 + 7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888 +rcu_bh: + 0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314 + 1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180 + 2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936 + 3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863 + 4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671 + 5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235 + 6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921 + 7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542 + +As always, this is once again split into "rcu_sched" and "rcu_bh" +portions, with CONFIG_TREE_PREEMPT_RCU kernels having an additional +"rcu_preempt" section. The fields are as follows: + +o "np" is the number of times that __rcu_pending() has been invoked + for the corresponding flavor of RCU. + +o "qsp" is the number of times that the RCU was waiting for a + quiescent state from this CPU. + +o "rpq" is the number of times that the CPU had passed through + a quiescent state, but not yet reported it to RCU. + +o "cbr" is the number of times that this CPU had RCU callbacks + that had passed through a grace period, and were thus ready + to be invoked. + +o "cng" is the number of times that this CPU needed another + grace period while RCU was idle. + +o "gpc" is the number of times that an old grace period had + completed, but this CPU was not yet aware of it. + +o "gps" is the number of times that a new grace period had started, + but this CPU was not yet aware of it. + +o "nf" is the number of times that this CPU suspected that the + current grace period had run for too long, and thus needed to + be forced. + + Please note that "forcing" consists of sending resched IPIs + to holdout CPUs. If that CPU really still is in an old RCU + read-side critical section, then we really do have to wait for it. + The assumption behing "forcing" is that the CPU is not still in + an old RCU read-side critical section, but has not yet responded + for some other reason. + +o "nn" is the number of times that this CPU needed nothing. Alert + readers will note that the rcu "nn" number for a given CPU very + closely matches the rcu_bh "np" number for that same CPU. This + is due to short-circuit evaluation in rcu_pending(). + + +The output of "cat rcu/rcutorture" looks as follows: + +rcutorture test sequence: 0 (test in progress) +rcutorture update version number: 615 + +The first line shows the number of rcutorture tests that have completed +since boot. If a test is currently running, the "(test in progress)" +string will appear as shown above. The second line shows the number of +update cycles that the current test has started, or zero if there is +no test in progress. + + +The output of "cat rcu/rcuboost" looks as follows: + +0:5 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=2f95 bt=300f + balk: nt=0 egt=989 bt=0 nb=0 ny=0 nos=16 +6:7 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=2f95 bt=300f + balk: nt=0 egt=225 bt=0 nb=0 ny=0 nos=6 + +This information is output only for rcu_preempt. Each two-line entry +corresponds to a leaf rcu_node strcuture. The fields are as follows: + +o "n:m" is the CPU-number range for the corresponding two-line + entry. In the sample output above, the first entry covers + CPUs zero through five and the second entry covers CPUs 6 + and 7. + +o "tasks=TNEB" gives the state of the various segments of the + rnp->blocked_tasks list: + + "T" This indicates that there are some tasks that blocked + while running on one of the corresponding CPUs while + in an RCU read-side critical section. + + "N" This indicates that some of the blocked tasks are preventing + the current normal (non-expedited) grace period from + completing. + + "E" This indicates that some of the blocked tasks are preventing + the current expedited grace period from completing. + + "B" This indicates that some of the blocked tasks are in + need of RCU priority boosting. + + Each character is replaced with "." if the corresponding + condition does not hold. + +o "kt" is the state of the RCU priority-boosting kernel + thread associated with the corresponding rcu_node structure. + The state can be one of the following: + + "S" The kernel thread is stopped, in other words, all + CPUs corresponding to this rcu_node structure are + offline. + + "R" The kernel thread is running. + + "W" The kernel thread is waiting because there is no work + for it to do. + + "Y" The kernel thread is yielding to avoid hogging CPU. + + "?" Unknown value, indicates a bug. + +o "ntb" is the number of tasks boosted. + +o "neb" is the number of tasks boosted in order to complete an + expedited grace period. + +o "nnb" is the number of tasks boosted in order to complete a + normal (non-expedited) grace period. When boosting a task + that was blocking both an expedited and a normal grace period, + it is counted against the expedited total above. + +o "j" is the low-order 16 bits of the jiffies counter in + hexadecimal. + +o "bt" is the low-order 16 bits of the value that the jiffies + counter will have when we next start boosting, assuming that + the current grace period does not end beforehand. This is + also in hexadecimal. + +o "balk: nt" counts the number of times we didn't boost (in + other words, we balked) even though it was time to boost because + there were no blocked tasks to boost. This situation occurs + when there is one blocked task on one rcu_node structure and + none on some other rcu_node structure. + +o "egt" counts the number of times we balked because although + there were blocked tasks, none of them were blocking the + current grace period, whether expedited or otherwise. + +o "bt" counts the number of times we balked because boosting + had already been initiated for the current grace period. + +o "nb" counts the number of times we balked because there + was at least one task blocking the current non-expedited grace + period that never had blocked. If it is already running, it + just won't help to boost its priority! + +o "ny" counts the number of times we balked because it was + not yet time to start boosting. + +o "nos" counts the number of times we balked for other + reasons, e.g., the grace period ended first. + + +CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats + +These implementations of RCU provides a single debugfs file under the +top-level directory RCU, namely rcu/rcudata, which displays fields in +rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU, +rcu_preempt_ctrlblk. + +The output of "cat rcu/rcudata" is as follows: + +rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=... + ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274 + normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0 + exp balk: bt=0 nos=0 +rcu_sched: qlen: 0 +rcu_bh: qlen: 0 + +This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the +rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds. +The last three lines of the rcu_preempt section appear only in +CONFIG_RCU_BOOST kernel builds. The fields are as follows: + +o "qlen" is the number of RCU callbacks currently waiting either + for an RCU grace period or waiting to be invoked. This is the + only field present for rcu_sched and rcu_bh, due to the + short-circuiting of grace period in those two cases. + +o "gp" is the number of grace periods that have completed. + +o "g197/p197/c197" displays the grace-period state, with the + "g" number being the number of grace periods that have started + (mod 256), the "p" number being the number of grace periods + that the CPU has responded to (also mod 256), and the "c" + number being the number of grace periods that have completed + (once again mode 256). + + Why have both "gp" and "g"? Because the data flowing into + "gp" is only present in a CONFIG_RCU_TRACE kernel. + +o "tasks" is a set of bits. The first bit is "T" if there are + currently tasks that have recently blocked within an RCU + read-side critical section, the second bit is "N" if any of the + aforementioned tasks are blocking the current RCU grace period, + and the third bit is "E" if any of the aforementioned tasks are + blocking the current expedited grace period. Each bit is "." + if the corresponding condition does not hold. + +o "ttb" is a single bit. It is "B" if any of the blocked tasks + need to be priority boosted and "." otherwise. + +o "btg" indicates whether boosting has been carried out during + the current grace period, with "exp" indicating that boosting + is in progress for an expedited grace period, "no" indicating + that boosting has not yet started for a normal grace period, + "begun" indicating that boosting has bebug for a normal grace + period, and "done" indicating that boosting has completed for + a normal grace period. + +o "ntb" is the total number of tasks subjected to RCU priority boosting + periods since boot. + +o "neb" is the number of expedited grace periods that have had + to resort to RCU priority boosting since boot. + +o "nnb" is the number of normal grace periods that have had + to resort to RCU priority boosting since boot. + +o "j" is the low-order 16 bits of the jiffies counter in hexadecimal. + +o "bt" is the low-order 16 bits of the value that the jiffies counter + will have at the next time that boosting is scheduled to begin. + +o In the line beginning with "normal balk", the fields are as follows: + + o "nt" is the number of times that the system balked from + boosting because there were no blocked tasks to boost. + Note that the system will balk from boosting even if the + grace period is overdue when the currently running task + is looping within an RCU read-side critical section. + There is no point in boosting in this case, because + boosting a running task won't make it run any faster. + + o "gt" is the number of times that the system balked + from boosting because, although there were blocked tasks, + none of them were preventing the current grace period + from completing. + + o "bt" is the number of times that the system balked + from boosting because boosting was already in progress. + + o "b" is the number of times that the system balked from + boosting because boosting had already completed for + the grace period in question. + + o "ny" is the number of times that the system balked from + boosting because it was not yet time to start boosting + the grace period in question. + + o "nos" is the number of times that the system balked from + boosting for inexplicable ("not otherwise specified") + reasons. This can actually happen due to races involving + increments of the jiffies counter. + +o In the line beginning with "exp balk", the fields are as follows: + + o "bt" is the number of times that the system balked from + boosting because there were no blocked tasks to boost. + + o "nos" is the number of times that the system balked from + boosting for inexplicable ("not otherwise specified") + reasons. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt new file mode 100644 index 00000000..6ef69266 --- /dev/null +++ b/Documentation/RCU/whatisRCU.txt @@ -0,0 +1,985 @@ +Please note that the "What is RCU?" LWN series is an excellent place +to start learning about RCU: + +1. What is RCU, Fundamentally? http://lwn.net/Articles/262464/ +2. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/ +3. RCU part 3: the RCU API http://lwn.net/Articles/264090/ + + +What is RCU? + +RCU is a synchronization mechanism that was added to the Linux kernel +during the 2.5 development effort that is optimized for read-mostly +situations. Although RCU is actually quite simple once you understand it, +getting there can sometimes be a challenge. Part of the problem is that +most of the past descriptions of RCU have been written with the mistaken +assumption that there is "one true way" to describe RCU. Instead, +the experience has been that different people must take different paths +to arrive at an understanding of RCU. This document provides several +different paths, as follows: + +1. RCU OVERVIEW +2. WHAT IS RCU'S CORE API? +3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? +4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? +5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? +6. ANALOGY WITH READER-WRITER LOCKING +7. FULL LIST OF RCU APIs +8. ANSWERS TO QUICK QUIZZES + +People who prefer starting with a conceptual overview should focus on +Section 1, though most readers will profit by reading this section at +some point. People who prefer to start with an API that they can then +experiment with should focus on Section 2. People who prefer to start +with example uses should focus on Sections 3 and 4. People who need to +understand the RCU implementation should focus on Section 5, then dive +into the kernel source code. People who reason best by analogy should +focus on Section 6. Section 7 serves as an index to the docbook API +documentation, and Section 8 is the traditional answer key. + +So, start with the section that makes the most sense to you and your +preferred method of learning. If you need to know everything about +everything, feel free to read the whole thing -- but if you are really +that type of person, you have perused the source code and will therefore +never need this document anyway. ;-) + + +1. RCU OVERVIEW + +The basic idea behind RCU is to split updates into "removal" and +"reclamation" phases. The removal phase removes references to data items +within a data structure (possibly by replacing them with references to +new versions of these data items), and can run concurrently with readers. +The reason that it is safe to run the removal phase concurrently with +readers is the semantics of modern CPUs guarantee that readers will see +either the old or the new version of the data structure rather than a +partially updated reference. The reclamation phase does the work of reclaiming +(e.g., freeing) the data items removed from the data structure during the +removal phase. Because reclaiming data items can disrupt any readers +concurrently referencing those data items, the reclamation phase must +not start until readers no longer hold references to those data items. + +Splitting the update into removal and reclamation phases permits the +updater to perform the removal phase immediately, and to defer the +reclamation phase until all readers active during the removal phase have +completed, either by blocking until they finish or by registering a +callback that is invoked after they finish. Only readers that are active +during the removal phase need be considered, because any reader starting +after the removal phase will be unable to gain a reference to the removed +data items, and therefore cannot be disrupted by the reclamation phase. + +So the typical RCU update sequence goes something like the following: + +a. Remove pointers to a data structure, so that subsequent + readers cannot gain a reference to it. + +b. Wait for all previous readers to complete their RCU read-side + critical sections. + +c. At this point, there cannot be any readers who hold references + to the data structure, so it now may safely be reclaimed + (e.g., kfree()d). + +Step (b) above is the key idea underlying RCU's deferred destruction. +The ability to wait until all readers are done allows RCU readers to +use much lighter-weight synchronization, in some cases, absolutely no +synchronization at all. In contrast, in more conventional lock-based +schemes, readers must use heavy-weight synchronization in order to +prevent an updater from deleting the data structure out from under them. +This is because lock-based updaters typically update data items in place, +and must therefore exclude readers. In contrast, RCU-based updaters +typically take advantage of the fact that writes to single aligned +pointers are atomic on modern CPUs, allowing atomic insertion, removal, +and replacement of data items in a linked structure without disrupting +readers. Concurrent RCU readers can then continue accessing the old +versions, and can dispense with the atomic operations, memory barriers, +and communications cache misses that are so expensive on present-day +SMP computer systems, even in absence of lock contention. + +In the three-step procedure shown above, the updater is performing both +the removal and the reclamation step, but it is often helpful for an +entirely different thread to do the reclamation, as is in fact the case +in the Linux kernel's directory-entry cache (dcache). Even if the same +thread performs both the update step (step (a) above) and the reclamation +step (step (c) above), it is often helpful to think of them separately. +For example, RCU readers and updaters need not communicate at all, +but RCU provides implicit low-overhead communication between readers +and reclaimers, namely, in step (b) above. + +So how the heck can a reclaimer tell when a reader is done, given +that readers are not doing any sort of synchronization operations??? +Read on to learn about how RCU's API makes this easy. + + +2. WHAT IS RCU'S CORE API? + +The core RCU API is quite small: + +a. rcu_read_lock() +b. rcu_read_unlock() +c. synchronize_rcu() / call_rcu() +d. rcu_assign_pointer() +e. rcu_dereference() + +There are many other members of the RCU API, but the rest can be +expressed in terms of these five, though most implementations instead +express synchronize_rcu() in terms of the call_rcu() callback API. + +The five core RCU APIs are described below, the other 18 will be enumerated +later. See the kernel docbook documentation for more info, or look directly +at the function header comments. + +rcu_read_lock() + + void rcu_read_lock(void); + + Used by a reader to inform the reclaimer that the reader is + entering an RCU read-side critical section. It is illegal + to block while in an RCU read-side critical section, though + kernels built with CONFIG_TREE_PREEMPT_RCU can preempt RCU + read-side critical sections. Any RCU-protected data structure + accessed during an RCU read-side critical section is guaranteed to + remain unreclaimed for the full duration of that critical section. + Reference counts may be used in conjunction with RCU to maintain + longer-term references to data structures. + +rcu_read_unlock() + + void rcu_read_unlock(void); + + Used by a reader to inform the reclaimer that the reader is + exiting an RCU read-side critical section. Note that RCU + read-side critical sections may be nested and/or overlapping. + +synchronize_rcu() + + void synchronize_rcu(void); + + Marks the end of updater code and the beginning of reclaimer + code. It does this by blocking until all pre-existing RCU + read-side critical sections on all CPUs have completed. + Note that synchronize_rcu() will -not- necessarily wait for + any subsequent RCU read-side critical sections to complete. + For example, consider the following sequence of events: + + CPU 0 CPU 1 CPU 2 + ----------------- ------------------------- --------------- + 1. rcu_read_lock() + 2. enters synchronize_rcu() + 3. rcu_read_lock() + 4. rcu_read_unlock() + 5. exits synchronize_rcu() + 6. rcu_read_unlock() + + To reiterate, synchronize_rcu() waits only for ongoing RCU + read-side critical sections to complete, not necessarily for + any that begin after synchronize_rcu() is invoked. + + Of course, synchronize_rcu() does not necessarily return + -immediately- after the last pre-existing RCU read-side critical + section completes. For one thing, there might well be scheduling + delays. For another thing, many RCU implementations process + requests in batches in order to improve efficiencies, which can + further delay synchronize_rcu(). + + Since synchronize_rcu() is the API that must figure out when + readers are done, its implementation is key to RCU. For RCU + to be useful in all but the most read-intensive situations, + synchronize_rcu()'s overhead must also be quite small. + + The call_rcu() API is a callback form of synchronize_rcu(), + and is described in more detail in a later section. Instead of + blocking, it registers a function and argument which are invoked + after all ongoing RCU read-side critical sections have completed. + This callback variant is particularly useful in situations where + it is illegal to block or where update-side performance is + critically important. + + However, the call_rcu() API should not be used lightly, as use + of the synchronize_rcu() API generally results in simpler code. + In addition, the synchronize_rcu() API has the nice property + of automatically limiting update rate should grace periods + be delayed. This property results in system resilience in face + of denial-of-service attacks. Code using call_rcu() should limit + update rate in order to gain this same sort of resilience. See + checklist.txt for some approaches to limiting the update rate. + +rcu_assign_pointer() + + typeof(p) rcu_assign_pointer(p, typeof(p) v); + + Yes, rcu_assign_pointer() -is- implemented as a macro, though it + would be cool to be able to declare a function in this manner. + (Compiler experts will no doubt disagree.) + + The updater uses this function to assign a new value to an + RCU-protected pointer, in order to safely communicate the change + in value from the updater to the reader. This function returns + the new value, and also executes any memory-barrier instructions + required for a given CPU architecture. + + Perhaps just as important, it serves to document (1) which + pointers are protected by RCU and (2) the point at which a + given structure becomes accessible to other CPUs. That said, + rcu_assign_pointer() is most frequently used indirectly, via + the _rcu list-manipulation primitives such as list_add_rcu(). + +rcu_dereference() + + typeof(p) rcu_dereference(p); + + Like rcu_assign_pointer(), rcu_dereference() must be implemented + as a macro. + + The reader uses rcu_dereference() to fetch an RCU-protected + pointer, which returns a value that may then be safely + dereferenced. Note that rcu_deference() does not actually + dereference the pointer, instead, it protects the pointer for + later dereferencing. It also executes any needed memory-barrier + instructions for a given CPU architecture. Currently, only Alpha + needs memory barriers within rcu_dereference() -- on other CPUs, + it compiles to nothing, not even a compiler directive. + + Common coding practice uses rcu_dereference() to copy an + RCU-protected pointer to a local variable, then dereferences + this local variable, for example as follows: + + p = rcu_dereference(head.next); + return p->data; + + However, in this case, one could just as easily combine these + into one statement: + + return rcu_dereference(head.next)->data; + + If you are going to be fetching multiple fields from the + RCU-protected structure, using the local variable is of + course preferred. Repeated rcu_dereference() calls look + ugly and incur unnecessary overhead on Alpha CPUs. + + Note that the value returned by rcu_dereference() is valid + only within the enclosing RCU read-side critical section. + For example, the following is -not- legal: + + rcu_read_lock(); + p = rcu_dereference(head.next); + rcu_read_unlock(); + x = p->address; + rcu_read_lock(); + y = p->data; + rcu_read_unlock(); + + Holding a reference from one RCU read-side critical section + to another is just as illegal as holding a reference from + one lock-based critical section to another! Similarly, + using a reference outside of the critical section in which + it was acquired is just as illegal as doing so with normal + locking. + + As with rcu_assign_pointer(), an important function of + rcu_dereference() is to document which pointers are protected by + RCU, in particular, flagging a pointer that is subject to changing + at any time, including immediately after the rcu_dereference(). + And, again like rcu_assign_pointer(), rcu_dereference() is + typically used indirectly, via the _rcu list-manipulation + primitives, such as list_for_each_entry_rcu(). + +The following diagram shows how each API communicates among the +reader, updater, and reclaimer. + + + rcu_assign_pointer() + +--------+ + +---------------------->| reader |---------+ + | +--------+ | + | | | + | | | Protect: + | | | rcu_read_lock() + | | | rcu_read_unlock() + | rcu_dereference() | | + +---------+ | | + | updater |<---------------------+ | + +---------+ V + | +-----------+ + +----------------------------------->| reclaimer | + +-----------+ + Defer: + synchronize_rcu() & call_rcu() + + +The RCU infrastructure observes the time sequence of rcu_read_lock(), +rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in +order to determine when (1) synchronize_rcu() invocations may return +to their callers and (2) call_rcu() callbacks may be invoked. Efficient +implementations of the RCU infrastructure make heavy use of batching in +order to amortize their overhead over many uses of the corresponding APIs. + +There are no fewer than three RCU mechanisms in the Linux kernel; the +diagram above shows the first one, which is by far the most commonly used. +The rcu_dereference() and rcu_assign_pointer() primitives are used for +all three mechanisms, but different defer and protect primitives are +used as follows: + + Defer Protect + +a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock() + call_rcu() rcu_dereference() + +b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh() + rcu_dereference_bh() + +c. synchronize_sched() rcu_read_lock_sched() / rcu_read_unlock_sched() + preempt_disable() / preempt_enable() + local_irq_save() / local_irq_restore() + hardirq enter / hardirq exit + NMI enter / NMI exit + rcu_dereference_sched() + +These three mechanisms are used as follows: + +a. RCU applied to normal data structures. + +b. RCU applied to networking data structures that may be subjected + to remote denial-of-service attacks. + +c. RCU applied to scheduler and interrupt/NMI-handler tasks. + +Again, most uses will be of (a). The (b) and (c) cases are important +for specialized uses, but are relatively uncommon. + + +3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? + +This section shows a simple use of the core RCU API to protect a +global pointer to a dynamically allocated structure. More-typical +uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. + + struct foo { + int a; + char b; + long c; + }; + DEFINE_SPINLOCK(foo_mutex); + + struct foo *gbl_foo; + + /* + * Create a new struct foo that is the same as the one currently + * pointed to by gbl_foo, except that field "a" is replaced + * with "new_a". Points gbl_foo to the new structure, and + * frees up the old structure after a grace period. + * + * Uses rcu_assign_pointer() to ensure that concurrent readers + * see the initialized version of the new structure. + * + * Uses synchronize_rcu() to ensure that any readers that might + * have references to the old structure complete before freeing + * the old structure. + */ + void foo_update_a(int new_a) + { + struct foo *new_fp; + struct foo *old_fp; + + new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); + spin_lock(&foo_mutex); + old_fp = gbl_foo; + *new_fp = *old_fp; + new_fp->a = new_a; + rcu_assign_pointer(gbl_foo, new_fp); + spin_unlock(&foo_mutex); + synchronize_rcu(); + kfree(old_fp); + } + + /* + * Return the value of field "a" of the current gbl_foo + * structure. Use rcu_read_lock() and rcu_read_unlock() + * to ensure that the structure does not get deleted out + * from under us, and use rcu_dereference() to ensure that + * we see the initialized version of the structure (important + * for DEC Alpha and for people reading the code). + */ + int foo_get_a(void) + { + int retval; + + rcu_read_lock(); + retval = rcu_dereference(gbl_foo)->a; + rcu_read_unlock(); + return retval; + } + +So, to sum up: + +o Use rcu_read_lock() and rcu_read_unlock() to guard RCU + read-side critical sections. + +o Within an RCU read-side critical section, use rcu_dereference() + to dereference RCU-protected pointers. + +o Use some solid scheme (such as locks or semaphores) to + keep concurrent updates from interfering with each other. + +o Use rcu_assign_pointer() to update an RCU-protected pointer. + This primitive protects concurrent readers from the updater, + -not- concurrent updates from each other! You therefore still + need to use locking (or something similar) to keep concurrent + rcu_assign_pointer() primitives from interfering with each other. + +o Use synchronize_rcu() -after- removing a data element from an + RCU-protected data structure, but -before- reclaiming/freeing + the data element, in order to wait for the completion of all + RCU read-side critical sections that might be referencing that + data item. + +See checklist.txt for additional rules to follow when using RCU. +And again, more-typical uses of RCU may be found in listRCU.txt, +arrayRCU.txt, and NMI-RCU.txt. + + +4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? + +In the example above, foo_update_a() blocks until a grace period elapses. +This is quite simple, but in some cases one cannot afford to wait so +long -- there might be other high-priority work to be done. + +In such cases, one uses call_rcu() rather than synchronize_rcu(). +The call_rcu() API is as follows: + + void call_rcu(struct rcu_head * head, + void (*func)(struct rcu_head *head)); + +This function invokes func(head) after a grace period has elapsed. +This invocation might happen from either softirq or process context, +so the function is not permitted to block. The foo struct needs to +have an rcu_head structure added, perhaps as follows: + + struct foo { + int a; + char b; + long c; + struct rcu_head rcu; + }; + +The foo_update_a() function might then be written as follows: + + /* + * Create a new struct foo that is the same as the one currently + * pointed to by gbl_foo, except that field "a" is replaced + * with "new_a". Points gbl_foo to the new structure, and + * frees up the old structure after a grace period. + * + * Uses rcu_assign_pointer() to ensure that concurrent readers + * see the initialized version of the new structure. + * + * Uses call_rcu() to ensure that any readers that might have + * references to the old structure complete before freeing the + * old structure. + */ + void foo_update_a(int new_a) + { + struct foo *new_fp; + struct foo *old_fp; + + new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); + spin_lock(&foo_mutex); + old_fp = gbl_foo; + *new_fp = *old_fp; + new_fp->a = new_a; + rcu_assign_pointer(gbl_foo, new_fp); + spin_unlock(&foo_mutex); + call_rcu(&old_fp->rcu, foo_reclaim); + } + +The foo_reclaim() function might appear as follows: + + void foo_reclaim(struct rcu_head *rp) + { + struct foo *fp = container_of(rp, struct foo, rcu); + + kfree(fp); + } + +The container_of() primitive is a macro that, given a pointer into a +struct, the type of the struct, and the pointed-to field within the +struct, returns a pointer to the beginning of the struct. + +The use of call_rcu() permits the caller of foo_update_a() to +immediately regain control, without needing to worry further about the +old version of the newly updated element. It also clearly shows the +RCU distinction between updater, namely foo_update_a(), and reclaimer, +namely foo_reclaim(). + +The summary of advice is the same as for the previous section, except +that we are now using call_rcu() rather than synchronize_rcu(): + +o Use call_rcu() -after- removing a data element from an + RCU-protected data structure in order to register a callback + function that will be invoked after the completion of all RCU + read-side critical sections that might be referencing that + data item. + +Again, see checklist.txt for additional rules governing the use of RCU. + + +5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? + +One of the nice things about RCU is that it has extremely simple "toy" +implementations that are a good first step towards understanding the +production-quality implementations in the Linux kernel. This section +presents two such "toy" implementations of RCU, one that is implemented +in terms of familiar locking primitives, and another that more closely +resembles "classic" RCU. Both are way too simple for real-world use, +lacking both functionality and performance. However, they are useful +in getting a feel for how RCU works. See kernel/rcupdate.c for a +production-quality implementation, and see: + + http://www.rdrop.com/users/paulmck/RCU + +for papers describing the Linux kernel RCU implementation. The OLS'01 +and OLS'02 papers are a good introduction, and the dissertation provides +more details on the current implementation as of early 2004. + + +5A. "TOY" IMPLEMENTATION #1: LOCKING + +This section presents a "toy" RCU implementation that is based on +familiar locking primitives. Its overhead makes it a non-starter for +real-life use, as does its lack of scalability. It is also unsuitable +for realtime use, since it allows scheduling latency to "bleed" from +one read-side critical section to another. + +However, it is probably the easiest implementation to relate to, so is +a good starting point. + +It is extremely simple: + + static DEFINE_RWLOCK(rcu_gp_mutex); + + void rcu_read_lock(void) + { + read_lock(&rcu_gp_mutex); + } + + void rcu_read_unlock(void) + { + read_unlock(&rcu_gp_mutex); + } + + void synchronize_rcu(void) + { + write_lock(&rcu_gp_mutex); + write_unlock(&rcu_gp_mutex); + } + +[You can ignore rcu_assign_pointer() and rcu_dereference() without +missing much. But here they are anyway. And whatever you do, don't +forget about them when submitting patches making use of RCU!] + + #define rcu_assign_pointer(p, v) ({ \ + smp_wmb(); \ + (p) = (v); \ + }) + + #define rcu_dereference(p) ({ \ + typeof(p) _________p1 = p; \ + smp_read_barrier_depends(); \ + (_________p1); \ + }) + + +The rcu_read_lock() and rcu_read_unlock() primitive read-acquire +and release a global reader-writer lock. The synchronize_rcu() +primitive write-acquires this same lock, then immediately releases +it. This means that once synchronize_rcu() exits, all RCU read-side +critical sections that were in progress before synchronize_rcu() was +called are guaranteed to have completed -- there is no way that +synchronize_rcu() would have been able to write-acquire the lock +otherwise. + +It is possible to nest rcu_read_lock(), since reader-writer locks may +be recursively acquired. Note also that rcu_read_lock() is immune +from deadlock (an important property of RCU). The reason for this is +that the only thing that can block rcu_read_lock() is a synchronize_rcu(). +But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, +so there can be no deadlock cycle. + +Quick Quiz #1: Why is this argument naive? How could a deadlock + occur when using this algorithm in a real-world Linux + kernel? How could this deadlock be avoided? + + +5B. "TOY" EXAMPLE #2: CLASSIC RCU + +This section presents a "toy" RCU implementation that is based on +"classic RCU". It is also short on performance (but only for updates) and +on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT +kernels. The definitions of rcu_dereference() and rcu_assign_pointer() +are the same as those shown in the preceding section, so they are omitted. + + void rcu_read_lock(void) { } + + void rcu_read_unlock(void) { } + + void synchronize_rcu(void) + { + int cpu; + + for_each_possible_cpu(cpu) + run_on(cpu); + } + +Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. +This is the great strength of classic RCU in a non-preemptive kernel: +read-side overhead is precisely zero, at least on non-Alpha CPUs. +And there is absolutely no way that rcu_read_lock() can possibly +participate in a deadlock cycle! + +The implementation of synchronize_rcu() simply schedules itself on each +CPU in turn. The run_on() primitive can be implemented straightforwardly +in terms of the sched_setaffinity() primitive. Of course, a somewhat less +"toy" implementation would restore the affinity upon completion rather +than just leaving all tasks running on the last CPU, but when I said +"toy", I meant -toy-! + +So how the heck is this supposed to work??? + +Remember that it is illegal to block while in an RCU read-side critical +section. Therefore, if a given CPU executes a context switch, we know +that it must have completed all preceding RCU read-side critical sections. +Once -all- CPUs have executed a context switch, then -all- preceding +RCU read-side critical sections will have completed. + +So, suppose that we remove a data item from its structure and then invoke +synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed +that there are no RCU read-side critical sections holding a reference +to that data item, so we can safely reclaim it. + +Quick Quiz #2: Give an example where Classic RCU's read-side + overhead is -negative-. + +Quick Quiz #3: If it is illegal to block in an RCU read-side + critical section, what the heck do you do in + PREEMPT_RT, where normal spinlocks can block??? + + +6. ANALOGY WITH READER-WRITER LOCKING + +Although RCU can be used in many different ways, a very common use of +RCU is analogous to reader-writer locking. The following unified +diff shows how closely related RCU and reader-writer locking can be. + + @@ -13,15 +14,15 @@ + struct list_head *lp; + struct el *p; + + - read_lock(); + - list_for_each_entry(p, head, lp) { + + rcu_read_lock(); + + list_for_each_entry_rcu(p, head, lp) { + if (p->key == key) { + *result = p->data; + - read_unlock(); + + rcu_read_unlock(); + return 1; + } + } + - read_unlock(); + + rcu_read_unlock(); + return 0; + } + + @@ -29,15 +30,16 @@ + { + struct el *p; + + - write_lock(&listmutex); + + spin_lock(&listmutex); + list_for_each_entry(p, head, lp) { + if (p->key == key) { + - list_del(&p->list); + - write_unlock(&listmutex); + + list_del_rcu(&p->list); + + spin_unlock(&listmutex); + + synchronize_rcu(); + kfree(p); + return 1; + } + } + - write_unlock(&listmutex); + + spin_unlock(&listmutex); + return 0; + } + +Or, for those who prefer a side-by-side listing: + + 1 struct el { 1 struct el { + 2 struct list_head list; 2 struct list_head list; + 3 long key; 3 long key; + 4 spinlock_t mutex; 4 spinlock_t mutex; + 5 int data; 5 int data; + 6 /* Other data fields */ 6 /* Other data fields */ + 7 }; 7 }; + 8 spinlock_t listmutex; 8 spinlock_t listmutex; + 9 struct el head; 9 struct el head; + + 1 int search(long key, int *result) 1 int search(long key, int *result) + 2 { 2 { + 3 struct list_head *lp; 3 struct list_head *lp; + 4 struct el *p; 4 struct el *p; + 5 5 + 6 read_lock(); 6 rcu_read_lock(); + 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { + 8 if (p->key == key) { 8 if (p->key == key) { + 9 *result = p->data; 9 *result = p->data; +10 read_unlock(); 10 rcu_read_unlock(); +11 return 1; 11 return 1; +12 } 12 } +13 } 13 } +14 read_unlock(); 14 rcu_read_unlock(); +15 return 0; 15 return 0; +16 } 16 } + + 1 int delete(long key) 1 int delete(long key) + 2 { 2 { + 3 struct el *p; 3 struct el *p; + 4 4 + 5 write_lock(&listmutex); 5 spin_lock(&listmutex); + 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { + 7 if (p->key == key) { 7 if (p->key == key) { + 8 list_del(&p->list); 8 list_del_rcu(&p->list); + 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); + 10 synchronize_rcu(); +10 kfree(p); 11 kfree(p); +11 return 1; 12 return 1; +12 } 13 } +13 } 14 } +14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); +15 return 0; 16 return 0; +16 } 17 } + +Either way, the differences are quite small. Read-side locking moves +to rcu_read_lock() and rcu_read_unlock, update-side locking moves from +a reader-writer lock to a simple spinlock, and a synchronize_rcu() +precedes the kfree(). + +However, there is one potential catch: the read-side and update-side +critical sections can now run concurrently. In many cases, this will +not be a problem, but it is necessary to check carefully regardless. +For example, if multiple independent list updates must be seen as +a single atomic update, converting to RCU will require special care. + +Also, the presence of synchronize_rcu() means that the RCU version of +delete() can now block. If this is a problem, there is a callback-based +mechanism that never blocks, namely call_rcu(), that can be used in +place of synchronize_rcu(). + + +7. FULL LIST OF RCU APIs + +The RCU APIs are documented in docbook-format header comments in the +Linux-kernel source code, but it helps to have a full list of the +APIs, since there does not appear to be a way to categorize them +in docbook. Here is the list, by category. + +RCU list traversal: + + list_for_each_entry_rcu + hlist_for_each_entry_rcu + hlist_nulls_for_each_entry_rcu + + list_for_each_continue_rcu (to be deprecated in favor of new + list_for_each_entry_continue_rcu) + +RCU pointer/list update: + + rcu_assign_pointer + list_add_rcu + list_add_tail_rcu + list_del_rcu + list_replace_rcu + hlist_del_rcu + hlist_add_after_rcu + hlist_add_before_rcu + hlist_add_head_rcu + hlist_replace_rcu + list_splice_init_rcu() + +RCU: Critical sections Grace period Barrier + + rcu_read_lock synchronize_net rcu_barrier + rcu_read_unlock synchronize_rcu + rcu_dereference synchronize_rcu_expedited + call_rcu + + +bh: Critical sections Grace period Barrier + + rcu_read_lock_bh call_rcu_bh rcu_barrier_bh + rcu_read_unlock_bh synchronize_rcu_bh + rcu_dereference_bh synchronize_rcu_bh_expedited + + +sched: Critical sections Grace period Barrier + + rcu_read_lock_sched synchronize_sched rcu_barrier_sched + rcu_read_unlock_sched call_rcu_sched + [preempt_disable] synchronize_sched_expedited + [and friends] + rcu_dereference_sched + + +SRCU: Critical sections Grace period Barrier + + srcu_read_lock synchronize_srcu N/A + srcu_read_unlock synchronize_srcu_expedited + srcu_dereference + +SRCU: Initialization/cleanup + init_srcu_struct + cleanup_srcu_struct + +All: lockdep-checked RCU-protected pointer access + + rcu_dereference_check + rcu_dereference_protected + rcu_access_pointer + +See the comment headers in the source code (or the docbook generated +from them) for more information. + +However, given that there are no fewer than four families of RCU APIs +in the Linux kernel, how do you choose which one to use? The following +list can be helpful: + +a. Will readers need to block? If so, you need SRCU. + +b. What about the -rt patchset? If readers would need to block + in an non-rt kernel, you need SRCU. If readers would block + in a -rt kernel, but not in a non-rt kernel, SRCU is not + necessary. + +c. Do you need to treat NMI handlers, hardirq handlers, + and code segments with preemption disabled (whether + via preempt_disable(), local_irq_save(), local_bh_disable(), + or some other mechanism) as if they were explicit RCU readers? + If so, you need RCU-sched. + +d. Do you need RCU grace periods to complete even in the face + of softirq monopolization of one or more of the CPUs? For + example, is your code subject to network-based denial-of-service + attacks? If so, you need RCU-bh. + +e. Is your workload too update-intensive for normal use of + RCU, but inappropriate for other synchronization mechanisms? + If so, consider SLAB_DESTROY_BY_RCU. But please be careful! + +f. Otherwise, use RCU. + +Of course, this all assumes that you have determined that RCU is in fact +the right tool for your job. + + +8. ANSWERS TO QUICK QUIZZES + +Quick Quiz #1: Why is this argument naive? How could a deadlock + occur when using this algorithm in a real-world Linux + kernel? [Referring to the lock-based "toy" RCU + algorithm.] + +Answer: Consider the following sequence of events: + + 1. CPU 0 acquires some unrelated lock, call it + "problematic_lock", disabling irq via + spin_lock_irqsave(). + + 2. CPU 1 enters synchronize_rcu(), write-acquiring + rcu_gp_mutex. + + 3. CPU 0 enters rcu_read_lock(), but must wait + because CPU 1 holds rcu_gp_mutex. + + 4. CPU 1 is interrupted, and the irq handler + attempts to acquire problematic_lock. + + The system is now deadlocked. + + One way to avoid this deadlock is to use an approach like + that of CONFIG_PREEMPT_RT, where all normal spinlocks + become blocking locks, and all irq handlers execute in + the context of special tasks. In this case, in step 4 + above, the irq handler would block, allowing CPU 1 to + release rcu_gp_mutex, avoiding the deadlock. + + Even in the absence of deadlock, this RCU implementation + allows latency to "bleed" from readers to other + readers through synchronize_rcu(). To see this, + consider task A in an RCU read-side critical section + (thus read-holding rcu_gp_mutex), task B blocked + attempting to write-acquire rcu_gp_mutex, and + task C blocked in rcu_read_lock() attempting to + read_acquire rcu_gp_mutex. Task A's RCU read-side + latency is holding up task C, albeit indirectly via + task B. + + Realtime RCU implementations therefore use a counter-based + approach where tasks in RCU read-side critical sections + cannot be blocked by tasks executing synchronize_rcu(). + +Quick Quiz #2: Give an example where Classic RCU's read-side + overhead is -negative-. + +Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT + kernel where a routing table is used by process-context + code, but can be updated by irq-context code (for example, + by an "ICMP REDIRECT" packet). The usual way of handling + this would be to have the process-context code disable + interrupts while searching the routing table. Use of + RCU allows such interrupt-disabling to be dispensed with. + Thus, without RCU, you pay the cost of disabling interrupts, + and with RCU you don't. + + One can argue that the overhead of RCU in this + case is negative with respect to the single-CPU + interrupt-disabling approach. Others might argue that + the overhead of RCU is merely zero, and that replacing + the positive overhead of the interrupt-disabling scheme + with the zero-overhead RCU scheme does not constitute + negative overhead. + + In real life, of course, things are more complex. But + even the theoretical possibility of negative overhead for + a synchronization primitive is a bit unexpected. ;-) + +Quick Quiz #3: If it is illegal to block in an RCU read-side + critical section, what the heck do you do in + PREEMPT_RT, where normal spinlocks can block??? + +Answer: Just as PREEMPT_RT permits preemption of spinlock + critical sections, it permits preemption of RCU + read-side critical sections. It also permits + spinlocks blocking while in RCU read-side critical + sections. + + Why the apparent inconsistency? Because it is it + possible to use priority boosting to keep the RCU + grace periods short if need be (for example, if running + short of memory). In contrast, if blocking waiting + for (say) network reception, there is no way to know + what should be boosted. Especially given that the + process we need to boost might well be a human being + who just went out for a pizza or something. And although + a computer-operated cattle prod might arouse serious + interest, it might also provoke serious objections. + Besides, how does the computer know what pizza parlor + the human being went to??? + + +ACKNOWLEDGEMENTS + +My thanks to the people who helped make this human-readable, including +Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern. + + +For more information, see http://www.rdrop.com/users/paulmck/RCU. -- cgit v1.2.3