From 849369d6c66d3054688672f97d31fceb8e8230fb Mon Sep 17 00:00:00 2001 From: root Date: Fri, 25 Dec 2015 04:40:36 +0000 Subject: initial_commit --- Documentation/sysctl/00-INDEX | 16 + Documentation/sysctl/README | 75 +++++ Documentation/sysctl/abi.txt | 54 ++++ Documentation/sysctl/fs.txt | 245 ++++++++++++++ Documentation/sysctl/kernel.txt | 557 +++++++++++++++++++++++++++++++ Documentation/sysctl/net.txt | 196 +++++++++++ Documentation/sysctl/sunrpc.txt | 20 ++ Documentation/sysctl/vm.txt | 701 ++++++++++++++++++++++++++++++++++++++++ 8 files changed, 1864 insertions(+) create mode 100644 Documentation/sysctl/00-INDEX create mode 100644 Documentation/sysctl/README create mode 100644 Documentation/sysctl/abi.txt create mode 100644 Documentation/sysctl/fs.txt create mode 100644 Documentation/sysctl/kernel.txt create mode 100644 Documentation/sysctl/net.txt create mode 100644 Documentation/sysctl/sunrpc.txt create mode 100644 Documentation/sysctl/vm.txt (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/00-INDEX b/Documentation/sysctl/00-INDEX new file mode 100644 index 00000000..8cf5d493 --- /dev/null +++ b/Documentation/sysctl/00-INDEX @@ -0,0 +1,16 @@ +00-INDEX + - this file. +README + - general information about /proc/sys/ sysctl files. +abi.txt + - documentation for /proc/sys/abi/*. +fs.txt + - documentation for /proc/sys/fs/*. +kernel.txt + - documentation for /proc/sys/kernel/*. +net.txt + - documentation for /proc/sys/net/*. +sunrpc.txt + - documentation for /proc/sys/sunrpc/*. +vm.txt + - documentation for /proc/sys/vm/*. diff --git a/Documentation/sysctl/README b/Documentation/sysctl/README new file mode 100644 index 00000000..8c3306e0 --- /dev/null +++ b/Documentation/sysctl/README @@ -0,0 +1,75 @@ +Documentation for /proc/sys/ kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel + +'Why', I hear you ask, 'would anyone even _want_ documentation +for them sysctl files? If anybody really needs it, it's all in +the source...' + +Well, this documentation is written because some people either +don't know they need to tweak something, or because they don't +have the time or knowledge to read the source code. + +Furthermore, the programmers who built sysctl have built it to +be actually used, not just for the fun of programming it :-) + +============================================================== + +Legal blurb: + +As usual, there are two main things to consider: +1. you get what you pay for +2. it's free + +The consequences are that I won't guarantee the correctness of +this document, and if you come to me complaining about how you +screwed up your system because of wrong documentation, I won't +feel sorry for you. I might even laugh at you... + +But of course, if you _do_ manage to screw up your system using +only the sysctl options used in this file, I'd like to hear of +it. Not only to have a great laugh, but also to make sure that +you're the last RTFMing person to screw up. + +In short, e-mail your suggestions, corrections and / or horror +stories to: + +Rik van Riel. + +============================================================== + +Introduction: + +Sysctl is a means of configuring certain aspects of the kernel +at run-time, and the /proc/sys/ directory is there so that you +don't even need special tools to do it! +In fact, there are only four things needed to use these config +facilities: +- a running Linux system +- root access +- common sense (this is especially hard to come by these days) +- knowledge of what all those values mean + +As a quick 'ls /proc/sys' will show, the directory consists of +several (arch-dependent?) subdirs. Each subdir is mainly about +one part of the kernel, so you can do configuration on a piece +by piece basis, or just some 'thematic frobbing'. + +The subdirs are about: +abi/ execution domains & personalities +debug/ +dev/ device specific information (eg dev/cdrom/info) +fs/ specific filesystems + filehandle, inode, dentry and quota tuning + binfmt_misc +kernel/ global kernel info / tuning + miscellaneous stuff +net/ networking stuff, for documentation look in: + +proc/ +sunrpc/ SUN Remote Procedure Call (NFS) +vm/ memory management tuning + buffer and cache management + +These are the subdirs I have on my system. There might be more +or other subdirs in another setup. If you see another dir, I'd +really like to hear about it :-) diff --git a/Documentation/sysctl/abi.txt b/Documentation/sysctl/abi.txt new file mode 100644 index 00000000..63f4ebcf --- /dev/null +++ b/Documentation/sysctl/abi.txt @@ -0,0 +1,54 @@ +Documentation for /proc/sys/abi/* kernel version 2.6.0.test2 + (c) 2003, Fabian Frederick + +For general info : README. + +============================================================== + +This path is binary emulation relevant aka personality types aka abi. +When a process is executed, it's linked to an exec_domain whose +personality is defined using values available from /proc/sys/abi. +You can find further details about abi in include/linux/personality.h. + +Here are the files featuring in 2.6 kernel : + +- defhandler_coff +- defhandler_elf +- defhandler_lcall7 +- defhandler_libcso +- fake_utsname +- trace + +=========================================================== +defhandler_coff: +defined value : +PER_SCOSVR3 +0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE + +=========================================================== +defhandler_elf: +defined value : +PER_LINUX +0 + +=========================================================== +defhandler_lcall7: +defined value : +PER_SVR4 +0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +=========================================================== +defhandler_libsco: +defined value: +PER_SVR4 +0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +=========================================================== +fake_utsname: +Unused + +=========================================================== +trace: +Unused + +=========================================================== diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt new file mode 100644 index 00000000..88fd7f5c --- /dev/null +++ b/Documentation/sysctl/fs.txt @@ -0,0 +1,245 @@ +Documentation for /proc/sys/fs/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel + (c) 2009, Shen Feng + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains documentation for the sysctl files in +/proc/sys/fs/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +1. /proc/sys/fs +---------------------------------------------------------- + +Currently, these files are in /proc/sys/fs: +- aio-max-nr +- aio-nr +- dentry-state +- dquot-max +- dquot-nr +- file-max +- file-nr +- inode-max +- inode-nr +- inode-state +- nr_open +- overflowuid +- overflowgid +- suid_dumpable +- super-max +- super-nr + +============================================================== + +aio-nr & aio-max-nr: + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + +============================================================== + +dentry-state: + +From linux/fs/dentry.c: +-------------------------------------------------------------- +struct { + int nr_dentry; + int nr_unused; + int age_limit; /* age in seconds */ + int want_pages; /* pages requested by system */ + int dummy[2]; +} dentry_stat = {0, 0, 45, 0,}; +-------------------------------------------------------------- + +Dentries are dynamically allocated and deallocated, and +nr_dentry seems to be 0 all the time. Hence it's safe to +assume that only nr_unused, age_limit and want_pages are +used. Nr_unused seems to be exactly what its name says. +Age_limit is the age in seconds after which dcache entries +can be reclaimed when memory is short and want_pages is +nonzero when shrink_dcache_pages() has been called and the +dcache isn't pruned yet. + +============================================================== + +dquot-max & dquot-nr: + +The file dquot-max shows the maximum number of cached disk +quota entries. + +The file dquot-nr shows the number of allocated disk quota +entries and the number of free disk quota entries. + +If the number of free cached disk quotas is very low and +you have some awesome number of simultaneous system users, +you might want to raise the limit. + +============================================================== + +file-max & file-nr: + +The value in file-max denotes the maximum number of file- +handles that the Linux kernel will allocate. When you get lots +of error messages about running out of file handles, you might +want to increase this limit. + +Historically,the kernel was able to allocate file handles +dynamically, but not to free them again. The three values in +file-nr denote the number of allocated file handles, the number +of allocated but unused file handles, and the maximum number of +file handles. Linux 2.6 always reports 0 as the number of free +file handles -- this is not an error, it just means that the +number of allocated file handles exactly matches the number of +used file handles. + +Attempts to allocate more file descriptors than file-max are +reported with printk, look for "VFS: file-max limit +reached". +============================================================== + +nr_open: + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + +============================================================== + +inode-max, inode-nr & inode-state: + +As with file handles, the kernel allocates the inode structures +dynamically, but can't free them yet. + +The value in inode-max denotes the maximum number of inode +handlers. This value should be 3-4 times larger than the value +in file-max, since stdin, stdout and network sockets also +need an inode struct to handle them. When you regularly run +out of inodes, you need to increase this value. + +The file inode-nr contains the first two items from +inode-state, so we'll skip to that file... + +Inode-state contains three actual numbers and four dummies. +The actual numbers are, in order of appearance, nr_inodes, +nr_free_inodes and preshrink. + +Nr_inodes stands for the number of inodes the system has +allocated, this can be slightly more than inode-max because +Linux allocates them one pageful at a time. + +Nr_free_inodes represents the number of free inodes (?) and +preshrink is nonzero when the nr_inodes > inode-max and the +system needs to prune the inode list instead of allocating +more. + +============================================================== + +overflowgid & overflowuid: + +Some filesystems only support 16-bit UIDs and GIDs, although in Linux +UIDs and GIDs are 32 bits. When one of these filesystems is mounted +with writes enabled, any UID or GID that would exceed 65535 is translated +to a fixed value before being written to disk. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + +============================================================== + +suid_dumpable: + +This value can be used to query and set the core dump mode for setuid +or otherwise protected/tainted binaries. The modes are + +0 - (default) - traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped +1 - (debug) - all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. Ptrace is unchecked. +2 - (suidsafe) - any binary which normally would not be dumped is dumped + readable by root only. This allows the end user to remove + such a dump but not access it directly. For security reasons + core dumps in this mode will not overwrite one another or + other files. This mode is appropriate when administrators are + attempting to debug problems in a normal environment. + +============================================================== + +super-max & super-nr: + +These numbers control the maximum number of superblocks, and +thus the maximum number of mounted filesystems the kernel +can have. You only need to increase super-max if you need to +mount more filesystems than the current value in super-max +allows you to. + +============================================================== + +aio-nr & aio-max-nr: + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + +============================================================== + + +2. /proc/sys/fs/binfmt_misc +---------------------------------------------------------- + +Documentation for the files in /proc/sys/fs/binfmt_misc is +in Documentation/binfmt_misc.txt. + + +3. /proc/sys/fs/mqueue - POSIX message queues filesystem +---------------------------------------------------------- + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + + +4. /proc/sys/fs/epoll - Configuration options for the epoll interface +-------------------------------------------------------- + +This directory contains configuration options for the epoll(7) interface. + +max_user_watches +---------------- + +Every epoll file descriptor can store a number of files to be monitored +for event readiness. Each one of these monitored files constitutes a "watch". +This configuration option sets the maximum number of "watches" that are +allowed for each user. +Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes +on a 64bit one. +The current default value for max_user_watches is the 1/32 of the available +low memory, divided for the "watch" cost in bytes. + diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt new file mode 100644 index 00000000..5e7cb39a --- /dev/null +++ b/Documentation/sysctl/kernel.txt @@ -0,0 +1,557 @@ +Documentation for /proc/sys/kernel/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel + (c) 2009, Shen Feng + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains documentation for the sysctl files in +/proc/sys/kernel/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +Currently, these files might (depending on your configuration) +show up in /proc/sys/kernel: +- acpi_video_flags +- acct +- bootloader_type [ X86 only ] +- bootloader_version [ X86 only ] +- callhome [ S390 only ] +- auto_msgmni +- core_pattern +- core_pipe_limit +- core_uses_pid +- ctrl-alt-del +- dentry-state +- dmesg_restrict +- domainname +- hostname +- hotplug +- java-appletviewer [ binfmt_java, obsolete ] +- java-interpreter [ binfmt_java, obsolete ] +- kptr_restrict +- kstack_depth_to_print [ X86 only ] +- l2cr [ PPC only ] +- modprobe ==> Documentation/debugging-modules.txt +- modules_disabled +- msgmax +- msgmnb +- msgmni +- nmi_watchdog +- osrelease +- ostype +- overflowgid +- overflowuid +- panic +- pid_max +- powersave-nap [ PPC only ] +- panic_on_unrecovered_nmi +- printk +- randomize_va_space +- real-root-dev ==> Documentation/initrd.txt +- reboot-cmd [ SPARC only ] +- rtsig-max +- rtsig-nr +- sem +- sg-big-buff [ generic SCSI device (sg) ] +- shmall +- shmmax [ sysv ipc ] +- shmmni +- stop-a [ SPARC only ] +- sysrq ==> Documentation/sysrq.txt +- tainted +- threads-max +- unknown_nmi_panic +- version + +============================================================== + +acpi_video_flags: + +flags + +See Doc*/kernel/power/video.txt, it allows mode of video boot to be +set during run time. + +============================================================== + +acct: + +highwater lowwater frequency + +If BSD-style process accounting is enabled these values control +its behaviour. If free space on filesystem where the log lives +goes below % accounting suspends. If free space gets +above % accounting resumes. determines +how often do we check the amount of free space (value is in +seconds). Default: +4 2 30 +That is, suspend accounting if there left <= 2% free; resume it +if we got >=4%; consider information about amount of free space +valid for 30 seconds. + +============================================================== + +bootloader_type: + +x86 bootloader identification + +This gives the bootloader type number as indicated by the bootloader, +shifted left by 4, and OR'd with the low four bits of the bootloader +version. The reason for this encoding is that this used to match the +type_of_loader field in the kernel header; the encoding is kept for +backwards compatibility. That is, if the full bootloader type number +is 0x15 and the full version number is 0x234, this file will contain +the value 340 = 0x154. + +See the type_of_loader and ext_loader_type fields in +Documentation/x86/boot.txt for additional information. + +============================================================== + +bootloader_version: + +x86 bootloader version + +The complete bootloader version number. In the example above, this +file will contain the value 564 = 0x234. + +See the type_of_loader and ext_loader_ver fields in +Documentation/x86/boot.txt for additional information. + +============================================================== + +callhome: + +Controls the kernel's callhome behavior in case of a kernel panic. + +The s390 hardware allows an operating system to send a notification +to a service organization (callhome) in case of an operating system panic. + +When the value in this file is 0 (which is the default behavior) +nothing happens in case of a kernel panic. If this value is set to "1" +the complete kernel oops message is send to the IBM customer service +organization in case the mainframe the Linux operating system is running +on has a service contract with IBM. + +============================================================== + +core_pattern: + +core_pattern is used to specify a core dumpfile pattern name. +. max length 128 characters; default value is "core" +. core_pattern is used as a pattern template for the output filename; + certain string patterns (beginning with '%') are substituted with + their actual values. +. backward compatibility with core_uses_pid: + If core_pattern does not include "%p" (default does not) + and core_uses_pid is set, then .PID will be appended to + the filename. +. corename format specifiers: + % '%' is dropped + %% output one '%' + %p pid + %u uid + %g gid + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened) + %E executable path + % both are dropped +. If the first character of the pattern is a '|', the kernel will treat + the rest of the pattern as a command to run. The core dump will be + written to the standard input of that program instead of to a file. + +============================================================== + +core_pipe_limit: + +This sysctl is only applicable when core_pattern is configured to pipe core +files to a user space helper (when the first character of core_pattern is a '|', +see above). When collecting cores via a pipe to an application, it is +occasionally useful for the collecting application to gather data about the +crashing process from its /proc/pid directory. In order to do this safely, the +kernel must wait for the collecting process to exit, so as not to remove the +crashing processes proc files prematurely. This in turn creates the possibility +that a misbehaving userspace collecting process can block the reaping of a +crashed process simply by never exiting. This sysctl defends against that. It +defines how many concurrent crashing processes may be piped to user space +applications in parallel. If this value is exceeded, then those crashing +processes above that value are noted via the kernel log and their cores are +skipped. 0 is a special value, indicating that unlimited processes may be +captured in parallel, but that no waiting will take place (i.e. the collecting +process is not guaranteed access to /proc//). This value defaults +to 0. + +============================================================== + +core_uses_pid: + +The default coredump filename is "core". By setting +core_uses_pid to 1, the coredump filename becomes core.PID. +If core_pattern does not include "%p" (default does not) +and core_uses_pid is set, then .PID will be appended to +the filename. + +============================================================== + +ctrl-alt-del: + +When the value in this file is 0, ctrl-alt-del is trapped and +sent to the init(1) program to handle a graceful restart. +When, however, the value is > 0, Linux's reaction to a Vulcan +Nerve Pinch (tm) will be an immediate reboot, without even +syncing its dirty buffers. + +Note: when a program (like dosemu) has the keyboard in 'raw' +mode, the ctrl-alt-del is intercepted by the program before it +ever reaches the kernel tty layer, and it's up to the program +to decide what to do with it. + +============================================================== + +dmesg_restrict: + +This toggle indicates whether unprivileged users are prevented from using +dmesg(8) to view messages from the kernel's log buffer. When +dmesg_restrict is set to (0) there are no restrictions. When +dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use +dmesg(8). + +The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the default +value of dmesg_restrict. + +============================================================== + +domainname & hostname: + +These files can be used to set the NIS/YP domainname and the +hostname of your box in exactly the same way as the commands +domainname and hostname, i.e.: +# echo "darkstar" > /proc/sys/kernel/hostname +# echo "mydomain" > /proc/sys/kernel/domainname +has the same effect as +# hostname "darkstar" +# domainname "mydomain" + +Note, however, that the classic darkstar.frop.org has the +hostname "darkstar" and DNS (Internet Domain Name Server) +domainname "frop.org", not to be confused with the NIS (Network +Information Service) or YP (Yellow Pages) domainname. These two +domain names are in general different. For a detailed discussion +see the hostname(1) man page. + +============================================================== + +hotplug: + +Path for the hotplug policy agent. +Default value is "/sbin/hotplug". + +============================================================== + +l2cr: (PPC only) + +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. + +============================================================== + +kptr_restrict: + +This toggle indicates whether restrictions are placed on +exposing kernel addresses via /proc and other interfaces. When +kptr_restrict is set to (0), there are no restrictions. When +kptr_restrict is set to (1), the default, kernel pointers +printed using the %pK format specifier will be replaced with 0's +unless the user has CAP_SYSLOG. When kptr_restrict is set to +(2), kernel pointers printed using %pK will be replaced with 0's +regardless of privileges. + +============================================================== + +kstack_depth_to_print: (X86 only) + +Controls the number of words to print when dumping the raw +kernel stack. + +============================================================== + +modules_disabled: + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. + +============================================================== + +osrelease, ostype & version: + +# cat osrelease +2.1.88 +# cat ostype +Linux +# cat version +#5 Wed Feb 25 21:49:24 MET 1998 + +The files osrelease and ostype should be clear enough. Version +needs a little more clarification however. The '#5' means that +this is the fifth kernel built from this source base and the +date behind it indicates the time the kernel was built. +The only way to tune these values is to rebuild the kernel :-) + +============================================================== + +overflowgid & overflowuid: + +if your architecture did not always support 32-bit UIDs (i.e. arm, i386, +m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the actual +UID or GID would exceed 65535. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + +============================================================== + +panic: + +The value in this file represents the number of seconds the +kernel waits before rebooting on a panic. When you use the +software watchdog, the recommended setting is 60. + +============================================================== + +panic_on_oops: + +Controls the kernel's behaviour when an oops or BUG is encountered. + +0: try to continue operation + +1: panic immediately. If the `panic' sysctl is also non-zero then the + machine will be rebooted. + +============================================================== + +pid_max: + +PID allocation wrap value. When the kernel's next PID value +reaches this value, it wraps back to a minimum PID value. +PIDs of value pid_max or larger are not allocated. + +============================================================== + +powersave-nap: (PPC only) + +If set, Linux-PPC will use the 'nap' mode of powersaving, +otherwise the 'doze' mode will be used. + +============================================================== + +printk: + +The four values in printk denote: console_loglevel, +default_message_loglevel, minimum_console_loglevel and +default_console_loglevel respectively. + +These values influence printk() behavior when printing or +logging error messages. See 'man 2 syslog' for more info on +the different loglevels. + +- console_loglevel: messages with a higher priority than + this will be printed to the console +- default_message_loglevel: messages without an explicit priority + will be printed with this priority +- minimum_console_loglevel: minimum (highest) value to which + console_loglevel can be set +- default_console_loglevel: default value for console_loglevel + +============================================================== + +printk_ratelimit: + +Some warning messages are rate limited. printk_ratelimit specifies +the minimum length of time between these messages (in jiffies), by +default we allow one every 5 seconds. + +A value of 0 will disable rate limiting. + +============================================================== + +printk_ratelimit_burst: + +While long term we enforce one message per printk_ratelimit +seconds, we do allow a burst of messages to pass through. +printk_ratelimit_burst specifies the number of messages we can +send before ratelimiting kicks in. + +============================================================== + +printk_delay: + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + +============================================================== + +randomize-va-space: + +This option can be used to select the type of process address +space randomization that is used in the system, for architectures +that support this feature. + +0 - Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. + +1 - Make the addresses of mmap base, stack and VDSO page randomized. + This, among other things, implies that shared libraries will be + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. + +2 - Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. + + There are a few legacy applications out there (such as some ancient + versions of libc.so.5 from 1996) that assume that brk area starts + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known + non-legacy applications that would be broken this way, so for most + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. + +============================================================== + +reboot-cmd: (Sparc only) + +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? + +============================================================== + +rtsig-max & rtsig-nr: + +The file rtsig-max can be used to tune the maximum number +of POSIX realtime (queued) signals that can be outstanding +in the system. + +rtsig-nr shows the number of RT signals currently queued. + +============================================================== + +sg-big-buff: + +This file shows the size of the generic SCSI (sg) buffer. +You can't tune it just yet, but you could change it on +compile time by editing include/scsi/sg.h and changing +the value of SG_BIG_BUFF. + +There shouldn't be any reason to change this value. If +you can come up with one, you probably know what you +are doing anyway :) + +============================================================== + +shmmax: + +This value can be used to query and set the run time limit +on the maximum shared memory segment size that can be created. +Shared memory segments up to 1Gb are now supported in the +kernel. This value defaults to SHMMAX. + +============================================================== + +softlockup_thresh: + +This value can be used to lower the softlockup tolerance threshold. The +default threshold is 60 seconds. If a cpu is locked up for 60 seconds, +the kernel complains. Valid values are 1-60 seconds. Setting this +tunable to zero will disable the softlockup detection altogether. + +============================================================== + +tainted: + +Non-zero if the kernel has been tainted. Numeric values, which +can be ORed together: + + 1 - A module with a non-GPL license has been loaded, this + includes modules with no license. + Set by modutils >= 2.4.9 and module-init-tools. + 2 - A module was force loaded by insmod -f. + Set by modutils >= 2.4.9 and module-init-tools. + 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 8 - A module was forcibly unloaded from the system by rmmod -f. + 16 - A hardware machine check error occurred on the system. + 32 - A bad page was discovered on the system. + 64 - The user has asked that the system be marked "tainted". This + could be because they are running software that directly modifies + the hardware, or for other reasons. + 128 - The system has died. + 256 - The ACPI DSDT has been overridden with one supplied by the user + instead of using the one provided by the hardware. + 512 - A kernel warning has occurred. +1024 - A module from drivers/staging was loaded. + +============================================================== + +auto_msgmni: + +Enables/Disables automatic recomputing of msgmni upon memory add/remove or +upon ipc namespace creation/removal (see the msgmni description above). +Echoing "1" into this file enables msgmni automatic recomputing. +Echoing "0" turns it off. +auto_msgmni default value is 1. + +============================================================== + +nmi_watchdog: + +Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero +the NMI watchdog is enabled and will continuously test all online cpus to +determine whether or not they are still functioning properly. Currently, +passing "nmi_watchdog=" parameter at boot time is required for this function +to work. + +If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel parameter), the +NMI watchdog shares registers with oprofile. By disabling the NMI watchdog, +oprofile may have more registers to utilize. + +============================================================== + +unknown_nmi_panic: + +The value in this file affects behavior of handling NMI. When the value is +non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel +debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for example. +If a system hangs up, try pressing the NMI switch. + +============================================================== + +panic_on_unrecovered_nmi: + +The default Linux behaviour on an NMI of either memory or unknown is to continue +operation. For many environments such as scientific computing it is preferable +that the box is taken out and the error dealt with than an uncorrected +parity/ECC error get propogated. + +A small number of systems do generate NMI's for bizarre random reasons such as +power management so the default is off. That sysctl works like the existing +panic controls already in that directory. + diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt new file mode 100644 index 00000000..3201a709 --- /dev/null +++ b/Documentation/sysctl/net.txt @@ -0,0 +1,196 @@ +Documentation for /proc/sys/net/* kernel version 2.4.0-test11-pre4 + (c) 1999 Terrehon Bowden + Bodo Bauer + (c) 2000 Jorge Nerin + (c) 2009 Shen Feng + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains the documentation for the sysctl files in +/proc/sys/net and is valid for Linux kernel version 2.4.0-test11-pre4. + +The interface to the networking parts of the kernel is located in +/proc/sys/net. The following table shows all possible subdirectories.You may +see only some of them, depending on your kernel's configuration. + + +Table : Subdirectories in /proc/sys/net +.............................................................................. + Directory Content Directory Content + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 +.............................................................................. + +1. /proc/sys/net/core - Network core options +------------------------------------------------------- + +bpf_jit_enable +-------------- + +This enables Berkeley Packet Filter Just in Time compiler. +Currently supported on x86_64 architecture, bpf_jit provides a framework +to speed packet filtering, the one used by tcpdump/libpcap for example. +Values : + 0 - disable the JIT (default value) + 1 - enable the JIT + 2 - enable the JIT and ask the compiler to emit traces on kernel log. + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +warnings +-------- + +This controls console messages from the networking stack that can occur because +of problems on the network like duplicate address or bad checksums. Normally, +this should be enabled, but if the problem persists the messages can be +disabled. + +netdev_budget +------------- + +Maximum number of packets taken from all interfaces in one polling cycle (NAPI +poll). In one polling cycle interfaces which are registered to polling are +probed in a round-robin manner. The limit of packets in one such probe can be +set per-device via sysfs class/net//weight . + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +netdev_tstamp_prequeue +---------------------- + +If set to 0, RX packet timestamps can be sampled after RPS processing, when +the target CPU processes packets. It might give some delay on timestamps, but +permit to distribute the load on several cpus. + +If set to 1 (default), timestamps are sampled as soon as possible, before +queueing. + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +2. /proc/sys/net/unix - Parameters for Unix domain sockets +------------------------------------------------------- + +There is only one file in this directory. +unix_dgram_qlen limits the max number of datagrams queued in Unix domain +socket's buffer. It will not take effect unless PF_UNIX flag is specified. + + +3. /proc/sys/net/ipv4 - IPV4 settings +------------------------------------------------------- +Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for +descriptions of these entries. + + +4. Appletalk +------------------------------------------------------- + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + + +5. IPX +------------------------------------------------------- + +The IPX protocol has no tunable values in proc/sys/net. + +The IPX protocol does, however, provide proc/net/ipx. This lists each IPX +socket giving the local and remote addresses in Novell format (that is +network:node:port). In accordance with the strange Novell tradition, +everything but the port is in hex. Not_Connected is displayed for sockets that +are not tied to a specific remote address. The Tx and Rx queue sizes indicate +the number of bytes pending for transmission and reception. The state +indicates the state the socket is in and the uid is the owning uid of the +socket. + +The /proc/net/ipx_interface file lists all IPX interfaces. For each interface +it gives the network number, the node number, and indicates if the network is +the primary network. It also indicates which device it is bound to (or +Internal for internal networks) and the Frame Type if appropriate. Linux +supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for +IPX. + +The /proc/net/ipx_route table holds a list of IPX routes. For each route it +gives the destination network, the router node (or Directly) and the network +address of the router (or Connected) for internal networks. diff --git a/Documentation/sysctl/sunrpc.txt b/Documentation/sysctl/sunrpc.txt new file mode 100644 index 00000000..ae1ecac6 --- /dev/null +++ b/Documentation/sysctl/sunrpc.txt @@ -0,0 +1,20 @@ +Documentation for /proc/sys/sunrpc/* kernel version 2.2.10 + (c) 1998, 1999, Rik van Riel + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains the documentation for the sysctl files in +/proc/sys/sunrpc and is valid for Linux kernel version 2.2. + +The files in this directory can be used to (re)set the debug +flags of the SUN Remote Procedure Call (RPC) subsystem in +the Linux kernel. This stuff is used for NFS, KNFSD and +maybe a few other things as well. + +The files in there are used to control the debugging flags: +rpc_debug, nfs_debug, nfsd_debug and nlm_debug. + +These flags are for kernel hackers only. You should read the +source code in net/sunrpc/ for more information. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt new file mode 100644 index 00000000..96f0ee82 --- /dev/null +++ b/Documentation/sysctl/vm.txt @@ -0,0 +1,701 @@ +Documentation for /proc/sys/vm/* kernel version 2.6.29 + (c) 1998, 1999, Rik van Riel + (c) 2008 Peter W. Morreale + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains the documentation for the sysctl files in +/proc/sys/vm and is valid for Linux kernel version 2.6.29. + +The files in this directory can be used to tune the operation +of the virtual memory (VM) subsystem of the Linux kernel and +the writeout of dirty data to disk. + +Default values and initialization routines for most of these +files can be found in mm/swap.c. + +Currently, these files are in /proc/sys/vm: + +- block_dump +- compact_memory +- dirty_background_bytes +- dirty_background_ratio +- dirty_bytes +- dirty_expire_centisecs +- dirty_ratio +- dirty_writeback_centisecs +- drop_caches +- extfrag_threshold +- hugepages_treat_as_movable +- hugetlb_shm_group +- laptop_mode +- legacy_va_layout +- lowmem_reserve_ratio +- max_map_count +- memory_failure_early_kill +- memory_failure_recovery +- min_free_kbytes +- min_slab_ratio +- min_unmapped_ratio +- mmap_min_addr +- nr_hugepages +- nr_overcommit_hugepages +- nr_pdflush_threads +- nr_trim_pages (only if CONFIG_MMU=n) +- numa_zonelist_order +- oom_dump_tasks +- oom_kill_allocating_task +- overcommit_memory +- overcommit_ratio +- page-cluster +- panic_on_oom +- percpu_pagelist_fraction +- stat_interval +- swappiness +- vfs_cache_pressure +- zone_reclaim_mode + +============================================================== + +block_dump + +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. + +============================================================== + +compact_memory + +Available only when CONFIG_COMPACTION is set. When 1 is written to the file, +all zones are compacted such that free memory is available in contiguous +blocks where possible. This can be important for example in the allocation of +huge pages although processes will also directly compact memory as required. + +============================================================== + +dirty_background_bytes + +Contains the amount of dirty memory at which the pdflush background writeback +daemon will start writeback. + +Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only +one of them may be specified at a time. When one sysctl is written it is +immediately taken into account to evaluate the dirty memory limits and the +other appears as 0 when read. + +============================================================== + +dirty_background_ratio + +Contains, as a percentage of total system memory, the number of pages at which +the pdflush background writeback daemon will start writing out dirty data. + +============================================================== + +dirty_bytes + +Contains the amount of dirty memory at which a process generating disk writes +will itself start writeback. + +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. + +Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any +value lower than this limit will be ignored and the old configuration will be +retained. + +============================================================== + +dirty_expire_centisecs + +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the pdflush daemons. It is expressed in 100'ths of a second. +Data which has been dirty in-memory for longer than this interval will be +written out next time a pdflush daemon wakes up. + +============================================================== + +dirty_ratio + +Contains, as a percentage of total system memory, the number of pages at which +a process which is generating disk writes will itself start writing out dirty +data. + +============================================================== + +dirty_writeback_centisecs + +The pdflush writeback daemons will periodically wake up and write `old' data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. + +Setting this to zero disables periodic writeback altogether. + +============================================================== + +drop_caches + +Writing to this will cause the kernel to drop clean caches, dentries and +inodes from memory, causing that memory to become free. + +To free pagecache: + echo 1 > /proc/sys/vm/drop_caches +To free dentries and inodes: + echo 2 > /proc/sys/vm/drop_caches +To free pagecache, dentries and inodes: + echo 3 > /proc/sys/vm/drop_caches + +As this is a non-destructive operation and dirty objects are not freeable, the +user should run `sync' first. + +============================================================== + +extfrag_threshold + +This parameter affects whether the kernel will compact memory or direct +reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what +the fragmentation index for each order is in each zone in the system. Values +tending towards 0 imply allocations would fail due to lack of memory, +values towards 1000 imply failures are due to fragmentation and -1 implies +that the allocation will succeed as long as watermarks are met. + +The kernel will not compact memory in a zone if the +fragmentation index is <= extfrag_threshold. The default value is 500. + +============================================================== + +hugepages_treat_as_movable + +This parameter is only useful when kernelcore= is specified at boot time to +create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages +are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero +value written to hugepages_treat_as_movable allows huge pages to be allocated +from ZONE_MOVABLE. + +Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge +pages pool can easily grow or shrink within. Assuming that applications are +not running that mlock() a lot of memory, it is likely the huge pages pool +can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value +into nr_hugepages and triggering page reclaim. + +============================================================== + +hugetlb_shm_group + +hugetlb_shm_group contains group id that is allowed to create SysV +shared memory segment using hugetlb page. + +============================================================== + +laptop_mode + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. + +============================================================== + +legacy_va_layout + +If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + +============================================================== + +lowmem_reserve_ratio + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which _could_ use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file. +- +% cat /proc/sys/vm/lowmem_reserve_ratio +256 256 32 +- +Note: # of this elements is one fewer than number of zones. Because the highest + zone's value is not necessary for following calculation. + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this. + +- +Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : +- +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression. + +(i < j): + zone[i]->protection[j] + = (total sums of present_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; +(i = j): + (should not be protected. = 0; +(i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others). +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total present +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). + +============================================================== + +max_map_count: + +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap and mprotect, and also when loading shared +libraries. + +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. + +The default value is 65536. + +============================================================= + +memory_failure_early_kill: + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + +============================================================== + +memory_failure_recovery + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + +============================================================== + +min_free_kbytes: + +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. + +============================================================= + +min_slab_ratio: + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. + +The default is 5 percent. + +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. + +============================================================= + +min_unmapped_ratio: + +This is available only on NUMA kernels. + +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. + +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. + +The default is 1 percent. + +============================================================== + +mmap_min_addr + +This file indicates the amount of address space which a user process will +be restricted from mmapping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. + +============================================================== + +nr_hugepages + +Change the minimum size of the hugepage pool. + +See Documentation/vm/hugetlbpage.txt + +============================================================== + +nr_overcommit_hugepages + +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. + +See Documentation/vm/hugetlbpage.txt + +============================================================== + +nr_pdflush_threads + +The current number of pdflush threads. This value is read-only. +The value changes according to the number of dirty pages in the system. + +When necessary, additional pdflush threads are created, one per second, up to +nr_pdflush_threads_max. + +============================================================== + +nr_trim_pages + +This is available only on NOMMU kernels. + +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. + +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. + +The default value is 1. + +See Documentation/nommu-mmap.txt for more information. + +============================================================== + +numa_zonelist_order + +This sysctl is only for NUMA. +'where the memory is allocated from' is controlled by zonelists. +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. + you may be able to read ZONE_DMA as ZONE_DMA32...) + +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. +ZONE_NORMAL -> ZONE_DMA +This means that a memory allocation request for GFP_KERNEL will +get memory from ZONE_DMA only when ZONE_NORMAL is not available. + +In NUMA case, you can think of following 2 types of order. +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL + +(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL +(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. + +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA +will be used before ZONE_NORMAL exhaustion. This increases possibility of +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. + +Type(B) cannot offer the best locality but is more robust against OOM of +the DMA zone. + +Type(A) is called as "Node" order. Type (B) is "Zone" order. + +"Node order" orders the zonelists by node, then by zone within each node. +Specify "[Nn]ode" for node order + +"Zone Order" orders the zonelists by zone type, then by node within each +zone. Specify "[Zz]one" for zone order. + +Specify "[Dd]efault" to request automatic configuration. Autoconfiguration +will select "node" order in following case. +(1) if the DMA zone does not exist or +(2) if the DMA zone comprises greater than 50% of the available memory or +(3) if any node's DMA zone comprises greater than 60% of its local memory and + the amount of local memory is big enough. + +Otherwise, "zone" order will be selected. Default order is recommended unless +this is causing problems for your system/application. + +============================================================== + +oom_dump_tasks + +Enables a system-wide task dump (excluding kernel threads) to be +produced when the kernel performs an OOM-killing and includes such +information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and +name. This is helpful to determine why the OOM killer was invoked +and to identify the rogue task that caused it. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 1 (enabled). + +============================================================== + +oom_kill_allocating_task + +This enables or disables killing the OOM-triggering task in +out-of-memory situations. + +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. + +============================================================== + +overcommit_memory: + +This value contains a flag that enables memory overcommitment. + +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. + +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. + +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/vm/overcommit-accounting and +security/commoncap.c::cap_vm_enough_memory() for more information. + +============================================================== + +overcommit_ratio: + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + +============================================================== + +page-cluster + +page-cluster controls the number of pages which are written to swap in +a single attempt. The swap I/O size. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +============================================================= + +panic_on_oom + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. Even oom happens under memory cgroup, the whole +system panics. + +The default value is 0. +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. +panic_on_oom=2+kdump gives you very strong tool to investigate +why oom happens. You can get snapshot. + +============================================================= + +percpu_pagelist_fraction + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we don't allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist is also updated as a result. It is +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) + +The initial value is zero. Kernel does not use this value at boot time to set +the high water marks for each per cpu page list. + +============================================================== + +stat_interval + +The time interval between which vm statistics are updated. The default +is 1 second. + +============================================================== + +swappiness + +This control is used to define how aggressive the kernel will swap +memory pages. Higher values will increase agressiveness, lower values +decrease the amount of swap. + +The default value is 60. + +============================================================== + +vfs_cache_pressure +------------------ + +Controls the tendency of the kernel to reclaim the memory which is used for +caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +============================================================== + +zone_reclaim_mode: + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value ORed together of + +1 = Zone reclaim on +2 = Zone reclaim writes dirty pages out +4 = Zone reclaim swaps pages + +zone_reclaim_mode is set during bootup to 1 if it is determined that pages +from remote zones will cause a measurable performance reduction. The +page allocator will then reclaim easily reusable pages (those page +cache pages that are currently not used) before allocating off node pages. + +It may be beneficial to switch off zone reclaim if the system is +used for a file server and all of memory should be used for caching files +from disk. In that case the caching effect is more important than +data locality. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. + +============ End of Document ================================= -- cgit v1.2.3