aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/00-INDEX120
-rw-r--r--Documentation/filesystems/9p.txt166
-rw-r--r--Documentation/filesystems/Locking534
-rw-r--r--Documentation/filesystems/Makefile8
-rw-r--r--Documentation/filesystems/adfs.txt75
-rw-r--r--Documentation/filesystems/affs.txt219
-rw-r--r--Documentation/filesystems/afs.txt247
-rw-r--r--Documentation/filesystems/autofs4-mount-control.txt393
-rw-r--r--Documentation/filesystems/automount-support.txt118
-rw-r--r--Documentation/filesystems/befs.txt117
-rw-r--r--Documentation/filesystems/bfs.txt57
-rw-r--r--Documentation/filesystems/btrfs.txt91
-rw-r--r--Documentation/filesystems/caching/backend-api.txt658
-rw-r--r--Documentation/filesystems/caching/cachefiles.txt501
-rw-r--r--Documentation/filesystems/caching/fscache.txt443
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt813
-rw-r--r--Documentation/filesystems/caching/object.txt313
-rw-r--r--Documentation/filesystems/caching/operations.txt213
-rw-r--r--Documentation/filesystems/ceph.txt140
-rw-r--r--Documentation/filesystems/cifs.txt51
-rw-r--r--Documentation/filesystems/coda.txt1673
-rw-r--r--Documentation/filesystems/configfs/Makefile3
-rw-r--r--Documentation/filesystems/configfs/configfs.txt484
-rw-r--r--Documentation/filesystems/configfs/configfs_example_explicit.c483
-rw-r--r--Documentation/filesystems/configfs/configfs_example_macros.c446
-rw-r--r--Documentation/filesystems/cramfs.txt76
-rw-r--r--Documentation/filesystems/debugfs.txt158
-rw-r--r--Documentation/filesystems/devpts.txt132
-rw-r--r--Documentation/filesystems/directory-locking114
-rw-r--r--Documentation/filesystems/dlmfs.txt130
-rw-r--r--Documentation/filesystems/dnotify.txt70
-rw-r--r--Documentation/filesystems/dnotify_test.c34
-rw-r--r--Documentation/filesystems/ecryptfs.txt77
-rw-r--r--Documentation/filesystems/exofs.txt185
-rw-r--r--Documentation/filesystems/ext2.txt383
-rw-r--r--Documentation/filesystems/ext3.txt231
-rw-r--r--Documentation/filesystems/ext4.txt615
-rw-r--r--Documentation/filesystems/fiemap.txt228
-rw-r--r--Documentation/filesystems/files.txt123
-rw-r--r--Documentation/filesystems/fuse.txt423
-rw-r--r--Documentation/filesystems/gfs2-glocks.txt114
-rw-r--r--Documentation/filesystems/gfs2-uevents.txt100
-rw-r--r--Documentation/filesystems/gfs2.txt46
-rw-r--r--Documentation/filesystems/hfs.txt83
-rw-r--r--Documentation/filesystems/hfsplus.txt59
-rw-r--r--Documentation/filesystems/hpfs.txt296
-rw-r--r--Documentation/filesystems/inotify.txt269
-rw-r--r--Documentation/filesystems/isofs.txt48
-rw-r--r--Documentation/filesystems/jfs.txt41
-rw-r--r--Documentation/filesystems/locks.txt67
-rw-r--r--Documentation/filesystems/logfs.txt241
-rw-r--r--Documentation/filesystems/mandatory-locking.txt171
-rw-r--r--Documentation/filesystems/ncpfs.txt12
-rw-r--r--Documentation/filesystems/nfs/00-INDEX20
-rw-r--r--Documentation/filesystems/nfs/Exporting147
-rw-r--r--Documentation/filesystems/nfs/idmapper.txt67
-rw-r--r--Documentation/filesystems/nfs/knfsd-stats.txt159
-rw-r--r--Documentation/filesystems/nfs/nfs-rdma.txt271
-rw-r--r--Documentation/filesystems/nfs/nfs.txt98
-rw-r--r--Documentation/filesystems/nfs/nfs41-server.txt221
-rw-r--r--Documentation/filesystems/nfs/nfsroot.txt294
-rw-r--r--Documentation/filesystems/nfs/pnfs.txt55
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.txt202
-rw-r--r--Documentation/filesystems/nilfs2.txt208
-rw-r--r--Documentation/filesystems/ntfs.txt721
-rw-r--r--Documentation/filesystems/ocfs2.txt102
-rw-r--r--Documentation/filesystems/omfs.txt106
-rw-r--r--Documentation/filesystems/path-lookup.txt382
-rw-r--r--Documentation/filesystems/pohmelfs/design_notes.txt71
-rw-r--r--Documentation/filesystems/pohmelfs/info.txt99
-rw-r--r--Documentation/filesystems/pohmelfs/network_protocol.txt227
-rw-r--r--Documentation/filesystems/porting409
-rw-r--r--Documentation/filesystems/proc.txt1544
-rw-r--r--Documentation/filesystems/quota.txt65
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.txt355
-rw-r--r--Documentation/filesystems/relay.txt494
-rw-r--r--Documentation/filesystems/romfs.txt186
-rw-r--r--Documentation/filesystems/seq_file.txt292
-rw-r--r--Documentation/filesystems/sharedsubtree.txt939
-rw-r--r--Documentation/filesystems/spufs.txt521
-rw-r--r--Documentation/filesystems/squashfs.txt257
-rw-r--r--Documentation/filesystems/sysfs-pci.txt120
-rw-r--r--Documentation/filesystems/sysfs-tagging.txt42
-rw-r--r--Documentation/filesystems/sysfs.txt372
-rw-r--r--Documentation/filesystems/sysv-fs.txt197
-rw-r--r--Documentation/filesystems/tmpfs.txt148
-rw-r--r--Documentation/filesystems/ubifs.txt147
-rw-r--r--Documentation/filesystems/udf.txt82
-rw-r--r--Documentation/filesystems/ufs.txt60
-rw-r--r--Documentation/filesystems/vfat.txt295
-rw-r--r--Documentation/filesystems/vfs.txt1097
-rw-r--r--Documentation/filesystems/xfs-delayed-logging-design.txt793
-rw-r--r--Documentation/filesystems/xfs.txt252
-rw-r--r--Documentation/filesystems/xip.txt68
94 files changed, 25077 insertions, 0 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
new file mode 100644
index 00000000..8c624a18
--- /dev/null
+++ b/Documentation/filesystems/00-INDEX
@@ -0,0 +1,120 @@
+00-INDEX
+ - this file (info on some of the filesystems supported by linux).
+Locking
+ - info on locking rules as they pertain to Linux VFS.
+9p.txt
+ - 9p (v9fs) is an implementation of the Plan 9 remote fs protocol.
+adfs.txt
+ - info and mount options for the Acorn Advanced Disc Filing System.
+afs.txt
+ - info and examples for the distributed AFS (Andrew File System) fs.
+affs.txt
+ - info and mount options for the Amiga Fast File System.
+automount-support.txt
+ - information about filesystem automount support.
+befs.txt
+ - information about the BeOS filesystem for Linux.
+bfs.txt
+ - info for the SCO UnixWare Boot Filesystem (BFS).
+ceph.txt
+ - info for the Ceph Distributed File System
+cifs.txt
+ - description of the CIFS filesystem.
+coda.txt
+ - description of the CODA filesystem.
+configfs/
+ - directory containing configfs documentation and example code.
+cramfs.txt
+ - info on the cram filesystem for small storage (ROMs etc).
+dentry-locking.txt
+ - info on the RCU-based dcache locking model.
+directory-locking
+ - info about the locking scheme used for directory operations.
+dlmfs.txt
+ - info on the userspace interface to the OCFS2 DLM.
+dnotify.txt
+ - info about directory notification in Linux.
+dnotify_test.c
+ - example program for dnotify
+ecryptfs.txt
+ - docs on eCryptfs: stacked cryptographic filesystem for Linux.
+exofs.txt
+ - info, usage, mount options, design about EXOFS.
+ext2.txt
+ - info, mount options and specifications for the Ext2 filesystem.
+ext3.txt
+ - info, mount options and specifications for the Ext3 filesystem.
+ext4.txt
+ - info, mount options and specifications for the Ext4 filesystem.
+files.txt
+ - info on file management in the Linux kernel.
+fuse.txt
+ - info on the Filesystem in User SpacE including mount options.
+gfs2.txt
+ - info on the Global File System 2.
+hfs.txt
+ - info on the Macintosh HFS Filesystem for Linux.
+hfsplus.txt
+ - info on the Macintosh HFSPlus Filesystem for Linux.
+hpfs.txt
+ - info and mount options for the OS/2 HPFS.
+inotify.txt
+ - info on the powerful yet simple file change notification system.
+isofs.txt
+ - info and mount options for the ISO 9660 (CDROM) filesystem.
+jfs.txt
+ - info and mount options for the JFS filesystem.
+locks.txt
+ - info on file locking implementations, flock() vs. fcntl(), etc.
+logfs.txt
+ - info on the LogFS flash filesystem.
+mandatory-locking.txt
+ - info on the Linux implementation of Sys V mandatory file locking.
+ncpfs.txt
+ - info on Novell Netware(tm) filesystem using NCP protocol.
+nfs/
+ - nfs-related documentation.
+nilfs2.txt
+ - info and mount options for the NILFS2 filesystem.
+ntfs.txt
+ - info and mount options for the NTFS filesystem (Windows NT).
+ocfs2.txt
+ - info and mount options for the OCFS2 clustered filesystem.
+porting
+ - various information on filesystem porting.
+proc.txt
+ - info on Linux's /proc filesystem.
+ramfs-rootfs-initramfs.txt
+ - info on the 'in memory' filesystems ramfs, rootfs and initramfs.
+reiser4.txt
+ - info on the Reiser4 filesystem based on dancing tree algorithms.
+relay.txt
+ - info on relay, for efficient streaming from kernel to user space.
+romfs.txt
+ - description of the ROMFS filesystem.
+seq_file.txt
+ - how to use the seq_file API
+sharedsubtree.txt
+ - a description of shared subtrees for namespaces.
+spufs.txt
+ - info and mount options for the SPU filesystem used on Cell.
+sysfs-pci.txt
+ - info on accessing PCI device resources through sysfs.
+sysfs.txt
+ - info on sysfs, a ram-based filesystem for exporting kernel objects.
+sysv-fs.txt
+ - info on the SystemV/V7/Xenix/Coherent filesystem.
+tmpfs.txt
+ - info on tmpfs, a filesystem that holds all files in virtual memory.
+udf.txt
+ - info and mount options for the UDF filesystem.
+ufs.txt
+ - info on the ufs filesystem.
+vfat.txt
+ - info on using the VFAT filesystem used in Windows NT and Windows 95
+vfs.txt
+ - overview of the Virtual File System
+xfs.txt
+ - info and mount options for the XFS filesystem.
+xip.txt
+ - info on execute-in-place for file mappings.
diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt
new file mode 100644
index 00000000..13de64c7
--- /dev/null
+++ b/Documentation/filesystems/9p.txt
@@ -0,0 +1,166 @@
+ v9fs: Plan 9 Resource Sharing for Linux
+ =======================================
+
+ABOUT
+=====
+
+v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
+
+This software was originally developed by Ron Minnich <rminnich@sandia.gov>
+and Maya Gokhale. Additional development by Greg Watson
+<gwatson@lanl.gov> and most recently Eric Van Hensbergen
+<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
+<rsc@swtch.com>.
+
+The best detailed explanation of the Linux implementation and applications of
+the 9p client is available in the form of a USENIX paper:
+ http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
+
+Other applications are described in the following papers:
+ * XCPU & Clustering
+ http://xcpu.org/papers/xcpu-talk.pdf
+ * KVMFS: control file system for KVM
+ http://xcpu.org/papers/kvmfs.pdf
+ * CellFS: A New Programming Model for the Cell BE
+ http://xcpu.org/papers/cellfs-talk.pdf
+ * PROSE I/O: Using 9p to enable Application Partitions
+ http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
+ * VirtFS: A Virtualization Aware File System pass-through
+ http://goo.gl/3WPDg
+
+USAGE
+=====
+
+For remote file server:
+
+ mount -t 9p 10.10.1.2 /mnt/9
+
+For Plan 9 From User Space applications (http://swtch.com/plan9)
+
+ mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
+
+For server running on QEMU host with virtio transport:
+
+ mount -t 9p -o trans=virtio <mount_tag> /mnt/9
+
+where mount_tag is the tag associated by the server to each of the exported
+mount points. Each 9P export is seen by the client as a virtio device with an
+associated "mount_tag" property. Available mount tags can be
+seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+
+OPTIONS
+=======
+
+ trans=name select an alternative transport. Valid options are
+ currently:
+ unix - specifying a named pipe mount point
+ tcp - specifying a normal TCP/IP connection
+ fd - used passed file descriptors for connection
+ (see rfdno and wfdno)
+ virtio - connect to the next virtio channel available
+ (from QEMU with trans_virtio module)
+ rdma - connect to a specified RDMA channel
+
+ uname=name user name to attempt mount as on the remote server. The
+ server may override or ignore this value. Certain user
+ names may require authentication.
+
+ aname=name aname specifies the file tree to access when the server is
+ offering several exported file systems.
+
+ cache=mode specifies a caching policy. By default, no caches are used.
+ loose = no attempts are made at consistency,
+ intended for exclusive, read-only mounts
+ fscache = use FS-Cache for a persistent, read-only
+ cache backend.
+
+ debug=n specifies debug level. The debug level is a bitmask.
+ 0x01 = display verbose error messages
+ 0x02 = developer debug (DEBUG_CURRENT)
+ 0x04 = display 9p trace
+ 0x08 = display VFS trace
+ 0x10 = display Marshalling debug
+ 0x20 = display RPC debug
+ 0x40 = display transport debug
+ 0x80 = display allocation debug
+ 0x100 = display protocol message debug
+ 0x200 = display Fid debug
+ 0x400 = display packet debug
+ 0x800 = display fscache tracing debug
+
+ rfdno=n the file descriptor for reading with trans=fd
+
+ wfdno=n the file descriptor for writing with trans=fd
+
+ maxdata=n the number of bytes to use for 9p packet payload (msize)
+
+ port=n port to connect to on the remote server
+
+ noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
+
+ version=name Select 9P protocol version. Valid options are:
+ 9p2000 - Legacy mode (same as noextend)
+ 9p2000.u - Use 9P2000.u protocol
+ 9p2000.L - Use 9P2000.L protocol
+
+ dfltuid attempt to mount as a particular uid
+
+ dfltgid attempt to mount with a particular gid
+
+ afid security channel - used by Plan 9 authentication protocols
+
+ nodevmap do not map special files - represent them as normal files.
+ This can be used to share devices/named pipes/sockets between
+ hosts. This functionality will be expanded in later versions.
+
+ access there are four access modes.
+ user = if a user tries to access a file on v9fs
+ filesystem for the first time, v9fs sends an
+ attach command (Tattach) for that user.
+ This is the default mode.
+ <uid> = allows only user with uid=<uid> to access
+ the files on the mounted filesystem
+ any = v9fs does single attach and performs all
+ operations as one user
+ client = ACL based access check on the 9p client
+ side for access validation
+
+ cachetag cache tag to use the specified persistent cache.
+ cache tags for existing cache sessions can be listed at
+ /sys/fs/9p/caches. (applies only to cache=fscache)
+
+RESOURCES
+=========
+
+Protocol specifications are maintained on github:
+http://ericvh.github.com/9p-rfc/
+
+9p client and server implementations are listed on
+http://9p.cat-v.org/implementations
+
+A 9p2000.L server is being developed by LLNL and can be found
+at http://code.google.com/p/diod/
+
+There are user and developer mailing lists available through the v9fs project
+on sourceforge (http://sourceforge.net/projects/v9fs).
+
+News and other information is maintained on a Wiki.
+(http://sf.net/apps/mediawiki/v9fs/index.php).
+
+Bug reports may be issued through the kernel.org bugzilla
+(http://bugzilla.kernel.org)
+
+For more information on the Plan 9 Operating System check out
+http://plan9.bell-labs.com/plan9
+
+For information on Plan 9 from User Space (Plan 9 applications and libraries
+ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
+
+
+STATUS
+======
+
+The 2.6 kernel support is working on PPC and x86.
+
+PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org)
+
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
new file mode 100644
index 00000000..57d827d6
--- /dev/null
+++ b/Documentation/filesystems/Locking
@@ -0,0 +1,534 @@
+ The text below describes the locking rules for VFS-related methods.
+It is (believed to be) up-to-date. *Please*, if you change anything in
+prototypes or locking protocols - update this file. And update the relevant
+instances in the tree, don't leave that to maintainers of filesystems/devices/
+etc. At the very least, put the list of dubious cases in the end of this file.
+Don't turn it into log - maintainers of out-of-the-tree code are supposed to
+be able to use diff(1).
+ Thing currently missing here: socket operations. Alexey?
+
+--------------------------- dentry_operations --------------------------
+prototypes:
+ int (*d_revalidate)(struct dentry *, struct nameidata *);
+ int (*d_hash)(const struct dentry *, const struct inode *,
+ struct qstr *);
+ int (*d_compare)(const struct dentry *, const struct inode *,
+ const struct dentry *, const struct inode *,
+ unsigned int, const char *, const struct qstr *);
+ int (*d_delete)(struct dentry *);
+ void (*d_release)(struct dentry *);
+ void (*d_iput)(struct dentry *, struct inode *);
+ char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
+ struct vfsmount *(*d_automount)(struct path *path);
+ int (*d_manage)(struct dentry *, bool);
+
+locking rules:
+ rename_lock ->d_lock may block rcu-walk
+d_revalidate: no no yes (ref-walk) maybe
+d_hash no no no maybe
+d_compare: yes no no maybe
+d_delete: no yes no no
+d_release: no no yes no
+d_iput: no no yes no
+d_dname: no no no no
+d_automount: no no yes no
+d_manage: no no yes (ref-walk) maybe
+
+--------------------------- inode_operations ---------------------------
+prototypes:
+ int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
+ struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
+ata *);
+ int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*unlink) (struct inode *,struct dentry *);
+ int (*symlink) (struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct inode *,struct dentry *,int);
+ int (*rmdir) (struct inode *,struct dentry *);
+ int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*rename) (struct inode *, struct dentry *,
+ struct inode *, struct dentry *);
+ int (*readlink) (struct dentry *, char __user *,int);
+ void * (*follow_link) (struct dentry *, struct nameidata *);
+ void (*put_link) (struct dentry *, struct nameidata *, void *);
+ void (*truncate) (struct inode *);
+ int (*permission) (struct inode *, int, unsigned int);
+ int (*check_acl)(struct inode *, int, unsigned int);
+ int (*setattr) (struct dentry *, struct iattr *);
+ int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
+ int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
+ ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
+ ssize_t (*listxattr) (struct dentry *, char *, size_t);
+ int (*removexattr) (struct dentry *, const char *);
+ void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
+
+locking rules:
+ all may block
+ i_mutex(inode)
+lookup: yes
+create: yes
+link: yes (both)
+mknod: yes
+symlink: yes
+mkdir: yes
+unlink: yes (both)
+rmdir: yes (both) (see below)
+rename: yes (all) (see below)
+readlink: no
+follow_link: no
+put_link: no
+truncate: yes (see below)
+setattr: yes
+permission: no (may not block if called in rcu-walk mode)
+check_acl: no
+getattr: no
+setxattr: yes
+getxattr: no
+listxattr: no
+removexattr: yes
+truncate_range: yes
+fiemap: no
+ Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
+victim.
+ cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+ ->truncate() is never called directly - it's a callback, not a
+method. It's called by vmtruncate() - deprecated library function used by
+->setattr(). Locking information above applies to that call (i.e. is
+inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
+passed).
+
+See Documentation/filesystems/directory-locking for more detailed discussion
+of the locking scheme for directory operations.
+
+--------------------------- super_operations ---------------------------
+prototypes:
+ struct inode *(*alloc_inode)(struct super_block *sb);
+ void (*destroy_inode)(struct inode *);
+ void (*dirty_inode) (struct inode *, int flags);
+ int (*write_inode) (struct inode *, struct writeback_control *wbc);
+ int (*drop_inode) (struct inode *);
+ void (*evict_inode) (struct inode *);
+ void (*put_super) (struct super_block *);
+ void (*write_super) (struct super_block *);
+ int (*sync_fs)(struct super_block *sb, int wait);
+ int (*freeze_fs) (struct super_block *);
+ int (*unfreeze_fs) (struct super_block *);
+ int (*statfs) (struct dentry *, struct kstatfs *);
+ int (*remount_fs) (struct super_block *, int *, char *);
+ void (*umount_begin) (struct super_block *);
+ int (*show_options)(struct seq_file *, struct vfsmount *);
+ ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+ ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+ int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+
+locking rules:
+ All may block [not true, see below]
+ s_umount
+alloc_inode:
+destroy_inode:
+dirty_inode:
+write_inode:
+drop_inode: !!!inode->i_lock!!!
+evict_inode:
+put_super: write
+write_super: read
+sync_fs: read
+freeze_fs: read
+unfreeze_fs: read
+statfs: maybe(read) (see below)
+remount_fs: write
+umount_begin: no
+show_options: no (namespace_sem)
+quota_read: no (see below)
+quota_write: no (see below)
+bdev_try_to_free_page: no (see below)
+
+->statfs() has s_umount (shared) when called by ustat(2) (native or
+compat), but that's an accident of bad API; s_umount is used to pin
+the superblock down when we only have dev_t given us by userland to
+identify the superblock. Everything else (statfs(), fstatfs(), etc.)
+doesn't hold it when calling ->statfs() - superblock is pinned down
+by resolving the pathname passed to syscall.
+->quota_read() and ->quota_write() functions are both guaranteed to
+be the only ones operating on the quota file by the quota code (via
+dqio_sem) (unless an admin really wants to screw up something and
+writes to quota files with quotas on). For other details about locking
+see also dquot_operations section.
+->bdev_try_to_free_page is called from the ->releasepage handler of
+the block device inode. See there for more details.
+
+--------------------------- file_system_type ---------------------------
+prototypes:
+ int (*get_sb) (struct file_system_type *, int,
+ const char *, void *, struct vfsmount *);
+ struct dentry *(*mount) (struct file_system_type *, int,
+ const char *, void *);
+ void (*kill_sb) (struct super_block *);
+locking rules:
+ may block
+mount yes
+kill_sb yes
+
+->mount() returns ERR_PTR or the root dentry; its superblock should be locked
+on return.
+->kill_sb() takes a write-locked superblock, does all shutdown work on it,
+unlocks and drops the reference.
+
+--------------------------- address_space_operations --------------------------
+prototypes:
+ int (*writepage)(struct page *page, struct writeback_control *wbc);
+ int (*readpage)(struct file *, struct page *);
+ int (*sync_page)(struct page *);
+ int (*writepages)(struct address_space *, struct writeback_control *);
+ int (*set_page_dirty)(struct page *page);
+ int (*readpages)(struct file *filp, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages);
+ int (*write_begin)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
+ sector_t (*bmap)(struct address_space *, sector_t);
+ int (*invalidatepage) (struct page *, unsigned long);
+ int (*releasepage) (struct page *, int);
+ void (*freepage)(struct page *);
+ int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+ loff_t offset, unsigned long nr_segs);
+ int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
+ unsigned long *);
+ int (*migratepage)(struct address_space *, struct page *, struct page *);
+ int (*launder_page)(struct page *);
+ int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
+ int (*error_remove_page)(struct address_space *, struct page *);
+
+locking rules:
+ All except set_page_dirty and freepage may block
+
+ PageLocked(page) i_mutex
+writepage: yes, unlocks (see below)
+readpage: yes, unlocks
+sync_page: maybe
+writepages:
+set_page_dirty no
+readpages:
+write_begin: locks the page yes
+write_end: yes, unlocks yes
+bmap:
+invalidatepage: yes
+releasepage: yes
+freepage: yes
+direct_IO:
+get_xip_mem: maybe
+migratepage: yes (both)
+launder_page: yes
+is_partially_uptodate: yes
+error_remove_page: yes
+
+ ->write_begin(), ->write_end(), ->sync_page() and ->readpage()
+may be called from the request handler (/dev/loop).
+
+ ->readpage() unlocks the page, either synchronously or via I/O
+completion.
+
+ ->readpages() populates the pagecache with the passed pages and starts
+I/O against them. They come unlocked upon I/O completion.
+
+ ->writepage() is used for two purposes: for "memory cleansing" and for
+"sync". These are quite different operations and the behaviour may differ
+depending upon the mode.
+
+If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
+it *must* start I/O against the page, even if that would involve
+blocking on in-progress I/O.
+
+If writepage is called for memory cleansing (sync_mode ==
+WBC_SYNC_NONE) then its role is to get as much writeout underway as
+possible. So writepage should try to avoid blocking against
+currently-in-progress I/O.
+
+If the filesystem is not called for "sync" and it determines that it
+would need to block against in-progress I/O to be able to start new I/O
+against the page the filesystem should redirty the page with
+redirty_page_for_writepage(), then unlock the page and return zero.
+This may also be done to avoid internal deadlocks, but rarely.
+
+If the filesystem is called for sync then it must wait on any
+in-progress I/O and then start new I/O.
+
+The filesystem should unlock the page synchronously, before returning to the
+caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
+value. WRITEPAGE_ACTIVATE means that page cannot really be written out
+currently, and VM should stop calling ->writepage() on this page for some
+time. VM does this by moving page to the head of the active list, hence the
+name.
+
+Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
+and return zero, writepage *must* run set_page_writeback() against the page,
+followed by unlocking it. Once set_page_writeback() has been run against the
+page, write I/O can be submitted and the write I/O completion handler must run
+end_page_writeback() once the I/O is complete. If no I/O is submitted, the
+filesystem must run end_page_writeback() against the page before returning from
+writepage.
+
+That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
+if the filesystem needs the page to be locked during writeout, that is ok, too,
+the page is allowed to be unlocked at any point in time between the calls to
+set_page_writeback() and end_page_writeback().
+
+Note, failure to run either redirty_page_for_writepage() or the combination of
+set_page_writeback()/end_page_writeback() on a page submitted to writepage
+will leave the page itself marked clean but it will be tagged as dirty in the
+radix tree. This incoherency can lead to all sorts of hard-to-debug problems
+in the filesystem like having dirty inodes at umount and losing written data.
+
+ ->sync_page() locking rules are not well-defined - usually it is called
+with lock on page, but that is not guaranteed. Considering the currently
+existing instances of this method ->sync_page() itself doesn't look
+well-defined...
+
+ ->writepages() is used for periodic writeback and for syscall-initiated
+sync operations. The address_space should start I/O against at least
+*nr_to_write pages. *nr_to_write must be decremented for each page which is
+written. The address_space implementation may write more (or less) pages
+than *nr_to_write asks for, but it should try to be reasonably close. If
+nr_to_write is NULL, all dirty pages must be written.
+
+writepages should _only_ write pages which are present on
+mapping->io_pages.
+
+ ->set_page_dirty() is called from various places in the kernel
+when the target page is marked as needing writeback. It may be called
+under spinlock (it cannot block) and is sometimes called with the page
+not locked.
+
+ ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
+filesystems and by the swapper. The latter will eventually go away. Please,
+keep it that way and don't breed new callers.
+
+ ->invalidatepage() is called when the filesystem must attempt to drop
+some or all of the buffers from the page when it is being truncated. It
+returns zero on success. If ->invalidatepage is zero, the kernel uses
+block_invalidatepage() instead.
+
+ ->releasepage() is called when the kernel is about to try to drop the
+buffers from the page in preparation for freeing it. It returns zero to
+indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
+the kernel assumes that the fs has no private interest in the buffers.
+
+ ->freepage() is called when the kernel is done dropping the page
+from the page cache.
+
+ ->launder_page() may be called prior to releasing a page if
+it is still found to be dirty. It returns zero if the page was successfully
+cleaned, or an error value if not. Note that in order to prevent the page
+getting mapped back in and redirtied, it needs to be kept locked
+across the entire operation.
+
+----------------------- file_lock_operations ------------------------------
+prototypes:
+ void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
+ void (*fl_release_private)(struct file_lock *);
+
+
+locking rules:
+ file_lock_lock may block
+fl_copy_lock: yes no
+fl_release_private: maybe no
+
+----------------------- lock_manager_operations ---------------------------
+prototypes:
+ int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
+ void (*fl_notify)(struct file_lock *); /* unblock callback */
+ int (*fl_grant)(struct file_lock *, struct file_lock *, int);
+ void (*fl_release_private)(struct file_lock *);
+ void (*fl_break)(struct file_lock *); /* break_lease callback */
+ int (*fl_change)(struct file_lock **, int);
+
+locking rules:
+ file_lock_lock may block
+fl_compare_owner: yes no
+fl_notify: yes no
+fl_grant: no no
+fl_release_private: maybe no
+fl_break: yes no
+fl_change yes no
+
+--------------------------- buffer_head -----------------------------------
+prototypes:
+ void (*b_end_io)(struct buffer_head *bh, int uptodate);
+
+locking rules:
+ called from interrupts. In other words, extreme care is needed here.
+bh is locked, but that's all warranties we have here. Currently only RAID1,
+highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
+call this method upon the IO completion.
+
+--------------------------- block_device_operations -----------------------
+prototypes:
+ int (*open) (struct block_device *, fmode_t);
+ int (*release) (struct gendisk *, fmode_t);
+ int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
+ int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
+ int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *);
+ int (*media_changed) (struct gendisk *);
+ void (*unlock_native_capacity) (struct gendisk *);
+ int (*revalidate_disk) (struct gendisk *);
+ int (*getgeo)(struct block_device *, struct hd_geometry *);
+ void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+
+locking rules:
+ bd_mutex
+open: yes
+release: yes
+ioctl: no
+compat_ioctl: no
+direct_access: no
+media_changed: no
+unlock_native_capacity: no
+revalidate_disk: no
+getgeo: no
+swap_slot_free_notify: no (see below)
+
+media_changed, unlock_native_capacity and revalidate_disk are called only from
+check_disk_change().
+
+swap_slot_free_notify is called with swap_lock and sometimes the page lock
+held.
+
+
+--------------------------- file_operations -------------------------------
+prototypes:
+ loff_t (*llseek) (struct file *, loff_t, int);
+ ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
+ ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
+ ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ int (*readdir) (struct file *, void *, filldir_t);
+ unsigned int (*poll) (struct file *, struct poll_table_struct *);
+ long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
+ long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*mmap) (struct file *, struct vm_area_struct *);
+ int (*open) (struct inode *, struct file *);
+ int (*flush) (struct file *);
+ int (*release) (struct inode *, struct file *);
+ int (*fsync) (struct file *, int datasync);
+ int (*aio_fsync) (struct kiocb *, int datasync);
+ int (*fasync) (int, struct file *, int);
+ int (*lock) (struct file *, int, struct file_lock *);
+ ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
+ loff_t *);
+ ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
+ loff_t *);
+ ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
+ void __user *);
+ ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
+ loff_t *, int);
+ unsigned long (*get_unmapped_area)(struct file *, unsigned long,
+ unsigned long, unsigned long, unsigned long);
+ int (*check_flags)(int);
+ int (*flock) (struct file *, int, struct file_lock *);
+ ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
+ size_t, unsigned int);
+ ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
+ size_t, unsigned int);
+ int (*setlease)(struct file *, long, struct file_lock **);
+ long (*fallocate)(struct file *, int, loff_t, loff_t);
+};
+
+locking rules:
+ All may block except for ->setlease.
+ No VFS locks held on entry except for ->fsync and ->setlease.
+
+->fsync() has i_mutex on inode.
+
+->setlease has the file_list_lock held and must not sleep.
+
+->llseek() locking has moved from llseek to the individual llseek
+implementations. If your fs is not using generic_file_llseek, you
+need to acquire and release the appropriate locks in your ->llseek().
+For many filesystems, it is probably safe to acquire the inode
+mutex or just to use i_size_read() instead.
+Note: this does not protect the file->f_pos against concurrent modifications
+since this is something the userspace has to take care about.
+
+->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
+Most instances call fasync_helper(), which does that maintenance, so it's
+not normally something one needs to worry about. Return values > 0 will be
+mapped to zero in the VFS layer.
+
+->readdir() and ->ioctl() on directories must be changed. Ideally we would
+move ->readdir() to inode_operations and use a separate method for directory
+->ioctl() or kill the latter completely. One of the problems is that for
+anything that resembles union-mount we won't have a struct file for all
+components. And there are other reasons why the current interface is a mess...
+
+->read on directories probably must go away - we should just enforce -EISDIR
+in sys_read() and friends.
+
+--------------------------- dquot_operations -------------------------------
+prototypes:
+ int (*write_dquot) (struct dquot *);
+ int (*acquire_dquot) (struct dquot *);
+ int (*release_dquot) (struct dquot *);
+ int (*mark_dirty) (struct dquot *);
+ int (*write_info) (struct super_block *, int);
+
+These operations are intended to be more or less wrapping functions that ensure
+a proper locking wrt the filesystem and call the generic quota operations.
+
+What filesystem should expect from the generic quota functions:
+
+ FS recursion Held locks when called
+write_dquot: yes dqonoff_sem or dqptr_sem
+acquire_dquot: yes dqonoff_sem or dqptr_sem
+release_dquot: yes dqonoff_sem or dqptr_sem
+mark_dirty: no -
+write_info: yes dqonoff_sem
+
+FS recursion means calling ->quota_read() and ->quota_write() from superblock
+operations.
+
+More details about quota locking can be found in fs/dquot.c.
+
+--------------------------- vm_operations_struct -----------------------------
+prototypes:
+ void (*open)(struct vm_area_struct*);
+ void (*close)(struct vm_area_struct*);
+ int (*fault)(struct vm_area_struct*, struct vm_fault *);
+ int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+ int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
+
+locking rules:
+ mmap_sem PageLocked(page)
+open: yes
+close: yes
+fault: yes can return with page locked
+page_mkwrite: yes can return with page locked
+access: yes
+
+ ->fault() is called when a previously not present pte is about
+to be faulted in. The filesystem must find and return the page associated
+with the passed in "pgoff" in the vm_fault structure. If it is possible that
+the page may be truncated and/or invalidated, then the filesystem must lock
+the page, then ensure it is not already truncated (the page lock will block
+subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
+locked. The VM will unlock the page.
+
+ ->page_mkwrite() is called when a previously read-only pte is
+about to become writeable. The filesystem again must ensure that there are
+no truncate/invalidate races, and then return with the page locked. If
+the page has been truncated, the filesystem should not look up a new page
+like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
+will cause the VM to retry the fault.
+
+ ->access() is called when get_user_pages() fails in
+acces_process_vm(), typically used to debug a process through
+/proc/pid/mem or ptrace. This function is needed only for
+VM_IO | VM_PFNMAP VMAs.
+
+================================================================================
+ Dubious stuff
+
+(if you break something or notice that it is broken and do not fix it yourself
+- at least put it here)
diff --git a/Documentation/filesystems/Makefile b/Documentation/filesystems/Makefile
new file mode 100644
index 00000000..a5dd114d
--- /dev/null
+++ b/Documentation/filesystems/Makefile
@@ -0,0 +1,8 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := dnotify_test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt
new file mode 100644
index 00000000..59497663
--- /dev/null
+++ b/Documentation/filesystems/adfs.txt
@@ -0,0 +1,75 @@
+Mount options for ADFS
+----------------------
+
+ uid=nnn All files in the partition will be owned by
+ user id nnn. Default 0 (root).
+ gid=nnn All files in the partition will be in group
+ nnn. Default 0 (root).
+ ownmask=nnn The permission mask for ADFS 'owner' permissions
+ will be nnn. Default 0700.
+ othmask=nnn The permission mask for ADFS 'other' permissions
+ will be nnn. Default 0077.
+ ftsuffix=n When ftsuffix=0, no file type suffix will be applied.
+ When ftsuffix=1, a hexadecimal suffix corresponding to
+ the RISC OS file type will be added. Default 0.
+
+Mapping of ADFS permissions to Linux permissions
+------------------------------------------------
+
+ ADFS permissions consist of the following:
+
+ Owner read
+ Owner write
+ Other read
+ Other write
+
+ (In older versions, an 'execute' permission did exist, but this
+ does not hold the same meaning as the Linux 'execute' permission
+ and is now obsolete).
+
+ The mapping is performed as follows:
+
+ Owner read -> -r--r--r--
+ Owner write -> --w--w---w
+ Owner read and filetype UnixExec -> ---x--x--x
+ These are then masked by ownmask, eg 700 -> -rwx------
+ Possible owner mode permissions -> -rwx------
+
+ Other read -> -r--r--r--
+ Other write -> --w--w--w-
+ Other read and filetype UnixExec -> ---x--x--x
+ These are then masked by othmask, eg 077 -> ----rwxrwx
+ Possible other mode permissions -> ----rwxrwx
+
+ Hence, with the default masks, if a file is owner read/write, and
+ not a UnixExec filetype, then the permissions will be:
+
+ -rw-------
+
+ However, if the masks were ownmask=0770,othmask=0007, then this would
+ be modified to:
+ -rw-rw----
+
+ There is no restriction on what you can do with these masks. You may
+ wish that either read bits give read access to the file for all, but
+ keep the default write protection (ownmask=0755,othmask=0577):
+
+ -rw-r--r--
+
+ You can therefore tailor the permission translation to whatever you
+ desire the permissions should be under Linux.
+
+RISC OS file type suffix
+------------------------
+
+ RISC OS file types are stored in bits 19..8 of the file load address.
+
+ To enable non-RISC OS systems to be used to store files without losing
+ file type information, a file naming convention was devised (initially
+ for use with NFS) such that a hexadecimal suffix of the form ,xyz
+ denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This
+ naming convention is now also used by RISC OS emulators such as RPCEmu.
+
+ Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file
+ type suffixes to be appended to file names read from a directory. If the
+ ftsuffix option is zero or omitted, no file type suffixes will be added.
diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt
new file mode 100644
index 00000000..81ac488e
--- /dev/null
+++ b/Documentation/filesystems/affs.txt
@@ -0,0 +1,219 @@
+Overview of Amiga Filesystems
+=============================
+
+Not all varieties of the Amiga filesystems are supported for reading and
+writing. The Amiga currently knows six different filesystems:
+
+DOS\0 The old or original filesystem, not really suited for
+ hard disks and normally not used on them, either.
+ Supported read/write.
+
+DOS\1 The original Fast File System. Supported read/write.
+
+DOS\2 The old "international" filesystem. International means that
+ a bug has been fixed so that accented ("international") letters
+ in file names are case-insensitive, as they ought to be.
+ Supported read/write.
+
+DOS\3 The "international" Fast File System. Supported read/write.
+
+DOS\4 The original filesystem with directory cache. The directory
+ cache speeds up directory accesses on floppies considerably,
+ but slows down file creation/deletion. Doesn't make much
+ sense on hard disks. Supported read only.
+
+DOS\5 The Fast File System with directory cache. Supported read only.
+
+All of the above filesystems allow block sizes from 512 to 32K bytes.
+Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
+speed up almost everything at the expense of wasted disk space. The speed
+gain above 4K seems not really worth the price, so you don't lose too
+much here, either.
+
+The muFS (multi user File System) equivalents of the above file systems
+are supported, too.
+
+Mount options for the AFFS
+==========================
+
+protect If this option is set, the protection bits cannot be altered.
+
+setuid[=uid] This sets the owner of all files and directories in the file
+ system to uid or the uid of the current user, respectively.
+
+setgid[=gid] Same as above, but for gid.
+
+mode=mode Sets the mode flags to the given (octal) value, regardless
+ of the original permissions. Directories will get an x
+ permission if the corresponding r bit is set.
+ This is useful since most of the plain AmigaOS files
+ will map to 600.
+
+reserved=num Sets the number of reserved blocks at the start of the
+ partition to num. You should never need this option.
+ Default is 2.
+
+root=block Sets the block number of the root block. This should never
+ be necessary.
+
+bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
+ 1024, 2048 and 4096. Like the root option, this should
+ never be necessary, as the affs can figure it out itself.
+
+quiet The file system will not return an error for disallowed
+ mode changes.
+
+verbose The volume name, file system type and block size will
+ be written to the syslog when the filesystem is mounted.
+
+mufs The filesystem is really a muFS, also it doesn't
+ identify itself as one. This option is necessary if
+ the filesystem wasn't formatted as muFS, but is used
+ as one.
+
+prefix=path Path will be prefixed to every absolute path name of
+ symbolic links on an AFFS partition. Default = "/".
+ (See below.)
+
+volume=name When symbolic links with an absolute path are created
+ on an AFFS partition, name will be prepended as the
+ volume name. Default = "" (empty string).
+ (See below.)
+
+Handling of the Users/Groups and protection flags
+=================================================
+
+Amiga -> Linux:
+
+The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
+
+ - R maps to r for user, group and others. On directories, R implies x.
+
+ - If both W and D are allowed, w will be set.
+
+ - E maps to x.
+
+ - H and P are always retained and ignored under Linux.
+
+ - A is always reset when a file is written to.
+
+User id and group id will be used unless set[gu]id are given as mount
+options. Since most of the Amiga file systems are single user systems
+they will be owned by root. The root directory (the mount point) of the
+Amiga filesystem will be owned by the user who actually mounts the
+filesystem (the root directory doesn't have uid/gid fields).
+
+Linux -> Amiga:
+
+The Linux rwxrwxrwx file mode is handled as follows:
+
+ - r permission will set R for user, group and others.
+
+ - w permission will set W and D for user, group and others.
+
+ - x permission of the user will set E for plain files.
+
+ - All other flags (suid, sgid, ...) are ignored and will
+ not be retained.
+
+Newly created files and directories will get the user and group ID
+of the current user and a mode according to the umask.
+
+Symbolic links
+==============
+
+Although the Amiga and Linux file systems resemble each other, there
+are some, not always subtle, differences. One of them becomes apparent
+with symbolic links. While Linux has a file system with exactly one
+root directory, the Amiga has a separate root directory for each
+file system (for example, partition, floppy disk, ...). With the Amiga,
+these entities are called "volumes". They have symbolic names which
+can be used to access them. Thus, symbolic links can point to a
+different volume. AFFS turns the volume name into a directory name
+and prepends the prefix path (see prefix option) to it.
+
+Example:
+You mount all your Amiga partitions under /amiga/<volume> (where
+<volume> is the name of the volume), and you give the option
+"prefix=/amiga/" when mounting all your AFFS partitions. (They
+might be "User", "WB" and "Graphics", the mount points /amiga/User,
+/amiga/WB and /amiga/Graphics). A symbolic link referring to
+"User:sc/include/dos/dos.h" will be followed to
+"/amiga/User/sc/include/dos/dos.h".
+
+Examples
+========
+
+Command line:
+ mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
+ mount /dev/sda3 /Amiga -t affs
+
+/etc/fstab entry:
+ /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
+
+IMPORTANT NOTE
+==============
+
+If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
+have an Amiga harddisk connected to your PC, it will overwrite
+the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
+the Rigid Disk Block. Sheer luck has it that this is an unused
+area of the RDB, so only the checksum doesn't match anymore.
+Linux will ignore this garbage and recognize the RDB anyway, but
+before you connect that drive to your Amiga again, you must
+restore or repair your RDB. So please do make a backup copy of it
+before booting Windows!
+
+If the damage is already done, the following should fix the RDB
+(where <disk> is the device name).
+DO AT YOUR OWN RISK:
+
+ dd if=/dev/<disk> of=rdb.tmp count=1
+ cp rdb.tmp rdb.fixed
+ dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
+ dd if=rdb.fixed of=/dev/<disk>
+
+Bugs, Restrictions, Caveats
+===========================
+
+Quite a few things may not work as advertised. Not everything is
+tested, though several hundred MB have been read and written using
+this fs. For a most up-to-date list of bugs please consult
+fs/affs/Changes.
+
+Filenames are truncated to 30 characters without warning (this
+can be changed by setting the compile-time option AFFS_NO_TRUNCATE
+in include/linux/amigaffs.h).
+
+Case is ignored by the affs in filename matching, but Linux shells
+do care about the case. Example (with /wb being an affs mounted fs):
+ rm /wb/WRONGCASE
+will remove /mnt/wrongcase, but
+ rm /wb/WR*
+will not since the names are matched by the shell.
+
+The block allocation is designed for hard disk partitions. If more
+than 1 process writes to a (small) diskette, the blocks are allocated
+in an ugly way (but the real AFFS doesn't do much better). This
+is also true when space gets tight.
+
+You cannot execute programs on an OFS (Old File System), since the
+program files cannot be memory mapped due to the 488 byte blocks.
+For the same reason you cannot mount an image on such a filesystem
+via the loopback device.
+
+The bitmap valid flag in the root block may not be accurate when the
+system crashes while an affs partition is mounted. There's currently
+no way to fix a garbled filesystem without an Amiga (disk validator)
+or manually (who would do this?). Maybe later.
+
+If you mount affs partitions on system startup, you may want to tell
+fsck that the fs should not be checked (place a '0' in the sixth field
+of /etc/fstab).
+
+It's not possible to read floppy disks with a normal PC or workstation
+due to an incompatibility with the Amiga floppy controller.
+
+If you are interested in an Amiga Emulator for Linux, look at
+
+http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt
new file mode 100644
index 00000000..ffef91c4
--- /dev/null
+++ b/Documentation/filesystems/afs.txt
@@ -0,0 +1,247 @@
+ ====================
+ kAFS: AFS FILESYSTEM
+ ====================
+
+Contents:
+
+ - Overview.
+ - Usage.
+ - Mountpoints.
+ - Proc filesystem.
+ - The cell database.
+ - Security.
+ - Examples.
+
+
+========
+OVERVIEW
+========
+
+This filesystem provides a fairly simple secure AFS filesystem driver. It is
+under development and does not yet provide the full feature set. The features
+it does support include:
+
+ (*) Security (currently only AFS kaserver and KerberosIV tickets).
+
+ (*) File reading and writing.
+
+ (*) Automounting.
+
+ (*) Local caching (via fscache).
+
+It does not yet support the following AFS features:
+
+ (*) pioctl() system call.
+
+
+===========
+COMPILATION
+===========
+
+The filesystem should be enabled by turning on the kernel configuration
+options:
+
+ CONFIG_AF_RXRPC - The RxRPC protocol transport
+ CONFIG_RXKAD - The RxRPC Kerberos security handler
+ CONFIG_AFS - The AFS filesystem
+
+Additionally, the following can be turned on to aid debugging:
+
+ CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled
+ CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled
+
+They permit the debugging messages to be turned on dynamically by manipulating
+the masks in the following files:
+
+ /sys/module/af_rxrpc/parameters/debug
+ /sys/module/kafs/parameters/debug
+
+
+=====
+USAGE
+=====
+
+When inserting the driver modules the root cell must be specified along with a
+list of volume location server IP addresses:
+
+ modprobe af_rxrpc
+ modprobe rxkad
+ modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
+
+The first module is the AF_RXRPC network protocol driver. This provides the
+RxRPC remote operation protocol and may also be accessed from userspace. See:
+
+ Documentation/networking/rxrpc.txt
+
+The second module is the kerberos RxRPC security driver, and the third module
+is the actual filesystem driver for the AFS filesystem.
+
+Once the module has been loaded, more modules can be added by the following
+procedure:
+
+ echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
+
+Where the parameters to the "add" command are the name of a cell and a list of
+volume location servers within that cell, with the latter separated by colons.
+
+Filesystems can be mounted anywhere by commands similar to the following:
+
+ mount -t afs "%cambridge.redhat.com:root.afs." /afs
+ mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
+ mount -t afs "#root.afs." /afs
+ mount -t afs "#root.cell." /afs/cambridge
+
+Where the initial character is either a hash or a percent symbol depending on
+whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
+volume, but are willing to use a R/W volume instead (percent).
+
+The name of the volume can be suffixes with ".backup" or ".readonly" to
+specify connection to only volumes of those types.
+
+The name of the cell is optional, and if not given during a mount, then the
+named volume will be looked up in the cell specified during modprobe.
+
+Additional cells can be added through /proc (see later section).
+
+
+===========
+MOUNTPOINTS
+===========
+
+AFS has a concept of mountpoints. In AFS terms, these are specially formatted
+symbolic links (of the same form as the "device name" passed to mount). kAFS
+presents these to the user as directories that have a follow-link capability
+(ie: symbolic link semantics). If anyone attempts to access them, they will
+automatically cause the target volume to be mounted (if possible) on that site.
+
+Automatically mounted filesystems will be automatically unmounted approximately
+twenty minutes after they were last used. Alternatively they can be unmounted
+directly with the umount() system call.
+
+Manually unmounting an AFS volume will cause any idle submounts upon it to be
+culled first. If all are culled, then the requested volume will also be
+unmounted, otherwise error EBUSY will be returned.
+
+This can be used by the administrator to attempt to unmount the whole AFS tree
+mounted on /afs in one go by doing:
+
+ umount /afs
+
+
+===============
+PROC FILESYSTEM
+===============
+
+The AFS modules creates a "/proc/fs/afs/" directory and populates it:
+
+ (*) A "cells" file that lists cells currently known to the afs module and
+ their usage counts:
+
+ [root@andromeda ~]# cat /proc/fs/afs/cells
+ USE NAME
+ 3 cambridge.redhat.com
+
+ (*) A directory per cell that contains files that list volume location
+ servers, volumes, and active servers known within that cell.
+
+ [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers
+ USE ADDR STATE
+ 4 172.16.18.91 0
+ [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers
+ ADDRESS
+ 172.16.18.91
+ [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes
+ USE STT VLID[0] VLID[1] VLID[2] NAME
+ 1 Val 20000000 20000001 20000002 root.afs
+
+
+=================
+THE CELL DATABASE
+=================
+
+The filesystem maintains an internal database of all the cells it knows and the
+IP addresses of the volume location servers for those cells. The cell to which
+the system belongs is added to the database when modprobe is performed by the
+"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
+the kernel command line.
+
+Further cells can be added by commands similar to the following:
+
+ echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
+ echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
+
+No other cell database operations are available at this time.
+
+
+========
+SECURITY
+========
+
+Secure operations are initiated by acquiring a key using the klog program. A
+very primitive klog program is available at:
+
+ http://people.redhat.com/~dhowells/rxrpc/klog.c
+
+This should be compiled by:
+
+ make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils"
+
+And then run as:
+
+ ./klog
+
+Assuming it's successful, this adds a key of type RxRPC, named for the service
+and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
+by cat'ing /proc/keys:
+
+ [root@andromeda ~]# keyctl show
+ Session Keyring
+ -3 --alswrv 0 0 keyring: _ses.3268
+ 2 --alswrv 0 0 \_ keyring: _uid.0
+ 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM
+
+Currently the username, realm, password and proposed ticket lifetime are
+compiled in to the program.
+
+It is not required to acquire a key before using AFS facilities, but if one is
+not acquired then all operations will be governed by the anonymous user parts
+of the ACLs.
+
+If a key is acquired, then all AFS operations, including mounts and automounts,
+made by a possessor of that key will be secured with that key.
+
+If a file is opened with a particular key and then the file descriptor is
+passed to a process that doesn't have that key (perhaps over an AF_UNIX
+socket), then the operations on the file will be made with key that was used to
+open the file.
+
+
+========
+EXAMPLES
+========
+
+Here's what I use to test this. Some of the names and IP addresses are local
+to my internal DNS. My "root.afs" partition has a mount point within it for
+some public volumes volumes.
+
+insmod /tmp/rxrpc.o
+insmod /tmp/rxkad.o
+insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91
+
+mount -t afs \%root.afs. /afs
+mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
+
+echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells
+mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
+mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
+mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
+mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
+mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
+mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
+mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
+mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
+
+umount /afs
+rmmod kafs
+rmmod rxkad
+rmmod rxrpc
diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt
new file mode 100644
index 00000000..4c95935c
--- /dev/null
+++ b/Documentation/filesystems/autofs4-mount-control.txt
@@ -0,0 +1,393 @@
+
+Miscellaneous Device control operations for the autofs4 kernel module
+====================================================================
+
+The problem
+===========
+
+There is a problem with active restarts in autofs (that is to say
+restarting autofs when there are busy mounts).
+
+During normal operation autofs uses a file descriptor opened on the
+directory that is being managed in order to be able to issue control
+operations. Using a file descriptor gives ioctl operations access to
+autofs specific information stored in the super block. The operations
+are things such as setting an autofs mount catatonic, setting the
+expire timeout and requesting expire checks. As is explained below,
+certain types of autofs triggered mounts can end up covering an autofs
+mount itself which prevents us being able to use open(2) to obtain a
+file descriptor for these operations if we don't already have one open.
+
+Currently autofs uses "umount -l" (lazy umount) to clear active mounts
+at restart. While using lazy umount works for most cases, anything that
+needs to walk back up the mount tree to construct a path, such as
+getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works
+because the point from which the path is constructed has been detached
+from the mount tree.
+
+The actual problem with autofs is that it can't reconnect to existing
+mounts. Immediately one thinks of just adding the ability to remount
+autofs file systems would solve it, but alas, that can't work. This is
+because autofs direct mounts and the implementation of "on demand mount
+and expire" of nested mount trees have the file system mounted directly
+on top of the mount trigger directory dentry.
+
+For example, there are two types of automount maps, direct (in the kernel
+module source you will see a third type called an offset, which is just
+a direct mount in disguise) and indirect.
+
+Here is a master map with direct and indirect map entries:
+
+/- /etc/auto.direct
+/test /etc/auto.indirect
+
+and the corresponding map files:
+
+/etc/auto.direct:
+
+/automount/dparse/g6 budgie:/autofs/export1
+/automount/dparse/g1 shark:/autofs/export1
+and so on.
+
+/etc/auto.indirect:
+
+g1 shark:/autofs/export1
+g6 budgie:/autofs/export1
+and so on.
+
+For the above indirect map an autofs file system is mounted on /test and
+mounts are triggered for each sub-directory key by the inode lookup
+operation. So we see a mount of shark:/autofs/export1 on /test/g1, for
+example.
+
+The way that direct mounts are handled is by making an autofs mount on
+each full path, such as /automount/dparse/g1, and using it as a mount
+trigger. So when we walk on the path we mount shark:/autofs/export1 "on
+top of this mount point". Since these are always directories we can
+use the follow_link inode operation to trigger the mount.
+
+But, each entry in direct and indirect maps can have offsets (making
+them multi-mount map entries).
+
+For example, an indirect mount map entry could also be:
+
+g1 \
+ / shark:/autofs/export5/testing/test \
+ /s1 shark:/autofs/export/testing/test/s1 \
+ /s2 shark:/autofs/export5/testing/test/s2 \
+ /s1/ss1 shark:/autofs/export1 \
+ /s2/ss2 shark:/autofs/export2
+
+and a similarly a direct mount map entry could also be:
+
+/automount/dparse/g1 \
+ / shark:/autofs/export5/testing/test \
+ /s1 shark:/autofs/export/testing/test/s1 \
+ /s2 shark:/autofs/export5/testing/test/s2 \
+ /s1/ss1 shark:/autofs/export2 \
+ /s2/ss2 shark:/autofs/export2
+
+One of the issues with version 4 of autofs was that, when mounting an
+entry with a large number of offsets, possibly with nesting, we needed
+to mount and umount all of the offsets as a single unit. Not really a
+problem, except for people with a large number of offsets in map entries.
+This mechanism is used for the well known "hosts" map and we have seen
+cases (in 2.4) where the available number of mounts are exhausted or
+where the number of privileged ports available is exhausted.
+
+In version 5 we mount only as we go down the tree of offsets and
+similarly for expiring them which resolves the above problem. There is
+somewhat more detail to the implementation but it isn't needed for the
+sake of the problem explanation. The one important detail is that these
+offsets are implemented using the same mechanism as the direct mounts
+above and so the mount points can be covered by a mount.
+
+The current autofs implementation uses an ioctl file descriptor opened
+on the mount point for control operations. The references held by the
+descriptor are accounted for in checks made to determine if a mount is
+in use and is also used to access autofs file system information held
+in the mount super block. So the use of a file handle needs to be
+retained.
+
+
+The Solution
+============
+
+To be able to restart autofs leaving existing direct, indirect and
+offset mounts in place we need to be able to obtain a file handle
+for these potentially covered autofs mount points. Rather than just
+implement an isolated operation it was decided to re-implement the
+existing ioctl interface and add new operations to provide this
+functionality.
+
+In addition, to be able to reconstruct a mount tree that has busy mounts,
+the uid and gid of the last user that triggered the mount needs to be
+available because these can be used as macro substitution variables in
+autofs maps. They are recorded at mount request time and an operation
+has been added to retrieve them.
+
+Since we're re-implementing the control interface, a couple of other
+problems with the existing interface have been addressed. First, when
+a mount or expire operation completes a status is returned to the
+kernel by either a "send ready" or a "send fail" operation. The
+"send fail" operation of the ioctl interface could only ever send
+ENOENT so the re-implementation allows user space to send an actual
+status. Another expensive operation in user space, for those using
+very large maps, is discovering if a mount is present. Usually this
+involves scanning /proc/mounts and since it needs to be done quite
+often it can introduce significant overhead when there are many entries
+in the mount table. An operation to lookup the mount status of a mount
+point dentry (covered or not) has also been added.
+
+Current kernel development policy recommends avoiding the use of the
+ioctl mechanism in favor of systems such as Netlink. An implementation
+using this system was attempted to evaluate its suitability and it was
+found to be inadequate, in this case. The Generic Netlink system was
+used for this as raw Netlink would lead to a significant increase in
+complexity. There's no question that the Generic Netlink system is an
+elegant solution for common case ioctl functions but it's not a complete
+replacement probably because its primary purpose in life is to be a
+message bus implementation rather than specifically an ioctl replacement.
+While it would be possible to work around this there is one concern
+that lead to the decision to not use it. This is that the autofs
+expire in the daemon has become far to complex because umount
+candidates are enumerated, almost for no other reason than to "count"
+the number of times to call the expire ioctl. This involves scanning
+the mount table which has proved to be a big overhead for users with
+large maps. The best way to improve this is try and get back to the
+way the expire was done long ago. That is, when an expire request is
+issued for a mount (file handle) we should continually call back to
+the daemon until we can't umount any more mounts, then return the
+appropriate status to the daemon. At the moment we just expire one
+mount at a time. A Generic Netlink implementation would exclude this
+possibility for future development due to the requirements of the
+message bus architecture.
+
+
+autofs4 Miscellaneous Device mount control interface
+====================================================
+
+The control interface is opening a device node, typically /dev/autofs.
+
+All the ioctls use a common structure to pass the needed parameter
+information and return operation results:
+
+struct autofs_dev_ioctl {
+ __u32 ver_major;
+ __u32 ver_minor;
+ __u32 size; /* total size of data passed in
+ * including this struct */
+ __s32 ioctlfd; /* automount command fd */
+
+ __u32 arg1; /* Command parameters */
+ __u32 arg2;
+
+ char path[0];
+};
+
+The ioctlfd field is a mount point file descriptor of an autofs mount
+point. It is returned by the open call and is used by all calls except
+the check for whether a given path is a mount point, where it may
+optionally be used to check a specific mount corresponding to a given
+mount point file descriptor, and when requesting the uid and gid of the
+last successful mount on a directory within the autofs file system.
+
+The fields arg1 and arg2 are used to communicate parameters and results of
+calls made as described below.
+
+The path field is used to pass a path where it is needed and the size field
+is used account for the increased structure length when translating the
+structure sent from user space.
+
+This structure can be initialized before setting specific fields by using
+the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *).
+
+All of the ioctls perform a copy of this structure from user space to
+kernel space and return -EINVAL if the size parameter is smaller than
+the structure size itself, -ENOMEM if the kernel memory allocation fails
+or -EFAULT if the copy itself fails. Other checks include a version check
+of the compiled in user space version against the module version and a
+mismatch results in a -EINVAL return. If the size field is greater than
+the structure size then a path is assumed to be present and is checked to
+ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is
+returned. Following these checks, for all ioctl commands except
+AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and
+AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is
+not a valid descriptor or doesn't correspond to an autofs mount point
+an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is
+returned.
+
+
+The ioctls
+==========
+
+An example of an implementation which uses this interface can be seen
+in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the
+distribution tar available for download from kernel.org in directory
+/pub/linux/daemons/autofs/v5.
+
+The device node ioctl operations implemented by this interface are:
+
+
+AUTOFS_DEV_IOCTL_VERSION
+------------------------
+
+Get the major and minor version of the autofs4 device ioctl kernel module
+implementation. It requires an initialized struct autofs_dev_ioctl as an
+input parameter and sets the version information in the passed in structure.
+It returns 0 on success or the error -EINVAL if a version mismatch is
+detected.
+
+
+AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD
+------------------------------------------------------------------
+
+Get the major and minor version of the autofs4 protocol version understood
+by loaded module. This call requires an initialized struct autofs_dev_ioctl
+with the ioctlfd field set to a valid autofs mount point descriptor
+and sets the requested version number in structure field arg1. These
+commands return 0 on success or one of the negative error codes if
+validation fails.
+
+
+AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT
+----------------------------------------------------------
+
+Obtain and release a file descriptor for an autofs managed mount point
+path. The open call requires an initialized struct autofs_dev_ioctl with
+the the path field set and the size field adjusted appropriately as well
+as the arg1 field set to the device number of the autofs mount. The
+device number can be obtained from the mount options shown in
+/proc/mounts. The close call requires an initialized struct
+autofs_dev_ioct with the ioctlfd field set to the descriptor obtained
+from the open call. The release of the file descriptor can also be done
+with close(2) so any open descriptors will also be closed at process exit.
+The close call is included in the implemented operations largely for
+completeness and to provide for a consistent user space implementation.
+
+
+AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD
+--------------------------------------------------------
+
+Return mount and expire result status from user space to the kernel.
+Both of these calls require an initialized struct autofs_dev_ioctl
+with the ioctlfd field set to the descriptor obtained from the open
+call and the arg1 field set to the wait queue token number, received
+by user space in the foregoing mount or expire request. The arg2 field
+is set to the status to be returned. For the ready call this is always
+0 and for the fail call it is set to the errno of the operation.
+
+
+AUTOFS_DEV_IOCTL_SETPIPEFD_CMD
+------------------------------
+
+Set the pipe file descriptor used for kernel communication to the daemon.
+Normally this is set at mount time using an option but when reconnecting
+to a existing mount we need to use this to tell the autofs mount about
+the new kernel pipe descriptor. In order to protect mounts against
+incorrectly setting the pipe descriptor we also require that the autofs
+mount be catatonic (see next call).
+
+The call requires an initialized struct autofs_dev_ioctl with the
+ioctlfd field set to the descriptor obtained from the open call and
+the arg1 field set to descriptor of the pipe. On success the call
+also sets the process group id used to identify the controlling process
+(eg. the owning automount(8) daemon) to the process group of the caller.
+
+
+AUTOFS_DEV_IOCTL_CATATONIC_CMD
+------------------------------
+
+Make the autofs mount point catatonic. The autofs mount will no longer
+issue mount requests, the kernel communication pipe descriptor is released
+and any remaining waits in the queue released.
+
+The call requires an initialized struct autofs_dev_ioctl with the
+ioctlfd field set to the descriptor obtained from the open call.
+
+
+AUTOFS_DEV_IOCTL_TIMEOUT_CMD
+----------------------------
+
+Set the expire timeout for mounts within an autofs mount point.
+
+The call requires an initialized struct autofs_dev_ioctl with the
+ioctlfd field set to the descriptor obtained from the open call.
+
+
+AUTOFS_DEV_IOCTL_REQUESTER_CMD
+------------------------------
+
+Return the uid and gid of the last process to successfully trigger a the
+mount on the given path dentry.
+
+The call requires an initialized struct autofs_dev_ioctl with the path
+field set to the mount point in question and the size field adjusted
+appropriately as well as the arg1 field set to the device number of the
+containing autofs mount. Upon return the struct field arg1 contains the
+uid and arg2 the gid.
+
+When reconstructing an autofs mount tree with active mounts we need to
+re-connect to mounts that may have used the original process uid and
+gid (or string variations of them) for mount lookups within the map entry.
+This call provides the ability to obtain this uid and gid so they may be
+used by user space for the mount map lookups.
+
+
+AUTOFS_DEV_IOCTL_EXPIRE_CMD
+---------------------------
+
+Issue an expire request to the kernel for an autofs mount. Typically
+this ioctl is called until no further expire candidates are found.
+
+The call requires an initialized struct autofs_dev_ioctl with the
+ioctlfd field set to the descriptor obtained from the open call. In
+addition an immediate expire, independent of the mount timeout, can be
+requested by setting the arg1 field to 1. If no expire candidates can
+be found the ioctl returns -1 with errno set to EAGAIN.
+
+This call causes the kernel module to check the mount corresponding
+to the given ioctlfd for mounts that can be expired, issues an expire
+request back to the daemon and waits for completion.
+
+AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD
+------------------------------
+
+Checks if an autofs mount point is in use.
+
+The call requires an initialized struct autofs_dev_ioctl with the
+ioctlfd field set to the descriptor obtained from the open call and
+it returns the result in the arg1 field, 1 for busy and 0 otherwise.
+
+
+AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD
+---------------------------------
+
+Check if the given path is a mountpoint.
+
+The call requires an initialized struct autofs_dev_ioctl. There are two
+possible variations. Both use the path field set to the path of the mount
+point to check and the size field adjusted appropriately. One uses the
+ioctlfd field to identify a specific mount point to check while the other
+variation uses the path and optionally arg1 set to an autofs mount type.
+The call returns 1 if this is a mount point and sets arg1 to the device
+number of the mount and field arg2 to the relevant super block magic
+number (described below) or 0 if it isn't a mountpoint. In both cases
+the the device number (as returned by new_encode_dev()) is returned
+in field arg1.
+
+If supplied with a file descriptor we're looking for a specific mount,
+not necessarily at the top of the mounted stack. In this case the path
+the descriptor corresponds to is considered a mountpoint if it is itself
+a mountpoint or contains a mount, such as a multi-mount without a root
+mount. In this case we return 1 if the descriptor corresponds to a mount
+point and and also returns the super magic of the covering mount if there
+is one or 0 if it isn't a mountpoint.
+
+If a path is supplied (and the ioctlfd field is set to -1) then the path
+is looked up and is checked to see if it is the root of a mount. If a
+type is also given we are looking for a particular autofs mount and if
+a match isn't found a fail is returned. If the the located path is the
+root of a mount 1 is returned along with the super magic of the mount
+or 0 otherwise.
+
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt
new file mode 100644
index 00000000..7cac200e
--- /dev/null
+++ b/Documentation/filesystems/automount-support.txt
@@ -0,0 +1,118 @@
+Support is available for filesystems that wish to do automounting support (such
+as kAFS which can be found in fs/afs/). This facility includes allowing
+in-kernel mounts to be performed and mountpoint degradation to be
+requested. The latter can also be requested by userspace.
+
+
+======================
+IN-KERNEL AUTOMOUNTING
+======================
+
+A filesystem can now mount another filesystem on one of its directories by the
+following procedure:
+
+ (1) Give the directory a follow_link() operation.
+
+ When the directory is accessed, the follow_link op will be called, and
+ it will be provided with the location of the mountpoint in the nameidata
+ structure (vfsmount and dentry).
+
+ (2) Have the follow_link() op do the following steps:
+
+ (a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
+ superblock and gain a vfsmount structure representing it.
+
+ (b) Copy the nameidata provided as an argument and substitute the dentry
+ argument into it the copy.
+
+ (c) Call do_add_mount() to install the new vfsmount into the namespace's
+ mountpoint tree, thus making it accessible to userspace. Use the
+ nameidata set up in (b) as the destination.
+
+ If the mountpoint will be automatically expired, then do_add_mount()
+ should also be given the location of an expiration list (see further
+ down).
+
+ (d) Release the path in the nameidata argument and substitute in the new
+ vfsmount and its root dentry. The ref counts on these will need
+ incrementing.
+
+Then from userspace, you can just do something like:
+
+ [root@andromeda root]# mount -t afs \#root.afs. /afs
+ [root@andromeda root]# ls /afs
+ asd cambridge cambridge.redhat.com grand.central.org
+ [root@andromeda root]# ls /afs/cambridge
+ afsdoc
+ [root@andromeda root]# ls /afs/cambridge/afsdoc/
+ ChangeLog html LICENSE pdf RELNOTES-1.2.2
+
+And then if you look in the mountpoint catalogue, you'll see something like:
+
+ [root@andromeda root]# cat /proc/mounts
+ ...
+ #root.afs. /afs afs rw 0 0
+ #root.cell. /afs/cambridge.redhat.com afs rw 0 0
+ #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
+
+
+===========================
+AUTOMATIC MOUNTPOINT EXPIRY
+===========================
+
+Automatic expiration of mountpoints is easy, provided you've mounted the
+mountpoint to be expired in the automounting procedure outlined above.
+
+To do expiration, you need to follow these steps:
+
+ (3) Create at least one list off which the vfsmounts to be expired can be
+ hung. Access to this list will be governed by the vfsmount_lock.
+
+ (4) In step (2c) above, the call to do_add_mount() should be provided with a
+ pointer to this list. It will hang the vfsmount off of it if it succeeds.
+
+ (5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
+ with a pointer to this list. This will process the list, marking every
+ vfsmount thereon for potential expiry on the next call.
+
+ If a vfsmount was already flagged for expiry, and if its usage count is 1
+ (it's only referenced by its parent vfsmount), then it will be deleted
+ from the namespace and thrown away (effectively unmounted).
+
+ It may prove simplest to simply call this at regular intervals, using
+ some sort of timed event to drive it.
+
+The expiration flag is cleared by calls to mntput. This means that expiration
+will only happen on the second expiration request after the last time the
+mountpoint was accessed.
+
+If a mountpoint is moved, it gets removed from the expiration list. If a bind
+mount is made on an expirable mount, the new vfsmount will not be on the
+expiration list and will not expire.
+
+If a namespace is copied, all mountpoints contained therein will be copied,
+and the copies of those that are on an expiration list will be added to the
+same expiration list.
+
+
+=======================
+USERSPACE DRIVEN EXPIRY
+=======================
+
+As an alternative, it is possible for userspace to request expiry of any
+mountpoint (though some will be rejected - the current process's idea of the
+rootfs for example). It does this by passing the MNT_EXPIRE flag to
+umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
+
+If the mountpoint in question is in referenced by something other than
+umount() or its parent mountpoint, an EBUSY error will be returned and the
+mountpoint will not be marked for expiration or unmounted.
+
+If the mountpoint was not already marked for expiry at that time, an EAGAIN
+error will be given and it won't be unmounted.
+
+Otherwise if it was already marked and it wasn't referenced, unmounting will
+take place as usual.
+
+Again, the expiration flag is cleared every time anything other than umount()
+looks at a mountpoint.
diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt
new file mode 100644
index 00000000..6e49c363
--- /dev/null
+++ b/Documentation/filesystems/befs.txt
@@ -0,0 +1,117 @@
+BeOS filesystem for Linux
+
+Document last updated: Dec 6, 2001
+
+WARNING
+=======
+Make sure you understand that this is alpha software. This means that the
+implementation is neither complete nor well-tested.
+
+I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
+
+LICENSE
+=====
+This software is covered by the GNU General Public License.
+See the file COPYING for the complete text of the license.
+Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
+
+AUTHOR
+=====
+The largest part of the code written by Will Dyson <will_dyson@pobox.com>
+He has been working on the code since Aug 13, 2001. See the changelog for
+details.
+
+Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
+His original code can still be found at:
+<http://hp.vector.co.jp/authors/VA008030/bfs/>
+Does anyone know of a more current email address for Makoto? He doesn't
+respond to the address given above...
+
+Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
+
+WHAT IS THIS DRIVER?
+==================
+This module implements the native filesystem of BeOS http://www.beincorporated.com/
+for the linux 2.4.1 and later kernels. Currently it is a read-only
+implementation.
+
+Which is it, BFS or BEFS?
+================
+Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
+But Unixware Boot Filesystem is called bfs, too. And they are already in
+the kernel. Because of this naming conflict, on Linux the BeOS
+filesystem is called befs.
+
+HOW TO INSTALL
+==============
+step 1. Install the BeFS patch into the source code tree of linux.
+
+Apply the patchfile to your kernel source tree.
+Assuming that your kernel source is in /foo/bar/linux and the patchfile
+is called patch-befs-xxx, you would do the following:
+
+ cd /foo/bar/linux
+ patch -p1 < /path/to/patch-befs-xxx
+
+if the patching step fails (i.e. there are rejected hunks), you can try to
+figure it out yourself (it shouldn't be hard), or mail the maintainer
+(Will Dyson <will_dyson@pobox.com>) for help.
+
+step 2. Configuration & make kernel
+
+The linux kernel has many compile-time options. Most of them are beyond the
+scope of this document. I suggest the Kernel-HOWTO document as a good general
+reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
+
+However, to use the BeFS module, you must enable it at configure time.
+
+ cd /foo/bar/linux
+ make menuconfig (or xconfig)
+
+The BeFS module is not a standard part of the linux kernel, so you must first
+enable support for experimental code under the "Code maturity level" menu.
+
+Then, under the "Filesystems" menu will be an option called "BeFS
+filesystem (experimental)", or something like that. Enable that option
+(it is fine to make it a module).
+
+Save your kernel configuration and then build your kernel.
+
+step 3. Install
+
+See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
+instructions on this critical step.
+
+USING BFS
+=========
+To use the BeOS filesystem, use filesystem type 'befs'.
+
+ex)
+ mount -t befs /dev/fd0 /beos
+
+MOUNT OPTIONS
+=============
+uid=nnn All files in the partition will be owned by user id nnn.
+gid=nnn All files in the partition will be in group nnn.
+iocharset=xxx Use xxx as the name of the NLS translation table.
+debug The driver will output debugging information to the syslog.
+
+HOW TO GET LASTEST VERSION
+==========================
+
+The latest version is currently available at:
+<http://befs-driver.sourceforge.net/>
+
+ANY KNOWN BUGS?
+===========
+As of Jan 20, 2002:
+
+ None
+
+SPECIAL THANKS
+==============
+Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
+Hiroyuki Yamada ... Testing LinuxPPC.
+
+
+
diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt
new file mode 100644
index 00000000..78043d5a
--- /dev/null
+++ b/Documentation/filesystems/bfs.txt
@@ -0,0 +1,57 @@
+BFS FILESYSTEM FOR LINUX
+========================
+
+The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
+usually contains the kernel image and a few other files required for the
+boot process.
+
+In order to access /stand partition under Linux you obviously need to
+know the partition number and the kernel must support UnixWare disk slices
+(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
+depend on having UnixWare disklabel support because one can also mount
+BFS filesystem via loopback:
+
+# losetup /dev/loop0 stand.img
+# mount -t bfs /dev/loop0 /mnt/stand
+
+where stand.img is a file containing the image of BFS filesystem.
+When you have finished using it and umounted you need to also deallocate
+/dev/loop0 device by:
+
+# losetup -d /dev/loop0
+
+You can simplify mounting by just typing:
+
+# mount -t bfs -o loop stand.img /mnt/stand
+
+this will allocate the first available loopback device (and load loop.o
+kernel module if necessary) automatically. If the loopback driver is not
+loaded automatically, make sure that you have compiled the module and
+that modprobe is functioning. Beware that umount will not deallocate
+/dev/loopN device if /etc/mtab file on your system is a symbolic link to
+/proc/mounts. You will need to do it manually using "-d" switch of
+losetup(8). Read losetup(8) manpage for more info.
+
+To create the BFS image under UnixWare you need to find out first which
+slice contains it. The command prtvtoc(1M) is your friend:
+
+# prtvtoc /dev/rdsk/c0b0t0d0s0
+
+(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
+look for the slice with tag "STAND", which is usually slice 10. With this
+information you can use dd(1) to create the BFS image:
+
+# umount /stand
+# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
+
+Just in case, you can verify that you have done the right thing by checking
+the magic number:
+
+# od -Ad -tx4 stand.img | more
+
+The first 4 bytes should be 0x1badface.
+
+If you have any patches, questions or suggestions regarding this BFS
+implementation please contact the author:
+
+Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt
new file mode 100644
index 00000000..64087c34
--- /dev/null
+++ b/Documentation/filesystems/btrfs.txt
@@ -0,0 +1,91 @@
+
+ BTRFS
+ =====
+
+Btrfs is a new copy on write filesystem for Linux aimed at
+implementing advanced features while focusing on fault tolerance,
+repair and easy administration. Initially developed by Oracle, Btrfs
+is licensed under the GPL and open for contribution from anyone.
+
+Linux has a wealth of filesystems to choose from, but we are facing a
+number of challenges with scaling to the large storage subsystems that
+are becoming common in today's data centers. Filesystems need to scale
+in their ability to address and manage large storage, and also in
+their ability to detect, repair and tolerate errors in the data stored
+on disk. Btrfs is under heavy development, and is not suitable for
+any uses other than benchmarking and review. The Btrfs disk format is
+not yet finalized.
+
+The main Btrfs features include:
+
+ * Extent based file storage (2^64 max file size)
+ * Space efficient packing of small files
+ * Space efficient indexed directories
+ * Dynamic inode allocation
+ * Writable snapshots
+ * Subvolumes (separate internal filesystem roots)
+ * Object level mirroring and striping
+ * Checksums on data and metadata (multiple algorithms available)
+ * Compression
+ * Integrated multiple device support, with several raid algorithms
+ * Online filesystem check (not yet implemented)
+ * Very fast offline filesystem check
+ * Efficient incremental backup and FS mirroring (not yet implemented)
+ * Online filesystem defragmentation
+
+
+
+ MAILING LIST
+ ============
+
+There is a Btrfs mailing list hosted on vger.kernel.org. You can
+find details on how to subscribe here:
+
+http://vger.kernel.org/vger-lists.html#linux-btrfs
+
+Mailing list archives are available from gmane:
+
+http://dir.gmane.org/gmane.comp.file-systems.btrfs
+
+
+
+ IRC
+ ===
+
+Discussion of Btrfs also occurs on the #btrfs channel of the Freenode
+IRC network.
+
+
+
+ UTILITIES
+ =========
+
+Userspace tools for creating and manipulating Btrfs file systems are
+available from the git repository at the following location:
+
+ http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git
+ git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
+
+These include the following tools:
+
+mkfs.btrfs: create a filesystem
+
+btrfsctl: control program to create snapshots and subvolumes:
+
+ mount /dev/sda2 /mnt
+ btrfsctl -s new_subvol_name /mnt
+ btrfsctl -s snapshot_of_default /mnt/default
+ btrfsctl -s snapshot_of_new_subvol /mnt/new_subvol_name
+ btrfsctl -s snapshot_of_a_snapshot /mnt/snapshot_of_new_subvol
+ ls /mnt
+ default snapshot_of_a_snapshot snapshot_of_new_subvol
+ new_subvol_name snapshot_of_default
+
+ Snapshots and subvolumes cannot be deleted right now, but you can
+ rm -rf all the files and directories inside them.
+
+btrfsck: do a limited check of the FS extent trees.
+
+btrfs-debug-tree: print all of the FS metadata in text form. Example:
+
+ btrfs-debug-tree /dev/sda2 >& big_output_file
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
new file mode 100644
index 00000000..382d52cd
--- /dev/null
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -0,0 +1,658 @@
+ ==========================
+ FS-CACHE CACHE BACKEND API
+ ==========================
+
+The FS-Cache system provides an API by which actual caches can be supplied to
+FS-Cache for it to then serve out to network filesystems and other interested
+parties.
+
+This API is declared in <linux/fscache-cache.h>.
+
+
+====================================
+INITIALISING AND REGISTERING A CACHE
+====================================
+
+To start off, a cache definition must be initialised and registered for each
+cache the backend wants to make available. For instance, CacheFS does this in
+the fill_super() operation on mounting.
+
+The cache definition (struct fscache_cache) should be initialised by calling:
+
+ void fscache_init_cache(struct fscache_cache *cache,
+ struct fscache_cache_ops *ops,
+ const char *idfmt,
+ ...);
+
+Where:
+
+ (*) "cache" is a pointer to the cache definition;
+
+ (*) "ops" is a pointer to the table of operations that the backend supports on
+ this cache; and
+
+ (*) "idfmt" is a format and printf-style arguments for constructing a label
+ for the cache.
+
+
+The cache should then be registered with FS-Cache by passing a pointer to the
+previously initialised cache definition to:
+
+ int fscache_add_cache(struct fscache_cache *cache,
+ struct fscache_object *fsdef,
+ const char *tagname);
+
+Two extra arguments should also be supplied:
+
+ (*) "fsdef" which should point to the object representation for the FS-Cache
+ master index in this cache. Netfs primary index entries will be created
+ here. FS-Cache keeps the caller's reference to the index object if
+ successful and will release it upon withdrawal of the cache.
+
+ (*) "tagname" which, if given, should be a text string naming this cache. If
+ this is NULL, the identifier will be used instead. For CacheFS, the
+ identifier is set to name the underlying block device and the tag can be
+ supplied by mount.
+
+This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
+is already in use. 0 will be returned on success.
+
+
+=====================
+UNREGISTERING A CACHE
+=====================
+
+A cache can be withdrawn from the system by calling this function with a
+pointer to the cache definition:
+
+ void fscache_withdraw_cache(struct fscache_cache *cache);
+
+In CacheFS's case, this is called by put_super().
+
+
+========
+SECURITY
+========
+
+The cache methods are executed one of two contexts:
+
+ (1) that of the userspace process that issued the netfs operation that caused
+ the cache method to be invoked, or
+
+ (2) that of one of the processes in the FS-Cache thread pool.
+
+In either case, this may not be an appropriate context in which to access the
+cache.
+
+The calling process's fsuid, fsgid and SELinux security identities may need to
+be masqueraded for the duration of the cache driver's access to the cache.
+This is left to the cache to handle; FS-Cache makes no effort in this regard.
+
+
+===================================
+CONTROL AND STATISTICS PRESENTATION
+===================================
+
+The cache may present data to the outside world through FS-Cache's interfaces
+in sysfs and procfs - the former for control and the latter for statistics.
+
+A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
+is enabled. This is accessible through the kobject struct fscache_cache::kobj
+and is for use by the cache as it sees fit.
+
+
+========================
+RELEVANT DATA STRUCTURES
+========================
+
+ (*) Index/Data file FS-Cache representation cookie:
+
+ struct fscache_cookie {
+ struct fscache_object_def *def;
+ struct fscache_netfs *netfs;
+ void *netfs_data;
+ ...
+ };
+
+ The fields that might be of use to the backend describe the object
+ definition, the netfs definition and the netfs's data for this cookie.
+ The object definition contain functions supplied by the netfs for loading
+ and matching index entries; these are required to provide some of the
+ cache operations.
+
+
+ (*) In-cache object representation:
+
+ struct fscache_object {
+ int debug_id;
+ enum {
+ FSCACHE_OBJECT_RECYCLING,
+ ...
+ } state;
+ spinlock_t lock
+ struct fscache_cache *cache;
+ struct fscache_cookie *cookie;
+ ...
+ };
+
+ Structures of this type should be allocated by the cache backend and
+ passed to FS-Cache when requested by the appropriate cache operation. In
+ the case of CacheFS, they're embedded in CacheFS's internal object
+ structures.
+
+ The debug_id is a simple integer that can be used in debugging messages
+ that refer to a particular object. In such a case it should be printed
+ using "OBJ%x" to be consistent with FS-Cache.
+
+ Each object contains a pointer to the cookie that represents the object it
+ is backing. An object should retired when put_object() is called if it is
+ in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be
+ initialised by calling fscache_object_init(object).
+
+
+ (*) FS-Cache operation record:
+
+ struct fscache_operation {
+ atomic_t usage;
+ struct fscache_object *object;
+ unsigned long flags;
+ #define FSCACHE_OP_EXCLUSIVE
+ void (*processor)(struct fscache_operation *op);
+ void (*release)(struct fscache_operation *op);
+ ...
+ };
+
+ FS-Cache has a pool of threads that it uses to give CPU time to the
+ various asynchronous operations that need to be done as part of driving
+ the cache. These are represented by the above structure. The processor
+ method is called to give the op CPU time, and the release method to get
+ rid of it when its usage count reaches 0.
+
+ An operation can be made exclusive upon an object by setting the
+ appropriate flag before enqueuing it with fscache_enqueue_operation(). If
+ an operation needs more processing time, it should be enqueued again.
+
+
+ (*) FS-Cache retrieval operation record:
+
+ struct fscache_retrieval {
+ struct fscache_operation op;
+ struct address_space *mapping;
+ struct list_head *to_do;
+ ...
+ };
+
+ A structure of this type is allocated by FS-Cache to record retrieval and
+ allocation requests made by the netfs. This struct is then passed to the
+ backend to do the operation. The backend may get extra refs to it by
+ calling fscache_get_retrieval() and refs may be discarded by calling
+ fscache_put_retrieval().
+
+ A retrieval operation can be used by the backend to do retrieval work. To
+ do this, the retrieval->op.processor method pointer should be set
+ appropriately by the backend and fscache_enqueue_retrieval() called to
+ submit it to the thread pool. CacheFiles, for example, uses this to queue
+ page examination when it detects PG_lock being cleared.
+
+ The to_do field is an empty list available for the cache backend to use as
+ it sees fit.
+
+
+ (*) FS-Cache storage operation record:
+
+ struct fscache_storage {
+ struct fscache_operation op;
+ pgoff_t store_limit;
+ ...
+ };
+
+ A structure of this type is allocated by FS-Cache to record outstanding
+ writes to be made. FS-Cache itself enqueues this operation and invokes
+ the write_page() method on the object at appropriate times to effect
+ storage.
+
+
+================
+CACHE OPERATIONS
+================
+
+The cache backend provides FS-Cache with a table of operations that can be
+performed on the denizens of the cache. These are held in a structure of type:
+
+ struct fscache_cache_ops
+
+ (*) Name of cache provider [mandatory]:
+
+ const char *name
+
+ This isn't strictly an operation, but should be pointed at a string naming
+ the backend.
+
+
+ (*) Allocate a new object [mandatory]:
+
+ struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
+ struct fscache_cookie *cookie)
+
+ This method is used to allocate a cache object representation to back a
+ cookie in a particular cache. fscache_object_init() should be called on
+ the object to initialise it prior to returning.
+
+ This function may also be used to parse the index key to be used for
+ multiple lookup calls to turn it into a more convenient form. FS-Cache
+ will call the lookup_complete() method to allow the cache to release the
+ form once lookup is complete or aborted.
+
+
+ (*) Look up and create object [mandatory]:
+
+ void (*lookup_object)(struct fscache_object *object)
+
+ This method is used to look up an object, given that the object is already
+ allocated and attached to the cookie. This should instantiate that object
+ in the cache if it can.
+
+ The method should call fscache_object_lookup_negative() as soon as
+ possible if it determines the object doesn't exist in the cache. If the
+ object is found to exist and the netfs indicates that it is valid then
+ fscache_obtained_object() should be called once the object is in a
+ position to have data stored in it. Similarly, fscache_obtained_object()
+ should also be called once a non-present object has been created.
+
+ If a lookup error occurs, fscache_object_lookup_error() should be called
+ to abort the lookup of that object.
+
+
+ (*) Release lookup data [mandatory]:
+
+ void (*lookup_complete)(struct fscache_object *object)
+
+ This method is called to ask the cache to release any resources it was
+ using to perform a lookup.
+
+
+ (*) Increment object refcount [mandatory]:
+
+ struct fscache_object *(*grab_object)(struct fscache_object *object)
+
+ This method is called to increment the reference count on an object. It
+ may fail (for instance if the cache is being withdrawn) by returning NULL.
+ It should return the object pointer if successful.
+
+
+ (*) Lock/Unlock object [mandatory]:
+
+ void (*lock_object)(struct fscache_object *object)
+ void (*unlock_object)(struct fscache_object *object)
+
+ These methods are used to exclusively lock an object. It must be possible
+ to schedule with the lock held, so a spinlock isn't sufficient.
+
+
+ (*) Pin/Unpin object [optional]:
+
+ int (*pin_object)(struct fscache_object *object)
+ void (*unpin_object)(struct fscache_object *object)
+
+ These methods are used to pin an object into the cache. Once pinned an
+ object cannot be reclaimed to make space. Return -ENOSPC if there's not
+ enough space in the cache to permit this.
+
+
+ (*) Update object [mandatory]:
+
+ int (*update_object)(struct fscache_object *object)
+
+ This is called to update the index entry for the specified object. The
+ new information should be in object->cookie->netfs_data. This can be
+ obtained by calling object->cookie->def->get_aux()/get_attr().
+
+
+ (*) Discard object [mandatory]:
+
+ void (*drop_object)(struct fscache_object *object)
+
+ This method is called to indicate that an object has been unbound from its
+ cookie, and that the cache should release the object's resources and
+ retire it if it's in state FSCACHE_OBJECT_RECYCLING.
+
+ This method should not attempt to release any references held by the
+ caller. The caller will invoke the put_object() method as appropriate.
+
+
+ (*) Release object reference [mandatory]:
+
+ void (*put_object)(struct fscache_object *object)
+
+ This method is used to discard a reference to an object. The object may
+ be freed when all the references to it are released.
+
+
+ (*) Synchronise a cache [mandatory]:
+
+ void (*sync)(struct fscache_cache *cache)
+
+ This is called to ask the backend to synchronise a cache with its backing
+ device.
+
+
+ (*) Dissociate a cache [mandatory]:
+
+ void (*dissociate_pages)(struct fscache_cache *cache)
+
+ This is called to ask a cache to perform any page dissociations as part of
+ cache withdrawal.
+
+
+ (*) Notification that the attributes on a netfs file changed [mandatory]:
+
+ int (*attr_changed)(struct fscache_object *object);
+
+ This is called to indicate to the cache that certain attributes on a netfs
+ file have changed (for example the maximum size a file may reach). The
+ cache can read these from the netfs by calling the cookie's get_attr()
+ method.
+
+ The cache may use the file size information to reserve space on the cache.
+ It should also call fscache_set_store_limit() to indicate to FS-Cache the
+ highest byte it's willing to store for an object.
+
+ This method may return -ve if an error occurred or the cache object cannot
+ be expanded. In such a case, the object will be withdrawn from service.
+
+ This operation is run asynchronously from FS-Cache's thread pool, and
+ storage and retrieval operations from the netfs are excluded during the
+ execution of this operation.
+
+
+ (*) Reserve cache space for an object's data [optional]:
+
+ int (*reserve_space)(struct fscache_object *object, loff_t size);
+
+ This is called to request that cache space be reserved to hold the data
+ for an object and the metadata used to track it. Zero size should be
+ taken as request to cancel a reservation.
+
+ This should return 0 if successful, -ENOSPC if there isn't enough space
+ available, or -ENOMEM or -EIO on other errors.
+
+ The reservation may exceed the current size of the object, thus permitting
+ future expansion. If the amount of space consumed by an object would
+ exceed the reservation, it's permitted to refuse requests to allocate
+ pages, but not required. An object may be pruned down to its reservation
+ size if larger than that already.
+
+
+ (*) Request page be read from cache [mandatory]:
+
+ int (*read_or_alloc_page)(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp)
+
+ This is called to attempt to read a netfs page from the cache, or to
+ reserve a backing block if not. FS-Cache will have done as much checking
+ as it can before calling, but most of the work belongs to the backend.
+
+ If there's no page in the cache, then -ENODATA should be returned if the
+ backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
+ didn't.
+
+ If there is suitable data in the cache, then a read operation should be
+ queued and 0 returned. When the read finishes, fscache_end_io() should be
+ called.
+
+ The fscache_mark_pages_cached() should be called for the page if any cache
+ metadata is retained. This will indicate to the netfs that the page needs
+ explicit uncaching. This operation takes a pagevec, thus allowing several
+ pages to be marked at once.
+
+ The retrieval record pointed to by op should be retained for each page
+ queued and released when I/O on the page has been formally ended.
+ fscache_get/put_retrieval() are available for this purpose.
+
+ The retrieval record may be used to get CPU time via the FS-Cache thread
+ pool. If this is desired, the op->op.processor should be set to point to
+ the appropriate processing routine, and fscache_enqueue_retrieval() should
+ be called at an appropriate point to request CPU time. For instance, the
+ retrieval routine could be enqueued upon the completion of a disk read.
+ The to_do field in the retrieval record is provided to aid in this.
+
+ If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
+ returned if possible or fscache_end_io() called with a suitable error
+ code..
+
+
+ (*) Request pages be read from cache [mandatory]:
+
+ int (*read_or_alloc_pages)(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp)
+
+ This is like the read_or_alloc_page() method, except it is handed a list
+ of pages instead of one page. Any pages on which a read operation is
+ started must be added to the page cache for the specified mapping and also
+ to the LRU. Such pages must also be removed from the pages list and
+ *nr_pages decremented per page.
+
+ If there was an error such as -ENOMEM, then that should be returned; else
+ if one or more pages couldn't be read or allocated, then -ENOBUFS should
+ be returned; else if one or more pages couldn't be read, then -ENODATA
+ should be returned. If all the pages are dispatched then 0 should be
+ returned.
+
+
+ (*) Request page be allocated in the cache [mandatory]:
+
+ int (*allocate_page)(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp)
+
+ This is like the read_or_alloc_page() method, except that it shouldn't
+ read from the cache, even if there's data there that could be retrieved.
+ It should, however, set up any internal metadata required such that
+ the write_page() method can write to the cache.
+
+ If there's no backing block available, then -ENOBUFS should be returned
+ (or -ENOMEM if there were other problems). If a block is successfully
+ allocated, then the netfs page should be marked and 0 returned.
+
+
+ (*) Request pages be allocated in the cache [mandatory]:
+
+ int (*allocate_pages)(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp)
+
+ This is an multiple page version of the allocate_page() method. pages and
+ nr_pages should be treated as for the read_or_alloc_pages() method.
+
+
+ (*) Request page be written to cache [mandatory]:
+
+ int (*write_page)(struct fscache_storage *op,
+ struct page *page);
+
+ This is called to write from a page on which there was a previously
+ successful read_or_alloc_page() call or similar. FS-Cache filters out
+ pages that don't have mappings.
+
+ This method is called asynchronously from the FS-Cache thread pool. It is
+ not required to actually store anything, provided -ENODATA is then
+ returned to the next read of this page.
+
+ If an error occurred, then a negative error code should be returned,
+ otherwise zero should be returned. FS-Cache will take appropriate action
+ in response to an error, such as withdrawing this object.
+
+ If this method returns success then FS-Cache will inform the netfs
+ appropriately.
+
+
+ (*) Discard retained per-page metadata [mandatory]:
+
+ void (*uncache_page)(struct fscache_object *object, struct page *page)
+
+ This is called when a netfs page is being evicted from the pagecache. The
+ cache backend should tear down any internal representation or tracking it
+ maintains for this page.
+
+
+==================
+FS-CACHE UTILITIES
+==================
+
+FS-Cache provides some utilities that a cache backend may make use of:
+
+ (*) Note occurrence of an I/O error in a cache:
+
+ void fscache_io_error(struct fscache_cache *cache)
+
+ This tells FS-Cache that an I/O error occurred in the cache. After this
+ has been called, only resource dissociation operations (object and page
+ release) will be passed from the netfs to the cache backend for the
+ specified cache.
+
+ This does not actually withdraw the cache. That must be done separately.
+
+
+ (*) Invoke the retrieval I/O completion function:
+
+ void fscache_end_io(struct fscache_retrieval *op, struct page *page,
+ int error);
+
+ This is called to note the end of an attempt to retrieve a page. The
+ error value should be 0 if successful and an error otherwise.
+
+
+ (*) Set highest store limit:
+
+ void fscache_set_store_limit(struct fscache_object *object,
+ loff_t i_size);
+
+ This sets the limit FS-Cache imposes on the highest byte it's willing to
+ try and store for a netfs. Any page over this limit is automatically
+ rejected by fscache_read_alloc_page() and co with -ENOBUFS.
+
+
+ (*) Mark pages as being cached:
+
+ void fscache_mark_pages_cached(struct fscache_retrieval *op,
+ struct pagevec *pagevec);
+
+ This marks a set of pages as being cached. After this has been called,
+ the netfs must call fscache_uncache_page() to unmark the pages.
+
+
+ (*) Perform coherency check on an object:
+
+ enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
+ const void *data,
+ uint16_t datalen);
+
+ This asks the netfs to perform a coherency check on an object that has
+ just been looked up. The cookie attached to the object will determine the
+ netfs to use. data and datalen should specify where the auxiliary data
+ retrieved from the cache can be found.
+
+ One of three values will be returned:
+
+ (*) FSCACHE_CHECKAUX_OKAY
+
+ The coherency data indicates the object is valid as is.
+
+ (*) FSCACHE_CHECKAUX_NEEDS_UPDATE
+
+ The coherency data needs updating, but otherwise the object is
+ valid.
+
+ (*) FSCACHE_CHECKAUX_OBSOLETE
+
+ The coherency data indicates that the object is obsolete and should
+ be discarded.
+
+
+ (*) Initialise a freshly allocated object:
+
+ void fscache_object_init(struct fscache_object *object);
+
+ This initialises all the fields in an object representation.
+
+
+ (*) Indicate the destruction of an object:
+
+ void fscache_object_destroyed(struct fscache_cache *cache);
+
+ This must be called to inform FS-Cache that an object that belonged to a
+ cache has been destroyed and deallocated. This will allow continuation
+ of the cache withdrawal process when it is stopped pending destruction of
+ all the objects.
+
+
+ (*) Indicate negative lookup on an object:
+
+ void fscache_object_lookup_negative(struct fscache_object *object);
+
+ This is called to indicate to FS-Cache that a lookup process for an object
+ found a negative result.
+
+ This changes the state of an object to permit reads pending on lookup
+ completion to go off and start fetching data from the netfs server as it's
+ known at this point that there can't be any data in the cache.
+
+ This may be called multiple times on an object. Only the first call is
+ significant - all subsequent calls are ignored.
+
+
+ (*) Indicate an object has been obtained:
+
+ void fscache_obtained_object(struct fscache_object *object);
+
+ This is called to indicate to FS-Cache that a lookup process for an object
+ produced a positive result, or that an object was created. This should
+ only be called once for any particular object.
+
+ This changes the state of an object to indicate:
+
+ (1) if no call to fscache_object_lookup_negative() has been made on
+ this object, that there may be data available, and that reads can
+ now go and look for it; and
+
+ (2) that writes may now proceed against this object.
+
+
+ (*) Indicate that object lookup failed:
+
+ void fscache_object_lookup_error(struct fscache_object *object);
+
+ This marks an object as having encountered a fatal error (usually EIO)
+ and causes it to move into a state whereby it will be withdrawn as soon
+ as possible.
+
+
+ (*) Get and release references on a retrieval record:
+
+ void fscache_get_retrieval(struct fscache_retrieval *op);
+ void fscache_put_retrieval(struct fscache_retrieval *op);
+
+ These two functions are used to retain a retrieval record whilst doing
+ asynchronous data retrieval and block allocation.
+
+
+ (*) Enqueue a retrieval record for processing.
+
+ void fscache_enqueue_retrieval(struct fscache_retrieval *op);
+
+ This enqueues a retrieval record for processing by the FS-Cache thread
+ pool. One of the threads in the pool will invoke the retrieval record's
+ op->op.processor callback function. This function may be called from
+ within the callback function.
+
+
+ (*) List of object state names:
+
+ const char *fscache_object_states[];
+
+ For debugging purposes, this may be used to turn the state that an object
+ is in into a text string for display purposes.
diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt
new file mode 100644
index 00000000..748a1ae4
--- /dev/null
+++ b/Documentation/filesystems/caching/cachefiles.txt
@@ -0,0 +1,501 @@
+ ===============================================
+ CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
+ ===============================================
+
+Contents:
+
+ (*) Overview.
+
+ (*) Requirements.
+
+ (*) Configuration.
+
+ (*) Starting the cache.
+
+ (*) Things to avoid.
+
+ (*) Cache culling.
+
+ (*) Cache structure.
+
+ (*) Security model and SELinux.
+
+ (*) A note on security.
+
+ (*) Statistical information.
+
+ (*) Debugging.
+
+
+========
+OVERVIEW
+========
+
+CacheFiles is a caching backend that's meant to use as a cache a directory on
+an already mounted filesystem of a local type (such as Ext3).
+
+CacheFiles uses a userspace daemon to do some of the cache management - such as
+reaping stale nodes and culling. This is called cachefilesd and lives in
+/sbin.
+
+The filesystem and data integrity of the cache are only as good as those of the
+filesystem providing the backing services. Note that CacheFiles does not
+attempt to journal anything since the journalling interfaces of the various
+filesystems are very specific in nature.
+
+CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
+to communication with the daemon. Only one thing may have this open at once,
+and whilst it is open, a cache is at least partially in existence. The daemon
+opens this and sends commands down it to control the cache.
+
+CacheFiles is currently limited to a single cache.
+
+CacheFiles attempts to maintain at least a certain percentage of free space on
+the filesystem, shrinking the cache by culling the objects it contains to make
+space if necessary - see the "Cache Culling" section. This means it can be
+placed on the same medium as a live set of data, and will expand to make use of
+spare space and automatically contract when the set of data requires more
+space.
+
+
+============
+REQUIREMENTS
+============
+
+The use of CacheFiles and its daemon requires the following features to be
+available in the system and in the cache filesystem:
+
+ - dnotify.
+
+ - extended attributes (xattrs).
+
+ - openat() and friends.
+
+ - bmap() support on files in the filesystem (FIBMAP ioctl).
+
+ - The use of bmap() to detect a partial page at the end of the file.
+
+It is strongly recommended that the "dir_index" option is enabled on Ext3
+filesystems being used as a cache.
+
+
+=============
+CONFIGURATION
+=============
+
+The cache is configured by a script in /etc/cachefilesd.conf. These commands
+set up cache ready for use. The following script commands are available:
+
+ (*) brun <N>%
+ (*) bcull <N>%
+ (*) bstop <N>%
+ (*) frun <N>%
+ (*) fcull <N>%
+ (*) fstop <N>%
+
+ Configure the culling limits. Optional. See the section on culling
+ The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
+
+ The commands beginning with a 'b' are file space (block) limits, those
+ beginning with an 'f' are file count limits.
+
+ (*) dir <path>
+
+ Specify the directory containing the root of the cache. Mandatory.
+
+ (*) tag <name>
+
+ Specify a tag to FS-Cache to use in distinguishing multiple caches.
+ Optional. The default is "CacheFiles".
+
+ (*) debug <mask>
+
+ Specify a numeric bitmask to control debugging in the kernel module.
+ Optional. The default is zero (all off). The following values can be
+ OR'd into the mask to collect various information:
+
+ 1 Turn on trace of function entry (_enter() macros)
+ 2 Turn on trace of function exit (_leave() macros)
+ 4 Turn on trace of internal debug points (_debug())
+
+ This mask can also be set through sysfs, eg:
+
+ echo 5 >/sys/modules/cachefiles/parameters/debug
+
+
+==================
+STARTING THE CACHE
+==================
+
+The cache is started by running the daemon. The daemon opens the cache device,
+configures the cache and tells it to begin caching. At that point the cache
+binds to fscache and the cache becomes live.
+
+The daemon is run as follows:
+
+ /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
+
+The flags are:
+
+ (*) -d
+
+ Increase the debugging level. This can be specified multiple times and
+ is cumulative with itself.
+
+ (*) -s
+
+ Send messages to stderr instead of syslog.
+
+ (*) -n
+
+ Don't daemonise and go into background.
+
+ (*) -f <configfile>
+
+ Use an alternative configuration file rather than the default one.
+
+
+===============
+THINGS TO AVOID
+===============
+
+Do not mount other things within the cache as this will cause problems. The
+kernel module contains its own very cut-down path walking facility that ignores
+mountpoints, but the daemon can't avoid them.
+
+Do not create, rename or unlink files and directories in the cache whilst the
+cache is active, as this may cause the state to become uncertain.
+
+Renaming files in the cache might make objects appear to be other objects (the
+filename is part of the lookup key).
+
+Do not change or remove the extended attributes attached to cache files by the
+cache as this will cause the cache state management to get confused.
+
+Do not create files or directories in the cache, lest the cache get confused or
+serve incorrect data.
+
+Do not chmod files in the cache. The module creates things with minimal
+permissions to prevent random users being able to access them directly.
+
+
+=============
+CACHE CULLING
+=============
+
+The cache may need culling occasionally to make space. This involves
+discarding objects from the cache that have been used less recently than
+anything else. Culling is based on the access time of data objects. Empty
+directories are culled if not in use.
+
+Cache culling is done on the basis of the percentage of blocks and the
+percentage of files available in the underlying filesystem. There are six
+"limits":
+
+ (*) brun
+ (*) frun
+
+ If the amount of free space and the number of available files in the cache
+ rises above both these limits, then culling is turned off.
+
+ (*) bcull
+ (*) fcull
+
+ If the amount of available space or the number of available files in the
+ cache falls below either of these limits, then culling is started.
+
+ (*) bstop
+ (*) fstop
+
+ If the amount of available space or the number of available files in the
+ cache falls below either of these limits, then no further allocation of
+ disk space or files is permitted until culling has raised things above
+ these limits again.
+
+These must be configured thusly:
+
+ 0 <= bstop < bcull < brun < 100
+ 0 <= fstop < fcull < frun < 100
+
+Note that these are percentages of available space and available files, and do
+_not_ appear as 100 minus the percentage displayed by the "df" program.
+
+The userspace daemon scans the cache to build up a table of cullable objects.
+These are then culled in least recently used order. A new scan of the cache is
+started as soon as space is made in the table. Objects will be skipped if
+their atimes have changed or if the kernel module says it is still using them.
+
+
+===============
+CACHE STRUCTURE
+===============
+
+The CacheFiles module will create two directories in the directory it was
+given:
+
+ (*) cache/
+
+ (*) graveyard/
+
+The active cache objects all reside in the first directory. The CacheFiles
+kernel module moves any retired or culled objects that it can't simply unlink
+to the graveyard from which the daemon will actually delete them.
+
+The daemon uses dnotify to monitor the graveyard directory, and will delete
+anything that appears therein.
+
+
+The module represents index objects as directories with the filename "I..." or
+"J...". Note that the "cache/" directory is itself a special index.
+
+Data objects are represented as files if they have no children, or directories
+if they do. Their filenames all begin "D..." or "E...". If represented as a
+directory, data objects will have a file in the directory called "data" that
+actually holds the data.
+
+Special objects are similar to data objects, except their filenames begin
+"S..." or "T...".
+
+
+If an object has children, then it will be represented as a directory.
+Immediately in the representative directory are a collection of directories
+named for hash values of the child object keys with an '@' prepended. Into
+this directory, if possible, will be placed the representations of the child
+objects:
+
+ INDEX INDEX INDEX DATA FILES
+ ========= ========== ================================= ================
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
+
+
+If the key is so long that it exceeds NAME_MAX with the decorations added on to
+it, then it will be cut into pieces, the first few of which will be used to
+make a nest of directories, and the last one of which will be the objects
+inside the last directory. The names of the intermediate directories will have
+'+' prepended:
+
+ J1223/@23/+xy...z/+kl...m/Epqr
+
+
+Note that keys are raw data, and not only may they exceed NAME_MAX in size,
+they may also contain things like '/' and NUL characters, and so they may not
+be suitable for turning directly into a filename.
+
+To handle this, CacheFiles will use a suitably printable filename directly and
+"base-64" encode ones that aren't directly suitable. The two versions of
+object filenames indicate the encoding:
+
+ OBJECT TYPE PRINTABLE ENCODED
+ =============== =============== ===============
+ Index "I..." "J..."
+ Data "D..." "E..."
+ Special "S..." "T..."
+
+Intermediate directories are always "@" or "+" as appropriate.
+
+
+Each object in the cache has an extended attribute label that holds the object
+type ID (required to distinguish special objects) and the auxiliary data from
+the netfs. The latter is used to detect stale objects in the cache and update
+or retire them.
+
+
+Note that CacheFiles will erase from the cache any file it doesn't recognise or
+any file of an incorrect type (such as a FIFO file or a device file).
+
+
+==========================
+SECURITY MODEL AND SELINUX
+==========================
+
+CacheFiles is implemented to deal properly with the LSM security features of
+the Linux kernel and the SELinux facility.
+
+One of the problems that CacheFiles faces is that it is generally acting on
+behalf of a process, and running in that process's context, and that includes a
+security context that is not appropriate for accessing the cache - either
+because the files in the cache are inaccessible to that process, or because if
+the process creates a file in the cache, that file may be inaccessible to other
+processes.
+
+The way CacheFiles works is to temporarily change the security context (fsuid,
+fsgid and actor security label) that the process acts as - without changing the
+security context of the process when it the target of an operation performed by
+some other process (so signalling and suchlike still work correctly).
+
+
+When the CacheFiles module is asked to bind to its cache, it:
+
+ (1) Finds the security label attached to the root cache directory and uses
+ that as the security label with which it will create files. By default,
+ this is:
+
+ cachefiles_var_t
+
+ (2) Finds the security label of the process which issued the bind request
+ (presumed to be the cachefilesd daemon), which by default will be:
+
+ cachefilesd_t
+
+ and asks LSM to supply a security ID as which it should act given the
+ daemon's label. By default, this will be:
+
+ cachefiles_kernel_t
+
+ SELinux transitions the daemon's security ID to the module's security ID
+ based on a rule of this form in the policy.
+
+ type_transition <daemon's-ID> kernel_t : process <module's-ID>;
+
+ For instance:
+
+ type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
+
+
+The module's security ID gives it permission to create, move and remove files
+and directories in the cache, to find and access directories and files in the
+cache, to set and access extended attributes on cache objects, and to read and
+write files in the cache.
+
+The daemon's security ID gives it only a very restricted set of permissions: it
+may scan directories, stat files and erase files and directories. It may
+not read or write files in the cache, and so it is precluded from accessing the
+data cached therein; nor is it permitted to create new files in the cache.
+
+
+There are policy source files available in:
+
+ http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
+
+and later versions. In that tarball, see the files:
+
+ cachefilesd.te
+ cachefilesd.fc
+ cachefilesd.if
+
+They are built and installed directly by the RPM.
+
+If a non-RPM based system is being used, then copy the above files to their own
+directory and run:
+
+ make -f /usr/share/selinux/devel/Makefile
+ semodule -i cachefilesd.pp
+
+You will need checkpolicy and selinux-policy-devel installed prior to the
+build.
+
+
+By default, the cache is located in /var/fscache, but if it is desirable that
+it should be elsewhere, than either the above policy files must be altered, or
+an auxiliary policy must be installed to label the alternate location of the
+cache.
+
+For instructions on how to add an auxiliary policy to enable the cache to be
+located elsewhere when SELinux is in enforcing mode, please see:
+
+ /usr/share/doc/cachefilesd-*/move-cache.txt
+
+When the cachefilesd rpm is installed; alternatively, the document can be found
+in the sources.
+
+
+==================
+A NOTE ON SECURITY
+==================
+
+CacheFiles makes use of the split security in the task_struct. It allocates
+its own task_security structure, and redirects current->cred to point to it
+when it acts on behalf of another process, in that process's context.
+
+The reason it does this is that it calls vfs_mkdir() and suchlike rather than
+bypassing security and calling inode ops directly. Therefore the VFS and LSM
+may deny the CacheFiles access to the cache data because under some
+circumstances the caching code is running in the security context of whatever
+process issued the original syscall on the netfs.
+
+Furthermore, should CacheFiles create a file or directory, the security
+parameters with that object is created (UID, GID, security label) would be
+derived from that process that issued the system call, thus potentially
+preventing other processes from accessing the cache - including CacheFiles's
+cache management daemon (cachefilesd).
+
+What is required is to temporarily override the security of the process that
+issued the system call. We can't, however, just do an in-place change of the
+security data as that affects the process as an object, not just as a subject.
+This means it may lose signals or ptrace events for example, and affects what
+the process looks like in /proc.
+
+So CacheFiles makes use of a logical split in the security between the
+objective security (task->real_cred) and the subjective security (task->cred).
+The objective security holds the intrinsic security properties of a process and
+is never overridden. This is what appears in /proc, and is what is used when a
+process is the target of an operation by some other process (SIGKILL for
+example).
+
+The subjective security holds the active security properties of a process, and
+may be overridden. This is not seen externally, and is used whan a process
+acts upon another object, for example SIGKILLing another process or opening a
+file.
+
+LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
+for CacheFiles to run in a context of a specific security label, or to create
+files and directories with another security label.
+
+
+=======================
+STATISTICAL INFORMATION
+=======================
+
+If FS-Cache is compiled with the following option enabled:
+
+ CONFIG_CACHEFILES_HISTOGRAM=y
+
+then it will gather certain statistics and display them through a proc file.
+
+ (*) /proc/fs/cachefiles/histogram
+
+ cat /proc/fs/cachefiles/histogram
+ JIFS SECS LOOKUPS MKDIRS CREATES
+ ===== ===== ========= ========= =========
+
+ This shows the breakdown of the number of times each amount of time
+ between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
+ columns are as follows:
+
+ COLUMN TIME MEASUREMENT
+ ======= =======================================================
+ LOOKUPS Length of time to perform a lookup on the backing fs
+ MKDIRS Length of time to perform a mkdir on the backing fs
+ CREATES Length of time to perform a create on the backing fs
+
+ Each row shows the number of events that took a particular range of times.
+ Each step is 1 jiffy in size. The JIFS column indicates the particular
+ jiffy range covered, and the SECS field the equivalent number of seconds.
+
+
+=========
+DEBUGGING
+=========
+
+If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
+debugging enabled by adjusting the value in:
+
+ /sys/module/cachefiles/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+ BIT VALUE STREAM POINT
+ ======= ======= =============================== =======================
+ 0 1 General Function entry trace
+ 1 2 Function exit trace
+ 2 4 General
+
+The appropriate set of values should be OR'd together and the result written to
+the control file. For example:
+
+ echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
+
+will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
new file mode 100644
index 00000000..770267af
--- /dev/null
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -0,0 +1,443 @@
+ ==========================
+ General Filesystem Caching
+ ==========================
+
+========
+OVERVIEW
+========
+
+This facility is a general purpose cache for network filesystems, though it
+could be used for caching other things such as ISO9660 filesystems too.
+
+FS-Cache mediates between cache backends (such as CacheFS) and network
+filesystems:
+
+ +---------+
+ | | +--------------+
+ | NFS |--+ | |
+ | | | +-->| CacheFS |
+ +---------+ | +----------+ | | /dev/hda5 |
+ | | | | +--------------+
+ +---------+ +-->| | |
+ | | | |--+
+ | AFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ +-->| | |
+ | | | | +--------------+
+ +---------+ | +----------+ | | |
+ | | | +-->| CacheFiles |
+ | ISOFS |--+ | /var/cache |
+ | | +--------------+
+ +---------+
+
+Or to look at it another way, FS-Cache is a module that provides a caching
+facility to a network filesystem such that the cache is transparent to the
+user:
+
+ +---------+
+ | |
+ | Server |
+ | |
+ +---------+
+ | NETWORK
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ |
+ | +----------+
+ V | |
+ +---------+ | |
+ | | | |
+ | NFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ | | | +--------------+ +--------------+
+ | | | | | | | |
+ V +----------+ +-->| CacheFiles |-->| Ext3 |
+ +---------+ | /var/cache | | /dev/sda6 |
+ | | +--------------+ +--------------+
+ | VFS | ^ ^
+ | | | |
+ +---------+ +--------------+ |
+ | KERNEL SPACE | |
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
+ | USER SPACE | |
+ V | |
+ +---------+ +--------------+
+ | | | |
+ | Process | | cachefilesd |
+ | | | |
+ +---------+ +--------------+
+
+
+FS-Cache does not follow the idea of completely loading every netfs file
+opened in its entirety into a cache before permitting it to be accessed and
+then serving the pages out of that cache rather than the netfs inode because:
+
+ (1) It must be practical to operate without a cache.
+
+ (2) The size of any accessible file must not be limited to the size of the
+ cache.
+
+ (3) The combined size of all opened files (this includes mapped libraries)
+ must not be limited to the size of the cache.
+
+ (4) The user should not be forced to download an entire file just to do a
+ one-off access of a small portion of it (such as might be done with the
+ "file" program).
+
+It instead serves the cache out in PAGE_SIZE chunks as and when requested by
+the netfs('s) using it.
+
+
+FS-Cache provides the following facilities:
+
+ (1) More than one cache can be used at once. Caches can be selected
+ explicitly by use of tags.
+
+ (2) Caches can be added / removed at any time.
+
+ (3) The netfs is provided with an interface that allows either party to
+ withdraw caching facilities from a file (required for (2)).
+
+ (4) The interface to the netfs returns as few errors as possible, preferring
+ rather to let the netfs remain oblivious.
+
+ (5) Cookies are used to represent indices, files and other objects to the
+ netfs. The simplest cookie is just a NULL pointer - indicating nothing
+ cached there.
+
+ (6) The netfs is allowed to propose - dynamically - any index hierarchy it
+ desires, though it must be aware that the index search function is
+ recursive, stack space is limited, and indices can only be children of
+ indices.
+
+ (7) Data I/O is done direct to and from the netfs's pages. The netfs
+ indicates that page A is at index B of the data-file represented by cookie
+ C, and that it should be read or written. The cache backend may or may
+ not start I/O on that page, but if it does, a netfs callback will be
+ invoked to indicate completion. The I/O may be either synchronous or
+ asynchronous.
+
+ (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
+ them as obsolete and the index hierarchy rooted at that point will get
+ recycled.
+
+ (9) The netfs provides a "match" function for index searches. In addition to
+ saying whether a match was made or not, this can also specify that an
+ entry should be updated or deleted.
+
+(10) As much as possible is done asynchronously.
+
+
+FS-Cache maintains a virtual indexing tree in which all indices, files, objects
+and pages are kept. Bits of this tree may actually reside in one or more
+caches.
+
+ FSDEF
+ |
+ +------------------------------------+
+ | |
+ NFS AFS
+ | |
+ +--------------------------+ +-----------+
+ | | | |
+ homedir mirror afs.org redhat.com
+ | | |
+ +------------+ +---------------+ +----------+
+ | | | | | |
+ 00001 00002 00007 00125 vol00001 vol00002
+ | | | | |
+ +---+---+ +-----+ +---+ +------+------+ +-----+----+
+ | | | | | | | | | | | | |
+PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
+ | |
+ PG0 +-------+
+ | |
+ 00001 00003
+ |
+ +---+---+
+ | | |
+ PG0 PG1 PG2
+
+In the example above, you can see two netfs's being backed: NFS and AFS. These
+have different index hierarchies:
+
+ (*) The NFS primary index contains per-server indices. Each server index is
+ indexed by NFS file handles to get data file objects. Each data file
+ objects can have an array of pages, but may also have further child
+ objects, such as extended attributes and directory entries. Extended
+ attribute objects themselves have page-array contents.
+
+ (*) The AFS primary index contains per-cell indices. Each cell index contains
+ per-logical-volume indices. Each of volume index contains up to three
+ indices for the read-write, read-only and backup mirrors of those volumes.
+ Each of these contains vnode data file objects, each of which contains an
+ array of pages.
+
+The very top index is the FS-Cache master index in which individual netfs's
+have entries.
+
+Any index object may reside in more than one cache, provided it only has index
+children. Any index with non-index object children will be assumed to only
+reside in one cache.
+
+
+The netfs API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/netfs-api.txt
+
+The cache backend API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/backend-api.txt
+
+A description of the internal representations and object state machine can be
+found in:
+
+ Documentation/filesystems/caching/object.txt
+
+
+=======================
+STATISTICAL INFORMATION
+=======================
+
+If FS-Cache is compiled with the following options enabled:
+
+ CONFIG_FSCACHE_STATS=y
+ CONFIG_FSCACHE_HISTOGRAM=y
+
+then it will gather certain statistics and display them through a number of
+proc files.
+
+ (*) /proc/fs/fscache/stats
+
+ This shows counts of a number of events that can happen in FS-Cache:
+
+ CLASS EVENT MEANING
+ ======= ======= =======================================================
+ Cookies idx=N Number of index cookies allocated
+ dat=N Number of data storage cookies allocated
+ spc=N Number of special cookies allocated
+ Objects alc=N Number of objects allocated
+ nal=N Number of object allocation failures
+ avl=N Number of objects that reached the available state
+ ded=N Number of objects that reached the dead state
+ ChkAux non=N Number of objects that didn't have a coherency check
+ ok=N Number of objects that passed a coherency check
+ upd=N Number of objects that needed a coherency data update
+ obs=N Number of objects that were declared obsolete
+ Pages mrk=N Number of pages marked as being cached
+ unc=N Number of uncache page requests seen
+ Acquire n=N Number of acquire cookie requests seen
+ nul=N Number of acq reqs given a NULL parent
+ noc=N Number of acq reqs rejected due to no cache available
+ ok=N Number of acq reqs succeeded
+ nbf=N Number of acq reqs rejected due to error
+ oom=N Number of acq reqs failed on ENOMEM
+ Lookups n=N Number of lookup calls made on cache backends
+ neg=N Number of negative lookups made
+ pos=N Number of positive lookups made
+ crt=N Number of objects created by lookup
+ tmo=N Number of lookups timed out and requeued
+ Updates n=N Number of update cookie requests seen
+ nul=N Number of upd reqs given a NULL parent
+ run=N Number of upd reqs granted CPU time
+ Relinqs n=N Number of relinquish cookie requests seen
+ nul=N Number of rlq reqs given a NULL parent
+ wcr=N Number of rlq reqs waited on completion of creation
+ AttrChg n=N Number of attribute changed requests seen
+ ok=N Number of attr changed requests queued
+ nbf=N Number of attr changed rejected -ENOBUFS
+ oom=N Number of attr changed failed -ENOMEM
+ run=N Number of attr changed ops given CPU time
+ Allocs n=N Number of allocation requests seen
+ ok=N Number of successful alloc reqs
+ wt=N Number of alloc reqs that waited on lookup completion
+ nbf=N Number of alloc reqs rejected -ENOBUFS
+ int=N Number of alloc reqs aborted -ERESTARTSYS
+ ops=N Number of alloc reqs submitted
+ owt=N Number of alloc reqs waited for CPU time
+ abt=N Number of alloc reqs aborted due to object death
+ Retrvls n=N Number of retrieval (read) requests seen
+ ok=N Number of successful retr reqs
+ wt=N Number of retr reqs that waited on lookup completion
+ nod=N Number of retr reqs returned -ENODATA
+ nbf=N Number of retr reqs rejected -ENOBUFS
+ int=N Number of retr reqs aborted -ERESTARTSYS
+ oom=N Number of retr reqs failed -ENOMEM
+ ops=N Number of retr reqs submitted
+ owt=N Number of retr reqs waited for CPU time
+ abt=N Number of retr reqs aborted due to object death
+ Stores n=N Number of storage (write) requests seen
+ ok=N Number of successful store reqs
+ agn=N Number of store reqs on a page already pending storage
+ nbf=N Number of store reqs rejected -ENOBUFS
+ oom=N Number of store reqs failed -ENOMEM
+ ops=N Number of store reqs submitted
+ run=N Number of store reqs granted CPU time
+ pgs=N Number of pages given store req processing time
+ rxd=N Number of store reqs deleted from tracking tree
+ olm=N Number of store reqs over store limit
+ VmScan nos=N Number of release reqs against pages with no pending store
+ gon=N Number of release reqs against pages stored by time lock granted
+ bsy=N Number of release reqs ignored due to in-progress store
+ can=N Number of page stores cancelled due to release req
+ Ops pend=N Number of times async ops added to pending queues
+ run=N Number of times async ops given CPU time
+ enq=N Number of times async ops queued for processing
+ can=N Number of async ops cancelled
+ rej=N Number of async ops rejected due to object lookup/create failure
+ dfr=N Number of async ops queued for deferred release
+ rel=N Number of async ops released
+ gc=N Number of deferred-release async ops garbage collected
+ CacheOp alo=N Number of in-progress alloc_object() cache ops
+ luo=N Number of in-progress lookup_object() cache ops
+ luc=N Number of in-progress lookup_complete() cache ops
+ gro=N Number of in-progress grab_object() cache ops
+ upo=N Number of in-progress update_object() cache ops
+ dro=N Number of in-progress drop_object() cache ops
+ pto=N Number of in-progress put_object() cache ops
+ syn=N Number of in-progress sync_cache() cache ops
+ atc=N Number of in-progress attr_changed() cache ops
+ rap=N Number of in-progress read_or_alloc_page() cache ops
+ ras=N Number of in-progress read_or_alloc_pages() cache ops
+ alp=N Number of in-progress allocate_page() cache ops
+ als=N Number of in-progress allocate_pages() cache ops
+ wrp=N Number of in-progress write_page() cache ops
+ ucp=N Number of in-progress uncache_page() cache ops
+ dsp=N Number of in-progress dissociate_pages() cache ops
+
+
+ (*) /proc/fs/fscache/histogram
+
+ cat /proc/fs/fscache/histogram
+ JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
+ ===== ===== ========= ========= ========= ========= =========
+
+ This shows the breakdown of the number of times each amount of time
+ between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
+ columns are as follows:
+
+ COLUMN TIME MEASUREMENT
+ ======= =======================================================
+ OBJ INST Length of time to instantiate an object
+ OP RUNS Length of time a call to process an operation took
+ OBJ RUNS Length of time a call to process an object event took
+ RETRV DLY Time between an requesting a read and lookup completing
+ RETRIEVLS Time between beginning and end of a retrieval
+
+ Each row shows the number of events that took a particular range of times.
+ Each step is 1 jiffy in size. The JIFS column indicates the particular
+ jiffy range covered, and the SECS field the equivalent number of seconds.
+
+
+===========
+OBJECT LIST
+===========
+
+If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
+list of all the objects currently allocated and allow them to be viewed
+through:
+
+ /proc/fs/fscache/objects
+
+This will look something like:
+
+ [root@andromeda ~]# head /proc/fs/fscache/objects
+ OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
+ ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
+ 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
+ 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
+
+where the first set of columns before the '|' describe the object:
+
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
+ PARENT Debugging ID of parent object
+ STAT Object state
+ CHLDN Number of child objects of this object
+ OPS Number of outstanding operations on this object
+ OOP Number of outstanding child object management operations
+ IPR
+ EX Number of outstanding exclusive operations
+ READS Number of outstanding read operations
+ EM Object's event mask
+ EV Events raised on this object
+ F Object flags
+ S Object work item busy state mask (1:pending 2:running)
+
+and the second set of columns describe the object's cookie, if present:
+
+ COLUMN DESCRIPTION
+ =============== =======================================================
+ NETFS_COOKIE_DEF Name of netfs cookie definition
+ TY Cookie type (IX - index, DT - data, hex - special)
+ FL Cookie flags
+ NETFS_DATA Netfs private data stored in the cookie
+ OBJECT_KEY Object key } 1 column, with separating comma
+ AUX_DATA Object aux data } presence may be configured
+
+The data shown may be filtered by attaching the a key to an appropriate keyring
+before viewing the file. Something like:
+
+ keyctl add user fscache:objlist <restrictions> @s
+
+where <restrictions> are a selection of the following letters:
+
+ K Show hexdump of object key (don't show if not given)
+ A Show hexdump of object aux data (don't show if not given)
+
+and the following paired letters:
+
+ C Show objects that have a cookie
+ c Show objects that don't have a cookie
+ B Show objects that are busy
+ b Show objects that aren't busy
+ W Show objects that have pending writes
+ w Show objects that don't have pending writes
+ R Show objects that have outstanding reads
+ r Show objects that don't have outstanding reads
+ S Show objects that have work queued
+ s Show objects that don't have work queued
+
+If neither side of a letter pair is given, then both are implied. For example:
+
+ keyctl add user fscache:objlist KB @s
+
+shows objects that are busy, and lists their object keys, but does not dump
+their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
+not implied.
+
+By default all objects and all fields will be shown.
+
+
+=========
+DEBUGGING
+=========
+
+If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
+debugging enabled by adjusting the value in:
+
+ /sys/module/fscache/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+ BIT VALUE STREAM POINT
+ ======= ======= =============================== =======================
+ 0 1 Cache management Function entry trace
+ 1 2 Function exit trace
+ 2 4 General
+ 3 8 Cookie management Function entry trace
+ 4 16 Function exit trace
+ 5 32 General
+ 6 64 Page handling Function entry trace
+ 7 128 Function exit trace
+ 8 256 General
+ 9 512 Operation management Function entry trace
+ 10 1024 Function exit trace
+ 11 2048 General
+
+The appropriate set of values should be OR'd together and the result written to
+the control file. For example:
+
+ echo $((1|8|64)) >/sys/module/fscache/parameters/debug
+
+will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
new file mode 100644
index 00000000..7cc6bf28
--- /dev/null
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -0,0 +1,813 @@
+ ===============================
+ FS-CACHE NETWORK FILESYSTEM API
+ ===============================
+
+There's an API by which a network filesystem can make use of the FS-Cache
+facilities. This is based around a number of principles:
+
+ (1) Caches can store a number of different object types. There are two main
+ object types: indices and files. The first is a special type used by
+ FS-Cache to make finding objects faster and to make retiring of groups of
+ objects easier.
+
+ (2) Every index, file or other object is represented by a cookie. This cookie
+ may or may not have anything associated with it, but the netfs doesn't
+ need to care.
+
+ (3) Barring the top-level index (one entry per cached netfs), the index
+ hierarchy for each netfs is structured according the whim of the netfs.
+
+This API is declared in <linux/fscache.h>.
+
+This document contains the following sections:
+
+ (1) Network filesystem definition
+ (2) Index definition
+ (3) Object definition
+ (4) Network filesystem (un)registration
+ (5) Cache tag lookup
+ (6) Index registration
+ (7) Data file registration
+ (8) Miscellaneous object registration
+ (9) Setting the data file size
+ (10) Page alloc/read/write
+ (11) Page uncaching
+ (12) Index and data file update
+ (13) Miscellaneous cookie operations
+ (14) Cookie unregistration
+ (15) Index and data file invalidation
+ (16) FS-Cache specific page flags.
+
+
+=============================
+NETWORK FILESYSTEM DEFINITION
+=============================
+
+FS-Cache needs a description of the network filesystem. This is specified
+using a record of the following structure:
+
+ struct fscache_netfs {
+ uint32_t version;
+ const char *name;
+ struct fscache_cookie *primary_index;
+ ...
+ };
+
+This first two fields should be filled in before registration, and the third
+will be filled in by the registration function; any other fields should just be
+ignored and are for internal use only.
+
+The fields are:
+
+ (1) The name of the netfs (used as the key in the toplevel index).
+
+ (2) The version of the netfs (if the name matches but the version doesn't, the
+ entire in-cache hierarchy for this netfs will be scrapped and begun
+ afresh).
+
+ (3) The cookie representing the primary index will be allocated according to
+ another parameter passed into the registration function.
+
+For example, kAFS (linux/fs/afs/) uses the following definitions to describe
+itself:
+
+ struct fscache_netfs afs_cache_netfs = {
+ .version = 0,
+ .name = "afs",
+ };
+
+
+================
+INDEX DEFINITION
+================
+
+Indices are used for two purposes:
+
+ (1) To aid the finding of a file based on a series of keys (such as AFS's
+ "cell", "volume ID", "vnode ID").
+
+ (2) To make it easier to discard a subset of all the files cached based around
+ a particular key - for instance to mirror the removal of an AFS volume.
+
+However, since it's unlikely that any two netfs's are going to want to define
+their index hierarchies in quite the same way, FS-Cache tries to impose as few
+restraints as possible on how an index is structured and where it is placed in
+the tree. The netfs can even mix indices and data files at the same level, but
+it's not recommended.
+
+Each index entry consists of a key of indeterminate length plus some auxiliary
+data, also of indeterminate length.
+
+There are some limits on indices:
+
+ (1) Any index containing non-index objects should be restricted to a single
+ cache. Any such objects created within an index will be created in the
+ first cache only. The cache in which an index is created can be
+ controlled by cache tags (see below).
+
+ (2) The entry data must be atomically journallable, so it is limited to about
+ 400 bytes at present. At least 400 bytes will be available.
+
+ (3) The depth of the index tree should be judged with care as the search
+ function is recursive. Too many layers will run the kernel out of stack.
+
+
+=================
+OBJECT DEFINITION
+=================
+
+To define an object, a structure of the following type should be filled out:
+
+ struct fscache_cookie_def
+ {
+ uint8_t name[16];
+ uint8_t type;
+
+ struct fscache_cache_tag *(*select_cache)(
+ const void *parent_netfs_data,
+ const void *cookie_netfs_data);
+
+ uint16_t (*get_key)(const void *cookie_netfs_data,
+ void *buffer,
+ uint16_t bufmax);
+
+ void (*get_attr)(const void *cookie_netfs_data,
+ uint64_t *size);
+
+ uint16_t (*get_aux)(const void *cookie_netfs_data,
+ void *buffer,
+ uint16_t bufmax);
+
+ enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
+ const void *data,
+ uint16_t datalen);
+
+ void (*get_context)(void *cookie_netfs_data, void *context);
+
+ void (*put_context)(void *cookie_netfs_data, void *context);
+
+ void (*mark_pages_cached)(void *cookie_netfs_data,
+ struct address_space *mapping,
+ struct pagevec *cached_pvec);
+
+ void (*now_uncached)(void *cookie_netfs_data);
+ };
+
+This has the following fields:
+
+ (1) The type of the object [mandatory].
+
+ This is one of the following values:
+
+ (*) FSCACHE_COOKIE_TYPE_INDEX
+
+ This defines an index, which is a special FS-Cache type.
+
+ (*) FSCACHE_COOKIE_TYPE_DATAFILE
+
+ This defines an ordinary data file.
+
+ (*) Any other value between 2 and 255
+
+ This defines an extraordinary object such as an XATTR.
+
+ (2) The name of the object type (NUL terminated unless all 16 chars are used)
+ [optional].
+
+ (3) A function to select the cache in which to store an index [optional].
+
+ This function is invoked when an index needs to be instantiated in a cache
+ during the instantiation of a non-index object. Only the immediate index
+ parent for the non-index object will be queried. Any indices above that
+ in the hierarchy may be stored in multiple caches. This function does not
+ need to be supplied for any non-index object or any index that will only
+ have index children.
+
+ If this function is not supplied or if it returns NULL then the first
+ cache in the parent's list will be chosen, or failing that, the first
+ cache in the master list.
+
+ (4) A function to retrieve an object's key from the netfs [mandatory].
+
+ This function will be called with the netfs data that was passed to the
+ cookie acquisition function and the maximum length of key data that it may
+ provide. It should write the required key data into the given buffer and
+ return the quantity it wrote.
+
+ (5) A function to retrieve attribute data from the netfs [optional].
+
+ This function will be called with the netfs data that was passed to the
+ cookie acquisition function. It should return the size of the file if
+ this is a data file. The size may be used to govern how much cache must
+ be reserved for this file in the cache.
+
+ If the function is absent, a file size of 0 is assumed.
+
+ (6) A function to retrieve auxiliary data from the netfs [optional].
+
+ This function will be called with the netfs data that was passed to the
+ cookie acquisition function and the maximum length of auxiliary data that
+ it may provide. It should write the auxiliary data into the given buffer
+ and return the quantity it wrote.
+
+ If this function is absent, the auxiliary data length will be set to 0.
+
+ The length of the auxiliary data buffer may be dependent on the key
+ length. A netfs mustn't rely on being able to provide more than 400 bytes
+ for both.
+
+ (7) A function to check the auxiliary data [optional].
+
+ This function will be called to check that a match found in the cache for
+ this object is valid. For instance with AFS it could check the auxiliary
+ data against the data version number returned by the server to determine
+ whether the index entry in a cache is still valid.
+
+ If this function is absent, it will be assumed that matching objects in a
+ cache are always valid.
+
+ If present, the function should return one of the following values:
+
+ (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
+ (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
+ (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
+
+ This function can also be used to extract data from the auxiliary data in
+ the cache and copy it into the netfs's structures.
+
+ (8) A pair of functions to manage contexts for the completion callback
+ [optional].
+
+ The cache read/write functions are passed a context which is then passed
+ to the I/O completion callback function. To ensure this context remains
+ valid until after the I/O completion is called, two functions may be
+ provided: one to get an extra reference on the context, and one to drop a
+ reference to it.
+
+ If the context is not used or is a type of object that won't go out of
+ scope, then these functions are not required. These functions are not
+ required for indices as indices may not contain data. These functions may
+ be called in interrupt context and so may not sleep.
+
+ (9) A function to mark a page as retaining cache metadata [optional].
+
+ This is called by the cache to indicate that it is retaining in-memory
+ information for this page and that the netfs should uncache the page when
+ it has finished. This does not indicate whether there's data on the disk
+ or not. Note that several pages at once may be presented for marking.
+
+ The PG_fscache bit is set on the pages before this function would be
+ called, so the function need not be provided if this is sufficient.
+
+ This function is not required for indices as they're not permitted data.
+
+(10) A function to unmark all the pages retaining cache metadata [mandatory].
+
+ This is called by FS-Cache to indicate that a backing store is being
+ unbound from a cookie and that all the marks on the pages should be
+ cleared to prevent confusion. Note that the cache will have torn down all
+ its tracking information so that the pages don't need to be explicitly
+ uncached.
+
+ This function is not required for indices as they're not permitted data.
+
+
+===================================
+NETWORK FILESYSTEM (UN)REGISTRATION
+===================================
+
+The first step is to declare the network filesystem to the cache. This also
+involves specifying the layout of the primary index (for AFS, this would be the
+"cell" level).
+
+The registration function is:
+
+ int fscache_register_netfs(struct fscache_netfs *netfs);
+
+It just takes a pointer to the netfs definition. It returns 0 or an error as
+appropriate.
+
+For kAFS, registration is done as follows:
+
+ ret = fscache_register_netfs(&afs_cache_netfs);
+
+The last step is, of course, unregistration:
+
+ void fscache_unregister_netfs(struct fscache_netfs *netfs);
+
+
+================
+CACHE TAG LOOKUP
+================
+
+FS-Cache permits the use of more than one cache. To permit particular index
+subtrees to be bound to particular caches, the second step is to look up cache
+representation tags. This step is optional; it can be left entirely up to
+FS-Cache as to which cache should be used. The problem with doing that is that
+FS-Cache will always pick the first cache that was registered.
+
+To get the representation for a named tag:
+
+ struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
+
+This takes a text string as the name and returns a representation of a tag. It
+will never return an error. It may return a dummy tag, however, if it runs out
+of memory; this will inhibit caching with this tag.
+
+Any representation so obtained must be released by passing it to this function:
+
+ void fscache_release_cache_tag(struct fscache_cache_tag *tag);
+
+The tag will be retrieved by FS-Cache when it calls the object definition
+operation select_cache().
+
+
+==================
+INDEX REGISTRATION
+==================
+
+The third step is to inform FS-Cache about part of an index hierarchy that can
+be used to locate files. This is done by requesting a cookie for each index in
+the path to the file:
+
+ struct fscache_cookie *
+ fscache_acquire_cookie(struct fscache_cookie *parent,
+ const struct fscache_object_def *def,
+ void *netfs_data);
+
+This function creates an index entry in the index represented by parent,
+filling in the index entry by calling the operations pointed to by def.
+
+Note that this function never returns an error - all errors are handled
+internally. It may, however, return NULL to indicate no cookie. It is quite
+acceptable to pass this token back to this function as the parent to another
+acquisition (or even to the relinquish cookie, read page and write page
+functions - see below).
+
+Note also that no indices are actually created in a cache until a non-index
+object needs to be created somewhere down the hierarchy. Furthermore, an index
+may be created in several different caches independently at different times.
+This is all handled transparently, and the netfs doesn't see any of it.
+
+For example, with AFS, a cell would be added to the primary index. This index
+entry would have a dependent inode containing a volume location index for the
+volume mappings within this cell:
+
+ cell->cache =
+ fscache_acquire_cookie(afs_cache_netfs.primary_index,
+ &afs_cell_cache_index_def,
+ cell);
+
+Then when a volume location was accessed, it would be entered into the cell's
+index and an inode would be allocated that acts as a volume type and hash chain
+combination:
+
+ vlocation->cache =
+ fscache_acquire_cookie(cell->cache,
+ &afs_vlocation_cache_index_def,
+ vlocation);
+
+And then a particular flavour of volume (R/O for example) could be added to
+that index, creating another index for vnodes (AFS inode equivalents):
+
+ volume->cache =
+ fscache_acquire_cookie(vlocation->cache,
+ &afs_volume_cache_index_def,
+ volume);
+
+
+======================
+DATA FILE REGISTRATION
+======================
+
+The fourth step is to request a data file be created in the cache. This is
+identical to index cookie acquisition. The only difference is that the type in
+the object definition should be something other than index type.
+
+ vnode->cache =
+ fscache_acquire_cookie(volume->cache,
+ &afs_vnode_cache_object_def,
+ vnode);
+
+
+=================================
+MISCELLANEOUS OBJECT REGISTRATION
+=================================
+
+An optional step is to request an object of miscellaneous type be created in
+the cache. This is almost identical to index cookie acquisition. The only
+difference is that the type in the object definition should be something other
+than index type. Whilst the parent object could be an index, it's more likely
+it would be some other type of object such as a data file.
+
+ xattr->cache =
+ fscache_acquire_cookie(vnode->cache,
+ &afs_xattr_cache_object_def,
+ xattr);
+
+Miscellaneous objects might be used to store extended attributes or directory
+entries for example.
+
+
+==========================
+SETTING THE DATA FILE SIZE
+==========================
+
+The fifth step is to set the physical attributes of the file, such as its size.
+This doesn't automatically reserve any space in the cache, but permits the
+cache to adjust its metadata for data tracking appropriately:
+
+ int fscache_attr_changed(struct fscache_cookie *cookie);
+
+The cache will return -ENOBUFS if there is no backing cache or if there is no
+space to allocate any extra metadata required in the cache. The attributes
+will be accessed with the get_attr() cookie definition operation.
+
+Note that attempts to read or write data pages in the cache over this size may
+be rebuffed with -ENOBUFS.
+
+This operation schedules an attribute adjustment to happen asynchronously at
+some point in the future, and as such, it may happen after the function returns
+to the caller. The attribute adjustment excludes read and write operations.
+
+
+=====================
+PAGE READ/ALLOC/WRITE
+=====================
+
+And the sixth step is to store and retrieve pages in the cache. There are
+three functions that are used to do this.
+
+Note:
+
+ (1) A page should not be re-read or re-allocated without uncaching it first.
+
+ (2) A read or allocated page must be uncached when the netfs page is released
+ from the pagecache.
+
+ (3) A page should only be written to the cache if previous read or allocated.
+
+This permits the cache to maintain its page tracking in proper order.
+
+
+PAGE READ
+---------
+
+Firstly, the netfs should ask FS-Cache to examine the caches and read the
+contents cached for a particular page of a particular file if present, or else
+allocate space to store the contents if not:
+
+ typedef
+ void (*fscache_rw_complete_t)(struct page *page,
+ void *context,
+ int error);
+
+ int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp);
+
+The cookie argument must specify a cookie for an object that isn't an index,
+the page specified will have the data loaded into it (and is also used to
+specify the page number), and the gfp argument is used to control how any
+memory allocations made are satisfied.
+
+If the cookie indicates the inode is not cached:
+
+ (1) The function will return -ENOBUFS.
+
+Else if there's a copy of the page resident in the cache:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) The function will submit a request to read the data from the cache's
+ backing device directly into the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the read is complete, end_io_func() will be invoked with:
+
+ (*) The netfs data supplied when the cookie was created.
+
+ (*) The page descriptor.
+
+ (*) The context argument passed to the above function. This will be
+ maintained with the get_context/put_context functions mentioned above.
+
+ (*) An argument that's 0 on success or negative for an error code.
+
+ If an error occurs, it should be assumed that the page contains no usable
+ data.
+
+ end_io_func() will be called in process context if the read is results in
+ an error, but it might be called in interrupt context if the read is
+ successful.
+
+Otherwise, if there's not a copy available in cache, but the cache may be able
+to store the page:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) A block may be reserved in the cache and attached to the object at the
+ appropriate place.
+
+ (3) The function will return -ENODATA.
+
+This function may also return -ENOMEM or -EINTR, in which case it won't have
+read any data from the cache.
+
+
+PAGE ALLOCATE
+-------------
+
+Alternatively, if there's not expected to be any data in the cache for a page
+because the file has been extended, a block can simply be allocated instead:
+
+ int fscache_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp);
+
+This is similar to the fscache_read_or_alloc_page() function, except that it
+never reads from the cache. It will return 0 if a block has been allocated,
+rather than -ENODATA as the other would. One or the other must be performed
+before writing to the cache.
+
+The mark_pages_cached() cookie operation will be called on the page if
+successful.
+
+
+PAGE WRITE
+----------
+
+Secondly, if the netfs changes the contents of the page (either due to an
+initial download or if a user performs a write), then the page should be
+written back to the cache:
+
+ int fscache_write_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp);
+
+The cookie argument must specify a data file cookie, the page specified should
+contain the data to be written (and is also used to specify the page number),
+and the gfp argument is used to control how any memory allocations made are
+satisfied.
+
+The page must have first been read or allocated successfully and must not have
+been uncached before writing is performed.
+
+If the cookie indicates the inode is not cached then:
+
+ (1) The function will return -ENOBUFS.
+
+Else if space can be allocated in the cache to hold this page:
+
+ (1) PG_fscache_write will be set on the page.
+
+ (2) The function will submit a request to write the data to cache's backing
+ device directly from the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the write is complete PG_fscache_write is cleared on the page and
+ anyone waiting for that bit will be woken up.
+
+Else if there's no space available in the cache, -ENOBUFS will be returned. It
+is also possible for the PG_fscache_write bit to be cleared when no write took
+place if unforeseen circumstances arose (such as a disk error).
+
+Writing takes place asynchronously.
+
+
+MULTIPLE PAGE READ
+------------------
+
+A facility is provided to read several pages at once, as requested by the
+readpages() address space operation:
+
+ int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ struct list_head *pages,
+ int *nr_pages,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp);
+
+This works in a similar way to fscache_read_or_alloc_page(), except:
+
+ (1) Any page it can retrieve data for is removed from pages and nr_pages and
+ dispatched for reading to the disk. Reads of adjacent pages on disk may
+ be merged for greater efficiency.
+
+ (2) The mark_pages_cached() cookie operation will be called on several pages
+ at once if they're being read or allocated.
+
+ (3) If there was an general error, then that error will be returned.
+
+ Else if some pages couldn't be allocated or read, then -ENOBUFS will be
+ returned.
+
+ Else if some pages couldn't be read but were allocated, then -ENODATA will
+ be returned.
+
+ Otherwise, if all pages had reads dispatched, then 0 will be returned, the
+ list will be empty and *nr_pages will be 0.
+
+ (4) end_io_func will be called once for each page being read as the reads
+ complete. It will be called in process context if error != 0, but it may
+ be called in interrupt context if there is no error.
+
+Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
+some of the pages being read and some being allocated. Those pages will have
+been marked appropriately and will need uncaching.
+
+
+==============
+PAGE UNCACHING
+==============
+
+To uncache a page, this function should be called:
+
+ void fscache_uncache_page(struct fscache_cookie *cookie,
+ struct page *page);
+
+This function permits the cache to release any in-memory representation it
+might be holding for this netfs page. This function must be called once for
+each page on which the read or write page functions above have been called to
+make sure the cache's in-memory tracking information gets torn down.
+
+Note that pages can't be explicitly deleted from the a data file. The whole
+data file must be retired (see the relinquish cookie function below).
+
+Furthermore, note that this does not cancel the asynchronous read or write
+operation started by the read/alloc and write functions, so the page
+invalidation functions must use:
+
+ bool fscache_check_page_write(struct fscache_cookie *cookie,
+ struct page *page);
+
+to see if a page is being written to the cache, and:
+
+ void fscache_wait_on_page_write(struct fscache_cookie *cookie,
+ struct page *page);
+
+to wait for it to finish if it is.
+
+
+When releasepage() is being implemented, a special FS-Cache function exists to
+manage the heuristics of coping with vmscan trying to eject pages, which may
+conflict with the cache trying to write pages to the cache (which may itself
+need to allocate memory):
+
+ bool fscache_maybe_release_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp);
+
+This takes the netfs cookie, and the page and gfp arguments as supplied to
+releasepage(). It will return false if the page cannot be released yet for
+some reason and if it returns true, the page has been uncached and can now be
+released.
+
+To make a page available for release, this function may wait for an outstanding
+storage request to complete, or it may attempt to cancel the storage request -
+in which case the page will not be stored in the cache this time.
+
+
+BULK INODE PAGE UNCACHE
+-----------------------
+
+A convenience routine is provided to perform an uncache on all the pages
+attached to an inode. This assumes that the pages on the inode correspond on a
+1:1 basis with the pages in the cache.
+
+ void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
+ struct inode *inode);
+
+This takes the netfs cookie that the pages were cached with and the inode that
+the pages are attached to. This function will wait for pages to finish being
+written to the cache and for the cache to finish with the page generally. No
+error is returned.
+
+
+==========================
+INDEX AND DATA FILE UPDATE
+==========================
+
+To request an update of the index data for an index or other object, the
+following function should be called:
+
+ void fscache_update_cookie(struct fscache_cookie *cookie);
+
+This function will refer back to the netfs_data pointer stored in the cookie by
+the acquisition function to obtain the data to write into each revised index
+entry. The update method in the parent index definition will be called to
+transfer the data.
+
+Note that partial updates may happen automatically at other times, such as when
+data blocks are added to a data file object.
+
+
+===============================
+MISCELLANEOUS COOKIE OPERATIONS
+===============================
+
+There are a number of operations that can be used to control cookies:
+
+ (*) Cookie pinning:
+
+ int fscache_pin_cookie(struct fscache_cookie *cookie);
+ void fscache_unpin_cookie(struct fscache_cookie *cookie);
+
+ These operations permit data cookies to be pinned into the cache and to
+ have the pinning removed. They are not permitted on index cookies.
+
+ The pinning function will return 0 if successful, -ENOBUFS in the cookie
+ isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
+ -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+ -EIO if there's any other problem.
+
+ (*) Data space reservation:
+
+ int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
+
+ This permits a netfs to request cache space be reserved to store up to the
+ given amount of a file. It is permitted to ask for more than the current
+ size of the file to allow for future file expansion.
+
+ If size is given as zero then the reservation will be cancelled.
+
+ The function will return 0 if successful, -ENOBUFS in the cookie isn't
+ backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
+ -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+ -EIO if there's any other problem.
+
+ Note that this doesn't pin an object in a cache; it can still be culled to
+ make space if it's not in use.
+
+
+=====================
+COOKIE UNREGISTRATION
+=====================
+
+To get rid of a cookie, this function should be called.
+
+ void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+ int retire);
+
+If retire is non-zero, then the object will be marked for recycling, and all
+copies of it will be removed from all active caches in which it is present.
+Not only that but all child objects will also be retired.
+
+If retire is zero, then the object may be available again when next the
+acquisition function is called. Retirement here will overrule the pinning on a
+cookie.
+
+One very important note - relinquish must NOT be called for a cookie unless all
+the cookies for "child" indices, objects and pages have been relinquished
+first.
+
+
+================================
+INDEX AND DATA FILE INVALIDATION
+================================
+
+There is no direct way to invalidate an index subtree or a data file. To do
+this, the caller should relinquish and retire the cookie they have, and then
+acquire a new one.
+
+
+===========================
+FS-CACHE SPECIFIC PAGE FLAG
+===========================
+
+FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
+given the alternative name PG_fscache.
+
+PG_fscache is used to indicate that the page is known by the cache, and that
+the cache must be informed if the page is going to go away. It's an indication
+to the netfs that the cache has an interest in this page, where an interest may
+be a pointer to it, resources allocated or reserved for it, or I/O in progress
+upon it.
+
+The netfs can use this information in methods such as releasepage() to
+determine whether it needs to uncache a page or update it.
+
+Furthermore, if this bit is set, releasepage() and invalidatepage() operations
+will be called on a page to get rid of it, even if PG_private is not set. This
+allows caching to attempted on a page before read_cache_pages() to be called
+after fscache_read_or_alloc_pages() as the former will try and release pages it
+was given under certain circumstances.
+
+This bit does not overlap with such as PG_private. This means that FS-Cache
+can be used with a filesystem that uses the block buffering code.
+
+There are a number of operations defined on this flag:
+
+ int PageFsCache(struct page *page);
+ void SetPageFsCache(struct page *page)
+ void ClearPageFsCache(struct page *page)
+ int TestSetPageFsCache(struct page *page)
+ int TestClearPageFsCache(struct page *page)
+
+These functions are bit test, bit set, bit clear, bit test and set and bit
+test and clear operations on PG_fscache.
diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt
new file mode 100644
index 00000000..e8b0a35d
--- /dev/null
+++ b/Documentation/filesystems/caching/object.txt
@@ -0,0 +1,313 @@
+ ====================================================
+ IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
+ ====================================================
+
+By: David Howells <dhowells@redhat.com>
+
+Contents:
+
+ (*) Representation
+
+ (*) Object management state machine.
+
+ - Provision of cpu time.
+ - Locking simplification.
+
+ (*) The set of states.
+
+ (*) The set of events.
+
+
+==============
+REPRESENTATION
+==============
+
+FS-Cache maintains an in-kernel representation of each object that a netfs is
+currently interested in. Such objects are represented by the fscache_cookie
+struct and are referred to as cookies.
+
+FS-Cache also maintains a separate in-kernel representation of the objects that
+a cache backend is currently actively caching. Such objects are represented by
+the fscache_object struct. The cache backends allocate these upon request, and
+are expected to embed them in their own representations. These are referred to
+as objects.
+
+There is a 1:N relationship between cookies and objects. A cookie may be
+represented by multiple objects - an index may exist in more than one cache -
+or even by no objects (it may not be cached).
+
+Furthermore, both cookies and objects are hierarchical. The two hierarchies
+correspond, but the cookies tree is a superset of the union of the object trees
+of multiple caches:
+
+ NETFS INDEX TREE : CACHE 1 : CACHE 2
+ : :
+ : +-----------+ :
+ +----------->| IObject | :
+ +-----------+ | : +-----------+ :
+ | ICookie |-------+ : | :
+ +-----------+ | : | : +-----------+
+ | +------------------------------>| IObject |
+ | : | : +-----------+
+ | : V : |
+ | : +-----------+ : |
+ V +----------->| IObject | : |
+ +-----------+ | : +-----------+ : |
+ | ICookie |-------+ : | : V
+ +-----------+ | : | : +-----------+
+ | +------------------------------>| IObject |
+ +-----+-----+ : | : +-----------+
+ | | : | : |
+ V | : V : |
+ +-----------+ | : +-----------+ : |
+ | ICookie |------------------------->| IObject | : |
+ +-----------+ | : +-----------+ : |
+ | V : | : V
+ | +-----------+ : | : +-----------+
+ | | ICookie |-------------------------------->| IObject |
+ | +-----------+ : | : +-----------+
+ V | : V : |
+ +-----------+ | : +-----------+ : |
+ | DCookie |------------------------->| DObject | : |
+ +-----------+ | : +-----------+ : |
+ | : : |
+ +-------+-------+ : : |
+ | | : : |
+ V V : : V
+ +-----------+ +-----------+ : : +-----------+
+ | DCookie | | DCookie |------------------------>| DObject |
+ +-----------+ +-----------+ : : +-----------+
+ : :
+
+In the above illustration, ICookie and IObject represent indices and DCookie
+and DObject represent data storage objects. Indices may have representation in
+multiple caches, but currently, non-index objects may not. Objects of any type
+may also be entirely unrepresented.
+
+As far as the netfs API goes, the netfs is only actually permitted to see
+pointers to the cookies. The cookies themselves and any objects attached to
+those cookies are hidden from it.
+
+
+===============================
+OBJECT MANAGEMENT STATE MACHINE
+===============================
+
+Within FS-Cache, each active object is managed by its own individual state
+machine. The state for an object is kept in the fscache_object struct, in
+object->state. A cookie may point to a set of objects that are in different
+states.
+
+Each state has an action associated with it that is invoked when the machine
+wakes up in that state. There are four logical sets of states:
+
+ (1) Preparation: states that wait for the parent objects to become ready. The
+ representations are hierarchical, and it is expected that an object must
+ be created or accessed with respect to its parent object.
+
+ (2) Initialisation: states that perform lookups in the cache and validate
+ what's found and that create on disk any missing metadata.
+
+ (3) Normal running: states that allow netfs operations on objects to proceed
+ and that update the state of objects.
+
+ (4) Termination: states that detach objects from their netfs cookies, that
+ delete objects from disk, that handle disk and system errors and that free
+ up in-memory resources.
+
+
+In most cases, transitioning between states is in response to signalled events.
+When a state has finished processing, it will usually set the mask of events in
+which it is interested (object->event_mask) and relinquish the worker thread.
+Then when an event is raised (by calling fscache_raise_event()), if the event
+is not masked, the object will be queued for processing (by calling
+fscache_enqueue_object()).
+
+
+PROVISION OF CPU TIME
+---------------------
+
+The work to be done by the various states is given CPU time by the threads of
+the slow work facility (see Documentation/slow-work.txt). This is used in
+preference to the workqueue facility because:
+
+ (1) Threads may be completely occupied for very long periods of time by a
+ particular work item. These state actions may be doing sequences of
+ synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
+ getxattr, truncate, unlink, rmdir, rename).
+
+ (2) Threads may do little actual work, but may rather spend a lot of time
+ sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
+ workqueues don't necessarily have the right numbers of threads.
+
+
+LOCKING SIMPLIFICATION
+----------------------
+
+Because only one worker thread may be operating on any particular object's
+state machine at once, this simplifies the locking, particularly with respect
+to disconnecting the netfs's representation of a cache object (fscache_cookie)
+from the cache backend's representation (fscache_object) - which may be
+requested from either end.
+
+
+=================
+THE SET OF STATES
+=================
+
+The object state machine has a set of states that it can be in. There are
+preparation states in which the object sets itself up and waits for its parent
+object to transit to a state that allows access to its children:
+
+ (1) State FSCACHE_OBJECT_INIT.
+
+ Initialise the object and wait for the parent object to become active. In
+ the cache, it is expected that it will not be possible to look an object
+ up from the parent object, until that parent object itself has been looked
+ up.
+
+There are initialisation states in which the object sets itself up and accesses
+disk for the object metadata:
+
+ (2) State FSCACHE_OBJECT_LOOKING_UP.
+
+ Look up the object on disk, using the parent as a starting point.
+ FS-Cache expects the cache backend to probe the cache to see whether this
+ object is represented there, and if it is, to see if it's valid (coherency
+ management).
+
+ The cache should call fscache_object_lookup_negative() to indicate lookup
+ failure for whatever reason, and should call fscache_obtained_object() to
+ indicate success.
+
+ At the completion of lookup, FS-Cache will let the netfs go ahead with
+ read operations, no matter whether the file is yet cached. If not yet
+ cached, read operations will be immediately rejected with ENODATA until
+ the first known page is uncached - as to that point there can be no data
+ to be read out of the cache for that file that isn't currently also held
+ in the pagecache.
+
+ (3) State FSCACHE_OBJECT_CREATING.
+
+ Create an object on disk, using the parent as a starting point. This
+ happens if the lookup failed to find the object, or if the object's
+ coherency data indicated what's on disk is out of date. In this state,
+ FS-Cache expects the cache to create
+
+ The cache should call fscache_obtained_object() if creation completes
+ successfully, fscache_object_lookup_negative() otherwise.
+
+ At the completion of creation, FS-Cache will start processing write
+ operations the netfs has queued for an object. If creation failed, the
+ write ops will be transparently discarded, and nothing recorded in the
+ cache.
+
+There are some normal running states in which the object spends its time
+servicing netfs requests:
+
+ (4) State FSCACHE_OBJECT_AVAILABLE.
+
+ A transient state in which pending operations are started, child objects
+ are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
+ lookup data is freed.
+
+ (5) State FSCACHE_OBJECT_ACTIVE.
+
+ The normal running state. In this state, requests the netfs makes will be
+ passed on to the cache.
+
+ (6) State FSCACHE_OBJECT_UPDATING.
+
+ The state machine comes here to update the object in the cache from the
+ netfs's records. This involves updating the auxiliary data that is used
+ to maintain coherency.
+
+And there are terminal states in which an object cleans itself up, deallocates
+memory and potentially deletes stuff from disk:
+
+ (7) State FSCACHE_OBJECT_LC_DYING.
+
+ The object comes here if it is dying because of a lookup or creation
+ error. This would be due to a disk error or system error of some sort.
+ Temporary data is cleaned up, and the parent is released.
+
+ (8) State FSCACHE_OBJECT_DYING.
+
+ The object comes here if it is dying due to an error, because its parent
+ cookie has been relinquished by the netfs or because the cache is being
+ withdrawn.
+
+ Any child objects waiting on this one are given CPU time so that they too
+ can destroy themselves. This object waits for all its children to go away
+ before advancing to the next state.
+
+ (9) State FSCACHE_OBJECT_ABORT_INIT.
+
+ The object comes to this state if it was waiting on its parent in
+ FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
+ so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
+
+(10) State FSCACHE_OBJECT_RELEASING.
+(11) State FSCACHE_OBJECT_RECYCLING.
+
+ The object comes to one of these two states when dying once it is rid of
+ all its children, if it is dying because the netfs relinquished its
+ cookie. In the first state, the cached data is expected to persist, and
+ in the second it will be deleted.
+
+(12) State FSCACHE_OBJECT_WITHDRAWING.
+
+ The object transits to this state if the cache decides it wants to
+ withdraw the object from service, perhaps to make space, but also due to
+ error or just because the whole cache is being withdrawn.
+
+(13) State FSCACHE_OBJECT_DEAD.
+
+ The object transits to this state when the in-memory object record is
+ ready to be deleted. The object processor shouldn't ever see an object in
+ this state.
+
+
+THE SET OF EVENTS
+-----------------
+
+There are a number of events that can be raised to an object state machine:
+
+ (*) FSCACHE_OBJECT_EV_UPDATE
+
+ The netfs requested that an object be updated. The state machine will ask
+ the cache backend to update the object, and the cache backend will ask the
+ netfs for details of the change through its cookie definition ops.
+
+ (*) FSCACHE_OBJECT_EV_CLEARED
+
+ This is signalled in two circumstances:
+
+ (a) when an object's last child object is dropped and
+
+ (b) when the last operation outstanding on an object is completed.
+
+ This is used to proceed from the dying state.
+
+ (*) FSCACHE_OBJECT_EV_ERROR
+
+ This is signalled when an I/O error occurs during the processing of some
+ object.
+
+ (*) FSCACHE_OBJECT_EV_RELEASE
+ (*) FSCACHE_OBJECT_EV_RETIRE
+
+ These are signalled when the netfs relinquishes a cookie it was using.
+ The event selected depends on whether the netfs asks for the backing
+ object to be retired (deleted) or retained.
+
+ (*) FSCACHE_OBJECT_EV_WITHDRAW
+
+ This is signalled when the cache backend wants to withdraw an object.
+ This means that the object will have to be detached from the netfs's
+ cookie.
+
+Because the withdrawing releasing/retiring events are all handled by the object
+state machine, it doesn't matter if there's a collision with both ends trying
+to sever the connection at the same time. The state machine can just pick
+which one it wants to honour, and that effects the other.
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt
new file mode 100644
index 00000000..b6b070c5
--- /dev/null
+++ b/Documentation/filesystems/caching/operations.txt
@@ -0,0 +1,213 @@
+ ================================
+ ASYNCHRONOUS OPERATIONS HANDLING
+ ================================
+
+By: David Howells <dhowells@redhat.com>
+
+Contents:
+
+ (*) Overview.
+
+ (*) Operation record initialisation.
+
+ (*) Parameters.
+
+ (*) Procedure.
+
+ (*) Asynchronous callback.
+
+
+========
+OVERVIEW
+========
+
+FS-Cache has an asynchronous operations handling facility that it uses for its
+data storage and retrieval routines. Its operations are represented by
+fscache_operation structs, though these are usually embedded into some other
+structure.
+
+This facility is available to and expected to be be used by the cache backends,
+and FS-Cache will create operations and pass them off to the appropriate cache
+backend for completion.
+
+To make use of this facility, <linux/fscache-cache.h> should be #included.
+
+
+===============================
+OPERATION RECORD INITIALISATION
+===============================
+
+An operation is recorded in an fscache_operation struct:
+
+ struct fscache_operation {
+ union {
+ struct work_struct fast_work;
+ struct slow_work slow_work;
+ };
+ unsigned long flags;
+ fscache_operation_processor_t processor;
+ ...
+ };
+
+Someone wanting to issue an operation should allocate something with this
+struct embedded in it. They should initialise it by calling:
+
+ void fscache_operation_init(struct fscache_operation *op,
+ fscache_operation_release_t release);
+
+with the operation to be initialised and the release function to use.
+
+The op->flags parameter should be set to indicate the CPU time provision and
+the exclusivity (see the Parameters section).
+
+The op->fast_work, op->slow_work and op->processor flags should be set as
+appropriate for the CPU time provision (see the Parameters section).
+
+FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
+operation and waited for afterwards.
+
+
+==========
+PARAMETERS
+==========
+
+There are a number of parameters that can be set in the operation record's flag
+parameter. There are three options for the provision of CPU time in these
+operations:
+
+ (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
+ may decide it wants to handle an operation itself without deferring it to
+ another thread.
+
+ This is, for example, used in read operations for calling readpages() on
+ the backing filesystem in CacheFiles. Although readpages() does an
+ asynchronous data fetch, the determination of whether pages exist is done
+ synchronously - and the netfs does not proceed until this has been
+ determined.
+
+ If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
+ before submitting the operation, and the operating thread must wait for it
+ to be cleared before proceeding:
+
+ wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+
+
+ (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
+ will be given to keventd to process. Such an operation is not permitted
+ to sleep on I/O.
+
+ This is, for example, used by CacheFiles to copy data from a backing fs
+ page to a netfs page after the backing fs has read the page in.
+
+ If this option is used, op->fast_work and op->processor must be
+ initialised before submitting the operation:
+
+ INIT_WORK(&op->fast_work, do_some_work);
+
+
+ (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
+ will be given to the slow work facility to process. Such an operation is
+ permitted to sleep on I/O.
+
+ This is, for example, used by FS-Cache to handle background writes of
+ pages that have just been fetched from a remote server.
+
+ If this option is used, op->slow_work and op->processor must be
+ initialised before submitting the operation:
+
+ fscache_operation_init_slow(op, processor)
+
+
+Furthermore, operations may be one of two types:
+
+ (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
+ conjunction with any other operation on the object being operated upon.
+
+ An example of this is the attribute change operation, in which the file
+ being written to may need truncation.
+
+ (2) Shareable. Operations of this type may be running simultaneously. It's
+ up to the operation implementation to prevent interference between other
+ operations running at the same time.
+
+
+=========
+PROCEDURE
+=========
+
+Operations are used through the following procedure:
+
+ (1) The submitting thread must allocate the operation and initialise it
+ itself. Normally this would be part of a more specific structure with the
+ generic op embedded within.
+
+ (2) The submitting thread must then submit the operation for processing using
+ one of the following two functions:
+
+ int fscache_submit_op(struct fscache_object *object,
+ struct fscache_operation *op);
+
+ int fscache_submit_exclusive_op(struct fscache_object *object,
+ struct fscache_operation *op);
+
+ The first function should be used to submit non-exclusive ops and the
+ second to submit exclusive ones. The caller must still set the
+ FSCACHE_OP_EXCLUSIVE flag.
+
+ If successful, both functions will assign the operation to the specified
+ object and return 0. -ENOBUFS will be returned if the object specified is
+ permanently unavailable.
+
+ The operation manager will defer operations on an object that is still
+ undergoing lookup or creation. The operation will also be deferred if an
+ operation of conflicting exclusivity is in progress on the object.
+
+ If the operation is asynchronous, the manager will retain a reference to
+ it, so the caller should put their reference to it by passing it to:
+
+ void fscache_put_operation(struct fscache_operation *op);
+
+ (3) If the submitting thread wants to do the work itself, and has marked the
+ operation with FSCACHE_OP_MYTHREAD, then it should monitor
+ FSCACHE_OP_WAITING as described above and check the state of the object if
+ necessary (the object might have died whilst the thread was waiting).
+
+ When it has finished doing its processing, it should call
+ fscache_put_operation() on it.
+
+ (4) The operation holds an effective lock upon the object, preventing other
+ exclusive ops conflicting until it is released. The operation can be
+ enqueued for further immediate asynchronous processing by adjusting the
+ CPU time provisioning option if necessary, eg:
+
+ op->flags &= ~FSCACHE_OP_TYPE;
+ op->flags |= ~FSCACHE_OP_FAST;
+
+ and calling:
+
+ void fscache_enqueue_operation(struct fscache_operation *op)
+
+ This can be used to allow other things to have use of the worker thread
+ pools.
+
+
+=====================
+ASYNCHRONOUS CALLBACK
+=====================
+
+When used in asynchronous mode, the worker thread pool will invoke the
+processor method with a pointer to the operation. This should then get at the
+container struct by using container_of():
+
+ static void fscache_write_op(struct fscache_operation *_op)
+ {
+ struct fscache_storage *op =
+ container_of(_op, struct fscache_storage, op);
+ ...
+ }
+
+The caller holds a reference on the operation, and will invoke
+fscache_put_operation() when the processor function returns. The processor
+function is at liberty to call fscache_enqueue_operation() or to take extra
+references.
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt
new file mode 100644
index 00000000..763d8ebb
--- /dev/null
+++ b/Documentation/filesystems/ceph.txt
@@ -0,0 +1,140 @@
+Ceph Distributed File System
+============================
+
+Ceph is a distributed network file system designed to provide good
+performance, reliability, and scalability.
+
+Basic features include:
+
+ * POSIX semantics
+ * Seamless scaling from 1 to many thousands of nodes
+ * High availability and reliability. No single point of failure.
+ * N-way replication of data across storage nodes
+ * Fast recovery from node failures
+ * Automatic rebalancing of data on node addition/removal
+ * Easy deployment: most FS components are userspace daemons
+
+Also,
+ * Flexible snapshots (on any directory)
+ * Recursive accounting (nested files, directories, bytes)
+
+In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
+on symmetric access by all clients to shared block devices, Ceph
+separates data and metadata management into independent server
+clusters, similar to Lustre. Unlike Lustre, however, metadata and
+storage nodes run entirely as user space daemons. Storage nodes
+utilize btrfs to store data objects, leveraging its advanced features
+(checksumming, metadata replication, etc.). File data is striped
+across storage nodes in large chunks to distribute workload and
+facilitate high throughputs. When storage nodes fail, data is
+re-replicated in a distributed fashion by the storage nodes themselves
+(with some minimal coordination from a cluster monitor), making the
+system extremely efficient and scalable.
+
+Metadata servers effectively form a large, consistent, distributed
+in-memory cache above the file namespace that is extremely scalable,
+dynamically redistributes metadata in response to workload changes,
+and can tolerate arbitrary (well, non-Byzantine) node failures. The
+metadata server takes a somewhat unconventional approach to metadata
+storage to significantly improve performance for common workloads. In
+particular, inodes with only a single link are embedded in
+directories, allowing entire directories of dentries and inodes to be
+loaded into its cache with a single I/O operation. The contents of
+extremely large directories can be fragmented and managed by
+independent metadata servers, allowing scalable concurrent access.
+
+The system offers automatic data rebalancing/migration when scaling
+from a small cluster of just a few nodes to many hundreds, without
+requiring an administrator carve the data set into static volumes or
+go through the tedious process of migrating data between servers.
+When the file system approaches full, new nodes can be easily added
+and things will "just work."
+
+Ceph includes flexible snapshot mechanism that allows a user to create
+a snapshot on any subdirectory (and its nested contents) in the
+system. Snapshot creation and deletion are as simple as 'mkdir
+.snap/foo' and 'rmdir .snap/foo'.
+
+Ceph also provides some recursive accounting on directories for nested
+files and bytes. That is, a 'getfattr -d foo' on any directory in the
+system will reveal the total number of nested regular files and
+subdirectories, and a summation of all nested file sizes. This makes
+the identification of large disk space consumers relatively quick, as
+no 'du' or similar recursive scan of the file system is required.
+
+
+Mount Syntax
+============
+
+The basic mount syntax is:
+
+ # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
+
+You only need to specify a single monitor, as the client will get the
+full list when it connects. (However, if the monitor you specify
+happens to be down, the mount won't succeed.) The port can be left
+off if the monitor is using the default. So if the monitor is at
+1.2.3.4,
+
+ # mount -t ceph 1.2.3.4:/ /mnt/ceph
+
+is sufficient. If /sbin/mount.ceph is installed, a hostname can be
+used instead of an IP address.
+
+
+
+Mount Options
+=============
+
+ ip=A.B.C.D[:N]
+ Specify the IP and/or port the client should bind to locally.
+ There is normally not much reason to do this. If the IP is not
+ specified, the client's IP address is determined by looking at the
+ address its connection to the monitor originates from.
+
+ wsize=X
+ Specify the maximum write size in bytes. By default there is no
+ maximum. Ceph will normally size writes based on the file stripe
+ size.
+
+ rsize=X
+ Specify the maximum readahead.
+
+ mount_timeout=X
+ Specify the timeout value for mount (in seconds), in the case
+ of a non-responsive Ceph file system. The default is 30
+ seconds.
+
+ rbytes
+ When stat() is called on a directory, set st_size to 'rbytes',
+ the summation of file sizes over all files nested beneath that
+ directory. This is the default.
+
+ norbytes
+ When stat() is called on a directory, set st_size to the
+ number of entries in that directory.
+
+ nocrc
+ Disable CRC32C calculation for data writes. If set, the storage node
+ must rely on TCP's error correction to detect data corruption
+ in the data payload.
+
+ noasyncreaddir
+ Disable client's use its local cache to satisfy readdir
+ requests. (This does not change correctness; the client uses
+ cached metadata only when a lease or capability ensures it is
+ valid.)
+
+
+More Information
+================
+
+For more information on Ceph, see the home page at
+ http://ceph.newdream.net/
+
+The Linux kernel client source tree is available at
+ git://ceph.newdream.net/git/ceph-client.git
+ git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
+
+and the source for the full system is at
+ git://ceph.newdream.net/git/ceph.git
diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt
new file mode 100644
index 00000000..49cc923a
--- /dev/null
+++ b/Documentation/filesystems/cifs.txt
@@ -0,0 +1,51 @@
+ This is the client VFS module for the Common Internet File System
+ (CIFS) protocol which is the successor to the Server Message Block
+ (SMB) protocol, the native file sharing mechanism for most early
+ PC operating systems. CIFS is fully supported by current network
+ file servers such as Windows 2000, Windows 2003 (including
+ Windows XP) as well by Samba (which provides excellent CIFS
+ server support for Linux and many other operating systems), so
+ this network filesystem client can mount to a wide variety of
+ servers. The smbfs module should be used instead of this cifs module
+ for mounting to older SMB servers such as OS/2. The smbfs and cifs
+ modules can coexist and do not conflict. The CIFS VFS filesystem
+ module is designed to work well with servers that implement the
+ newer versions (dialects) of the SMB/CIFS protocol such as Samba,
+ the program written by Andrew Tridgell that turns any Unix host
+ into a SMB/CIFS file server.
+
+ The intent of this module is to provide the most advanced network
+ file system function for CIFS compliant servers, including better
+ POSIX compliance, secure per-user session establishment, high
+ performance safe distributed caching (oplock), optional packet
+ signing, large files, Unicode support and other internationalization
+ improvements. Since both Samba server and this filesystem client support
+ the CIFS Unix extensions, the combination can provide a reasonable
+ alternative to NFSv4 for fileserving in some Linux to Linux environments,
+ not just in Linux to Windows environments.
+
+ This filesystem has an optional mount utility (mount.cifs) that can
+ be obtained from the project page and installed in the path in the same
+ directory with the other mount helpers (such as mount.smbfs).
+ Mounting using the cifs filesystem without installing the mount helper
+ requires specifying the server's ip address.
+
+ For Linux 2.4:
+ mount //anything/here /mnt_target -o
+ user=username,pass=password,unc=//ip_address_of_server/sharename
+
+ For Linux 2.5:
+ mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
+
+
+ For more information on the module see the project page at
+
+ http://us1.samba.org/samba/Linux_CIFS_client.html
+
+ For more information on CIFS see:
+
+ http://www.snia.org/tech_activities/CIFS
+
+ or the Samba site:
+
+ http://www.samba.org
diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt
new file mode 100644
index 00000000..61311356
--- /dev/null
+++ b/Documentation/filesystems/coda.txt
@@ -0,0 +1,1673 @@
+NOTE:
+This is one of the technical documents describing a component of
+Coda -- this document describes the client kernel-Venus interface.
+
+For more information:
+ http://www.coda.cs.cmu.edu
+For user level software needed to run Coda:
+ ftp://ftp.coda.cs.cmu.edu
+
+To run Coda you need to get a user level cache manager for the client,
+named Venus, as well as tools to manipulate ACLs, to log in, etc. The
+client needs to have the Coda filesystem selected in the kernel
+configuration.
+
+The server needs a user level server and at present does not depend on
+kernel support.
+
+
+
+
+
+
+
+ The Venus kernel interface
+ Peter J. Braam
+ v1.0, Nov 9, 1997
+
+ This document describes the communication between Venus and kernel
+ level filesystem code needed for the operation of the Coda file sys-
+ tem. This document version is meant to describe the current interface
+ (version 1.0) as well as improvements we envisage.
+ ______________________________________________________________________
+
+ Table of Contents
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1. Introduction
+
+ 2. Servicing Coda filesystem calls
+
+ 3. The message layer
+
+ 3.1 Implementation details
+
+ 4. The interface at the call level
+
+ 4.1 Data structures shared by the kernel and Venus
+ 4.2 The pioctl interface
+ 4.3 root
+ 4.4 lookup
+ 4.5 getattr
+ 4.6 setattr
+ 4.7 access
+ 4.8 create
+ 4.9 mkdir
+ 4.10 link
+ 4.11 symlink
+ 4.12 remove
+ 4.13 rmdir
+ 4.14 readlink
+ 4.15 open
+ 4.16 close
+ 4.17 ioctl
+ 4.18 rename
+ 4.19 readdir
+ 4.20 vget
+ 4.21 fsync
+ 4.22 inactive
+ 4.23 rdwr
+ 4.24 odymount
+ 4.25 ody_lookup
+ 4.26 ody_expand
+ 4.27 prefetch
+ 4.28 signal
+
+ 5. The minicache and downcalls
+
+ 5.1 INVALIDATE
+ 5.2 FLUSH
+ 5.3 PURGEUSER
+ 5.4 ZAPFILE
+ 5.5 ZAPDIR
+ 5.6 ZAPVNODE
+ 5.7 PURGEFID
+ 5.8 REPLACE
+
+ 6. Initialization and cleanup
+
+ 6.1 Requirements
+
+
+ ______________________________________________________________________
+ 0wpage
+
+ 11.. IInnttrroodduuccttiioonn
+
+
+
+ A key component in the Coda Distributed File System is the cache
+ manager, _V_e_n_u_s.
+
+
+ When processes on a Coda enabled system access files in the Coda
+ filesystem, requests are directed at the filesystem layer in the
+ operating system. The operating system will communicate with Venus to
+ service the request for the process. Venus manages a persistent
+ client cache and makes remote procedure calls to Coda file servers and
+ related servers (such as authentication servers) to service these
+ requests it receives from the operating system. When Venus has
+ serviced a request it replies to the operating system with appropriate
+ return codes, and other data related to the request. Optionally the
+ kernel support for Coda may maintain a minicache of recently processed
+ requests to limit the number of interactions with Venus. Venus
+ possesses the facility to inform the kernel when elements from its
+ minicache are no longer valid.
+
+ This document describes precisely this communication between the
+ kernel and Venus. The definitions of so called upcalls and downcalls
+ will be given with the format of the data they handle. We shall also
+ describe the semantic invariants resulting from the calls.
+
+ Historically Coda was implemented in a BSD file system in Mach 2.6.
+ The interface between the kernel and Venus is very similar to the BSD
+ VFS interface. Similar functionality is provided, and the format of
+ the parameters and returned data is very similar to the BSD VFS. This
+ leads to an almost natural environment for implementing a kernel-level
+ filesystem driver for Coda in a BSD system. However, other operating
+ systems such as Linux and Windows 95 and NT have virtual filesystem
+ with different interfaces.
+
+ To implement Coda on these systems some reverse engineering of the
+ Venus/Kernel protocol is necessary. Also it came to light that other
+ systems could profit significantly from certain small optimizations
+ and modifications to the protocol. To facilitate this work as well as
+ to make future ports easier, communication between Venus and the
+ kernel should be documented in great detail. This is the aim of this
+ document.
+
+ 0wpage
+
+ 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss
+
+ The service of a request for a Coda file system service originates in
+ a process PP which accessing a Coda file. It makes a system call which
+ traps to the OS kernel. Examples of such calls trapping to the kernel
+ are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix
+ context. Similar calls exist in the Win32 environment, and are named
+ _C_r_e_a_t_e_F_i_l_e_, .
+
+ Generally the operating system handles the request in a virtual
+ filesystem (VFS) layer, which is named I/O Manager in NT and IFS
+ manager in Windows 95. The VFS is responsible for partial processing
+ of the request and for locating the specific filesystem(s) which will
+ service parts of the request. Usually the information in the path
+ assists in locating the correct FS drivers. Sometimes after extensive
+ pre-processing, the VFS starts invoking exported routines in the FS
+ driver. This is the point where the FS specific processing of the
+ request starts, and here the Coda specific kernel code comes into
+ play.
+
+ The FS layer for Coda must expose and implement several interfaces.
+ First and foremost the VFS must be able to make all necessary calls to
+ the Coda FS layer, so the Coda FS driver must expose the VFS interface
+ as applicable in the operating system. These differ very significantly
+ among operating systems, but share features such as facilities to
+ read/write and create and remove objects. The Coda FS layer services
+ such VFS requests by invoking one or more well defined services
+ offered by the cache manager Venus. When the replies from Venus have
+ come back to the FS driver, servicing of the VFS call continues and
+ finishes with a reply to the kernel's VFS. Finally the VFS layer
+ returns to the process.
+
+ As a result of this design a basic interface exposed by the FS driver
+ must allow Venus to manage message traffic. In particular Venus must
+ be able to retrieve and place messages and to be notified of the
+ arrival of a new message. The notification must be through a mechanism
+ which does not block Venus since Venus must attend to other tasks even
+ when no messages are waiting or being processed.
+
+
+
+
+
+
+ Interfaces of the Coda FS Driver
+
+ Furthermore the FS layer provides for a special path of communication
+ between a user process and Venus, called the pioctl interface. The
+ pioctl interface is used for Coda specific services, such as
+ requesting detailed information about the persistent cache managed by
+ Venus. Here the involvement of the kernel is minimal. It identifies
+ the calling process and passes the information on to Venus. When
+ Venus replies the response is passed back to the caller in unmodified
+ form.
+
+ Finally Venus allows the kernel FS driver to cache the results from
+ certain services. This is done to avoid excessive context switches
+ and results in an efficient system. However, Venus may acquire
+ information, for example from the network which implies that cached
+ information must be flushed or replaced. Venus then makes a downcall
+ to the Coda FS layer to request flushes or updates in the cache. The
+ kernel FS driver handles such requests synchronously.
+
+ Among these interfaces the VFS interface and the facility to place,
+ receive and be notified of messages are platform specific. We will
+ not go into the calls exported to the VFS layer but we will state the
+ requirements of the message exchange mechanism.
+
+ 0wpage
+
+ 33.. TThhee mmeessssaaggee llaayyeerr
+
+
+
+ At the lowest level the communication between Venus and the FS driver
+ proceeds through messages. The synchronization between processes
+ requesting Coda file service and Venus relies on blocking and waking
+ up processes. The Coda FS driver processes VFS- and pioctl-requests
+ on behalf of a process P, creates messages for Venus, awaits replies
+ and finally returns to the caller. The implementation of the exchange
+ of messages is platform specific, but the semantics have (so far)
+ appeared to be generally applicable. Data buffers are created by the
+ FS Driver in kernel memory on behalf of P and copied to user memory in
+ Venus.
+
+ The FS Driver while servicing P makes upcalls to Venus. Such an
+ upcall is dispatched to Venus by creating a message structure. The
+ structure contains the identification of P, the message sequence
+ number, the size of the request and a pointer to the data in kernel
+ memory for the request. Since the data buffer is re-used to hold the
+ reply from Venus, there is a field for the size of the reply. A flags
+ field is used in the message to precisely record the status of the
+ message. Additional platform dependent structures involve pointers to
+ determine the position of the message on queues and pointers to
+ synchronization objects. In the upcall routine the message structure
+ is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g
+ queue. The routine calling upcall is responsible for allocating the
+ data buffer; its structure will be described in the next section.
+
+ A facility must exist to notify Venus that the message has been
+ created, and implemented using available synchronization objects in
+ the OS. This notification is done in the upcall context of the process
+ P. When the message is on the pending queue, process P cannot proceed
+ in upcall. The (kernel mode) processing of P in the filesystem
+ request routine must be suspended until Venus has replied. Therefore
+ the calling thread in P is blocked in upcall. A pointer in the
+ message structure will locate the synchronization object on which P is
+ sleeping.
+
+ Venus detects the notification that a message has arrived, and the FS
+ driver allow Venus to retrieve the message with a getmsg_from_kernel
+ call. This action finishes in the kernel by putting the message on the
+ queue of processing messages and setting flags to READ. Venus is
+ passed the contents of the data buffer. The getmsg_from_kernel call
+ now returns and Venus processes the request.
+
+ At some later point the FS driver receives a message from Venus,
+ namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
+ driver looks at the contents of the message and decides if:
+
+
+ +o the message is a reply for a suspended thread P. If so it removes
+ the message from the processing queue and marks the message as
+ WRITTEN. Finally, the FS driver unblocks P (still in the kernel
+ mode context of Venus) and the sendmsg_to_kernel call returns to
+ Venus. The process P will be scheduled at some point and continues
+ processing its upcall with the data buffer replaced with the reply
+ from Venus.
+
+ +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to
+ the FS Driver. The FS driver processes the request immediately
+ (usually a cache eviction or replacement) and when it finishes
+ sendmsg_to_kernel returns.
+
+ Now P awakes and continues processing upcall. There are some
+ subtleties to take account of. First P will determine if it was woken
+ up in upcall by a signal from some other source (for example an
+ attempt to terminate P) or as is normally the case by Venus in its
+ sendmsg_to_kernel call. In the normal case, the upcall routine will
+ deallocate the message structure and return. The FS routine can proceed
+ with its processing.
+
+
+
+
+
+
+
+ Sleeping and IPC arrangements
+
+ In case P is woken up by a signal and not by Venus, it will first look
+ at the flags field. If the message is not yet READ, the process P can
+ handle its signal without notifying Venus. If Venus has READ, and
+ the request should not be processed, P can send Venus a signal message
+ to indicate that it should disregard the previous message. Such
+ signals are put in the queue at the head, and read first by Venus. If
+ the message is already marked as WRITTEN it is too late to stop the
+ processing. The VFS routine will now continue. (-- If a VFS request
+ involves more than one upcall, this can lead to complicated state, an
+ extra field "handle_signals" could be added in the message structure
+ to indicate points of no return have been passed.--)
+
+
+
+ 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss
+
+ The Unix implementation of this mechanism has been through the
+ implementation of a character device associated with Coda. Venus
+ retrieves messages by doing a read on the device, replies are sent
+ with a write and notification is through the select system call on the
+ file descriptor for the device. The process P is kept waiting on an
+ interruptible wait queue object.
+
+ In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
+ call is used. The DeviceIoControl call is designed to copy buffers
+ from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
+ is issued as a synchronous call, while the getmsg_from_kernel call is
+ asynchronous. Windows EventObjects are used for notification of
+ message arrival. The process P is kept waiting on a KernelEvent
+ object in NT and a semaphore in Windows 95.
+
+ 0wpage
+
+ 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell
+
+
+ This section describes the upcalls a Coda FS driver can make to Venus.
+ Each of these upcalls make use of two structures: inputArgs and
+ outputArgs. In pseudo BNF form the structures take the following
+ form:
+
+
+ struct inputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_short pid; /* Common to all */
+ u_short pgid; /* Common to all */
+ struct CodaCred cred; /* Common to all */
+
+ <union "in" of call dependent parts of inputArgs>
+ };
+
+ struct outputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_long result;
+
+ <union "out" of call dependent parts of inputArgs>
+ };
+
+
+
+ Before going on let us elucidate the role of the various fields. The
+ inputArgs start with the opcode which defines the type of service
+ requested from Venus. There are approximately 30 upcalls at present
+ which we will discuss. The unique field labels the inputArg with a
+ unique number which will identify the message uniquely. A process and
+ process group id are passed. Finally the credentials of the caller
+ are included.
+
+ Before delving into the specific calls we need to discuss a variety of
+ data structures shared by the kernel and Venus.
+
+
+
+
+ 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss
+
+
+ The CodaCred structure defines a variety of user and group ids as
+ they are set for the calling process. The vuid_t and guid_t are 32 bit
+ unsigned integers. It also defines group membership in an array. On
+ Unix the CodaCred has proven sufficient to implement good security
+ semantics for Coda but the structure may have to undergo modification
+ for the Windows environment when these mature.
+
+ struct CodaCred {
+ vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/
+ vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
+ vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
+ };
+
+
+
+ NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus
+ doesn't know about groups, although it does create files with the
+ default uid/gid. Perhaps the list of group membership is superfluous.
+
+
+ The next item is the fundamental identifier used to identify Coda
+ files, the ViceFid. A fid of a file uniquely defines a file or
+ directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a
+ group of Coda servers acting under the aegis of a single system
+ control machine or SCM. See the Coda Administration manual for a
+ detailed description of the role of the SCM.--)
+
+
+ typedef struct ViceFid {
+ VolumeId Volume;
+ VnodeId Vnode;
+ Unique_t Unique;
+ } ViceFid;
+
+
+
+ Each of the constituent fields: VolumeId, VnodeId and Unique_t are
+ unsigned 32 bit integers. We envisage that a further field will need
+ to be prefixed to identify the Coda cell; this will probably take the
+ form of a Ipv6 size IP address naming the Coda cell through DNS.
+
+ The next important structure shared between Venus and the kernel is
+ the attributes of the file. The following structure is used to
+ exchange information. It has room for future extensions such as
+ support for device files (currently not present in Coda).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ struct coda_vattr {
+ enum coda_vtype va_type; /* vnode type (for create) */
+ u_short va_mode; /* files access mode and type */
+ short va_nlink; /* number of references to file */
+ vuid_t va_uid; /* owner user id */
+ vgid_t va_gid; /* owner group id */
+ long va_fsid; /* file system id (dev for now) */
+ long va_fileid; /* file id */
+ u_quad_t va_size; /* file size in bytes */
+ long va_blocksize; /* blocksize preferred for i/o */
+ struct timespec va_atime; /* time of last access */
+ struct timespec va_mtime; /* time of last modification */
+ struct timespec va_ctime; /* time file changed */
+ u_long va_gen; /* generation number of file */
+ u_long va_flags; /* flags defined for file */
+ dev_t va_rdev; /* device special file represents */
+ u_quad_t va_bytes; /* bytes of disk space held by file */
+ u_quad_t va_filerev; /* file modification number */
+ u_int va_vaflags; /* operations flags, see below */
+ long va_spare; /* remain quad aligned */
+ };
+
+
+
+
+ 44..22.. TThhee ppiiooccttll iinntteerrffaaccee
+
+
+ Coda specific requests can be made by application through the pioctl
+ interface. The pioctl is implemented as an ordinary ioctl on a
+ fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
+ a file handle and makes the ioctl call. Finally it closes the file.
+
+ The kernel involvement in this is limited to providing the facility to
+ open and close and pass the ioctl message _a_n_d to verify that a path in
+ the pioctl data buffers is a file in a Coda filesystem.
+
+ The kernel is handed a data packet of the form:
+
+ struct {
+ const char *path;
+ struct ViceIoctl vidata;
+ int follow;
+ } data;
+
+
+
+ where
+
+
+ struct ViceIoctl {
+ caddr_t in, out; /* Data to be transferred in, or out */
+ short in_size; /* Size of input buffer <= 2K */
+ short out_size; /* Maximum size of output buffer, <= 2K */
+ };
+
+
+
+ The path must be a Coda file, otherwise the ioctl upcall will not be
+ made.
+
+ NNOOTTEE The data structures and code are a mess. We need to clean this
+ up.
+
+ We now proceed to document the individual calls:
+
+ 0wpage
+
+ 44..33.. rroooott
+
+
+ AArrgguummeennttss
+
+ iinn empty
+
+ oouutt
+
+ struct cfs_root_out {
+ ViceFid VFid;
+ } cfs_root;
+
+
+
+ DDeessccrriippttiioonn This call is made to Venus during the initialization of
+ the Coda filesystem. If the result is zero, the cfs_root structure
+ contains the ViceFid of the root of the Coda filesystem. If a non-zero
+ result is generated, its value is a platform dependent error code
+ indicating the difficulty Venus encountered in locating the root of
+ the Coda filesystem.
+
+ 0wpage
+
+ 44..44.. llooookkuupp
+
+
+ SSuummmmaarryy Find the ViceFid and type of an object in a directory if it
+ exists.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_lookup_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_lookup;
+
+
+
+ oouutt
+
+ struct cfs_lookup_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_lookup;
+
+
+
+ DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of
+ a directory entry. The directory entry requested carries name name
+ and Venus will search the directory identified by cfs_lookup_in.VFid.
+ The result may indicate that the name does not exist, or that
+ difficulty was encountered in finding it (e.g. due to disconnection).
+ If the result is zero, the field cfs_lookup_out.VFid contains the
+ targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
+ type of object the name designates.
+
+ The name of the object is an 8 bit character string of maximum length
+ CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
+
+ It is extremely important to realize that Venus bitwise ors the field
+ cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
+ not be put in the kernel name cache.
+
+ NNOOTTEE The type of the vtype is currently wrong. It should be
+ coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
+
+ 0wpage
+
+ 44..55.. ggeettaattttrr
+
+
+ SSuummmmaarryy Get the attributes of a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_getattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr; /* XXXXX */
+ } cfs_getattr;
+
+
+
+ oouutt
+
+ struct cfs_getattr_out {
+ struct coda_vattr attr;
+ } cfs_getattr;
+
+
+
+ DDeessccrriippttiioonn This call returns the attributes of the file identified by
+ fid.
+
+ EErrrroorrss Errors can occur if the object with fid does not exist, is
+ unaccessible or if the caller does not have permission to fetch
+ attributes.
+
+ NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
+ the attributes as well as the Fid for the instantiation of an internal
+ "inode" or "FileHandle". A significant improvement in performance on
+ such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls
+ both at the Venus/kernel interaction level and at the RPC level.
+
+ The vattr structure included in the input arguments is superfluous and
+ should be removed.
+
+ 0wpage
+
+ 44..66.. sseettaattttrr
+
+
+ SSuummmmaarryy Set the attributes of a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_setattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_setattr;
+
+
+
+
+ oouutt
+ empty
+
+ DDeessccrriippttiioonn The structure attr is filled with attributes to be changed
+ in BSD style. Attributes not to be changed are set to -1, apart from
+ vtype which is set to VNON. Other are set to the value to be assigned.
+ The only attributes which the FS driver may request to change are the
+ mode, owner, groupid, atime, mtime and ctime. The return value
+ indicates success or failure.
+
+ EErrrroorrss A variety of errors can occur. The object may not exist, may
+ be inaccessible, or permission may not be granted by Venus.
+
+ 0wpage
+
+ 44..77.. aacccceessss
+
+
+ SSuummmmaarryy
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_access_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_access;
+
+
+
+ oouutt
+ empty
+
+ DDeessccrriippttiioonn Verify if access to the object identified by VFid for
+ operations described by flags is permitted. The result indicates if
+ access will be granted. It is important to remember that Coda uses
+ ACLs to enforce protection and that ultimately the servers, not the
+ clients enforce the security of the system. The result of this call
+ will depend on whether a _t_o_k_e_n is held by the user.
+
+ EErrrroorrss The object may not exist, or the ACL describing the protection
+ may not be accessible.
+
+ 0wpage
+
+ 44..88.. ccrreeaattee
+
+
+ SSuummmmaarryy Invoked to create a file
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_create_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ int excl;
+ int mode;
+ char *name; /* Place holder for data. */
+ } cfs_create;
+
+
+
+
+ oouutt
+
+ struct cfs_create_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_create;
+
+
+
+ DDeessccrriippttiioonn This upcall is invoked to request creation of a file.
+ The file will be created in the directory identified by VFid, its name
+ will be name, and the mode will be mode. If excl is set an error will
+ be returned if the file already exists. If the size field in attr is
+ set to zero the file will be truncated. The uid and gid of the file
+ are set by converting the CodaCred to a uid using a macro CRTOUID
+ (this macro is platform dependent). Upon success the VFid and
+ attributes of the file are returned. The Coda FS Driver will normally
+ instantiate a vnode, inode or file handle at kernel level for the new
+ object.
+
+
+ EErrrroorrss A variety of errors can occur. Permissions may be insufficient.
+ If the object exists and is not a file the error EISDIR is returned
+ under Unix.
+
+ NNOOTTEE The packing of parameters is very inefficient and appears to
+ indicate confusion between the system call creat and the VFS operation
+ create. The VFS operation create is only called to create new objects.
+ This create call differs from the Unix one in that it is not invoked
+ to return a file descriptor. The truncate and exclusive options,
+ together with the mode, could simply be part of the mode as it is
+ under Unix. There should be no flags argument; this is used in open
+ (2) to return a file descriptor for READ or WRITE mode.
+
+ The attributes of the directory should be returned too, since the size
+ and mtime changed.
+
+ 0wpage
+
+ 44..99.. mmkkddiirr
+
+
+ SSuummmmaarryy Create a new directory.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_mkdir_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ char *name; /* Place holder for data. */
+ } cfs_mkdir;
+
+
+
+ oouutt
+
+ struct cfs_mkdir_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_mkdir;
+
+
+
+
+ DDeessccrriippttiioonn This call is similar to create but creates a directory.
+ Only the mode field in the input parameters is used for creation.
+ Upon successful creation, the attr returned contains the attributes of
+ the new directory.
+
+ EErrrroorrss As for create.
+
+ NNOOTTEE The input parameter should be changed to mode instead of
+ attributes.
+
+ The attributes of the parent should be returned since the size and
+ mtime changes.
+
+ 0wpage
+
+ 44..1100.. lliinnkk
+
+
+ SSuummmmaarryy Create a link to an existing file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_link_in {
+ ViceFid sourceFid; /* cnode to link *to* */
+ ViceFid destFid; /* Directory in which to place link */
+ char *tname; /* Place holder for data. */
+ } cfs_link;
+
+
+
+ oouutt
+ empty
+
+ DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory
+ identified by destFid with name tname. The source must reside in the
+ target's parent, i.e. the source must be have parent destFid, i.e. Coda
+ does not support cross directory hard links. Only the return value is
+ relevant. It indicates success or the type of failure.
+
+ EErrrroorrss The usual errors can occur.0wpage
+
+ 44..1111.. ssyymmlliinnkk
+
+
+ SSuummmmaarryy create a symbolic link
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_symlink_in {
+ ViceFid VFid; /* Directory to put symlink in */
+ char *srcname;
+ struct coda_vattr attr;
+ char *tname;
+ } cfs_symlink;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the
+ directory identified by VFid and named tname. It should point to the
+ pathname srcname. The attributes of the newly created object are to
+ be set to attr.
+
+ EErrrroorrss
+
+ NNOOTTEE The attributes of the target directory should be returned since
+ its size changed.
+
+ 0wpage
+
+ 44..1122.. rreemmoovvee
+
+
+ SSuummmmaarryy Remove a file
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_remove_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_remove;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory
+ identified by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE The attributes of the directory should be returned since its
+ mtime and size may change.
+
+ 0wpage
+
+ 44..1133.. rrmmddiirr
+
+
+ SSuummmmaarryy Remove a directory
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_rmdir_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_rmdir;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Remove the directory with name name from the directory
+ identified by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE The attributes of the parent directory should be returned since
+ its mtime and size may change.
+
+ 0wpage
+
+ 44..1144.. rreeaaddlliinnkk
+
+
+ SSuummmmaarryy Read the value of a symbolic link.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_readlink_in {
+ ViceFid VFid;
+ } cfs_readlink;
+
+
+
+ oouutt
+
+ struct cfs_readlink_out {
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readlink;
+
+
+
+ DDeessccrriippttiioonn This routine reads the contents of symbolic link
+ identified by VFid into the buffer data. The buffer data must be able
+ to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
+
+ EErrrroorrss No unusual errors.
+
+ 0wpage
+
+ 44..1155.. ooppeenn
+
+
+ SSuummmmaarryy Open a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_open_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_open;
+
+
+
+ oouutt
+
+ struct cfs_open_out {
+ dev_t dev;
+ ino_t inode;
+ } cfs_open;
+
+
+
+ DDeessccrriippttiioonn This request asks Venus to place the file identified by
+ VFid in its cache and to note that the calling process wishes to open
+ it with flags as in open(2). The return value to the kernel differs
+ for Unix and Windows systems. For Unix systems the Coda FS Driver is
+ informed of the device and inode number of the container file in the
+ fields dev and inode. For Windows the path of the container file is
+ returned to the kernel.
+ EErrrroorrss
+
+ NNOOTTEE Currently the cfs_open_out structure is not properly adapted to
+ deal with the Windows case. It might be best to implement two
+ upcalls, one to open aiming at a container file name, the other at a
+ container file inode.
+
+ 0wpage
+
+ 44..1166.. cclloossee
+
+
+ SSuummmmaarryy Close a file, update it on the servers.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_close_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_close;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Close the file identified by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE The flags argument is bogus and not used. However, Venus' code
+ has room to deal with an execp input field, probably this field should
+ be used to inform Venus that the file was closed but is still memory
+ mapped for execution. There are comments about fetching versus not
+ fetching the data in Venus vproc_vfscalls. This seems silly. If a
+ file is being closed, the data in the container file is to be the new
+ data. Here again the execp flag might be in play to create confusion:
+ currently Venus might think a file can be flushed from the cache when
+ it is still memory mapped. This needs to be understood.
+
+ 0wpage
+
+ 44..1177.. iiooccttll
+
+
+ SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_ioctl_in {
+ ViceFid VFid;
+ int cmd;
+ int len;
+ int rwflag;
+ char *data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ oouutt
+
+
+ struct cfs_ioctl_out {
+ int len;
+ caddr_t data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and
+ data arguments are filled as usual. flags is not used by Venus.
+
+ EErrrroorrss
+
+ NNOOTTEE Another bogus parameter. flags is not used. What is the
+ business about PREFETCHING in the Venus code?
+
+
+ 0wpage
+
+ 44..1188.. rreennaammee
+
+
+ SSuummmmaarryy Rename a fid.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_rename_in {
+ ViceFid sourceFid;
+ char *srcname;
+ ViceFid destFid;
+ char *destname;
+ } cfs_rename;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Rename the object with name srcname in directory
+ sourceFid to destname in destFid. It is important that the names
+ srcname and destname are 0 terminated strings. Strings in Unix
+ kernels are not always null terminated.
+
+ EErrrroorrss
+
+ 0wpage
+
+ 44..1199.. rreeaaddddiirr
+
+
+ SSuummmmaarryy Read directory entries.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_readdir_in {
+ ViceFid VFid;
+ int count;
+ int offset;
+ } cfs_readdir;
+
+
+
+
+ oouutt
+
+ struct cfs_readdir_out {
+ int size;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readdir;
+
+
+
+ DDeessccrriippttiioonn Read directory entries from VFid starting at offset and
+ read at most count bytes. Returns the data in data and returns
+ the size in size.
+
+ EErrrroorrss
+
+ NNOOTTEE This call is not used. Readdir operations exploit container
+ files. We will re-evaluate this during the directory revamp which is
+ about to take place.
+
+ 0wpage
+
+ 44..2200.. vvggeett
+
+
+ SSuummmmaarryy instructs Venus to do an FSDB->Get.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_vget_in {
+ ViceFid VFid;
+ } cfs_vget;
+
+
+
+ oouutt
+
+ struct cfs_vget_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_vget;
+
+
+
+ DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj
+ labelled by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE This operation is not used. However, it is extremely useful
+ since it can be used to deal with read/write memory mapped files.
+ These can be "pinned" in the Venus cache using vget and released with
+ inactive.
+
+ 0wpage
+
+ 44..2211.. ffssyynncc
+
+
+ SSuummmmaarryy Tell Venus to update the RVM attributes of a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_fsync_in {
+ ViceFid VFid;
+ } cfs_fsync;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This
+ should be called as part of kernel level fsync type calls. The
+ result indicates if the syncing was successful.
+
+ EErrrroorrss
+
+ NNOOTTEE Linux does not implement this call. It should.
+
+ 0wpage
+
+ 44..2222.. iinnaaccttiivvee
+
+
+ SSuummmmaarryy Tell Venus a vnode is no longer in use.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_inactive_in {
+ ViceFid VFid;
+ } cfs_inactive;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn This operation returns EOPNOTSUPP.
+
+ EErrrroorrss
+
+ NNOOTTEE This should perhaps be removed.
+
+ 0wpage
+
+ 44..2233.. rrddwwrr
+
+
+ SSuummmmaarryy Read or write from a file
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_rdwr_in {
+ ViceFid VFid;
+ int rwflag;
+ int count;
+ int offset;
+ int ioflag;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+
+ oouutt
+
+ struct cfs_rdwr_out {
+ int rwflag;
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+ DDeessccrriippttiioonn This upcall asks Venus to read or write from a file.
+
+ EErrrroorrss
+
+ NNOOTTEE It should be removed since it is against the Coda philosophy that
+ read/write operations never reach Venus. I have been told the
+ operation does not work. It is not currently used.
+
+
+ 0wpage
+
+ 44..2244.. ooddyymmoouunntt
+
+
+ SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount
+ point.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct ody_mount_in {
+ char *name; /* Place holder for data. */
+ } ody_mount;
+
+
+
+ oouutt
+
+ struct ody_mount_out {
+ ViceFid VFid;
+ } ody_mount;
+
+
+
+ DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named
+ name. The fid is returned in VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE This call was used by David for dynamic sets. It should be
+ removed since it causes a jungle of pointers in the VFS mounting area.
+ It is not used by Coda proper. Call is not implemented by Venus.
+
+ 0wpage
+
+ 44..2255.. ooddyy__llooookkuupp
+
+
+ SSuummmmaarryy Looks up something.
+
+ AArrgguummeennttss
+
+ iinn irrelevant
+
+
+ oouutt
+ irrelevant
+
+ DDeessccrriippttiioonn
+
+ EErrrroorrss
+
+ NNOOTTEE Gut it. Call is not implemented by Venus.
+
+ 0wpage
+
+ 44..2266.. ooddyy__eexxppaanndd
+
+
+ SSuummmmaarryy expands something in a dynamic set.
+
+ AArrgguummeennttss
+
+ iinn irrelevant
+
+ oouutt
+ irrelevant
+
+ DDeessccrriippttiioonn
+
+ EErrrroorrss
+
+ NNOOTTEE Gut it. Call is not implemented by Venus.
+
+ 0wpage
+
+ 44..2277.. pprreeffeettcchh
+
+
+ SSuummmmaarryy Prefetch a dynamic set.
+
+ AArrgguummeennttss
+
+ iinn Not documented.
+
+ oouutt
+ Not documented.
+
+ DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is
+ noted that it doesn't work. Not surprising, since the kernel does not
+ have support for it. (ODY_PREFETCH is not a defined operation).
+
+ EErrrroorrss
+
+ NNOOTTEE Gut it. It isn't working and isn't used by Coda.
+
+
+ 0wpage
+
+ 44..2288.. ssiiggnnaall
+
+
+ SSuummmmaarryy Send Venus a signal about an upcall.
+
+ AArrgguummeennttss
+
+ iinn none
+
+ oouutt
+ not applicable.
+
+ DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus
+ that the calling process received a signal after Venus read the
+ message from the input queue. Venus is supposed to clean up the
+ operation.
+
+ EErrrroorrss No reply is given.
+
+ NNOOTTEE We need to better understand what Venus needs to clean up and if
+ it is doing this correctly. Also we need to handle multiple upcall
+ per system call situations correctly. It would be important to know
+ what state changes in Venus take place after an upcall for which the
+ kernel is responsible for notifying Venus to clean up (e.g. open
+ definitely is such a state change, but many others are maybe not).
+
+ 0wpage
+
+ 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss
+
+
+ The Coda FS Driver can cache results of lookup and access upcalls, to
+ limit the frequency of upcalls. Upcalls carry a price since a process
+ context switch needs to take place. The counterpart of caching the
+ information is that Venus will notify the FS Driver that cached
+ entries must be flushed or renamed.
+
+ The kernel code generally has to maintain a structure which links the
+ internal file handles (called vnodes in BSD, inodes in Linux and
+ FileHandles in Windows) with the ViceFid's which Venus maintains. The
+ reason is that frequent translations back and forth are needed in
+ order to make upcalls and use the results of upcalls. Such linking
+ objects are called ccnnooddeess.
+
+ The current minicache implementations have cache entries which record
+ the following:
+
+ 1. the name of the file
+
+ 2. the cnode of the directory containing the object
+
+ 3. a list of CodaCred's for which the lookup is permitted.
+
+ 4. the cnode of the object
+
+ The lookup call in the Coda FS Driver may request the cnode of the
+ desired object from the cache, by passing its name, directory and the
+ CodaCred's of the caller. The cache will return the cnode or indicate
+ that it cannot be found. The Coda FS Driver must be careful to
+ invalidate cache entries when it modifies or removes objects.
+
+ When Venus obtains information that indicates that cache entries are
+ no longer valid, it will make a downcall to the kernel. Downcalls are
+ intercepted by the Coda FS Driver and lead to cache invalidations of
+ the kind described below. The Coda FS Driver does not return an error
+ unless the downcall data could not be read into kernel memory.
+
+
+ 55..11.. IINNVVAALLIIDDAATTEE
+
+
+ No information is available on this call.
+
+
+ 55..22.. FFLLUUSSHH
+
+
+
+ AArrgguummeennttss None
+
+ SSuummmmaarryy Flush the name cache entirely.
+
+ DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This
+ is to prevent stale cache information being held. Some operating
+ systems allow the kernel name cache to be switched off dynamically.
+ When this is done, this downcall is made.
+
+
+ 55..33.. PPUURRGGEEUUSSEERR
+
+
+ AArrgguummeennttss
+
+ struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
+ struct CodaCred cred;
+ } cfs_purgeuser;
+
+
+
+ DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This
+ call is issued when tokens for a user expire or are flushed.
+
+
+ 55..44.. ZZAAPPFFIILLEE
+
+
+ AArrgguummeennttss
+
+ struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapfile;
+
+
+
+ DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair.
+ This is issued as a result of an invalidation of cached attributes of
+ a vnode.
+
+ NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache
+ zapfile routine takes different arguments. Linux does not implement
+ the invalidation of attributes correctly.
+
+
+
+ 55..55.. ZZAAPPDDIIRR
+
+
+ AArrgguummeennttss
+
+ struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapdir;
+
+
+
+ DDeessccrriippttiioonn Remove all entries in the cache lying in a directory
+ CodaFid, and all children of this directory. This call is issued when
+ Venus receives a callback on the directory.
+
+
+ 55..66.. ZZAAPPVVNNOODDEE
+
+
+
+ AArrgguummeennttss
+
+ struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
+ struct CodaCred cred;
+ ViceFid VFid;
+ } cfs_zapvnode;
+
+
+
+ DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid
+ as in the arguments. This downcall is probably never issued.
+
+
+ 55..77.. PPUURRGGEEFFIIDD
+
+
+ SSuummmmaarryy
+
+ AArrgguummeennttss
+
+ struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_purgefid;
+
+
+
+ DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd
+ vnode), purge its children from the namecache and remove the file from the
+ namecache.
+
+
+
+ 55..88.. RREEPPLLAACCEE
+
+
+ SSuummmmaarryy Replace the Fid's for a collection of names.
+
+ AArrgguummeennttss
+
+ struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
+ ViceFid NewFid;
+ ViceFid OldFid;
+ } cfs_replace;
+
+
+
+ DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with
+ another. It is added to allow Venus during reintegration to replace
+ locally allocated temp fids while disconnected with global fids even
+ when the reference counts on those fids are not zero.
+
+ 0wpage
+
+ 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp
+
+
+ This section gives brief hints as to desirable features for the Coda
+ FS Driver at startup and upon shutdown or Venus failures. Before
+ entering the discussion it is useful to repeat that the Coda FS Driver
+ maintains the following data:
+
+
+ 1. message queues
+
+ 2. cnodes
+
+ 3. name cache entries
+
+ The name cache entries are entirely private to the driver, so they
+ can easily be manipulated. The message queues will generally have
+ clear points of initialization and destruction. The cnodes are
+ much more delicate. User processes hold reference counts in Coda
+ filesystems and it can be difficult to clean up the cnodes.
+
+ It can expect requests through:
+
+ 1. the message subsystem
+
+ 2. the VFS layer
+
+ 3. pioctl interface
+
+ Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can
+ treat these similarly.
+
+
+ 66..11.. RReeqquuiirreemmeennttss
+
+
+ The following requirements should be accommodated:
+
+ 1. The message queues should have open and close routines. On Unix
+ the opening of the character devices are such routines.
+
+ +o Before opening, no messages can be placed.
+
+ +o Opening will remove any old messages still pending.
+
+ +o Close will notify any sleeping processes that their upcall cannot
+ be completed.
+
+ +o Close will free all memory allocated by the message queues.
+
+
+ 2. At open the namecache shall be initialized to empty state.
+
+ 3. Before the message queues are open, all VFS operations will fail.
+ Fortunately this can be achieved by making sure than mounting the
+ Coda filesystem cannot succeed before opening.
+
+ 4. After closing of the queues, no VFS operations can succeed. Here
+ one needs to be careful, since a few operations (lookup,
+ read/write, readdir) can proceed without upcalls. These must be
+ explicitly blocked.
+
+ 5. Upon closing the namecache shall be flushed and disabled.
+
+ 6. All memory held by cnodes can be freed without relying on upcalls.
+
+ 7. Unmounting the file system can be done without relying on upcalls.
+
+ 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
+ get the rootfid or the attributes of the rootfid. The latter is
+ best implemented by Venus fetching these objects before attempting
+ to mount.
+
+ NNOOTTEE NetBSD in particular but also Linux have not implemented the
+ above requirements fully. For smooth operation this needs to be
+ corrected.
+
+
+
diff --git a/Documentation/filesystems/configfs/Makefile b/Documentation/filesystems/configfs/Makefile
new file mode 100644
index 00000000..be7ec5e6
--- /dev/null
+++ b/Documentation/filesystems/configfs/Makefile
@@ -0,0 +1,3 @@
+ifneq ($(CONFIG_CONFIGFS_FS),)
+obj-m += configfs_example_explicit.o configfs_example_macros.o
+endif
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt
new file mode 100644
index 00000000..dd57bb6b
--- /dev/null
+++ b/Documentation/filesystems/configfs/configfs.txt
@@ -0,0 +1,484 @@
+
+configfs - Userspace-driven kernel object configuration.
+
+Joel Becker <joel.becker@oracle.com>
+
+Updated: 31 March 2005
+
+Copyright (c) 2005 Oracle Corporation,
+ Joel Becker <joel.becker@oracle.com>
+
+
+[What is configfs?]
+
+configfs is a ram-based filesystem that provides the converse of
+sysfs's functionality. Where sysfs is a filesystem-based view of
+kernel objects, configfs is a filesystem-based manager of kernel
+objects, or config_items.
+
+With sysfs, an object is created in kernel (for example, when a device
+is discovered) and it is registered with sysfs. Its attributes then
+appear in sysfs, allowing userspace to read the attributes via
+readdir(3)/read(2). It may allow some attributes to be modified via
+write(2). The important point is that the object is created and
+destroyed in kernel, the kernel controls the lifecycle of the sysfs
+representation, and sysfs is merely a window on all this.
+
+A configfs config_item is created via an explicit userspace operation:
+mkdir(2). It is destroyed via rmdir(2). The attributes appear at
+mkdir(2) time, and can be read or modified via read(2) and write(2).
+As with sysfs, readdir(3) queries the list of items and/or attributes.
+symlink(2) can be used to group items together. Unlike sysfs, the
+lifetime of the representation is completely driven by userspace. The
+kernel modules backing the items must respond to this.
+
+Both sysfs and configfs can and should exist together on the same
+system. One is not a replacement for the other.
+
+[Using configfs]
+
+configfs can be compiled as a module or into the kernel. You can access
+it by doing
+
+ mount -t configfs none /config
+
+The configfs tree will be empty unless client modules are also loaded.
+These are modules that register their item types with configfs as
+subsystems. Once a client subsystem is loaded, it will appear as a
+subdirectory (or more than one) under /config. Like sysfs, the
+configfs tree is always there, whether mounted on /config or not.
+
+An item is created via mkdir(2). The item's attributes will also
+appear at this time. readdir(3) can determine what the attributes are,
+read(2) can query their default values, and write(2) can store new
+values. Like sysfs, attributes should be ASCII text files, preferably
+with only one value per file. The same efficiency caveats from sysfs
+apply. Don't mix more than one attribute in one attribute file.
+
+Like sysfs, configfs expects write(2) to store the entire buffer at
+once. When writing to configfs attributes, userspace processes should
+first read the entire file, modify the portions they wish to change, and
+then write the entire buffer back. Attribute files have a maximum size
+of one page (PAGE_SIZE, 4096 on i386).
+
+When an item needs to be destroyed, remove it with rmdir(2). An
+item cannot be destroyed if any other item has a link to it (via
+symlink(2)). Links can be removed via unlink(2).
+
+[Configuring FakeNBD: an Example]
+
+Imagine there's a Network Block Device (NBD) driver that allows you to
+access remote block devices. Call it FakeNBD. FakeNBD uses configfs
+for its configuration. Obviously, there will be a nice program that
+sysadmins use to configure FakeNBD, but somehow that program has to tell
+the driver about it. Here's where configfs comes in.
+
+When the FakeNBD driver is loaded, it registers itself with configfs.
+readdir(3) sees this just fine:
+
+ # ls /config
+ fakenbd
+
+A fakenbd connection can be created with mkdir(2). The name is
+arbitrary, but likely the tool will make some use of the name. Perhaps
+it is a uuid or a disk name:
+
+ # mkdir /config/fakenbd/disk1
+ # ls /config/fakenbd/disk1
+ target device rw
+
+The target attribute contains the IP address of the server FakeNBD will
+connect to. The device attribute is the device on the server.
+Predictably, the rw attribute determines whether the connection is
+read-only or read-write.
+
+ # echo 10.0.0.1 > /config/fakenbd/disk1/target
+ # echo /dev/sda1 > /config/fakenbd/disk1/device
+ # echo 1 > /config/fakenbd/disk1/rw
+
+That's it. That's all there is. Now the device is configured, via the
+shell no less.
+
+[Coding With configfs]
+
+Every object in configfs is a config_item. A config_item reflects an
+object in the subsystem. It has attributes that match values on that
+object. configfs handles the filesystem representation of that object
+and its attributes, allowing the subsystem to ignore all but the
+basic show/store interaction.
+
+Items are created and destroyed inside a config_group. A group is a
+collection of items that share the same attributes and operations.
+Items are created by mkdir(2) and removed by rmdir(2), but configfs
+handles that. The group has a set of operations to perform these tasks
+
+A subsystem is the top level of a client module. During initialization,
+the client module registers the subsystem with configfs, the subsystem
+appears as a directory at the top of the configfs filesystem. A
+subsystem is also a config_group, and can do everything a config_group
+can.
+
+[struct config_item]
+
+ struct config_item {
+ char *ci_name;
+ char ci_namebuf[UOBJ_NAME_LEN];
+ struct kref ci_kref;
+ struct list_head ci_entry;
+ struct config_item *ci_parent;
+ struct config_group *ci_group;
+ struct config_item_type *ci_type;
+ struct dentry *ci_dentry;
+ };
+
+ void config_item_init(struct config_item *);
+ void config_item_init_type_name(struct config_item *,
+ const char *name,
+ struct config_item_type *type);
+ struct config_item *config_item_get(struct config_item *);
+ void config_item_put(struct config_item *);
+
+Generally, struct config_item is embedded in a container structure, a
+structure that actually represents what the subsystem is doing. The
+config_item portion of that structure is how the object interacts with
+configfs.
+
+Whether statically defined in a source file or created by a parent
+config_group, a config_item must have one of the _init() functions
+called on it. This initializes the reference count and sets up the
+appropriate fields.
+
+All users of a config_item should have a reference on it via
+config_item_get(), and drop the reference when they are done via
+config_item_put().
+
+By itself, a config_item cannot do much more than appear in configfs.
+Usually a subsystem wants the item to display and/or store attributes,
+among other things. For that, it needs a type.
+
+[struct config_item_type]
+
+ struct configfs_item_operations {
+ void (*release)(struct config_item *);
+ ssize_t (*show_attribute)(struct config_item *,
+ struct configfs_attribute *,
+ char *);
+ ssize_t (*store_attribute)(struct config_item *,
+ struct configfs_attribute *,
+ const char *, size_t);
+ int (*allow_link)(struct config_item *src,
+ struct config_item *target);
+ int (*drop_link)(struct config_item *src,
+ struct config_item *target);
+ };
+
+ struct config_item_type {
+ struct module *ct_owner;
+ struct configfs_item_operations *ct_item_ops;
+ struct configfs_group_operations *ct_group_ops;
+ struct configfs_attribute **ct_attrs;
+ };
+
+The most basic function of a config_item_type is to define what
+operations can be performed on a config_item. All items that have been
+allocated dynamically will need to provide the ct_item_ops->release()
+method. This method is called when the config_item's reference count
+reaches zero. Items that wish to display an attribute need to provide
+the ct_item_ops->show_attribute() method. Similarly, storing a new
+attribute value uses the store_attribute() method.
+
+[struct configfs_attribute]
+
+ struct configfs_attribute {
+ char *ca_name;
+ struct module *ca_owner;
+ mode_t ca_mode;
+ };
+
+When a config_item wants an attribute to appear as a file in the item's
+configfs directory, it must define a configfs_attribute describing it.
+It then adds the attribute to the NULL-terminated array
+config_item_type->ct_attrs. When the item appears in configfs, the
+attribute file will appear with the configfs_attribute->ca_name
+filename. configfs_attribute->ca_mode specifies the file permissions.
+
+If an attribute is readable and the config_item provides a
+ct_item_ops->show_attribute() method, that method will be called
+whenever userspace asks for a read(2) on the attribute. The converse
+will happen for write(2).
+
+[struct config_group]
+
+A config_item cannot live in a vacuum. The only way one can be created
+is via mkdir(2) on a config_group. This will trigger creation of a
+child item.
+
+ struct config_group {
+ struct config_item cg_item;
+ struct list_head cg_children;
+ struct configfs_subsystem *cg_subsys;
+ struct config_group **default_groups;
+ };
+
+ void config_group_init(struct config_group *group);
+ void config_group_init_type_name(struct config_group *group,
+ const char *name,
+ struct config_item_type *type);
+
+
+The config_group structure contains a config_item. Properly configuring
+that item means that a group can behave as an item in its own right.
+However, it can do more: it can create child items or groups. This is
+accomplished via the group operations specified on the group's
+config_item_type.
+
+ struct configfs_group_operations {
+ struct config_item *(*make_item)(struct config_group *group,
+ const char *name);
+ struct config_group *(*make_group)(struct config_group *group,
+ const char *name);
+ int (*commit_item)(struct config_item *item);
+ void (*disconnect_notify)(struct config_group *group,
+ struct config_item *item);
+ void (*drop_item)(struct config_group *group,
+ struct config_item *item);
+ };
+
+A group creates child items by providing the
+ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
+config_item (or more likely, its container structure), initializes it,
+and returns it to configfs. Configfs will then populate the filesystem
+tree to reflect the new item.
+
+If the subsystem wants the child to be a group itself, the subsystem
+provides ct_group_ops->make_group(). Everything else behaves the same,
+using the group _init() functions on the group.
+
+Finally, when userspace calls rmdir(2) on the item or group,
+ct_group_ops->drop_item() is called. As a config_group is also a
+config_item, it is not necessary for a separate drop_group() method.
+The subsystem must config_item_put() the reference that was initialized
+upon item allocation. If a subsystem has no work to do, it may omit
+the ct_group_ops->drop_item() method, and configfs will call
+config_item_put() on the item on behalf of the subsystem.
+
+IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
+is called, configfs WILL remove the item from the filesystem tree
+(assuming that it has no children to keep it busy). The subsystem is
+responsible for responding to this. If the subsystem has references to
+the item in other threads, the memory is safe. It may take some time
+for the item to actually disappear from the subsystem's usage. But it
+is gone from configfs.
+
+When drop_item() is called, the item's linkage has already been torn
+down. It no longer has a reference on its parent and has no place in
+the item hierarchy. If a client needs to do some cleanup before this
+teardown happens, the subsystem can implement the
+ct_group_ops->disconnect_notify() method. The method is called after
+configfs has removed the item from the filesystem view but before the
+item is removed from its parent group. Like drop_item(),
+disconnect_notify() is void and cannot fail. Client subsystems should
+not drop any references here, as they still must do it in drop_item().
+
+A config_group cannot be removed while it still has child items. This
+is implemented in the configfs rmdir(2) code. ->drop_item() will not be
+called, as the item has not been dropped. rmdir(2) will fail, as the
+directory is not empty.
+
+[struct configfs_subsystem]
+
+A subsystem must register itself, usually at module_init time. This
+tells configfs to make the subsystem appear in the file tree.
+
+ struct configfs_subsystem {
+ struct config_group su_group;
+ struct mutex su_mutex;
+ };
+
+ int configfs_register_subsystem(struct configfs_subsystem *subsys);
+ void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
+
+ A subsystem consists of a toplevel config_group and a mutex.
+The group is where child config_items are created. For a subsystem,
+this group is usually defined statically. Before calling
+configfs_register_subsystem(), the subsystem must have initialized the
+group via the usual group _init() functions, and it must also have
+initialized the mutex.
+ When the register call returns, the subsystem is live, and it
+will be visible via configfs. At that point, mkdir(2) can be called and
+the subsystem must be ready for it.
+
+[An Example]
+
+The best example of these basic concepts is the simple_children
+subsystem/group and the simple_child item in configfs_example_explicit.c
+and configfs_example_macros.c. It shows a trivial object displaying and
+storing an attribute, and a simple group creating and destroying these
+children.
+
+The only difference between configfs_example_explicit.c and
+configfs_example_macros.c is how the attributes of the childless item
+are defined. The childless item has extended attributes, each with
+their own show()/store() operation. This follows a convention commonly
+used in sysfs. configfs_example_explicit.c creates these attributes
+by explicitly defining the structures involved. Conversely
+configfs_example_macros.c uses some convenience macros from configfs.h
+to define the attributes. These macros are similar to their sysfs
+counterparts.
+
+[Hierarchy Navigation and the Subsystem Mutex]
+
+There is an extra bonus that configfs provides. The config_groups and
+config_items are arranged in a hierarchy due to the fact that they
+appear in a filesystem. A subsystem is NEVER to touch the filesystem
+parts, but the subsystem might be interested in this hierarchy. For
+this reason, the hierarchy is mirrored via the config_group->cg_children
+and config_item->ci_parent structure members.
+
+A subsystem can navigate the cg_children list and the ci_parent pointer
+to see the tree created by the subsystem. This can race with configfs'
+management of the hierarchy, so configfs uses the subsystem mutex to
+protect modifications. Whenever a subsystem wants to navigate the
+hierarchy, it must do so under the protection of the subsystem
+mutex.
+
+A subsystem will be prevented from acquiring the mutex while a newly
+allocated item has not been linked into this hierarchy. Similarly, it
+will not be able to acquire the mutex while a dropping item has not
+yet been unlinked. This means that an item's ci_parent pointer will
+never be NULL while the item is in configfs, and that an item will only
+be in its parent's cg_children list for the same duration. This allows
+a subsystem to trust ci_parent and cg_children while they hold the
+mutex.
+
+[Item Aggregation Via symlink(2)]
+
+configfs provides a simple group via the group->item parent/child
+relationship. Often, however, a larger environment requires aggregation
+outside of the parent/child connection. This is implemented via
+symlink(2).
+
+A config_item may provide the ct_item_ops->allow_link() and
+ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
+symlink(2) may be called with the config_item as the source of the link.
+These links are only allowed between configfs config_items. Any
+symlink(2) attempt outside the configfs filesystem will be denied.
+
+When symlink(2) is called, the source config_item's ->allow_link()
+method is called with itself and a target item. If the source item
+allows linking to target item, it returns 0. A source item may wish to
+reject a link if it only wants links to a certain type of object (say,
+in its own subsystem).
+
+When unlink(2) is called on the symbolic link, the source item is
+notified via the ->drop_link() method. Like the ->drop_item() method,
+this is a void function and cannot return failure. The subsystem is
+responsible for responding to the change.
+
+A config_item cannot be removed while it links to any other item, nor
+can it be removed while an item links to it. Dangling symlinks are not
+allowed in configfs.
+
+[Automatically Created Subgroups]
+
+A new config_group may want to have two types of child config_items.
+While this could be codified by magic names in ->make_item(), it is much
+more explicit to have a method whereby userspace sees this divergence.
+
+Rather than have a group where some items behave differently than
+others, configfs provides a method whereby one or many subgroups are
+automatically created inside the parent at its creation. Thus,
+mkdir("parent") results in "parent", "parent/subgroup1", up through
+"parent/subgroupN". Items of type 1 can now be created in
+"parent/subgroup1", and items of type N can be created in
+"parent/subgroupN".
+
+These automatic subgroups, or default groups, do not preclude other
+children of the parent group. If ct_group_ops->make_group() exists,
+other child groups can be created on the parent group directly.
+
+A configfs subsystem specifies default groups by filling in the
+NULL-terminated array default_groups on the config_group structure.
+Each group in that array is populated in the configfs tree at the same
+time as the parent group. Similarly, they are removed at the same time
+as the parent. No extra notification is provided. When a ->drop_item()
+method call notifies the subsystem the parent group is going away, it
+also means every default group child associated with that parent group.
+
+As a consequence of this, default_groups cannot be removed directly via
+rmdir(2). They also are not considered when rmdir(2) on the parent
+group is checking for children.
+
+[Dependent Subsystems]
+
+Sometimes other drivers depend on particular configfs items. For
+example, ocfs2 mounts depend on a heartbeat region item. If that
+region item is removed with rmdir(2), the ocfs2 mount must BUG or go
+readonly. Not happy.
+
+configfs provides two additional API calls: configfs_depend_item() and
+configfs_undepend_item(). A client driver can call
+configfs_depend_item() on an existing item to tell configfs that it is
+depended on. configfs will then return -EBUSY from rmdir(2) for that
+item. When the item is no longer depended on, the client driver calls
+configfs_undepend_item() on it.
+
+These API cannot be called underneath any configfs callbacks, as
+they will conflict. They can block and allocate. A client driver
+probably shouldn't calling them of its own gumption. Rather it should
+be providing an API that external subsystems call.
+
+How does this work? Imagine the ocfs2 mount process. When it mounts,
+it asks for a heartbeat region item. This is done via a call into the
+heartbeat code. Inside the heartbeat code, the region item is looked
+up. Here, the heartbeat code calls configfs_depend_item(). If it
+succeeds, then heartbeat knows the region is safe to give to ocfs2.
+If it fails, it was being torn down anyway, and heartbeat can gracefully
+pass up an error.
+
+[Committable Items]
+
+NOTE: Committable items are currently unimplemented.
+
+Some config_items cannot have a valid initial state. That is, no
+default values can be specified for the item's attributes such that the
+item can do its work. Userspace must configure one or more attributes,
+after which the subsystem can start whatever entity this item
+represents.
+
+Consider the FakeNBD device from above. Without a target address *and*
+a target device, the subsystem has no idea what block device to import.
+The simple example assumes that the subsystem merely waits until all the
+appropriate attributes are configured, and then connects. This will,
+indeed, work, but now every attribute store must check if the attributes
+are initialized. Every attribute store must fire off the connection if
+that condition is met.
+
+Far better would be an explicit action notifying the subsystem that the
+config_item is ready to go. More importantly, an explicit action allows
+the subsystem to provide feedback as to whether the attributes are
+initialized in a way that makes sense. configfs provides this as
+committable items.
+
+configfs still uses only normal filesystem operations. An item is
+committed via rename(2). The item is moved from a directory where it
+can be modified to a directory where it cannot.
+
+Any group that provides the ct_group_ops->commit_item() method has
+committable items. When this group appears in configfs, mkdir(2) will
+not work directly in the group. Instead, the group will have two
+subdirectories: "live" and "pending". The "live" directory does not
+support mkdir(2) or rmdir(2) either. It only allows rename(2). The
+"pending" directory does allow mkdir(2) and rmdir(2). An item is
+created in the "pending" directory. Its attributes can be modified at
+will. Userspace commits the item by renaming it into the "live"
+directory. At this point, the subsystem receives the ->commit_item()
+callback. If all required attributes are filled to satisfaction, the
+method returns zero and the item is moved to the "live" directory.
+
+As rmdir(2) does not work in the "live" directory, an item must be
+shutdown, or "uncommitted". Again, this is done via rename(2), this
+time from the "live" directory back to the "pending" one. The subsystem
+is notified by the ct_group_ops->uncommit_object() method.
+
+
diff --git a/Documentation/filesystems/configfs/configfs_example_explicit.c b/Documentation/filesystems/configfs/configfs_example_explicit.c
new file mode 100644
index 00000000..1420233d
--- /dev/null
+++ b/Documentation/filesystems/configfs/configfs_example_explicit.c
@@ -0,0 +1,483 @@
+/*
+ * vim: noexpandtab ts=8 sts=0 sw=8:
+ *
+ * configfs_example_explicit.c - This file is a demonstration module
+ * containing a number of configfs subsystems. It explicitly defines
+ * each structure without using the helper macros defined in
+ * configfs.h.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle. All rights reserved.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/configfs.h>
+
+
+
+/*
+ * 01-childless
+ *
+ * This first example is a childless subsystem. It cannot create
+ * any config_items. It just has attributes.
+ *
+ * Note that we are enclosing the configfs_subsystem inside a container.
+ * This is not necessary if a subsystem has no attributes directly
+ * on the subsystem. See the next example, 02-simple-children, for
+ * such a subsystem.
+ */
+
+struct childless {
+ struct configfs_subsystem subsys;
+ int showme;
+ int storeme;
+};
+
+struct childless_attribute {
+ struct configfs_attribute attr;
+ ssize_t (*show)(struct childless *, char *);
+ ssize_t (*store)(struct childless *, const char *, size_t);
+};
+
+static inline struct childless *to_childless(struct config_item *item)
+{
+ return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
+}
+
+static ssize_t childless_showme_read(struct childless *childless,
+ char *page)
+{
+ ssize_t pos;
+
+ pos = sprintf(page, "%d\n", childless->showme);
+ childless->showme++;
+
+ return pos;
+}
+
+static ssize_t childless_storeme_read(struct childless *childless,
+ char *page)
+{
+ return sprintf(page, "%d\n", childless->storeme);
+}
+
+static ssize_t childless_storeme_write(struct childless *childless,
+ const char *page,
+ size_t count)
+{
+ unsigned long tmp;
+ char *p = (char *) page;
+
+ tmp = simple_strtoul(p, &p, 10);
+ if ((*p != '\0') && (*p != '\n'))
+ return -EINVAL;
+
+ if (tmp > INT_MAX)
+ return -ERANGE;
+
+ childless->storeme = tmp;
+
+ return count;
+}
+
+static ssize_t childless_description_read(struct childless *childless,
+ char *page)
+{
+ return sprintf(page,
+"[01-childless]\n"
+"\n"
+"The childless subsystem is the simplest possible subsystem in\n"
+"configfs. It does not support the creation of child config_items.\n"
+"It only has a few attributes. In fact, it isn't much different\n"
+"than a directory in /proc.\n");
+}
+
+static struct childless_attribute childless_attr_showme = {
+ .attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
+ .show = childless_showme_read,
+};
+static struct childless_attribute childless_attr_storeme = {
+ .attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
+ .show = childless_storeme_read,
+ .store = childless_storeme_write,
+};
+static struct childless_attribute childless_attr_description = {
+ .attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
+ .show = childless_description_read,
+};
+
+static struct configfs_attribute *childless_attrs[] = {
+ &childless_attr_showme.attr,
+ &childless_attr_storeme.attr,
+ &childless_attr_description.attr,
+ NULL,
+};
+
+static ssize_t childless_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ struct childless *childless = to_childless(item);
+ struct childless_attribute *childless_attr =
+ container_of(attr, struct childless_attribute, attr);
+ ssize_t ret = 0;
+
+ if (childless_attr->show)
+ ret = childless_attr->show(childless, page);
+ return ret;
+}
+
+static ssize_t childless_attr_store(struct config_item *item,
+ struct configfs_attribute *attr,
+ const char *page, size_t count)
+{
+ struct childless *childless = to_childless(item);
+ struct childless_attribute *childless_attr =
+ container_of(attr, struct childless_attribute, attr);
+ ssize_t ret = -EINVAL;
+
+ if (childless_attr->store)
+ ret = childless_attr->store(childless, page, count);
+ return ret;
+}
+
+static struct configfs_item_operations childless_item_ops = {
+ .show_attribute = childless_attr_show,
+ .store_attribute = childless_attr_store,
+};
+
+static struct config_item_type childless_type = {
+ .ct_item_ops = &childless_item_ops,
+ .ct_attrs = childless_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct childless childless_subsys = {
+ .subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "01-childless",
+ .ci_type = &childless_type,
+ },
+ },
+ },
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 02-simple-children
+ *
+ * This example merely has a simple one-attribute child. Note that
+ * there is no extra attribute structure, as the child's attribute is
+ * known from the get-go. Also, there is no container for the
+ * subsystem, as it has no attributes of its own.
+ */
+
+struct simple_child {
+ struct config_item item;
+ int storeme;
+};
+
+static inline struct simple_child *to_simple_child(struct config_item *item)
+{
+ return item ? container_of(item, struct simple_child, item) : NULL;
+}
+
+static struct configfs_attribute simple_child_attr_storeme = {
+ .ca_owner = THIS_MODULE,
+ .ca_name = "storeme",
+ .ca_mode = S_IRUGO | S_IWUSR,
+};
+
+static struct configfs_attribute *simple_child_attrs[] = {
+ &simple_child_attr_storeme,
+ NULL,
+};
+
+static ssize_t simple_child_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ ssize_t count;
+ struct simple_child *simple_child = to_simple_child(item);
+
+ count = sprintf(page, "%d\n", simple_child->storeme);
+
+ return count;
+}
+
+static ssize_t simple_child_attr_store(struct config_item *item,
+ struct configfs_attribute *attr,
+ const char *page, size_t count)
+{
+ struct simple_child *simple_child = to_simple_child(item);
+ unsigned long tmp;
+ char *p = (char *) page;
+
+ tmp = simple_strtoul(p, &p, 10);
+ if (!p || (*p && (*p != '\n')))
+ return -EINVAL;
+
+ if (tmp > INT_MAX)
+ return -ERANGE;
+
+ simple_child->storeme = tmp;
+
+ return count;
+}
+
+static void simple_child_release(struct config_item *item)
+{
+ kfree(to_simple_child(item));
+}
+
+static struct configfs_item_operations simple_child_item_ops = {
+ .release = simple_child_release,
+ .show_attribute = simple_child_attr_show,
+ .store_attribute = simple_child_attr_store,
+};
+
+static struct config_item_type simple_child_type = {
+ .ct_item_ops = &simple_child_item_ops,
+ .ct_attrs = simple_child_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+
+struct simple_children {
+ struct config_group group;
+};
+
+static inline struct simple_children *to_simple_children(struct config_item *item)
+{
+ return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
+}
+
+static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
+{
+ struct simple_child *simple_child;
+
+ simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL);
+ if (!simple_child)
+ return ERR_PTR(-ENOMEM);
+
+ config_item_init_type_name(&simple_child->item, name,
+ &simple_child_type);
+
+ simple_child->storeme = 0;
+
+ return &simple_child->item;
+}
+
+static struct configfs_attribute simple_children_attr_description = {
+ .ca_owner = THIS_MODULE,
+ .ca_name = "description",
+ .ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *simple_children_attrs[] = {
+ &simple_children_attr_description,
+ NULL,
+};
+
+static ssize_t simple_children_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ return sprintf(page,
+"[02-simple-children]\n"
+"\n"
+"This subsystem allows the creation of child config_items. These\n"
+"items have only one attribute that is readable and writeable.\n");
+}
+
+static void simple_children_release(struct config_item *item)
+{
+ kfree(to_simple_children(item));
+}
+
+static struct configfs_item_operations simple_children_item_ops = {
+ .release = simple_children_release,
+ .show_attribute = simple_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations simple_children_group_ops = {
+ .make_item = simple_children_make_item,
+};
+
+static struct config_item_type simple_children_type = {
+ .ct_item_ops = &simple_children_item_ops,
+ .ct_group_ops = &simple_children_group_ops,
+ .ct_attrs = simple_children_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem simple_children_subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "02-simple-children",
+ .ci_type = &simple_children_type,
+ },
+ },
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 03-group-children
+ *
+ * This example reuses the simple_children group from above. However,
+ * the simple_children group is not the subsystem itself, it is a
+ * child of the subsystem. Creation of a group in the subsystem creates
+ * a new simple_children group. That group can then have simple_child
+ * children of its own.
+ */
+
+static struct config_group *group_children_make_group(struct config_group *group, const char *name)
+{
+ struct simple_children *simple_children;
+
+ simple_children = kzalloc(sizeof(struct simple_children),
+ GFP_KERNEL);
+ if (!simple_children)
+ return ERR_PTR(-ENOMEM);
+
+ config_group_init_type_name(&simple_children->group, name,
+ &simple_children_type);
+
+ return &simple_children->group;
+}
+
+static struct configfs_attribute group_children_attr_description = {
+ .ca_owner = THIS_MODULE,
+ .ca_name = "description",
+ .ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *group_children_attrs[] = {
+ &group_children_attr_description,
+ NULL,
+};
+
+static ssize_t group_children_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ return sprintf(page,
+"[03-group-children]\n"
+"\n"
+"This subsystem allows the creation of child config_groups. These\n"
+"groups are like the subsystem simple-children.\n");
+}
+
+static struct configfs_item_operations group_children_item_ops = {
+ .show_attribute = group_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations group_children_group_ops = {
+ .make_group = group_children_make_group,
+};
+
+static struct config_item_type group_children_type = {
+ .ct_item_ops = &group_children_item_ops,
+ .ct_group_ops = &group_children_group_ops,
+ .ct_attrs = group_children_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem group_children_subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "03-group-children",
+ .ci_type = &group_children_type,
+ },
+ },
+};
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * We're now done with our subsystem definitions.
+ * For convenience in this module, here's a list of them all. It
+ * allows the init function to easily register them. Most modules
+ * will only have one subsystem, and will only call register_subsystem
+ * on it directly.
+ */
+static struct configfs_subsystem *example_subsys[] = {
+ &childless_subsys.subsys,
+ &simple_children_subsys,
+ &group_children_subsys,
+ NULL,
+};
+
+static int __init configfs_example_init(void)
+{
+ int ret;
+ int i;
+ struct configfs_subsystem *subsys;
+
+ for (i = 0; example_subsys[i]; i++) {
+ subsys = example_subsys[i];
+
+ config_group_init(&subsys->su_group);
+ mutex_init(&subsys->su_mutex);
+ ret = configfs_register_subsystem(subsys);
+ if (ret) {
+ printk(KERN_ERR "Error %d while registering subsystem %s\n",
+ ret,
+ subsys->su_group.cg_item.ci_namebuf);
+ goto out_unregister;
+ }
+ }
+
+ return 0;
+
+out_unregister:
+ for (i--; i >= 0; i--)
+ configfs_unregister_subsystem(example_subsys[i]);
+
+ return ret;
+}
+
+static void __exit configfs_example_exit(void)
+{
+ int i;
+
+ for (i = 0; example_subsys[i]; i++)
+ configfs_unregister_subsystem(example_subsys[i]);
+}
+
+module_init(configfs_example_init);
+module_exit(configfs_example_exit);
+MODULE_LICENSE("GPL");
diff --git a/Documentation/filesystems/configfs/configfs_example_macros.c b/Documentation/filesystems/configfs/configfs_example_macros.c
new file mode 100644
index 00000000..327dfbc6
--- /dev/null
+++ b/Documentation/filesystems/configfs/configfs_example_macros.c
@@ -0,0 +1,446 @@
+/*
+ * vim: noexpandtab ts=8 sts=0 sw=8:
+ *
+ * configfs_example_macros.c - This file is a demonstration module
+ * containing a number of configfs subsystems. It uses the helper
+ * macros defined by configfs.h
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle. All rights reserved.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/configfs.h>
+
+
+
+/*
+ * 01-childless
+ *
+ * This first example is a childless subsystem. It cannot create
+ * any config_items. It just has attributes.
+ *
+ * Note that we are enclosing the configfs_subsystem inside a container.
+ * This is not necessary if a subsystem has no attributes directly
+ * on the subsystem. See the next example, 02-simple-children, for
+ * such a subsystem.
+ */
+
+struct childless {
+ struct configfs_subsystem subsys;
+ int showme;
+ int storeme;
+};
+
+static inline struct childless *to_childless(struct config_item *item)
+{
+ return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
+}
+
+CONFIGFS_ATTR_STRUCT(childless);
+#define CHILDLESS_ATTR(_name, _mode, _show, _store) \
+struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR(_name, _mode, _show, _store)
+#define CHILDLESS_ATTR_RO(_name, _show) \
+struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR_RO(_name, _show);
+
+static ssize_t childless_showme_read(struct childless *childless,
+ char *page)
+{
+ ssize_t pos;
+
+ pos = sprintf(page, "%d\n", childless->showme);
+ childless->showme++;
+
+ return pos;
+}
+
+static ssize_t childless_storeme_read(struct childless *childless,
+ char *page)
+{
+ return sprintf(page, "%d\n", childless->storeme);
+}
+
+static ssize_t childless_storeme_write(struct childless *childless,
+ const char *page,
+ size_t count)
+{
+ unsigned long tmp;
+ char *p = (char *) page;
+
+ tmp = simple_strtoul(p, &p, 10);
+ if (!p || (*p && (*p != '\n')))
+ return -EINVAL;
+
+ if (tmp > INT_MAX)
+ return -ERANGE;
+
+ childless->storeme = tmp;
+
+ return count;
+}
+
+static ssize_t childless_description_read(struct childless *childless,
+ char *page)
+{
+ return sprintf(page,
+"[01-childless]\n"
+"\n"
+"The childless subsystem is the simplest possible subsystem in\n"
+"configfs. It does not support the creation of child config_items.\n"
+"It only has a few attributes. In fact, it isn't much different\n"
+"than a directory in /proc.\n");
+}
+
+CHILDLESS_ATTR_RO(showme, childless_showme_read);
+CHILDLESS_ATTR(storeme, S_IRUGO | S_IWUSR, childless_storeme_read,
+ childless_storeme_write);
+CHILDLESS_ATTR_RO(description, childless_description_read);
+
+static struct configfs_attribute *childless_attrs[] = {
+ &childless_attr_showme.attr,
+ &childless_attr_storeme.attr,
+ &childless_attr_description.attr,
+ NULL,
+};
+
+CONFIGFS_ATTR_OPS(childless);
+static struct configfs_item_operations childless_item_ops = {
+ .show_attribute = childless_attr_show,
+ .store_attribute = childless_attr_store,
+};
+
+static struct config_item_type childless_type = {
+ .ct_item_ops = &childless_item_ops,
+ .ct_attrs = childless_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct childless childless_subsys = {
+ .subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "01-childless",
+ .ci_type = &childless_type,
+ },
+ },
+ },
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 02-simple-children
+ *
+ * This example merely has a simple one-attribute child. Note that
+ * there is no extra attribute structure, as the child's attribute is
+ * known from the get-go. Also, there is no container for the
+ * subsystem, as it has no attributes of its own.
+ */
+
+struct simple_child {
+ struct config_item item;
+ int storeme;
+};
+
+static inline struct simple_child *to_simple_child(struct config_item *item)
+{
+ return item ? container_of(item, struct simple_child, item) : NULL;
+}
+
+static struct configfs_attribute simple_child_attr_storeme = {
+ .ca_owner = THIS_MODULE,
+ .ca_name = "storeme",
+ .ca_mode = S_IRUGO | S_IWUSR,
+};
+
+static struct configfs_attribute *simple_child_attrs[] = {
+ &simple_child_attr_storeme,
+ NULL,
+};
+
+static ssize_t simple_child_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ ssize_t count;
+ struct simple_child *simple_child = to_simple_child(item);
+
+ count = sprintf(page, "%d\n", simple_child->storeme);
+
+ return count;
+}
+
+static ssize_t simple_child_attr_store(struct config_item *item,
+ struct configfs_attribute *attr,
+ const char *page, size_t count)
+{
+ struct simple_child *simple_child = to_simple_child(item);
+ unsigned long tmp;
+ char *p = (char *) page;
+
+ tmp = simple_strtoul(p, &p, 10);
+ if (!p || (*p && (*p != '\n')))
+ return -EINVAL;
+
+ if (tmp > INT_MAX)
+ return -ERANGE;
+
+ simple_child->storeme = tmp;
+
+ return count;
+}
+
+static void simple_child_release(struct config_item *item)
+{
+ kfree(to_simple_child(item));
+}
+
+static struct configfs_item_operations simple_child_item_ops = {
+ .release = simple_child_release,
+ .show_attribute = simple_child_attr_show,
+ .store_attribute = simple_child_attr_store,
+};
+
+static struct config_item_type simple_child_type = {
+ .ct_item_ops = &simple_child_item_ops,
+ .ct_attrs = simple_child_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+
+struct simple_children {
+ struct config_group group;
+};
+
+static inline struct simple_children *to_simple_children(struct config_item *item)
+{
+ return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
+}
+
+static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
+{
+ struct simple_child *simple_child;
+
+ simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL);
+ if (!simple_child)
+ return ERR_PTR(-ENOMEM);
+
+ config_item_init_type_name(&simple_child->item, name,
+ &simple_child_type);
+
+ simple_child->storeme = 0;
+
+ return &simple_child->item;
+}
+
+static struct configfs_attribute simple_children_attr_description = {
+ .ca_owner = THIS_MODULE,
+ .ca_name = "description",
+ .ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *simple_children_attrs[] = {
+ &simple_children_attr_description,
+ NULL,
+};
+
+static ssize_t simple_children_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ return sprintf(page,
+"[02-simple-children]\n"
+"\n"
+"This subsystem allows the creation of child config_items. These\n"
+"items have only one attribute that is readable and writeable.\n");
+}
+
+static void simple_children_release(struct config_item *item)
+{
+ kfree(to_simple_children(item));
+}
+
+static struct configfs_item_operations simple_children_item_ops = {
+ .release = simple_children_release,
+ .show_attribute = simple_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations simple_children_group_ops = {
+ .make_item = simple_children_make_item,
+};
+
+static struct config_item_type simple_children_type = {
+ .ct_item_ops = &simple_children_item_ops,
+ .ct_group_ops = &simple_children_group_ops,
+ .ct_attrs = simple_children_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem simple_children_subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "02-simple-children",
+ .ci_type = &simple_children_type,
+ },
+ },
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 03-group-children
+ *
+ * This example reuses the simple_children group from above. However,
+ * the simple_children group is not the subsystem itself, it is a
+ * child of the subsystem. Creation of a group in the subsystem creates
+ * a new simple_children group. That group can then have simple_child
+ * children of its own.
+ */
+
+static struct config_group *group_children_make_group(struct config_group *group, const char *name)
+{
+ struct simple_children *simple_children;
+
+ simple_children = kzalloc(sizeof(struct simple_children),
+ GFP_KERNEL);
+ if (!simple_children)
+ return ERR_PTR(-ENOMEM);
+
+ config_group_init_type_name(&simple_children->group, name,
+ &simple_children_type);
+
+ return &simple_children->group;
+}
+
+static struct configfs_attribute group_children_attr_description = {
+ .ca_owner = THIS_MODULE,
+ .ca_name = "description",
+ .ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *group_children_attrs[] = {
+ &group_children_attr_description,
+ NULL,
+};
+
+static ssize_t group_children_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ return sprintf(page,
+"[03-group-children]\n"
+"\n"
+"This subsystem allows the creation of child config_groups. These\n"
+"groups are like the subsystem simple-children.\n");
+}
+
+static struct configfs_item_operations group_children_item_ops = {
+ .show_attribute = group_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations group_children_group_ops = {
+ .make_group = group_children_make_group,
+};
+
+static struct config_item_type group_children_type = {
+ .ct_item_ops = &group_children_item_ops,
+ .ct_group_ops = &group_children_group_ops,
+ .ct_attrs = group_children_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem group_children_subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "03-group-children",
+ .ci_type = &group_children_type,
+ },
+ },
+};
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * We're now done with our subsystem definitions.
+ * For convenience in this module, here's a list of them all. It
+ * allows the init function to easily register them. Most modules
+ * will only have one subsystem, and will only call register_subsystem
+ * on it directly.
+ */
+static struct configfs_subsystem *example_subsys[] = {
+ &childless_subsys.subsys,
+ &simple_children_subsys,
+ &group_children_subsys,
+ NULL,
+};
+
+static int __init configfs_example_init(void)
+{
+ int ret;
+ int i;
+ struct configfs_subsystem *subsys;
+
+ for (i = 0; example_subsys[i]; i++) {
+ subsys = example_subsys[i];
+
+ config_group_init(&subsys->su_group);
+ mutex_init(&subsys->su_mutex);
+ ret = configfs_register_subsystem(subsys);
+ if (ret) {
+ printk(KERN_ERR "Error %d while registering subsystem %s\n",
+ ret,
+ subsys->su_group.cg_item.ci_namebuf);
+ goto out_unregister;
+ }
+ }
+
+ return 0;
+
+out_unregister:
+ for (i--; i >= 0; i--)
+ configfs_unregister_subsystem(example_subsys[i]);
+
+ return ret;
+}
+
+static void __exit configfs_example_exit(void)
+{
+ int i;
+
+ for (i = 0; example_subsys[i]; i++)
+ configfs_unregister_subsystem(example_subsys[i]);
+}
+
+module_init(configfs_example_init);
+module_exit(configfs_example_exit);
+MODULE_LICENSE("GPL");
diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt
new file mode 100644
index 00000000..31f53f0a
--- /dev/null
+++ b/Documentation/filesystems/cramfs.txt
@@ -0,0 +1,76 @@
+
+ Cramfs - cram a filesystem onto a small ROM
+
+cramfs is designed to be simple and small, and to compress things well.
+
+It uses the zlib routines to compress a file one page at a time, and
+allows random page access. The meta-data is not compressed, but is
+expressed in a very terse representation to make it use much less
+diskspace than traditional filesystems.
+
+You can't write to a cramfs filesystem (making it compressible and
+compact also makes it _very_ hard to update on-the-fly), so you have to
+create the disk image with the "mkcramfs" utility.
+
+
+Usage Notes
+-----------
+
+File sizes are limited to less than 16MB.
+
+Maximum filesystem size is a little over 256MB. (The last file on the
+filesystem is allowed to extend past 256MB.)
+
+Only the low 8 bits of gid are stored. The current version of
+mkcramfs simply truncates to 8 bits, which is a potential security
+issue.
+
+Hard links are supported, but hard linked files
+will still have a link count of 1 in the cramfs image.
+
+Cramfs directories have no `.' or `..' entries. Directories (like
+every other file on cramfs) always have a link count of 1. (There's
+no need to use -noleaf in `find', btw.)
+
+No timestamps are stored in a cramfs, so these default to the epoch
+(1970 GMT). Recently-accessed files may have updated timestamps, but
+the update lasts only as long as the inode is cached in memory, after
+which the timestamp reverts to 1970, i.e. moves backwards in time.
+
+Currently, cramfs must be written and read with architectures of the
+same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
+== 4096. At least the latter of these is a bug, but it hasn't been
+decided what the best fix is. For the moment if you have larger pages
+you can just change the #define in mkcramfs.c, so long as you don't
+mind the filesystem becoming unreadable to future kernels.
+
+
+For /usr/share/magic
+--------------------
+
+0 ulelong 0x28cd3d45 Linux cramfs offset 0
+>4 ulelong x size %d
+>8 ulelong x flags 0x%x
+>12 ulelong x future 0x%x
+>16 string >\0 signature "%.16s"
+>32 ulelong x fsid.crc 0x%x
+>36 ulelong x fsid.edition %d
+>40 ulelong x fsid.blocks %d
+>44 ulelong x fsid.files %d
+>48 string >\0 name "%.16s"
+512 ulelong 0x28cd3d45 Linux cramfs offset 512
+>516 ulelong x size %d
+>520 ulelong x flags 0x%x
+>524 ulelong x future 0x%x
+>528 string >\0 signature "%.16s"
+>544 ulelong x fsid.crc 0x%x
+>548 ulelong x fsid.edition %d
+>552 ulelong x fsid.blocks %d
+>556 ulelong x fsid.files %d
+>560 string >\0 name "%.16s"
+
+
+Hacker Notes
+------------
+
+See fs/cramfs/README for filesystem layout and implementation notes.
diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt
new file mode 100644
index 00000000..ed52af60
--- /dev/null
+++ b/Documentation/filesystems/debugfs.txt
@@ -0,0 +1,158 @@
+Copyright 2009 Jonathan Corbet <corbet@lwn.net>
+
+Debugfs exists as a simple way for kernel developers to make information
+available to user space. Unlike /proc, which is only meant for information
+about a process, or sysfs, which has strict one-value-per-file rules,
+debugfs has no rules at all. Developers can put any information they want
+there. The debugfs filesystem is also intended to not serve as a stable
+ABI to user space; in theory, there are no stability constraints placed on
+files exported there. The real world is not always so simple, though [1];
+even debugfs interfaces are best designed with the idea that they will need
+to be maintained forever.
+
+Debugfs is typically mounted with a command like:
+
+ mount -t debugfs none /sys/kernel/debug
+
+(Or an equivalent /etc/fstab line).
+
+Note that the debugfs API is exported GPL-only to modules.
+
+Code using debugfs should include <linux/debugfs.h>. Then, the first order
+of business will be to create at least one directory to hold a set of
+debugfs files:
+
+ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
+
+This call, if successful, will make a directory called name underneath the
+indicated parent directory. If parent is NULL, the directory will be
+created in the debugfs root. On success, the return value is a struct
+dentry pointer which can be used to create files in the directory (and to
+clean it up at the end). A NULL return value indicates that something went
+wrong. If ERR_PTR(-ENODEV) is returned, that is an indication that the
+kernel has been built without debugfs support and none of the functions
+described below will work.
+
+The most general way to create a file within a debugfs directory is with:
+
+ struct dentry *debugfs_create_file(const char *name, mode_t mode,
+ struct dentry *parent, void *data,
+ const struct file_operations *fops);
+
+Here, name is the name of the file to create, mode describes the access
+permissions the file should have, parent indicates the directory which
+should hold the file, data will be stored in the i_private field of the
+resulting inode structure, and fops is a set of file operations which
+implement the file's behavior. At a minimum, the read() and/or write()
+operations should be provided; others can be included as needed. Again,
+the return value will be a dentry pointer to the created file, NULL for
+error, or ERR_PTR(-ENODEV) if debugfs support is missing.
+
+In a number of cases, the creation of a set of file operations is not
+actually necessary; the debugfs code provides a number of helper functions
+for simple situations. Files containing a single integer value can be
+created with any of:
+
+ struct dentry *debugfs_create_u8(const char *name, mode_t mode,
+ struct dentry *parent, u8 *value);
+ struct dentry *debugfs_create_u16(const char *name, mode_t mode,
+ struct dentry *parent, u16 *value);
+ struct dentry *debugfs_create_u32(const char *name, mode_t mode,
+ struct dentry *parent, u32 *value);
+ struct dentry *debugfs_create_u64(const char *name, mode_t mode,
+ struct dentry *parent, u64 *value);
+
+These files support both reading and writing the given value; if a specific
+file should not be written to, simply set the mode bits accordingly. The
+values in these files are in decimal; if hexadecimal is more appropriate,
+the following functions can be used instead:
+
+ struct dentry *debugfs_create_x8(const char *name, mode_t mode,
+ struct dentry *parent, u8 *value);
+ struct dentry *debugfs_create_x16(const char *name, mode_t mode,
+ struct dentry *parent, u16 *value);
+ struct dentry *debugfs_create_x32(const char *name, mode_t mode,
+ struct dentry *parent, u32 *value);
+
+Note that there is no debugfs_create_x64().
+
+These functions are useful as long as the developer knows the size of the
+value to be exported. Some types can have different widths on different
+architectures, though, complicating the situation somewhat. There is a
+function meant to help out in one special case:
+
+ struct dentry *debugfs_create_size_t(const char *name, mode_t mode,
+ struct dentry *parent,
+ size_t *value);
+
+As might be expected, this function will create a debugfs file to represent
+a variable of type size_t.
+
+Boolean values can be placed in debugfs with:
+
+ struct dentry *debugfs_create_bool(const char *name, mode_t mode,
+ struct dentry *parent, u32 *value);
+
+A read on the resulting file will yield either Y (for non-zero values) or
+N, followed by a newline. If written to, it will accept either upper- or
+lower-case values, or 1 or 0. Any other input will be silently ignored.
+
+Finally, a block of arbitrary binary data can be exported with:
+
+ struct debugfs_blob_wrapper {
+ void *data;
+ unsigned long size;
+ };
+
+ struct dentry *debugfs_create_blob(const char *name, mode_t mode,
+ struct dentry *parent,
+ struct debugfs_blob_wrapper *blob);
+
+A read of this file will return the data pointed to by the
+debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way
+to return several lines of (static) formatted text output. This function
+can be used to export binary information, but there does not appear to be
+any code which does so in the mainline. Note that all files created with
+debugfs_create_blob() are read-only.
+
+There are a couple of other directory-oriented helper functions:
+
+ struct dentry *debugfs_rename(struct dentry *old_dir,
+ struct dentry *old_dentry,
+ struct dentry *new_dir,
+ const char *new_name);
+
+ struct dentry *debugfs_create_symlink(const char *name,
+ struct dentry *parent,
+ const char *target);
+
+A call to debugfs_rename() will give a new name to an existing debugfs
+file, possibly in a different directory. The new_name must not exist prior
+to the call; the return value is old_dentry with updated information.
+Symbolic links can be created with debugfs_create_symlink().
+
+There is one important thing that all debugfs users must take into account:
+there is no automatic cleanup of any directories created in debugfs. If a
+module is unloaded without explicitly removing debugfs entries, the result
+will be a lot of stale pointers and no end of highly antisocial behavior.
+So all debugfs users - at least those which can be built as modules - must
+be prepared to remove all files and directories they create there. A file
+can be removed with:
+
+ void debugfs_remove(struct dentry *dentry);
+
+The dentry value can be NULL, in which case nothing will be removed.
+
+Once upon a time, debugfs users were required to remember the dentry
+pointer for every debugfs file they created so that all files could be
+cleaned up. We live in more civilized times now, though, and debugfs users
+can call:
+
+ void debugfs_remove_recursive(struct dentry *dentry);
+
+If this function is passed a pointer for the dentry corresponding to the
+top-level directory, the entire hierarchy below that directory will be
+removed.
+
+Notes:
+ [1] http://lwn.net/Articles/309298/
diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt
new file mode 100644
index 00000000..68dffd87
--- /dev/null
+++ b/Documentation/filesystems/devpts.txt
@@ -0,0 +1,132 @@
+
+To support containers, we now allow multiple instances of devpts filesystem,
+such that indices of ptys allocated in one instance are independent of indices
+allocated in other instances of devpts.
+
+To preserve backward compatibility, this support for multiple instances is
+enabled only if:
+
+ - CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and
+ - '-o newinstance' mount option is specified while mounting devpts
+
+IOW, devpts now supports both single-instance and multi-instance semantics.
+
+If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and
+this referred to as the "legacy" mode. In this mode, the new mount options
+(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message
+on console.
+
+If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the
+'newinstance' option (as in current start-up scripts) the new mount binds
+to the initial kernel mount of devpts. This mode is referred to as the
+'single-instance' mode and the current, single-instance semantics are
+preserved, i.e PTYs are common across the system.
+
+The only difference between this single-instance mode and the legacy mode
+is the presence of new, '/dev/pts/ptmx' node with permissions 0000, which
+can safely be ignored.
+
+If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified,
+the mount is considered to be in the multi-instance mode and a new instance
+of the devpts fs is created. Any ptys created in this instance are independent
+of ptys in other instances of devpts. Like in the single-instance mode, the
+/dev/pts/ptmx node is present. To effectively use the multi-instance mode,
+open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or
+bind-mount.
+
+Eg: A container startup script could do the following:
+
+ $ chmod 0666 /dev/pts/ptmx
+ $ rm /dev/ptmx
+ $ ln -s pts/ptmx /dev/ptmx
+ $ ns_exec -cm /bin/bash
+
+ # We are now in new container
+
+ $ umount /dev/pts
+ $ mount -t devpts -o newinstance lxcpts /dev/pts
+ $ sshd -p 1234
+
+where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs
+/bin/bash in the child process. A pty created by the sshd is not visible in
+the original mount of /dev/pts.
+
+User-space changes
+------------------
+
+In multi-instance mode (i.e '-o newinstance' mount option is specified at least
+once), following user-space issues should be noted.
+
+1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored
+ and no change is needed to system-startup scripts.
+
+2. To effectively use multi-instance mode (i.e -o newinstance is specified)
+ administrators or startup scripts should "redirect" open of /dev/ptmx to
+ /dev/pts/ptmx using either a bind mount or symlink.
+
+ $ mount -t devpts -o newinstance devpts /dev/pts
+
+ followed by either
+
+ $ rm /dev/ptmx
+ $ ln -s pts/ptmx /dev/ptmx
+ $ chmod 666 /dev/pts/ptmx
+ or
+ $ mount -o bind /dev/pts/ptmx /dev/ptmx
+
+3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it
+ enables better error-reporting and treats both single-instance and
+ multi-instance mounts similarly.
+
+ But this method requires that system-startup scripts set the mode of
+ /dev/pts/ptmx correctly (default mode is 0000). The scripts can set the
+ mode by, either
+
+ - adding ptmxmode mount option to devpts entry in /etc/fstab, or
+ - using 'chmod 0666 /dev/pts/ptmx'
+
+4. If multi-instance mode mount is needed for containers, but the system
+ startup scripts have not yet been updated, container-startup scripts
+ should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single-
+ instance mounts.
+
+ Or, in general, container-startup scripts should use:
+
+ mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts
+ if [ ! -L /dev/ptmx ]; then
+ mount -o bind /dev/pts/ptmx /dev/ptmx
+ fi
+
+ When all devpts mounts are multi-instance, /dev/ptmx can permanently be
+ a symlink to pts/ptmx and the bind mount can be ignored.
+
+5. A multi-instance mount that is not accompanied by the /dev/ptmx to
+ /dev/pts/ptmx redirection would result in an unusable/unreachable pty.
+
+ mount -t devpts -o newinstance lxcpts /dev/pts
+
+ immediately followed by:
+
+ open("/dev/ptmx")
+
+ would create a pty, say /dev/pts/7, in the initial kernel mount.
+ But /dev/pts/7 would be invisible in the new mount.
+
+6. The permissions for /dev/pts/ptmx node should be specified when mounting
+ /dev/pts, using the '-o ptmxmode=%o' mount option (default is 0000).
+
+ mount -t devpts -o newinstance -o ptmxmode=0644 devpts /dev/pts
+
+ The permissions can be later be changed as usual with 'chmod'.
+
+ chmod 666 /dev/pts/ptmx
+
+7. A mount of devpts without the 'newinstance' option results in binding to
+ initial kernel mount. This behavior while preserving legacy semantics,
+ does not provide strict isolation in a container environment. i.e by
+ mounting devpts without the 'newinstance' option, a container could
+ get visibility into the 'host' or root container's devpts.
+
+ To workaround this and have strict isolation, all mounts of devpts,
+ including the mount in the root container, should use the newinstance
+ option.
diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
new file mode 100644
index 00000000..ff7b611a
--- /dev/null
+++ b/Documentation/filesystems/directory-locking
@@ -0,0 +1,114 @@
+ Locking scheme used for directory operations is based on two
+kinds of locks - per-inode (->i_mutex) and per-filesystem
+(->s_vfs_rename_mutex).
+
+ For our purposes all operations fall in 5 classes:
+
+1) read access. Locking rules: caller locks directory we are accessing.
+
+2) object creation. Locking rules: same as above.
+
+3) object removal. Locking rules: caller locks parent, finds victim,
+locks victim and calls the method.
+
+4) rename() that is _not_ cross-directory. Locking rules: caller locks
+the parent, finds source and target, if target already exists - locks it
+and then calls the method.
+
+5) link creation. Locking rules:
+ * lock parent
+ * check that source is not a directory
+ * lock source
+ * call the method.
+
+6) cross-directory rename. The trickiest in the whole bunch. Locking
+rules:
+ * lock the filesystem
+ * lock parents in "ancestors first" order.
+ * find source and target.
+ * if old parent is equal to or is a descendent of target
+ fail with -ENOTEMPTY
+ * if new parent is equal to or is a descendent of source
+ fail with -ELOOP
+ * if target exists - lock it.
+ * call the method.
+
+
+The rules above obviously guarantee that all directories that are going to be
+read, modified or removed by method will be locked by caller.
+
+
+If no directory is its own ancestor, the scheme above is deadlock-free.
+Proof:
+
+ First of all, at any moment we have a partial ordering of the
+objects - A < B iff A is an ancestor of B.
+
+ That ordering can change. However, the following is true:
+
+(1) if object removal or non-cross-directory rename holds lock on A and
+ attempts to acquire lock on B, A will remain the parent of B until we
+ acquire the lock on B. (Proof: only cross-directory rename can change
+ the parent of object and it would have to lock the parent).
+
+(2) if cross-directory rename holds the lock on filesystem, order will not
+ change until rename acquires all locks. (Proof: other cross-directory
+ renames will be blocked on filesystem lock and we don't start changing
+ the order until we had acquired all locks).
+
+(3) any operation holds at most one lock on non-directory object and
+ that lock is acquired after all other locks. (Proof: see descriptions
+ of operations).
+
+ Now consider the minimal deadlock. Each process is blocked on
+attempt to acquire some lock and already holds at least one lock. Let's
+consider the set of contended locks. First of all, filesystem lock is
+not contended, since any process blocked on it is not holding any locks.
+Thus all processes are blocked on ->i_mutex.
+
+ Non-directory objects are not contended due to (3). Thus link
+creation can't be a part of deadlock - it can't be blocked on source
+and it means that it doesn't hold any locks.
+
+ Any contended object is either held by cross-directory rename or
+has a child that is also contended. Indeed, suppose that it is held by
+operation other than cross-directory rename. Then the lock this operation
+is blocked on belongs to child of that object due to (1).
+
+ It means that one of the operations is cross-directory rename.
+Otherwise the set of contended objects would be infinite - each of them
+would have a contended child and we had assumed that no object is its
+own descendent. Moreover, there is exactly one cross-directory rename
+(see above).
+
+ Consider the object blocking the cross-directory rename. One
+of its descendents is locked by cross-directory rename (otherwise we
+would again have an infinite set of contended objects). But that
+means that cross-directory rename is taking locks out of order. Due
+to (2) the order hadn't changed since we had acquired filesystem lock.
+But locking rules for cross-directory rename guarantee that we do not
+try to acquire lock on descendent before the lock on ancestor.
+Contradiction. I.e. deadlock is impossible. Q.E.D.
+
+
+ These operations are guaranteed to avoid loop creation. Indeed,
+the only operation that could introduce loops is cross-directory rename.
+Since the only new (parent, child) pair added by rename() is (new parent,
+source), such loop would have to contain these objects and the rest of it
+would have to exist before rename(). I.e. at the moment of loop creation
+rename() responsible for that would be holding filesystem lock and new parent
+would have to be equal to or a descendent of source. But that means that
+new parent had been equal to or a descendent of source since the moment when
+we had acquired filesystem lock and rename() would fail with -ELOOP in that
+case.
+
+ While this locking scheme works for arbitrary DAGs, it relies on
+ability to check that directory is a descendent of another object. Current
+implementation assumes that directory graph is a tree. This assumption is
+also preserved by all operations (cross-directory rename on a tree that would
+not introduce a cycle will leave it a tree and link() fails for directories).
+
+ Notice that "directory" in the above == "anything that might have
+children", so if we are going to introduce hybrid objects we will need
+either to make sure that link(2) doesn't work for them or to make changes
+in is_subdir() that would make it work even in presence of such beasts.
diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt
new file mode 100644
index 00000000..1b528b2a
--- /dev/null
+++ b/Documentation/filesystems/dlmfs.txt
@@ -0,0 +1,130 @@
+dlmfs
+==================
+A minimal DLM userspace interface implemented via a virtual file
+system.
+
+dlmfs is built with OCFS2 as it requires most of its infrastructure.
+
+Project web page: http://oss.oracle.com/projects/ocfs2
+Tools web page: http://oss.oracle.com/projects/ocfs2-tools
+OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+
+All code copyright 2005 Oracle except when otherwise noted.
+
+CREDITS
+=======
+
+Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
+and Transmeta Corp.
+
+Mark Fasheh <mark.fasheh@oracle.com>
+
+Caveats
+=======
+- Right now it only works with the OCFS2 DLM, though support for other
+ DLM implementations should not be a major issue.
+
+Mount options
+=============
+None
+
+Usage
+=====
+
+If you're just interested in OCFS2, then please see ocfs2.txt. The
+rest of this document will be geared towards those who want to use
+dlmfs for easy to setup and easy to use clustered locking in
+userspace.
+
+Setup
+=====
+
+dlmfs requires that the OCFS2 cluster infrastructure be in
+place. Please download ocfs2-tools from the above url and configure a
+cluster.
+
+You'll want to start heartbeating on a volume which all the nodes in
+your lockspace can access. The easiest way to do this is via
+ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
+that an OCFS2 file system be in place so that it can automatically
+find its heartbeat area, though it will eventually support heartbeat
+against raw disks.
+
+Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
+with ocfs2-tools.
+
+Once you're heartbeating, DLM lock 'domains' can be easily created /
+destroyed and locks within them accessed.
+
+Locking
+=======
+
+Users may access dlmfs via standard file system calls, or they can use
+'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
+system calls and presents a more traditional locking api.
+
+dlmfs handles lock caching automatically for the user, so a lock
+request for an already acquired lock will not generate another DLM
+call. Userspace programs are assumed to handle their own local
+locking.
+
+Two levels of locks are supported - Shared Read, and Exclusive.
+Also supported is a Trylock operation.
+
+For information on the libo2dlm interface, please see o2dlm.h,
+distributed with ocfs2-tools.
+
+Lock value blocks can be read and written to a resource via read(2)
+and write(2) against the fd obtained via your open(2) call. The
+maximum currently supported LVB length is 64 bytes (though that is an
+OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
+small amounts of data amongst their nodes.
+
+mkdir(2) signals dlmfs to join a domain (which will have the same name
+as the resulting directory)
+
+rmdir(2) signals dlmfs to leave the domain
+
+Locks for a given domain are represented by regular inodes inside the
+domain directory. Locking against them is done via the open(2) system
+call.
+
+The open(2) call will not return until your lock has been granted or
+an error has occurred, unless it has been instructed to do a trylock
+operation. If the lock succeeds, you'll get an fd.
+
+open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
+not automatically create inodes for existing lock resources.
+
+Open Flag Lock Request Type
+--------- -----------------
+O_RDONLY Shared Read
+O_RDWR Exclusive
+
+Open Flag Resulting Locking Behavior
+--------- --------------------------
+O_NONBLOCK Trylock operation
+
+You must provide exactly one of O_RDONLY or O_RDWR.
+
+If O_NONBLOCK is also provided and the trylock operation was valid but
+could not lock the resource then open(2) will return ETXTBUSY.
+
+close(2) drops the lock associated with your fd.
+
+Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
+supported locally as well. This means you can use them to restrict
+access to the resources via dlmfs on your local node only.
+
+The resource LVB may be read from the fd in either Shared Read or
+Exclusive modes via the read(2) system call. It can be written via
+write(2) only when open in Exclusive mode.
+
+Once written, an LVB will be visible to other nodes who obtain Read
+Only or higher level locks on the resource.
+
+See Also
+========
+http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
+
+For more information on the VMS distributed locking API.
diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.txt
new file mode 100644
index 00000000..6baf88f4
--- /dev/null
+++ b/Documentation/filesystems/dnotify.txt
@@ -0,0 +1,70 @@
+ Linux Directory Notification
+ ============================
+
+ Stephen Rothwell <sfr@canb.auug.org.au>
+
+The intention of directory notification is to allow user applications
+to be notified when a directory, or any of the files in it, are changed.
+The basic mechanism involves the application registering for notification
+on a directory using a fcntl(2) call and the notifications themselves
+being delivered using signals.
+
+The application decides which "events" it wants to be notified about.
+The currently defined events are:
+
+ DN_ACCESS A file in the directory was accessed (read)
+ DN_MODIFY A file in the directory was modified (write,truncate)
+ DN_CREATE A file was created in the directory
+ DN_DELETE A file was unlinked from directory
+ DN_RENAME A file in the directory was renamed
+ DN_ATTRIB A file in the directory had its attributes
+ changed (chmod,chown)
+
+Usually, the application must reregister after each notification, but
+if DN_MULTISHOT is or'ed with the event mask, then the registration will
+remain until explicitly removed (by registering for no events).
+
+By default, SIGIO will be delivered to the process and no other useful
+information. However, if the F_SETSIG fcntl(2) call is used to let the
+kernel know which signal to deliver, a siginfo structure will be passed to
+the signal handler and the si_fd member of that structure will contain the
+file descriptor associated with the directory in which the event occurred.
+
+Preferably the application will choose one of the real time signals
+(SIGRTMIN + <n>) so that the notifications may be queued. This is
+especially important if DN_MULTISHOT is specified. Note that SIGRTMIN
+is often blocked, so it is better to use (at least) SIGRTMIN + 1.
+
+Implementation expectations (features and bugs :-))
+---------------------------
+
+The notification should work for any local access to files even if the
+actual file system is on a remote server. This implies that remote
+access to files served by local user mode servers should be notified.
+Also, remote accesses to files served by a local kernel NFS server should
+be notified.
+
+In order to make the impact on the file system code as small as possible,
+the problem of hard links to files has been ignored. So if a file (x)
+exists in two directories (a and b) then a change to the file using the
+name "a/x" should be notified to a program expecting notifications on
+directory "a", but will not be notified to one expecting notifications on
+directory "b".
+
+Also, files that are unlinked, will still cause notifications in the
+last directory that they were linked to.
+
+Configuration
+-------------
+
+Dnotify is controlled via the CONFIG_DNOTIFY configuration option. When
+disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL.
+
+Example
+-------
+See Documentation/filesystems/dnotify_test.c for an example.
+
+NOTE
+----
+Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
+See Documentation/filesystems/inotify.txt for more information on it.
diff --git a/Documentation/filesystems/dnotify_test.c b/Documentation/filesystems/dnotify_test.c
new file mode 100644
index 00000000..8b37b4a1
--- /dev/null
+++ b/Documentation/filesystems/dnotify_test.c
@@ -0,0 +1,34 @@
+#define _GNU_SOURCE /* needed to get the defines */
+#include <fcntl.h> /* in glibc 2.2 this has the needed
+ values defined */
+#include <signal.h>
+#include <stdio.h>
+#include <unistd.h>
+
+static volatile int event_fd;
+
+static void handler(int sig, siginfo_t *si, void *data)
+{
+ event_fd = si->si_fd;
+}
+
+int main(void)
+{
+ struct sigaction act;
+ int fd;
+
+ act.sa_sigaction = handler;
+ sigemptyset(&act.sa_mask);
+ act.sa_flags = SA_SIGINFO;
+ sigaction(SIGRTMIN + 1, &act, NULL);
+
+ fd = open(".", O_RDONLY);
+ fcntl(fd, F_SETSIG, SIGRTMIN + 1);
+ fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
+ /* we will now be notified if any of the files
+ in "." is modified or new files are created */
+ while (1) {
+ pause();
+ printf("Got event on fd=%d\n", event_fd);
+ }
+}
diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.txt
new file mode 100644
index 00000000..01d8a083
--- /dev/null
+++ b/Documentation/filesystems/ecryptfs.txt
@@ -0,0 +1,77 @@
+eCryptfs: A stacked cryptographic filesystem for Linux
+
+eCryptfs is free software. Please see the file COPYING for details.
+For documentation, please see the files in the doc/ subdirectory. For
+building and installation instructions please see the INSTALL file.
+
+Maintainer: Phillip Hellewell
+Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
+Developers: Michael C. Thompson
+ Kent Yoder
+Web Site: http://ecryptfs.sf.net
+
+This software is currently undergoing development. Make sure to
+maintain a backup copy of any data you write into eCryptfs.
+
+eCryptfs requires the userspace tools downloadable from the
+SourceForge site:
+
+http://sourceforge.net/projects/ecryptfs/
+
+Userspace requirements include:
+ - David Howells' userspace keyring headers and libraries (version
+ 1.0 or higher), obtainable from
+ http://people.redhat.com/~dhowells/keyutils/
+ - Libgcrypt
+
+
+NOTES
+
+In the beta/experimental releases of eCryptfs, when you upgrade
+eCryptfs, you should copy the files to an unencrypted location and
+then copy the files back into the new eCryptfs mount to migrate the
+files.
+
+
+MOUNT-WIDE PASSPHRASE
+
+Create a new directory into which eCryptfs will write its encrypted
+files (i.e., /root/crypt). Then, create the mount point directory
+(i.e., /mnt/crypt). Now it's time to mount eCryptfs:
+
+mount -t ecryptfs /root/crypt /mnt/crypt
+
+You should be prompted for a passphrase and a salt (the salt may be
+blank).
+
+Try writing a new file:
+
+echo "Hello, World" > /mnt/crypt/hello.txt
+
+The operation will complete. Notice that there is a new file in
+/root/crypt that is at least 12288 bytes in size (depending on your
+host page size). This is the encrypted underlying file for what you
+just wrote. To test reading, from start to finish, you need to clear
+the user session keyring:
+
+keyctl clear @u
+
+Then umount /mnt/crypt and mount again per the instructions given
+above.
+
+cat /mnt/crypt/hello.txt
+
+
+NOTES
+
+eCryptfs version 0.1 should only be mounted on (1) empty directories
+or (2) directories containing files only created by eCryptfs. If you
+mount a directory that has pre-existing files not created by eCryptfs,
+then behavior is undefined. Do not run eCryptfs in higher verbosity
+levels unless you are doing so for the sole purpose of debugging or
+development, since secret values will be written out to the system log
+in that case.
+
+
+Mike Halcrow
+mhalcrow@us.ibm.com
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
new file mode 100644
index 00000000..23583a13
--- /dev/null
+++ b/Documentation/filesystems/exofs.txt
@@ -0,0 +1,185 @@
+===============================================================================
+WHAT IS EXOFS?
+===============================================================================
+
+exofs is a file system that uses an OSD and exports the API of a normal Linux
+file system. Users access exofs like any other local file system, and exofs
+will in turn issue commands to the local OSD initiator.
+
+OSD is a new T10 command set that views storage devices not as a large/flat
+array of sectors but as a container of objects, each having a length, quota,
+time attributes and more. Each object is addressed by a 64bit ID, and is
+contained in a 64bit ID partition. Each object has associated attributes
+attached to it, which are integral part of the object and provide metadata about
+the object. The standard defines some common obligatory attributes, but user
+attributes can be added as needed.
+
+===============================================================================
+ENVIRONMENT
+===============================================================================
+
+To use this file system, you need to have an object store to run it on. You
+may download a target from:
+http://open-osd.org
+
+See Documentation/scsi/osd.txt for how to setup a working osd environment.
+
+===============================================================================
+USAGE
+===============================================================================
+
+1. Download and compile exofs and open-osd initiator:
+ You need an external Kernel source tree or kernel headers from your
+ distribution. (anything based on 2.6.26 or later).
+
+ a. download open-osd including exofs source using:
+ [parent-directory]$ git clone git://git.open-osd.org/open-osd.git
+
+ b. Build the library module like this:
+ [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
+
+ This will build both the open-osd initiator as well as the exofs kernel
+ module. Use whatever parameters you compiled your Kernel with and
+ $(KER_DIR) above pointing to the Kernel you compile against. See the file
+ open-osd/top-level-Makefile for an example.
+
+2. Get the OSD initiator and target set up properly, and login to the target.
+ See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
+ for example script that does all these steps.
+
+3. Insmod the exofs.ko module:
+ [exofs]$ insmod exofs.ko
+
+4. Make sure the directory where you want to mount exists. If not, create it.
+ (For example, mkdir /mnt/exofs)
+
+5. At first run you will need to invoke the mkfs.exofs application
+
+ As an example, this will create the file system on:
+ /dev/osd0 partition ID 65536
+
+ mkfs.exofs --pid=65536 --format /dev/osd0
+
+ The --format is optional. If not specified, no OSD_FORMAT will be
+ performed and a clean file system will be created in the specified pid,
+ in the available space of the target. (Use --format=size_in_meg to limit
+ the total LUN space available)
+
+ If pid already exists, it will be deleted and a new one will be created in
+ its place. Be careful.
+
+ An exofs lives inside a single OSD partition. You can create multiple exofs
+ filesystems on the same device using multiple pids.
+
+ (run mkfs.exofs without any parameters for usage help message)
+
+6. Mount the file system.
+
+ For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
+
+ mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
+
+7. For reference (See do-exofs example script):
+ do-exofs start - an example of how to perform the above steps.
+ do-exofs stop - an example of how to unmount the file system.
+ do-exofs format - an example of how to format and mkfs a new exofs.
+
+8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
+ CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
+
+===============================================================================
+exofs mount options
+===============================================================================
+Similar to any mount command:
+ mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
+
+Where:
+ -t exofs: specifies the exofs file system
+
+ /dev/osdX: X is a decimal number. /dev/osdX was created after a successful
+ login into an OSD target.
+
+ mount_exofs_directory: The directory to mount the file system on
+
+ exofs specific options: Options are separated by commas (,)
+ pid=<integer> - The partition number to mount/create as
+ container of the filesystem.
+ This option is mandatory. integer can be
+ Hex by pre-pending an 0x to the number.
+ osdname=<id> - Mount by a device's osdname.
+ osdname is usually a 36 character uuid of the
+ form "d2683732-c906-4ee1-9dbd-c10c27bb40df".
+ It is one of the device's uuid specified in the
+ mkfs.exofs format command.
+ If this option is specified then the /dev/osdX
+ above can be empty and is ignored.
+ to=<integer> - Timeout in ticks for a single command.
+ default is (60 * HZ) [for debugging only]
+
+===============================================================================
+DESIGN
+===============================================================================
+
+* The file system control block (AKA on-disk superblock) resides in an object
+ with a special ID (defined in common.h).
+ Information included in the file system control block is used to fill the
+ in-memory superblock structure at mount time. This object is created before
+ the file system is used by mkexofs.c. It contains information such as:
+ - The file system's magic number
+ - The next inode number to be allocated
+
+* Each file resides in its own object and contains the data (and it will be
+ possible to extend the file over multiple objects, though this has not been
+ implemented yet).
+
+* A directory is treated as a file, and essentially contains a list of <file
+ name, inode #> pairs for files that are found in that directory. The object
+ IDs correspond to the files' inode numbers and will be allocated according to
+ a bitmap (stored in a separate object). Now they are allocated using a
+ counter.
+
+* Each file's control block (AKA on-disk inode) is stored in its object's
+ attributes. This applies to both regular files and other types (directories,
+ device files, symlinks, etc.).
+
+* Credentials are generated per object (inode and superblock) when they are
+ created in memory (read from disk or created). The credential works for all
+ operations and is used as long as the object remains in memory.
+
+* Async OSD operations are used whenever possible, but the target may execute
+ them out of order. The operations that concern us are create, delete,
+ readpage, writepage, update_inode, and truncate. The following pairs of
+ operations should execute in the order written, and we need to prevent them
+ from executing in reverse order:
+ - The following are handled with the OBJ_CREATED and OBJ_2BCREATED
+ flags. OBJ_CREATED is set when we know the object exists on the OSD -
+ in create's callback function, and when we successfully do a
+ read_inode.
+ OBJ_2BCREATED is set in the beginning of the create function, so we
+ know that we should wait.
+ - create/delete: delete should wait until the object is created
+ on the OSD.
+ - create/readpage: readpage should be able to return a page
+ full of zeroes in this case. If there was a write already
+ en-route (i.e. create, writepage, readpage) then the page
+ would be locked, and so it would really be the same as
+ create/writepage.
+ - create/writepage: if writepage is called for a sync write, it
+ should wait until the object is created on the OSD.
+ Otherwise, it should just return.
+ - create/truncate: truncate should wait until the object is
+ created on the OSD.
+ - create/update_inode: update_inode should wait until the
+ object is created on the OSD.
+ - Handled by VFS locks:
+ - readpage/delete: shouldn't happen because of page lock.
+ - writepage/delete: shouldn't happen because of page lock.
+ - readpage/writepage: shouldn't happen because of page lock.
+
+===============================================================================
+LICENSE/COPYRIGHT
+===============================================================================
+The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
+version 2.6.10). All files include the original copyrights, and the license
+is GPL version 2 (only version 2, as is true for the Linux kernel). The
+Linux kernel can be downloaded from www.kernel.org.
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
new file mode 100644
index 00000000..67639f90
--- /dev/null
+++ b/Documentation/filesystems/ext2.txt
@@ -0,0 +1,383 @@
+
+The Second Extended Filesystem
+==============================
+
+ext2 was originally released in January 1993. Written by R\'emy Card,
+Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
+Extended Filesystem. It is currently still (April 2001) the predominant
+filesystem in use by Linux. There are also implementations available
+for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
+
+Options
+=======
+
+Most defaults are determined by the filesystem superblock, and can be
+set using tune2fs(8). Kernel-determined defaults are indicated by (*).
+
+bsddf (*) Makes `df' act like BSD.
+minixdf Makes `df' act like Minix.
+
+check=none, nocheck (*) Don't do extra checking of bitmaps on mount
+ (check=normal and check=strict options removed)
+
+debug Extra debugging information is sent to the
+ kernel syslog. Useful for developers.
+
+errors=continue Keep going on a filesystem error.
+errors=remount-ro Remount the filesystem read-only on an error.
+errors=panic Panic and halt the machine if an error occurs.
+
+grpid, bsdgroups Give objects the same group ID as their parent.
+nogrpid, sysvgroups New objects have the group ID of their creator.
+
+nouid32 Use 16-bit UIDs and GIDs.
+
+oldalloc Enable the old block allocator. Orlov should
+ have better performance, we'd like to get some
+ feedback if it's the contrary for you.
+orlov (*) Use the Orlov block allocator.
+ (See http://lwn.net/Articles/14633/ and
+ http://lwn.net/Articles/14446/.)
+
+resuid=n The user ID which may use the reserved blocks.
+resgid=n The group ID which may use the reserved blocks.
+
+sb=n Use alternate superblock at this location.
+
+user_xattr Enable "user." POSIX Extended Attributes
+ (requires CONFIG_EXT2_FS_XATTR).
+ See also http://acl.bestbits.at
+nouser_xattr Don't support "user." extended attributes.
+
+acl Enable POSIX Access Control Lists support
+ (requires CONFIG_EXT2_FS_POSIX_ACL).
+ See also http://acl.bestbits.at
+noacl Don't support POSIX ACLs.
+
+nobh Do not attach buffer_heads to file pagecache.
+
+xip Use execute in place (no caching) if possible
+
+grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
+
+
+Specification
+=============
+
+ext2 shares many properties with traditional Unix filesystems. It has
+the concepts of blocks, inodes and directories. It has space in the
+specification for Access Control Lists (ACLs), fragments, undeletion and
+compression though these are not yet implemented (some are available as
+separate patches). There is also a versioning mechanism to allow new
+features (such as journalling) to be added in a maximally compatible
+manner.
+
+Blocks
+------
+
+The space in the device or file is split up into blocks. These are
+a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
+which is decided when the filesystem is created. Smaller blocks mean
+less wasted space per file, but require slightly more accounting overhead,
+and also impose other limits on the size of files and the filesystem.
+
+Block Groups
+------------
+
+Blocks are clustered into block groups in order to reduce fragmentation
+and minimise the amount of head seeking when reading a large amount
+of consecutive data. Information about each block group is kept in a
+descriptor table stored in the block(s) immediately after the superblock.
+Two blocks near the start of each group are reserved for the block usage
+bitmap and the inode usage bitmap which show which blocks and inodes
+are in use. Since each bitmap is limited to a single block, this means
+that the maximum size of a block group is 8 times the size of a block.
+
+The block(s) following the bitmaps in each block group are designated
+as the inode table for that block group and the remainder are the data
+blocks. The block allocation algorithm attempts to allocate data blocks
+in the same block group as the inode which contains them.
+
+The Superblock
+--------------
+
+The superblock contains all the information about the configuration of
+the filing system. The primary copy of the superblock is stored at an
+offset of 1024 bytes from the start of the device, and it is essential
+to mounting the filesystem. Since it is so important, backup copies of
+the superblock are stored in block groups throughout the filesystem.
+The first version of ext2 (revision 0) stores a copy at the start of
+every block group, along with backups of the group descriptor block(s).
+Because this can consume a considerable amount of space for large
+filesystems, later revisions can optionally reduce the number of backup
+copies by only putting backups in specific groups (this is the sparse
+superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
+
+The information in the superblock contains fields such as the total
+number of inodes and blocks in the filesystem and how many are free,
+how many inodes and blocks are in each block group, when the filesystem
+was mounted (and if it was cleanly unmounted), when it was modified,
+what version of the filesystem it is (see the Revisions section below)
+and which OS created it.
+
+If the filesystem is revision 1 or higher, then there are extra fields,
+such as a volume name, a unique identification number, the inode size,
+and space for optional filesystem features to store configuration info.
+
+All fields in the superblock (as in all other ext2 structures) are stored
+on the disc in little endian format, so a filesystem is portable between
+machines without having to know what machine it was created on.
+
+Inodes
+------
+
+The inode (index node) is a fundamental concept in the ext2 filesystem.
+Each object in the filesystem is represented by an inode. The inode
+structure contains pointers to the filesystem blocks which contain the
+data held in the object and all of the metadata about an object except
+its name. The metadata about an object includes the permissions, owner,
+group, flags, size, number of blocks used, access time, change time,
+modification time, deletion time, number of links, fragments, version
+(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
+
+There are some reserved fields which are currently unused in the inode
+structure and several which are overloaded. One field is reserved for the
+directory ACL if the inode is a directory and alternately for the top 32
+bits of the file size if the inode is a regular file (allowing file sizes
+larger than 2GB). The translator field is unused under Linux, but is used
+by the HURD to reference the inode of a program which will be used to
+interpret this object. Most of the remaining reserved fields have been
+used up for both Linux and the HURD for larger owner and group fields,
+The HURD also has a larger mode field so it uses another of the remaining
+fields to store the extra more bits.
+
+There are pointers to the first 12 blocks which contain the file's data
+in the inode. There is a pointer to an indirect block (which contains
+pointers to the next set of blocks), a pointer to a doubly-indirect
+block (which contains pointers to indirect blocks) and a pointer to a
+trebly-indirect block (which contains pointers to doubly-indirect blocks).
+
+The flags field contains some ext2-specific flags which aren't catered
+for by the standard chmod flags. These flags can be listed with lsattr
+and changed with the chattr command, and allow specific filesystem
+behaviour on a per-file basis. There are flags for secure deletion,
+undeletable, compression, synchronous updates, immutability, append-only,
+dumpable, no-atime, indexed directories, and data-journaling. Not all
+of these are supported yet.
+
+Directories
+-----------
+
+A directory is a filesystem object and has an inode just like a file.
+It is a specially formatted file containing records which associate
+each name with an inode number. Later revisions of the filesystem also
+encode the type of the object (file, directory, symlink, device, fifo,
+socket) to avoid the need to check the inode itself for this information
+(support for taking advantage of this feature does not yet exist in
+Glibc 2.2).
+
+The inode allocation code tries to assign inodes which are in the same
+block group as the directory in which they are first created.
+
+The current implementation of ext2 uses a singly-linked list to store
+the filenames in the directory; a pending enhancement uses hashing of the
+filenames to allow lookup without the need to scan the entire directory.
+
+The current implementation never removes empty directory blocks once they
+have been allocated to hold more files.
+
+Special files
+-------------
+
+Symbolic links are also filesystem objects with inodes. They deserve
+special mention because the data for them is stored within the inode
+itself if the symlink is less than 60 bytes long. It uses the fields
+which would normally be used to store the pointers to data blocks.
+This is a worthwhile optimisation as it we avoid allocating a full
+block for the symlink, and most symlinks are less than 60 characters long.
+
+Character and block special devices never have data blocks assigned to
+them. Instead, their device number is stored in the inode, again reusing
+the fields which would be used to point to the data blocks.
+
+Reserved Space
+--------------
+
+In ext2, there is a mechanism for reserving a certain number of blocks
+for a particular user (normally the super-user). This is intended to
+allow for the system to continue functioning even if non-privileged users
+fill up all the space available to them (this is independent of filesystem
+quotas). It also keeps the filesystem from filling up entirely which
+helps combat fragmentation.
+
+Filesystem check
+----------------
+
+At boot time, most systems run a consistency check (e2fsck) on their
+filesystems. The superblock of the ext2 filesystem contains several
+fields which indicate whether fsck should actually run (since checking
+the filesystem at boot can take a long time if it is large). fsck will
+run if the filesystem was not cleanly unmounted, if the maximum mount
+count has been exceeded or if the maximum time between checks has been
+exceeded.
+
+Feature Compatibility
+---------------------
+
+The compatibility feature mechanism used in ext2 is sophisticated.
+It safely allows features to be added to the filesystem, without
+unnecessarily sacrificing compatibility with older versions of the
+filesystem code. The feature compatibility mechanism is not supported by
+the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
+revision 1. There are three 32-bit fields, one for compatible features
+(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
+incompatible (INCOMPAT) features.
+
+These feature flags have specific meanings for the kernel as follows:
+
+A COMPAT flag indicates that a feature is present in the filesystem,
+but the on-disk format is 100% compatible with older on-disk formats, so
+a kernel which didn't know anything about this feature could read/write
+the filesystem without any chance of corrupting the filesystem (or even
+making it inconsistent). This is essentially just a flag which says
+"this filesystem has a (hidden) feature" that the kernel or e2fsck may
+want to be aware of (more on e2fsck and feature flags later). The ext3
+HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
+a regular file with data blocks in it so the kernel does not need to
+take any special notice of it if it doesn't understand ext3 journaling.
+
+An RO_COMPAT flag indicates that the on-disk format is 100% compatible
+with older on-disk formats for reading (i.e. the feature does not change
+the visible on-disk format). However, an old kernel writing to such a
+filesystem would/could corrupt the filesystem, so this is prevented. The
+most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
+sparse groups allow file data blocks where superblock/group descriptor
+backups used to live, and ext2_free_blocks() refuses to free these blocks,
+which would leading to inconsistent bitmaps. An old kernel would also
+get an error if it tried to free a series of blocks which crossed a group
+boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
+
+An INCOMPAT flag indicates the on-disk format has changed in some
+way that makes it unreadable by older kernels, or would otherwise
+cause a problem if an old kernel tried to mount it. FILETYPE is an
+INCOMPAT flag because older kernels would think a filename was longer
+than 256 characters, which would lead to corrupt directory listings.
+The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
+doesn't understand compression, you would just get garbage back from
+read() instead of it automatically decompressing your data. The ext3
+RECOVER flag is needed to prevent a kernel which does not understand the
+ext3 journal from mounting the filesystem without replaying the journal.
+
+For e2fsck, it needs to be more strict with the handling of these
+flags than the kernel. If it doesn't understand ANY of the COMPAT,
+RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
+because it has no way of verifying whether a given feature is valid
+or not. Allowing e2fsck to succeed on a filesystem with an unknown
+feature is a false sense of security for the user. Refusing to check
+a filesystem with unknown features is a good incentive for the user to
+update to the latest e2fsck. This also means that anyone adding feature
+flags to ext2 also needs to update e2fsck to verify these features.
+
+Metadata
+--------
+
+It is frequently claimed that the ext2 implementation of writing
+asynchronous metadata is faster than the ffs synchronous metadata
+scheme but less reliable. Both methods are equally resolvable by their
+respective fsck programs.
+
+If you're exceptionally paranoid, there are 3 ways of making metadata
+writes synchronous on ext2:
+
+per-file if you have the program source: use the O_SYNC flag to open()
+per-file if you don't have the source: use "chattr +S" on the file
+per-filesystem: add the "sync" option to mount (or in /etc/fstab)
+
+the first and last are not ext2 specific but do force the metadata to
+be written synchronously. See also Journaling below.
+
+Limitations
+-----------
+
+There are various limits imposed by the on-disk layout of ext2. Other
+limits are imposed by the current implementation of the kernel code.
+Many of the limits are determined at the time the filesystem is first
+created, and depend upon the block size chosen. The ratio of inodes to
+data blocks is fixed at filesystem creation time, so the only way to
+increase the number of inodes is to increase the size of the filesystem.
+No tools currently exist which can change the ratio of inodes to blocks.
+
+Most of these limits could be overcome with slight changes in the on-disk
+format and using a compatibility flag to signal the format change (at
+the expense of some compatibility).
+
+Filesystem block size: 1kB 2kB 4kB 8kB
+
+File size limit: 16GB 256GB 2048GB 2048GB
+Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
+
+There is a 2.4 kernel limit of 2048GB for a single block device, so no
+filesystem larger than that can be created at this time. There is also
+an upper limit on the block size imposed by the page size of the kernel,
+so 8kB blocks are only allowed on Alpha systems (and other architectures
+which support larger pages).
+
+There is an upper limit of 32000 subdirectories in a single directory.
+
+There is a "soft" upper limit of about 10-15k files in a single directory
+with the current linear linked-list directory implementation. This limit
+stems from performance problems when creating and deleting (and also
+finding) files in such large directories. Using a hashed directory index
+(under development) allows 100k-1M+ files in a single directory without
+performance problems (although RAM size becomes an issue at this point).
+
+The (meaningless) absolute upper limit of files in a single directory
+(imposed by the file size, the realistic limit is obviously much less)
+is over 130 trillion files. It would be higher except there are not
+enough 4-character names to make up unique directory entries, so they
+have to be 8 character filenames, even then we are fairly close to
+running out of unique filenames.
+
+Journaling
+----------
+
+A journaling extension to the ext2 code has been developed by Stephen
+Tweedie. It avoids the risks of metadata corruption and the need to
+wait for e2fsck to complete after a crash, without requiring a change
+to the on-disk ext2 layout. In a nutshell, the journal is a regular
+file which stores whole metadata (and optionally data) blocks that have
+been modified, prior to writing them into the filesystem. This means
+it is possible to add a journal to an existing ext2 filesystem without
+the need for data conversion.
+
+When changes to the filesystem (e.g. a file is renamed) they are stored in
+a transaction in the journal and can either be complete or incomplete at
+the time of a crash. If a transaction is complete at the time of a crash
+(or in the normal case where the system does not crash), then any blocks
+in that transaction are guaranteed to represent a valid filesystem state,
+and are copied into the filesystem. If a transaction is incomplete at
+the time of the crash, then there is no guarantee of consistency for
+the blocks in that transaction so they are discarded (which means any
+filesystem changes they represent are also lost).
+Check Documentation/filesystems/ext3.txt if you want to read more about
+ext3 and journaling.
+
+References
+==========
+
+The kernel source file:/usr/src/linux/fs/ext2/
+e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
+Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
+Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
+Filesystem Resizing http://ext2resize.sourceforge.net/
+Compression (*) http://e2compr.sourceforge.net/
+
+Implementations for:
+Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
+Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2
+DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
+OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
+RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
+
+(*) no longer actively developed/supported (as of Apr 2001)
+(+) no longer actively developed/supported (as of Mar 2009)
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
new file mode 100644
index 00000000..272f80d5
--- /dev/null
+++ b/Documentation/filesystems/ext3.txt
@@ -0,0 +1,231 @@
+
+Ext3 Filesystem
+===============
+
+Ext3 was originally released in September 1999. Written by Stephen Tweedie
+for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
+Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
+
+Ext3 is the ext2 filesystem enhanced with journalling capabilities.
+
+Options
+=======
+
+When mounting an ext3 filesystem, the following option are accepted:
+(*) == default
+
+ro Mount filesystem read only. Note that ext3 will replay
+ the journal (and thus write to the partition) even when
+ mounted "read only". Mount options "ro,noload" can be
+ used to prevent writes to the filesystem.
+
+journal=update Update the ext3 file system's journal to the current
+ format.
+
+journal=inum When a journal already exists, this option is ignored.
+ Otherwise, it specifies the number of the inode which
+ will represent the ext3 file system's journal file.
+
+journal_dev=devnum When the external journal device's major/minor numbers
+ have changed, this option allows the user to specify
+ the new journal location. The journal device is
+ identified through its new major/minor numbers encoded
+ in devnum.
+
+norecovery Don't load the journal on mounting. Note that this forces
+noload mount of inconsistent filesystem, which can lead to
+ various problems.
+
+data=journal All data are committed into the journal prior to being
+ written into the main file system.
+
+data=ordered (*) All data are forced directly out to the main file
+ system prior to its metadata being committed to the
+ journal.
+
+data=writeback Data ordering is not preserved, data may be written
+ into the main file system after its metadata has been
+ committed to the journal.
+
+commit=nrsec (*) Ext3 can be told to sync all its data and metadata
+ every 'nrsec' seconds. The default value is 5 seconds.
+ This means that if you lose your power, you will lose
+ as much as the latest 5 seconds of work (your
+ filesystem will not be damaged though, thanks to the
+ journaling). This default value (or any low value)
+ will hurt performance, but it's good for data-safety.
+ Setting it to 0 will have the same effect as leaving
+ it at the default (5 seconds).
+ Setting it to very large values will improve
+ performance.
+
+barrier=<0(*)|1> This enables/disables the use of write barriers in
+barrier the jbd code. barrier=0 disables, barrier=1 enables.
+nobarrier (*) This also requires an IO stack which can support
+ barriers, and if jbd gets an error on a barrier
+ write, it will disable again with a warning.
+ Write barriers enforce proper on-disk ordering
+ of journal commits, making volatile disk write caches
+ safe to use, at some performance penalty. If
+ your disks are battery-backed in one way or another,
+ disabling barriers may safely improve performance.
+ The mount options "barrier" and "nobarrier" can
+ also be used to enable or disable barriers, for
+ consistency with other ext3 mount options.
+
+orlov (*) This enables the new Orlov block allocator. It is
+ enabled by default.
+
+oldalloc This disables the Orlov block allocator and enables
+ the old block allocator. Orlov should have better
+ performance - we'd like to get some feedback if it's
+ the contrary for you.
+
+user_xattr Enables Extended User Attributes. Additionally, you
+ need to have extended attribute support enabled in the
+ kernel configuration (CONFIG_EXT3_FS_XATTR). See the
+ attr(5) manual page and http://acl.bestbits.at/ to
+ learn more about extended attributes.
+
+nouser_xattr Disables Extended User Attributes.
+
+acl Enables POSIX Access Control Lists support.
+ Additionally, you need to have ACL support enabled in
+ the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
+ See the acl(5) manual page and http://acl.bestbits.at/
+ for more information.
+
+noacl This option disables POSIX Access Control List
+ support.
+
+reservation
+
+noreservation
+
+bsddf (*) Make 'df' act like BSD.
+minixdf Make 'df' act like Minix.
+
+check=none Don't do extra checking of bitmaps on mount.
+nocheck
+
+debug Extra debugging information is sent to syslog.
+
+errors=remount-ro Remount the filesystem read-only on an error.
+errors=continue Keep going on a filesystem error.
+errors=panic Panic and halt the machine if an error occurs.
+ (These mount options override the errors behavior
+ specified in the superblock, which can be
+ configured using tune2fs.)
+
+data_err=ignore(*) Just print an error message if an error occurs
+ in a file data buffer in ordered mode.
+data_err=abort Abort the journal if an error occurs in a file
+ data buffer in ordered mode.
+
+grpid Give objects the same group ID as their creator.
+bsdgroups
+
+nogrpid (*) New objects have the group ID of their creator.
+sysvgroups
+
+resgid=n The group ID which may use the reserved blocks.
+
+resuid=n The user ID which may use the reserved blocks.
+
+sb=n Use alternate superblock at this location.
+
+quota These options are ignored by the filesystem. They
+noquota are used only by quota tools to recognize volumes
+grpquota where quota should be turned on. See documentation
+usrquota in the quota-tools package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+jqfmt=<quota type> These options tell filesystem details about quota
+usrjquota=<file> so that quota information can be properly updated
+grpjquota=<file> during journal replay. They replace the above
+ quota options. See documentation in the quota-tools
+ package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+bh (*) ext3 associates buffer heads to data pages to
+nobh (a) cache disk block mapping information
+ (b) link pages into transaction to provide
+ ordering guarantees.
+ "bh" option forces use of buffer heads.
+ "nobh" option tries to avoid associating buffer
+ heads (supported only for "writeback" mode).
+
+
+Specification
+=============
+Ext3 shares all disk implementation with the ext2 filesystem, and adds
+transactions capabilities to ext2. Journaling is done by the Journaling Block
+Device layer.
+
+Journaling Block Device layer
+-----------------------------
+The Journaling Block Device layer (JBD) isn't ext3 specific. It was designed
+to add journaling capabilities to a block device. The ext3 filesystem code
+will inform the JBD of modifications it is performing (called a transaction).
+The journal supports the transactions start and stop, and in case of a crash,
+the journal can replay the transactions to quickly put the partition back into
+a consistent state.
+
+Handles represent a single atomic update to a filesystem. JBD can handle an
+external journal on a block device.
+
+Data Mode
+---------
+There are 3 different data modes:
+
+* writeback mode
+In data=writeback mode, ext3 does not journal data at all. This mode provides
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+mode - metadata journaling. A crash+recovery can cause incorrect data to
+appear in files which were written shortly before the crash. This mode will
+typically provide the best ext3 performance.
+
+* ordered mode
+In data=ordered mode, ext3 only officially journals metadata, but it logically
+groups metadata and data blocks into a single unit called a transaction. When
+it's time to write the new metadata out to disk, the associated data blocks
+are written first. In general, this mode performs slightly slower than
+writeback but significantly faster than journal mode.
+
+* journal mode
+data=journal mode provides full data and metadata journaling. All new data is
+written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both data and
+metadata into a consistent state. This mode is the slowest except when data
+needs to be read from and written to disk at the same time where it
+outperforms all other modes.
+
+Compatibility
+-------------
+
+Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
+Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as
+Ext2.
+
+
+External Tools
+==============
+See manual pages to learn more.
+
+tune2fs: create a ext3 journal on a ext2 partition with the -j flag.
+mke2fs: create a ext3 partition with the -j flag.
+debugfs: ext2 and ext3 file system debugger.
+ext2online: online (mounted) ext2 and ext3 filesystem resizer
+
+
+References
+==========
+
+kernel source: <file:fs/ext3/>
+ <file:fs/jbd/>
+
+programs: http://e2fsprogs.sourceforge.net/
+ http://ext2resize.sourceforge.net
+
+useful links: http://www.ibm.com/developerworks/library/l-fs7.html
+ http://www.ibm.com/developerworks/library/l-fs8.html
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
new file mode 100644
index 00000000..3ae9bc94
--- /dev/null
+++ b/Documentation/filesystems/ext4.txt
@@ -0,0 +1,615 @@
+
+Ext4 Filesystem
+===============
+
+Ext4 is an an advanced level of the ext3 filesystem which incorporates
+scalability and reliability enhancements for supporting large filesystems
+(64 bit) in keeping with increasing disk capacities and state-of-the-art
+feature requirements.
+
+Mailing list: linux-ext4@vger.kernel.org
+Web site: http://ext4.wiki.kernel.org
+
+
+1. Quick usage instructions:
+===========================
+
+Note: More extensive information for getting started with ext4 can be
+ found at the ext4 wiki site at the URL:
+ http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+
+ - Compile and install the latest version of e2fsprogs (as of this
+ writing version 1.41.3) from:
+
+ http://sourceforge.net/project/showfiles.php?group_id=2406
+
+ or
+
+ ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+
+ or grab the latest git repository from:
+
+ git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
+
+ - Note that it is highly important to install the mke2fs.conf file
+ that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
+ you have edited the /etc/mke2fs.conf file installed on your system,
+ you will need to merge your changes with the version from e2fsprogs
+ 1.41.x.
+
+ - Create a new filesystem using the ext4 filesystem type:
+
+ # mke2fs -t ext4 /dev/hda1
+
+ Or to configure an existing ext3 filesystem to support extents:
+
+ # tune2fs -O extents /dev/hda1
+
+ If the filesystem was created with 128 byte inodes, it can be
+ converted to use 256 byte for greater efficiency via:
+
+ # tune2fs -I 256 /dev/hda1
+
+ (Note: we currently do not have tools to convert an ext4
+ filesystem back to ext3; so please do not do try this on production
+ filesystems.)
+
+ - Mounting:
+
+ # mount -t ext4 /dev/hda1 /wherever
+
+ - When comparing performance with other filesystems, it's always
+ important to try multiple workloads; very often a subtle change in a
+ workload parameter can completely change the ranking of which
+ filesystems do well compared to others. When comparing versus ext3,
+ note that ext4 enables write barriers by default, while ext3 does
+ not enable write barriers by default. So it is useful to use
+ explicitly specify whether barriers are enabled or not when via the
+ '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
+ for a fair comparison. When tuning ext3 for best benchmark numbers,
+ it is often worthwhile to try changing the data journaling mode; '-o
+ data=writeback,nobh' can be faster for some workloads. (Note
+ however that running mounted with data=writeback can potentially
+ leave stale data exposed in recently written files in case of an
+ unclean shutdown, which could be a security exposure in some
+ situations.) Configuring the filesystem with a large journal can
+ also be helpful for metadata-intensive workloads.
+
+2. Features
+===========
+
+2.1 Currently available
+
+* ability to use filesystems > 16TB (e2fsprogs support not available yet)
+* extent format reduces metadata overhead (RAM, IO for access, transactions)
+* extent format more robust in face of on-disk corruption due to magics,
+* internal redundancy in tree
+* improved file allocation (multi-block alloc)
+* lift 32000 subdirectory limit imposed by i_links_count[1]
+* nsec timestamps for mtime, atime, ctime, create time
+* inode version field on disk (NFSv4, Lustre)
+* reduced e2fsck time via uninit_bg feature
+* journal checksumming for robustness, performance
+* persistent file preallocation (e.g for streaming media, databases)
+* ability to pack bitmaps and inode tables into larger virtual groups via the
+ flex_bg feature
+* large file support
+* Inode allocation using large virtual block groups via flex_bg
+* delayed allocation
+* large block (up to pagesize) support
+* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force
+ the ordering)
+
+[1] Filesystems with a block size of 1k may see a limit imposed by the
+directory hash tree having a maximum depth of two.
+
+2.2 Candidate features for future inclusion
+
+* Online defrag (patches available but not well tested)
+* reduced mke2fs time via lazy itable initialization in conjunction with
+ the uninit_bg feature (capability to do this is available in e2fsprogs
+ but a kernel thread to do lazy zeroing of unused inode table blocks
+ after filesystem is first mounted is required for safety)
+
+There are several others under discussion, whether they all make it in is
+partly a function of how much time everyone has to work on them. Features like
+metadata checksumming have been discussed and planned for a bit but no patches
+exist yet so I'm not sure they're in the near-term roadmap.
+
+The big performance win will come with mballoc, delalloc and flex_bg
+grouping of bitmaps and inode tables. Some test results available here:
+
+ - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
+ - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
+
+3. Options
+==========
+
+When mounting an ext4 filesystem, the following option are accepted:
+(*) == default
+
+ro Mount filesystem read only. Note that ext4 will
+ replay the journal (and thus write to the
+ partition) even when mounted "read only". The
+ mount options "ro,noload" can be used to prevent
+ writes to the filesystem.
+
+journal_checksum Enable checksumming of the journal transactions.
+ This will allow the recovery code in e2fsck and the
+ kernel to detect corruption in the kernel. It is a
+ compatible change and will be ignored by older kernels.
+
+journal_async_commit Commit block can be written to disk without waiting
+ for descriptor blocks. If enabled older kernels cannot
+ mount the device. This will enable 'journal_checksum'
+ internally.
+
+journal=update Update the ext4 file system's journal to the current
+ format.
+
+journal_dev=devnum When the external journal device's major/minor numbers
+ have changed, this option allows the user to specify
+ the new journal location. The journal device is
+ identified through its new major/minor numbers encoded
+ in devnum.
+
+norecovery Don't load the journal on mounting. Note that
+noload if the filesystem was not unmounted cleanly,
+ skipping the journal replay will lead to the
+ filesystem containing inconsistencies that can
+ lead to any number of problems.
+
+data=journal All data are committed into the journal prior to being
+ written into the main file system.
+
+data=ordered (*) All data are forced directly out to the main file
+ system prior to its metadata being committed to the
+ journal.
+
+data=writeback Data ordering is not preserved, data may be written
+ into the main file system after its metadata has been
+ committed to the journal.
+
+commit=nrsec (*) Ext4 can be told to sync all its data and metadata
+ every 'nrsec' seconds. The default value is 5 seconds.
+ This means that if you lose your power, you will lose
+ as much as the latest 5 seconds of work (your
+ filesystem will not be damaged though, thanks to the
+ journaling). This default value (or any low value)
+ will hurt performance, but it's good for data-safety.
+ Setting it to 0 will have the same effect as leaving
+ it at the default (5 seconds).
+ Setting it to very large values will improve
+ performance.
+
+barrier=<0|1(*)> This enables/disables the use of write barriers in
+barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
+nobarrier This also requires an IO stack which can support
+ barriers, and if jbd gets an error on a barrier
+ write, it will disable again with a warning.
+ Write barriers enforce proper on-disk ordering
+ of journal commits, making volatile disk write caches
+ safe to use, at some performance penalty. If
+ your disks are battery-backed in one way or another,
+ disabling barriers may safely improve performance.
+ The mount options "barrier" and "nobarrier" can
+ also be used to enable or disable barriers, for
+ consistency with other ext4 mount options.
+
+inode_readahead_blks=n This tuning parameter controls the maximum
+ number of inode table blocks that ext4's inode
+ table readahead algorithm will pre-read into
+ the buffer cache. The default value is 32 blocks.
+
+orlov (*) This enables the new Orlov block allocator. It is
+ enabled by default.
+
+oldalloc This disables the Orlov block allocator and enables
+ the old block allocator. Orlov should have better
+ performance - we'd like to get some feedback if it's
+ the contrary for you.
+
+user_xattr Enables Extended User Attributes. Additionally, you
+ need to have extended attribute support enabled in the
+ kernel configuration (CONFIG_EXT4_FS_XATTR). See the
+ attr(5) manual page and http://acl.bestbits.at/ to
+ learn more about extended attributes.
+
+nouser_xattr Disables Extended User Attributes.
+
+acl Enables POSIX Access Control Lists support.
+ Additionally, you need to have ACL support enabled in
+ the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
+ See the acl(5) manual page and http://acl.bestbits.at/
+ for more information.
+
+noacl This option disables POSIX Access Control List
+ support.
+
+bsddf (*) Make 'df' act like BSD.
+minixdf Make 'df' act like Minix.
+
+debug Extra debugging information is sent to syslog.
+
+abort Simulate the effects of calling ext4_abort() for
+ debugging purposes. This is normally used while
+ remounting a filesystem which is already mounted.
+
+errors=remount-ro Remount the filesystem read-only on an error.
+errors=continue Keep going on a filesystem error.
+errors=panic Panic and halt the machine if an error occurs.
+ (These mount options override the errors behavior
+ specified in the superblock, which can be configured
+ using tune2fs)
+
+data_err=ignore(*) Just print an error message if an error occurs
+ in a file data buffer in ordered mode.
+data_err=abort Abort the journal if an error occurs in a file
+ data buffer in ordered mode.
+
+grpid Give objects the same group ID as their creator.
+bsdgroups
+
+nogrpid (*) New objects have the group ID of their creator.
+sysvgroups
+
+resgid=n The group ID which may use the reserved blocks.
+
+resuid=n The user ID which may use the reserved blocks.
+
+sb=n Use alternate superblock at this location.
+
+quota These options are ignored by the filesystem. They
+noquota are used only by quota tools to recognize volumes
+grpquota where quota should be turned on. See documentation
+usrquota in the quota-tools package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+jqfmt=<quota type> These options tell filesystem details about quota
+usrjquota=<file> so that quota information can be properly updated
+grpjquota=<file> during journal replay. They replace the above
+ quota options. See documentation in the quota-tools
+ package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+bh (*) ext4 associates buffer heads to data pages to
+nobh (a) cache disk block mapping information
+ (b) link pages into transaction to provide
+ ordering guarantees.
+ "bh" option forces use of buffer heads.
+ "nobh" option tries to avoid associating buffer
+ heads (supported only for "writeback" mode).
+
+stripe=n Number of filesystem blocks that mballoc will try
+ to use for allocation size and alignment. For RAID5/6
+ systems this should be the number of data
+ disks * RAID chunk size in file system blocks.
+
+delalloc (*) Defer block allocation until just before ext4
+ writes out the block(s) in question. This
+ allows ext4 to better allocation decisions
+ more efficiently.
+nodelalloc Disable delayed allocation. Blocks are allocated
+ when the data is copied from userspace to the
+ page cache, either via the write(2) system call
+ or when an mmap'ed page which was previously
+ unallocated is written for the first time.
+
+max_batch_time=usec Maximum amount of time ext4 should wait for
+ additional filesystem operations to be batch
+ together with a synchronous write operation.
+ Since a synchronous write operation is going to
+ force a commit and then a wait for the I/O
+ complete, it doesn't cost much, and can be a
+ huge throughput win, we wait for a small amount
+ of time to see if any other transactions can
+ piggyback on the synchronous write. The
+ algorithm used is designed to automatically tune
+ for the speed of the disk, by measuring the
+ amount of time (on average) that it takes to
+ finish committing a transaction. Call this time
+ the "commit time". If the time that the
+ transaction has been running is less than the
+ commit time, ext4 will try sleeping for the
+ commit time to see if other operations will join
+ the transaction. The commit time is capped by
+ the max_batch_time, which defaults to 15000us
+ (15ms). This optimization can be turned off
+ entirely by setting max_batch_time to 0.
+
+min_batch_time=usec This parameter sets the commit time (as
+ described above) to be at least min_batch_time.
+ It defaults to zero microseconds. Increasing
+ this parameter may improve the throughput of
+ multi-threaded, synchronous workloads on very
+ fast disks, at the cost of increasing latency.
+
+journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
+ highest priorty) which should be used for I/O
+ operations submitted by kjournald2 during a
+ commit operation. This defaults to 3, which is
+ a slightly higher priority than the default I/O
+ priority.
+
+auto_da_alloc(*) Many broken applications don't use fsync() when
+noauto_da_alloc replacing existing files via patterns such as
+ fd = open("foo.new")/write(fd,..)/close(fd)/
+ rename("foo.new", "foo"), or worse yet,
+ fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
+ If auto_da_alloc is enabled, ext4 will detect
+ the replace-via-rename and replace-via-truncate
+ patterns and force that any delayed allocation
+ blocks are allocated such that at the next
+ journal commit, in the default data=ordered
+ mode, the data blocks of the new file are forced
+ to disk before the rename() operation is
+ committed. This provides roughly the same level
+ of guarantees as ext3, and avoids the
+ "zero-length" problem that can happen when a
+ system crashes before the delayed allocation
+ blocks are forced to disk.
+
+noinit_itable Do not initialize any uninitialized inode table
+ blocks in the background. This feature may be
+ used by installation CD's so that the install
+ process can complete as quickly as possible; the
+ inode table initialization process would then be
+ deferred until the next time the file system
+ is unmounted.
+
+init_itable=n The lazy itable init code will wait n times the
+ number of milliseconds it took to zero out the
+ previous block group's inode table. This
+ minimizes the impact on the systme performance
+ while file system's inode table is being initialized.
+
+discard Controls whether ext4 should issue discard/TRIM
+nodiscard(*) commands to the underlying block device when
+ blocks are freed. This is useful for SSD devices
+ and sparse/thinly-provisioned LUNs, but it is off
+ by default until sufficient testing has been done.
+
+nouid32 Disables 32-bit UIDs and GIDs. This is for
+ interoperability with older kernels which only
+ store and expect 16-bit values.
+
+resize Allows to resize filesystem to the end of the last
+ existing block group, further resize has to be done
+ with resize2fs either online, or offline. It can be
+ used only with conjunction with remount.
+
+block_validity This options allows to enables/disables the in-kernel
+noblock_validity facility for tracking filesystem metadata blocks
+ within internal data structures. This allows multi-
+ block allocator and other routines to quickly locate
+ extents which might overlap with filesystem metadata
+ blocks. This option is intended for debugging
+ purposes and since it negatively affects the
+ performance, it is off by default.
+
+dioread_lock Controls whether or not ext4 should use the DIO read
+dioread_nolock locking. If the dioread_nolock option is specified
+ ext4 will allocate uninitialized extent before buffer
+ write and convert the extent to initialized after IO
+ completes. This approach allows ext4 code to avoid
+ using inode mutex, which improves scalability on high
+ speed storages. However this does not work with nobh
+ option and the mount will fail. Nor does it work with
+ data journaling and dioread_nolock option will be
+ ignored with kernel warning. Note that dioread_nolock
+ code path is only used for extent-based files.
+ Because of the restrictions this options comprises
+ it is off by default (e.g. dioread_lock).
+
+i_version Enable 64-bit inode version support. This option is
+ off by default.
+
+Data Mode
+=========
+There are 3 different data modes:
+
+* writeback mode
+In data=writeback mode, ext4 does not journal data at all. This mode provides
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+mode - metadata journaling. A crash+recovery can cause incorrect data to
+appear in files which were written shortly before the crash. This mode will
+typically provide the best ext4 performance.
+
+* ordered mode
+In data=ordered mode, ext4 only officially journals metadata, but it logically
+groups metadata information related to data changes with the data blocks into a
+single unit called a transaction. When it's time to write the new metadata
+out to disk, the associated data blocks are written first. In general,
+this mode performs slightly slower than writeback but significantly faster than journal mode.
+
+* journal mode
+data=journal mode provides full data and metadata journaling. All new data is
+written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both data and
+metadata into a consistent state. This mode is the slowest except when data
+needs to be read from and written to disk at the same time where it
+outperforms all others modes. Currently ext4 does not have delayed
+allocation support if this data journalling mode is selected.
+
+/proc entries
+=============
+
+Information about mounted ext4 file systems can be found in
+/proc/fs/ext4. Each mounted filesystem will have a directory in
+/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
+/proc/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /proc/fs/ext4/<devname>
+..............................................................................
+ File Content
+ mb_groups details of multiblock allocator buddy cache of free blocks
+..............................................................................
+
+/sys entries
+============
+
+Information about mounted ext4 file systems can be found in
+/sys/fs/ext4. Each mounted filesystem will have a directory in
+/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
+/sys/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /sys/fs/ext4/<devname>
+(see also Documentation/ABI/testing/sysfs-fs-ext4)
+..............................................................................
+ File Content
+
+ delayed_allocation_blocks This file is read-only and shows the number of
+ blocks that are dirty in the page cache, but
+ which do not have their location in the
+ filesystem allocated yet.
+
+ inode_goal Tuning parameter which (if non-zero) controls
+ the goal inode used by the inode allocator in
+ preference to all other allocation heuristics.
+ This is intended for debugging use only, and
+ should be 0 on production systems.
+
+ inode_readahead_blks Tuning parameter which controls the maximum
+ number of inode table blocks that ext4's inode
+ table readahead algorithm will pre-read into
+ the buffer cache
+
+ lifetime_write_kbytes This file is read-only and shows the number of
+ kilobytes of data that have been written to this
+ filesystem since it was created.
+
+ max_writeback_mb_bump The maximum number of megabytes the writeback
+ code will try to write out before move on to
+ another inode.
+
+ mb_group_prealloc The multiblock allocator will round up allocation
+ requests to a multiple of this tuning parameter if
+ the stripe size is not set in the ext4 superblock
+
+ mb_max_to_scan The maximum number of extents the multiblock
+ allocator will search to find the best extent
+
+ mb_min_to_scan The minimum number of extents the multiblock
+ allocator will search to find the best extent
+
+ mb_order2_req Tuning parameter which controls the minimum size
+ for requests (as a power of 2) where the buddy
+ cache is used
+
+ mb_stats Controls whether the multiblock allocator should
+ collect statistics, which are shown during the
+ unmount. 1 means to collect statistics, 0 means
+ not to collect statistics
+
+ mb_stream_req Files which have fewer blocks than this tunable
+ parameter will have their blocks allocated out
+ of a block group specific preallocation pool, so
+ that small files are packed closely together.
+ Each large file will have its blocks allocated
+ out of its own unique preallocation pool.
+
+ session_write_kbytes This file is read-only and shows the number of
+ kilobytes of data that have been written to this
+ filesystem since it was mounted.
+..............................................................................
+
+Ioctls
+======
+
+There is some Ext4 specific functionality which can be accessed by applications
+through the system call interfaces. The list of all Ext4 specific ioctls are
+shown in the table below.
+
+Table of Ext4 specific ioctls
+..............................................................................
+ Ioctl Description
+ EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
+ The ioctl argument is an integer bitfield, with
+ bit values described in ext4.h. This ioctl is an
+ alias for FS_IOC_GETFLAGS.
+
+ EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
+ The ioctl argument is an integer bitfield, with
+ bit values described in ext4.h. This ioctl is an
+ alias for FS_IOC_SETFLAGS.
+
+ EXT4_IOC_GETVERSION
+ EXT4_IOC_GETVERSION_OLD
+ Get the inode i_generation number stored for
+ each inode. The i_generation number is normally
+ changed only when new inode is created and it is
+ particularly useful for network filesystems. The
+ '_OLD' version of this ioctl is an alias for
+ FS_IOC_GETVERSION.
+
+ EXT4_IOC_SETVERSION
+ EXT4_IOC_SETVERSION_OLD
+ Set the inode i_generation number stored for
+ each inode. The '_OLD' version of this ioctl
+ is an alias for FS_IOC_SETVERSION.
+
+ EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
+ mount option. It allows to resize filesystem
+ to the end of the last existing block group,
+ further resize has to be done with resize2fs,
+ either online, or offline. The argument points
+ to the unsigned logn number representing the
+ filesystem new block count.
+
+ EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
+ this ioctl is pointing to) to the donor_fd (the
+ one specified in move_extent structure passed
+ as an argument to this ioctl). Then, exchange
+ inode metadata between orig_fd and donor_fd.
+ This is especially useful for online
+ defragmentation, because the allocator has the
+ opportunity to allocate moved blocks better,
+ ideally into one contiguous extent.
+
+ EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
+ new group descriptor block. The new group
+ descriptor is described by ext4_new_group_input
+ structure, which is passed as an argument to
+ this ioctl. This is especially useful in
+ conjunction with EXT4_IOC_GROUP_EXTEND,
+ which allows online resize of the filesystem
+ to the end of the last existing block group.
+ Those two ioctls combined is used in userspace
+ online resize tool (e.g. resize2fs).
+
+ EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
+ It converts (migrates) ext3 indirect block mapped
+ inode to ext4 extent mapped inode by walking
+ through indirect block mapping of the original
+ inode and converting contiguous block ranges
+ into ext4 extents of the temporary inode. Then,
+ inodes are swapped. This ioctl might help, when
+ migrating from ext3 to ext4 filesystem, however
+ suggestion is to create fresh ext4 filesystem
+ and copy data from the backup. Note, that
+ filesystem has to support extents for this ioctl
+ to work.
+
+ EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
+ allocated to preserve application-expected ext3
+ behaviour. Note that this will also start
+ triggering a write of the data blocks, but this
+ behaviour may change in the future as it is
+ not necessary and has been done this way only
+ for sake of simplicity.
+..............................................................................
+
+References
+==========
+
+kernel source: <file:fs/ext4/>
+ <file:fs/jbd2/>
+
+programs: http://e2fsprogs.sourceforge.net/
+
+useful links: http://fedoraproject.org/wiki/ext3-devel
+ http://www.bullopensource.org/ext4/
+ http://ext4.wiki.kernel.org/index.php/Main_Page
+ http://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt
new file mode 100644
index 00000000..1b805a0e
--- /dev/null
+++ b/Documentation/filesystems/fiemap.txt
@@ -0,0 +1,228 @@
+============
+Fiemap Ioctl
+============
+
+The fiemap ioctl is an efficient method for userspace to get file
+extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
+returns a list of extents.
+
+
+Request Basics
+--------------
+
+A fiemap request is encoded within struct fiemap:
+
+struct fiemap {
+ __u64 fm_start; /* logical offset (inclusive) at
+ * which to start mapping (in) */
+ __u64 fm_length; /* logical length of mapping which
+ * userspace cares about (in) */
+ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
+ __u32 fm_mapped_extents; /* number of extents that were
+ * mapped (out) */
+ __u32 fm_extent_count; /* size of fm_extents array (in) */
+ __u32 fm_reserved;
+ struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
+};
+
+
+fm_start, and fm_length specify the logical range within the file
+which the process would like mappings for. Extents returned mirror
+those on disk - that is, the logical offset of the 1st returned extent
+may start before fm_start, and the range covered by the last returned
+extent may end after fm_length. All offsets and lengths are in bytes.
+
+Certain flags to modify the way in which mappings are looked up can be
+set in fm_flags. If the kernel doesn't understand some particular
+flags, it will return EBADR and the contents of fm_flags will contain
+the set of flags which caused the error. If the kernel is compatible
+with all flags passed, the contents of fm_flags will be unmodified.
+It is up to userspace to determine whether rejection of a particular
+flag is fatal to its operation. This scheme is intended to allow the
+fiemap interface to grow in the future but without losing
+compatibility with old software.
+
+fm_extent_count specifies the number of elements in the fm_extents[] array
+that can be used to return extents. If fm_extent_count is zero, then the
+fm_extents[] array is ignored (no extents will be returned), and the
+fm_mapped_extents count will hold the number of extents needed in
+fm_extents[] to hold the file's current mapping. Note that there is
+nothing to prevent the file from changing between calls to FIEMAP.
+
+The following flags can be set in fm_flags:
+
+* FIEMAP_FLAG_SYNC
+If this flag is set, the kernel will sync the file before mapping extents.
+
+* FIEMAP_FLAG_XATTR
+If this flag is set, the extents returned will describe the inodes
+extended attribute lookup tree, instead of its data tree.
+
+
+Extent Mapping
+--------------
+
+Extent information is returned within the embedded fm_extents array
+which userspace must allocate along with the fiemap structure. The
+number of elements in the fiemap_extents[] array should be passed via
+fm_extent_count. The number of extents mapped by kernel will be
+returned via fm_mapped_extents. If the number of fiemap_extents
+allocated is less than would be required to map the requested range,
+the maximum number of extents that can be mapped in the fm_extent[]
+array will be returned and fm_mapped_extents will be equal to
+fm_extent_count. In that case, the last extent in the array will not
+complete the requested range and will not have the FIEMAP_EXTENT_LAST
+flag set (see the next section on extent flags).
+
+Each extent is described by a single fiemap_extent structure as
+returned in fm_extents.
+
+struct fiemap_extent {
+ __u64 fe_logical; /* logical offset in bytes for the start of
+ * the extent */
+ __u64 fe_physical; /* physical offset in bytes for the start
+ * of the extent */
+ __u64 fe_length; /* length in bytes for the extent */
+ __u64 fe_reserved64[2];
+ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
+ __u32 fe_reserved[3];
+};
+
+All offsets and lengths are in bytes and mirror those on disk. It is valid
+for an extents logical offset to start before the request or its logical
+length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is
+returned, fe_logical, fe_physical, and fe_length will be aligned to the
+block size of the file system. With the exception of extents flagged as
+FIEMAP_EXTENT_MERGED, adjacent extents will not be merged.
+
+The fe_flags field contains flags which describe the extent returned.
+A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in
+the file so that the process making fiemap calls can determine when no
+more extents are available, without having to call the ioctl again.
+
+Some flags are intentionally vague and will always be set in the
+presence of other more specific flags. This way a program looking for
+a general property does not have to know all existing and future flags
+which imply that property.
+
+For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
+are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
+for inline or tail-packed data can key on the specific flag. Software
+which simply cares not to try operating on non-aligned extents
+however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
+worry about all present and future flags which might imply unaligned
+data. Note that the opposite is not true - it would be valid for
+FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
+
+* FIEMAP_EXTENT_LAST
+This is the last extent in the file. A mapping attempt past this
+extent will return nothing.
+
+* FIEMAP_EXTENT_UNKNOWN
+The location of this extent is currently unknown. This may indicate
+the data is stored on an inaccessible volume or that no storage has
+been allocated for the file yet.
+
+* FIEMAP_EXTENT_DELALLOC
+ - This will also set FIEMAP_EXTENT_UNKNOWN.
+Delayed allocation - while there is data for this extent, its
+physical location has not been allocated yet.
+
+* FIEMAP_EXTENT_ENCODED
+This extent does not consist of plain filesystem blocks but is
+encoded (e.g. encrypted or compressed). Reading the data in this
+extent via I/O to the block device will have undefined results.
+
+Note that it is *always* undefined to try to update the data
+in-place by writing to the indicated location without the
+assistance of the filesystem, or to access the data using the
+information returned by the FIEMAP interface while the filesystem
+is mounted. In other words, user applications may only read the
+extent data via I/O to the block device while the filesystem is
+unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is
+clear; user applications must not try reading or writing to the
+filesystem via the block device under any other circumstances.
+
+* FIEMAP_EXTENT_DATA_ENCRYPTED
+ - This will also set FIEMAP_EXTENT_ENCODED
+The data in this extent has been encrypted by the file system.
+
+* FIEMAP_EXTENT_NOT_ALIGNED
+Extent offsets and length are not guaranteed to be block aligned.
+
+* FIEMAP_EXTENT_DATA_INLINE
+ This will also set FIEMAP_EXTENT_NOT_ALIGNED
+Data is located within a meta data block.
+
+* FIEMAP_EXTENT_DATA_TAIL
+ This will also set FIEMAP_EXTENT_NOT_ALIGNED
+Data is packed into a block with data from other files.
+
+* FIEMAP_EXTENT_UNWRITTEN
+Unwritten extent - the extent is allocated but its data has not been
+initialized. This indicates the extent's data will be all zero if read
+through the filesystem but the contents are undefined if read directly from
+the device.
+
+* FIEMAP_EXTENT_MERGED
+This will be set when a file does not support extents, i.e., it uses a block
+based addressing scheme. Since returning an extent for each block back to
+userspace would be highly inefficient, the kernel will try to merge most
+adjacent blocks into 'extents'.
+
+
+VFS -> File System Implementation
+---------------------------------
+
+File systems wishing to support fiemap must implement a ->fiemap callback on
+their inode_operations structure. The fs ->fiemap call is responsible for
+defining its set of supported fiemap flags, and calling a helper function on
+each discovered extent:
+
+struct inode_operations {
+ ...
+
+ int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
+ u64 len);
+
+->fiemap is passed struct fiemap_extent_info which describes the
+fiemap request:
+
+struct fiemap_extent_info {
+ unsigned int fi_flags; /* Flags as passed from user */
+ unsigned int fi_extents_mapped; /* Number of mapped extents */
+ unsigned int fi_extents_max; /* Size of fiemap_extent array */
+ struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
+};
+
+It is intended that the file system should not need to access any of this
+structure directly.
+
+
+Flag checking should be done at the beginning of the ->fiemap callback via the
+fiemap_check_flags() helper:
+
+int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
+
+The struct fieinfo should be passed in as received from ioctl_fiemap(). The
+set of fiemap flags which the fs understands should be passed via fs_flags. If
+fiemap_check_flags finds invalid user flags, it will place the bad values in
+fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
+fiemap_check_flags(), it should immediately exit, returning that error back to
+ioctl_fiemap().
+
+
+For each extent in the request range, the file system should call
+the helper function, fiemap_fill_next_extent():
+
+int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+ u64 phys, u64 len, u32 flags, u32 dev);
+
+fiemap_fill_next_extent() will use the passed values to populate the
+next free extent in the fm_extents array. 'General' extent flags will
+automatically be set from specific flags on behalf of the calling file
+system so that the userspace API is not broken.
+
+fiemap_fill_next_extent() returns 0 on success, and 1 when the
+user-supplied fm_extents array is full. If an error is encountered
+while copying the extent to user memory, -EFAULT will be returned.
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
new file mode 100644
index 00000000..ac2facc5
--- /dev/null
+++ b/Documentation/filesystems/files.txt
@@ -0,0 +1,123 @@
+File management in the Linux kernel
+-----------------------------------
+
+This document describes how locking for files (struct file)
+and file descriptor table (struct files) works.
+
+Up until 2.6.12, the file descriptor table has been protected
+with a lock (files->file_lock) and reference count (files->count).
+->file_lock protected accesses to all the file related fields
+of the table. ->count was used for sharing the file descriptor
+table between tasks cloned with CLONE_FILES flag. Typically
+this would be the case for posix threads. As with the common
+refcounting model in the kernel, the last task doing
+a put_files_struct() frees the file descriptor (fd) table.
+The files (struct file) themselves are protected using
+reference count (->f_count).
+
+In the new lock-free model of file descriptor management,
+the reference counting is similar, but the locking is
+based on RCU. The file descriptor table contains multiple
+elements - the fd sets (open_fds and close_on_exec, the
+array of file pointers, the sizes of the sets and the array
+etc.). In order for the updates to appear atomic to
+a lock-free reader, all the elements of the file descriptor
+table are in a separate structure - struct fdtable.
+files_struct contains a pointer to struct fdtable through
+which the actual fd table is accessed. Initially the
+fdtable is embedded in files_struct itself. On a subsequent
+expansion of fdtable, a new fdtable structure is allocated
+and files->fdtab points to the new structure. The fdtable
+structure is freed with RCU and lock-free readers either
+see the old fdtable or the new fdtable making the update
+appear atomic. Here are the locking rules for
+the fdtable structure -
+
+1. All references to the fdtable must be done through
+ the files_fdtable() macro :
+
+ struct fdtable *fdt;
+
+ rcu_read_lock();
+
+ fdt = files_fdtable(files);
+ ....
+ if (n <= fdt->max_fds)
+ ....
+ ...
+ rcu_read_unlock();
+
+ files_fdtable() uses rcu_dereference() macro which takes care of
+ the memory barrier requirements for lock-free dereference.
+ The fdtable pointer must be read within the read-side
+ critical section.
+
+2. Reading of the fdtable as described above must be protected
+ by rcu_read_lock()/rcu_read_unlock().
+
+3. For any update to the fd table, files->file_lock must
+ be held.
+
+4. To look up the file structure given an fd, a reader
+ must use either fcheck() or fcheck_files() APIs. These
+ take care of barrier requirements due to lock-free lookup.
+ An example :
+
+ struct file *file;
+
+ rcu_read_lock();
+ file = fcheck(fd);
+ if (file) {
+ ...
+ }
+ ....
+ rcu_read_unlock();
+
+5. Handling of the file structures is special. Since the look-up
+ of the fd (fget()/fget_light()) are lock-free, it is possible
+ that look-up may race with the last put() operation on the
+ file structure. This is avoided using atomic_long_inc_not_zero()
+ on ->f_count :
+
+ rcu_read_lock();
+ file = fcheck_files(files, fd);
+ if (file) {
+ if (atomic_long_inc_not_zero(&file->f_count))
+ *fput_needed = 1;
+ else
+ /* Didn't get the reference, someone's freed */
+ file = NULL;
+ }
+ rcu_read_unlock();
+ ....
+ return file;
+
+ atomic_long_inc_not_zero() detects if refcounts is already zero or
+ goes to zero during increment. If it does, we fail
+ fget()/fget_light().
+
+6. Since both fdtable and file structures can be looked up
+ lock-free, they must be installed using rcu_assign_pointer()
+ API. If they are looked up lock-free, rcu_dereference()
+ must be used. However it is advisable to use files_fdtable()
+ and fcheck()/fcheck_files() which take care of these issues.
+
+7. While updating, the fdtable pointer must be looked up while
+ holding files->file_lock. If ->file_lock is dropped, then
+ another thread expand the files thereby creating a new
+ fdtable and making the earlier fdtable pointer stale.
+ For example :
+
+ spin_lock(&files->file_lock);
+ fd = locate_fd(files, file, start);
+ if (fd >= 0) {
+ /* locate_fd() may have expanded fdtable, load the ptr */
+ fdt = files_fdtable(files);
+ FD_SET(fd, fdt->open_fds);
+ FD_CLR(fd, fdt->close_on_exec);
+ spin_unlock(&files->file_lock);
+ .....
+
+ Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
+ the fdtable pointer (fdt) must be loaded after locate_fd().
+
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
new file mode 100644
index 00000000..13af4a49
--- /dev/null
+++ b/Documentation/filesystems/fuse.txt
@@ -0,0 +1,423 @@
+Definitions
+~~~~~~~~~~~
+
+Userspace filesystem:
+
+ A filesystem in which data and metadata are provided by an ordinary
+ userspace process. The filesystem can be accessed normally through
+ the kernel interface.
+
+Filesystem daemon:
+
+ The process(es) providing the data and metadata of the filesystem.
+
+Non-privileged mount (or user mount):
+
+ A userspace filesystem mounted by a non-privileged (non-root) user.
+ The filesystem daemon is running with the privileges of the mounting
+ user. NOTE: this is not the same as mounts allowed with the "user"
+ option in /etc/fstab, which is not discussed here.
+
+Filesystem connection:
+
+ A connection between the filesystem daemon and the kernel. The
+ connection exists until either the daemon dies, or the filesystem is
+ umounted. Note that detaching (or lazy umounting) the filesystem
+ does _not_ break the connection, in this case it will exist until
+ the last reference to the filesystem is released.
+
+Mount owner:
+
+ The user who does the mounting.
+
+User:
+
+ The user who is performing filesystem operations.
+
+What is FUSE?
+~~~~~~~~~~~~~
+
+FUSE is a userspace filesystem framework. It consists of a kernel
+module (fuse.ko), a userspace library (libfuse.*) and a mount utility
+(fusermount).
+
+One of the most important features of FUSE is allowing secure,
+non-privileged mounts. This opens up new possibilities for the use of
+filesystems. A good example is sshfs: a secure network filesystem
+using the sftp protocol.
+
+The userspace library and utilities are available from the FUSE
+homepage:
+
+ http://fuse.sourceforge.net/
+
+Filesystem type
+~~~~~~~~~~~~~~~
+
+The filesystem type given to mount(2) can be one of the following:
+
+'fuse'
+
+ This is the usual way to mount a FUSE filesystem. The first
+ argument of the mount system call may contain an arbitrary string,
+ which is not interpreted by the kernel.
+
+'fuseblk'
+
+ The filesystem is block device based. The first argument of the
+ mount system call is interpreted as the name of the device.
+
+Mount options
+~~~~~~~~~~~~~
+
+'fd=N'
+
+ The file descriptor to use for communication between the userspace
+ filesystem and the kernel. The file descriptor must have been
+ obtained by opening the FUSE device ('/dev/fuse').
+
+'rootmode=M'
+
+ The file mode of the filesystem's root in octal representation.
+
+'user_id=N'
+
+ The numeric user id of the mount owner.
+
+'group_id=N'
+
+ The numeric group id of the mount owner.
+
+'default_permissions'
+
+ By default FUSE doesn't check file access permissions, the
+ filesystem is free to implement its access policy or leave it to
+ the underlying file access mechanism (e.g. in case of network
+ filesystems). This option enables permission checking, restricting
+ access based on file mode. It is usually useful together with the
+ 'allow_other' mount option.
+
+'allow_other'
+
+ This option overrides the security measure restricting file access
+ to the user mounting the filesystem. This option is by default only
+ allowed to root, but this restriction can be removed with a
+ (userspace) configuration option.
+
+'max_read=N'
+
+ With this option the maximum size of read operations can be set.
+ The default is infinite. Note that the size of read requests is
+ limited anyway to 32 pages (which is 128kbyte on i386).
+
+'blksize=N'
+
+ Set the block size for the filesystem. The default is 512. This
+ option is only valid for 'fuseblk' type mounts.
+
+Control filesystem
+~~~~~~~~~~~~~~~~~~
+
+There's a control filesystem for FUSE, which can be mounted by:
+
+ mount -t fusectl none /sys/fs/fuse/connections
+
+Mounting it under the '/sys/fs/fuse/connections' directory makes it
+backwards compatible with earlier versions.
+
+Under the fuse control filesystem each connection has a directory
+named by a unique number.
+
+For each connection the following files exist within this directory:
+
+ 'waiting'
+
+ The number of requests which are waiting to be transferred to
+ userspace or being processed by the filesystem daemon. If there is
+ no filesystem activity and 'waiting' is non-zero, then the
+ filesystem is hung or deadlocked.
+
+ 'abort'
+
+ Writing anything into this file will abort the filesystem
+ connection. This means that all waiting requests will be aborted an
+ error returned for all aborted and new requests.
+
+Only the owner of the mount may read or write these files.
+
+Interrupting filesystem operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a process issuing a FUSE filesystem request is interrupted, the
+following will happen:
+
+ 1) If the request is not yet sent to userspace AND the signal is
+ fatal (SIGKILL or unhandled fatal signal), then the request is
+ dequeued and returns immediately.
+
+ 2) If the request is not yet sent to userspace AND the signal is not
+ fatal, then an 'interrupted' flag is set for the request. When
+ the request has been successfully transferred to userspace and
+ this flag is set, an INTERRUPT request is queued.
+
+ 3) If the request is already sent to userspace, then an INTERRUPT
+ request is queued.
+
+INTERRUPT requests take precedence over other requests, so the
+userspace filesystem will receive queued INTERRUPTs before any others.
+
+The userspace filesystem may ignore the INTERRUPT requests entirely,
+or may honor them by sending a reply to the _original_ request, with
+the error set to EINTR.
+
+It is also possible that there's a race between processing the
+original request and its INTERRUPT request. There are two possibilities:
+
+ 1) The INTERRUPT request is processed before the original request is
+ processed
+
+ 2) The INTERRUPT request is processed after the original request has
+ been answered
+
+If the filesystem cannot find the original request, it should wait for
+some timeout and/or a number of new requests to arrive, after which it
+should reply to the INTERRUPT request with an EAGAIN error. In case
+1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
+reply will be ignored.
+
+Aborting a filesystem connection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to get into certain situations where the filesystem is
+not responding. Reasons for this may be:
+
+ a) Broken userspace filesystem implementation
+
+ b) Network connection down
+
+ c) Accidental deadlock
+
+ d) Malicious deadlock
+
+(For more on c) and d) see later sections)
+
+In either of these cases it may be useful to abort the connection to
+the filesystem. There are several ways to do this:
+
+ - Kill the filesystem daemon. Works in case of a) and b)
+
+ - Kill the filesystem daemon and all users of the filesystem. Works
+ in all cases except some malicious deadlocks
+
+ - Use forced umount (umount -f). Works in all cases but only if
+ filesystem is still attached (it hasn't been lazy unmounted)
+
+ - Abort filesystem through the FUSE control filesystem. Most
+ powerful method, always works.
+
+How do non-privileged mounts work?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the mount() system call is a privileged operation, a helper
+program (fusermount) is needed, which is installed setuid root.
+
+The implication of providing non-privileged mounts is that the mount
+owner must not be able to use this capability to compromise the
+system. Obvious requirements arising from this are:
+
+ A) mount owner should not be able to get elevated privileges with the
+ help of the mounted filesystem
+
+ B) mount owner should not get illegitimate access to information from
+ other users' and the super user's processes
+
+ C) mount owner should not be able to induce undesired behavior in
+ other users' or the super user's processes
+
+How are requirements fulfilled?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ A) The mount owner could gain elevated privileges by either:
+
+ 1) creating a filesystem containing a device file, then opening
+ this device
+
+ 2) creating a filesystem containing a suid or sgid application,
+ then executing this application
+
+ The solution is not to allow opening device files and ignore
+ setuid and setgid bits when executing programs. To ensure this
+ fusermount always adds "nosuid" and "nodev" to the mount options
+ for non-privileged mounts.
+
+ B) If another user is accessing files or directories in the
+ filesystem, the filesystem daemon serving requests can record the
+ exact sequence and timing of operations performed. This
+ information is otherwise inaccessible to the mount owner, so this
+ counts as an information leak.
+
+ The solution to this problem will be presented in point 2) of C).
+
+ C) There are several ways in which the mount owner can induce
+ undesired behavior in other users' processes, such as:
+
+ 1) mounting a filesystem over a file or directory which the mount
+ owner could otherwise not be able to modify (or could only
+ make limited modifications).
+
+ This is solved in fusermount, by checking the access
+ permissions on the mountpoint and only allowing the mount if
+ the mount owner can do unlimited modification (has write
+ access to the mountpoint, and mountpoint is not a "sticky"
+ directory)
+
+ 2) Even if 1) is solved the mount owner can change the behavior
+ of other users' processes.
+
+ i) It can slow down or indefinitely delay the execution of a
+ filesystem operation creating a DoS against the user or the
+ whole system. For example a suid application locking a
+ system file, and then accessing a file on the mount owner's
+ filesystem could be stopped, and thus causing the system
+ file to be locked forever.
+
+ ii) It can present files or directories of unlimited length, or
+ directory structures of unlimited depth, possibly causing a
+ system process to eat up diskspace, memory or other
+ resources, again causing DoS.
+
+ The solution to this as well as B) is not to allow processes
+ to access the filesystem, which could otherwise not be
+ monitored or manipulated by the mount owner. Since if the
+ mount owner can ptrace a process, it can do all of the above
+ without using a FUSE mount, the same criteria as used in
+ ptrace can be used to check if a process is allowed to access
+ the filesystem or not.
+
+ Note that the ptrace check is not strictly necessary to
+ prevent B/2/i, it is enough to check if mount owner has enough
+ privilege to send signal to the process accessing the
+ filesystem, since SIGSTOP can be used to get a similar effect.
+
+I think these limitations are unacceptable?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a sysadmin trusts the users enough, or can ensure through other
+measures, that system processes will never enter non-privileged
+mounts, it can relax the last limitation with a "user_allow_other"
+config option. If this config option is set, the mounting user can
+add the "allow_other" mount option which disables the check for other
+users' processes.
+
+Kernel - userspace interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following diagram shows how a filesystem operation (in this
+example unlink) is performed in FUSE.
+
+NOTE: everything in this description is greatly simplified
+
+ | "rm /mnt/fuse/file" | FUSE filesystem daemon
+ | |
+ | | >sys_read()
+ | | >fuse_dev_read()
+ | | >request_wait()
+ | | [sleep on fc->waitq]
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [get request from |
+ | fc->unused_list] |
+ | >request_send() |
+ | [queue req on fc->pending] |
+ | [wake up fc->waitq] | [woken up]
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | <request_wait()
+ | | [remove req from fc->pending]
+ | | [copy req to read buffer]
+ | | [add req to fc->processing]
+ | | <fuse_dev_read()
+ | | <sys_read()
+ | |
+ | | [perform unlink]
+ | |
+ | | >sys_write()
+ | | >fuse_dev_write()
+ | | [look up req in fc->processing]
+ | | [remove from fc->processing]
+ | | [copy write buffer to req]
+ | [woken up] | [wake up req->waitq]
+ | | <fuse_dev_write()
+ | | <sys_write()
+ | <request_wait_answer() |
+ | <request_send() |
+ | [add request to |
+ | fc->unused_list] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+There are a couple of ways in which to deadlock a FUSE filesystem.
+Since we are talking about unprivileged userspace programs,
+something must be done about these.
+
+Scenario 1 - Simple deadlock
+-----------------------------
+
+ | "rm /mnt/fuse/file" | FUSE filesystem daemon
+ | |
+ | >sys_unlink("/mnt/fuse/file") |
+ | [acquire inode semaphore |
+ | for "file"] |
+ | >fuse_unlink() |
+ | [sleep on req->waitq] |
+ | | <sys_read()
+ | | >sys_unlink("/mnt/fuse/file")
+ | | [acquire inode semaphore
+ | | for "file"]
+ | | *DEADLOCK*
+
+The solution for this is to allow the filesystem to be aborted.
+
+Scenario 2 - Tricky deadlock
+----------------------------
+
+This one needs a carefully crafted filesystem. It's a variation on
+the above, only the call back to the filesystem is not explicit,
+but is caused by a pagefault.
+
+ | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
+ | |
+ | [fd = open("/mnt/fuse/file")] | [request served normally]
+ | [mmap fd to 'addr'] |
+ | [close fd] | [FLUSH triggers 'magic' flag]
+ | [read a byte from addr] |
+ | >do_page_fault() |
+ | [find or create page] |
+ | [lock page] |
+ | >fuse_readpage() |
+ | [queue READ request] |
+ | [sleep on req->waitq] |
+ | | [read request to buffer]
+ | | [create reply header before addr]
+ | | >sys_write(addr - headerlength)
+ | | >fuse_dev_write()
+ | | [look up req in fc->processing]
+ | | [remove from fc->processing]
+ | | [copy write buffer to req]
+ | | >do_page_fault()
+ | | [find or create page]
+ | | [lock page]
+ | | * DEADLOCK *
+
+Solution is basically the same as above.
+
+An additional problem is that while the write buffer is being copied
+to the request, the request must not be interrupted/aborted. This is
+because the destination address of the copy may not be valid after the
+request has returned.
+
+This is solved with doing the copy atomically, and allowing abort
+while the page(s) belonging to the write buffer are faulted with
+get_user_pages(). The 'req->locked' flag indicates when the copy is
+taking place, and abort is delayed until this flag is unset.
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt
new file mode 100644
index 00000000..0494f78d
--- /dev/null
+++ b/Documentation/filesystems/gfs2-glocks.txt
@@ -0,0 +1,114 @@
+ Glock internal locking rules
+ ------------------------------
+
+This documents the basic principles of the glock state machine
+internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
+has two main (internal) locks:
+
+ 1. A spinlock (gl_spin) which protects the internal state such
+ as gl_state, gl_target and the list of holders (gl_holders)
+ 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
+ threads from making calls to the DLM, etc. at the same time. If a
+ thread takes this lock, it must then call run_queue (usually via the
+ workqueue) when it releases it in order to ensure any pending tasks
+ are completed.
+
+The gl_holders list contains all the queued lock requests (not
+just the holders) associated with the glock. If there are any
+held locks, then they will be contiguous entries at the head
+of the list. Locks are granted in strictly the order that they
+are queued, except for those marked LM_FLAG_PRIORITY which are
+used only during recovery, and even then only for journal locks.
+
+There are three lock states that users of the glock layer can request,
+namely shared (SH), deferred (DF) and exclusive (EX). Those translate
+to the following DLM lock modes:
+
+Glock mode | DLM lock mode
+------------------------------
+ UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
+ SH | PR (Protected read)
+ DF | CW (Concurrent write)
+ EX | EX (Exclusive)
+
+Thus DF is basically a shared mode which is incompatible with the "normal"
+shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
+operations. The glocks are basically a lock plus some routines which deal
+with cache management. The following rules apply for the cache:
+
+Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
+--------------------------------------------------------------------------
+ UN | No | No | No | No
+ SH | Yes | Yes | No | No
+ DF | No | Yes | No | No
+ EX | Yes | Yes | Yes | Yes
+
+These rules are implemented using the various glock operations which
+are defined for each type of glock. Not all types of glocks use
+all the modes. Only inode glocks use the DF mode for example.
+
+Table of glock operations and per type constants:
+
+Field | Purpose
+----------------------------------------------------------------------------
+go_xmote_th | Called before remote state change (e.g. to sync dirty data)
+go_xmote_bh | Called after remote state change (e.g. to refill cache)
+go_inval | Called if remote state change requires invalidating the cache
+go_demote_ok | Returns boolean value of whether its ok to demote a glock
+ | (e.g. checks timeout, and that there is no cached data)
+go_lock | Called for the first local holder of a lock
+go_unlock | Called on the final local unlock of a lock
+go_dump | Called to print content of object for debugfs file, or on
+ | error to dump glock to the log.
+go_type | The type of the glock, LM_TYPE_.....
+go_min_hold_time | The minimum hold time
+
+The minimum hold time for each lock is the time after a remote lock
+grant for which we ignore remote demote requests. This is in order to
+prevent a situation where locks are being bounced around the cluster
+from node to node with none of the nodes making any progress. This
+tends to show up most with shared mmaped files which are being written
+to by multiple nodes. By delaying the demotion in response to a
+remote callback, that gives the userspace program time to make
+some progress before the pages are unmapped.
+
+There is a plan to try and remove the go_lock and go_unlock callbacks
+if possible, in order to try and speed up the fast path though the locking.
+Also, eventually we hope to make the glock "EX" mode locally shared
+such that any local locking will be done with the i_mutex as required
+rather than via the glock.
+
+Locking rules for glock operations:
+
+Operation | GLF_LOCK bit lock held | gl_spin spinlock held
+-----------------------------------------------------------------
+go_xmote_th | Yes | No
+go_xmote_bh | Yes | No
+go_inval | Yes | No
+go_demote_ok | Sometimes | Yes
+go_lock | Yes | No
+go_unlock | Yes | No
+go_dump | Sometimes | Yes
+
+N.B. Operations must not drop either the bit lock or the spinlock
+if its held on entry. go_dump and do_demote_ok must never block.
+Note that go_dump will only be called if the glock's state
+indicates that it is caching uptodate data.
+
+Glock locking order within GFS2:
+
+ 1. i_mutex (if required)
+ 2. Rename glock (for rename only)
+ 3. Inode glock(s)
+ (Parents before children, inodes at "same level" with same parent in
+ lock number order)
+ 4. Rgrp glock(s) (for (de)allocation operations)
+ 5. Transaction glock (via gfs2_trans_begin) for non-read operations
+ 6. Page lock (always last, very important!)
+
+There are two glocks per inode. One deals with access to the inode
+itself (locking order as above), and the other, known as the iopen
+glock is used in conjunction with the i_nlink field in the inode to
+determine the lifetime of the inode in question. Locking of inodes
+is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
+
diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt
new file mode 100644
index 00000000..d8188966
--- /dev/null
+++ b/Documentation/filesystems/gfs2-uevents.txt
@@ -0,0 +1,100 @@
+ uevents and GFS2
+ ==================
+
+During the lifetime of a GFS2 mount, a number of uevents are generated.
+This document explains what the events are and what they are used
+for (by gfs_controld in gfs2-utils).
+
+A list of GFS2 uevents
+-----------------------
+
+1. ADD
+
+The ADD event occurs at mount time. It will always be the first
+uevent generated by the newly created filesystem. If the mount
+is successful, an ONLINE uevent will follow. If it is not successful
+then a REMOVE uevent will follow.
+
+The ADD uevent has two environment variables: SPECTATOR=[0|1]
+and RDONLY=[0|1] that specify the spectator status (a read-only mount
+with no journal assigned), and read-only (with journal assigned) status
+of the filesystem respectively.
+
+2. ONLINE
+
+The ONLINE uevent is generated after a successful mount or remount. It
+has the same environment variables as the ADD uevent. The ONLINE
+uevent, along with the two environment variables for spectator and
+RDONLY are a relatively recent addition (2.6.32-rc+) and will not
+be generated by older kernels.
+
+3. CHANGE
+
+The CHANGE uevent is used in two places. One is when reporting the
+successful mount of the filesystem by the first node (FIRSTMOUNT=Done).
+This is used as a signal by gfs_controld that it is then ok for other
+nodes in the cluster to mount the filesystem.
+
+The other CHANGE uevent is used to inform of the completion
+of journal recovery for one of the filesystems journals. It has
+two environment variables, JID= which specifies the journal id which
+has just been recovered, and RECOVERY=[Done|Failed] to indicate the
+success (or otherwise) of the operation. These uevents are generated
+for every journal recovered, whether it is during the initial mount
+process or as the result of gfs_controld requesting a specific journal
+recovery via the /sys/fs/gfs2/<fsname>/lock_module/recovery file.
+
+Because the CHANGE uevent was used (in early versions of gfs_controld)
+without checking the environment variables to discover the state, we
+cannot add any more functions to it without running the risk of
+someone using an older version of the user tools and breaking their
+cluster. For this reason the ONLINE uevent was used when adding a new
+uevent for a successful mount or remount.
+
+4. OFFLINE
+
+The OFFLINE uevent is only generated due to filesystem errors and is used
+as part of the "withdraw" mechanism. Currently this doesn't give any
+information about what the error is, which is something that needs to
+be fixed.
+
+5. REMOVE
+
+The REMOVE uevent is generated at the end of an unsuccessful mount
+or at the end of a umount of the filesystem. All REMOVE uevents will
+have been preceded by at least an ADD uevent for the same fileystem,
+and unlike the other uevents is generated automatically by the kernel's
+kobject subsystem.
+
+
+Information common to all GFS2 uevents (uevent environment variables)
+----------------------------------------------------------------------
+
+1. LOCKTABLE=
+
+The LOCKTABLE is a string, as supplied on the mount command
+line (locktable=) or via fstab. It is used as a filesystem label
+as well as providing the information for a lock_dlm mount to be
+able to join the cluster.
+
+2. LOCKPROTO=
+
+The LOCKPROTO is a string, and its value depends on what is set
+on the mount command line, or via fstab. It will be either
+lock_nolock or lock_dlm. In the future other lock managers
+may be supported.
+
+3. JOURNALID=
+
+If a journal is in use by the filesystem (journals are not
+assigned for spectator mounts) then this will give the
+numeric journal id in all GFS2 uevents.
+
+4. UUID=
+
+With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID
+into the filesystem superblock. If it exists, this will
+be included in every uevent relating to the filesystem.
+
+
+
diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt
new file mode 100644
index 00000000..4cda9266
--- /dev/null
+++ b/Documentation/filesystems/gfs2.txt
@@ -0,0 +1,46 @@
+Global File System
+------------------
+
+http://sources.redhat.com/cluster/wiki/
+
+GFS is a cluster file system. It allows a cluster of computers to
+simultaneously use a block device that is shared between them (with FC,
+iSCSI, NBD, etc). GFS reads and writes to the block device like a local
+file system, but also uses a lock module to allow the computers coordinate
+their I/O so file system consistency is maintained. One of the nifty
+features of GFS is perfect consistency -- changes made to the file system
+on one machine show up immediately on all other machines in the cluster.
+
+GFS uses interchangeable inter-node locking mechanisms, the currently
+supported mechanisms are:
+
+ lock_nolock -- allows gfs to be used as a local file system
+
+ lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
+ The dlm is found at linux/fs/dlm/
+
+Lock_dlm depends on user space cluster management systems found
+at the URL above.
+
+To use gfs as a local file system, no external clustering systems are
+needed, simply:
+
+ $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
+ $ mount -t gfs2 /dev/block_device /dir
+
+If you are using Fedora, you need to install the gfs2-utils package
+and, for lock_dlm, you will also need to install the cman package
+and write a cluster.conf as per the documentation.
+
+GFS2 is not on-disk compatible with previous versions of GFS, but it
+is pretty close.
+
+The following man pages can be found at the URL above:
+ fsck.gfs2 to repair a filesystem
+ gfs2_grow to expand a filesystem online
+ gfs2_jadd to add journals to a filesystem online
+ gfs2_tool to manipulate, examine and tune a filesystem
+ gfs2_quota to examine and change quota values in a filesystem
+ gfs2_convert to convert a gfs filesystem to gfs2 in-place
+ mount.gfs2 to help mount(8) mount a filesystem
+ mkfs.gfs2 to make a filesystem
diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt
new file mode 100644
index 00000000..bd0fa770
--- /dev/null
+++ b/Documentation/filesystems/hfs.txt
@@ -0,0 +1,83 @@
+
+Macintosh HFS Filesystem for Linux
+==================================
+
+HFS stands for ``Hierarchical File System'' and is the filesystem used
+by the Mac Plus and all later Macintosh models. Earlier Macintosh
+models used MFS (``Macintosh File System''), which is not supported,
+MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
+HFS but is extended in various areas. Use the hfsplus filesystem driver
+to access such filesystems from Linux.
+
+
+Mount options
+=============
+
+When mounting an HFS filesystem, the following options are accepted:
+
+ creator=cccc, type=cccc
+ Specifies the creator/type values as shown by the MacOS finder
+ used for creating new files. Default values: '????'.
+
+ uid=n, gid=n
+ Specifies the user/group that owns all files on the filesystems.
+ Default: user/group id of the mounting process.
+
+ dir_umask=n, file_umask=n, umask=n
+ Specifies the umask used for all files , all directories or all
+ files and directories. Defaults to the umask of the mounting process.
+
+ session=n
+ Select the CDROM session to mount as HFS filesystem. Defaults to
+ leaving that decision to the CDROM driver. This option will fail
+ with anything but a CDROM as underlying devices.
+
+ part=n
+ Select partition number n from the devices. Does only makes
+ sense for CDROMS because they can't be partitioned under Linux.
+ For disk devices the generic partition parsing code does this
+ for us. Defaults to not parsing the partition table at all.
+
+ quiet
+ Ignore invalid mount options instead of complaining.
+
+
+Writing to HFS Filesystems
+==========================
+
+HFS is not a UNIX filesystem, thus it does not have the usual features you'd
+expect:
+
+ o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
+ and gid of files.
+ o You can't create hard- or symlinks, device files, sockets or FIFOs.
+
+HFS does on the other have the concepts of multiple forks per file. These
+non-standard forks are represented as hidden additional files in the normal
+filesystems namespace which is kind of a cludge and makes the semantics for
+the a little strange:
+
+ o You can't create, delete or rename resource forks of files or the
+ Finder's metadata.
+ o They are however created (with default values), deleted and renamed
+ along with the corresponding data fork or directory.
+ o Copying files to a different filesystem will loose those attributes
+ that are essential for MacOS to work.
+
+
+Creating HFS filesystems
+===================================
+
+The hfsutils package from Robert Leslie contains a program called
+hformat that can be used to create HFS filesystem. See
+<http://www.mars.org/home/rob/proj/hfs/> for details.
+
+
+Credits
+=======
+
+The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
+and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
+Technologies.
+Roman rewrote large parts of the code and brought in btree routines derived
+from Brad Boyer's hfsplus driver (also maintained by Roman now).
diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt
new file mode 100644
index 00000000..af1628a1
--- /dev/null
+++ b/Documentation/filesystems/hfsplus.txt
@@ -0,0 +1,59 @@
+
+Macintosh HFSPlus Filesystem for Linux
+======================================
+
+HFSPlus is a filesystem first introduced in MacOS 8.1.
+HFSPlus has several extensions to HFS, including 32-bit allocation
+blocks, 255-character unicode filenames, and file sizes of 2^63 bytes.
+
+
+Mount options
+=============
+
+When mounting an HFSPlus filesystem, the following options are accepted:
+
+ creator=cccc, type=cccc
+ Specifies the creator/type values as shown by the MacOS finder
+ used for creating new files. Default values: '????'.
+
+ uid=n, gid=n
+ Specifies the user/group that owns all files on the filesystem
+ that have uninitialized permissions structures.
+ Default: user/group id of the mounting process.
+
+ umask=n
+ Specifies the umask (in octal) used for files and directories
+ that have uninitialized permissions structures.
+ Default: umask of the mounting process.
+
+ session=n
+ Select the CDROM session to mount as HFSPlus filesystem. Defaults to
+ leaving that decision to the CDROM driver. This option will fail
+ with anything but a CDROM as underlying devices.
+
+ part=n
+ Select partition number n from the devices. This option only makes
+ sense for CDROMs because they can't be partitioned under Linux.
+ For disk devices the generic partition parsing code does this
+ for us. Defaults to not parsing the partition table at all.
+
+ decompose
+ Decompose file name characters.
+
+ nodecompose
+ Do not decompose file name characters.
+
+ force
+ Used to force write access to volumes that are marked as journalled
+ or locked. Use at your own risk.
+
+ nls=cccc
+ Encoding to use when presenting file names.
+
+
+References
+==========
+
+kernel source: <file:fs/hfsplus>
+
+Apple Technote 1150 http://developer.apple.com/technotes/tn/tn1150.html
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt
new file mode 100644
index 00000000..74630bd5
--- /dev/null
+++ b/Documentation/filesystems/hpfs.txt
@@ -0,0 +1,296 @@
+Read/Write HPFS 2.09
+1998-2004, Mikulas Patocka
+
+email: mikulas@artax.karlin.mff.cuni.cz
+homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
+
+CREDITS:
+Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
+ is taken from it
+Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
+Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
+
+Mount options
+
+uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
+ Set owner/group/mode for files that do not have it specified in extended
+ attributes. Mode is inverted umask - for example umask 027 gives owner
+ all permission, group read permission and anybody else no access. Note
+ that for files mode is anded with 0666. If you want files to have 'x'
+ rights, you must use extended attributes.
+case=lower,asis (default asis)
+ File name lowercasing in readdir.
+conv=binary,text,auto (default binary)
+ CR/LF -> LF conversion, if auto, decision is made according to extension
+ - there is a list of text extensions (I thing it's better to not convert
+ text file than to damage binary file). If you want to change that list,
+ change it in the source. Original readonly HPFS contained some strange
+ heuristic algorithm that I removed. I thing it's danger to let the
+ computer decide whether file is text or binary. For example, DJGPP
+ binaries contain small text message at the beginning and they could be
+ misidentified and damaged under some circumstances.
+check=none,normal,strict (default normal)
+ Check level. Selecting none will cause only little speedup and big
+ danger. I tried to write it so that it won't crash if check=normal on
+ corrupted filesystems. check=strict means many superfluous checks -
+ used for debugging (for example it checks if file is allocated in
+ bitmaps when accessing it).
+errors=continue,remount-ro,panic (default remount-ro)
+ Behaviour when filesystem errors found.
+chkdsk=no,errors,always (default errors)
+ When to mark filesystem dirty so that OS/2 checks it.
+eas=no,ro,rw (default rw)
+ What to do with extended attributes. 'no' - ignore them and use always
+ values specified in uid/gid/mode options. 'ro' - read extended
+ attributes but do not create them. 'rw' - create extended attributes
+ when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
+timeshift=(-)nnn (default 0)
+ Shifts the time by nnn seconds. For example, if you see under linux
+ one hour more, than under os/2, use timeshift=-3600.
+
+
+File names
+
+As in OS/2, filenames are case insensitive. However, shell thinks that names
+are case sensitive, so for example when you create a file FOO, you can use
+'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
+also won't be able to compile linux kernel (and maybe other things) on HPFS
+because kernel creates different files with names like bootsect.S and
+bootsect.s. When searching for file thats name has characters >= 128, codepages
+are used - see below.
+OS/2 ignores dots and spaces at the end of file name, so this driver does as
+well. If you create 'a. ...', the file 'a' will be created, but you can still
+access it under names 'a.', 'a..', 'a . . . ' etc.
+
+
+Extended attributes
+
+On HPFS partitions, OS/2 can associate to each file a special information called
+extended attributes. Extended attributes are pairs of (key,value) where key is
+an ascii string identifying that attribute and value is any string of bytes of
+variable length. OS/2 stores window and icon positions and file types there. So
+why not use it for unix-specific info like file owner or access rights? This
+driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
+attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
+that extended attributes those value differs from defaults specified in mount
+options are created. Once created, the extended attributes are never deleted,
+they're just changed. It means that when your default uid=0 and you type
+something like 'chown luser file; chown root file' the file will contain
+extended attribute UID=0. And when you umount the fs and mount it again with
+uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
+extended attribute "MODE" will not be set, this special case is done by setting
+read-only flag. When you mknod a block or char device, besides "MODE", the
+special 4-byte extended attribute "DEV" will be created containing the device
+number. Currently this driver cannot resize extended attributes - it means
+that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
+attributes with different sizes, they won't be rewritten and changing these
+values doesn't work.
+
+
+Symlinks
+
+You can do symlinks on HPFS partition, symlinks are achieved by setting extended
+attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
+chgrp symlinks but I don't know what is it good for. chmoding symlink results
+in chmoding file where symlink points. These symlinks are just for Linux use and
+incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
+stored in very crazy way. They tried to do it so that link changes when file is
+moved ... sometimes it works. But the link is partly stored in directory
+extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
+to analyze or change OS2SYS.INI.
+
+
+Codepages
+
+HPFS can contain several uppercasing tables for several codepages and each
+file has a pointer to codepage its name is in. However OS/2 was created in
+America where people don't care much about codepages and so multiple codepages
+support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
+Once I booted English OS/2 working in cp 850 and I created a file on my 852
+partition. It marked file name codepage as 850 - good. But when I again booted
+Czech OS/2, the file was completely inaccessible under any name. It seems that
+OS/2 uppercases the search pattern with its system code page (852) and file
+name it's comparing to with its code page (850). These could never match. Is it
+really what IBM developers wanted? But problems continued. When I created in
+Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
+probably uses different uppercasing method when searching where to place a file
+(note, that files in HPFS directory must be sorted) and when searching for
+a file. Finally when I opened this directory in PmShell, PmShell crashed (the
+funny thing was that, when rebooted, PmShell tried to reopen this directory
+again :-). chkdsk happily ignores these errors and only low-level disk
+modification saved me. Never mix different language versions of OS/2 on one
+system although HPFS was designed to allow that.
+OK, I could implement complex codepage support to this driver but I think it
+would cause more problems than benefit with such buggy implementation in OS/2.
+So this driver simply uses first codepage it finds for uppercasing and
+lowercasing no matter what's file codepage index. Usually all file names are in
+this codepage - if you don't try to do what I described above :-)
+
+
+Known bugs
+
+HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
+should work. If you have OS/2 server, use only read-only mode. I don't know how
+to handle some HPFS386 structures like access control list or extended perm
+list, I don't know how to delete them when file is deleted and how to not
+overwrite them with extended attributes. Send me some info on these structures
+and I'll make it. However, this driver should detect presence of HPFS386
+structures, remount read-only and not destroy them (I hope).
+
+When there's not enough space for extended attributes, they will be truncated
+and no error is returned.
+
+OS/2 can't access files if the path is longer than about 256 chars but this
+driver allows you to do it. chkdsk ignores such errors.
+
+Sometimes you won't be able to delete some files on a very full filesystem
+(returning error ENOSPC). That's because file in non-leaf node in directory tree
+(one directory, if it's large, has dirents in tree on HPFS) must be replaced
+with another node when deleted. And that new file might have larger name than
+the old one so the new name doesn't fit in directory node (dnode). And that
+would result in directory tree splitting, that takes disk space. Workaround is
+to delete other files that are leaf (probability that the file is non-leaf is
+about 1/50) or to truncate file first to make some space.
+You encounter this problem only if you have many directories so that
+preallocated directory band is full i.e.
+ number_of_directories / size_of_filesystem_in_mb > 4.
+
+You can't delete open directories.
+
+You can't rename over directories (what is it good for?).
+
+Renaming files so that only case changes doesn't work. This driver supports it
+but vfs doesn't. Something like 'mv file FILE' won't work.
+
+All atimes and directory mtimes are not updated. That's because of performance
+reasons. If you extremely wish to update them, let me know, I'll write it (but
+it will be slow).
+
+When the system is out of memory and swap, it may slightly corrupt filesystem
+(lost files, unbalanced directories). (I guess all filesystem may do it).
+
+When compiled, you get warning: function declaration isn't a prototype. Does
+anybody know what does it mean?
+
+
+What does "unbalanced tree" message mean?
+
+Old versions of this driver created sometimes unbalanced dnode trees. OS/2
+chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
+unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
+crashes when the tree is not balanced. This driver handles unbalanced trees
+correctly and writes warning if it finds them. If you see this message, this is
+probably because of directories created with old version of this driver.
+Workaround is to move all files from that directory to another and then back
+again. Do it in Linux, not OS/2! If you see this message in directory that is
+whole created by this driver, it is BUG - let me know about it.
+
+
+Bugs in OS/2
+
+When you have two (or more) lost directories pointing each to other, chkdsk
+locks up when repairing filesystem.
+
+Sometimes (I think it's random) when you create a file with one-char name under
+OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
+error corrected".
+
+File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
+marks them as short (and writes "minor fs error corrected"). This bug is not in
+HPFS386.
+
+Codepage bugs described above.
+
+If you don't install fixpacks, there are many, many more...
+
+
+History
+
+0.90 First public release
+0.91 Fixed bug that caused shooting to memory when write_inode was called on
+ open inode (rarely happened)
+0.92 Fixed a little memory leak in freeing directory inodes
+0.93 Fixed bug that locked up the machine when there were too many filenames
+ with first 15 characters same
+ Fixed write_file to zero file when writing behind file end
+0.94 Fixed a little memory leak when trying to delete busy file or directory
+0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
+1.90 First version for 2.1.1xx kernels
+1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
+ Fixed a race-condition when write_inode is called while deleting file
+ Fixed a bug that could possibly happen (with very low probability) when
+ using 0xff in filenames
+ Rewritten locking to avoid race-conditions
+ Mount option 'eas' now works
+ Fsync no longer returns error
+ Files beginning with '.' are marked hidden
+ Remount support added
+ Alloc is not so slow when filesystem becomes full
+ Atimes are no more updated because it slows down operation
+ Code cleanup (removed all commented debug prints)
+1.92 Corrected a bug when sync was called just before closing file
+1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
+ works with previous versions
+ Fixed a possible problem with disks > 64G (but I don't have one, so I can't
+ test it)
+ Fixed a file overflow at 2G
+ Added new option 'timeshift'
+ Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
+ read-only mode
+ Fixed a bug that slowed down alloc and prevented allocating 100% space
+ (this bug was not destructive)
+1.94 Added workaround for one bug in Linux
+ Fixed one buffer leak
+ Fixed some incompatibilities with large extended attributes (but it's still
+ not 100% ok, I have no info on it and OS/2 doesn't want to create them)
+ Rewritten allocation
+ Fixed a bug with i_blocks (du sometimes didn't display correct values)
+ Directories have no longer archive attribute set (some programs don't like
+ it)
+ Fixed a bug that it set badly one flag in large anode tree (it was not
+ destructive)
+1.95 Fixed one buffer leak, that could happen on corrupted filesystem
+ Fixed one bug in allocation in 1.94
+1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
+ error sometimes when opening directories in PMSHELL)
+ Fixed a possible bitmap race
+ Fixed possible problem on large disks
+ You can now delete open files
+ Fixed a nondestructive race in rename
+1.97 Support for HPFS v3 (on large partitions)
+ Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
+1.97.1 Changed names of global symbols
+ Fixed a bug when chmoding or chowning root directory
+1.98 Fixed a deadlock when using old_readdir
+ Better directory handling; workaround for "unbalanced tree" bug in OS/2
+1.99 Corrected a possible problem when there's not enough space while deleting
+ file
+ Now it tries to truncate the file if there's not enough space when deleting
+ Removed a lot of redundant code
+2.00 Fixed a bug in rename (it was there since 1.96)
+ Better anti-fragmentation strategy
+2.01 Fixed problem with directory listing over NFS
+ Directory lseek now checks for proper parameters
+ Fixed race-condition in buffer code - it is in all filesystems in Linux;
+ when reading device (cat /dev/hda) while creating files on it, files
+ could be damaged
+2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
+ end of partition
+2.03 Char, block devices and pipes are correctly created
+ Fixed non-crashing race in unlink (Alexander Viro)
+ Now it works with Japanese version of OS/2
+2.04 Fixed error when ftruncate used to extend file
+2.05 Fixed crash when got mount parameters without =
+ Fixed crash when allocation of anode failed due to full disk
+ Fixed some crashes when block io or inode allocation failed
+2.06 Fixed some crash on corrupted disk structures
+ Better allocation strategy
+ Reschedule points added so that it doesn't lock CPU long time
+ It should work in read-only mode on Warp Server
+2.07 More fixes for Warp Server. Now it really works
+2.08 Creating new files is not so slow on large disks
+ An attempt to sync deleted file does not generate filesystem error
+2.09 Fixed error on extremely fragmented files
+
+
+ vim: set textwidth=80:
diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.txt
new file mode 100644
index 00000000..59a919f1
--- /dev/null
+++ b/Documentation/filesystems/inotify.txt
@@ -0,0 +1,269 @@
+ inotify
+ a powerful yet simple file change notification system
+
+
+
+Document started 15 Mar 2005 by Robert Love <rml@novell.com>
+
+
+(i) User Interface
+
+Inotify is controlled by a set of three system calls and normal file I/O on a
+returned file descriptor.
+
+First step in using inotify is to initialise an inotify instance:
+
+ int fd = inotify_init ();
+
+Each instance is associated with a unique, ordered queue.
+
+Change events are managed by "watches". A watch is an (object,mask) pair where
+the object is a file or directory and the mask is a bit mask of one or more
+inotify events that the application wishes to receive. See <linux/inotify.h>
+for valid events. A watch is referenced by a watch descriptor, or wd.
+
+Watches are added via a path to the file.
+
+Watches on a directory will return events on any files inside of the directory.
+
+Adding a watch is simple:
+
+ int wd = inotify_add_watch (fd, path, mask);
+
+Where "fd" is the return value from inotify_init(), path is the path to the
+object to watch, and mask is the watch mask (see <linux/inotify.h>).
+
+You can update an existing watch in the same manner, by passing in a new mask.
+
+An existing watch is removed via
+
+ int ret = inotify_rm_watch (fd, wd);
+
+Events are provided in the form of an inotify_event structure that is read(2)
+from a given inotify instance. The filename is of dynamic length and follows
+the struct. It is of size len. The filename is padded with null bytes to
+ensure proper alignment. This padding is reflected in len.
+
+You can slurp multiple events by passing a large buffer, for example
+
+ size_t len = read (fd, buf, BUF_LEN);
+
+Where "buf" is a pointer to an array of "inotify_event" structures at least
+BUF_LEN bytes in size. The above example will return as many events as are
+available and fit in BUF_LEN.
+
+Each inotify instance fd is also select()- and poll()-able.
+
+You can find the size of the current event queue via the standard FIONREAD
+ioctl on the fd returned by inotify_init().
+
+All watches are destroyed and cleaned up on close.
+
+
+(ii)
+
+Prototypes:
+
+ int inotify_init (void);
+ int inotify_add_watch (int fd, const char *path, __u32 mask);
+ int inotify_rm_watch (int fd, __u32 mask);
+
+
+(iii) Kernel Interface
+
+Inotify's kernel API consists a set of functions for managing watches and an
+event callback.
+
+To use the kernel API, you must first initialize an inotify instance with a set
+of inotify_operations. You are given an opaque inotify_handle, which you use
+for any further calls to inotify.
+
+ struct inotify_handle *ih = inotify_init(my_event_handler);
+
+You must provide a function for processing events and a function for destroying
+the inotify watch.
+
+ void handle_event(struct inotify_watch *watch, u32 wd, u32 mask,
+ u32 cookie, const char *name, struct inode *inode)
+
+ watch - the pointer to the inotify_watch that triggered this call
+ wd - the watch descriptor
+ mask - describes the event that occurred
+ cookie - an identifier for synchronizing events
+ name - the dentry name for affected files in a directory-based event
+ inode - the affected inode in a directory-based event
+
+ void destroy_watch(struct inotify_watch *watch)
+
+You may add watches by providing a pre-allocated and initialized inotify_watch
+structure and specifying the inode to watch along with an inotify event mask.
+You must pin the inode during the call. You will likely wish to embed the
+inotify_watch structure in a structure of your own which contains other
+information about the watch. Once you add an inotify watch, it is immediately
+subject to removal depending on filesystem events. You must grab a reference if
+you depend on the watch hanging around after the call.
+
+ inotify_init_watch(&my_watch->iwatch);
+ inotify_get_watch(&my_watch->iwatch); // optional
+ s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask);
+ inotify_put_watch(&my_watch->iwatch); // optional
+
+You may use the watch descriptor (wd) or the address of the inotify_watch for
+other inotify operations. You must not directly read or manipulate data in the
+inotify_watch. Additionally, you must not call inotify_add_watch() more than
+once for a given inotify_watch structure, unless you have first called either
+inotify_rm_watch() or inotify_rm_wd().
+
+To determine if you have already registered a watch for a given inode, you may
+call inotify_find_watch(), which gives you both the wd and the watch pointer for
+the inotify_watch, or an error if the watch does not exist.
+
+ wd = inotify_find_watch(ih, inode, &watchp);
+
+You may use container_of() on the watch pointer to access your own data
+associated with a given watch. When an existing watch is found,
+inotify_find_watch() bumps the refcount before releasing its locks. You must
+put that reference with:
+
+ put_inotify_watch(watchp);
+
+Call inotify_find_update_watch() to update the event mask for an existing watch.
+inotify_find_update_watch() returns the wd of the updated watch, or an error if
+the watch does not exist.
+
+ wd = inotify_find_update_watch(ih, inode, mask);
+
+An existing watch may be removed by calling either inotify_rm_watch() or
+inotify_rm_wd().
+
+ int ret = inotify_rm_watch(ih, &my_watch->iwatch);
+ int ret = inotify_rm_wd(ih, wd);
+
+A watch may be removed while executing your event handler with the following:
+
+ inotify_remove_watch_locked(ih, iwatch);
+
+Call inotify_destroy() to remove all watches from your inotify instance and
+release it. If there are no outstanding references, inotify_destroy() will call
+your destroy_watch op for each watch.
+
+ inotify_destroy(ih);
+
+When inotify removes a watch, it sends an IN_IGNORED event to your callback.
+You may use this event as an indication to free the watch memory. Note that
+inotify may remove a watch due to filesystem events, as well as by your request.
+If you use IN_ONESHOT, inotify will remove the watch after the first event, at
+which point you may call the final inotify_put_watch.
+
+(iv) Kernel Interface Prototypes
+
+ struct inotify_handle *inotify_init(struct inotify_operations *ops);
+
+ inotify_init_watch(struct inotify_watch *watch);
+
+ s32 inotify_add_watch(struct inotify_handle *ih,
+ struct inotify_watch *watch,
+ struct inode *inode, u32 mask);
+
+ s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode,
+ struct inotify_watch **watchp);
+
+ s32 inotify_find_update_watch(struct inotify_handle *ih,
+ struct inode *inode, u32 mask);
+
+ int inotify_rm_wd(struct inotify_handle *ih, u32 wd);
+
+ int inotify_rm_watch(struct inotify_handle *ih,
+ struct inotify_watch *watch);
+
+ void inotify_remove_watch_locked(struct inotify_handle *ih,
+ struct inotify_watch *watch);
+
+ void inotify_destroy(struct inotify_handle *ih);
+
+ void get_inotify_watch(struct inotify_watch *watch);
+ void put_inotify_watch(struct inotify_watch *watch);
+
+
+(v) Internal Kernel Implementation
+
+Each inotify instance is represented by an inotify_handle structure.
+Inotify's userspace consumers also have an inotify_device which is
+associated with the inotify_handle, and on which events are queued.
+
+Each watch is associated with an inotify_watch structure. Watches are chained
+off of each associated inotify_handle and each associated inode.
+
+See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules.
+
+
+(vi) Rationale
+
+Q: What is the design decision behind not tying the watch to the open fd of
+ the watched object?
+
+A: Watches are associated with an open inotify device, not an open file.
+ This solves the primary problem with dnotify: keeping the file open pins
+ the file and thus, worse, pins the mount. Dnotify is therefore infeasible
+ for use on a desktop system with removable media as the media cannot be
+ unmounted. Watching a file should not require that it be open.
+
+Q: What is the design decision behind using an-fd-per-instance as opposed to
+ an fd-per-watch?
+
+A: An fd-per-watch quickly consumes more file descriptors than are allowed,
+ more fd's than are feasible to manage, and more fd's than are optimally
+ select()-able. Yes, root can bump the per-process fd limit and yes, users
+ can use epoll, but requiring both is a silly and extraneous requirement.
+ A watch consumes less memory than an open file, separating the number
+ spaces is thus sensible. The current design is what user-space developers
+ want: Users initialize inotify, once, and add n watches, requiring but one
+ fd and no twiddling with fd limits. Initializing an inotify instance two
+ thousand times is silly. If we can implement user-space's preferences
+ cleanly--and we can, the idr layer makes stuff like this trivial--then we
+ should.
+
+ There are other good arguments. With a single fd, there is a single
+ item to block on, which is mapped to a single queue of events. The single
+ fd returns all watch events and also any potential out-of-band data. If
+ every fd was a separate watch,
+
+ - There would be no way to get event ordering. Events on file foo and
+ file bar would pop poll() on both fd's, but there would be no way to tell
+ which happened first. A single queue trivially gives you ordering. Such
+ ordering is crucial to existing applications such as Beagle. Imagine
+ "mv a b ; mv b a" events without ordering.
+
+ - We'd have to maintain n fd's and n internal queues with state,
+ versus just one. It is a lot messier in the kernel. A single, linear
+ queue is the data structure that makes sense.
+
+ - User-space developers prefer the current API. The Beagle guys, for
+ example, love it. Trust me, I asked. It is not a surprise: Who'd want
+ to manage and block on 1000 fd's via select?
+
+ - No way to get out of band data.
+
+ - 1024 is still too low. ;-)
+
+ When you talk about designing a file change notification system that
+ scales to 1000s of directories, juggling 1000s of fd's just does not seem
+ the right interface. It is too heavy.
+
+ Additionally, it _is_ possible to more than one instance and
+ juggle more than one queue and thus more than one associated fd. There
+ need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
+ process can easily want more than one queue.
+
+Q: Why the system call approach?
+
+A: The poor user-space interface is the second biggest problem with dnotify.
+ Signals are a terrible, terrible interface for file notification. Or for
+ anything, for that matter. The ideal solution, from all perspectives, is a
+ file descriptor-based one that allows basic file I/O and poll/select.
+ Obtaining the fd and managing the watches could have been done either via a
+ device file or a family of new system calls. We decided to implement a
+ family of system calls because that is the preferred approach for new kernel
+ interfaces. The only real difference was whether we wanted to use open(2)
+ and ioctl(2) or a couple of new system calls. System calls beat ioctls.
+
diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt
new file mode 100644
index 00000000..ba0a9338
--- /dev/null
+++ b/Documentation/filesystems/isofs.txt
@@ -0,0 +1,48 @@
+Mount options that are the same as for msdos and vfat partitions.
+
+ gid=nnn All files in the partition will be in group nnn.
+ uid=nnn All files in the partition will be owned by user id nnn.
+ umask=nnn The permission mask (see umask(1)) for the partition.
+
+Mount options that are the same as vfat partitions. These are only useful
+when using discs encoded using Microsoft's Joliet extensions.
+ iocharset=name Character set to use for converting from Unicode to
+ ASCII. Joliet filenames are stored in Unicode format, but
+ Unix for the most part doesn't know how to deal with Unicode.
+ There is also an option of doing UTF-8 translations with the
+ utf8 option.
+ utf8 Encode Unicode names in UTF-8 format. Default is no.
+
+Mount options unique to the isofs filesystem.
+ block=512 Set the block size for the disk to 512 bytes
+ block=1024 Set the block size for the disk to 1024 bytes
+ block=2048 Set the block size for the disk to 2048 bytes
+ check=relaxed Matches filenames with different cases
+ check=strict Matches only filenames with the exact same case
+ cruft Try to handle badly formatted CDs.
+ map=off Do not map non-Rock Ridge filenames to lower case
+ map=normal Map non-Rock Ridge filenames to lower case
+ map=acorn As map=normal but also apply Acorn extensions if present
+ mode=xxx Sets the permissions on files to xxx unless Rock Ridge
+ extensions set the permissions otherwise
+ dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge
+ extensions set the permissions otherwise
+ overriderockperm Set permissions on files and directories according to
+ 'mode' and 'dmode' even though Rock Ridge extensions are
+ present.
+ nojoliet Ignore Joliet extensions if they are present.
+ norock Ignore Rock Ridge extensions if they are present.
+ hide Completely strip hidden files from the file system.
+ showassoc Show files marked with the 'associated' bit
+ unhide Deprecated; showing hidden files is now default;
+ If given, it is a synonym for 'showassoc' which will
+ recreate previous unhide behavior
+ session=x Select number of session on multisession CD
+ sbsector=xxx Session begins from sector xxx
+
+Recommended documents about ISO 9660 standard are located at:
+http://www.y-adagio.com/
+ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
+Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
+identical with ISO 9660.", so it is a valid and gratis substitute of the
+official ISO specification.
diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt
new file mode 100644
index 00000000..26ebde77
--- /dev/null
+++ b/Documentation/filesystems/jfs.txt
@@ -0,0 +1,41 @@
+IBM's Journaled File System (JFS) for Linux
+
+JFS Homepage: http://jfs.sourceforge.net/
+
+The following mount options are supported:
+
+iocharset=name Character set to use for converting from Unicode to
+ ASCII. The default is to do no conversion. Use
+ iocharset=utf8 for UTF-8 translations. This requires
+ CONFIG_NLS_UTF8 to be set in the kernel .config file.
+ iocharset=none specifies the default behavior explicitly.
+
+resize=value Resize the volume to <value> blocks. JFS only supports
+ growing a volume, not shrinking it. This option is only
+ valid during a remount, when the volume is mounted
+ read-write. The resize keyword with no value will grow
+ the volume to the full size of the partition.
+
+nointegrity Do not write to the journal. The primary use of this option
+ is to allow for higher performance when restoring a volume
+ from backup media. The integrity of the volume is not
+ guaranteed if the system abnormally abends.
+
+integrity Default. Commit metadata changes to the journal. Use this
+ option to remount a volume where the nointegrity option was
+ previously specified in order to restore normal behavior.
+
+errors=continue Keep going on a filesystem error.
+errors=remount-ro Default. Remount the filesystem read-only on an error.
+errors=panic Panic and halt the machine if an error occurs.
+
+uid=value Override on-disk uid with specified value
+gid=value Override on-disk gid with specified value
+umask=value Override on-disk umask with specified octal value. For
+ directories, the execute bit will be set if the corresponding
+ read bit is set.
+
+Please send bugs, comments, cards and letters to shaggy@linux.vnet.ibm.com.
+
+The JFS mailing list can be subscribed to by using the link labeled
+"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.txt
new file mode 100644
index 00000000..fab857ac
--- /dev/null
+++ b/Documentation/filesystems/locks.txt
@@ -0,0 +1,67 @@
+ File Locking Release Notes
+
+ Andy Walker <andy@lysaker.kvaerner.no>
+
+ 12 May 1997
+
+
+1. What's New?
+--------------
+
+1.1 Broken Flock Emulation
+--------------------------
+
+The old flock(2) emulation in the kernel was swapped for proper BSD
+compatible flock(2) support in the 1.3.x series of kernels. With the
+release of the 2.1.x kernel series, support for the old emulation has
+been totally removed, so that we don't need to carry this baggage
+forever.
+
+This should not cause problems for anybody, since everybody using a
+2.1.x kernel should have updated their C library to a suitable version
+anyway (see the file "Documentation/Changes".)
+
+1.2 Allow Mixed Locks Again
+---------------------------
+
+1.2.1 Typical Problems - Sendmail
+---------------------------------
+Because sendmail was unable to use the old flock() emulation, many sendmail
+installations use fcntl() instead of flock(). This is true of Slackware 3.0
+for example. This gave rise to some other subtle problems if sendmail was
+configured to rebuild the alias file. Sendmail tried to lock the aliases.dir
+file with fcntl() at the same time as the GDBM routines tried to lock this
+file with flock(). With pre 1.3.96 kernels this could result in deadlocks that,
+over time, or under a very heavy mail load, would eventually cause the kernel
+to lock solid with deadlocked processes.
+
+
+1.2.2 The Solution
+------------------
+The solution I have chosen, after much experimentation and discussion,
+is to make flock() and fcntl() locks oblivious to each other. Both can
+exists, and neither will have any effect on the other.
+
+I wanted the two lock styles to be cooperative, but there were so many
+race and deadlock conditions that the current solution was the only
+practical one. It puts us in the same position as, for example, SunOS
+4.1.x and several other commercial Unices. The only OS's that support
+cooperative flock()/fcntl() are those that emulate flock() using
+fcntl(), with all the problems that implies.
+
+
+1.3 Mandatory Locking As A Mount Option
+---------------------------------------
+
+Mandatory locking, as described in 'Documentation/filesystems/mandatory.txt'
+was prior to this release a general configuration option that was valid for
+all mounted filesystems. This had a number of inherent dangers, not the
+least of which was the ability to freeze an NFS server by asking it to read
+a file for which a mandatory lock existed.
+
+From this release of the kernel, mandatory locking can be turned on and off
+on a per-filesystem basis, using the mount options 'mand' and 'nomand'.
+The default is to disallow mandatory locking. The intention is that
+mandatory locking only be enabled on a local filesystem as the specific need
+arises.
+
diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt
new file mode 100644
index 00000000..bca42c22
--- /dev/null
+++ b/Documentation/filesystems/logfs.txt
@@ -0,0 +1,241 @@
+
+The LogFS Flash Filesystem
+==========================
+
+Specification
+=============
+
+Superblocks
+-----------
+
+Two superblocks exist at the beginning and end of the filesystem.
+Each superblock is 256 Bytes large, with another 3840 Bytes reserved
+for future purposes, making a total of 4096 Bytes.
+
+Superblock locations may differ for MTD and block devices. On MTD the
+first non-bad block contains a superblock in the first 4096 Bytes and
+the last non-bad block contains a superblock in the last 4096 Bytes.
+On block devices, the first 4096 Bytes of the device contain the first
+superblock and the last aligned 4096 Byte-block contains the second
+superblock.
+
+For the most part, the superblocks can be considered read-only. They
+are written only to correct errors detected within the superblocks,
+move the journal and change the filesystem parameters through tunefs.
+As a result, the superblock does not contain any fields that require
+constant updates, like the amount of free space, etc.
+
+Segments
+--------
+
+The space in the device is split up into equal-sized segments.
+Segments are the primary write unit of LogFS. Within each segments,
+writes happen from front (low addresses) to back (high addresses. If
+only a partial segment has been written, the segment number, the
+current position within and optionally a write buffer are stored in
+the journal.
+
+Segments are erased as a whole. Therefore Garbage Collection may be
+required to completely free a segment before doing so.
+
+Journal
+--------
+
+The journal contains all global information about the filesystem that
+is subject to frequent change. At mount time, it has to be scanned
+for the most recent commit entry, which contains a list of pointers to
+all currently valid entries.
+
+Object Store
+------------
+
+All space except for the superblocks and journal is part of the object
+store. Each segment contains a segment header and a number of
+objects, each consisting of the object header and the payload.
+Objects are either inodes, directory entries (dentries), file data
+blocks or indirect blocks.
+
+Levels
+------
+
+Garbage collection (GC) may fail if all data is written
+indiscriminately. One requirement of GC is that data is separated
+roughly according to the distance between the tree root and the data.
+Effectively that means all file data is on level 0, indirect blocks
+are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
+respectively. Inode file data is on level 6 for the inodes and 7-11
+for indirect blocks.
+
+Each segment contains objects of a single level only. As a result,
+each level requires its own separate segment to be open for writing.
+
+Inode File
+----------
+
+All inodes are stored in a special file, the inode file. Single
+exception is the inode file's inode (master inode) which for obvious
+reasons is stored in the journal instead. Instead of data blocks, the
+leaf nodes of the inode files are inodes.
+
+Aliases
+-------
+
+Writes in LogFS are done by means of a wandering tree. A naïve
+implementation would require that for each write or a block, all
+parent blocks are written as well, since the block pointers have
+changed. Such an implementation would not be very efficient.
+
+In LogFS, the block pointer changes are cached in the journal by means
+of alias entries. Each alias consists of its logical address - inode
+number, block index, level and child number (index into block) - and
+the changed data. Any 8-byte word can be changes in this manner.
+
+Currently aliases are used for block pointers, file size, file used
+bytes and the height of an inodes indirect tree.
+
+Segment Aliases
+---------------
+
+Related to regular aliases, these are used to handle bad blocks.
+Initially, bad blocks are handled by moving the affected segment
+content to a spare segment and noting this move in the journal with a
+segment alias, a simple (to, from) tupel. GC will later empty this
+segment and the alias can be removed again. This is used on MTD only.
+
+Vim
+---
+
+By cleverly predicting the life time of data, it is possible to
+separate long-living data from short-living data and thereby reduce
+the GC overhead later. Each type of distinc life expectency (vim) can
+have a separate segment open for writing. Each (level, vim) tupel can
+be open just once. If an open segment with unknown vim is encountered
+at mount time, it is closed and ignored henceforth.
+
+Indirect Tree
+-------------
+
+Inodes in LogFS are similar to FFS-style filesystems with direct and
+indirect block pointers. One difference is that LogFS uses a single
+indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
+A height field in the inode defines the height of the indirect tree
+and thereby the indirection of the pointer.
+
+Another difference is the addressing of indirect blocks. In LogFS,
+the first 16 pointers in the first indirect block are left empty,
+corresponding to the 16 direct pointers in the inode. In ext2 (maybe
+others as well) the first pointer in the first indirect block
+corresponds to logical block 12, skipping the 12 direct pointers.
+So where ext2 is using arithmetic to better utilize space, LogFS keeps
+arithmetic simple and uses compression to save space.
+
+Compression
+-----------
+
+Both file data and metadata can be compressed. Compression for file
+data can be enabled with chattr +c and disabled with chattr -c. Doing
+so has no effect on existing data, but new data will be stored
+accordingly. New inodes will inherit the compression flag of the
+parent directory.
+
+Metadata is always compressed. However, the space accounting ignores
+this and charges for the uncompressed size. Failing to do so could
+result in GC failures when, after moving some data, indirect blocks
+compress worse than previously. Even on a 100% full medium, GC may
+not consume any extra space, so the compression gains are lost space
+to the user.
+
+However, they are not lost space to the filesystem internals. By
+cheating the user for those bytes, the filesystem gained some slack
+space and GC will run less often and faster.
+
+Garbage Collection and Wear Leveling
+------------------------------------
+
+Garbage collection is invoked whenever the number of free segments
+falls below a threshold. The best (known) candidate is picked based
+on the least amount of valid data contained in the segment. All
+remaining valid data is copied elsewhere, thereby invalidating it.
+
+The GC code also checks for aliases and writes then back if their
+number gets too large.
+
+Wear leveling is done by occasionally picking a suboptimal segment for
+garbage collection. If a stale segments erase count is significantly
+lower than the active segments' erase counts, it will be picked. Wear
+leveling is rate limited, so it will never monopolize the device for
+more than one segment worth at a time.
+
+Values for "occasionally", "significantly lower" are compile time
+constants.
+
+Hashed directories
+------------------
+
+To satisfy efficient lookup(), directory entries are hashed and
+located based on the hash. In order to both support large directories
+and not be overly inefficient for small directories, several hash
+tables of increasing size are used. For each table, the hash value
+modulo the table size gives the table index.
+
+Tables sizes are chosen to limit the number of indirect blocks with a
+fully populated table to 0, 1, 2 or 3 respectively. So the first
+table contains 16 entries, the second 512-16, etc.
+
+The last table is special in several ways. First its size depends on
+the effective 32bit limit on telldir/seekdir cookies. Since logfs
+uses the upper half of the address space for indirect blocks, the size
+is limited to 2^31. Secondly the table contains hash buckets with 16
+entries each.
+
+Using single-entry buckets would result in birthday "attacks". At
+just 2^16 used entries, hash collisions would be likely (P >= 0.5).
+My math skills are insufficient to do the combinatorics for the 17x
+collisions necessary to overflow a bucket, but testing showed that in
+10,000 runs the lowest directory fill before a bucket overflow was
+188,057,130 entries with an average of 315,149,915 entries. So for
+directory sizes of up to a million, bucket overflows should be
+virtually impossible under normal circumstances.
+
+With carefully chosen filenames, it is obviously possible to cause an
+overflow with just 21 entries (4 higher tables + 16 entries + 1). So
+there may be a security concern if a malicious user has write access
+to a directory.
+
+Open For Discussion
+===================
+
+Device Address Space
+--------------------
+
+A device address space is used for caching. Both block devices and
+MTD provide functions to either read a single page or write a segment.
+Partial segments may be written for data integrity, but where possible
+complete segments are written for performance on simple block device
+flash media.
+
+Meta Inodes
+-----------
+
+Inodes are stored in the inode file, which is just a regular file for
+most purposes. At umount time, however, the inode file needs to
+remain open until all dirty inodes are written. So
+generic_shutdown_super() may not close this inode, but shouldn't
+complain about remaining inodes due to the inode file either. Same
+goes for mapping inode of the device address space.
+
+Currently logfs uses a hack that essentially copies part of fs/inode.c
+code over. A general solution would be preferred.
+
+Indirect block mapping
+----------------------
+
+With compression, the block device (or mapping inode) cannot be used
+to cache indirect blocks. Some other place is required. Currently
+logfs uses the top half of each inode's address space. The low 8TB
+(on 32bit) are filled with file data, the high 8TB are used for
+indirect blocks.
+
+One problem is that 16TB files created on 64bit systems actually have
+data in the top 8TB. But files >16TB would cause problems anyway, so
+only the limit has changed.
diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.txt
new file mode 100644
index 00000000..0979d1d2
--- /dev/null
+++ b/Documentation/filesystems/mandatory-locking.txt
@@ -0,0 +1,171 @@
+ Mandatory File Locking For The Linux Operating System
+
+ Andy Walker <andy@lysaker.kvaerner.no>
+
+ 15 April 1996
+ (Updated September 2007)
+
+0. Why you should avoid mandatory locking
+-----------------------------------------
+
+The Linux implementation is prey to a number of difficult-to-fix race
+conditions which in practice make it not dependable:
+
+ - The write system call checks for a mandatory lock only once
+ at its start. It is therefore possible for a lock request to
+ be granted after this check but before the data is modified.
+ A process may then see file data change even while a mandatory
+ lock was held.
+ - Similarly, an exclusive lock may be granted on a file after
+ the kernel has decided to proceed with a read, but before the
+ read has actually completed, and the reading process may see
+ the file data in a state which should not have been visible
+ to it.
+ - Similar races make the claimed mutual exclusion between lock
+ and mmap similarly unreliable.
+
+1. What is mandatory locking?
+------------------------------
+
+Mandatory locking is kernel enforced file locking, as opposed to the more usual
+cooperative file locking used to guarantee sequential access to files among
+processes. File locks are applied using the flock() and fcntl() system calls
+(and the lockf() library routine which is a wrapper around fcntl().) It is
+normally a process' responsibility to check for locks on a file it wishes to
+update, before applying its own lock, updating the file and unlocking it again.
+The most commonly used example of this (and in the case of sendmail, the most
+troublesome) is access to a user's mailbox. The mail user agent and the mail
+transfer agent must guard against updating the mailbox at the same time, and
+prevent reading the mailbox while it is being updated.
+
+In a perfect world all processes would use and honour a cooperative, or
+"advisory" locking scheme. However, the world isn't perfect, and there's
+a lot of poorly written code out there.
+
+In trying to address this problem, the designers of System V UNIX came up
+with a "mandatory" locking scheme, whereby the operating system kernel would
+block attempts by a process to write to a file that another process holds a
+"read" -or- "shared" lock on, and block attempts to both read and write to a
+file that a process holds a "write " -or- "exclusive" lock on.
+
+The System V mandatory locking scheme was intended to have as little impact as
+possible on existing user code. The scheme is based on marking individual files
+as candidates for mandatory locking, and using the existing fcntl()/lockf()
+interface for applying locks just as if they were normal, advisory locks.
+
+Note 1: In saying "file" in the paragraphs above I am actually not telling
+the whole truth. System V locking is based on fcntl(). The granularity of
+fcntl() is such that it allows the locking of byte ranges in files, in addition
+to entire files, so the mandatory locking rules also have byte level
+granularity.
+
+Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
+borrowing the fcntl() locking scheme from System V. The mandatory locking
+scheme is defined by the System V Interface Definition (SVID) Version 3.
+
+2. Marking a file for mandatory locking
+---------------------------------------
+
+A file is marked as a candidate for mandatory locking by setting the group-id
+bit in its file mode but removing the group-execute bit. This is an otherwise
+meaningless combination, and was chosen by the System V implementors so as not
+to break existing user programs.
+
+Note that the group-id bit is usually automatically cleared by the kernel when
+a setgid file is written to. This is a security measure. The kernel has been
+modified to recognize the special case of a mandatory lock candidate and to
+refrain from clearing this bit. Similarly the kernel has been modified not
+to run mandatory lock candidates with setgid privileges.
+
+3. Available implementations
+----------------------------
+
+I have considered the implementations of mandatory locking available with
+SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
+
+Generally I have tried to make the most sense out of the behaviour exhibited
+by these three reference systems. There are many anomalies.
+
+All the reference systems reject all calls to open() for a file on which
+another process has outstanding mandatory locks. This is in direct
+contravention of SVID 3, which states that only calls to open() with the
+O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
+definition, which is the "Right Thing", since only calls with O_TRUNC can
+modify the contents of the file.
+
+HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
+just mandatory locks. That would appear to contravene POSIX.1.
+
+mmap() is another interesting case. All the operating systems mentioned
+prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX
+also disallows advisory locks for such a file. SVID actually specifies the
+paranoid HP-UX behaviour.
+
+In my opinion only MAP_SHARED mappings should be immune from locking, and then
+only from mandatory locks - that is what is currently implemented.
+
+SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
+mandatory locks, so reads and writes to locked files always block when they
+should return EAGAIN.
+
+I'm afraid that this is such an esoteric area that the semantics described
+below are just as valid as any others, so long as the main points seem to
+agree.
+
+4. Semantics
+------------
+
+1. Mandatory locks can only be applied via the fcntl()/lockf() locking
+ interface - in other words the System V/POSIX interface. BSD style
+ locks using flock() never result in a mandatory lock.
+
+2. If a process has locked a region of a file with a mandatory read lock, then
+ other processes are permitted to read from that region. If any of these
+ processes attempts to write to the region it will block until the lock is
+ released, unless the process has opened the file with the O_NONBLOCK
+ flag in which case the system call will return immediately with the error
+ status EAGAIN.
+
+3. If a process has locked a region of a file with a mandatory write lock, all
+ attempts to read or write to that region block until the lock is released,
+ unless a process has opened the file with the O_NONBLOCK flag in which case
+ the system call will return immediately with the error status EAGAIN.
+
+4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
+ any mandatory locks owned by other processes will be rejected with the
+ error status EAGAIN.
+
+5. Attempts to apply a mandatory lock to a file that is memory mapped and
+ shared (via mmap() with MAP_SHARED) will be rejected with the error status
+ EAGAIN.
+
+6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
+ that has any mandatory locks in effect will be rejected with the error status
+ EAGAIN.
+
+5. Which system calls are affected?
+-----------------------------------
+
+Those which modify a file's contents, not just the inode. That gives read(),
+write(), readv(), writev(), open(), creat(), mmap(), truncate() and
+ftruncate(). truncate() and ftruncate() are considered to be "write" actions
+for the purposes of mandatory locking.
+
+The affected region is usually defined as stretching from the current position
+for the total number of bytes read or written. For the truncate calls it is
+defined as the bytes of a file removed or added (we must also consider bytes
+added, as a lock can specify just "the whole file", rather than a specific
+range of bytes.)
+
+Note 3: I may have overlooked some system calls that need mandatory lock
+checking in my eagerness to get this code out the door. Please let me know, or
+better still fix the system calls yourself and submit a patch to me or Linus.
+
+6. Warning!
+-----------
+
+Not even root can override a mandatory lock, so runaway processes can wreak
+havoc if they lock crucial files. The way around it is to change the file
+permissions (remove the setgid bit) before trying to read or write to it.
+Of course, that might be a bit tricky if the system is hung :-(
+
diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt
new file mode 100644
index 00000000..5af164f4
--- /dev/null
+++ b/Documentation/filesystems/ncpfs.txt
@@ -0,0 +1,12 @@
+The ncpfs filesystem understands the NCP protocol, designed by the
+Novell Corporation for their NetWare(tm) product. NCP is functionally
+similar to the NFS used in the TCP/IP community.
+To mount a NetWare filesystem, you need a special mount program, which
+can be found in the ncpfs package. The home site for ncpfs is
+ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
+will have it as well.
+
+Related products are linware and mars_nwe, which will give Linux partial
+NetWare server functionality.
+
+mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs.
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
new file mode 100644
index 00000000..a57e1241
--- /dev/null
+++ b/Documentation/filesystems/nfs/00-INDEX
@@ -0,0 +1,20 @@
+00-INDEX
+ - this file (nfs-related documentation).
+Exporting
+ - explanation of how to make filesystems exportable.
+knfsd-stats.txt
+ - statistics which the NFS server makes available to user space.
+nfs.txt
+ - nfs client, and DNS resolution for fs_locations.
+nfs41-server.txt
+ - info on the Linux server implementation of NFSv4 minor version 1.
+nfs-rdma.txt
+ - how to install and setup the Linux NFS/RDMA client and server software
+nfsroot.txt
+ - short guide on setting up a diskless box with NFS root filesystem.
+pnfs.txt
+ - short explanation of some of the internals of the pnfs client code
+rpc-cache.txt
+ - introduction to the caching mechanisms in the sunrpc layer.
+idmapper.txt
+ - information for configuring request-keys to be used by idmapper
diff --git a/Documentation/filesystems/nfs/Exporting b/Documentation/filesystems/nfs/Exporting
new file mode 100644
index 00000000..87019d2b
--- /dev/null
+++ b/Documentation/filesystems/nfs/Exporting
@@ -0,0 +1,147 @@
+
+Making Filesystems Exportable
+=============================
+
+Overview
+--------
+
+All filesystem operations require a dentry (or two) as a starting
+point. Local applications have a reference-counted hold on suitable
+dentries via open file descriptors or cwd/root. However remote
+applications that access a filesystem via a remote filesystem protocol
+such as NFS may not be able to hold such a reference, and so need a
+different way to refer to a particular dentry. As the alternative
+form of reference needs to be stable across renames, truncates, and
+server-reboot (among other things, though these tend to be the most
+problematic), there is no simple answer like 'filename'.
+
+The mechanism discussed here allows each filesystem implementation to
+specify how to generate an opaque (outside of the filesystem) byte
+string for any dentry, and how to find an appropriate dentry for any
+given opaque byte string.
+This byte string will be called a "filehandle fragment" as it
+corresponds to part of an NFS filehandle.
+
+A filesystem which supports the mapping between filehandle fragments
+and dentries will be termed "exportable".
+
+
+
+Dcache Issues
+-------------
+
+The dcache normally contains a proper prefix of any given filesystem
+tree. This means that if any filesystem object is in the dcache, then
+all of the ancestors of that filesystem object are also in the dcache.
+As normal access is by filename this prefix is created naturally and
+maintained easily (by each object maintaining a reference count on
+its parent).
+
+However when objects are included into the dcache by interpreting a
+filehandle fragment, there is no automatic creation of a path prefix
+for the object. This leads to two related but distinct features of
+the dcache that are not needed for normal filesystem access.
+
+1/ The dcache must sometimes contain objects that are not part of the
+ proper prefix. i.e that are not connected to the root.
+2/ The dcache must be prepared for a newly found (via ->lookup) directory
+ to already have a (non-connected) dentry, and must be able to move
+ that dentry into place (based on the parent and name in the
+ ->lookup). This is particularly needed for directories as
+ it is a dcache invariant that directories only have one dentry.
+
+To implement these features, the dcache has:
+
+a/ A dentry flag DCACHE_DISCONNECTED which is set on
+ any dentry that might not be part of the proper prefix.
+ This is set when anonymous dentries are created, and cleared when a
+ dentry is noticed to be a child of a dentry which is in the proper
+ prefix.
+
+b/ A per-superblock list "s_anon" of dentries which are the roots of
+ subtrees that are not in the proper prefix. These dentries, as
+ well as the proper prefix, need to be released at unmount time. As
+ these dentries will not be hashed, they are linked together on the
+ d_hash list_head.
+
+c/ Helper routines to allocate anonymous dentries, and to help attach
+ loose directory dentries at lookup time. They are:
+ d_alloc_anon(inode) will return a dentry for the given inode.
+ If the inode already has a dentry, one of those is returned.
+ If it doesn't, a new anonymous (IS_ROOT and
+ DCACHE_DISCONNECTED) dentry is allocated and attached.
+ In the case of a directory, care is taken that only one dentry
+ can ever be attached.
+ d_splice_alias(inode, dentry) will make sure that there is a
+ dentry with the same name and parent as the given dentry, and
+ which refers to the given inode.
+ If the inode is a directory and already has a dentry, then that
+ dentry is d_moved over the given dentry.
+ If the passed dentry gets attached, care is taken that this is
+ mutually exclusive to a d_alloc_anon operation.
+ If the passed dentry is used, NULL is returned, else the used
+ dentry is returned. This corresponds to the calling pattern of
+ ->lookup.
+
+
+Filesystem Issues
+-----------------
+
+For a filesystem to be exportable it must:
+
+ 1/ provide the filehandle fragment routines described below.
+ 2/ make sure that d_splice_alias is used rather than d_add
+ when ->lookup finds an inode for a given parent and name.
+ Typically the ->lookup routine will end with a:
+
+ return d_splice_alias(inode, dentry);
+ }
+
+
+
+ A file system implementation declares that instances of the filesystem
+are exportable by setting the s_export_op field in the struct
+super_block. This field must point to a "struct export_operations"
+struct which has the following members:
+
+ encode_fh (optional)
+ Takes a dentry and creates a filehandle fragment which can later be used
+ to find or create a dentry for the same object. The default
+ implementation creates a filehandle fragment that encodes a 32bit inode
+ and generation number for the inode encoded, and if necessary the
+ same information for the parent.
+
+ fh_to_dentry (mandatory)
+ Given a filehandle fragment, this should find the implied object and
+ create a dentry for it (possibly with d_alloc_anon).
+
+ fh_to_parent (optional but strongly recommended)
+ Given a filehandle fragment, this should find the parent of the
+ implied object and create a dentry for it (possibly with d_alloc_anon).
+ May fail if the filehandle fragment is too small.
+
+ get_parent (optional but strongly recommended)
+ When given a dentry for a directory, this should return a dentry for
+ the parent. Quite possibly the parent dentry will have been allocated
+ by d_alloc_anon. The default get_parent function just returns an error
+ so any filehandle lookup that requires finding a parent will fail.
+ ->lookup("..") is *not* used as a default as it can leave ".." entries
+ in the dcache which are too messy to work with.
+
+ get_name (optional)
+ When given a parent dentry and a child dentry, this should find a name
+ in the directory identified by the parent dentry, which leads to the
+ object identified by the child dentry. If no get_name function is
+ supplied, a default implementation is provided which uses vfs_readdir
+ to find potential names, and matches inode numbers to find the correct
+ match.
+
+
+A filehandle fragment consists of an array of 1 or more 4byte words,
+together with a one byte "type".
+The decode_fh routine should not depend on the stated size that is
+passed to it. This size may be larger than the original filehandle
+generated by encode_fh, in which case it will have been padded with
+nuls. Rather, the encode_fh routine should choose a "type" which
+indicates the decode_fh how much of the filehandle is valid, and how
+it should be interpreted.
diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt
new file mode 100644
index 00000000..9c8fd614
--- /dev/null
+++ b/Documentation/filesystems/nfs/idmapper.txt
@@ -0,0 +1,67 @@
+
+=========
+ID Mapper
+=========
+Id mapper is used by NFS to translate user and group ids into names, and to
+translate user and group names into ids. Part of this translation involves
+performing an upcall to userspace to request the information. Id mapper will
+user request-key to perform this upcall and cache the result. The program
+/usr/sbin/nfs.idmap should be called by request-key, and will perform the
+translation and initialize a key with the resulting information.
+
+ NFS_USE_NEW_IDMAPPER must be selected when configuring the kernel to use this
+ feature.
+
+===========
+Configuring
+===========
+The file /etc/request-key.conf will need to be modified so /sbin/request-key can
+direct the upcall. The following line should be added:
+
+#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
+#====== ======= =============== =============== ===============================
+create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
+
+This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap.
+The last parameter, 600, defines how many seconds into the future the key will
+expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout
+is not specified, nfs.idmap will default to 600 seconds.
+
+id mapper uses for key descriptions:
+ uid: Find the UID for the given user
+ gid: Find the GID for the given group
+ user: Find the user name for the given UID
+ group: Find the group name for the given GID
+
+You can handle any of these individually, rather than using the generic upcall
+program. If you would like to use your own program for a uid lookup then you
+would edit your request-key.conf so it look similar to this:
+
+#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
+#====== ======= =============== =============== ===============================
+create id_resolver uid:* * /some/other/program %k %d 600
+create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
+
+Notice that the new line was added above the line for the generic program.
+request-key will find the first matching line and corresponding program. In
+this case, /some/other/program will handle all uid lookups and
+/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
+
+See <file:Documentation/security/keys-request-keys.txt> for more information
+about the request-key function.
+
+
+=========
+nfs.idmap
+=========
+nfs.idmap is designed to be called by request-key, and should not be run "by
+hand". This program takes two arguments, a serialized key and a key
+description. The serialized key is first converted into a key_serial_t, and
+then passed as an argument to keyctl_instantiate (both are part of keyutils.h).
+
+The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap
+determines the correct function to call by looking at the first part of the
+description string. For example, a uid lookup description will appear as
+"uid:user@domain".
+
+nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise.
diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt
new file mode 100644
index 00000000..64ced514
--- /dev/null
+++ b/Documentation/filesystems/nfs/knfsd-stats.txt
@@ -0,0 +1,159 @@
+
+Kernel NFS Server Statistics
+============================
+
+This document describes the format and semantics of the statistics
+which the kernel NFS server makes available to userspace. These
+statistics are available in several text form pseudo files, each of
+which is described separately below.
+
+In most cases you don't need to know these formats, as the nfsstat(8)
+program from the nfs-utils distribution provides a helpful command-line
+interface for extracting and printing them.
+
+All the files described here are formatted as a sequence of text lines,
+separated by newline '\n' characters. Lines beginning with a hash
+'#' character are comments intended for humans and should be ignored
+by parsing routines. All other lines contain a sequence of fields
+separated by whitespace.
+
+/proc/fs/nfsd/pool_stats
+------------------------
+
+This file is available in kernels from 2.6.30 onwards, if the
+/proc/fs/nfsd filesystem is mounted (it almost always should be).
+
+The first line is a comment which describes the fields present in
+all the other lines. The other lines present the following data as
+a sequence of unsigned decimal numeric fields. One line is shown
+for each NFS thread pool.
+
+All counters are 64 bits wide and wrap naturally. There is no way
+to zero these counters, instead applications should do their own
+rate conversion.
+
+pool
+ The id number of the NFS thread pool to which this line applies.
+ This number does not change.
+
+ Thread pool ids are a contiguous set of small integers starting
+ at zero. The maximum value depends on the thread pool mode, but
+ currently cannot be larger than the number of CPUs in the system.
+ Note that in the default case there will be a single thread pool
+ which contains all the nfsd threads and all the CPUs in the system,
+ and thus this file will have a single line with a pool id of "0".
+
+packets-arrived
+ Counts how many NFS packets have arrived. More precisely, this
+ is the number of times that the network stack has notified the
+ sunrpc server layer that new data may be available on a transport
+ (e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
+
+ Depending on the NFS workload patterns and various network stack
+ effects (such as Large Receive Offload) which can combine packets
+ on the wire, this may be either more or less than the number
+ of NFS calls received (which statistic is available elsewhere).
+ However this is a more accurate and less workload-dependent measure
+ of how much CPU load is being placed on the sunrpc server layer
+ due to NFS network traffic.
+
+sockets-enqueued
+ Counts how many times an NFS transport is enqueued to wait for
+ an nfsd thread to service it, i.e. no nfsd thread was considered
+ available.
+
+ The circumstance this statistic tracks indicates that there was NFS
+ network-facing work to be done but it couldn't be done immediately,
+ thus introducing a small delay in servicing NFS calls. The ideal
+ rate of change for this counter is zero; significantly non-zero
+ values may indicate a performance limitation.
+
+ This can happen either because there are too few nfsd threads in the
+ thread pool for the NFS workload (the workload is thread-limited),
+ or because the NFS workload needs more CPU time than is available in
+ the thread pool (the workload is CPU-limited). In the former case,
+ configuring more nfsd threads will probably improve the performance
+ of the NFS workload. In the latter case, the sunrpc server layer is
+ already choosing not to wake idle nfsd threads because there are too
+ many nfsd threads which want to run but cannot, so configuring more
+ nfsd threads will make no difference whatsoever. The overloads-avoided
+ statistic (see below) can be used to distinguish these cases.
+
+threads-woken
+ Counts how many times an idle nfsd thread is woken to try to
+ receive some data from an NFS transport.
+
+ This statistic tracks the circumstance where incoming
+ network-facing NFS work is being handled quickly, which is a good
+ thing. The ideal rate of change for this counter will be close
+ to but less than the rate of change of the packets-arrived counter.
+
+overloads-avoided
+ Counts how many times the sunrpc server layer chose not to wake an
+ nfsd thread, despite the presence of idle nfsd threads, because
+ too many nfsd threads had been recently woken but could not get
+ enough CPU time to actually run.
+
+ This statistic counts a circumstance where the sunrpc layer
+ heuristically avoids overloading the CPU scheduler with too many
+ runnable nfsd threads. The ideal rate of change for this counter
+ is zero. Significant non-zero values indicate that the workload
+ is CPU limited. Usually this is associated with heavy CPU usage
+ on all the CPUs in the nfsd thread pool.
+
+ If a sustained large overloads-avoided rate is detected on a pool,
+ the top(1) utility should be used to check for the following
+ pattern of CPU usage on all the CPUs associated with the given
+ nfsd thread pool.
+
+ - %us ~= 0 (as you're *NOT* running applications on your NFS server)
+
+ - %wa ~= 0
+
+ - %id ~= 0
+
+ - %sy + %hi + %si ~= 100
+
+ If this pattern is seen, configuring more nfsd threads will *not*
+ improve the performance of the workload. If this patten is not
+ seen, then something more subtle is wrong.
+
+threads-timedout
+ Counts how many times an nfsd thread triggered an idle timeout,
+ i.e. was not woken to handle any incoming network packets for
+ some time.
+
+ This statistic counts a circumstance where there are more nfsd
+ threads configured than can be used by the NFS workload. This is
+ a clue that the number of nfsd threads can be reduced without
+ affecting performance. Unfortunately, it's only a clue and not
+ a strong indication, for a couple of reasons:
+
+ - Currently the rate at which the counter is incremented is quite
+ slow; the idle timeout is 60 minutes. Unless the NFS workload
+ remains constant for hours at a time, this counter is unlikely
+ to be providing information that is still useful.
+
+ - It is usually a wise policy to provide some slack,
+ i.e. configure a few more nfsds than are currently needed,
+ to allow for future spikes in load.
+
+
+Note that incoming packets on NFS transports will be dealt with in
+one of three ways. An nfsd thread can be woken (threads-woken counts
+this case), or the transport can be enqueued for later attention
+(sockets-enqueued counts this case), or the packet can be temporarily
+deferred because the transport is currently being used by an nfsd
+thread. This last case is not very interesting and is not explicitly
+counted, but can be inferred from the other counters thus:
+
+packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
+
+
+More
+----
+Descriptions of the other statistics file should go here.
+
+
+Greg Banks <gnb@sgi.com>
+26 Mar 2009
diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt
new file mode 100644
index 00000000..e386f7e4
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs-rdma.txt
@@ -0,0 +1,271 @@
+################################################################################
+# #
+# NFS/RDMA README #
+# #
+################################################################################
+
+ Author: NetApp and Open Grid Computing
+ Date: May 29, 2008
+
+Table of Contents
+~~~~~~~~~~~~~~~~~
+ - Overview
+ - Getting Help
+ - Installation
+ - Check RDMA and NFS Setup
+ - NFS/RDMA Setup
+
+Overview
+~~~~~~~~
+
+ This document describes how to install and setup the Linux NFS/RDMA client
+ and server software.
+
+ The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
+ was first included in the following release, Linux 2.6.25.
+
+ In our testing, we have obtained excellent performance results (full 10Gbit
+ wire bandwidth at minimal client CPU) under many workloads. The code passes
+ the full Connectathon test suite and operates over both Infiniband and iWARP
+ RDMA adapters.
+
+Getting Help
+~~~~~~~~~~~~
+
+ If you get stuck, you can ask questions on the
+
+ nfs-rdma-devel@lists.sourceforge.net
+
+ mailing list.
+
+Installation
+~~~~~~~~~~~~
+
+ These instructions are a step by step guide to building a machine for
+ use with NFS/RDMA.
+
+ - Install an RDMA device
+
+ Any device supported by the drivers in drivers/infiniband/hw is acceptable.
+
+ Testing has been performed using several Mellanox-based IB cards, the
+ Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
+
+ - Install a Linux distribution and tools
+
+ The first kernel release to contain both the NFS/RDMA client and server was
+ Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
+ Linux kernel release should be installed.
+
+ The procedures described in this document have been tested with
+ distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
+
+ - Install nfs-utils-1.1.2 or greater on the client
+
+ An NFS/RDMA mount point can be obtained by using the mount.nfs command in
+ nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
+ version with support for NFS/RDMA mounts, but for various reasons we
+ recommend using nfs-utils-1.1.2 or greater). To see which version of
+ mount.nfs you are using, type:
+
+ $ /sbin/mount.nfs -V
+
+ If the version is less than 1.1.2 or the command does not exist,
+ you should install the latest version of nfs-utils.
+
+ Download the latest package from:
+
+ http://www.kernel.org/pub/linux/utils/nfs
+
+ Uncompress the package and follow the installation instructions.
+
+ If you will not need the idmapper and gssd executables (you do not need
+ these to create an NFS/RDMA enabled mount command), the installation
+ process can be simplified by disabling these features when running
+ configure:
+
+ $ ./configure --disable-gss --disable-nfsv4
+
+ To build nfs-utils you will need the tcp_wrappers package installed. For
+ more information on this see the package's README and INSTALL files.
+
+ After building the nfs-utils package, there will be a mount.nfs binary in
+ the utils/mount directory. This binary can be used to initiate NFS v2, v3,
+ or v4 mounts. To initiate a v4 mount, the binary must be called
+ mount.nfs4. The standard technique is to create a symlink called
+ mount.nfs4 to mount.nfs.
+
+ This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
+
+ $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
+
+ In this location, mount.nfs will be invoked automatically for NFS mounts
+ by the system mount command.
+
+ NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
+ on the NFS client machine. You do not need this specific version of
+ nfs-utils on the server. Furthermore, only the mount.nfs command from
+ nfs-utils-1.1.2 is needed on the client.
+
+ - Install a Linux kernel with NFS/RDMA
+
+ The NFS/RDMA client and server are both included in the mainline Linux
+ kernel version 2.6.25 and later. This and other versions of the 2.6 Linux
+ kernel can be found at:
+
+ ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
+
+ Download the sources and place them in an appropriate location.
+
+ - Configure the RDMA stack
+
+ Make sure your kernel configuration has RDMA support enabled. Under
+ Device Drivers -> InfiniBand support, update the kernel configuration
+ to enable InfiniBand support [NOTE: the option name is misleading. Enabling
+ InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
+
+ Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
+ iWARP adapter support (amso, cxgb3, etc.).
+
+ If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
+
+ - Configure the NFS client and server
+
+ Your kernel configuration must also have NFS file system support and/or
+ NFS server support enabled. These and other NFS related configuration
+ options can be found under File Systems -> Network File Systems.
+
+ - Build, install, reboot
+
+ The NFS/RDMA code will be enabled automatically if NFS and RDMA
+ are turned on. The NFS/RDMA client and server are configured via the hidden
+ SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
+ value of SUNRPC_XPRT_RDMA will be:
+
+ - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
+ and server will not be built
+ - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
+ in this case the NFS/RDMA client and server will be built as modules
+ - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
+ and server will be built into the kernel
+
+ Therefore, if you have followed the steps above and turned no NFS and RDMA,
+ the NFS/RDMA client and server will be built.
+
+ Build a new kernel, install it, boot it.
+
+Check RDMA and NFS Setup
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+ Before configuring the NFS/RDMA software, it is a good idea to test
+ your new kernel to ensure that the kernel is working correctly.
+ In particular, it is a good idea to verify that the RDMA stack
+ is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
+ is working properly.
+
+ - Check RDMA Setup
+
+ If you built the RDMA components as modules, load them at
+ this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
+ card:
+
+ $ modprobe ib_mthca
+ $ modprobe ib_ipoib
+
+ If you are using InfiniBand, make sure there is a Subnet Manager (SM)
+ running on the network. If your IB switch has an embedded SM, you can
+ use it. Otherwise, you will need to run an SM, such as OpenSM, on one
+ of your end nodes.
+
+ If an SM is running on your network, you should see the following:
+
+ $ cat /sys/class/infiniband/driverX/ports/1/state
+ 4: ACTIVE
+
+ where driverX is mthca0, ipath5, ehca3, etc.
+
+ To further test the InfiniBand software stack, use IPoIB (this
+ assumes you have two IB hosts named host1 and host2):
+
+ host1$ ifconfig ib0 a.b.c.x
+ host2$ ifconfig ib0 a.b.c.y
+ host1$ ping a.b.c.y
+ host2$ ping a.b.c.x
+
+ For other device types, follow the appropriate procedures.
+
+ - Check NFS Setup
+
+ For the NFS components enabled above (client and/or server),
+ test their functionality over standard Ethernet using TCP/IP or UDP/IP.
+
+NFS/RDMA Setup
+~~~~~~~~~~~~~~
+
+ We recommend that you use two machines, one to act as the client and
+ one to act as the server.
+
+ One time configuration:
+
+ - On the server system, configure the /etc/exports file and
+ start the NFS/RDMA server.
+
+ Exports entries with the following formats have been tested:
+
+ /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
+ /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
+
+ The IP address(es) is(are) the client's IPoIB address for an InfiniBand
+ HCA or the cleint's iWARP address(es) for an RNIC.
+
+ NOTE: The "insecure" option must be used because the NFS/RDMA client does
+ not use a reserved port.
+
+ Each time a machine boots:
+
+ - Load and configure the RDMA drivers
+
+ For InfiniBand using a Mellanox adapter:
+
+ $ modprobe ib_mthca
+ $ modprobe ib_ipoib
+ $ ifconfig ib0 a.b.c.d
+
+ NOTE: use unique addresses for the client and server
+
+ - Start the NFS server
+
+ If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
+ kernel config), load the RDMA transport module:
+
+ $ modprobe svcrdma
+
+ Regardless of how the server was built (module or built-in), start the
+ server:
+
+ $ /etc/init.d/nfs start
+
+ or
+
+ $ service nfs start
+
+ Instruct the server to listen on the RDMA transport:
+
+ $ echo rdma 20049 > /proc/fs/nfsd/portlist
+
+ - On the client system
+
+ If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
+ kernel config), load the RDMA client module:
+
+ $ modprobe xprtrdma.ko
+
+ Regardless of how the client was built (module or built-in), use this
+ command to mount the NFS/RDMA server:
+
+ $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
+
+ To verify that the mount is using RDMA, run "cat /proc/mounts" and check
+ the "proto" field for the given mount.
+
+ Congratulations! You're using NFS/RDMA!
diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt
new file mode 100644
index 00000000..f50f26ce
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs.txt
@@ -0,0 +1,98 @@
+
+The NFS client
+==============
+
+The NFS version 2 protocol was first documented in RFC1094 (March 1989).
+Since then two more major releases of NFS have been published, with NFSv3
+being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
+2003).
+
+The Linux NFS client currently supports all the above published versions,
+and work is in progress on adding support for minor version 1 of the NFSv4
+protocol.
+
+The purpose of this document is to provide information on some of the
+upcall interfaces that are used in order to provide the NFS client with
+some of the information that it requires in order to fully comply with
+the NFS spec.
+
+The DNS resolver
+================
+
+NFSv4 allows for one server to refer the NFS client to data that has been
+migrated onto another server by means of the special "fs_locations"
+attribute. See
+ http://tools.ietf.org/html/rfc3530#section-6
+and
+ http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
+
+The fs_locations information can take the form of either an ip address and
+a path, or a DNS hostname and a path. The latter requires the NFS client to
+do a DNS lookup in order to mount the new volume, and hence the need for an
+upcall to allow userland to provide this service.
+
+Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
+/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
+
+ (1) The process checks the dns_resolve cache to see if it contains a
+ valid entry. If so, it returns that entry and exits.
+
+ (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
+ (may be changed using the 'nfs.cache_getent' kernel boot parameter)
+ is run, with two arguments:
+ - the cache name, "dns_resolve"
+ - the hostname to resolve
+
+ (3) After looking up the corresponding ip address, the helper script
+ writes the result into the rpc_pipefs pseudo-file
+ '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
+ in the following (text) format:
+
+ "<ip address> <hostname> <ttl>\n"
+
+ Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
+ (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
+ <hostname> is identical to the second argument of the helper
+ script, and <ttl> is the 'time to live' of this cache entry (in
+ units of seconds).
+
+ Note: If <ip address> is invalid, say the string "0", then a negative
+ entry is created, which will cause the kernel to treat the hostname
+ as having no valid DNS translation.
+
+
+
+
+A basic sample /sbin/nfs_cache_getent
+=====================================
+
+#!/bin/bash
+#
+ttl=600
+#
+cut=/usr/bin/cut
+getent=/usr/bin/getent
+rpc_pipefs=/var/lib/nfs/rpc_pipefs
+#
+die()
+{
+ echo "Usage: $0 cache_name entry_name"
+ exit 1
+}
+
+[ $# -lt 2 ] && die
+cachename="$1"
+cache_path=${rpc_pipefs}/cache/${cachename}/channel
+
+case "${cachename}" in
+ dns_resolve)
+ name="$2"
+ result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
+ [ -z "${result}" ] && result="0"
+ ;;
+ *)
+ die
+ ;;
+esac
+echo "${result} ${name} ${ttl}" >${cache_path}
+
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt
new file mode 100644
index 00000000..04884914
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs41-server.txt
@@ -0,0 +1,221 @@
+NFSv4.1 Server Implementation
+
+Server support for minorversion 1 can be controlled using the
+/proc/fs/nfsd/versions control file. The string output returned
+by reading this file will contain either "+4.1" or "-4.1"
+correspondingly.
+
+Currently, server support for minorversion 1 is disabled by default.
+It can be enabled at run time by writing the string "+4.1" to
+the /proc/fs/nfsd/versions control file. Note that to write this
+control file, the nfsd service must be taken down. Use your user-mode
+nfs-utils to set this up; see rpc.nfsd(8)
+
+(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
+"-4", respectively. Therefore, code meant to work on both new and old
+kernels must turn 4.1 on or off *before* turning support for version 4
+on or off; rpc.nfsd does this correctly.)
+
+The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
+on RFC 5661.
+
+From the many new features in NFSv4.1 the current implementation
+focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
+"exactly once" semantics and better control and throttling of the
+resources allocated for each client.
+
+Other NFSv4.1 features, Parallel NFS operations in particular,
+are still under development out of tree.
+See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
+for more information.
+
+The current implementation is intended for developers only: while it
+does support ordinary file operations on clients we have tested against
+(including the linux client), it is incomplete in ways which may limit
+features unexpectedly, cause known bugs in rare cases, or cause
+interoperability problems with future clients. Known issues:
+
+ - gss support is questionable: currently mounts with kerberos
+ from a linux client are possible, but we aren't really
+ conformant with the spec (for example, we don't use kerberos
+ on the backchannel correctly).
+ - no trunking support: no clients currently take advantage of
+ trunking, but this is a mandatory feature, and its use is
+ recommended to clients in a number of places. (E.g. to ensure
+ timely renewal in case an existing connection's retry timeouts
+ have gotten too long; see section 8.3 of the RFC.)
+ Therefore, lack of this feature may cause future clients to
+ fail.
+ - Incomplete backchannel support: incomplete backchannel gss
+ support and no support for BACKCHANNEL_CTL mean that
+ callbacks (hence delegations and layouts) may not be
+ available and clients confused by the incomplete
+ implementation may fail.
+ - Server reboot recovery is unsupported; if the server reboots,
+ clients may fail.
+ - We do not support SSV, which provides security for shared
+ client-server state (thus preventing unauthorized tampering
+ with locks and opens, for example). It is mandatory for
+ servers to support this, though no clients use it yet.
+ - Mandatory operations which we do not support, such as
+ DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and
+ TEST_STATEID, are not currently used by clients, but will be
+ (and the spec recommends their uses in common cases), and
+ clients should not be expected to know how to recover from the
+ case where they are not supported. This will eventually cause
+ interoperability failures.
+
+In addition, some limitations are inherited from the current NFSv4
+implementation:
+
+ - Incomplete delegation enforcement: if a file is renamed or
+ unlinked, a client holding a delegation may continue to
+ indefinitely allow opens of the file under the old name.
+
+The table below, taken from the NFSv4.1 document, lists
+the operations that are mandatory to implement (REQ), optional
+(OPT), and NFSv4.0 operations that are required not to implement (MNI)
+in minor version 1. The first column indicates the operations that
+are not supported yet by the linux server implementation.
+
+The OPTIONAL features identified and their abbreviations are as follows:
+ pNFS Parallel NFS
+ FDELG File Delegations
+ DDELG Directory Delegations
+
+The following abbreviations indicate the linux server implementation status.
+ I Implemented NFSv4.1 operations.
+ NS Not Supported.
+ NS* unimplemented optional feature.
+ P pNFS features implemented out of tree.
+ PNS pNFS features that are not supported yet (out of tree).
+
+Operations
+
+ +----------------------+------------+--------------+----------------+
+ | Operation | REQ, REC, | Feature | Definition |
+ | | OPT, or | (REQ, REC, | |
+ | | MNI | or OPT) | |
+ +----------------------+------------+--------------+----------------+
+ | ACCESS | REQ | | Section 18.1 |
+NS | BACKCHANNEL_CTL | REQ | | Section 18.33 |
+NS | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
+ | CLOSE | REQ | | Section 18.2 |
+ | COMMIT | REQ | | Section 18.3 |
+ | CREATE | REQ | | Section 18.4 |
+I | CREATE_SESSION | REQ | | Section 18.36 |
+NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
+ | DELEGRETURN | OPT | FDELG, | Section 18.6 |
+ | | | DDELG, pNFS | |
+ | | | (REQ) | |
+NS | DESTROY_CLIENTID | REQ | | Section 18.50 |
+I | DESTROY_SESSION | REQ | | Section 18.37 |
+I | EXCHANGE_ID | REQ | | Section 18.35 |
+NS | FREE_STATEID | REQ | | Section 18.38 |
+ | GETATTR | REQ | | Section 18.7 |
+P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
+P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
+ | GETFH | REQ | | Section 18.8 |
+NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
+P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
+P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
+P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
+ | LINK | OPT | | Section 18.9 |
+ | LOCK | REQ | | Section 18.10 |
+ | LOCKT | REQ | | Section 18.11 |
+ | LOCKU | REQ | | Section 18.12 |
+ | LOOKUP | REQ | | Section 18.13 |
+ | LOOKUPP | REQ | | Section 18.14 |
+ | NVERIFY | REQ | | Section 18.15 |
+ | OPEN | REQ | | Section 18.16 |
+NS*| OPENATTR | OPT | | Section 18.17 |
+ | OPEN_CONFIRM | MNI | | N/A |
+ | OPEN_DOWNGRADE | REQ | | Section 18.18 |
+ | PUTFH | REQ | | Section 18.19 |
+ | PUTPUBFH | REQ | | Section 18.20 |
+ | PUTROOTFH | REQ | | Section 18.21 |
+ | READ | REQ | | Section 18.22 |
+ | READDIR | REQ | | Section 18.23 |
+ | READLINK | OPT | | Section 18.24 |
+ | RECLAIM_COMPLETE | REQ | | Section 18.51 |
+ | RELEASE_LOCKOWNER | MNI | | N/A |
+ | REMOVE | REQ | | Section 18.25 |
+ | RENAME | REQ | | Section 18.26 |
+ | RENEW | MNI | | N/A |
+ | RESTOREFH | REQ | | Section 18.27 |
+ | SAVEFH | REQ | | Section 18.28 |
+ | SECINFO | REQ | | Section 18.29 |
+NS | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
+ | | | layout (REQ) | Section 13.12 |
+I | SEQUENCE | REQ | | Section 18.46 |
+ | SETATTR | REQ | | Section 18.30 |
+ | SETCLIENTID | MNI | | N/A |
+ | SETCLIENTID_CONFIRM | MNI | | N/A |
+NS | SET_SSV | REQ | | Section 18.47 |
+NS | TEST_STATEID | REQ | | Section 18.48 |
+ | VERIFY | REQ | | Section 18.31 |
+NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
+ | WRITE | REQ | | Section 18.32 |
+
+Callback Operations
+
+ +-------------------------+-----------+-------------+---------------+
+ | Operation | REQ, REC, | Feature | Definition |
+ | | OPT, or | (REQ, REC, | |
+ | | MNI | or OPT) | |
+ +-------------------------+-----------+-------------+---------------+
+ | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
+P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
+NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
+P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
+NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
+NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
+ | CB_RECALL | OPT | FDELG, | Section 20.2 |
+ | | | DDELG, pNFS | |
+ | | | (REQ) | |
+NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
+ | | | DDELG, pNFS | |
+ | | | (REQ) | |
+NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
+NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
+ | | | (REQ) | |
+I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
+ | | | DDELG, pNFS | |
+ | | | (REQ) | |
+NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
+ | | | DDELG, pNFS | |
+ | | | (REQ) | |
+ +-------------------------+-----------+-------------+---------------+
+
+Implementation notes:
+
+DELEGPURGE:
+* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
+ CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
+ persist across client reboots). Thus we need not implement this for
+ now.
+
+EXCHANGE_ID:
+* only SP4_NONE state protection supported
+* implementation ids are ignored
+
+CREATE_SESSION:
+* backchannel attributes are ignored
+* backchannel security parameters are ignored
+
+SEQUENCE:
+* no support for dynamic slot table renegotiation (optional)
+
+nfsv4.1 COMPOUND rules:
+The following cases aren't supported yet:
+* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION,
+ DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID.
+* DESTROY_SESSION MUST be the final operation in the COMPOUND request.
+
+Nonstandard compound limitations:
+* No support for a sessions fore channel RPC compound that requires both a
+ ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
+ fail to live up to the promise we made in CREATE_SESSION fore channel
+ negotiation.
+* No more than one IO operation (read, write, readdir) allowed per
+ compound.
diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt
new file mode 100644
index 00000000..90c71c6f
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -0,0 +1,294 @@
+Mounting the root filesystem via NFS (nfsroot)
+===============================================
+
+Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
+Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
+Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
+Updated 2006 by Horms <horms@verge.net.au>
+
+
+
+In order to use a diskless system, such as an X-terminal or printer server
+for example, it is necessary for the root filesystem to be present on a
+non-disk device. This may be an initramfs (see Documentation/filesystems/
+ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a
+filesystem mounted via NFS. The following text describes on how to use NFS
+for the root filesystem. For the rest of this text 'client' means the
+diskless system, and 'server' means the NFS server.
+
+
+
+
+1.) Enabling nfsroot capabilities
+ -----------------------------
+
+In order to use nfsroot, NFS client support needs to be selected as
+built-in during configuration. Once this has been selected, the nfsroot
+option will become available, which should also be selected.
+
+In the networking options, kernel level autoconfiguration can be selected,
+along with the types of autoconfiguration to support. Selecting all of
+DHCP, BOOTP and RARP is safe.
+
+
+
+
+2.) Kernel command line
+ -------------------
+
+When the kernel has been loaded by a boot loader (see below) it needs to be
+told what root fs device to use. And in the case of nfsroot, where to find
+both the server and the name of the directory on the server to mount as root.
+This can be established using the following kernel command line parameters:
+
+
+root=/dev/nfs
+
+ This is necessary to enable the pseudo-NFS-device. Note that it's not a
+ real device but just a synonym to tell the kernel to use NFS instead of
+ a real device.
+
+
+nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
+
+ If the `nfsroot' parameter is NOT given on the command line,
+ the default "/tftpboot/%s" will be used.
+
+ <server-ip> Specifies the IP address of the NFS server.
+ The default address is determined by the `ip' parameter
+ (see below). This parameter allows the use of different
+ servers for IP autoconfiguration and NFS.
+
+ <root-dir> Name of the directory on the server to mount as root.
+ If there is a "%s" token in the string, it will be
+ replaced by the ASCII-representation of the client's
+ IP address.
+
+ <nfs-options> Standard NFS options. All options are separated by commas.
+ The following defaults are used:
+ port = as given by server portmap daemon
+ rsize = 4096
+ wsize = 4096
+ timeo = 7
+ retrans = 3
+ acregmin = 3
+ acregmax = 60
+ acdirmin = 30
+ acdirmax = 60
+ flags = hard, nointr, noposix, cto, ac
+
+
+ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>
+
+ This parameter tells the kernel how to configure IP addresses of devices
+ and also how to set up the IP routing table. It was originally called
+ `nfsaddrs', but now the boot-time IP configuration works independently of
+ NFS, so it was renamed to `ip' and the old name remained as an alias for
+ compatibility reasons.
+
+ If this parameter is missing from the kernel command line, all fields are
+ assumed to be empty, and the defaults mentioned below apply. In general
+ this means that the kernel tries to configure everything using
+ autoconfiguration.
+
+ The <autoconf> parameter can appear alone as the value to the `ip'
+ parameter (without all the ':' characters before). If the value is
+ "ip=off" or "ip=none", no autoconfiguration will take place, otherwise
+ autoconfiguration will take place. The most common way to use this
+ is "ip=dhcp".
+
+ <client-ip> IP address of the client.
+
+ Default: Determined using autoconfiguration.
+
+ <server-ip> IP address of the NFS server. If RARP is used to determine
+ the client address and this parameter is NOT empty only
+ replies from the specified server are accepted.
+
+ Only required for NFS root. That is autoconfiguration
+ will not be triggered if it is missing and NFS root is not
+ in operation.
+
+ Default: Determined using autoconfiguration.
+ The address of the autoconfiguration server is used.
+
+ <gw-ip> IP address of a gateway if the server is on a different subnet.
+
+ Default: Determined using autoconfiguration.
+
+ <netmask> Netmask for local network interface. If unspecified
+ the netmask is derived from the client IP address assuming
+ classful addressing.
+
+ Default: Determined using autoconfiguration.
+
+ <hostname> Name of the client. May be supplied by autoconfiguration,
+ but its absence will not trigger autoconfiguration.
+ If specified and DHCP is used, the user provided hostname will
+ be carried in the DHCP request to hopefully update DNS record.
+
+ Default: Client IP address is used in ASCII notation.
+
+ <device> Name of network device to use.
+
+ Default: If the host only has one device, it is used.
+ Otherwise the device is determined using
+ autoconfiguration. This is done by sending
+ autoconfiguration requests out of all devices,
+ and using the device that received the first reply.
+
+ <autoconf> Method to use for autoconfiguration. In the case of options
+ which specify multiple autoconfiguration protocols,
+ requests are sent using all protocols, and the first one
+ to reply is used.
+
+ Only autoconfiguration protocols that have been compiled
+ into the kernel will be used, regardless of the value of
+ this option.
+
+ off or none: don't use autoconfiguration
+ (do static IP assignment instead)
+ on or any: use any protocol available in the kernel
+ (default)
+ dhcp: use DHCP
+ bootp: use BOOTP
+ rarp: use RARP
+ both: use both BOOTP and RARP but not DHCP
+ (old option kept for backwards compatibility)
+
+ Default: any
+
+
+nfsrootdebug
+
+ This parameter enables debugging messages to appear in the kernel
+ log at boot time so that administrators can verify that the correct
+ NFS mount options, server address, and root path are passed to the
+ NFS client.
+
+
+rdinit=<executable file>
+
+ To specify which file contains the program that starts system
+ initialization, administrators can use this command line parameter.
+ The default value of this parameter is "/init". If the specified
+ file exists and the kernel can execute it, root filesystem related
+ kernel command line parameters, including `nfsroot=', are ignored.
+
+ A description of the process of mounting the root file system can be
+ found in:
+
+ Documentation/early-userspace/README
+
+
+
+
+3.) Boot Loader
+ ----------
+
+To get the kernel into memory different approaches can be used.
+They depend on various facilities being available:
+
+
+3.1) Booting from a floppy using syslinux
+
+ When building kernels, an easy way to create a boot floppy that uses
+ syslinux is to use the zdisk or bzdisk make targets which use zimage
+ and bzimage images respectively. Both targets accept the
+ FDARGS parameter which can be used to set the kernel command line.
+
+ e.g.
+ make bzdisk FDARGS="root=/dev/nfs"
+
+ Note that the user running this command will need to have
+ access to the floppy drive device, /dev/fd0
+
+ For more information on syslinux, including how to create bootdisks
+ for prebuilt kernels, see http://syslinux.zytor.com/
+
+ N.B: Previously it was possible to write a kernel directly to
+ a floppy using dd, configure the boot device using rdev, and
+ boot using the resulting floppy. Linux no longer supports this
+ method of booting.
+
+3.2) Booting from a cdrom using isolinux
+
+ When building kernels, an easy way to create a bootable cdrom that
+ uses isolinux is to use the isoimage target which uses a bzimage
+ image. Like zdisk and bzdisk, this target accepts the FDARGS
+ parameter which can be used to set the kernel command line.
+
+ e.g.
+ make isoimage FDARGS="root=/dev/nfs"
+
+ The resulting iso image will be arch/<ARCH>/boot/image.iso
+ This can be written to a cdrom using a variety of tools including
+ cdrecord.
+
+ e.g.
+ cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso
+
+ For more information on isolinux, including how to create bootdisks
+ for prebuilt kernels, see http://syslinux.zytor.com/
+
+3.2) Using LILO
+ When using LILO all the necessary command line parameters may be
+ specified using the 'append=' directive in the LILO configuration
+ file.
+
+ However, to use the 'root=' directive you also need to create
+ a dummy root device, which may be removed after LILO is run.
+
+ mknod /dev/boot255 c 0 255
+
+ For information on configuring LILO, please refer to its documentation.
+
+3.3) Using GRUB
+ When using GRUB, kernel parameter are simply appended after the kernel
+ specification: kernel <kernel> <parameters>
+
+3.4) Using loadlin
+ loadlin may be used to boot Linux from a DOS command prompt without
+ requiring a local hard disk to mount as root. This has not been
+ thoroughly tested by the authors of this document, but in general
+ it should be possible configure the kernel command line similarly
+ to the configuration of LILO.
+
+ Please refer to the loadlin documentation for further information.
+
+3.5) Using a boot ROM
+ This is probably the most elegant way of booting a diskless client.
+ With a boot ROM the kernel is loaded using the TFTP protocol. The
+ authors of this document are not aware of any no commercial boot
+ ROMs that support booting Linux over the network. However, there
+ are two free implementations of a boot ROM, netboot-nfs and
+ etherboot, both of which are available on sunsite.unc.edu, and both
+ of which contain everything you need to boot a diskless Linux client.
+
+3.6) Using pxelinux
+ Pxelinux may be used to boot linux using the PXE boot loader
+ which is present on many modern network cards.
+
+ When using pxelinux, the kernel image is specified using
+ "kernel <relative-path-below /tftpboot>". The nfsroot parameters
+ are passed to the kernel by adding them to the "append" line.
+ It is common to use serial console in conjunction with pxeliunx,
+ see Documentation/serial-console.txt for more information.
+
+ For more information on isolinux, including how to create bootdisks
+ for prebuilt kernels, see http://syslinux.zytor.com/
+
+
+
+
+4.) Credits
+ -------
+
+ The nfsroot code in the kernel and the RARP support have been written
+ by Gero Kuhlmann <gero@gkminix.han.de>.
+
+ The rest of the IP layer autoconfiguration code has been written
+ by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
+
+ In order to write the initial version of nfsroot I would like to thank
+ Jens-Uwe Mager <jum@anubis.han.de> for his help.
diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
new file mode 100644
index 00000000..983e14ab
--- /dev/null
+++ b/Documentation/filesystems/nfs/pnfs.txt
@@ -0,0 +1,55 @@
+Reference counting in pnfs:
+==========================
+
+The are several inter-related caches. We have layouts which can
+reference multiple devices, each of which can reference multiple data servers.
+Each data server can be referenced by multiple devices. Each device
+can be referenced by multiple layouts. To keep all of this straight,
+we need to reference count.
+
+
+struct pnfs_layout_hdr
+----------------------
+The on-the-wire command LAYOUTGET corresponds to struct
+pnfs_layout_segment, usually referred to by the variable name lseg.
+Each nfs_inode may hold a pointer to a cache of of these layout
+segments in nfsi->layout, of type struct pnfs_layout_hdr.
+
+We reference the header for the inode pointing to it, across each
+outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN,
+LAYOUTCOMMIT), and for each lseg held within.
+
+Each header is also (when non-empty) put on a list associated with
+struct nfs_client (cl_layouts). Being put on this list does not bump
+the reference count, as the layout is kept around by the lseg that
+keeps it in the list.
+
+deviceid_cache
+--------------
+lsegs reference device ids, which are resolved per nfs_client and
+layout driver type. The device ids are held in a RCU cache (struct
+nfs4_deviceid_cache). The cache itself is referenced across each
+mount. The entries (struct nfs4_deviceid) themselves are held across
+the lifetime of each lseg referencing them.
+
+RCU is used because the deviceid is basically a write once, read many
+data structure. The hlist size of 32 buckets needs better
+justification, but seems reasonable given that we can have multiple
+deviceid's per filesystem, and multiple filesystems per nfs_client.
+
+The hash code is copied from the nfsd code base. A discussion of
+hashing and variations of this algorithm can be found at:
+http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
+
+data server cache
+-----------------
+file driver devices refer to data servers, which are kept in a module
+level cache. Its reference is held over the lifetime of the deviceid
+pointing to it.
+
+lseg
+----
+lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
+bit which holds it in the pnfs_layout_hdr's list. When the final lseg
+is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
+bit is set, preventing any new lsegs from being added.
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt
new file mode 100644
index 00000000..ebcaaee2
--- /dev/null
+++ b/Documentation/filesystems/nfs/rpc-cache.txt
@@ -0,0 +1,202 @@
+ This document gives a brief introduction to the caching
+mechanisms in the sunrpc layer that is used, in particular,
+for NFS authentication.
+
+CACHES
+======
+The caching replaces the old exports table and allows for
+a wide variety of values to be caches.
+
+There are a number of caches that are similar in structure though
+quite possibly very different in content and use. There is a corpus
+of common code for managing these caches.
+
+Examples of caches that are likely to be needed are:
+ - mapping from IP address to client name
+ - mapping from client name and filesystem to export options
+ - mapping from UID to list of GIDs, to work around NFS's limitation
+ of 16 gids.
+ - mappings between local UID/GID and remote UID/GID for sites that
+ do not have uniform uid assignment
+ - mapping from network identify to public key for crypto authentication.
+
+The common code handles such things as:
+ - general cache lookup with correct locking
+ - supporting 'NEGATIVE' as well as positive entries
+ - allowing an EXPIRED time on cache items, and removing
+ items after they expire, and are no longer in-use.
+ - making requests to user-space to fill in cache entries
+ - allowing user-space to directly set entries in the cache
+ - delaying RPC requests that depend on as-yet incomplete
+ cache entries, and replaying those requests when the cache entry
+ is complete.
+ - clean out old entries as they expire.
+
+Creating a Cache
+----------------
+
+1/ A cache needs a datum to store. This is in the form of a
+ structure definition that must contain a
+ struct cache_head
+ as an element, usually the first.
+ It will also contain a key and some content.
+ Each cache element is reference counted and contains
+ expiry and update times for use in cache management.
+2/ A cache needs a "cache_detail" structure that
+ describes the cache. This stores the hash table, some
+ parameters for cache management, and some operations detailing how
+ to work with particular cache items.
+ The operations requires are:
+ struct cache_head *alloc(void)
+ This simply allocates appropriate memory and returns
+ a pointer to the cache_detail embedded within the
+ structure
+ void cache_put(struct kref *)
+ This is called when the last reference to an item is
+ dropped. The pointer passed is to the 'ref' field
+ in the cache_head. cache_put should release any
+ references create by 'cache_init' and, if CACHE_VALID
+ is set, any references created by cache_update.
+ It should then release the memory allocated by
+ 'alloc'.
+ int match(struct cache_head *orig, struct cache_head *new)
+ test if the keys in the two structures match. Return
+ 1 if they do, 0 if they don't.
+ void init(struct cache_head *orig, struct cache_head *new)
+ Set the 'key' fields in 'new' from 'orig'. This may
+ include taking references to shared objects.
+ void update(struct cache_head *orig, struct cache_head *new)
+ Set the 'content' fileds in 'new' from 'orig'.
+ int cache_show(struct seq_file *m, struct cache_detail *cd,
+ struct cache_head *h)
+ Optional. Used to provide a /proc file that lists the
+ contents of a cache. This should show one item,
+ usually on just one line.
+ int cache_request(struct cache_detail *cd, struct cache_head *h,
+ char **bpp, int *blen)
+ Format a request to be send to user-space for an item
+ to be instantiated. *bpp is a buffer of size *blen.
+ bpp should be moved forward over the encoded message,
+ and *blen should be reduced to show how much free
+ space remains. Return 0 on success or <0 if not
+ enough room or other problem.
+ int cache_parse(struct cache_detail *cd, char *buf, int len)
+ A message from user space has arrived to fill out a
+ cache entry. It is in 'buf' of length 'len'.
+ cache_parse should parse this, find the item in the
+ cache with sunrpc_cache_lookup, and update the item
+ with sunrpc_cache_update.
+
+
+3/ A cache needs to be registered using cache_register(). This
+ includes it on a list of caches that will be regularly
+ cleaned to discard old data.
+
+Using a cache
+-------------
+
+To find a value in a cache, call sunrpc_cache_lookup passing a pointer
+to the cache_head in a sample item with the 'key' fields filled in.
+This will be passed to ->match to identify the target entry. If no
+entry is found, a new entry will be create, added to the cache, and
+marked as not containing valid data.
+
+The item returned is typically passed to cache_check which will check
+if the data is valid, and may initiate an up-call to get fresh data.
+cache_check will return -ENOENT in the entry is negative or if an up
+call is needed but not possible, -EAGAIN if an upcall is pending,
+or 0 if the data is valid;
+
+cache_check can be passed a "struct cache_req *". This structure is
+typically embedded in the actual request and can be used to create a
+deferred copy of the request (struct cache_deferred_req). This is
+done when the found cache item is not uptodate, but the is reason to
+believe that userspace might provide information soon. When the cache
+item does become valid, the deferred copy of the request will be
+revisited (->revisit). It is expected that this method will
+reschedule the request for processing.
+
+The value returned by sunrpc_cache_lookup can also be passed to
+sunrpc_cache_update to set the content for the item. A second item is
+passed which should hold the content. If the item found by _lookup
+has valid data, then it is discarded and a new item is created. This
+saves any user of an item from worrying about content changing while
+it is being inspected. If the item found by _lookup does not contain
+valid data, then the content is copied across and CACHE_VALID is set.
+
+Populating a cache
+------------------
+
+Each cache has a name, and when the cache is registered, a directory
+with that name is created in /proc/net/rpc
+
+This directory contains a file called 'channel' which is a channel
+for communicating between kernel and user for populating the cache.
+This directory may later contain other files of interacting
+with the cache.
+
+The 'channel' works a bit like a datagram socket. Each 'write' is
+passed as a whole to the cache for parsing and interpretation.
+Each cache can treat the write requests differently, but it is
+expected that a message written will contain:
+ - a key
+ - an expiry time
+ - a content.
+with the intention that an item in the cache with the give key
+should be create or updated to have the given content, and the
+expiry time should be set on that item.
+
+Reading from a channel is a bit more interesting. When a cache
+lookup fails, or when it succeeds but finds an entry that may soon
+expire, a request is lodged for that cache item to be updated by
+user-space. These requests appear in the channel file.
+
+Successive reads will return successive requests.
+If there are no more requests to return, read will return EOF, but a
+select or poll for read will block waiting for another request to be
+added.
+
+Thus a user-space helper is likely to:
+ open the channel.
+ select for readable
+ read a request
+ write a response
+ loop.
+
+If it dies and needs to be restarted, any requests that have not been
+answered will still appear in the file and will be read by the new
+instance of the helper.
+
+Each cache should define a "cache_parse" method which takes a message
+written from user-space and processes it. It should return an error
+(which propagates back to the write syscall) or 0.
+
+Each cache should also define a "cache_request" method which
+takes a cache item and encodes a request into the buffer
+provided.
+
+Note: If a cache has no active readers on the channel, and has had not
+active readers for more than 60 seconds, further requests will not be
+added to the channel but instead all lookups that do not find a valid
+entry will fail. This is partly for backward compatibility: The
+previous nfs exports table was deemed to be authoritative and a
+failed lookup meant a definite 'no'.
+
+request/response format
+-----------------------
+
+While each cache is free to use its own format for requests
+and responses over channel, the following is recommended as
+appropriate and support routines are available to help:
+Each request or response record should be printable ASCII
+with precisely one newline character which should be at the end.
+Fields within the record should be separated by spaces, normally one.
+If spaces, newlines, or nul characters are needed in a field they
+much be quoted. two mechanisms are available:
+1/ If a field begins '\x' then it must contain an even number of
+ hex digits, and pairs of these digits provide the bytes in the
+ field.
+2/ otherwise a \ in the field must be followed by 3 octal digits
+ which give the code for a byte. Other characters are treated
+ as them selves. At the very least, space, newline, nul, and
+ '\' must be quoted in this way.
diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt
new file mode 100644
index 00000000..873a2ab2
--- /dev/null
+++ b/Documentation/filesystems/nilfs2.txt
@@ -0,0 +1,208 @@
+NILFS2
+------
+
+NILFS2 is a log-structured file system (LFS) supporting continuous
+snapshotting. In addition to versioning capability of the entire file
+system, users can even restore files mistakenly overwritten or
+destroyed just a few seconds ago. Since NILFS2 can keep consistency
+like conventional LFS, it achieves quick recovery after system
+crashes.
+
+NILFS2 creates a number of checkpoints every few seconds or per
+synchronous write basis (unless there is no change). Users can select
+significant versions among continuously created checkpoints, and can
+change them into snapshots which will be preserved until they are
+changed back to checkpoints.
+
+There is no limit on the number of snapshots until the volume gets
+full. Each snapshot is mountable as a read-only file system
+concurrently with its writable mount, and this feature is convenient
+for online backup.
+
+The userland tools are included in nilfs-utils package, which is
+available from the following download page. At least "mkfs.nilfs2",
+"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called
+cleaner or garbage collector) are required. Details on the tools are
+described in the man pages included in the package.
+
+Project web page: http://www.nilfs.org/en/
+Download page: http://www.nilfs.org/en/download.html
+Git tree web page: http://www.nilfs.org/git/
+List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
+
+Caveats
+=======
+
+Features which NILFS2 does not support yet:
+
+ - atime
+ - extended attributes
+ - POSIX ACLs
+ - quotas
+ - fsck
+ - defragmentation
+
+Mount options
+=============
+
+NILFS2 supports the following mount options:
+(*) == default
+
+barrier(*) This enables/disables the use of write barriers. This
+nobarrier requires an IO stack which can support barriers, and
+ if nilfs gets an error on a barrier write, it will
+ disable again with a warning.
+errors=continue Keep going on a filesystem error.
+errors=remount-ro(*) Remount the filesystem read-only on an error.
+errors=panic Panic and halt the machine if an error occurs.
+cp=n Specify the checkpoint-number of the snapshot to be
+ mounted. Checkpoints and snapshots are listed by lscp
+ user command. Only the checkpoints marked as snapshot
+ are mountable with this option. Snapshot is read-only,
+ so a read-only mount option must be specified together.
+order=relaxed(*) Apply relaxed order semantics that allows modified data
+ blocks to be written to disk without making a
+ checkpoint if no metadata update is going. This mode
+ is equivalent to the ordered data mode of the ext3
+ filesystem except for the updates on data blocks still
+ conserve atomicity. This will improve synchronous
+ write performance for overwriting.
+order=strict Apply strict in-order semantics that preserves sequence
+ of all file operations including overwriting of data
+ blocks. That means, it is guaranteed that no
+ overtaking of events occurs in the recovered file
+ system after a crash.
+norecovery Disable recovery of the filesystem on mount.
+ This disables every write access on the device for
+ read-only mounts or snapshots. This option will fail
+ for r/w mounts on an unclean volume.
+discard This enables/disables the use of discard/TRIM commands.
+nodiscard(*) The discard/TRIM commands are sent to the underlying
+ block device when blocks are freed. This is useful
+ for SSD devices and sparse/thinly-provisioned LUNs.
+
+NILFS2 usage
+============
+
+To use nilfs2 as a local file system, simply:
+
+ # mkfs -t nilfs2 /dev/block_device
+ # mount -t nilfs2 /dev/block_device /dir
+
+This will also invoke the cleaner through the mount helper program
+(mount.nilfs2).
+
+Checkpoints and snapshots are managed by the following commands.
+Their manpages are included in the nilfs-utils package above.
+
+ lscp list checkpoints or snapshots.
+ mkcp make a checkpoint or a snapshot.
+ chcp change an existing checkpoint to a snapshot or vice versa.
+ rmcp invalidate specified checkpoint(s).
+
+To mount a snapshot,
+
+ # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir
+
+where <cno> is the checkpoint number of the snapshot.
+
+To unmount the NILFS2 mount point or snapshot, simply:
+
+ # umount /dir
+
+Then, the cleaner daemon is automatically shut down by the umount
+helper program (umount.nilfs2).
+
+Disk format
+===========
+
+A nilfs2 volume is equally divided into a number of segments except
+for the super block (SB) and segment #0. A segment is the container
+of logs. Each log is composed of summary information blocks, payload
+blocks, and an optional super root block (SR):
+
+ ______________________________________________________
+ | |SB| | Segment | Segment | Segment | ... | Segment | |
+ |_|__|_|____0____|____1____|____2____|_____|____N____|_|
+ 0 +1K +4K +8M +16M +24M +(8MB x N)
+ . . (Typical offsets for 4KB-block)
+ . .
+ .______________________.
+ | log | log |... | log |
+ |__1__|__2__|____|__m__|
+ . .
+ . .
+ . .
+ .______________________________.
+ | Summary | Payload blocks |SR|
+ |_blocks__|_________________|__|
+
+The payload blocks are organized per file, and each file consists of
+data blocks and B-tree node blocks:
+
+ |<--- File-A --->|<--- File-B --->|
+ _______________________________________________________________
+ | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ...
+ _|_____________|_______________|_____________|_______________|_
+
+
+Since only the modified blocks are written in the log, it may have
+files without data blocks or B-tree node blocks.
+
+The organization of the blocks is recorded in the summary information
+blocks, which contains a header structure (nilfs_segment_summary), per
+file structures (nilfs_finfo), and per block structures (nilfs_binfo):
+
+ _________________________________________________________________________
+ | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |...
+ |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___
+
+
+The logs include regular files, directory files, symbolic link files
+and several meta data files. The mata data files are the files used
+to maintain file system meta data. The current version of NILFS2 uses
+the following meta data files:
+
+ 1) Inode file (ifile) -- Stores on-disk inodes
+ 2) Checkpoint file (cpfile) -- Stores checkpoints
+ 3) Segment usage file (sufile) -- Stores allocation state of segments
+ 4) Data address translation file -- Maps virtual block numbers to usual
+ (DAT) block numbers. This file serves to
+ make on-disk blocks relocatable.
+
+The following figure shows a typical organization of the logs:
+
+ _________________________________________________________________________
+ | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR|
+ |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__|
+
+
+To stride over segment boundaries, this sequence of files may be split
+into multiple logs. The sequence of logs that should be treated as
+logically one log, is delimited with flags marked in the segment
+summary. The recovery code of nilfs2 looks this boundary information
+to ensure atomicity of updates.
+
+The super root block is inserted for every checkpoints. It includes
+three special inodes, inodes for the DAT, cpfile, and sufile. Inodes
+of regular files, directories, symlinks and other special files, are
+included in the ifile. The inode of ifile itself is included in the
+corresponding checkpoint entry in the cpfile. Thus, the hierarchy
+among NILFS2 files can be depicted as follows:
+
+ Super block (SB)
+ |
+ v
+ Super root block (the latest cno=xx)
+ |-- DAT
+ |-- sufile
+ `-- cpfile
+ |-- ifile (cno=c1)
+ |-- ifile (cno=c2) ---- file (ino=i1)
+ : : |-- file (ino=i2)
+ `-- ifile (cno=xx) |-- file (ino=i3)
+ : :
+ `-- file (ino=yy)
+ ( regular file, directory, or symlink )
+
+For detail on the format of each file, please see include/linux/nilfs2_fs.h.
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt
new file mode 100644
index 00000000..791af8da
--- /dev/null
+++ b/Documentation/filesystems/ntfs.txt
@@ -0,0 +1,721 @@
+The Linux NTFS filesystem driver
+================================
+
+
+Table of contents
+=================
+
+- Overview
+- Web site
+- Features
+- Supported mount options
+- Known bugs and (mis-)features
+- Using NTFS volume and stripe sets
+ - The Device-Mapper driver
+ - The Software RAID / MD driver
+ - Limitations when using the MD driver
+- ChangeLog
+
+
+Overview
+========
+
+Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
+These include mkntfs, a full-featured ntfs filesystem format utility,
+ntfsundelete used for recovering files that were unintentionally deleted
+from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
+See the web site for more information.
+
+To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
+system type 'ntfs'. The driver currently supports read-only mode (with no
+fault-tolerance, encryption or journalling) and very limited, but safe, write
+support.
+
+For fault tolerance and raid support (i.e. volume and stripe sets), you can
+use the kernel's Software RAID / MD driver. See section "Using Software RAID
+with NTFS" for details.
+
+
+Web site
+========
+
+There is plenty of additional information on the linux-ntfs web site
+at http://www.linux-ntfs.org/
+
+The web site has a lot of additional information, such as a comprehensive
+FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
+userspace utilities, etc.
+
+
+Features
+========
+
+- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
+ earlier kernels. This new driver implements NTFS read support and is
+ functionally equivalent to the old ntfs driver and it also implements limited
+ write support. The biggest limitation at present is that files/directories
+ cannot be created or deleted. See below for the list of write features that
+ are so far supported. Another limitation is that writing to compressed files
+ is not implemented at all. Also, neither read nor write access to encrypted
+ files is so far implemented.
+- The new driver has full support for sparse files on NTFS 3.x volumes which
+ the old driver isn't happy with.
+- The new driver supports execution of binaries due to mmap() now being
+ supported.
+- The new driver supports loopback mounting of files on NTFS which is used by
+ some Linux distributions to enable the user to run Linux from an NTFS
+ partition by creating a large file while in Windows and then loopback
+ mounting the file while in Linux and creating a Linux filesystem on it that
+ is used to install Linux on it.
+- A comparison of the two drivers using:
+ time find . -type f -exec md5sum "{}" \;
+ run three times in sequence with each driver (after a reboot) on a 1.4GiB
+ NTFS partition, showed the new driver to be 20% faster in total time elapsed
+ (from 9:43 minutes on average down to 7:53). The time spent in user space
+ was unchanged but the time spent in the kernel was decreased by a factor of
+ 2.5 (from 85 CPU seconds down to 33).
+- The driver does not support short file names in general. For backwards
+ compatibility, we implement access to files using their short file names if
+ they exist. The driver will not create short file names however, and a
+ rename will discard any existing short file name.
+- The new driver supports exporting of mounted NTFS volumes via NFS.
+- The new driver supports async io (aio).
+- The new driver supports fsync(2), fdatasync(2), and msync(2).
+- The new driver supports readv(2) and writev(2).
+- The new driver supports access time updates (including mtime and ctime).
+- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
+ only very limited support for highly fragmented files, i.e. ones which have
+ their data attribute split across multiple extents, is included. Another
+ limitation is that at present truncate(2) will never create sparse files,
+ since to mark a file sparse we need to modify the directory entry for the
+ file and we do not implement directory modifications yet.
+- The new driver supports write(2) which can both overwrite existing data and
+ extend the file size so that you can write beyond the existing data. Also,
+ writing into sparse regions is supported and the holes are filled in with
+ clusters. But at present only limited support for highly fragmented files,
+ i.e. ones which have their data attribute split across multiple extents, is
+ included. Another limitation is that write(2) will never create sparse
+ files, since to mark a file sparse we need to modify the directory entry for
+ the file and we do not implement directory modifications yet.
+
+Supported mount options
+=======================
+
+In addition to the generic mount options described by the manual page for the
+mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
+following mount options:
+
+iocharset=name Deprecated option. Still supported but please use
+ nls=name in the future. See description for nls=name.
+
+nls=name Character set to use when returning file names.
+ Unlike VFAT, NTFS suppresses names that contain
+ unconvertible characters. Note that most character
+ sets contain insufficient characters to represent all
+ possible Unicode characters that can exist on NTFS.
+ To be sure you are not missing any files, you are
+ advised to use nls=utf8 which is capable of
+ representing all Unicode characters.
+
+utf8=<bool> Option no longer supported. Currently mapped to
+ nls=utf8 but please use nls=utf8 in the future and
+ make sure utf8 is compiled either as module or into
+ the kernel. See description for nls=name.
+
+uid=
+gid=
+umask= Provide default owner, group, and access mode mask.
+ These options work as documented in mount(8). By
+ default, the files/directories are owned by root and
+ he/she has read and write permissions, as well as
+ browse permission for directories. No one else has any
+ access permissions. I.e. the mode on all files is by
+ default rw------- and for directories rwx------, a
+ consequence of the default fmask=0177 and dmask=0077.
+ Using a umask of zero will grant all permissions to
+ everyone, i.e. all files and directories will have mode
+ rwxrwxrwx.
+
+fmask=
+dmask= Instead of specifying umask which applies both to
+ files and directories, fmask applies only to files and
+ dmask only to directories.
+
+sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
+ Otherwise the default behaviour is to abort mount if
+ any unknown options are found.
+
+show_sys_files=<BOOL> If show_sys_files is specified, show the system files
+ in directory listings. Otherwise the default behaviour
+ is to hide the system files.
+ Note that even when show_sys_files is specified, "$MFT"
+ will not be visible due to bugs/mis-features in glibc.
+ Further, note that irrespective of show_sys_files, all
+ files are accessible by name, i.e. you can always do
+ "ls -l \$UpCase" for example to specifically show the
+ system file containing the Unicode upcase table.
+
+case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
+ case sensitive and create file names in the POSIX
+ namespace. Otherwise the default behaviour is to treat
+ file names as case insensitive and to create file names
+ in the WIN32/LONG name space. Note, the Linux NTFS
+ driver will never create short file names and will
+ remove them on rename/delete of the corresponding long
+ file name.
+ Note that files remain accessible via their short file
+ name, if it exists. If case_sensitive, you will need
+ to provide the correct case of the short file name.
+
+disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
+ regions, i.e. holes, inside files is disabled for the
+ volume (for the duration of this mount only). By
+ default, creation of sparse regions is enabled, which
+ is consistent with the behaviour of traditional Unix
+ filesystems.
+
+errors=opt What to do when critical filesystem errors are found.
+ Following values can be used for "opt":
+ continue: DEFAULT, try to clean-up as much as
+ possible, e.g. marking a corrupt inode as
+ bad so it is no longer accessed, and then
+ continue.
+ recover: At present only supported is recovery of
+ the boot sector from the backup copy.
+ If read-only mount, the recovery is done
+ in memory only and not written to disk.
+ Note that the options are additive, i.e. specifying:
+ errors=continue,errors=recover
+ means the driver will attempt to recover and if that
+ fails it will clean-up as much as possible and
+ continue.
+
+mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
+ setting is not persistent across mounts and can be
+ changed from mount to mount but cannot be changed on
+ remount). Values of 1 to 4 are allowed, 1 being the
+ default. The MFT zone multiplier determines how much
+ space is reserved for the MFT on the volume. If all
+ other space is used up, then the MFT zone will be
+ shrunk dynamically, so this has no impact on the
+ amount of free space. However, it can have an impact
+ on performance by affecting fragmentation of the MFT.
+ In general use the default. If you have a lot of small
+ files then use a higher value. The values have the
+ following meaning:
+ Value MFT zone size (% of volume size)
+ 1 12.5%
+ 2 25%
+ 3 37.5%
+ 4 50%
+ Note this option is irrelevant for read-only mounts.
+
+
+Known bugs and (mis-)features
+=============================
+
+- The link count on each directory inode entry is set to 1, due to Linux not
+ supporting directory hard links. This may well confuse some user space
+ applications, since the directory names will have the same inode numbers.
+ This also speeds up ntfs_read_inode() immensely. And we haven't found any
+ problems with this approach so far. If you find a problem with this, please
+ let us know.
+
+
+Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
+list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
+
+
+Using NTFS volume and stripe sets
+=================================
+
+For support of volume and stripe sets, you can either use the kernel's
+Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
+the recommended one to use for linear raid. But the latter is required for
+raid level 5. For striping and mirroring, either driver should work fine.
+
+
+The Device-Mapper driver
+------------------------
+
+You will need to create a table of the components of the volume/stripe set and
+how they fit together and load this into the kernel using the dmsetup utility
+(see man 8 dmsetup).
+
+Linear volume sets, i.e. linear raid, has been tested and works fine. Even
+though untested, there is no reason why stripe sets, i.e. raid level 0, and
+mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
+raid level 5, unfortunately cannot work yet because the current version of the
+Device-Mapper driver does not support raid level 5. You may be able to use the
+Software RAID / MD driver for raid level 5, see the next section for details.
+
+To create the table describing your volume you will need to know each of its
+components and their sizes in sectors, i.e. multiples of 512-byte blocks.
+
+For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
+example if one of your partitions is /dev/hda2 you would do:
+
+$ fdisk -ul /dev/hda
+
+Disk /dev/hda: 81.9 GB, 81964302336 bytes
+255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
+Units = sectors of 1 * 512 = 512 bytes
+
+ Device Boot Start End Blocks Id System
+ /dev/hda1 * 63 4209029 2104483+ 83 Linux
+ /dev/hda2 4209030 37768814 16779892+ 86 NTFS
+ /dev/hda3 37768815 46170809 4200997+ 83 Linux
+
+And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
+33559785 sectors.
+
+For Win2k and later dynamic disks, you can for example use the ldminfo utility
+which is part of the Linux LDM tools (the latest version at the time of
+writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
+ http://www.linux-ntfs.org/
+Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
+into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
+will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
+able to compile this yourself easily so use the binary version!
+
+Then you would use ldminfo in dump mode to obtain the necessary information:
+
+$ ./ldminfo --dump /dev/hda
+
+This would dump the LDM database found on /dev/hda which describes all of your
+dynamic disks and all the volumes on them. At the bottom you will see the
+VOLUME DEFINITIONS section which is all you really need. You may need to look
+further above to determine which of the disks in the volume definitions is
+which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
+look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
+section). You can then find these Disk Ids in the VBLK DATABASE section in the
+<Disk> components where you will get the LDM Name for the disk that is found in
+the VOLUME DEFINITIONS section.
+
+Note you will also need to enable the LDM driver in the Linux kernel. If your
+distribution did not enable it, you will need to recompile the kernel with it
+enabled. This will create the LDM partitions on each device at boot time. You
+would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
+in the Device-Mapper table.
+
+You can also bypass using the LDM driver by using the main device (e.g.
+/dev/hda) and then using the offsets of the LDM partitions into this device as
+the "Start sector of device" when creating the table. Once again ldminfo would
+give you the correct information to do this.
+
+Assuming you know all your devices and their sizes things are easy.
+
+For a linear raid the table would look like this (note all values are in
+512-byte sectors):
+
+--- cut here ---
+# Offset into Size of this Raid type Device Start sector
+# volume device of device
+0 1028161 linear /dev/hda1 0
+1028161 3903762 linear /dev/hdb2 0
+4931923 2103211 linear /dev/hdc1 0
+--- cut here ---
+
+For a striped volume, i.e. raid level 0, you will need to know the chunk size
+you used when creating the volume. Windows uses 64kiB as the default, so it
+will probably be this unless you changes the defaults when creating the array.
+
+For a raid level 0 the table would look like this (note all values are in
+512-byte sectors):
+
+--- cut here ---
+# Offset Size Raid Number Chunk 1st Start 2nd Start
+# into of the type of size Device in Device in
+# volume volume stripes device device
+0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
+--- cut here ---
+
+If there are more than two devices, just add each of them to the end of the
+line.
+
+Finally, for a mirrored volume, i.e. raid level 1, the table would look like
+this (note all values are in 512-byte sectors):
+
+--- cut here ---
+# Ofs Size Raid Log Number Region Should Number Source Start Target Start
+# in of the type type of log size sync? of Device in Device in
+# vol volume params mirrors Device Device
+0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
+--- cut here ---
+
+If you are mirroring to multiple devices you can specify further targets at the
+end of the line.
+
+Note the "Should sync?" parameter "nosync" means that the two mirrors are
+already in sync which will be the case on a clean shutdown of Windows. If the
+mirrors are not clean, you can specify the "sync" option instead of "nosync"
+and the Device-Mapper driver will then copy the entirety of the "Source Device"
+to the "Target Device" or if you specified multiple target devices to all of
+them.
+
+Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
+and hand it over to dmsetup to work with, like so:
+
+$ dmsetup create myvolume1 /etc/ntfsvolume1
+
+You can obviously replace "myvolume1" with whatever name you like.
+
+If it all worked, you will now have the device /dev/device-mapper/myvolume1
+which you can then just use as an argument to the mount command as usual to
+mount the ntfs volume. For example:
+
+$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
+
+(You need to create the directory /mnt/myvol1 first and of course you can use
+anything you like instead of /mnt/myvol1 as long as it is an existing
+directory.)
+
+It is advisable to do the mount read-only to see if the volume has been setup
+correctly to avoid the possibility of causing damage to the data on the ntfs
+volume.
+
+
+The Software RAID / MD driver
+-----------------------------
+
+An alternative to using the Device-Mapper driver is to use the kernel's
+Software RAID / MD driver. For which you need to set up your /etc/raidtab
+appropriately (see man 5 raidtab).
+
+Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
+0, have been tested and work fine (though see section "Limitations when using
+the MD driver with NTFS volumes" especially if you want to use linear raid).
+Even though untested, there is no reason why mirrors, i.e. raid level 1, and
+stripes with parity, i.e. raid level 5, should not work, too.
+
+You have to use the "persistent-superblock 0" option for each raid-disk in the
+NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
+superblock used by the MD driver would damage the NTFS volume.
+
+Windows by default uses a stripe chunk size of 64k, so you probably want the
+"chunk-size 64k" option for each raid-disk, too.
+
+For example, if you have a stripe set consisting of two partitions /dev/hda5
+and /dev/hdb1 your /etc/raidtab would look like this:
+
+raiddev /dev/md0
+ raid-level 0
+ nr-raid-disks 2
+ nr-spare-disks 0
+ persistent-superblock 0
+ chunk-size 64k
+ device /dev/hda5
+ raid-disk 0
+ device /dev/hdb1
+ raid-disk 1
+
+For linear raid, just change the raid-level above to "raid-level linear", for
+mirrors, change it to "raid-level 1", and for stripe sets with parity, change
+it to "raid-level 5".
+
+Note for stripe sets with parity you will also need to tell the MD driver
+which parity algorithm to use by specifying the option "parity-algorithm
+which", where you need to replace "which" with the name of the algorithm to
+use (see man 5 raidtab for available algorithms) and you will have to try the
+different available algorithms until you find one that works. Make sure you
+are working read-only when playing with this as you may damage your data
+otherwise. If you find which algorithm works please let us know (email the
+linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
+IRC in channel #ntfs on the irc.freenode.net network) so we can update this
+documentation.
+
+Once the raidtab is setup, run for example raid0run -a to start all devices or
+raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
+
+Then just use the mount command as usual to mount the ntfs volume using for
+example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
+
+It is advisable to do the mount read-only to see if the md volume has been
+setup correctly to avoid the possibility of causing damage to the data on the
+ntfs volume.
+
+
+Limitations when using the Software RAID / MD driver
+-----------------------------------------------------
+
+Using the md driver will not work properly if any of your NTFS partitions have
+an odd number of sectors. This is especially important for linear raid as all
+data after the first partition with an odd number of sectors will be offset by
+one or more sectors so if you mount such a partition with write support you
+will cause massive damage to the data on the volume which will only become
+apparent when you try to use the volume again under Windows.
+
+So when using linear raid, make sure that all your partitions have an even
+number of sectors BEFORE attempting to use it. You have been warned!
+
+Even better is to simply use the Device-Mapper for linear raid and then you do
+not have this problem with odd numbers of sectors.
+
+
+ChangeLog
+=========
+
+Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
+
+2.1.30:
+ - Fix writev() (it kept writing the first segment over and over again
+ instead of moving onto subsequent segments).
+ - Fix crash in ntfs_mft_record_alloc() when mapping the new extent mft
+ record failed.
+2.1.29:
+ - Fix a deadlock when mounting read-write.
+2.1.28:
+ - Fix a deadlock.
+2.1.27:
+ - Implement page migration support so the kernel can move memory used
+ by NTFS files and directories around for management purposes.
+ - Add support for writing to sparse files created with Windows XP SP2.
+ - Many minor improvements and bug fixes.
+2.1.26:
+ - Implement support for sector sizes above 512 bytes (up to the maximum
+ supported by NTFS which is 4096 bytes).
+ - Enhance support for NTFS volumes which were supported by Windows but
+ not by Linux due to invalid attribute list attribute flags.
+ - A few minor updates and bug fixes.
+2.1.25:
+ - Write support is now extended with write(2) being able to both
+ overwrite existing file data and to extend files. Also, if a write
+ to a sparse region occurs, write(2) will fill in the hole. Note,
+ mmap(2) based writes still do not support writing into holes or
+ writing beyond the initialized size.
+ - Write support has a new feature and that is that truncate(2) and
+ open(2) with O_TRUNC are now implemented thus files can be both made
+ smaller and larger.
+ - Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have
+ limitations in that they
+ - only provide limited support for highly fragmented files.
+ - only work on regular, i.e. uncompressed and unencrypted files.
+ - never create sparse files although this will change once directory
+ operations are implemented.
+ - Lots of bug fixes and enhancements across the board.
+2.1.24:
+ - Support journals ($LogFile) which have been modified by chkdsk. This
+ means users can boot into Windows after we marked the volume dirty.
+ The Windows boot will run chkdsk and then reboot. The user can then
+ immediately boot into Linux rather than having to do a full Windows
+ boot first before rebooting into Linux and we will recognize such a
+ journal and empty it as it is clean by definition.
+ - Support journals ($LogFile) with only one restart page as well as
+ journals with two different restart pages. We sanity check both and
+ either use the only sane one or the more recent one of the two in the
+ case that both are valid.
+ - Lots of bug fixes and enhancements across the board.
+2.1.23:
+ - Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
+ it is present and active thus telling Windows and applications using
+ the transaction log that changes can have happened on the volume
+ which are not recorded in $UsnJrnl.
+ - Detect the case when Windows has been hibernated (suspended to disk)
+ and if this is the case do not allow (re)mounting read-write to
+ prevent data corruption when you boot back into the suspended
+ Windows session.
+ - Implement extension of resident files using the normal file write
+ code paths, i.e. most very small files can be extended to be a little
+ bit bigger but not by much.
+ - Add new mount option "disable_sparse". (See list of mount options
+ above for details.)
+ - Improve handling of ntfs volumes with errors and strange boot sectors
+ in particular.
+ - Fix various bugs including a nasty deadlock that appeared in recent
+ kernels (around 2.6.11-2.6.12 timeframe).
+2.1.22:
+ - Improve handling of ntfs volumes with errors.
+ - Fix various bugs and race conditions.
+2.1.21:
+ - Fix several race conditions and various other bugs.
+ - Many internal cleanups, code reorganization, optimizations, and mft
+ and index record writing code rewritten to fit in with the changes.
+ - Update Documentation/filesystems/ntfs.txt with instructions on how to
+ use the Device-Mapper driver with NTFS ftdisk/LDM raid.
+2.1.20:
+ - Fix two stupid bugs introduced in 2.1.18 release.
+2.1.19:
+ - Minor bugfix in handling of the default upcase table.
+ - Many internal cleanups and improvements. Many thanks to Linus
+ Torvalds and Al Viro for the help and advice with the sparse
+ annotations and cleanups.
+2.1.18:
+ - Fix scheduling latencies at mount time. (Ingo Molnar)
+ - Fix endianness bug in a little traversed portion of the attribute
+ lookup code.
+2.1.17:
+ - Fix bugs in mount time error code paths.
+2.1.16:
+ - Implement access time updates (including mtime and ctime).
+ - Implement fsync(2), fdatasync(2), and msync(2) system calls.
+ - Enable the readv(2) and writev(2) system calls.
+ - Enable access via the asynchronous io (aio) API by adding support for
+ the aio_read(3) and aio_write(3) functions.
+2.1.15:
+ - Invalidate quotas when (re)mounting read-write.
+ NOTE: This now only leave user space journalling on the side. (See
+ note for version 2.1.13, below.)
+2.1.14:
+ - Fix an NFSd caused deadlock reported by several users.
+2.1.13:
+ - Implement writing of inodes (access time updates are not implemented
+ yet so mounting with -o noatime,nodiratime is enforced).
+ - Enable writing out of resident files so you can now overwrite any
+ uncompressed, unencrypted, nonsparse file as long as you do not
+ change the file size.
+ - Add housekeeping of ntfs system files so that ntfsfix no longer needs
+ to be run after writing to an NTFS volume.
+ NOTE: This still leaves quota tracking and user space journalling on
+ the side but they should not cause data corruption. In the worst
+ case the charged quotas will be out of date ($Quota) and some
+ userspace applications might get confused due to the out of date
+ userspace journal ($UsnJrnl).
+2.1.12:
+ - Fix the second fix to the decompression engine from the 2.1.9 release
+ and some further internals cleanups.
+2.1.11:
+ - Driver internal cleanups.
+2.1.10:
+ - Force read-only (re)mounting of volumes with unsupported volume
+ flags and various cleanups.
+2.1.9:
+ - Fix two bugs in handling of corner cases in the decompression engine.
+2.1.8:
+ - Read the $MFT mirror and compare it to the $MFT and if the two do not
+ match, force a read-only mount and do not allow read-write remounts.
+ - Read and parse the $LogFile journal and if it indicates that the
+ volume was not shutdown cleanly, force a read-only mount and do not
+ allow read-write remounts. If the $LogFile indicates a clean
+ shutdown and a read-write (re)mount is requested, empty $LogFile to
+ ensure that Windows cannot cause data corruption by replaying a stale
+ journal after Linux has written to the volume.
+ - Improve time handling so that the NTFS time is fully preserved when
+ converted to kernel time and only up to 99 nano-seconds are lost when
+ kernel time is converted to NTFS time.
+2.1.7:
+ - Enable NFS exporting of mounted NTFS volumes.
+2.1.6:
+ - Fix minor bug in handling of compressed directories that fixes the
+ erroneous "du" and "stat" output people reported.
+2.1.5:
+ - Minor bug fix in attribute list attribute handling that fixes the
+ I/O errors on "ls" of certain fragmented files found by at least two
+ people running Windows XP.
+2.1.4:
+ - Minor update allowing compilation with all gcc versions (well, the
+ ones the kernel can be compiled with anyway).
+2.1.3:
+ - Major bug fixes for reading files and volumes in corner cases which
+ were being hit by Windows 2k/XP users.
+2.1.2:
+ - Major bug fixes alleviating the hangs in statfs experienced by some
+ users.
+2.1.1:
+ - Update handling of compressed files so people no longer get the
+ frequently reported warning messages about initialized_size !=
+ data_size.
+2.1.0:
+ - Add configuration option for developmental write support.
+ - Initial implementation of file overwriting. (Writes to resident files
+ are not written out to disk yet, so avoid writing to files smaller
+ than about 1kiB.)
+ - Intercept/abort changes in file size as they are not implemented yet.
+2.0.25:
+ - Minor bugfixes in error code paths and small cleanups.
+2.0.24:
+ - Small internal cleanups.
+ - Support for sendfile system call. (Christoph Hellwig)
+2.0.23:
+ - Massive internal locking changes to mft record locking. Fixes
+ various race conditions and deadlocks.
+ - Fix ntfs over loopback for compressed files by adding an
+ optimization barrier. (gcc was screwing up otherwise ?)
+ Thanks go to Christoph Hellwig for pointing these two out:
+ - Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
+ - Fix ntfs_free() for ia64 and parisc.
+2.0.22:
+ - Small internal cleanups.
+2.0.21:
+ These only affect 32-bit architectures:
+ - Check for, and refuse to mount too large volumes (maximum is 2TiB).
+ - Check for, and refuse to open too large files and directories
+ (maximum is 16TiB).
+2.0.20:
+ - Support non-resident directory index bitmaps. This means we now cope
+ with huge directories without problems.
+ - Fix a page leak that manifested itself in some cases when reading
+ directory contents.
+ - Internal cleanups.
+2.0.19:
+ - Fix race condition and improvements in block i/o interface.
+ - Optimization when reading compressed files.
+2.0.18:
+ - Fix race condition in reading of compressed files.
+2.0.17:
+ - Cleanups and optimizations.
+2.0.16:
+ - Fix stupid bug introduced in 2.0.15 in new attribute inode API.
+ - Big internal cleanup replacing the mftbmp access hacks by using the
+ new attribute inode API instead.
+2.0.15:
+ - Bug fix in parsing of remount options.
+ - Internal changes implementing attribute (fake) inodes allowing all
+ attribute i/o to go via the page cache and to use all the normal
+ vfs/mm functionality.
+2.0.14:
+ - Internal changes improving run list merging code and minor locking
+ change to not rely on BKL in ntfs_statfs().
+2.0.13:
+ - Internal changes towards using iget5_locked() in preparation for
+ fake inodes and small cleanups to ntfs_volume structure.
+2.0.12:
+ - Internal cleanups in address space operations made possible by the
+ changes introduced in the previous release.
+2.0.11:
+ - Internal updates and cleanups introducing the first step towards
+ fake inode based attribute i/o.
+2.0.10:
+ - Microsoft says that the maximum number of inodes is 2^32 - 1. Update
+ the driver accordingly to only use 32-bits to store inode numbers on
+ 32-bit architectures. This improves the speed of the driver a little.
+2.0.9:
+ - Change decompression engine to use a single buffer. This should not
+ affect performance except perhaps on the most heavy i/o on SMP
+ systems when accessing multiple compressed files from multiple
+ devices simultaneously.
+ - Minor updates and cleanups.
+2.0.8:
+ - Remove now obsolete show_inodes and posix mount option(s).
+ - Restore show_sys_files mount option.
+ - Add new mount option case_sensitive, to determine if the driver
+ treats file names as case sensitive or not.
+ - Mostly drop support for short file names (for backwards compatibility
+ we only support accessing files via their short file name if one
+ exists).
+ - Fix dcache aliasing issues wrt short/long file names.
+ - Cleanups and minor fixes.
+2.0.7:
+ - Just cleanups.
+2.0.6:
+ - Major bugfix to make compatible with other kernel changes. This fixes
+ the hangs/oopses on umount.
+ - Locking cleanup in directory operations (remove BKL usage).
+2.0.5:
+ - Major buffer overflow bug fix.
+ - Minor cleanups and updates for kernel 2.5.12.
+2.0.4:
+ - Cleanups and updates for kernel 2.5.11.
+2.0.3:
+ - Small bug fixes, cleanups, and performance improvements.
+2.0.2:
+ - Use default fmask of 0177 so that files are no executable by default.
+ If you want owner executable files, just use fmask=0077.
+ - Update for kernel 2.5.9 but preserve backwards compatibility with
+ kernel 2.5.7.
+ - Minor bug fixes, cleanups, and updates.
+2.0.1:
+ - Minor updates, primarily set the executable bit by default on files
+ so they can be executed.
+2.0.0:
+ - Started ChangeLog.
+
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
new file mode 100644
index 00000000..7618a287
--- /dev/null
+++ b/Documentation/filesystems/ocfs2.txt
@@ -0,0 +1,102 @@
+OCFS2 filesystem
+==================
+OCFS2 is a general purpose extent based shared disk cluster file
+system with many similarities to ext3. It supports 64 bit inode
+numbers, and has automatically extending metadata groups which may
+also make it attractive for non-clustered use.
+
+You'll want to install the ocfs2-tools package in order to at least
+get "mount.ocfs2" and "ocfs2_hb_ctl".
+
+Project web page: http://oss.oracle.com/projects/ocfs2
+Tools web page: http://oss.oracle.com/projects/ocfs2-tools
+OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+
+All code copyright 2005 Oracle except when otherwise noted.
+
+CREDITS:
+Lots of code taken from ext3 and other projects.
+
+Authors in alphabetical order:
+Joel Becker <joel.becker@oracle.com>
+Zach Brown <zach.brown@oracle.com>
+Mark Fasheh <mfasheh@suse.com>
+Kurt Hackel <kurt.hackel@oracle.com>
+Tao Ma <tao.ma@oracle.com>
+Sunil Mushran <sunil.mushran@oracle.com>
+Manish Singh <manish.singh@oracle.com>
+Tiger Yang <tiger.yang@oracle.com>
+
+Caveats
+=======
+Features which OCFS2 does not support yet:
+ - Directory change notification (F_NOTIFY)
+ - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
+
+Mount options
+=============
+
+OCFS2 supports the following mount options:
+(*) == default
+
+barrier=1 This enables/disables barriers. barrier=0 disables it,
+ barrier=1 enables it.
+errors=remount-ro(*) Remount the filesystem read-only on an error.
+errors=panic Panic and halt the machine if an error occurs.
+intr (*) Allow signals to interrupt cluster operations.
+nointr Do not allow signals to interrupt cluster
+ operations.
+noatime Do not update access time.
+relatime(*) Update atime if the previous atime is older than
+ mtime or ctime
+strictatime Always update atime, but the minimum update interval
+ is specified by atime_quantum.
+atime_quantum=60(*) OCFS2 will not update atime unless this number
+ of seconds has passed since the last update.
+ Set to zero to always update atime. This option need
+ work with strictatime.
+data=ordered (*) All data are forced directly out to the main file
+ system prior to its metadata being committed to the
+ journal.
+data=writeback Data ordering is not preserved, data may be written
+ into the main file system after its metadata has been
+ committed to the journal.
+preferred_slot=0(*) During mount, try to use this filesystem slot first. If
+ it is in use by another node, the first empty one found
+ will be chosen. Invalid values will be ignored.
+commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata
+ every 'nrsec' seconds. The default value is 5 seconds.
+ This means that if you lose your power, you will lose
+ as much as the latest 5 seconds of work (your
+ filesystem will not be damaged though, thanks to the
+ journaling). This default value (or any low value)
+ will hurt performance, but it's good for data-safety.
+ Setting it to 0 will have the same effect as leaving
+ it at the default (5 seconds).
+ Setting it to very large values will improve
+ performance.
+localalloc=8(*) Allows custom localalloc size in MB. If the value is too
+ large, the fs will silently revert it to the default.
+localflocks This disables cluster aware flock.
+inode64 Indicates that Ocfs2 is allowed to create inodes at
+ any location in the filesystem, including those which
+ will result in inode numbers occupying more than 32
+ bits of significance.
+user_xattr (*) Enables Extended User Attributes.
+nouser_xattr Disables Extended User Attributes.
+acl Enables POSIX Access Control Lists support.
+noacl (*) Disables POSIX Access Control Lists support.
+resv_level=2 (*) Set how aggressive allocation reservations will be.
+ Valid values are between 0 (reservations off) to 8
+ (maximum space for reservations).
+dir_resv_level= (*) By default, directory reservations will scale with file
+ reservations - users should rarely need to change this
+ value. If allocation reservations are turned off, this
+ option will have no effect.
+coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode
+ lock will be taken to force other nodes drop cache,
+ therefore full cluster coherency is guaranteed even
+ for O_DIRECT writes.
+coherency=buffered Allow concurrent O_DIRECT writes without EX lock among
+ nodes, which gains high performance at risk of getting
+ stale data on other nodes.
diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt
new file mode 100644
index 00000000..1d0d41ff
--- /dev/null
+++ b/Documentation/filesystems/omfs.txt
@@ -0,0 +1,106 @@
+Optimized MPEG Filesystem (OMFS)
+
+Overview
+========
+
+OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR
+and Rio Karma MP3 player. The filesystem is extent-based, utilizing
+block sizes from 2k to 8k, with hash-based directories. This
+filesystem driver may be used to read and write disks from these
+devices.
+
+Note, it is not recommended that this FS be used in place of a general
+filesystem for your own streaming media device. Native Linux filesystems
+will likely perform better.
+
+More information is available at:
+
+ http://linux-karma.sf.net/
+
+Various utilities, including mkomfs and omfsck, are included with
+omfsprogs, available at:
+
+ http://bobcopeland.com/karma/
+
+Instructions are included in its README.
+
+Options
+=======
+
+OMFS supports the following mount-time options:
+
+ uid=n - make all files owned by specified user
+ gid=n - make all files owned by specified group
+ umask=xxx - set permission umask to xxx
+ fmask=xxx - set umask to xxx for files
+ dmask=xxx - set umask to xxx for directories
+
+Disk format
+===========
+
+OMFS discriminates between "sysblocks" and normal data blocks. The sysblock
+group consists of super block information, file metadata, directory structures,
+and extents. Each sysblock has a header containing CRCs of the entire
+sysblock, and may be mirrored in successive blocks on the disk. A sysblock may
+have a smaller size than a data block, but since they are both addressed by the
+same 64-bit block number, any remaining space in the smaller sysblock is
+unused.
+
+Sysblock header information:
+
+struct omfs_header {
+ __be64 h_self; /* FS block where this is located */
+ __be32 h_body_size; /* size of useful data after header */
+ __be16 h_crc; /* crc-ccitt of body_size bytes */
+ char h_fill1[2];
+ u8 h_version; /* version, always 1 */
+ char h_type; /* OMFS_INODE_X */
+ u8 h_magic; /* OMFS_IMAGIC */
+ u8 h_check_xor; /* XOR of header bytes before this */
+ __be32 h_fill2;
+};
+
+Files and directories are both represented by omfs_inode:
+
+struct omfs_inode {
+ struct omfs_header i_head; /* header */
+ __be64 i_parent; /* parent containing this inode */
+ __be64 i_sibling; /* next inode in hash bucket */
+ __be64 i_ctime; /* ctime, in milliseconds */
+ char i_fill1[35];
+ char i_type; /* OMFS_[DIR,FILE] */
+ __be32 i_fill2;
+ char i_fill3[64];
+ char i_name[OMFS_NAMELEN]; /* filename */
+ __be64 i_size; /* size of file, in bytes */
+};
+
+Directories in OMFS are implemented as a large hash table. Filenames are
+hashed then prepended into the bucket list beginning at OMFS_DIR_START.
+Lookup requires hashing the filename, then seeking across i_sibling pointers
+until a match is found on i_name. Empty buckets are represented by block
+pointers with all-1s (~0).
+
+A file is an omfs_inode structure followed by an extent table beginning at
+OMFS_EXTENT_START:
+
+struct omfs_extent_entry {
+ __be64 e_cluster; /* start location of a set of blocks */
+ __be64 e_blocks; /* number of blocks after e_cluster */
+};
+
+struct omfs_extent {
+ __be64 e_next; /* next extent table location */
+ __be32 e_extent_count; /* total # extents in this table */
+ __be32 e_fill;
+ struct omfs_extent_entry e_entry; /* start of extent entries */
+};
+
+Each extent holds the block offset followed by number of blocks allocated to
+the extent. The final extent in each table is a terminator with e_cluster
+being ~0 and e_blocks being ones'-complement of the total number of blocks
+in the table.
+
+If this table overflows, a continuation inode is written and pointed to by
+e_next. These have a header but lack the rest of the inode structure.
+
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
new file mode 100644
index 00000000..3571667c
--- /dev/null
+++ b/Documentation/filesystems/path-lookup.txt
@@ -0,0 +1,382 @@
+Path walking and name lookup locking
+====================================
+
+Path resolution is the finding a dentry corresponding to a path name string, by
+performing a path walk. Typically, for every open(), stat() etc., the path name
+will be resolved. Paths are resolved by walking the namespace tree, starting
+with the first component of the pathname (eg. root or cwd) with a known dentry,
+then finding the child of that dentry, which is named the next component in the
+path string. Then repeating the lookup from the child dentry and finding its
+child with the next element, and so on.
+
+Since it is a frequent operation for workloads like multiuser environments and
+web servers, it is important to optimize this code.
+
+Path walking synchronisation history:
+Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
+thus in every component during path look-up. Since 2.5.10 onwards, fast-walk
+algorithm changed this by holding the dcache_lock at the beginning and walking
+as many cached path component dentries as possible. This significantly
+decreases the number of acquisition of dcache_lock. However it also increases
+the lock hold time significantly and affects performance in large SMP machines.
+Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
+make dcache look-up lock-free.
+
+All the above algorithms required taking a lock and reference count on the
+dentry that was looked up, so that may be used as the basis for walking the
+next path element. This is inefficient and unscalable. It is inefficient
+because of the locks and atomic operations required for every dentry element
+slows things down. It is not scalable because many parallel applications that
+are path-walk intensive tend to do path lookups starting from a common dentry
+(usually, the root "/" or current working directory). So contention on these
+common path elements causes lock and cacheline queueing.
+
+Since 2.6.38, RCU is used to make a significant part of the entire path walk
+(including dcache look-up) completely "store-free" (so, no locks, atomics, or
+even stores into cachelines of common dentries). This is known as "rcu-walk"
+path walking.
+
+Path walking overview
+=====================
+
+A name string specifies a start (root directory, cwd, fd-relative) and a
+sequence of elements (directory entry names), which together refer to a path in
+the namespace. A path is represented as a (dentry, vfsmount) tuple. The name
+elements are sub-strings, separated by '/'.
+
+Name lookups will want to find a particular path that a name string refers to
+(usually the final element, or parent of final element). This is done by taking
+the path given by the name's starting point (which we know in advance -- eg.
+current->fs->cwd or current->fs->root) as the first parent of the lookup. Then
+iteratively for each subsequent name element, look up the child of the current
+parent with the given name and if it is not the desired entry, make it the
+parent for the next lookup.
+
+A parent, of course, must be a directory, and we must have appropriate
+permissions on the parent inode to be able to walk into it.
+
+Turning the child into a parent for the next lookup requires more checks and
+procedures. Symlinks essentially substitute the symlink name for the target
+name in the name string, and require some recursive path walking. Mount points
+must be followed into (thus changing the vfsmount that subsequent path elements
+refer to), switching from the mount point path to the root of the particular
+mounted vfsmount. These behaviours are variously modified depending on the
+exact path walking flags.
+
+Path walking then must, broadly, do several particular things:
+- find the start point of the walk;
+- perform permissions and validity checks on inodes;
+- perform dcache hash name lookups on (parent, name element) tuples;
+- traverse mount points;
+- traverse symlinks;
+- lookup and create missing parts of the path on demand.
+
+Safe store-free look-up of dcache hash table
+============================================
+
+Dcache name lookup
+------------------
+In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
+and use that to select a bucket in the dcache-hash table. The list of entries
+in that bucket is then walked, and we do a full comparison of each entry
+against our (parent, name) tuple.
+
+The hash lists are RCU protected, so list walking is not serialised with
+concurrent updates (insertion, deletion from the hash). This is a standard RCU
+list application with the exception of renames, which will be covered below.
+
+Parent and name members of a dentry, as well as its membership in the dcache
+hash, and its inode are protected by the per-dentry d_lock spinlock. A
+reference is taken on the dentry (while the fields are verified under d_lock),
+and this stabilises its d_inode pointer and actual inode. This gives a stable
+point to perform the next step of our path walk against.
+
+These members are also protected by d_seq seqlock, although this offers
+read-only protection and no durability of results, so care must be taken when
+using d_seq for synchronisation (see seqcount based lookups, below).
+
+Renames
+-------
+Back to the rename case. In usual RCU protected lists, the only operations that
+will happen to an object is insertion, and then eventually removal from the
+list. The object will not be reused until an RCU grace period is complete.
+This ensures the RCU list traversal primitives can run over the object without
+problems (see RCU documentation for how this works).
+
+However when a dentry is renamed, its hash value can change, requiring it to be
+moved to a new hash list. Allocating and inserting a new alias would be
+expensive and also problematic for directory dentries. Latency would be far to
+high to wait for a grace period after removing the dentry and before inserting
+it in the new hash bucket. So what is done is to insert the dentry into the
+new list immediately.
+
+However, when the dentry's list pointers are updated to point to objects in the
+new list before waiting for a grace period, this can result in a concurrent RCU
+lookup of the old list veering off into the new (incorrect) list and missing
+the remaining dentries on the list.
+
+There is no fundamental problem with walking down the wrong list, because the
+dentry comparisons will never match. However it is fatal to miss a matching
+dentry. So a seqlock is used to detect when a rename has occurred, and so the
+lookup can be retried.
+
+ 1 2 3
+ +---+ +---+ +---+
+hlist-->| N-+->| N-+->| N-+->
+head <--+-P |<-+-P |<-+-P |
+ +---+ +---+ +---+
+
+Rename of dentry 2 may require it deleted from the above list, and inserted
+into a new list. Deleting 2 gives the following list.
+
+ 1 3
+ +---+ +---+ (don't worry, the longer pointers do not
+hlist-->| N-+-------->| N-+-> impose a measurable performance overhead
+head <--+-P |<--------+-P | on modern CPUs)
+ +---+ +---+
+ ^ 2 ^
+ | +---+ |
+ | | N-+----+
+ +----+-P |
+ +---+
+
+This is a standard RCU-list deletion, which leaves the deleted object's
+pointers intact, so a concurrent list walker that is currently looking at
+object 2 will correctly continue to object 3 when it is time to traverse the
+next object.
+
+However, when inserting object 2 onto a new list, we end up with this:
+
+ 1 3
+ +---+ +---+
+hlist-->| N-+-------->| N-+->
+head <--+-P |<--------+-P |
+ +---+ +---+
+ 2
+ +---+
+ | N-+---->
+ <----+-P |
+ +---+
+
+Because we didn't wait for a grace period, there may be a concurrent lookup
+still at 2. Now when it follows 2's 'next' pointer, it will walk off into
+another list without ever having checked object 3.
+
+A related, but distinctly different, issue is that of rename atomicity versus
+lookup operations. If a file is renamed from 'A' to 'B', a lookup must only
+find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
+of 'B' must succeed (note the reverse is not true).
+
+Between deleting the dentry from the old hash list, and inserting it on the new
+hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
+rename seqlock is also used to cover this race in much the same way, by
+retrying a negative lookup result if a rename was in progress.
+
+Seqcount based lookups
+----------------------
+In refcount based dcache lookups, d_lock is used to serialise access to
+the dentry, stabilising it while comparing its name and parent and then
+taking a reference count (the reference count then gives a stable place to
+start the next part of the path walk from).
+
+As explained above, we would like to do path walking without taking locks or
+reference counts on intermediate dentries along the path. To do this, a per
+dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
+looks like (its name, parent, and inode). That snapshot is then used to start
+the next part of the path walk. When loading the coherent snapshot under d_seq,
+care must be taken to load the members up-front, and use those pointers rather
+than reloading from the dentry later on (otherwise we'd have interesting things
+like d_inode going NULL underneath us, if the name was unlinked).
+
+Also important is to avoid performing any destructive operations (pretty much:
+no non-atomic stores to shared data), and to recheck the seqcount when we are
+"done" with the operation. Retry or abort if the seqcount does not match.
+Avoiding destructive or changing operations means we can easily unwind from
+failure.
+
+What this means is that a caller, provided they are holding RCU lock to
+protect the dentry object from disappearing, can perform a seqcount based
+lookup which does not increment the refcount on the dentry or write to
+it in any way. This returned dentry can be used for subsequent operations,
+provided that d_seq is rechecked after that operation is complete.
+
+Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be
+queried for permissions.
+
+With this two parts of the puzzle, we can do path lookups without taking
+locks or refcounts on dentry elements.
+
+RCU-walk path walking design
+============================
+
+Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
+is the traditional[*] way of performing dcache lookups using d_lock to
+serialise concurrent modifications to the dentry and take a reference count on
+it. ref-walk is simple and obvious, and may sleep, take locks, etc while path
+walking is operating on each dentry. rcu-walk uses seqcount based dentry
+lookups, and can perform lookup of intermediate elements without any stores to
+shared data in the dentry or inode. rcu-walk can not be applied to all cases,
+eg. if the filesystem must sleep or perform non trivial operations, rcu-walk
+must be switched to ref-walk mode.
+
+[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
+ path walk.
+
+Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining
+path string, rcu-walk uses a d_seq protected snapshot. When looking up a
+child of this parent snapshot, we open d_seq critical section on the child
+before closing d_seq critical section on the parent. This gives an interlocking
+ladder of snapshots to walk down.
+
+
+ proc 101
+ /----------------\
+ / comm: "vi" \
+ / fs.root: dentry0 \
+ \ fs.cwd: dentry2 /
+ \ /
+ \----------------/
+
+So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
+start from current->fs->root, which is a pinned dentry. Alternatively,
+"./test.c" would start from cwd; both names refer to the same path in
+the context of proc101.
+
+ dentry 0
+ +---------------------+ rcu-walk begins here, we note d_seq, check the
+ | name: "/" | inode's permission, and then look up the next
+ | inode: 10 | path element which is "home"...
+ | children:"home", ...|
+ +---------------------+
+ |
+ dentry 1 V
+ +---------------------+ ... which brings us here. We find dentry1 via
+ | name: "home" | hash lookup, then note d_seq and compare name
+ | inode: 678 | string and parent pointer. When we have a match,
+ | children:"npiggin" | we now recheck the d_seq of dentry0. Then we
+ +---------------------+ check inode and look up the next element.
+ |
+ dentry2 V
+ +---------------------+ Note: if dentry0 is now modified, lookup is
+ | name: "npiggin" | not necessarily invalid, so we need only keep a
+ | inode: 543 | parent for d_seq verification, and grandparents
+ | children:"a.c", ... | can be forgotten.
+ +---------------------+
+ |
+ dentry3 V
+ +---------------------+ At this point we have our destination dentry.
+ | name: "a.c" | We now take its d_lock, verify d_seq of this
+ | inode: 14221 | dentry. If that checks out, we can increment
+ | children:NULL | its refcount because we're holding d_lock.
+ +---------------------+
+
+Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
+re-checking its d_seq, and then incrementing its refcount is called
+"dropping rcu" or dropping from rcu-walk into ref-walk mode.
+
+It is, in some sense, a bit of a house of cards. If the seqcount check of the
+parent snapshot fails, the house comes down, because we had closed the d_seq
+section on the grandparent, so we have nothing left to stand on. In that case,
+the path walk must be fully restarted (which we do in ref-walk mode, to avoid
+live locks). It is costly to have a full restart, but fortunately they are
+quite rare.
+
+When we reach a point where sleeping is required, or a filesystem callout
+requires ref-walk, then instead of restarting the walk, we attempt to drop rcu
+at the last known good dentry we have. Avoiding a full restart in ref-walk in
+these cases is fundamental for performance and scalability because blocking
+operations such as creates and unlinks are not uncommon.
+
+The detailed design for rcu-walk is like this:
+* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
+* Take the RCU lock for the entire path walk, starting with the acquiring
+ of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
+ not required for dentry persistence.
+* synchronize_rcu is called when unregistering a filesystem, so we can
+ access d_ops and i_ops during rcu-walk.
+* Similarly take the vfsmount lock for the entire path walk. So now mnt
+ refcounts are not required for persistence. Also we are free to perform mount
+ lookups, and to assume dentry mount points and mount roots are stable up and
+ down the path.
+* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
+ so we can load this tuple atomically, and also check whether any of its
+ members have changed.
+* Dentry lookups (based on parent, candidate string tuple) recheck the parent
+ sequence after the child is found in case anything changed in the parent
+ during the path walk.
+* inode is also RCU protected so we can load d_inode and use the inode for
+ limited things.
+* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
+* i_op can be loaded.
+* When the destination dentry is reached, drop rcu there (ie. take d_lock,
+ verify d_seq, increment refcount).
+* If seqlock verification fails anywhere along the path, do a full restart
+ of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
+ a better errno) to signal an rcu-walk failure.
+
+The cases where rcu-walk cannot continue are:
+* NULL dentry (ie. any uncached path element)
+* Following links
+
+It may be possible eventually to make following links rcu-walk aware.
+
+Uncached path elements will always require dropping to ref-walk mode, at the
+very least because i_mutex needs to be grabbed, and objects allocated.
+
+Final note:
+"store-free" path walking is not strictly store free. We take vfsmount lock
+and refcounts (both of which can be made per-cpu), and we also store to the
+stack (which is essentially CPU-local), and we also have to take locks and
+refcount on final dentry.
+
+The point is that shared data, where practically possible, is not locked
+or stored into. The result is massive improvements in performance and
+scalability of path resolution.
+
+
+Interesting statistics
+======================
+
+The following table gives rcu lookup statistics for a few simple workloads
+(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
+drop rcu that fail due to d_seq failure and requiring the entire path lookup
+again. Other cases are successful rcu-drops that are required before the final
+element, nodentry for missing dentry, revalidate for filesystem revalidate
+routine requiring rcu drop, permission for permission check requiring drop,
+and link for symlink traversal requiring drop.
+
+ rcu-lookups restart nodentry link revalidate permission
+bootup 47121 0 4624 1010 10283 7852
+dbench 25386793 0 6778659(26.7%) 55 549 1156
+kbuild 2696672 10 64442(2.3%) 108764(4.0%) 1 1590
+git diff 39605 0 28 2 0 106
+vfstest 24185492 4945 708725(2.9%) 1076136(4.4%) 0 2651
+
+What this shows is that failed rcu-walk lookups, ie. ones that are restarted
+entirely with ref-walk, are quite rare. Even the "vfstest" case which
+specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
+such races is not showing a huge amount of restarts.
+
+Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where
+the reference count needs to be taken for some reason. This is either because
+we have reached the target of the path walk, or because we have encountered a
+condition that can't be resolved in rcu-walk mode. Ideally, we drop rcu-walk
+only when we have reached the target dentry, so the other statistics show where
+this does not happen.
+
+Note that a graceful drop from rcu-walk mode due to something such as the
+dentry not existing (which can be common) is not necessarily a failure of
+rcu-walk scheme, because some elements of the path may have been walked in
+rcu-walk mode. The further we get from common path elements (such as cwd or
+root), the less contended the dentry is likely to be. The closer we are to
+common path elements, the more likely they will exist in dentry cache.
+
+
+Papers and other documentation on dcache locking
+================================================
+
+1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+
+2. http://lse.sourceforge.net/locking/dcache/dcache.html
+
+
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
new file mode 100644
index 00000000..dcf83358
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -0,0 +1,71 @@
+POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
+
+ Evgeniy Polyakov <zbr@ioremap.net>
+
+Homepage: http://www.ioremap.net/projects/pohmelfs
+
+POHMELFS first began as a network filesystem with coherent local data and
+metadata caches but is now evolving into a parallel distributed filesystem.
+
+Main features of this FS include:
+ * Locally coherent cache for data and metadata with (potentially) byte-range locks.
+ Since all Linux filesystems lock the whole inode during writing, algorithm
+ is very simple and does not use byte-ranges, although they are sent in
+ locking messages.
+ * Completely async processing of all events except creation of hard and symbolic
+ links, and rename events.
+ Object creation and data reading and writing are processed asynchronously.
+ * Flexible object architecture optimized for network processing.
+ Ability to create long paths to objects and remove arbitrarily huge
+ directories with a single network command.
+ (like removing the whole kernel tree via a single network command).
+ * Very high performance.
+ * Fast and scalable multithreaded userspace server. Being in userspace it works
+ with any underlying filesystem and still is much faster than async in-kernel NFS one.
+ * Client is able to switch between different servers (if one goes down, client
+ automatically reconnects to second and so on).
+ * Transactions support. Full failover for all operations.
+ Resending transactions to different servers on timeout or error.
+ * Read request (data read, directory listing, lookup requests) balancing between multiple servers.
+ * Write requests are replicated to multiple servers and completed only when all of them are acked.
+ * Ability to add and/or remove servers from the working set at run-time.
+ * Strong authentification and possible data encryption in network channel.
+ * Extended attributes support.
+
+POHMELFS is based on transactions, which are potentially long-standing objects that live
+in the client's memory. Each transaction contains all the information needed to process a given
+command (or set of commands, which is frequently used during data writing: single transactions
+can contain creation and data writing commands). Transactions are committed by all the servers
+to which they are sent and, in case of failures, are eventually resent or dropped with an error.
+For example, reading will return an error if no servers are available.
+
+POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
+possible to detach replies from requests and, if the command requires data to be received, the
+caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
+servers and async threads will pick up replies in parallel, find appropriate transactions in the
+system and put the data where it belongs (like the page or inode cache).
+
+The main feature of POHMELFS is writeback data and the metadata cache.
+Only a few non-performance critical operations use the write-through cache and
+are synchronous: hard and symbolic link creation, and object rename. Creation,
+removal of objects and data writing are asynchronous and are sent to
+the server during system writeback. Only one writer at a time is allowed for any
+given inode, which is guarded by an appropriate locking protocol.
+Because of this feature, POHMELFS is extremely fast at metadata intensive
+workloads and can fully utilize the bandwidth to the servers when doing bulk
+data transfers.
+
+POHMELFS clients operate with a working set of servers and are capable of balancing read-only
+operations (like lookups or directory listings) between them according to IO priorities.
+Administrators can add or remove servers from the set at run-time via special commands (described
+in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers, which are connected
+with write permission turned on. IO priority and permissions can be changed in run-time.
+
+POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
+One can select any kernel supported cipher, encryption mode, hash type and operation mode
+(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
+is checked during mount time and, if the server does not support it, appropriate capabilities
+will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
+Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
+crypto operations and send the resulting data to server or submit it up the stack. This number
+can be controlled via a mount option.
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt
new file mode 100644
index 00000000..db2e4139
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/info.txt
@@ -0,0 +1,99 @@
+POHMELFS usage information.
+
+Mount options.
+All but index, number of crypto threads and maximum IO size can changed via remount.
+
+idx=%u
+ Each mountpoint is associated with a special index via this option.
+ Administrator can add or remove servers from the given index, so all mounts,
+ which were attached to it, are updated.
+ Default it is 0.
+
+trans_scan_timeout=%u
+ This timeout, expressed in milliseconds, specifies time to scan transaction
+ trees looking for stale requests, which have to be resent, or if number of
+ retries exceed specified limit, dropped with error.
+ Default is 5 seconds.
+
+drop_scan_timeout=%u
+ Internal timeout, expressed in milliseconds, which specifies how frequently
+ inodes marked to be dropped are freed. It also specifies how frequently
+ the system checks that servers have to be added or removed from current working set.
+ Default is 1 second.
+
+wait_on_page_timeout=%u
+ Number of milliseconds to wait for reply from remote server for data reading command.
+ If this timeout is exceeded, reading returns an error.
+ Default is 5 seconds.
+
+trans_retries=%u
+ This is the number of times that a transaction will be resent to a server that did
+ not answer for the last @trans_scan_timeout milliseconds.
+ When the number of resends exceeds this limit, the transaction is completed with error.
+ Default is 5 resends.
+
+crypto_thread_num=%u
+ Number of crypto processing threads. Threads are used both for RX and TX traffic.
+ Default is 2, or no threads if crypto operations are not supported.
+
+trans_max_pages=%u
+ Maximum number of pages in a single transaction. This parameter also controls
+ the number of pages, allocated for crypto processing (each crypto thread has
+ pool of pages, the number of which is equal to 'trans_max_pages'.
+ Default is 100 pages.
+
+crypto_fail_unsupported
+ If specified, mount will fail if the server does not support requested crypto operations.
+ By default mount will disable non-matching crypto operations.
+
+mcache_timeout=%u
+ Maximum number of milliseconds to wait for the mcache objects to be processed.
+ Mcache includes locks (given lock should be granted by server), attributes (they should be
+ fully received in the given timeframe).
+ Default is 5 seconds.
+
+Usage examples.
+
+Add server server1.net:1025 into the working set with index $idx
+with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
+$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
+
+Mount filesystem with given index $idx to /mnt mountpoint.
+Client will connect to all servers specified in the working set via previous command:
+mount -t pohmel -o idx=$idx q /mnt
+
+Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw):
+$cfg A modify -a server1.net -p 1025 -i $idx -I 1
+
+Change IO priority to 123 (node with the highest priority gets read requests).
+$cfg A modify -a server1.net -p 1025 -i $idx -P 123
+
+One can check currect status of all connections in the mountstats file:
+# cat /proc/$PID/mountstats
+...
+device none mounted on /mnt with fstype pohmel
+idx addr(:port) socket_type protocol active priority permissions
+0 server1.net:1026 1 6 1 250 1
+0 server2.net:1025 1 6 1 123 3
+
+Server installation.
+
+Creating a server, which listens at port 1025 and 0.0.0.0 address.
+Working root directory (note, that server chroots there, so you have to have appropriate permissions)
+is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
+are appropriate key files.
+Number of working threads is set to 10.
+
+# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key
+
+ -A 6 - listen on ipv6 address. Default: Disabled.
+ -r root - path to root directory. Default: /tmp.
+ -a addr - listen address. Default: 0.0.0.0.
+ -p port - listen port. Default: 1025.
+ -w workers - number of workers per connected client. Default: 1.
+ -K file - hash key size. Default: none.
+ -k file - cipher key size. Default: none.
+ -h - this help.
+
+Number of worker threads specifies how many workers will be created for each client.
+Bulk single-client transafers usually are better handled with smaller number (like 1-3).
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt
new file mode 100644
index 00000000..65e03dd4
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/network_protocol.txt
@@ -0,0 +1,227 @@
+POHMELFS network protocol.
+
+Basic structure used in network communication is following command:
+
+struct netfs_cmd
+{
+ __u16 cmd; /* Command number */
+ __u16 csize; /* Attached crypto information size */
+ __u16 cpad; /* Attached padding size */
+ __u16 ext; /* External flags */
+ __u32 size; /* Size of the attached data */
+ __u32 trans; /* Transaction id */
+ __u64 id; /* Object ID to operate on. Used for feedback.*/
+ __u64 start; /* Start of the object. */
+ __u64 iv; /* IV sequence */
+ __u8 data[0];
+};
+
+Commands can be embedded into transaction command (which in turn has own command),
+so one can extend protocol as needed without breaking backward compatibility as long
+as old commands are supported. All string lengths include tail 0 byte.
+
+All commands are transferred over the network in big-endian. CPU endianess is used at the end peers.
+
+@cmd - command number, which specifies command to be processed. Following
+ commands are used currently:
+
+ NETFS_READDIR = 1, /* Read directory for given inode number */
+ NETFS_READ_PAGE, /* Read data page from the server */
+ NETFS_WRITE_PAGE, /* Write data page to the server */
+ NETFS_CREATE, /* Create directory entry */
+ NETFS_REMOVE, /* Remove directory entry */
+ NETFS_LOOKUP, /* Lookup single object */
+ NETFS_LINK, /* Create a link */
+ NETFS_TRANS, /* Transaction */
+ NETFS_OPEN, /* Open intent */
+ NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */
+ NETFS_PAGE_CACHE, /* Page cache invalidation message */
+ NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */
+ NETFS_RENAME, /* Rename object */
+ NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */
+ NETFS_LOCK, /* Distributed lock message */
+ NETFS_XATTR_SET, /* Set extended attribute */
+ NETFS_XATTR_GET, /* Get extended attribute */
+
+@ext - external flags. Used by different commands to specify some extra arguments
+ like partial size of the embedded objects or creation flags.
+
+@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
+ but size of the requested data is incorporated here. It does not include size of the command
+ header (struct netfs_cmd) itself.
+
+@id - id of the object this command operates on. Each command can use it for own purpose.
+
+@start - start of the object this command operates on. Each command can use it for own purpose.
+
+@csize, @cpad - size and padding size of the (attached if needed) crypto information.
+
+Command specifications.
+
+@NETFS_READDIR
+This command is used to sync content of the remote dir to the client.
+
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number of the directory to read.
+@start - zero.
+
+
+@NETFS_READ_PAGE
+This command is used to read data from remote server.
+Data size does not exceed local page cache size.
+
+@id - inode number.
+@start - first byte offset.
+@size - number of bytes to read plus length of the path to object.
+@ext - object path length.
+
+
+@NETFS_CREATE
+Used to create object.
+It does not require that all directories on top of the object were
+already created, it will create them automatically. Each object has
+associated @netfs_path_entry data structure, which contains creation
+mode (permissions and type) and length of the name as long as name itself.
+
+@start - 0
+@size - size of the all data structures needed to create a path
+@id - local inode number
+@ext - 0
+
+
+@NETFS_REMOVE
+Used to remove object.
+
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number.
+@start - zero.
+
+
+@NETFS_LOOKUP
+Lookup information about object on server.
+
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number of the directory to look object in.
+@start - local inode number of the object to look at.
+
+
+@NETFS_LINK
+Create hard of symlink.
+Command is sent as "object_path|target_path".
+
+@size - size of the above string.
+@id - parent local inode number.
+@start - 1 for symlink, 0 for hardlink.
+@ext - size of the "object_path" above.
+
+
+@NETFS_TRANS
+Transaction header.
+
+@size - incorporates all embedded command sizes including theirs header sizes.
+@start - transaction generation number - unique id used to find transaction.
+@ext - transaction flags. Unused at the moment.
+@id - 0.
+
+
+@NETFS_OPEN
+Open intent for given transaction.
+
+@id - local inode number.
+@start - 0.
+@size - path length to the object.
+@ext - open flags (O_RDWR and so on).
+
+
+@NETFS_INODE_INFO
+Metadata update command.
+It is sent to servers when attributes of the object are changed and received
+when data or metadata were updated. It operates with the following structure:
+
+struct netfs_inode_info
+{
+ unsigned int mode;
+ unsigned int nlink;
+ unsigned int uid;
+ unsigned int gid;
+ unsigned int blocksize;
+ unsigned int padding;
+ __u64 ino;
+ __u64 blocks;
+ __u64 rdev;
+ __u64 size;
+ __u64 version;
+};
+
+It effectively mirrors stat(2) returned data.
+
+
+@ext - path length to the object.
+@size - the same plus size of the netfs_inode_info structure.
+@id - local inode number.
+@start - 0.
+
+
+@NETFS_PAGE_CACHE
+Command is only received by clients. It contains information about
+page to be marked as not up-to-date.
+
+@id - client's inode number.
+@start - last byte of the page to be invalidated. If it is not equal to
+ current inode size, it will be vmtruncated().
+@size - 0
+@ext - 0
+
+
+@NETFS_READ_PAGES
+Used to read multiple contiguous pages in one go.
+
+@start - first byte of the contiguous region to read.
+@size - contains of two fields: lower 8 bits are used to represent page cache shift
+ used by client, another 3 bytes are used to get number of pages.
+@id - local inode number.
+@ext - path length to the object.
+
+
+@NETFS_RENAME
+Used to rename object.
+Attached data is formed into following string: "old_path|new_path".
+
+@id - local inode number.
+@start - parent inode number.
+@size - length of the above string.
+@ext - length of the old path part.
+
+
+@NETFS_CAPABILITIES
+Used to exchange crypto capabilities with server.
+If crypto capabilities are not supported by server, then client will disable it
+or fail (if 'crypto_fail_unsupported' mount options was specified).
+
+@id - superblock index. Used to specify crypto information for group of servers.
+@size - size of the attached capabilities structure.
+@start - 0.
+@size - 0.
+@scsize - 0.
+
+@NETFS_LOCK
+Used to send lock request/release messages. Although it sends byte range request
+and is capable of flushing pages based on that, it is not used, since all Linux
+filesystems lock the whole inode.
+
+@id - lock generation number.
+@start - start of the locked range.
+@size - size of the locked range.
+@ext - lock type: read/write. Not used actually. 15'th bit is used to determine,
+ if it is lock request (1) or release (0).
+
+@NETFS_XATTR_SET
+@NETFS_XATTR_GET
+Used to set/get extended attributes for given inode.
+@id - attribute generation number or xattr setting type
+@start - size of the attribute (request or attached)
+@size - name length, path len and data size for given attribute
+@ext - path length for given object
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
new file mode 100644
index 00000000..6e299548
--- /dev/null
+++ b/Documentation/filesystems/porting
@@ -0,0 +1,409 @@
+Changes since 2.5.0:
+
+---
+[recommended]
+
+New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
+ sb_set_blocksize() and sb_min_blocksize().
+
+Use them.
+
+(sb_find_get_block() replaces 2.4's get_hash_table())
+
+---
+[recommended]
+
+New methods: ->alloc_inode() and ->destroy_inode().
+
+Remove inode->u.foo_inode_i
+Declare
+ struct foo_inode_info {
+ /* fs-private stuff */
+ struct inode vfs_inode;
+ };
+ static inline struct foo_inode_info *FOO_I(struct inode *inode)
+ {
+ return list_entry(inode, struct foo_inode_info, vfs_inode);
+ }
+
+Use FOO_I(inode) instead of &inode->u.foo_inode_i;
+
+Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate
+foo_inode_info and return the address of ->vfs_inode, the latter should free
+FOO_I(inode) (see in-tree filesystems for examples).
+
+Make them ->alloc_inode and ->destroy_inode in your super_operations.
+
+Keep in mind that now you need explicit initialization of private data
+typically between calling iget_locked() and unlocking the inode.
+
+At some point that will become mandatory.
+
+---
+[mandatory]
+
+Change of file_system_type method (->read_super to ->get_sb)
+
+->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
+
+Turn your foo_read_super() into a function that would return 0 in case of
+success and negative number in case of error (-EINVAL unless you have more
+informative error value to report). Call it foo_fill_super(). Now declare
+
+int foo_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
+ mnt);
+}
+
+(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
+filesystem).
+
+Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
+foo_get_sb.
+
+---
+[mandatory]
+
+Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
+Most likely there is no need to change anything, but if you relied on
+global exclusion between renames for some internal purpose - you need to
+change your internal locking. Otherwise exclusion warranties remain the
+same (i.e. parents and victim are locked, etc.).
+
+---
+[informational]
+
+Now we have the exclusion between ->lookup() and directory removal (by
+->rmdir() and ->rename()). If you used to need that exclusion and do
+it by internal locking (most of filesystems couldn't care less) - you
+can relax your locking.
+
+---
+[mandatory]
+
+->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
+->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
+and ->readdir() are called without BKL now. Grab it on entry, drop upon return
+- that will guarantee the same locking you used to have. If your method or its
+parts do not need BKL - better yet, now you can shift lock_kernel() and
+unlock_kernel() so that they would protect exactly what needs to be
+protected.
+
+---
+[mandatory]
+
+BKL is also moved from around sb operations. ->write_super() Is now called
+without BKL held. BKL should have been shifted into individual fs sb_op
+functions. If you don't need it, remove it.
+
+---
+[informational]
+
+check for ->link() target not being a directory is done by callers. Feel
+free to drop it...
+
+---
+[informational]
+
+->link() callers hold ->i_mutex on the object we are linking to. Some of your
+problems might be over...
+
+---
+[mandatory]
+
+new file_system_type method - kill_sb(superblock). If you are converting
+an existing filesystem, set it according to ->fs_flags:
+ FS_REQUIRES_DEV - kill_block_super
+ FS_LITTER - kill_litter_super
+ neither - kill_anon_super
+FS_LITTER is gone - just remove it from fs_flags.
+
+---
+[mandatory]
+
+ FS_SINGLE is gone (actually, that had happened back when ->get_sb()
+went in - and hadn't been documented ;-/). Just remove it from fs_flags
+(and see ->get_sb() entry for other actions).
+
+---
+[mandatory]
+
+->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so
+watch for ->i_mutex-grabbing code that might be used by your ->setattr().
+Callers of notify_change() need ->i_mutex now.
+
+---
+[recommended]
+
+New super_block field "struct export_operations *s_export_op" for
+explicit support for exporting, e.g. via NFS. The structure is fully
+documented at its declaration in include/linux/fs.h, and in
+Documentation/filesystems/nfs/Exporting.
+
+Briefly it allows for the definition of decode_fh and encode_fh operations
+to encode and decode filehandles, and allows the filesystem to use
+a standard helper function for decode_fh, and provide file-system specific
+support for this helper, particularly get_parent.
+
+It is planned that this will be required for exporting once the code
+settles down a bit.
+
+[mandatory]
+
+s_export_op is now required for exporting a filesystem.
+isofs, ext2, ext3, resierfs, fat
+can be used as examples of very different filesystems.
+
+---
+[mandatory]
+
+iget4() and the read_inode2 callback have been superseded by iget5_locked()
+which has the following prototype,
+
+ struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
+ int (*test)(struct inode *, void *),
+ int (*set)(struct inode *, void *),
+ void *data);
+
+'test' is an additional function that can be used when the inode
+number is not sufficient to identify the actual file object. 'set'
+should be a non-blocking function that initializes those parts of a
+newly created inode to allow the test function to succeed. 'data' is
+passed as an opaque value to both test and set functions.
+
+When the inode has been created by iget5_locked(), it will be returned with the
+I_NEW flag set and will still be locked. The filesystem then needs to finalize
+the initialization. Once the inode is initialized it must be unlocked by
+calling unlock_new_inode().
+
+The filesystem is responsible for setting (and possibly testing) i_ino
+when appropriate. There is also a simpler iget_locked function that
+just takes the superblock and inode number as arguments and does the
+test and set for you.
+
+e.g.
+ inode = iget_locked(sb, ino);
+ if (inode->i_state & I_NEW) {
+ err = read_inode_from_disk(inode);
+ if (err < 0) {
+ iget_failed(inode);
+ return err;
+ }
+ unlock_new_inode(inode);
+ }
+
+Note that if the process of setting up a new inode fails, then iget_failed()
+should be called on the inode to render it dead, and an appropriate error
+should be passed back to the caller.
+
+---
+[recommended]
+
+->getattr() finally getting used. See instances in nfs, minix, etc.
+
+---
+[mandatory]
+
+->revalidate() is gone. If your filesystem had it - provide ->getattr()
+and let it call whatever you had as ->revlidate() + (for symlinks that
+had ->revalidate()) add calls in ->follow_link()/->readlink().
+
+---
+[mandatory]
+
+->d_parent changes are not protected by BKL anymore. Read access is safe
+if at least one of the following is true:
+ * filesystem has no cross-directory rename()
+ * we know that parent had been locked (e.g. we are looking at
+->d_parent of ->lookup() argument).
+ * we are called from ->rename().
+ * the child's ->d_lock is held
+Audit your code and add locking if needed. Notice that any place that is
+not protected by the conditions above is risky even in the old tree - you
+had been relying on BKL and that's prone to screwups. Old tree had quite
+a few holes of that kind - unprotected access to ->d_parent leading to
+anything from oops to silent memory corruption.
+
+---
+[mandatory]
+
+ FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
+(see rootfs for one kind of solution and bdev/socket/pipe for another).
+
+---
+[recommended]
+
+ Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
+is still alive, but only because of the mess in drivers/s390/block/dasd.c.
+As soon as it gets fixed is_read_only() will die.
+
+---
+[mandatory]
+
+->permission() is called without BKL now. Grab it on entry, drop upon
+return - that will guarantee the same locking you used to have. If
+your method or its parts do not need BKL - better yet, now you can
+shift lock_kernel() and unlock_kernel() so that they would protect
+exactly what needs to be protected.
+
+---
+[mandatory]
+
+->statfs() is now called without BKL held. BKL should have been
+shifted into individual fs sb_op functions where it's not clear that
+it's safe to remove it. If you don't need it, remove it.
+
+---
+[mandatory]
+
+ is_read_only() is gone; use bdev_read_only() instead.
+
+---
+[mandatory]
+
+ destroy_buffers() is gone; use invalidate_bdev().
+
+---
+[mandatory]
+
+ fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
+deliberate; as soon as struct block_device * is propagated in a reasonable
+way by that code fixing will become trivial; until then nothing can be
+done.
+
+[mandatory]
+
+ block truncatation on error exit from ->write_begin, and ->direct_IO
+moved from generic methods (block_write_begin, cont_write_begin,
+nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
+ext2_write_failed and callers for an example.
+
+[mandatory]
+
+ ->truncate is going away. The whole truncate sequence needs to be
+implemented in ->setattr, which is now mandatory for filesystems
+implementing on-disk size changes. Start with a copy of the old inode_setattr
+and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to
+be in order of zeroing blocks using block_truncate_page or similar helpers,
+size update and on finally on-disk truncation which should not fail.
+inode_change_ok now includes the size checks for ATTR_SIZE and must be called
+in the beginning of ->setattr unconditionally.
+
+[mandatory]
+
+ ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should
+be used instead. It gets called whenever the inode is evicted, whether it has
+remaining links or not. Caller does *not* evict the pagecache or inode-associated
+metadata buffers; getting rid of those is responsibility of method, as it had
+been for ->delete_inode().
+
+ ->drop_inode() returns int now; it's called on final iput() with
+inode->i_lock held and it returns true if filesystems wants the inode to be
+dropped. As before, generic_drop_inode() is still the default and it's been
+updated appropriately. generic_delete_inode() is also alive and it consists
+simply of return 1. Note that all actual eviction work is done by caller after
+->drop_inode() returns.
+
+ clear_inode() is gone; use end_writeback() instead. As before, it must
+be called exactly once on each call of ->evict_inode() (as it used to be for
+each call of ->delete_inode()). Unlike before, if you are using inode-associated
+metadata buffers (i.e. mark_buffer_dirty_inode()), it's your responsibility to
+call invalidate_inode_buffers() before end_writeback().
+ No async writeback (and thus no calls of ->write_inode()) will happen
+after end_writeback() returns, so actions that should not overlap with ->write_inode()
+(e.g. freeing on-disk inode if i_nlink is 0) ought to be done after that call.
+
+ NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out
+if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput()
+may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
+free the on-disk inode, you may end up doing that while ->write_inode() is writing
+to it.
+
+---
+[mandatory]
+
+ .d_delete() now only advises the dcache as to whether or not to cache
+unreferenced dentries, and is now only called when the dentry refcount goes to
+0. Even on 0 refcount transition, it must be able to tolerate being called 0,
+1, or more times (eg. constant, idempotent).
+
+---
+[mandatory]
+
+ .d_compare() calling convention and locking rules are significantly
+changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+look at examples of other filesystems) for guidance.
+
+---
+[mandatory]
+
+ .d_hash() calling convention and locking rules are significantly
+changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+look at examples of other filesystems) for guidance.
+
+---
+[mandatory]
+ dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c
+for details of what locks to replace dcache_lock with in order to protect
+particular things. Most of the time, a filesystem only needs ->d_lock, which
+protects *all* the dcache state of a given dentry.
+
+--
+[mandatory]
+
+ Filesystems must RCU-free their inodes, if they can have been accessed
+via rcu-walk path walk (basically, if the file can have had a path name in the
+vfs namespace).
+
+ i_dentry and i_rcu share storage in a union, and the vfs expects
+i_dentry to be reinitialized before it is freed, so an:
+
+ INIT_LIST_HEAD(&inode->i_dentry);
+
+must be done in the RCU callback.
+
+--
+[recommended]
+ vfs now tries to do path walking in "rcu-walk mode", which avoids
+atomic operations and scalability hazards on dentries and inodes (see
+Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes
+(above) are examples of the changes required to support this. For more complex
+filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so
+no changes are required to the filesystem. However, this is costly and loses
+the benefits of rcu-walk mode. We will begin to add filesystem callbacks that
+are rcu-walk aware, shown below. Filesystems should take advantage of this
+where possible.
+
+--
+[mandatory]
+ d_revalidate is a callback that is made on every path element (if
+the filesystem provides it), which requires dropping out of rcu-walk mode. This
+may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be
+returned if the filesystem cannot handle rcu-walk. See
+Documentation/filesystems/vfs.txt for more details.
+
+ permission and check_acl are inode permission checks that are called
+on many or all directory inodes on the way down a path walk (to check for
+exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU).
+See Documentation/filesystems/vfs.txt for more details.
+
+--
+[mandatory]
+ In ->fallocate() you must check the mode option passed in. If your
+filesystem does not support hole punching (deallocating space in the middle of a
+file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode.
+Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set,
+so the i_size should not change when hole punching, even when puching the end of
+a file off.
+
+--
+[mandatory]
+
+--
+[mandatory]
+ ->get_sb() is gone. Switch to use of ->mount(). Typically it's just
+a matter of switching from calling get_sb_... to mount_... and changing the
+function type. If you were doing it manually, just switch from setting ->mnt_root
+to some pointer to returning that pointer. On errors return ERR_PTR(...).
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
new file mode 100644
index 00000000..db3b1aba
--- /dev/null
+++ b/Documentation/filesystems/proc.txt
@@ -0,0 +1,1544 @@
+------------------------------------------------------------------------------
+ T H E /proc F I L E S Y S T E M
+------------------------------------------------------------------------------
+/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999
+ Bodo Bauer <bb@ricochet.net>
+
+2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000
+move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009
+------------------------------------------------------------------------------
+Version 1.3 Kernel version 2.2.12
+ Kernel version 2.4.0-test11-pre4
+------------------------------------------------------------------------------
+fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
+
+Table of Contents
+-----------------
+
+ 0 Preface
+ 0.1 Introduction/Credits
+ 0.2 Legal Stuff
+
+ 1 Collecting System Information
+ 1.1 Process-Specific Subdirectories
+ 1.2 Kernel data
+ 1.3 IDE devices in /proc/ide
+ 1.4 Networking info in /proc/net
+ 1.5 SCSI info
+ 1.6 Parallel port info in /proc/parport
+ 1.7 TTY info in /proc/tty
+ 1.8 Miscellaneous kernel statistics in /proc/stat
+ 1.9 Ext4 file system parameters
+
+ 2 Modifying System Parameters
+
+ 3 Per-Process Parameters
+ 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+ score
+ 3.2 /proc/<pid>/oom_score - Display current oom-killer score
+ 3.3 /proc/<pid>/io - Display the IO accounting fields
+ 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
+ 3.5 /proc/<pid>/mountinfo - Information about mounts
+ 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+
+
+------------------------------------------------------------------------------
+Preface
+------------------------------------------------------------------------------
+
+0.1 Introduction/Credits
+------------------------
+
+This documentation is part of a soon (or so we hope) to be released book on
+the SuSE Linux distribution. As there is no complete documentation for the
+/proc file system and we've used many freely available sources to write these
+chapters, it seems only fair to give the work back to the Linux community.
+This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm
+afraid it's still far from complete, but we hope it will be useful. As far as
+we know, it is the first 'all-in-one' document about the /proc file system. It
+is focused on the Intel x86 hardware, so if you are looking for PPC, ARM,
+SPARC, AXP, etc., features, you probably won't find what you are looking for.
+It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But
+additions and patches are welcome and will be added to this document if you
+mail them to Bodo.
+
+We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of
+other people for help compiling this documentation. We'd also like to extend a
+special thank you to Andi Kleen for documentation, which we relied on heavily
+to create this document, as well as the additional information he provided.
+Thanks to everybody else who contributed source or docs to the Linux kernel
+and helped create a great piece of software... :)
+
+If you have any comments, corrections or additions, please don't hesitate to
+contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
+document.
+
+The latest version of this document is available online at
+http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html
+
+If the above direction does not works for you, you could try the kernel
+mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
+comandante@zaralinux.com.
+
+0.2 Legal Stuff
+---------------
+
+We don't guarantee the correctness of this document, and if you come to us
+complaining about how you screwed up your system because of incorrect
+documentation, we won't feel responsible...
+
+------------------------------------------------------------------------------
+CHAPTER 1: COLLECTING SYSTEM INFORMATION
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+In This Chapter
+------------------------------------------------------------------------------
+* Investigating the properties of the pseudo file system /proc and its
+ ability to provide information on the running Linux system
+* Examining /proc's structure
+* Uncovering various information about the kernel and the processes running
+ on the system
+------------------------------------------------------------------------------
+
+
+The proc file system acts as an interface to internal data structures in the
+kernel. It can be used to obtain information about the system and to change
+certain kernel parameters at runtime (sysctl).
+
+First, we'll take a look at the read-only parts of /proc. In Chapter 2, we
+show you how you can use /proc/sys to change settings.
+
+1.1 Process-Specific Subdirectories
+-----------------------------------
+
+The directory /proc contains (among other things) one subdirectory for each
+process running on the system, which is named after the process ID (PID).
+
+The link self points to the process reading the file system. Each process
+subdirectory has the entries listed in Table 1-1.
+
+
+Table 1-1: Process specific entries in /proc
+..............................................................................
+ File Content
+ clear_refs Clears page referenced bits shown in smaps output
+ cmdline Command line arguments
+ cpu Current and last cpu in which it was executed (2.4)(smp)
+ cwd Link to the current working directory
+ environ Values of environment variables
+ exe Link to the executable of this process
+ fd Directory, which contains all file descriptors
+ maps Memory maps to executables and library files (2.4)
+ mem Memory held by this process
+ root Link to the root directory of this process
+ stat Process status
+ statm Process memory status information
+ status Process status in human readable form
+ wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
+ pagemap Page table
+ stack Report full stack trace, enable via CONFIG_STACKTRACE
+ smaps a extension based on maps, showing the memory consumption of
+ each mapping
+..............................................................................
+
+For example, to get the status information of a process, all you have to do is
+read the file /proc/PID/status:
+
+ >cat /proc/self/status
+ Name: cat
+ State: R (running)
+ Tgid: 5452
+ Pid: 5452
+ PPid: 743
+ TracerPid: 0 (2.4)
+ Uid: 501 501 501 501
+ Gid: 100 100 100 100
+ FDSize: 256
+ Groups: 100 14 16
+ VmPeak: 5004 kB
+ VmSize: 5004 kB
+ VmLck: 0 kB
+ VmHWM: 476 kB
+ VmRSS: 476 kB
+ VmData: 156 kB
+ VmStk: 88 kB
+ VmExe: 68 kB
+ VmLib: 1412 kB
+ VmPTE: 20 kb
+ VmSwap: 0 kB
+ Threads: 1
+ SigQ: 0/28578
+ SigPnd: 0000000000000000
+ ShdPnd: 0000000000000000
+ SigBlk: 0000000000000000
+ SigIgn: 0000000000000000
+ SigCgt: 0000000000000000
+ CapInh: 00000000fffffeff
+ CapPrm: 0000000000000000
+ CapEff: 0000000000000000
+ CapBnd: ffffffffffffffff
+ voluntary_ctxt_switches: 0
+ nonvoluntary_ctxt_switches: 1
+
+This shows you nearly the same information you would get if you viewed it with
+the ps command. In fact, ps uses the proc file system to obtain its
+information. But you get a more detailed view of the process by reading the
+file /proc/PID/status. It fields are described in table 1-2.
+
+The statm file contains more detailed information about the process
+memory usage. Its seven fields are explained in Table 1-3. The stat file
+contains details information about the process itself. Its fields are
+explained in Table 1-4.
+
+(for SMP CONFIG users)
+For making accounting scalable, RSS related information are handled in
+asynchronous manner and the vaule may not be very precise. To see a precise
+snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
+It's slow but very precise.
+
+Table 1-2: Contents of the status files (as of 2.6.30-rc7)
+..............................................................................
+ Field Content
+ Name filename of the executable
+ State state (R is running, S is sleeping, D is sleeping
+ in an uninterruptible wait, Z is zombie,
+ T is traced or stopped)
+ Tgid thread group ID
+ Pid process id
+ PPid process id of the parent process
+ TracerPid PID of process tracing this process (0 if not)
+ Uid Real, effective, saved set, and file system UIDs
+ Gid Real, effective, saved set, and file system GIDs
+ FDSize number of file descriptor slots currently allocated
+ Groups supplementary group list
+ VmPeak peak virtual memory size
+ VmSize total program size
+ VmLck locked memory size
+ VmHWM peak resident set size ("high water mark")
+ VmRSS size of memory portions
+ VmData size of data, stack, and text segments
+ VmStk size of data, stack, and text segments
+ VmExe size of text segment
+ VmLib size of shared library code
+ VmPTE size of page table entries
+ VmSwap size of swap usage (the number of referred swapents)
+ Threads number of threads
+ SigQ number of signals queued/max. number for queue
+ SigPnd bitmap of pending signals for the thread
+ ShdPnd bitmap of shared pending signals for the process
+ SigBlk bitmap of blocked signals
+ SigIgn bitmap of ignored signals
+ SigCgt bitmap of catched signals
+ CapInh bitmap of inheritable capabilities
+ CapPrm bitmap of permitted capabilities
+ CapEff bitmap of effective capabilities
+ CapBnd bitmap of capabilities bounding set
+ Cpus_allowed mask of CPUs on which this process may run
+ Cpus_allowed_list Same as previous, but in "list format"
+ Mems_allowed mask of memory nodes allowed to this process
+ Mems_allowed_list Same as previous, but in "list format"
+ voluntary_ctxt_switches number of voluntary context switches
+ nonvoluntary_ctxt_switches number of non voluntary context switches
+..............................................................................
+
+Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
+..............................................................................
+ Field Content
+ size total program size (pages) (same as VmSize in status)
+ resident size of memory portions (pages) (same as VmRSS in status)
+ shared number of pages that are shared (i.e. backed by a file)
+ trs number of pages that are 'code' (not including libs; broken,
+ includes data segment)
+ lrs number of pages of library (always 0 on 2.6)
+ drs number of pages of data/stack (including libs; broken,
+ includes library text)
+ dt number of dirty pages (always 0 on 2.6)
+..............................................................................
+
+
+Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
+..............................................................................
+ Field Content
+ pid process id
+ tcomm filename of the executable
+ state state (R is running, S is sleeping, D is sleeping in an
+ uninterruptible wait, Z is zombie, T is traced or stopped)
+ ppid process id of the parent process
+ pgrp pgrp of the process
+ sid session id
+ tty_nr tty the process uses
+ tty_pgrp pgrp of the tty
+ flags task flags
+ min_flt number of minor faults
+ cmin_flt number of minor faults with child's
+ maj_flt number of major faults
+ cmaj_flt number of major faults with child's
+ utime user mode jiffies
+ stime kernel mode jiffies
+ cutime user mode jiffies with child's
+ cstime kernel mode jiffies with child's
+ priority priority level
+ nice nice level
+ num_threads number of threads
+ it_real_value (obsolete, always 0)
+ start_time time the process started after system boot
+ vsize virtual memory size
+ rss resident set memory size
+ rsslim current limit in bytes on the rss
+ start_code address above which program text can run
+ end_code address below which program text can run
+ start_stack address of the start of the stack
+ esp current value of ESP
+ eip current value of EIP
+ pending bitmap of pending signals
+ blocked bitmap of blocked signals
+ sigign bitmap of ignored signals
+ sigcatch bitmap of catched signals
+ wchan address where process went to sleep
+ 0 (place holder)
+ 0 (place holder)
+ exit_signal signal to send to parent thread on exit
+ task_cpu which CPU the task is scheduled on
+ rt_priority realtime priority
+ policy scheduling policy (man sched_setscheduler)
+ blkio_ticks time spent waiting for block IO
+ gtime guest time of the task in jiffies
+ cgtime guest time of the task children in jiffies
+..............................................................................
+
+The /proc/PID/maps file containing the currently mapped memory regions and
+their access permissions.
+
+The format is:
+
+address perms offset dev inode pathname
+
+08048000-08049000 r-xp 00000000 03:00 8312 /opt/test
+08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test
+0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
+a7cb1000-a7cb2000 ---p 00000000 00:00 0
+a7cb2000-a7eb2000 rw-p 00000000 00:00 0
+a7eb2000-a7eb3000 ---p 00000000 00:00 0
+a7eb3000-a7ed5000 rw-p 00000000 00:00 0
+a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6
+a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6
+a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6
+a800b000-a800e000 rw-p 00000000 00:00 0
+a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
+a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
+a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
+a8024000-a8027000 rw-p 00000000 00:00 0
+a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
+a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
+a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
+aff35000-aff4a000 rw-p 00000000 00:00 0 [stack]
+ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
+
+where "address" is the address space in the process that it occupies, "perms"
+is a set of permissions:
+
+ r = read
+ w = write
+ x = execute
+ s = shared
+ p = private (copy on write)
+
+"offset" is the offset into the mapping, "dev" is the device (major:minor), and
+"inode" is the inode on that device. 0 indicates that no inode is associated
+with the memory region, as the case would be with BSS (uninitialized data).
+The "pathname" shows the name associated file for this mapping. If the mapping
+is not associated with a file:
+
+ [heap] = the heap of the program
+ [stack] = the stack of the main process
+ [vdso] = the "virtual dynamic shared object",
+ the kernel system call handler
+
+ or if empty, the mapping is anonymous.
+
+
+The /proc/PID/smaps is an extension based on maps, showing the memory
+consumption for each of the process's mappings. For each of mappings there
+is a series of lines such as the following:
+
+08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash
+Size: 1084 kB
+Rss: 892 kB
+Pss: 374 kB
+Shared_Clean: 892 kB
+Shared_Dirty: 0 kB
+Private_Clean: 0 kB
+Private_Dirty: 0 kB
+Referenced: 892 kB
+Anonymous: 0 kB
+Swap: 0 kB
+KernelPageSize: 4 kB
+MMUPageSize: 4 kB
+Locked: 374 kB
+
+The first of these lines shows the same information as is displayed for the
+mapping in /proc/PID/maps. The remaining lines show the size of the mapping
+(size), the amount of the mapping that is currently resident in RAM (RSS), the
+process' proportional share of this mapping (PSS), the number of clean and
+dirty private pages in the mapping. Note that even a page which is part of a
+MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used
+by only one process, is accounted as private and not as shared. "Referenced"
+indicates the amount of memory currently marked as referenced or accessed.
+"Anonymous" shows the amount of memory that does not belong to any file. Even
+a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
+and a page is modified, the file page is replaced by a private anonymous copy.
+"Swap" shows how much would-be-anonymous memory is also used, but out on
+swap.
+
+This file is only present if the CONFIG_MMU kernel configuration option is
+enabled.
+
+The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
+bits on both physical and virtual pages associated with a process.
+To clear the bits for all the pages associated with the process
+ > echo 1 > /proc/PID/clear_refs
+
+To clear the bits for the anonymous pages associated with the process
+ > echo 2 > /proc/PID/clear_refs
+
+To clear the bits for the file mapped pages associated with the process
+ > echo 3 > /proc/PID/clear_refs
+Any other value written to /proc/PID/clear_refs will have no effect.
+
+The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
+using /proc/kpageflags and number of times a page is mapped using
+/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
+
+1.2 Kernel data
+---------------
+
+Similar to the process entries, the kernel data files give information about
+the running kernel. The files used to obtain this information are contained in
+/proc and are listed in Table 1-5. Not all of these will be present in your
+system. It depends on the kernel configuration and the loaded modules, which
+files are there, and which are missing.
+
+Table 1-5: Kernel info in /proc
+..............................................................................
+ File Content
+ apm Advanced power management info
+ buddyinfo Kernel memory allocator information (see text) (2.5)
+ bus Directory containing bus specific information
+ cmdline Kernel command line
+ cpuinfo Info about the CPU
+ devices Available devices (block and character)
+ dma Used DMS channels
+ filesystems Supported filesystems
+ driver Various drivers grouped here, currently rtc (2.4)
+ execdomains Execdomains, related to security (2.4)
+ fb Frame Buffer devices (2.4)
+ fs File system parameters, currently nfs/exports (2.4)
+ ide Directory containing info about the IDE subsystem
+ interrupts Interrupt usage
+ iomem Memory map (2.4)
+ ioports I/O port usage
+ irq Masks for irq to cpu affinity (2.4)(smp?)
+ isapnp ISA PnP (Plug&Play) Info (2.4)
+ kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
+ kmsg Kernel messages
+ ksyms Kernel symbol table
+ loadavg Load average of last 1, 5 & 15 minutes
+ locks Kernel locks
+ meminfo Memory info
+ misc Miscellaneous
+ modules List of loaded modules
+ mounts Mounted filesystems
+ net Networking info (see text)
+ pagetypeinfo Additional page allocator information (see text) (2.5)
+ partitions Table of partitions known to the system
+ pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
+ decoupled by lspci (2.4)
+ rtc Real time clock
+ scsi SCSI info (see text)
+ slabinfo Slab pool info
+ softirqs softirq usage
+ stat Overall statistics
+ swaps Swap space utilization
+ sys See chapter 2
+ sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4)
+ tty Info of tty drivers
+ uptime System uptime
+ version Kernel version
+ video bttv info of video resources (2.4)
+ vmallocinfo Show vmalloced areas
+..............................................................................
+
+You can, for example, check which interrupts are currently in use and what
+they are used for by looking in the file /proc/interrupts:
+
+ > cat /proc/interrupts
+ CPU0
+ 0: 8728810 XT-PIC timer
+ 1: 895 XT-PIC keyboard
+ 2: 0 XT-PIC cascade
+ 3: 531695 XT-PIC aha152x
+ 4: 2014133 XT-PIC serial
+ 5: 44401 XT-PIC pcnet_cs
+ 8: 2 XT-PIC rtc
+ 11: 8 XT-PIC i82365
+ 12: 182918 XT-PIC PS/2 Mouse
+ 13: 1 XT-PIC fpu
+ 14: 1232265 XT-PIC ide0
+ 15: 7 XT-PIC ide1
+ NMI: 0
+
+In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the
+output of a SMP machine):
+
+ > cat /proc/interrupts
+
+ CPU0 CPU1
+ 0: 1243498 1214548 IO-APIC-edge timer
+ 1: 8949 8958 IO-APIC-edge keyboard
+ 2: 0 0 XT-PIC cascade
+ 5: 11286 10161 IO-APIC-edge soundblaster
+ 8: 1 0 IO-APIC-edge rtc
+ 9: 27422 27407 IO-APIC-edge 3c503
+ 12: 113645 113873 IO-APIC-edge PS/2 Mouse
+ 13: 0 0 XT-PIC fpu
+ 14: 22491 24012 IO-APIC-edge ide0
+ 15: 2183 2415 IO-APIC-edge ide1
+ 17: 30564 30414 IO-APIC-level eth0
+ 18: 177 164 IO-APIC-level bttv
+ NMI: 2457961 2457959
+ LOC: 2457882 2457881
+ ERR: 2155
+
+NMI is incremented in this case because every timer interrupt generates a NMI
+(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups.
+
+LOC is the local interrupt counter of the internal APIC of every CPU.
+
+ERR is incremented in the case of errors in the IO-APIC bus (the bus that
+connects the CPUs in a SMP system. This means that an error has been detected,
+the IO-APIC automatically retry the transmission, so it should not be a big
+problem, but you should read the SMP-FAQ.
+
+In 2.6.2* /proc/interrupts was expanded again. This time the goal was for
+/proc/interrupts to display every IRQ vector in use by the system, not
+just those considered 'most important'. The new vectors are:
+
+ THR -- interrupt raised when a machine check threshold counter
+ (typically counting ECC corrected errors of memory or cache) exceeds
+ a configurable threshold. Only available on some systems.
+
+ TRM -- a thermal event interrupt occurs when a temperature threshold
+ has been exceeded for the CPU. This interrupt may also be generated
+ when the temperature drops back to normal.
+
+ SPU -- a spurious interrupt is some interrupt that was raised then lowered
+ by some IO device before it could be fully processed by the APIC. Hence
+ the APIC sees the interrupt but does not know what device it came from.
+ For this case the APIC will generate the interrupt with a IRQ vector
+ of 0xff. This might also be generated by chipset bugs.
+
+ RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are
+ sent from one CPU to another per the needs of the OS. Typically,
+ their statistics are used by kernel developers and interested users to
+ determine the occurrence of interrupts of the given type.
+
+The above IRQ vectors are displayed only when relevant. For example,
+the threshold vector does not exist on x86_64 platforms. Others are
+suppressed when the system is a uniprocessor. As of this writing, only
+i386 and x86_64 platforms support the new IRQ vector displays.
+
+Of some interest is the introduction of the /proc/irq directory to 2.4.
+It could be used to set IRQ to CPU affinity, this means that you can "hook" an
+IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
+irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
+prof_cpu_mask.
+
+For example
+ > ls /proc/irq/
+ 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
+ 1 11 13 15 17 19 3 5 7 9 default_smp_affinity
+ > ls /proc/irq/0/
+ smp_affinity
+
+smp_affinity is a bitmask, in which you can specify which CPUs can handle the
+IRQ, you can set it by doing:
+
+ > echo 1 > /proc/irq/10/smp_affinity
+
+This means that only the first CPU will handle the IRQ, but you can also echo
+5 which means that only the first and fourth CPU can handle the IRQ.
+
+The contents of each smp_affinity file is the same by default:
+
+ > cat /proc/irq/0/smp_affinity
+ ffffffff
+
+There is an alternate interface, smp_affinity_list which allows specifying
+a cpu range instead of a bitmask:
+
+ > cat /proc/irq/0/smp_affinity_list
+ 1024-1031
+
+The default_smp_affinity mask applies to all non-active IRQs, which are the
+IRQs which have not yet been allocated/activated, and hence which lack a
+/proc/irq/[0-9]* directory.
+
+The node file on an SMP system shows the node to which the device using the IRQ
+reports itself as being attached. This hardware locality information does not
+include information about any possible driver locality preference.
+
+prof_cpu_mask specifies which CPUs are to be profiled by the system wide
+profiler. Default value is ffffffff (all cpus if there are only 32 of them).
+
+The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
+between all the CPUs which are allowed to handle it. As usual the kernel has
+more info than you and does a better job than you, so the defaults are the
+best choice for almost everyone. [Note this applies only to those IO-APIC's
+that support "Round Robin" interrupt distribution.]
+
+There are three more important subdirectories in /proc: net, scsi, and sys.
+The general rule is that the contents, or even the existence of these
+directories, depend on your kernel configuration. If SCSI is not enabled, the
+directory scsi may not exist. The same is true with the net, which is there
+only when networking support is present in the running kernel.
+
+The slabinfo file gives information about memory usage at the slab level.
+Linux uses slab pools for memory management above page level in version 2.2.
+Commonly used objects have their own slab pool (such as network buffers,
+directory cache, and so on).
+
+..............................................................................
+
+> cat /proc/buddyinfo
+
+Node 0, zone DMA 0 4 5 4 4 3 ...
+Node 0, zone Normal 1 0 0 1 101 8 ...
+Node 0, zone HighMem 2 0 0 1 1 0 ...
+
+External fragmentation is a problem under some workloads, and buddyinfo is a
+useful tool for helping diagnose these problems. Buddyinfo will give you a
+clue as to how big an area you can safely allocate, or why a previous
+allocation failed.
+
+Each column represents the number of pages of a certain order which are
+available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
+ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
+available in ZONE_NORMAL, etc...
+
+More information relevant to external fragmentation can be found in
+pagetypeinfo.
+
+> cat /proc/pagetypeinfo
+Page block order: 9
+Pages per block: 512
+
+Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
+Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
+Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
+Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
+Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
+Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
+Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
+Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
+Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
+Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+
+Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
+Node 0, zone DMA 2 0 5 1 0
+Node 0, zone DMA32 41 6 967 2 0
+
+Fragmentation avoidance in the kernel works by grouping pages of different
+migrate types into the same contiguous regions of memory called page blocks.
+A page block is typically the size of the default hugepage size e.g. 2MB on
+X86-64. By keeping pages grouped based on their ability to move, the kernel
+can reclaim pages within a page block to satisfy a high-order allocation.
+
+The pagetypinfo begins with information on the size of a page block. It
+then gives the same type of information as buddyinfo except broken down
+by migrate-type and finishes with details on how many page blocks of each
+type exist.
+
+If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
+from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
+make an estimate of the likely number of huge pages that can be allocated
+at a given point in time. All the "Movable" blocks should be allocatable
+unless memory has been mlock()'d. Some of the Reclaimable blocks should
+also be allocatable although a lot of filesystem metadata may have to be
+reclaimed to achieve this.
+
+..............................................................................
+
+meminfo:
+
+Provides information about distribution and utilization of memory. This
+varies by architecture and compile options. The following is from a
+16GB PIII, which has highmem enabled. You may not have all of these fields.
+
+> cat /proc/meminfo
+
+The "Locked" indicates whether the mapping is locked in memory or not.
+
+
+MemTotal: 16344972 kB
+MemFree: 13634064 kB
+Buffers: 3656 kB
+Cached: 1195708 kB
+SwapCached: 0 kB
+Active: 891636 kB
+Inactive: 1077224 kB
+HighTotal: 15597528 kB
+HighFree: 13629632 kB
+LowTotal: 747444 kB
+LowFree: 4432 kB
+SwapTotal: 0 kB
+SwapFree: 0 kB
+Dirty: 968 kB
+Writeback: 0 kB
+AnonPages: 861800 kB
+Mapped: 280372 kB
+Slab: 284364 kB
+SReclaimable: 159856 kB
+SUnreclaim: 124508 kB
+PageTables: 24448 kB
+NFS_Unstable: 0 kB
+Bounce: 0 kB
+WritebackTmp: 0 kB
+CommitLimit: 7669796 kB
+Committed_AS: 100056 kB
+VmallocTotal: 112216 kB
+VmallocUsed: 428 kB
+VmallocChunk: 111088 kB
+
+ MemTotal: Total usable ram (i.e. physical ram minus a few reserved
+ bits and the kernel binary code)
+ MemFree: The sum of LowFree+HighFree
+ Buffers: Relatively temporary storage for raw disk blocks
+ shouldn't get tremendously large (20MB or so)
+ Cached: in-memory cache for files read from the disk (the
+ pagecache). Doesn't include SwapCached
+ SwapCached: Memory that once was swapped out, is swapped back in but
+ still also is in the swapfile (if memory is needed it
+ doesn't need to be swapped out AGAIN because it is already
+ in the swapfile. This saves I/O)
+ Active: Memory that has been used more recently and usually not
+ reclaimed unless absolutely necessary.
+ Inactive: Memory which has been less recently used. It is more
+ eligible to be reclaimed for other purposes
+ HighTotal:
+ HighFree: Highmem is all memory above ~860MB of physical memory
+ Highmem areas are for use by userspace programs, or
+ for the pagecache. The kernel must use tricks to access
+ this memory, making it slower to access than lowmem.
+ LowTotal:
+ LowFree: Lowmem is memory which can be used for everything that
+ highmem can be used for, but it is also available for the
+ kernel's use for its own data structures. Among many
+ other things, it is where everything from the Slab is
+ allocated. Bad things happen when you're out of lowmem.
+ SwapTotal: total amount of swap space available
+ SwapFree: Memory which has been evicted from RAM, and is temporarily
+ on the disk
+ Dirty: Memory which is waiting to get written back to the disk
+ Writeback: Memory which is actively being written back to the disk
+ AnonPages: Non-file backed pages mapped into userspace page tables
+ Mapped: files which have been mmaped, such as libraries
+ Slab: in-kernel data structures cache
+SReclaimable: Part of Slab, that might be reclaimed, such as caches
+ SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
+ PageTables: amount of memory dedicated to the lowest level of page
+ tables.
+NFS_Unstable: NFS pages sent to the server, but not yet committed to stable
+ storage
+ Bounce: Memory used for block device "bounce buffers"
+WritebackTmp: Memory used by FUSE for temporary writeback buffers
+ CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
+ this is the total amount of memory currently available to
+ be allocated on the system. This limit is only adhered to
+ if strict overcommit accounting is enabled (mode 2 in
+ 'vm.overcommit_memory').
+ The CommitLimit is calculated with the following formula:
+ CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap
+ For example, on a system with 1G of physical RAM and 7G
+ of swap with a `vm.overcommit_ratio` of 30 it would
+ yield a CommitLimit of 7.3G.
+ For more details, see the memory overcommit documentation
+ in vm/overcommit-accounting.
+Committed_AS: The amount of memory presently allocated on the system.
+ The committed memory is a sum of all of the memory which
+ has been allocated by processes, even if it has not been
+ "used" by them as of yet. A process which malloc()'s 1G
+ of memory, but only touches 300M of it will only show up
+ as using 300M of memory even if it has the address space
+ allocated for the entire 1G. This 1G is memory which has
+ been "committed" to by the VM and can be used at any time
+ by the allocating application. With strict overcommit
+ enabled on the system (mode 2 in 'vm.overcommit_memory'),
+ allocations which would exceed the CommitLimit (detailed
+ above) will not be permitted. This is useful if one needs
+ to guarantee that processes will not fail due to lack of
+ memory once that memory has been successfully allocated.
+VmallocTotal: total size of vmalloc memory area
+ VmallocUsed: amount of vmalloc area which is used
+VmallocChunk: largest contiguous block of vmalloc area which is free
+
+..............................................................................
+
+vmallocinfo:
+
+Provides information about vmalloced/vmaped areas. One line per area,
+containing the virtual address range of the area, size in bytes,
+caller information of the creator, and optional information depending
+on the kind of area :
+
+ pages=nr number of pages
+ phys=addr if a physical address was specified
+ ioremap I/O mapping (ioremap() and friends)
+ vmalloc vmalloc() area
+ vmap vmap()ed pages
+ user VM_USERMAP area
+ vpages buffer for pages pointers was vmalloced (huge area)
+ N<node>=nr (Only on NUMA kernels)
+ Number of pages allocated on memory node <node>
+
+> cat /proc/vmallocinfo
+0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ...
+ /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
+0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ...
+ /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
+0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f...
+ phys=7fee8000 ioremap
+0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f...
+ phys=7fee7000 ioremap
+0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210
+0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ...
+ /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
+0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ...
+ pages=2 vmalloc N1=2
+0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ...
+ /0x130 [x_tables] pages=4 vmalloc N0=4
+0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ...
+ pages=14 vmalloc N2=14
+0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ...
+ pages=4 vmalloc N1=4
+0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ...
+ pages=2 vmalloc N1=2
+0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ...
+ pages=10 vmalloc N0=10
+
+..............................................................................
+
+softirqs:
+
+Provides counts of softirq handlers serviced since boot time, for each cpu.
+
+> cat /proc/softirqs
+ CPU0 CPU1 CPU2 CPU3
+ HI: 0 0 0 0
+ TIMER: 27166 27120 27097 27034
+ NET_TX: 0 0 0 17
+ NET_RX: 42 0 0 39
+ BLOCK: 0 0 107 1121
+ TASKLET: 0 0 0 290
+ SCHED: 27035 26983 26971 26746
+ HRTIMER: 0 0 0 0
+ RCU: 1678 1769 2178 2250
+
+
+1.3 IDE devices in /proc/ide
+----------------------------
+
+The subdirectory /proc/ide contains information about all IDE devices of which
+the kernel is aware. There is one subdirectory for each IDE controller, the
+file drivers and a link for each IDE device, pointing to the device directory
+in the controller specific subtree.
+
+The file drivers contains general information about the drivers used for the
+IDE devices:
+
+ > cat /proc/ide/drivers
+ ide-cdrom version 4.53
+ ide-disk version 1.08
+
+More detailed information can be found in the controller specific
+subdirectories. These are named ide0, ide1 and so on. Each of these
+directories contains the files shown in table 1-6.
+
+
+Table 1-6: IDE controller info in /proc/ide/ide?
+..............................................................................
+ File Content
+ channel IDE channel (0 or 1)
+ config Configuration (only for PCI/IDE bridge)
+ mate Mate name
+ model Type/Chipset of IDE controller
+..............................................................................
+
+Each device connected to a controller has a separate subdirectory in the
+controllers directory. The files listed in table 1-7 are contained in these
+directories.
+
+
+Table 1-7: IDE device information
+..............................................................................
+ File Content
+ cache The cache
+ capacity Capacity of the medium (in 512Byte blocks)
+ driver driver and version
+ geometry physical and logical geometry
+ identify device identify block
+ media media type
+ model device identifier
+ settings device setup
+ smart_thresholds IDE disk management thresholds
+ smart_values IDE disk management values
+..............................................................................
+
+The most interesting file is settings. This file contains a nice overview of
+the drive parameters:
+
+ # cat /proc/ide/ide0/hda/settings
+ name value min max mode
+ ---- ----- --- --- ----
+ bios_cyl 526 0 65535 rw
+ bios_head 255 0 255 rw
+ bios_sect 63 0 63 rw
+ breada_readahead 4 0 127 rw
+ bswap 0 0 1 r
+ file_readahead 72 0 2097151 rw
+ io_32bit 0 0 3 rw
+ keepsettings 0 0 1 rw
+ max_kb_per_request 122 1 127 rw
+ multcount 0 0 8 rw
+ nice1 1 0 1 rw
+ nowerr 0 0 1 rw
+ pio_mode write-only 0 255 w
+ slow 0 0 1 rw
+ unmaskirq 0 0 1 rw
+ using_dma 0 0 1 rw
+
+
+1.4 Networking info in /proc/net
+--------------------------------
+
+The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the
+additional values you get for IP version 6 if you configure the kernel to
+support this. Table 1-9 lists the files and their meaning.
+
+
+Table 1-8: IPv6 info in /proc/net
+..............................................................................
+ File Content
+ udp6 UDP sockets (IPv6)
+ tcp6 TCP sockets (IPv6)
+ raw6 Raw device statistics (IPv6)
+ igmp6 IP multicast addresses, which this host joined (IPv6)
+ if_inet6 List of IPv6 interface addresses
+ ipv6_route Kernel routing table for IPv6
+ rt6_stats Global IPv6 routing tables statistics
+ sockstat6 Socket statistics (IPv6)
+ snmp6 Snmp data (IPv6)
+..............................................................................
+
+
+Table 1-9: Network info in /proc/net
+..............................................................................
+ File Content
+ arp Kernel ARP table
+ dev network devices with statistics
+ dev_mcast the Layer2 multicast groups a device is listening too
+ (interface index, label, number of references, number of bound
+ addresses).
+ dev_stat network device status
+ ip_fwchains Firewall chain linkage
+ ip_fwnames Firewall chain names
+ ip_masq Directory containing the masquerading tables
+ ip_masquerade Major masquerading table
+ netstat Network statistics
+ raw raw device statistics
+ route Kernel routing table
+ rpc Directory containing rpc info
+ rt_cache Routing cache
+ snmp SNMP data
+ sockstat Socket statistics
+ tcp TCP sockets
+ tr_rif Token ring RIF routing table
+ udp UDP sockets
+ unix UNIX domain sockets
+ wireless Wireless interface data (Wavelan etc)
+ igmp IP multicast addresses, which this host joined
+ psched Global packet scheduler parameters.
+ netlink List of PF_NETLINK sockets
+ ip_mr_vifs List of multicast virtual interfaces
+ ip_mr_cache List of multicast routing cache
+..............................................................................
+
+You can use this information to see which network devices are available in
+your system and how much traffic was routed over those devices:
+
+ > cat /proc/net/dev
+ Inter-|Receive |[...
+ face |bytes packets errs drop fifo frame compressed multicast|[...
+ lo: 908188 5596 0 0 0 0 0 0 [...
+ ppp0:15475140 20721 410 0 0 410 0 0 [...
+ eth0: 614530 7085 0 0 0 0 0 1 [...
+
+ ...] Transmit
+ ...] bytes packets errs drop fifo colls carrier compressed
+ ...] 908188 5596 0 0 0 0 0 0
+ ...] 1375103 17405 0 0 0 0 0 0
+ ...] 1703981 5535 0 0 0 3 0 0
+
+In addition, each Channel Bond interface has its own directory. For
+example, the bond0 device will have a directory called /proc/net/bond0/.
+It will contain information that is specific to that bond, such as the
+current slaves of the bond, the link status of the slaves, and how
+many times the slaves link has failed.
+
+1.5 SCSI info
+-------------
+
+If you have a SCSI host adapter in your system, you'll find a subdirectory
+named after the driver for this adapter in /proc/scsi. You'll also see a list
+of all recognized SCSI devices in /proc/scsi:
+
+ >cat /proc/scsi/scsi
+ Attached devices:
+ Host: scsi0 Channel: 00 Id: 00 Lun: 00
+ Vendor: IBM Model: DGHS09U Rev: 03E0
+ Type: Direct-Access ANSI SCSI revision: 03
+ Host: scsi0 Channel: 00 Id: 06 Lun: 00
+ Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04
+ Type: CD-ROM ANSI SCSI revision: 02
+
+
+The directory named after the driver has one file for each adapter found in
+the system. These files contain information about the controller, including
+the used IRQ and the IO address range. The amount of information shown is
+dependent on the adapter you use. The example shows the output for an Adaptec
+AHA-2940 SCSI adapter:
+
+ > cat /proc/scsi/aic7xxx/0
+
+ Adaptec AIC7xxx driver version: 5.1.19/3.2.4
+ Compile Options:
+ TCQ Enabled By Default : Disabled
+ AIC7XXX_PROC_STATS : Disabled
+ AIC7XXX_RESET_DELAY : 5
+ Adapter Configuration:
+ SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
+ Ultra Wide Controller
+ PCI MMAPed I/O Base: 0xeb001000
+ Adapter SEEPROM Config: SEEPROM found and used.
+ Adaptec SCSI BIOS: Enabled
+ IRQ: 10
+ SCBs: Active 0, Max Active 2,
+ Allocated 15, HW 16, Page 255
+ Interrupts: 160328
+ BIOS Control Word: 0x18b6
+ Adapter Control Word: 0x005b
+ Extended Translation: Enabled
+ Disconnect Enable Flags: 0xffff
+ Ultra Enable Flags: 0x0001
+ Tag Queue Enable Flags: 0x0000
+ Ordered Queue Tag Flags: 0x0000
+ Default Tag Queue Depth: 8
+ Tagged Queue By Device array for aic7xxx host instance 0:
+ {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255}
+ Actual queue depth per device for aic7xxx host instance 0:
+ {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
+ Statistics:
+ (scsi0:0:0:0)
+ Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8
+ Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0)
+ Total transfers 160151 (74577 reads and 85574 writes)
+ (scsi0:0:6:0)
+ Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15
+ Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0)
+ Total transfers 0 (0 reads and 0 writes)
+
+
+1.6 Parallel port info in /proc/parport
+---------------------------------------
+
+The directory /proc/parport contains information about the parallel ports of
+your system. It has one subdirectory for each port, named after the port
+number (0,1,2,...).
+
+These directories contain the four files shown in Table 1-10.
+
+
+Table 1-10: Files in /proc/parport
+..............................................................................
+ File Content
+ autoprobe Any IEEE-1284 device ID information that has been acquired.
+ devices list of the device drivers using that port. A + will appear by the
+ name of the device currently using the port (it might not appear
+ against any).
+ hardware Parallel port's base address, IRQ line and DMA channel.
+ irq IRQ that parport is using for that port. This is in a separate
+ file to allow you to alter it by writing a new value in (IRQ
+ number or none).
+..............................................................................
+
+1.7 TTY info in /proc/tty
+-------------------------
+
+Information about the available and actually used tty's can be found in the
+directory /proc/tty.You'll find entries for drivers and line disciplines in
+this directory, as shown in Table 1-11.
+
+
+Table 1-11: Files in /proc/tty
+..............................................................................
+ File Content
+ drivers list of drivers and their usage
+ ldiscs registered line disciplines
+ driver/serial usage statistic and status of single tty lines
+..............................................................................
+
+To see which tty's are currently in use, you can simply look into the file
+/proc/tty/drivers:
+
+ > cat /proc/tty/drivers
+ pty_slave /dev/pts 136 0-255 pty:slave
+ pty_master /dev/ptm 128 0-255 pty:master
+ pty_slave /dev/ttyp 3 0-255 pty:slave
+ pty_master /dev/pty 2 0-255 pty:master
+ serial /dev/cua 5 64-67 serial:callout
+ serial /dev/ttyS 4 64-67 serial
+ /dev/tty0 /dev/tty0 4 0 system:vtmaster
+ /dev/ptmx /dev/ptmx 5 2 system
+ /dev/console /dev/console 5 1 system:console
+ /dev/tty /dev/tty 5 0 system:/dev/tty
+ unknown /dev/tty 4 1-63 console
+
+
+1.8 Miscellaneous kernel statistics in /proc/stat
+-------------------------------------------------
+
+Various pieces of information about kernel activity are available in the
+/proc/stat file. All of the numbers reported in this file are aggregates
+since the system first booted. For a quick look, simply cat the file:
+
+ > cat /proc/stat
+ cpu 2255 34 2290 22625563 6290 127 456 0 0
+ cpu0 1132 34 1441 11311718 3675 127 438 0 0
+ cpu1 1123 0 849 11313845 2614 0 18 0 0
+ intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+ ctxt 1990473
+ btime 1062191376
+ processes 2915
+ procs_running 1
+ procs_blocked 0
+ softirq 183433 0 21755 12 39 1137 231 21459 2263
+
+The very first "cpu" line aggregates the numbers in all of the other "cpuN"
+lines. These numbers identify the amount of time the CPU has spent performing
+different kinds of work. Time units are in USER_HZ (typically hundredths of a
+second). The meanings of the columns are as follows, from left to right:
+
+- user: normal processes executing in user mode
+- nice: niced processes executing in user mode
+- system: processes executing in kernel mode
+- idle: twiddling thumbs
+- iowait: waiting for I/O to complete
+- irq: servicing interrupts
+- softirq: servicing softirqs
+- steal: involuntary wait
+- guest: running a normal guest
+- guest_nice: running a niced guest
+
+The "intr" line gives counts of interrupts serviced since boot time, for each
+of the possible system interrupts. The first column is the total of all
+interrupts serviced; each subsequent column is the total for that particular
+interrupt.
+
+The "ctxt" line gives the total number of context switches across all CPUs.
+
+The "btime" line gives the time at which the system booted, in seconds since
+the Unix epoch.
+
+The "processes" line gives the number of processes and threads created, which
+includes (but is not limited to) those created by calls to the fork() and
+clone() system calls.
+
+The "procs_running" line gives the total number of threads that are
+running or ready to run (i.e., the total number of runnable threads).
+
+The "procs_blocked" line gives the number of processes currently blocked,
+waiting for I/O to complete.
+
+The "softirq" line gives counts of softirqs serviced since boot time, for each
+of the possible system softirqs. The first column is the total of all
+softirqs serviced; each subsequent column is the total for that particular
+softirq.
+
+
+1.9 Ext4 file system parameters
+------------------------------
+
+Information about mounted ext4 file systems can be found in
+/proc/fs/ext4. Each mounted filesystem will have a directory in
+/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
+/proc/fs/ext4/dm-0). The files in each per-device directory are shown
+in Table 1-12, below.
+
+Table 1-12: Files in /proc/fs/ext4/<devname>
+..............................................................................
+ File Content
+ mb_groups details of multiblock allocator buddy cache of free blocks
+..............................................................................
+
+2.0 /proc/consoles
+------------------
+Shows registered system console lines.
+
+To see which character device lines are currently used for the system console
+/dev/console, you may simply look into the file /proc/consoles:
+
+ > cat /proc/consoles
+ tty0 -WU (ECp) 4:7
+ ttyS0 -W- (Ep) 4:64
+
+The columns are:
+
+ device name of the device
+ operations R = can do read operations
+ W = can do write operations
+ U = can do unblank
+ flags E = it is enabled
+ C = it is preferred console
+ B = it is primary boot console
+ p = it is used for printk buffer
+ b = it is not a TTY but a Braille device
+ a = it is safe to use when cpu is offline
+ major:minor major and minor number of the device separated by a colon
+
+------------------------------------------------------------------------------
+Summary
+------------------------------------------------------------------------------
+The /proc file system serves information about the running system. It not only
+allows access to process data but also allows you to request the kernel status
+by reading files in the hierarchy.
+
+The directory structure of /proc reflects the types of information and makes
+it easy, if not obvious, where to look for specific data.
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+CHAPTER 2: MODIFYING SYSTEM PARAMETERS
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+In This Chapter
+------------------------------------------------------------------------------
+* Modifying kernel parameters by writing into files found in /proc/sys
+* Exploring the files which modify certain parameters
+* Review of the /proc/sys file tree
+------------------------------------------------------------------------------
+
+
+A very interesting part of /proc is the directory /proc/sys. This is not only
+a source of information, it also allows you to change parameters within the
+kernel. Be very careful when attempting this. You can optimize your system,
+but you can also cause it to crash. Never alter kernel parameters on a
+production system. Set up a development machine and test to make sure that
+everything works the way you want it to. You may have no alternative but to
+reboot the machine once an error has been made.
+
+To change a value, simply echo the new value into the file. An example is
+given below in the section on the file system data. You need to be root to do
+this. You can create your own boot script to perform this every time your
+system boots.
+
+The files in /proc/sys can be used to fine tune and monitor miscellaneous and
+general things in the operation of the Linux kernel. Since some of the files
+can inadvertently disrupt your system, it is advisable to read both
+documentation and source before actually making adjustments. In any case, be
+very careful when writing to any of these files. The entries in /proc may
+change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt
+review the kernel documentation in the directory /usr/src/linux/Documentation.
+This chapter is heavily based on the documentation included in the pre 2.2
+kernels, and became part of it in version 2.2.1 of the Linux kernel.
+
+Please see: Documentation/sysctls/ directory for descriptions of these
+entries.
+
+------------------------------------------------------------------------------
+Summary
+------------------------------------------------------------------------------
+Certain aspects of kernel behavior can be modified at runtime, without the
+need to recompile the kernel, or even to reboot the system. The files in the
+/proc/sys tree can not only be read, but also modified. You can use the echo
+command to write value into these files, thereby changing the default settings
+of the kernel.
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+CHAPTER 3: PER-PROCESS PARAMETERS
+------------------------------------------------------------------------------
+
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted. The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000. If it is using half of its allowed memory, its score will be 500.
+
+There is an additional factor included in the badness score: root
+processes are given 3% extra memory over other tasks.
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called. If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes. If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit. Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill. Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it. The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory. A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score. Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+requires CAP_SYS_RESOURCE.
+
+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with separate address spaces instead, if possible. This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
+
+3.2 /proc/<pid>/oom_score - Display current oom-killer score
+-------------------------------------------------------------
+
+This file can be used to check the current score used by the oom-killer is for
+any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
+process should be killed in an out-of-memory situation.
+
+
+3.3 /proc/<pid>/io - Display the IO accounting fields
+-------------------------------------------------------
+
+This file contains IO statistics for each running process
+
+Example
+-------
+
+test:/tmp # dd if=/dev/zero of=/tmp/test.dat &
+[1] 3828
+
+test:/tmp # cat /proc/3828/io
+rchar: 323934931
+wchar: 323929600
+syscr: 632687
+syscw: 632675
+read_bytes: 0
+write_bytes: 323932160
+cancelled_write_bytes: 0
+
+
+Description
+-----------
+
+rchar
+-----
+
+I/O counter: chars read
+The number of bytes which this task has caused to be read from storage. This
+is simply the sum of bytes which this process passed to read() and pread().
+It includes things like tty IO and it is unaffected by whether or not actual
+physical disk IO was required (the read might have been satisfied from
+pagecache)
+
+
+wchar
+-----
+
+I/O counter: chars written
+The number of bytes which this task has caused, or shall cause to be written
+to disk. Similar caveats apply here as with rchar.
+
+
+syscr
+-----
+
+I/O counter: read syscalls
+Attempt to count the number of read I/O operations, i.e. syscalls like read()
+and pread().
+
+
+syscw
+-----
+
+I/O counter: write syscalls
+Attempt to count the number of write I/O operations, i.e. syscalls like
+write() and pwrite().
+
+
+read_bytes
+----------
+
+I/O counter: bytes read
+Attempt to count the number of bytes which this process really did cause to
+be fetched from the storage layer. Done at the submit_bio() level, so it is
+accurate for block-backed filesystems. <please add status regarding NFS and
+CIFS at a later time>
+
+
+write_bytes
+-----------
+
+I/O counter: bytes written
+Attempt to count the number of bytes which this process caused to be sent to
+the storage layer. This is done at page-dirtying time.
+
+
+cancelled_write_bytes
+---------------------
+
+The big inaccuracy here is truncate. If a process writes 1MB to a file and
+then deletes the file, it will in fact perform no writeout. But it will have
+been accounted as having caused 1MB of write.
+In other words: The number of bytes which this process caused to not happen,
+by truncating pagecache. A task can cause "negative" IO too. If this task
+truncates some dirty pagecache, some IO which another task has been accounted
+for (in its write_bytes) will not be happening. We _could_ just subtract that
+from the truncating task's write_bytes, but there is information loss in doing
+that.
+
+
+Note
+----
+
+At its current implementation state, this is a bit racy on 32-bit machines: if
+process A reads process B's /proc/pid/io while process B is updating one of
+those 64-bit counters, process A could see an intermediate result.
+
+
+More information about this can be found within the taskstats documentation in
+Documentation/accounting.
+
+3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
+---------------------------------------------------------------
+When a process is dumped, all anonymous memory is written to a core file as
+long as the size of the core file isn't limited. But sometimes we don't want
+to dump some memory segments, for example, huge shared memory. Conversely,
+sometimes we want to save file-backed memory segments into a core file, not
+only the individual files.
+
+/proc/<pid>/coredump_filter allows you to customize which memory segments
+will be dumped when the <pid> process is dumped. coredump_filter is a bitmask
+of memory types. If a bit of the bitmask is set, memory segments of the
+corresponding memory type are dumped, otherwise they are not dumped.
+
+The following 7 memory types are supported:
+ - (bit 0) anonymous private memory
+ - (bit 1) anonymous shared memory
+ - (bit 2) file-backed private memory
+ - (bit 3) file-backed shared memory
+ - (bit 4) ELF header pages in file-backed private memory areas (it is
+ effective only if the bit 2 is cleared)
+ - (bit 5) hugetlb private memory
+ - (bit 6) hugetlb shared memory
+
+ Note that MMIO pages such as frame buffer are never dumped and vDSO pages
+ are always dumped regardless of the bitmask status.
+
+ Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
+ effected by bit 5-6.
+
+Default value of coredump_filter is 0x23; this means all anonymous memory
+segments and hugetlb private memory are dumped.
+
+If you don't want to dump all shared memory segments attached to pid 1234,
+write 0x21 to the process's proc file.
+
+ $ echo 0x21 > /proc/1234/coredump_filter
+
+When a new process is created, the process inherits the bitmask status from its
+parent. It is useful to set up coredump_filter before the program runs.
+For example:
+
+ $ echo 0x7 > /proc/self/coredump_filter
+ $ ./some_program
+
+3.5 /proc/<pid>/mountinfo - Information about mounts
+--------------------------------------------------------
+
+This file contains lines of the form:
+
+36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
+(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
+
+(1) mount ID: unique identifier of the mount (may be reused after umount)
+(2) parent ID: ID of parent (or of self for the top of the mount tree)
+(3) major:minor: value of st_dev for files on filesystem
+(4) root: root of the mount within the filesystem
+(5) mount point: mount point relative to the process's root
+(6) mount options: per mount options
+(7) optional fields: zero or more fields of the form "tag[:value]"
+(8) separator: marks the end of the optional fields
+(9) filesystem type: name of filesystem of the form "type[.subtype]"
+(10) mount source: filesystem specific information or "none"
+(11) super options: per super block options
+
+Parsers should ignore all unrecognised optional fields. Currently the
+possible optional fields are:
+
+shared:X mount is shared in peer group X
+master:X mount is slave to peer group X
+propagate_from:X mount is slave and receives propagation from peer group X (*)
+unbindable mount is unbindable
+
+(*) X is the closest dominant peer group under the process's root. If
+X is the immediate master of the mount, or if there's no dominant peer
+group under the same root, then only the "master:X" field is present
+and not the "propagate_from:X" field.
+
+For more information on mount propagation see:
+
+ Documentation/filesystems/sharedsubtree.txt
+
+
+3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+--------------------------------------------------------
+These files provide a method to access a tasks comm value. It also allows for
+a task to set its own or one of its thread siblings comm value. The comm value
+is limited in size compared to the cmdline value, so writing anything longer
+then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
+comm value.
diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.txt
new file mode 100644
index 00000000..5e8de25b
--- /dev/null
+++ b/Documentation/filesystems/quota.txt
@@ -0,0 +1,65 @@
+
+Quota subsystem
+===============
+
+Quota subsystem allows system administrator to set limits on used space and
+number of used inodes (inode is a filesystem structure which is associated with
+each file or directory) for users and/or groups. For both used space and number
+of used inodes there are actually two limits. The first one is called softlimit
+and the second one hardlimit. An user can never exceed a hardlimit for any
+resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed
+softlimit but only for limited period of time. This period is called "grace
+period" or "grace time". When grace time is over, user is not able to allocate
+more space/inodes until he frees enough of them to get below softlimit.
+
+Quota limits (and amount of grace time) are set independently for each
+filesystem.
+
+For more details about quota design, see the documentation in quota-tools package
+(http://sourceforge.net/projects/linuxquota).
+
+Quota netlink interface
+=======================
+When user exceeds a softlimit, runs out of grace time or reaches hardlimit,
+quota subsystem traditionally printed a message to the controlling terminal of
+the process which caused the excess. This method has the disadvantage that
+when user is using a graphical desktop he usually cannot see the message.
+Thus quota netlink interface has been designed to pass information about
+the above events to userspace. There they can be captured by an application
+and processed accordingly.
+
+The interface uses generic netlink framework (see
+http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more
+details about this layer). The name of the quota generic netlink interface
+is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>.
+ Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
+This command is used to send a notification about any of the above mentioned
+events. Each message has six attributes. These are (type of the argument is
+in parentheses):
+ QUOTA_NL_A_QTYPE (u32)
+ - type of quota being exceeded (one of USRQUOTA, GRPQUOTA)
+ QUOTA_NL_A_EXCESS_ID (u64)
+ - UID/GID (depends on quota type) of user / group whose limit
+ is being exceeded.
+ QUOTA_NL_A_CAUSED_ID (u64)
+ - UID of a user who caused the event
+ QUOTA_NL_A_WARNING (u32)
+ - what kind of limit is exceeded:
+ QUOTA_NL_IHARDWARN - inode hardlimit
+ QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer
+ than given grace period
+ QUOTA_NL_ISOFTWARN - inode softlimit
+ QUOTA_NL_BHARDWARN - space (block) hardlimit
+ QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded
+ longer than given grace period.
+ QUOTA_NL_BSOFTWARN - space (block) softlimit
+ - four warnings are also defined for the event when user stops
+ exceeding some limit:
+ QUOTA_NL_IHARDBELOW - inode hardlimit
+ QUOTA_NL_ISOFTBELOW - inode softlimit
+ QUOTA_NL_BHARDBELOW - space (block) hardlimit
+ QUOTA_NL_BSOFTBELOW - space (block) softlimit
+ QUOTA_NL_A_DEV_MAJOR (u32)
+ - major number of a device with the affected filesystem
+ QUOTA_NL_A_DEV_MINOR (u32)
+ - minor number of a device with the affected filesystem
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
new file mode 100644
index 00000000..a8273d5f
--- /dev/null
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -0,0 +1,355 @@
+ramfs, rootfs and initramfs
+October 17, 2005
+Rob Landley <rob@landley.net>
+=============================
+
+What is ramfs?
+--------------
+
+Ramfs is a very simple filesystem that exports Linux's disk caching
+mechanisms (the page cache and dentry cache) as a dynamically resizable
+RAM-based filesystem.
+
+Normally all files are cached in memory by Linux. Pages of data read from
+backing store (usually the block device the filesystem is mounted on) are kept
+around in case it's needed again, but marked as clean (freeable) in case the
+Virtual Memory system needs the memory for something else. Similarly, data
+written to files is marked clean as soon as it has been written to backing
+store, but kept around for caching purposes until the VM reallocates the
+memory. A similar mechanism (the dentry cache) greatly speeds up access to
+directories.
+
+With ramfs, there is no backing store. Files written into ramfs allocate
+dentries and page cache as usual, but there's nowhere to write them to.
+This means the pages are never marked clean, so they can't be freed by the
+VM when it's looking to recycle memory.
+
+The amount of code required to implement ramfs is tiny, because all the
+work is done by the existing Linux caching infrastructure. Basically,
+you're mounting the disk cache as a filesystem. Because of this, ramfs is not
+an optional component removable via menuconfig, since there would be negligible
+space savings.
+
+ramfs and ramdisk:
+------------------
+
+The older "ram disk" mechanism created a synthetic block device out of
+an area of RAM and used it as backing store for a filesystem. This block
+device was of fixed size, so the filesystem mounted on it was of fixed
+size. Using a ram disk also required unnecessarily copying memory from the
+fake block device into the page cache (and copying changes back out), as well
+as creating and destroying dentries. Plus it needed a filesystem driver
+(such as ext2) to format and interpret this data.
+
+Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
+unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks
+to avoid this copying by playing with the page tables, but they're unpleasantly
+complicated and turn out to be about as expensive as the copying anyway.)
+More to the point, all the work ramfs is doing has to happen _anyway_,
+since all file access goes through the page and dentry caches. The RAM
+disk is simply unnecessary; ramfs is internally much simpler.
+
+Another reason ramdisks are semi-obsolete is that the introduction of
+loopback devices offered a more flexible and convenient way to create
+synthetic block devices, now from files instead of from chunks of memory.
+See losetup (8) for details.
+
+ramfs and tmpfs:
+----------------
+
+One downside of ramfs is you can keep writing data into it until you fill
+up all memory, and the VM can't free it because the VM thinks that files
+should get written to backing store (rather than swap space), but ramfs hasn't
+got any backing store. Because of this, only root (or a trusted user) should
+be allowed write access to a ramfs mount.
+
+A ramfs derivative called tmpfs was created to add size limits, and the ability
+to write the data to swap space. Normal users can be allowed write access to
+tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
+
+What is rootfs?
+---------------
+
+Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
+always present in 2.6 systems. You can't unmount rootfs for approximately the
+same reason you can't kill the init process; rather than having special code
+to check for and handle an empty list, it's smaller and simpler for the kernel
+to just make sure certain lists can't become empty.
+
+Most systems just mount another filesystem over rootfs and ignore it. The
+amount of space an empty instance of ramfs takes up is tiny.
+
+What is initramfs?
+------------------
+
+All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
+extracted into rootfs when the kernel boots up. After extracting, the kernel
+checks to see if rootfs contains a file "init", and if so it executes it as PID
+1. If found, this init process is responsible for bringing the system the
+rest of the way up, including locating and mounting the real root device (if
+any). If rootfs does not contain an init program after the embedded cpio
+archive is extracted into it, the kernel will fall through to the older code
+to locate and mount a root partition, then exec some variant of /sbin/init
+out of that.
+
+All this differs from the old initrd in several ways:
+
+ - The old initrd was always a separate file, while the initramfs archive is
+ linked into the linux kernel image. (The directory linux-*/usr is devoted
+ to generating this archive during the build.)
+
+ - The old initrd file was a gzipped filesystem image (in some file format,
+ such as ext2, that needed a driver built into the kernel), while the new
+ initramfs archive is a gzipped cpio archive (like tar only simpler,
+ see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
+ kernel's cpio extraction code is not only extremely small, it's also
+ __init text and data that can be discarded during the boot process.
+
+ - The program run by the old initrd (which was called /initrd, not /init) did
+ some setup and then returned to the kernel, while the init program from
+ initramfs is not expected to return to the kernel. (If /init needs to hand
+ off control it can overmount / with a new root device and exec another init
+ program. See the switch_root utility, below.)
+
+ - When switching another root device, initrd would pivot_root and then
+ umount the ramdisk. But initramfs is rootfs: you can neither pivot_root
+ rootfs, nor unmount it. Instead delete everything out of rootfs to
+ free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
+ with the new root (cd /newmount; mount --move . /; chroot .), attach
+ stdin/stdout/stderr to the new /dev/console, and exec the new init.
+
+ Since this is a remarkably persnickety process (and involves deleting
+ commands before you can run them), the klibc package introduced a helper
+ program (utils/run_init.c) to do all this for you. Most other packages
+ (such as busybox) have named this command "switch_root".
+
+Populating initramfs:
+---------------------
+
+The 2.6 kernel build process always creates a gzipped cpio format initramfs
+archive and links it into the resulting kernel binary. By default, this
+archive is empty (consuming 134 bytes on x86).
+
+The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig,
+and living in usr/Kconfig) can be used to specify a source for the
+initramfs archive, which will automatically be incorporated into the
+resulting binary. This option can point to an existing gzipped cpio
+archive, a directory containing files to be archived, or a text file
+specification such as the following example:
+
+ dir /dev 755 0 0
+ nod /dev/console 644 0 0 c 5 1
+ nod /dev/loop0 644 0 0 b 7 0
+ dir /bin 755 1000 1000
+ slink /bin/sh busybox 777 0 0
+ file /bin/busybox initramfs/busybox 755 0 0
+ dir /proc 755 0 0
+ dir /sys 755 0 0
+ dir /mnt 755 0 0
+ file /init initramfs/init.sh 755 0 0
+
+Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
+documenting the above file format.
+
+One advantage of the configuration file is that root access is not required to
+set permissions or create device nodes in the new archive. (Note that those
+two example "file" entries expect to find files named "init.sh" and "busybox" in
+a directory called "initramfs", under the linux-2.6.* directory. See
+Documentation/early-userspace/README for more details.)
+
+The kernel does not depend on external cpio tools. If you specify a
+directory instead of a configuration file, the kernel's build infrastructure
+creates a configuration file from that directory (usr/Makefile calls
+scripts/gen_initramfs_list.sh), and proceeds to package up that directory
+using the config file (by feeding it to usr/gen_init_cpio, which is created
+from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
+entirely self-contained, and the kernel's boot-time extractor is also
+(obviously) self-contained.
+
+The one thing you might need external cpio utilities installed for is creating
+or extracting your own preprepared cpio files to feed to the kernel build
+(instead of a config file or directory).
+
+The following command line can extract a cpio image (either by the above script
+or by the kernel build) back into its component files:
+
+ cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
+
+The following shell script can create a prebuilt cpio archive you can
+use in place of the above config file:
+
+ #!/bin/sh
+
+ # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
+ # Licensed under GPL version 2
+
+ if [ $# -ne 2 ]
+ then
+ echo "usage: mkinitramfs directory imagename.cpio.gz"
+ exit 1
+ fi
+
+ if [ -d "$1" ]
+ then
+ echo "creating $2 from $1"
+ (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
+ else
+ echo "First argument must be a directory"
+ exit 1
+ fi
+
+Note: The cpio man page contains some bad advice that will break your initramfs
+archive if you follow it. It says "A typical way to generate the list
+of filenames is with the find command; you should give find the -depth option
+to minimize problems with permissions on directories that are unwritable or not
+searchable." Don't do this when creating initramfs.cpio.gz images, it won't
+work. The Linux kernel cpio extractor won't create files in a directory that
+doesn't exist, so the directory entries must go before the files that go in
+those directories. The above script gets them in the right order.
+
+External initramfs images:
+--------------------------
+
+If the kernel has initrd support enabled, an external cpio.gz archive can also
+be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
+will autodetect the type (initramfs, not initrd) and extract the external cpio
+archive into rootfs before trying to run /init.
+
+This has the memory efficiency advantages of initramfs (no ramdisk block
+device) but the separate packaging of initrd (which is nice if you have
+non-GPL code you'd like to run from initramfs, without conflating it with
+the GPL licensed Linux kernel binary).
+
+It can also be used to supplement the kernel's built-in initramfs image. The
+files in the external archive will overwrite any conflicting files in
+the built-in initramfs archive. Some distributors also prefer to customize
+a single kernel image with task-specific initramfs images, without recompiling.
+
+Contents of initramfs:
+----------------------
+
+An initramfs archive is a complete self-contained root filesystem for Linux.
+If you don't already understand what shared libraries, devices, and paths
+you need to get a minimal root filesystem up and running, here are some
+references:
+http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
+http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
+http://www.linuxfromscratch.org/lfs/view/stable/
+
+The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
+designed to be a tiny C library to statically link early userspace
+code against, along with some related utilities. It is BSD licensed.
+
+I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
+myself. These are LGPL and GPL, respectively. (A self-contained initramfs
+package is planned for the busybox 1.3 release.)
+
+In theory you could use glibc, but that's not well suited for small embedded
+uses like this. (A "hello world" program statically linked against glibc is
+over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
+name lookups, even when otherwise statically linked.)
+
+A good first step is to get initramfs to run a statically linked "hello world"
+program as init, and test it under an emulator like qemu (www.qemu.org) or
+User Mode Linux, like so:
+
+ cat > hello.c << EOF
+ #include <stdio.h>
+ #include <unistd.h>
+
+ int main(int argc, char *argv[])
+ {
+ printf("Hello world!\n");
+ sleep(999999999);
+ }
+ EOF
+ gcc -static hello.c -o init
+ echo init | cpio -o -H newc | gzip > test.cpio.gz
+ # Testing external initramfs using the initrd loading mechanism.
+ qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
+
+When debugging a normal root filesystem, it's nice to be able to boot with
+"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
+just as useful.
+
+Why cpio rather than tar?
+-------------------------
+
+This decision was made back in December, 2001. The discussion started here:
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
+
+And spawned a second thread (specifically on tar vs cpio), starting here:
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
+
+The quick and dirty summary version (which is no substitute for reading
+the above threads) is:
+
+1) cpio is a standard. It's decades old (from the AT&T days), and already
+ widely used on Linux (inside RPM, Red Hat's device driver disks). Here's
+ a Linux Journal article about it from 1996:
+
+ http://www.linuxjournal.com/article/1213
+
+ It's not as popular as tar because the traditional cpio command line tools
+ require _truly_hideous_ command line arguments. But that says nothing
+ either way about the archive format, and there are alternative tools,
+ such as:
+
+ http://freshmeat.net/projects/afio/
+
+2) The cpio archive format chosen by the kernel is simpler and cleaner (and
+ thus easier to create and parse) than any of the (literally dozens of)
+ various tar archive formats. The complete initramfs archive format is
+ explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+ extracted in init/initramfs.c. All three together come to less than 26k
+ total of human-readable text.
+
+3) The GNU project standardizing on tar is approximately as relevant as
+ Windows standardizing on zip. Linux is not part of either, and is free
+ to make its own technical decisions.
+
+4) Since this is a kernel internal format, it could easily have been
+ something brand new. The kernel provides its own tools to create and
+ extract this format anyway. Using an existing standard was preferable,
+ but not essential.
+
+5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
+ supported on the kernel side"):
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
+
+ explained his reasoning:
+
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
+ http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
+
+ and, most importantly, designed and implemented the initramfs code.
+
+Future directions:
+------------------
+
+Today (2.6.16), initramfs is always compiled in, but not always used. The
+kernel falls back to legacy boot code that is reached only if initramfs does
+not contain an /init program. The fallback is legacy code, there to ensure a
+smooth transition and allowing early boot functionality to gradually move to
+"early userspace" (I.E. initramfs).
+
+The move to early userspace is necessary because finding and mounting the real
+root device is complex. Root partitions can span multiple devices (raid or
+separate journal). They can be out on the network (requiring dhcp, setting a
+specific MAC address, logging into a server, etc). They can live on removable
+media, with dynamically allocated major/minor numbers and persistent naming
+issues requiring a full udev implementation to sort out. They can be
+compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
+and so on.
+
+This kind of complexity (which inevitably includes policy) is rightly handled
+in userspace. Both klibc and busybox/uClibc are working on simple initramfs
+packages to drop into a kernel build.
+
+The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
+The kernel's current early boot code (partition detection, etc) will probably
+be migrated into a default initramfs, automatically created and used by the
+kernel build.
diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt
new file mode 100644
index 00000000..510b7226
--- /dev/null
+++ b/Documentation/filesystems/relay.txt
@@ -0,0 +1,494 @@
+relay interface (formerly relayfs)
+==================================
+
+The relay interface provides a means for kernel applications to
+efficiently log and transfer large quantities of data from the kernel
+to userspace via user-defined 'relay channels'.
+
+A 'relay channel' is a kernel->user data relay mechanism implemented
+as a set of per-cpu kernel buffers ('channel buffers'), each
+represented as a regular file ('relay file') in user space. Kernel
+clients write into the channel buffers using efficient write
+functions; these automatically log into the current cpu's channel
+buffer. User space applications mmap() or read() from the relay files
+and retrieve the data as it becomes available. The relay files
+themselves are files created in a host filesystem, e.g. debugfs, and
+are associated with the channel buffers using the API described below.
+
+The format of the data logged into the channel buffers is completely
+up to the kernel client; the relay interface does however provide
+hooks which allow kernel clients to impose some structure on the
+buffer data. The relay interface doesn't implement any form of data
+filtering - this also is left to the kernel client. The purpose is to
+keep things as simple as possible.
+
+This document provides an overview of the relay interface API. The
+details of the function parameters are documented along with the
+functions in the relay interface code - please see that for details.
+
+Semantics
+=========
+
+Each relay channel has one buffer per CPU, each buffer has one or more
+sub-buffers. Messages are written to the first sub-buffer until it is
+too full to contain a new message, in which case it it is written to
+the next (if available). Messages are never split across sub-buffers.
+At this point, userspace can be notified so it empties the first
+sub-buffer, while the kernel continues writing to the next.
+
+When notified that a sub-buffer is full, the kernel knows how many
+bytes of it are padding i.e. unused space occurring because a complete
+message couldn't fit into a sub-buffer. Userspace can use this
+knowledge to copy only valid data.
+
+After copying it, userspace can notify the kernel that a sub-buffer
+has been consumed.
+
+A relay channel can operate in a mode where it will overwrite data not
+yet collected by userspace, and not wait for it to be consumed.
+
+The relay channel itself does not provide for communication of such
+data between userspace and kernel, allowing the kernel side to remain
+simple and not impose a single interface on userspace. It does
+provide a set of examples and a separate helper though, described
+below.
+
+The read() interface both removes padding and internally consumes the
+read sub-buffers; thus in cases where read(2) is being used to drain
+the channel buffers, special-purpose communication between kernel and
+user isn't necessary for basic operation.
+
+One of the major goals of the relay interface is to provide a low
+overhead mechanism for conveying kernel data to userspace. While the
+read() interface is easy to use, it's not as efficient as the mmap()
+approach; the example code attempts to make the tradeoff between the
+two approaches as small as possible.
+
+klog and relay-apps example code
+================================
+
+The relay interface itself is ready to use, but to make things easier,
+a couple simple utility functions and a set of examples are provided.
+
+The relay-apps example tarball, available on the relay sourceforge
+site, contains a set of self-contained examples, each consisting of a
+pair of .c files containing boilerplate code for each of the user and
+kernel sides of a relay application. When combined these two sets of
+boilerplate code provide glue to easily stream data to disk, without
+having to bother with mundane housekeeping chores.
+
+The 'klog debugging functions' patch (klog.patch in the relay-apps
+tarball) provides a couple of high-level logging functions to the
+kernel which allow writing formatted text or raw data to a channel,
+regardless of whether a channel to write into exists or not, or even
+whether the relay interface is compiled into the kernel or not. These
+functions allow you to put unconditional 'trace' statements anywhere
+in the kernel or kernel modules; only when there is a 'klog handler'
+registered will data actually be logged (see the klog and kleak
+examples for details).
+
+It is of course possible to use the relay interface from scratch,
+i.e. without using any of the relay-apps example code or klog, but
+you'll have to implement communication between userspace and kernel,
+allowing both to convey the state of buffers (full, empty, amount of
+padding). The read() interface both removes padding and internally
+consumes the read sub-buffers; thus in cases where read(2) is being
+used to drain the channel buffers, special-purpose communication
+between kernel and user isn't necessary for basic operation. Things
+such as buffer-full conditions would still need to be communicated via
+some channel though.
+
+klog and the relay-apps examples can be found in the relay-apps
+tarball on http://relayfs.sourceforge.net
+
+The relay interface user space API
+==================================
+
+The relay interface implements basic file operations for user space
+access to relay channel buffer data. Here are the file operations
+that are available and some comments regarding their behavior:
+
+open() enables user to open an _existing_ channel buffer.
+
+mmap() results in channel buffer being mapped into the caller's
+ memory space. Note that you can't do a partial mmap - you
+ must map the entire file, which is NRBUF * SUBBUFSIZE.
+
+read() read the contents of a channel buffer. The bytes read are
+ 'consumed' by the reader, i.e. they won't be available
+ again to subsequent reads. If the channel is being used
+ in no-overwrite mode (the default), it can be read at any
+ time even if there's an active kernel writer. If the
+ channel is being used in overwrite mode and there are
+ active channel writers, results may be unpredictable -
+ users should make sure that all logging to the channel has
+ ended before using read() with overwrite mode. Sub-buffer
+ padding is automatically removed and will not be seen by
+ the reader.
+
+sendfile() transfer data from a channel buffer to an output file
+ descriptor. Sub-buffer padding is automatically removed
+ and will not be seen by the reader.
+
+poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
+ notified when sub-buffer boundaries are crossed.
+
+close() decrements the channel buffer's refcount. When the refcount
+ reaches 0, i.e. when no process or kernel client has the
+ buffer open, the channel buffer is freed.
+
+In order for a user application to make use of relay files, the
+host filesystem must be mounted. For example,
+
+ mount -t debugfs debugfs /sys/kernel/debug
+
+NOTE: the host filesystem doesn't need to be mounted for kernel
+ clients to create or use channels - it only needs to be
+ mounted when user space applications need access to the buffer
+ data.
+
+
+The relay interface kernel API
+==============================
+
+Here's a summary of the API the relay interface provides to in-kernel clients:
+
+TBD(curr. line MT:/API/)
+ channel management functions:
+
+ relay_open(base_filename, parent, subbuf_size, n_subbufs,
+ callbacks, private_data)
+ relay_close(chan)
+ relay_flush(chan)
+ relay_reset(chan)
+
+ channel management typically called on instigation of userspace:
+
+ relay_subbufs_consumed(chan, cpu, subbufs_consumed)
+
+ write functions:
+
+ relay_write(chan, data, length)
+ __relay_write(chan, data, length)
+ relay_reserve(chan, length)
+
+ callbacks:
+
+ subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
+ buf_mapped(buf, filp)
+ buf_unmapped(buf, filp)
+ create_buf_file(filename, parent, mode, buf, is_global)
+ remove_buf_file(dentry)
+
+ helper functions:
+
+ relay_buf_full(buf)
+ subbuf_start_reserve(buf, length)
+
+
+Creating a channel
+------------------
+
+relay_open() is used to create a channel, along with its per-cpu
+channel buffers. Each channel buffer will have an associated file
+created for it in the host filesystem, which can be and mmapped or
+read from in user space. The files are named basename0...basenameN-1
+where N is the number of online cpus, and by default will be created
+in the root of the filesystem (if the parent param is NULL). If you
+want a directory structure to contain your relay files, you should
+create it using the host filesystem's directory creation function,
+e.g. debugfs_create_dir(), and pass the parent directory to
+relay_open(). Users are responsible for cleaning up any directory
+structure they create, when the channel is closed - again the host
+filesystem's directory removal functions should be used for that,
+e.g. debugfs_remove().
+
+In order for a channel to be created and the host filesystem's files
+associated with its channel buffers, the user must provide definitions
+for two callback functions, create_buf_file() and remove_buf_file().
+create_buf_file() is called once for each per-cpu buffer from
+relay_open() and allows the user to create the file which will be used
+to represent the corresponding channel buffer. The callback should
+return the dentry of the file created to represent the channel buffer.
+remove_buf_file() must also be defined; it's responsible for deleting
+the file(s) created in create_buf_file() and is called during
+relay_close().
+
+Here are some typical definitions for these callbacks, in this case
+using debugfs:
+
+/*
+ * create_buf_file() callback. Creates relay file in debugfs.
+ */
+static struct dentry *create_buf_file_handler(const char *filename,
+ struct dentry *parent,
+ int mode,
+ struct rchan_buf *buf,
+ int *is_global)
+{
+ return debugfs_create_file(filename, mode, parent, buf,
+ &relay_file_operations);
+}
+
+/*
+ * remove_buf_file() callback. Removes relay file from debugfs.
+ */
+static int remove_buf_file_handler(struct dentry *dentry)
+{
+ debugfs_remove(dentry);
+
+ return 0;
+}
+
+/*
+ * relay interface callbacks
+ */
+static struct rchan_callbacks relay_callbacks =
+{
+ .create_buf_file = create_buf_file_handler,
+ .remove_buf_file = remove_buf_file_handler,
+};
+
+And an example relay_open() invocation using them:
+
+ chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
+
+If the create_buf_file() callback fails, or isn't defined, channel
+creation and thus relay_open() will fail.
+
+The total size of each per-cpu buffer is calculated by multiplying the
+number of sub-buffers by the sub-buffer size passed into relay_open().
+The idea behind sub-buffers is that they're basically an extension of
+double-buffering to N buffers, and they also allow applications to
+easily implement random-access-on-buffer-boundary schemes, which can
+be important for some high-volume applications. The number and size
+of sub-buffers is completely dependent on the application and even for
+the same application, different conditions will warrant different
+values for these parameters at different times. Typically, the right
+values to use are best decided after some experimentation; in general,
+though, it's safe to assume that having only 1 sub-buffer is a bad
+idea - you're guaranteed to either overwrite data or lose events
+depending on the channel mode being used.
+
+The create_buf_file() implementation can also be defined in such a way
+as to allow the creation of a single 'global' buffer instead of the
+default per-cpu set. This can be useful for applications interested
+mainly in seeing the relative ordering of system-wide events without
+the need to bother with saving explicit timestamps for the purpose of
+merging/sorting per-cpu files in a postprocessing step.
+
+To have relay_open() create a global buffer, the create_buf_file()
+implementation should set the value of the is_global outparam to a
+non-zero value in addition to creating the file that will be used to
+represent the single buffer. In the case of a global buffer,
+create_buf_file() and remove_buf_file() will be called only once. The
+normal channel-writing functions, e.g. relay_write(), can still be
+used - writes from any cpu will transparently end up in the global
+buffer - but since it is a global buffer, callers should make sure
+they use the proper locking for such a buffer, either by wrapping
+writes in a spinlock, or by copying a write function from relay.h and
+creating a local version that internally does the proper locking.
+
+The private_data passed into relay_open() allows clients to associate
+user-defined data with a channel, and is immediately available
+(including in create_buf_file()) via chan->private_data or
+buf->chan->private_data.
+
+Buffer-only channels
+--------------------
+
+These channels have no files associated and can be created with
+relay_open(NULL, NULL, ...). Such channels are useful in scenarios such
+as when doing early tracing in the kernel, before the VFS is up. In these
+cases, one may open a buffer-only channel and then call
+relay_late_setup_files() when the kernel is ready to handle files,
+to expose the buffered data to the userspace.
+
+Channel 'modes'
+---------------
+
+relay channels can be used in either of two modes - 'overwrite' or
+'no-overwrite'. The mode is entirely determined by the implementation
+of the subbuf_start() callback, as described below. The default if no
+subbuf_start() callback is defined is 'no-overwrite' mode. If the
+default mode suits your needs, and you plan to use the read()
+interface to retrieve channel data, you can ignore the details of this
+section, as it pertains mainly to mmap() implementations.
+
+In 'overwrite' mode, also known as 'flight recorder' mode, writes
+continuously cycle around the buffer and will never fail, but will
+unconditionally overwrite old data regardless of whether it's actually
+been consumed. In no-overwrite mode, writes will fail, i.e. data will
+be lost, if the number of unconsumed sub-buffers equals the total
+number of sub-buffers in the channel. It should be clear that if
+there is no consumer or if the consumer can't consume sub-buffers fast
+enough, data will be lost in either case; the only difference is
+whether data is lost from the beginning or the end of a buffer.
+
+As explained above, a relay channel is made of up one or more
+per-cpu channel buffers, each implemented as a circular buffer
+subdivided into one or more sub-buffers. Messages are written into
+the current sub-buffer of the channel's current per-cpu buffer via the
+write functions described below. Whenever a message can't fit into
+the current sub-buffer, because there's no room left for it, the
+client is notified via the subbuf_start() callback that a switch to a
+new sub-buffer is about to occur. The client uses this callback to 1)
+initialize the next sub-buffer if appropriate 2) finalize the previous
+sub-buffer if appropriate and 3) return a boolean value indicating
+whether or not to actually move on to the next sub-buffer.
+
+To implement 'no-overwrite' mode, the userspace client would provide
+an implementation of the subbuf_start() callback something like the
+following:
+
+static int subbuf_start(struct rchan_buf *buf,
+ void *subbuf,
+ void *prev_subbuf,
+ unsigned int prev_padding)
+{
+ if (prev_subbuf)
+ *((unsigned *)prev_subbuf) = prev_padding;
+
+ if (relay_buf_full(buf))
+ return 0;
+
+ subbuf_start_reserve(buf, sizeof(unsigned int));
+
+ return 1;
+}
+
+If the current buffer is full, i.e. all sub-buffers remain unconsumed,
+the callback returns 0 to indicate that the buffer switch should not
+occur yet, i.e. until the consumer has had a chance to read the
+current set of ready sub-buffers. For the relay_buf_full() function
+to make sense, the consumer is responsible for notifying the relay
+interface when sub-buffers have been consumed via
+relay_subbufs_consumed(). Any subsequent attempts to write into the
+buffer will again invoke the subbuf_start() callback with the same
+parameters; only when the consumer has consumed one or more of the
+ready sub-buffers will relay_buf_full() return 0, in which case the
+buffer switch can continue.
+
+The implementation of the subbuf_start() callback for 'overwrite' mode
+would be very similar:
+
+static int subbuf_start(struct rchan_buf *buf,
+ void *subbuf,
+ void *prev_subbuf,
+ unsigned int prev_padding)
+{
+ if (prev_subbuf)
+ *((unsigned *)prev_subbuf) = prev_padding;
+
+ subbuf_start_reserve(buf, sizeof(unsigned int));
+
+ return 1;
+}
+
+In this case, the relay_buf_full() check is meaningless and the
+callback always returns 1, causing the buffer switch to occur
+unconditionally. It's also meaningless for the client to use the
+relay_subbufs_consumed() function in this mode, as it's never
+consulted.
+
+The default subbuf_start() implementation, used if the client doesn't
+define any callbacks, or doesn't define the subbuf_start() callback,
+implements the simplest possible 'no-overwrite' mode, i.e. it does
+nothing but return 0.
+
+Header information can be reserved at the beginning of each sub-buffer
+by calling the subbuf_start_reserve() helper function from within the
+subbuf_start() callback. This reserved area can be used to store
+whatever information the client wants. In the example above, room is
+reserved in each sub-buffer to store the padding count for that
+sub-buffer. This is filled in for the previous sub-buffer in the
+subbuf_start() implementation; the padding value for the previous
+sub-buffer is passed into the subbuf_start() callback along with a
+pointer to the previous sub-buffer, since the padding value isn't
+known until a sub-buffer is filled. The subbuf_start() callback is
+also called for the first sub-buffer when the channel is opened, to
+give the client a chance to reserve space in it. In this case the
+previous sub-buffer pointer passed into the callback will be NULL, so
+the client should check the value of the prev_subbuf pointer before
+writing into the previous sub-buffer.
+
+Writing to a channel
+--------------------
+
+Kernel clients write data into the current cpu's channel buffer using
+relay_write() or __relay_write(). relay_write() is the main logging
+function - it uses local_irqsave() to protect the buffer and should be
+used if you might be logging from interrupt context. If you know
+you'll never be logging from interrupt context, you can use
+__relay_write(), which only disables preemption. These functions
+don't return a value, so you can't determine whether or not they
+failed - the assumption is that you wouldn't want to check a return
+value in the fast logging path anyway, and that they'll always succeed
+unless the buffer is full and no-overwrite mode is being used, in
+which case you can detect a failed write in the subbuf_start()
+callback by calling the relay_buf_full() helper function.
+
+relay_reserve() is used to reserve a slot in a channel buffer which
+can be written to later. This would typically be used in applications
+that need to write directly into a channel buffer without having to
+stage data in a temporary buffer beforehand. Because the actual write
+may not happen immediately after the slot is reserved, applications
+using relay_reserve() can keep a count of the number of bytes actually
+written, either in space reserved in the sub-buffers themselves or as
+a separate array. See the 'reserve' example in the relay-apps tarball
+at http://relayfs.sourceforge.net for an example of how this can be
+done. Because the write is under control of the client and is
+separated from the reserve, relay_reserve() doesn't protect the buffer
+at all - it's up to the client to provide the appropriate
+synchronization when using relay_reserve().
+
+Closing a channel
+-----------------
+
+The client calls relay_close() when it's finished using the channel.
+The channel and its associated buffers are destroyed when there are no
+longer any references to any of the channel buffers. relay_flush()
+forces a sub-buffer switch on all the channel buffers, and can be used
+to finalize and process the last sub-buffers before the channel is
+closed.
+
+Misc
+----
+
+Some applications may want to keep a channel around and re-use it
+rather than open and close a new channel for each use. relay_reset()
+can be used for this purpose - it resets a channel to its initial
+state without reallocating channel buffer memory or destroying
+existing mappings. It should however only be called when it's safe to
+do so, i.e. when the channel isn't currently being written to.
+
+Finally, there are a couple of utility callbacks that can be used for
+different purposes. buf_mapped() is called whenever a channel buffer
+is mmapped from user space and buf_unmapped() is called when it's
+unmapped. The client can use this notification to trigger actions
+within the kernel application, such as enabling/disabling logging to
+the channel.
+
+
+Resources
+=========
+
+For news, example code, mailing list, etc. see the relay interface homepage:
+
+ http://relayfs.sourceforge.net
+
+
+Credits
+=======
+
+The ideas and specs for the relay interface came about as a result of
+discussions on tracing involving the following:
+
+Michel Dagenais <michel.dagenais@polymtl.ca>
+Richard Moore <richardj_moore@uk.ibm.com>
+Bob Wisniewski <bob@watson.ibm.com>
+Karim Yaghmour <karim@opersys.com>
+Tom Zanussi <zanussi@us.ibm.com>
+
+Also thanks to Hubertus Franke for a lot of useful suggestions and bug
+reports.
diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt
new file mode 100644
index 00000000..e2b07cc9
--- /dev/null
+++ b/Documentation/filesystems/romfs.txt
@@ -0,0 +1,186 @@
+ROMFS - ROM FILE SYSTEM
+
+This is a quite dumb, read only filesystem, mainly for initial RAM
+disks of installation disks. It has grown up by the need of having
+modules linked at boot time. Using this filesystem, you get a very
+similar feature, and even the possibility of a small kernel, with a
+file system which doesn't take up useful memory from the router
+functions in the basement of your office.
+
+For comparison, both the older minix and xiafs (the latter is now
+defunct) filesystems, compiled as module need more than 20000 bytes,
+while romfs is less than a page, about 4000 bytes (assuming i586
+code). Under the same conditions, the msdos filesystem would need
+about 30K (and does not support device nodes or symlinks), while the
+nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
+comparison, an actual rescue disk used up 3202 blocks with ext2, while
+with romfs, it needed 3079 blocks.
+
+To create such a file system, you'll need a user program named
+genromfs. It is available on http://romfs.sourceforge.net/
+
+As the name suggests, romfs could be also used (space-efficiently) on
+various read-only media, like (E)EPROM disks if someone will have the
+motivation.. :)
+
+However, the main purpose of romfs is to have a very small kernel,
+which has only this filesystem linked in, and then can load any module
+later, with the current module utilities. It can also be used to run
+some program to decide if you need SCSI devices, and even IDE or
+floppy drives can be loaded later if you use the "initrd"--initial
+RAM disk--feature of the kernel. This would not be really news
+flash, but with romfs, you can even spare off your ext2 or minix or
+maybe even affs filesystem until you really know that you need it.
+
+For example, a distribution boot disk can contain only the cd disk
+drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
+module. The kernel can be small enough, since it doesn't have other
+filesystems, like the quite large ext2fs module, which can then be
+loaded off the CD at a later stage of the installation. Another use
+would be for a recovery disk, when you are reinstalling a workstation
+from the network, and you will have all the tools/modules available
+from a nearby server, so you don't want to carry two disks for this
+purpose, just because it won't fit into ext2.
+
+romfs operates on block devices as you can expect, and the underlying
+structure is very simple. Every accessible structure begins on 16
+byte boundaries for fast access. The minimum space a file will take
+is 32 bytes (this is an empty file, with a less than 16 character
+name). The maximum overhead for any non-empty file is the header, and
+the 16 byte padding for the name and the contents, also 16+14+15 = 45
+bytes. This is quite rare however, since most file names are longer
+than 3 bytes, and shorter than 15 bytes.
+
+The layout of the filesystem is the following:
+
+offset content
+
+ +---+---+---+---+
+ 0 | - | r | o | m | \
+ +---+---+---+---+ The ASCII representation of those bytes
+ 4 | 1 | f | s | - | / (i.e. "-rom1fs-")
+ +---+---+---+---+
+ 8 | full size | The number of accessible bytes in this fs.
+ +---+---+---+---+
+ 12 | checksum | The checksum of the FIRST 512 BYTES.
+ +---+---+---+---+
+ 16 | volume name | The zero terminated name of the volume,
+ : : padded to 16 byte boundary.
+ +---+---+---+---+
+ xx | file |
+ : headers :
+
+Every multi byte value (32 bit words, I'll use the longwords term from
+now on) must be in big endian order.
+
+The first eight bytes identify the filesystem, even for the casual
+inspector. After that, in the 3rd longword, it contains the number of
+bytes accessible from the start of this filesystem. The 4th longword
+is the checksum of the first 512 bytes (or the number of bytes
+accessible, whichever is smaller). The applied algorithm is the same
+as in the AFFS filesystem, namely a simple sum of the longwords
+(assuming bigendian quantities again). For details, please consult
+the source. This algorithm was chosen because although it's not quite
+reliable, it does not require any tables, and it is very simple.
+
+The following bytes are now part of the file system; each file header
+must begin on a 16 byte boundary.
+
+offset content
+
+ +---+---+---+---+
+ 0 | next filehdr|X| The offset of the next file header
+ +---+---+---+---+ (zero if no more files)
+ 4 | spec.info | Info for directories/hard links/devices
+ +---+---+---+---+
+ 8 | size | The size of this file in bytes
+ +---+---+---+---+
+ 12 | checksum | Covering the meta data, including the file
+ +---+---+---+---+ name, and padding
+ 16 | file name | The zero terminated name of the file,
+ : : padded to 16 byte boundary
+ +---+---+---+---+
+ xx | file data |
+ : :
+
+Since the file headers begin always at a 16 byte boundary, the lowest
+4 bits would be always zero in the next filehdr pointer. These four
+bits are used for the mode information. Bits 0..2 specify the type of
+the file; while bit 4 shows if the file is executable or not. The
+permissions are assumed to be world readable, if this bit is not set,
+and world executable if it is; except the character and block devices,
+they are never accessible for other than owner. The owner of every
+file is user and group 0, this should never be a problem for the
+intended use. The mapping of the 8 possible values to file types is
+the following:
+
+ mapping spec.info means
+ 0 hard link link destination [file header]
+ 1 directory first file's header
+ 2 regular file unused, must be zero [MBZ]
+ 3 symbolic link unused, MBZ (file data is the link content)
+ 4 block device 16/16 bits major/minor number
+ 5 char device - " -
+ 6 socket unused, MBZ
+ 7 fifo unused, MBZ
+
+Note that hard links are specifically marked in this filesystem, but
+they will behave as you can expect (i.e. share the inode number).
+Note also that it is your responsibility to not create hard link
+loops, and creating all the . and .. links for directories. This is
+normally done correctly by the genromfs program. Please refrain from
+using the executable bits for special purposes on the socket and fifo
+special files, they may have other uses in the future. Additionally,
+please remember that only regular files, and symlinks are supposed to
+have a nonzero size field; they contain the number of bytes available
+directly after the (padded) file name.
+
+Another thing to note is that romfs works on file headers and data
+aligned to 16 byte boundaries, but most hardware devices and the block
+device drivers are unable to cope with smaller than block-sized data.
+To overcome this limitation, the whole size of the file system must be
+padded to an 1024 byte boundary.
+
+If you have any problems or suggestions concerning this file system,
+please contact me. However, think twice before wanting me to add
+features and code, because the primary and most important advantage of
+this file system is the small code. On the other hand, don't be
+alarmed, I'm not getting that much romfs related mail. Now I can
+understand why Avery wrote poems in the ARCnet docs to get some more
+feedback. :)
+
+romfs has also a mailing list, and to date, it hasn't received any
+traffic, so you are welcome to join it to discuss your ideas. :)
+
+It's run by ezmlm, so you can subscribe to it by sending a message
+to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
+
+Pending issues:
+
+- Permissions and owner information are pretty essential features of a
+Un*x like system, but romfs does not provide the full possibilities.
+I have never found this limiting, but others might.
+
+- The file system is read only, so it can be very small, but in case
+one would want to write _anything_ to a file system, he still needs
+a writable file system, thus negating the size advantages. Possible
+solutions: implement write access as a compile-time option, or a new,
+similarly small writable filesystem for RAM disks.
+
+- Since the files are only required to have alignment on a 16 byte
+boundary, it is currently possibly suboptimal to read or execute files
+from the filesystem. It might be resolved by reordering file data to
+have most of it (i.e. except the start and the end) laying at "natural"
+boundaries, thus it would be possible to directly map a big portion of
+the file contents to the mm subsystem.
+
+- Compression might be an useful feature, but memory is quite a
+limiting factor in my eyes.
+
+- Where it is used?
+
+- Does it work on other architectures than intel and motorola?
+
+
+Have fun,
+Janos Farkas <chexum@shadow.banki.hu>
diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt
new file mode 100644
index 00000000..a1e2e0dd
--- /dev/null
+++ b/Documentation/filesystems/seq_file.txt
@@ -0,0 +1,292 @@
+The seq_file interface
+
+ Copyright 2003 Jonathan Corbet <corbet@lwn.net>
+ This file is originally from the LWN.net Driver Porting series at
+ http://lwn.net/Articles/driver-porting/
+
+
+There are numerous ways for a device driver (or other kernel component) to
+provide information to the user or system administrator. One useful
+technique is the creation of virtual files, in debugfs, /proc or elsewhere.
+Virtual files can provide human-readable output that is easy to get at
+without any special utility programs; they can also make life easier for
+script writers. It is not surprising that the use of virtual files has
+grown over the years.
+
+Creating those files correctly has always been a bit of a challenge,
+however. It is not that hard to make a virtual file which returns a
+string. But life gets trickier if the output is long - anything greater
+than an application is likely to read in a single operation. Handling
+multiple reads (and seeks) requires careful attention to the reader's
+position within the virtual file - that position is, likely as not, in the
+middle of a line of output. The kernel has traditionally had a number of
+implementations that got this wrong.
+
+The 2.6 kernel contains a set of functions (implemented by Alexander Viro)
+which are designed to make it easy for virtual file creators to get it
+right.
+
+The seq_file interface is available via <linux/seq_file.h>. There are
+three aspects to seq_file:
+
+ * An iterator interface which lets a virtual file implementation
+ step through the objects it is presenting.
+
+ * Some utility functions for formatting objects for output without
+ needing to worry about things like output buffers.
+
+ * A set of canned file_operations which implement most operations on
+ the virtual file.
+
+We'll look at the seq_file interface via an extremely simple example: a
+loadable module which creates a file called /proc/sequence. The file, when
+read, simply produces a set of increasing integer values, one per line. The
+sequence will continue until the user loses patience and finds something
+better to do. The file is seekable, in that one can do something like the
+following:
+
+ dd if=/proc/sequence of=out1 count=1
+ dd if=/proc/sequence skip=1 of=out2 count=1
+
+Then concatenate the output files out1 and out2 and get the right
+result. Yes, it is a thoroughly useless module, but the point is to show
+how the mechanism works without getting lost in other details. (Those
+wanting to see the full source for this module can find it at
+http://lwn.net/Articles/22359/).
+
+
+The iterator interface
+
+Modules implementing a virtual file with seq_file must implement a simple
+iterator object that allows stepping through the data of interest.
+Iterators must be able to move to a specific position - like the file they
+implement - but the interpretation of that position is up to the iterator
+itself. A seq_file implementation that is formatting firewall rules, for
+example, could interpret position N as the Nth rule in the chain.
+Positioning can thus be done in whatever way makes the most sense for the
+generator of the data, which need not be aware of how a position translates
+to an offset in the virtual file. The one obvious exception is that a
+position of zero should indicate the beginning of the file.
+
+The /proc/sequence iterator just uses the count of the next number it
+will output as its position.
+
+Four functions must be implemented to make the iterator work. The first,
+called start() takes a position as an argument and returns an iterator
+which will start reading at that position. For our simple sequence example,
+the start() function looks like:
+
+ static void *ct_seq_start(struct seq_file *s, loff_t *pos)
+ {
+ loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL);
+ if (! spos)
+ return NULL;
+ *spos = *pos;
+ return spos;
+ }
+
+The entire data structure for this iterator is a single loff_t value
+holding the current position. There is no upper bound for the sequence
+iterator, but that will not be the case for most other seq_file
+implementations; in most cases the start() function should check for a
+"past end of file" condition and return NULL if need be.
+
+For more complicated applications, the private field of the seq_file
+structure can be used. There is also a special value which can be returned
+by the start() function called SEQ_START_TOKEN; it can be used if you wish
+to instruct your show() function (described below) to print a header at the
+top of the output. SEQ_START_TOKEN should only be used if the offset is
+zero, however.
+
+The next function to implement is called, amazingly, next(); its job is to
+move the iterator forward to the next position in the sequence. The
+example module can simply increment the position by one; more useful
+modules will do what is needed to step through some data structure. The
+next() function returns a new iterator, or NULL if the sequence is
+complete. Here's the example version:
+
+ static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos)
+ {
+ loff_t *spos = v;
+ *pos = ++*spos;
+ return spos;
+ }
+
+The stop() function is called when iteration is complete; its job, of
+course, is to clean up. If dynamic memory is allocated for the iterator,
+stop() is the place to free it.
+
+ static void ct_seq_stop(struct seq_file *s, void *v)
+ {
+ kfree(v);
+ }
+
+Finally, the show() function should format the object currently pointed to
+by the iterator for output. The example module's show() function is:
+
+ static int ct_seq_show(struct seq_file *s, void *v)
+ {
+ loff_t *spos = v;
+ seq_printf(s, "%lld\n", (long long)*spos);
+ return 0;
+ }
+
+If all is well, the show() function should return zero. A negative error
+code in the usual manner indicates that something went wrong; it will be
+passed back to user space. This function can also return SEQ_SKIP, which
+causes the current item to be skipped; if the show() function has already
+generated output before returning SEQ_SKIP, that output will be dropped.
+
+We will look at seq_printf() in a moment. But first, the definition of the
+seq_file iterator is finished by creating a seq_operations structure with
+the four functions we have just defined:
+
+ static const struct seq_operations ct_seq_ops = {
+ .start = ct_seq_start,
+ .next = ct_seq_next,
+ .stop = ct_seq_stop,
+ .show = ct_seq_show
+ };
+
+This structure will be needed to tie our iterator to the /proc file in
+a little bit.
+
+It's worth noting that the iterator value returned by start() and
+manipulated by the other functions is considered to be completely opaque by
+the seq_file code. It can thus be anything that is useful in stepping
+through the data to be output. Counters can be useful, but it could also be
+a direct pointer into an array or linked list. Anything goes, as long as
+the programmer is aware that things can happen between calls to the
+iterator function. However, the seq_file code (by design) will not sleep
+between the calls to start() and stop(), so holding a lock during that time
+is a reasonable thing to do. The seq_file code will also avoid taking any
+other locks while the iterator is active.
+
+
+Formatted output
+
+The seq_file code manages positioning within the output created by the
+iterator and getting it into the user's buffer. But, for that to work, that
+output must be passed to the seq_file code. Some utility functions have
+been defined which make this task easy.
+
+Most code will simply use seq_printf(), which works pretty much like
+printk(), but which requires the seq_file pointer as an argument. It is
+common to ignore the return value from seq_printf(), but a function
+producing complicated output may want to check that value and quit if
+something non-zero is returned; an error return means that the seq_file
+buffer has been filled and further output will be discarded.
+
+For straight character output, the following functions may be used:
+
+ int seq_putc(struct seq_file *m, char c);
+ int seq_puts(struct seq_file *m, const char *s);
+ int seq_escape(struct seq_file *m, const char *s, const char *esc);
+
+The first two output a single character and a string, just like one would
+expect. seq_escape() is like seq_puts(), except that any character in s
+which is in the string esc will be represented in octal form in the output.
+
+There is also a pair of functions for printing filenames:
+
+ int seq_path(struct seq_file *m, struct path *path, char *esc);
+ int seq_path_root(struct seq_file *m, struct path *path,
+ struct path *root, char *esc)
+
+Here, path indicates the file of interest, and esc is a set of characters
+which should be escaped in the output. A call to seq_path() will output
+the path relative to the current process's filesystem root. If a different
+root is desired, it can be used with seq_path_root(). Note that, if it
+turns out that path cannot be reached from root, the value of root will be
+changed in seq_file_root() to a root which *does* work.
+
+
+Making it all work
+
+So far, we have a nice set of functions which can produce output within the
+seq_file system, but we have not yet turned them into a file that a user
+can see. Creating a file within the kernel requires, of course, the
+creation of a set of file_operations which implement the operations on that
+file. The seq_file interface provides a set of canned operations which do
+most of the work. The virtual file author still must implement the open()
+method, however, to hook everything up. The open function is often a single
+line, as in the example module:
+
+ static int ct_open(struct inode *inode, struct file *file)
+ {
+ return seq_open(file, &ct_seq_ops);
+ }
+
+Here, the call to seq_open() takes the seq_operations structure we created
+before, and gets set up to iterate through the virtual file.
+
+On a successful open, seq_open() stores the struct seq_file pointer in
+file->private_data. If you have an application where the same iterator can
+be used for more than one file, you can store an arbitrary pointer in the
+private field of the seq_file structure; that value can then be retrieved
+by the iterator functions.
+
+The other operations of interest - read(), llseek(), and release() - are
+all implemented by the seq_file code itself. So a virtual file's
+file_operations structure will look like:
+
+ static const struct file_operations ct_file_ops = {
+ .owner = THIS_MODULE,
+ .open = ct_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release
+ };
+
+There is also a seq_release_private() which passes the contents of the
+seq_file private field to kfree() before releasing the structure.
+
+The final step is the creation of the /proc file itself. In the example
+code, that is done in the initialization code in the usual way:
+
+ static int ct_init(void)
+ {
+ struct proc_dir_entry *entry;
+
+ proc_create("sequence", 0, NULL, &ct_file_ops);
+ return 0;
+ }
+
+ module_init(ct_init);
+
+And that is pretty much it.
+
+
+seq_list
+
+If your file will be iterating through a linked list, you may find these
+routines useful:
+
+ struct list_head *seq_list_start(struct list_head *head,
+ loff_t pos);
+ struct list_head *seq_list_start_head(struct list_head *head,
+ loff_t pos);
+ struct list_head *seq_list_next(void *v, struct list_head *head,
+ loff_t *ppos);
+
+These helpers will interpret pos as a position within the list and iterate
+accordingly. Your start() and next() functions need only invoke the
+seq_list_* helpers with a pointer to the appropriate list_head structure.
+
+
+The extra-simple version
+
+For extremely simple virtual files, there is an even easier interface. A
+module can define only the show() function, which should create all the
+output that the virtual file will contain. The file's open() method then
+calls:
+
+ int single_open(struct file *file,
+ int (*show)(struct seq_file *m, void *p),
+ void *data);
+
+When output time comes, the show() function will be called once. The data
+value given to single_open() can be found in the private field of the
+seq_file structure. When using single_open(), the programmer should use
+single_release() instead of seq_release() in the file_operations structure
+to avoid a memory leak.
diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt
new file mode 100644
index 00000000..4ede421c
--- /dev/null
+++ b/Documentation/filesystems/sharedsubtree.txt
@@ -0,0 +1,939 @@
+Shared Subtrees
+---------------
+
+Contents:
+ 1) Overview
+ 2) Features
+ 3) Setting mount states
+ 4) Use-case
+ 5) Detailed semantics
+ 6) Quiz
+ 7) FAQ
+ 8) Implementation
+
+
+1) Overview
+-----------
+
+Consider the following situation:
+
+A process wants to clone its own namespace, but still wants to access the CD
+that got mounted recently. Shared subtree semantics provide the necessary
+mechanism to accomplish the above.
+
+It provides the necessary building blocks for features like per-user-namespace
+and versioned filesystem.
+
+2) Features
+-----------
+
+Shared subtree provides four different flavors of mounts; struct vfsmount to be
+precise
+
+ a. shared mount
+ b. slave mount
+ c. private mount
+ d. unbindable mount
+
+
+2a) A shared mount can be replicated to as many mountpoints and all the
+replicas continue to be exactly same.
+
+ Here is an example:
+
+ Let's say /mnt has a mount that is shared.
+ mount --make-shared /mnt
+
+ Note: mount(8) command now supports the --make-shared flag,
+ so the sample 'smount' program is no longer needed and has been
+ removed.
+
+ # mount --bind /mnt /tmp
+ The above command replicates the mount at /mnt to the mountpoint /tmp
+ and the contents of both the mounts remain identical.
+
+ #ls /mnt
+ a b c
+
+ #ls /tmp
+ a b c
+
+ Now let's say we mount a device at /tmp/a
+ # mount /dev/sd0 /tmp/a
+
+ #ls /tmp/a
+ t1 t2 t3
+
+ #ls /mnt/a
+ t1 t2 t3
+
+ Note that the mount has propagated to the mount at /mnt as well.
+
+ And the same is true even when /dev/sd0 is mounted on /mnt/a. The
+ contents will be visible under /tmp/a too.
+
+
+2b) A slave mount is like a shared mount except that mount and umount events
+ only propagate towards it.
+
+ All slave mounts have a master mount which is a shared.
+
+ Here is an example:
+
+ Let's say /mnt has a mount which is shared.
+ # mount --make-shared /mnt
+
+ Let's bind mount /mnt to /tmp
+ # mount --bind /mnt /tmp
+
+ the new mount at /tmp becomes a shared mount and it is a replica of
+ the mount at /mnt.
+
+ Now let's make the mount at /tmp; a slave of /mnt
+ # mount --make-slave /tmp
+
+ let's mount /dev/sd0 on /mnt/a
+ # mount /dev/sd0 /mnt/a
+
+ #ls /mnt/a
+ t1 t2 t3
+
+ #ls /tmp/a
+ t1 t2 t3
+
+ Note the mount event has propagated to the mount at /tmp
+
+ However let's see what happens if we mount something on the mount at /tmp
+
+ # mount /dev/sd1 /tmp/b
+
+ #ls /tmp/b
+ s1 s2 s3
+
+ #ls /mnt/b
+
+ Note how the mount event has not propagated to the mount at
+ /mnt
+
+
+2c) A private mount does not forward or receive propagation.
+
+ This is the mount we are familiar with. Its the default type.
+
+
+2d) A unbindable mount is a unbindable private mount
+
+ let's say we have a mount at /mnt and we make is unbindable
+
+ # mount --make-unbindable /mnt
+
+ Let's try to bind mount this mount somewhere else.
+ # mount --bind /mnt /tmp
+ mount: wrong fs type, bad option, bad superblock on /mnt,
+ or too many mounted file systems
+
+ Binding a unbindable mount is a invalid operation.
+
+
+3) Setting mount states
+
+ The mount command (util-linux package) can be used to set mount
+ states:
+
+ mount --make-shared mountpoint
+ mount --make-slave mountpoint
+ mount --make-private mountpoint
+ mount --make-unbindable mountpoint
+
+
+4) Use cases
+------------
+
+ A) A process wants to clone its own namespace, but still wants to
+ access the CD that got mounted recently.
+
+ Solution:
+
+ The system administrator can make the mount at /cdrom shared
+ mount --bind /cdrom /cdrom
+ mount --make-shared /cdrom
+
+ Now any process that clones off a new namespace will have a
+ mount at /cdrom which is a replica of the same mount in the
+ parent namespace.
+
+ So when a CD is inserted and mounted at /cdrom that mount gets
+ propagated to the other mount at /cdrom in all the other clone
+ namespaces.
+
+ B) A process wants its mounts invisible to any other process, but
+ still be able to see the other system mounts.
+
+ Solution:
+
+ To begin with, the administrator can mark the entire mount tree
+ as shareable.
+
+ mount --make-rshared /
+
+ A new process can clone off a new namespace. And mark some part
+ of its namespace as slave
+
+ mount --make-rslave /myprivatetree
+
+ Hence forth any mounts within the /myprivatetree done by the
+ process will not show up in any other namespace. However mounts
+ done in the parent namespace under /myprivatetree still shows
+ up in the process's namespace.
+
+
+ Apart from the above semantics this feature provides the
+ building blocks to solve the following problems:
+
+ C) Per-user namespace
+
+ The above semantics allows a way to share mounts across
+ namespaces. But namespaces are associated with processes. If
+ namespaces are made first class objects with user API to
+ associate/disassociate a namespace with userid, then each user
+ could have his/her own namespace and tailor it to his/her
+ requirements. Offcourse its needs support from PAM.
+
+ D) Versioned files
+
+ If the entire mount tree is visible at multiple locations, then
+ a underlying versioning file system can return different
+ version of the file depending on the path used to access that
+ file.
+
+ An example is:
+
+ mount --make-shared /
+ mount --rbind / /view/v1
+ mount --rbind / /view/v2
+ mount --rbind / /view/v3
+ mount --rbind / /view/v4
+
+ and if /usr has a versioning filesystem mounted, then that
+ mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
+ /view/v4/usr too
+
+ A user can request v3 version of the file /usr/fs/namespace.c
+ by accessing /view/v3/usr/fs/namespace.c . The underlying
+ versioning filesystem can then decipher that v3 version of the
+ filesystem is being requested and return the corresponding
+ inode.
+
+5) Detailed semantics:
+-------------------
+ The section below explains the detailed semantics of
+ bind, rbind, move, mount, umount and clone-namespace operations.
+
+ Note: the word 'vfsmount' and the noun 'mount' have been used
+ to mean the same thing, throughout this document.
+
+5a) Mount states
+
+ A given mount can be in one of the following states
+ 1) shared
+ 2) slave
+ 3) shared and slave
+ 4) private
+ 5) unbindable
+
+ A 'propagation event' is defined as event generated on a vfsmount
+ that leads to mount or unmount actions in other vfsmounts.
+
+ A 'peer group' is defined as a group of vfsmounts that propagate
+ events to each other.
+
+ (1) Shared mounts
+
+ A 'shared mount' is defined as a vfsmount that belongs to a
+ 'peer group'.
+
+ For example:
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+
+ The mount at /mnt and that at /tmp are both shared and belong
+ to the same peer group. Anything mounted or unmounted under
+ /mnt or /tmp reflect in all the other mounts of its peer
+ group.
+
+
+ (2) Slave mounts
+
+ A 'slave mount' is defined as a vfsmount that receives
+ propagation events and does not forward propagation events.
+
+ A slave mount as the name implies has a master mount from which
+ mount/unmount events are received. Events do not propagate from
+ the slave mount to the master. Only a shared mount can be made
+ a slave by executing the following command
+
+ mount --make-slave mount
+
+ A shared mount that is made as a slave is no more shared unless
+ modified to become shared.
+
+ (3) Shared and Slave
+
+ A vfsmount can be both shared as well as slave. This state
+ indicates that the mount is a slave of some vfsmount, and
+ has its own peer group too. This vfsmount receives propagation
+ events from its master vfsmount, and also forwards propagation
+ events to its 'peer group' and to its slave vfsmounts.
+
+ Strictly speaking, the vfsmount is shared having its own
+ peer group, and this peer-group is a slave of some other
+ peer group.
+
+ Only a slave vfsmount can be made as 'shared and slave' by
+ either executing the following command
+ mount --make-shared mount
+ or by moving the slave vfsmount under a shared vfsmount.
+
+ (4) Private mount
+
+ A 'private mount' is defined as vfsmount that does not
+ receive or forward any propagation events.
+
+ (5) Unbindable mount
+
+ A 'unbindable mount' is defined as vfsmount that does not
+ receive or forward any propagation events and cannot
+ be bind mounted.
+
+
+ State diagram:
+ The state diagram below explains the state transition of a mount,
+ in response to various commands.
+ ------------------------------------------------------------------------
+ | |make-shared | make-slave | make-private |make-unbindab|
+ --------------|------------|--------------|--------------|-------------|
+ |shared |shared |*slave/private| private | unbindable |
+ | | | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |slave |shared | **slave | private | unbindable |
+ | |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |shared |shared | slave | private | unbindable |
+ |and slave |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |private |shared | **private | private | unbindable |
+ |-------------|------------|--------------|--------------|-------------|
+ |unbindable |shared |**unbindable | private | unbindable |
+ ------------------------------------------------------------------------
+
+ * if the shared mount is the only mount in its peer group, making it
+ slave, makes it private automatically. Note that there is no master to
+ which it can be slaved to.
+
+ ** slaving a non-shared mount has no effect on the mount.
+
+ Apart from the commands listed below, the 'move' operation also changes
+ the state of a mount depending on type of the destination mount. Its
+ explained in section 5d.
+
+5b) Bind semantics
+
+ Consider the following command
+
+ mount --bind A/a B/b
+
+ where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
+ is the destination mount and 'b' is the dentry in the destination mount.
+
+ The outcome depends on the type of mount of 'A' and 'B'. The table
+ below contains quick reference.
+ ---------------------------------------------------------------------------
+ | BIND MOUNT OPERATION |
+ |**************************************************************************
+ |source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |**************************************************************************
+ | shared | shared | shared | shared & slave | invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | invalid |
+ ***************************************************************************
+
+ Details:
+
+ 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree containing 'C1',..,'Cn' is
+ created. This propagation tree is identical to the propagation tree of
+ 'B'. And finally the peer-group of 'C' is merged with the peer group
+ of 'A'.
+
+ 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree is set containing all new mounts
+ 'C', 'C1', .., 'Cn' with exactly the same configuration as the
+ propagation tree for 'B'.
+
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
+ mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
+ 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
+ 'C3' ... are created and mounted at the dentry 'b' on all mounts where
+ 'B' propagates to. A new propagation tree containing the new mounts
+ 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
+ propagation tree for 'B'. And finally the mount 'C' and its peer group
+ is made the slave of mount 'Z'. In other words, mount 'C' is in the
+ state 'slave and shared'.
+
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
+ invalid operation.
+
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ unbindable) mount. A new mount 'C' which is clone of 'A', is created.
+ Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
+
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
+ which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
+ peer-group of 'A'.
+
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
+ new mount 'C' which is a clone of 'A' is created. Its root dentry is
+ 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
+ slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
+ 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
+ mount/unmount on 'A' do not propagate anywhere else. Similarly
+ mount/unmount on 'C' do not propagate anywhere else.
+
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
+ invalid operation. A unbindable mount cannot be bind mounted.
+
+5c) Rbind semantics
+
+ rbind is same as bind. Bind replicates the specified mount. Rbind
+ replicates all the mounts in the tree belonging to the specified mount.
+ Rbind mount is bind mount applied to all the mounts in the tree.
+
+ If the source tree that is rbind has some unbindable mounts,
+ then the subtree under the unbindable mount is pruned in the new
+ location.
+
+ eg: let's say we have the following mount tree.
+
+ A
+ / \
+ B C
+ / \ / \
+ D E F G
+
+ Let's say all the mount except the mount C in the tree are
+ of a type other than unbindable.
+
+ If this tree is rbound to say Z
+
+ We will have the following tree at the new location.
+
+ Z
+ |
+ A'
+ /
+ B' Note how the tree under C is pruned
+ / \ in the new location.
+ D' E'
+
+
+
+5d) Move semantics
+
+ Consider the following command
+
+ mount --move A B/b
+
+ where 'A' is the source mount, 'B' is the destination mount and 'b' is
+ the dentry in the destination mount.
+
+ The outcome depends on the type of the mount of 'A' and 'B'. The table
+ below is a quick reference.
+ ---------------------------------------------------------------------------
+ | MOVE MOUNT OPERATION |
+ |**************************************************************************
+ | source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |**************************************************************************
+ | shared | shared | shared |shared and slave| invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | unbindable |
+ ***************************************************************************
+ NOTE: moving a mount residing under a shared mount is invalid.
+
+ Details follow:
+
+ 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. A new propagation tree is created in the
+ exact same configuration as that of 'B'. This new propagation tree
+ contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree
+ of 'A'.
+
+ 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. The mount 'A' becomes a shared mount and a
+ propagation tree is created which is identical to that of
+ 'B'. This new propagation tree contains all the new mounts 'A1',
+ 'A2'... 'An'.
+
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
+ mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
+ 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
+ receive propagation from mount 'B'. A new propagation tree is created
+ in the exact same configuration as that of 'B'. This new propagation
+ tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree of
+ 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
+ becomes 'shared'.
+
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
+ is invalid. Because mounting anything on the shared mount 'B' can
+ create new mounts that get mounted on the mounts that receive
+ propagation from 'B'. And since the mount 'A' is unbindable, cloning
+ it to mount at other mountpoints is not possible.
+
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
+
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ shared mount.
+
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
+ The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
+ continues to be a slave mount of mount 'Z'.
+
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
+ 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ unbindable mount.
+
+5e) Mount semantics
+
+ Consider the following command
+
+ mount device B/b
+
+ 'B' is the destination mount and 'b' is the dentry in the destination
+ mount.
+
+ The above operation is the same as bind operation with the exception
+ that the source mount is always a private mount.
+
+
+5f) Unmount semantics
+
+ Consider the following command
+
+ umount A
+
+ where 'A' is a mount mounted on mount 'B' at dentry 'b'.
+
+ If mount 'B' is shared, then all most-recently-mounted mounts at dentry
+ 'b' on mounts that receive propagation from mount 'B' and does not have
+ sub-mounts within them are unmounted.
+
+ Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
+ each other.
+
+ let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
+ 'B1', 'B2' and 'B3' respectively.
+
+ let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
+ mount 'B1', 'B2' and 'B3' respectively.
+
+ if 'C1' is unmounted, all the mounts that are most-recently-mounted on
+ 'B1' and on the mounts that 'B1' propagates-to are unmounted.
+
+ 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
+ on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
+
+ So all 'C1', 'C2' and 'C3' should be unmounted.
+
+ If any of 'C2' or 'C3' has some child mounts, then that mount is not
+ unmounted, but all other mounts are unmounted. However if 'C1' is told
+ to be unmounted and 'C1' has some sub-mounts, the umount operation is
+ failed entirely.
+
+5g) Clone Namespace
+
+ A cloned namespace contains all the mounts as that of the parent
+ namespace.
+
+ Let's say 'A' and 'B' are the corresponding mounts in the parent and the
+ child namespace.
+
+ If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
+ each other.
+
+ If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
+ 'Z'.
+
+ If 'A' is a private mount, then 'B' is a private mount too.
+
+ If 'A' is unbindable mount, then 'B' is a unbindable mount too.
+
+
+6) Quiz
+
+ A. What is the result of the following command sequence?
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+ mount --move /tmp /mnt/1
+
+ what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
+ Should they all be identical? or should /mnt and /mnt/1 be
+ identical only?
+
+
+ B. What is the result of the following command sequence?
+
+ mount --make-rshared /
+ mkdir -p /v/1
+ mount --rbind / /v/1
+
+ what should be the content of /v/1/v/1 be?
+
+
+ C. What is the result of the following command sequence?
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1/2/3 /mnt/1/test
+ mount --bind /mnt/1 /tmp
+ mount --make-slave /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt/1/2 /tmp1
+ mount --make-slave /mnt
+
+ At this point we have the first mount at /tmp and
+ its root dentry is 1. Let's call this mount 'A'
+ And then we have a second mount at /tmp1 with root
+ dentry 2. Let's call this mount 'B'
+ Next we have a third mount at /mnt with root dentry
+ mnt. Let's call this mount 'C'
+
+ 'B' is the slave of 'A' and 'C' is a slave of 'B'
+ A -> B -> C
+
+ at this point if we execute the following command
+
+ mount --bind /bin /tmp/test
+
+ The mount is attempted on 'A'
+
+ will the mount propagate to 'B' and 'C' ?
+
+ what would be the contents of
+ /mnt/1/test be?
+
+7) FAQ
+
+ Q1. Why is bind mount needed? How is it different from symbolic links?
+ symbolic links can get stale if the destination mount gets
+ unmounted or moved. Bind mounts continue to exist even if the
+ other mount is unmounted or moved.
+
+ Q2. Why can't the shared subtree be implemented using exportfs?
+
+ exportfs is a heavyweight way of accomplishing part of what
+ shared subtree can do. I cannot imagine a way to implement the
+ semantics of slave mount using exportfs?
+
+ Q3 Why is unbindable mount needed?
+
+ Let's say we want to replicate the mount tree at multiple
+ locations within the same subtree.
+
+ if one rbind mounts a tree within the same subtree 'n' times
+ the number of mounts created is an exponential function of 'n'.
+ Having unbindable mount can help prune the unneeded bind
+ mounts. Here is a example.
+
+ step 1:
+ let's say the root tree has just two directories with
+ one vfsmount.
+ root
+ / \
+ tmp usr
+
+ And we want to replicate the tree at multiple
+ mountpoints under /root/tmp
+
+ step2:
+ mount --make-shared /root
+
+ mkdir -p /tmp/m1
+
+ mount --rbind /root /tmp/m1
+
+ the new tree now looks like this:
+
+ root
+ / \
+ tmp usr
+ /
+ m1
+ / \
+ tmp usr
+ /
+ m1
+
+ it has two vfsmounts
+
+ step3:
+ mkdir -p /tmp/m2
+ mount --rbind /root /tmp/m2
+
+ the new tree now looks like this:
+
+ root
+ / \
+ tmp usr
+ / \
+ m1 m2
+ / \ / \
+ tmp usr tmp usr
+ / \ /
+ m1 m2 m1
+ / \ / \
+ tmp usr tmp usr
+ / / \
+ m1 m1 m2
+ / \
+ tmp usr
+ / \
+ m1 m2
+
+ it has 6 vfsmounts
+
+ step 4:
+ mkdir -p /tmp/m3
+ mount --rbind /root /tmp/m3
+
+ I wont' draw the tree..but it has 24 vfsmounts
+
+
+ at step i the number of vfsmounts is V[i] = i*V[i-1].
+ This is an exponential function. And this tree has way more
+ mounts than what we really needed in the first place.
+
+ One could use a series of umount at each step to prune
+ out the unneeded mounts. But there is a better solution.
+ Unclonable mounts come in handy here.
+
+ step 1:
+ let's say the root tree has just two directories with
+ one vfsmount.
+ root
+ / \
+ tmp usr
+
+ How do we set up the same tree at multiple locations under
+ /root/tmp
+
+ step2:
+ mount --bind /root/tmp /root/tmp
+
+ mount --make-rshared /root
+ mount --make-unbindable /root/tmp
+
+ mkdir -p /tmp/m1
+
+ mount --rbind /root /tmp/m1
+
+ the new tree now looks like this:
+
+ root
+ / \
+ tmp usr
+ /
+ m1
+ / \
+ tmp usr
+
+ step3:
+ mkdir -p /tmp/m2
+ mount --rbind /root /tmp/m2
+
+ the new tree now looks like this:
+
+ root
+ / \
+ tmp usr
+ / \
+ m1 m2
+ / \ / \
+ tmp usr tmp usr
+
+ step4:
+
+ mkdir -p /tmp/m3
+ mount --rbind /root /tmp/m3
+
+ the new tree now looks like this:
+
+ root
+ / \
+ tmp usr
+ / \ \
+ m1 m2 m3
+ / \ / \ / \
+ tmp usr tmp usr tmp usr
+
+8) Implementation
+
+8A) Datastructure
+
+ 4 new fields are introduced to struct vfsmount
+ ->mnt_share
+ ->mnt_slave_list
+ ->mnt_slave
+ ->mnt_master
+
+ ->mnt_share links together all the mount to/from which this vfsmount
+ send/receives propagation events.
+
+ ->mnt_slave_list links all the mounts to which this vfsmount propagates
+ to.
+
+ ->mnt_slave links together all the slaves that its master vfsmount
+ propagates to.
+
+ ->mnt_master points to the master vfsmount from which this vfsmount
+ receives propagation.
+
+ ->mnt_flags takes two more flags to indicate the propagation status of
+ the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
+ vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
+ replicated.
+
+ All the shared vfsmounts in a peer group form a cyclic list through
+ ->mnt_share.
+
+ All vfsmounts with the same ->mnt_master form on a cyclic list anchored
+ in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
+
+ ->mnt_master can point to arbitrary (and possibly different) members
+ of master peer group. To find all immediate slaves of a peer group
+ you need to go through _all_ ->mnt_slave_list of its members.
+ Conceptually it's just a single set - distribution among the
+ individual lists does not affect propagation or the way propagation
+ tree is modified by operations.
+
+ All vfsmounts in a peer group have the same ->mnt_master. If it is
+ non-NULL, they form a contiguous (ordered) segment of slave list.
+
+ A example propagation tree looks as shown in the figure below.
+ [ NOTE: Though it looks like a forest, if we consider all the shared
+ mounts as a conceptual entity called 'pnode', it becomes a tree]
+
+
+ A <--> B <--> C <---> D
+ /|\ /| |\
+ / F G J K H I
+ /
+ E<-->K
+ /|\
+ M L N
+
+ In the above figure A,B,C and D all are shared and propagate to each
+ other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
+ mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
+ 'E' is also shared with 'K' and they propagate to each other. And
+ 'K' has 3 slaves 'M', 'L' and 'N'
+
+ A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
+
+ A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
+
+ E's ->mnt_share links with ->mnt_share of K
+ 'E', 'K', 'F', 'G' have their ->mnt_master point to struct
+ vfsmount of 'A'
+ 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
+ K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
+
+ C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
+ J and K's ->mnt_master points to struct vfsmount of C
+ and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
+ 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
+
+
+ NOTE: The propagation tree is orthogonal to the mount tree.
+
+8B Locking:
+
+ ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
+ by namespace_sem (exclusive for modifications, shared for reading).
+
+ Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
+ There are two exceptions: do_add_mount() and clone_mnt().
+ The former modifies a vfsmount that has not been visible in any shared
+ data structures yet.
+ The latter holds namespace_sem and the only references to vfsmount
+ are in lists that can't be traversed without namespace_sem.
+
+8C Algorithm:
+
+ The crux of the implementation resides in rbind/move operation.
+
+ The overall algorithm breaks the operation into 3 phases: (look at
+ attach_recursive_mnt() and propagate_mnt())
+
+ 1. prepare phase.
+ 2. commit phases.
+ 3. abort phases.
+
+ Prepare phase:
+
+ for each mount in the source tree:
+ a) Create the necessary number of mount trees to
+ be attached to each of the mounts that receive
+ propagation from the destination mount.
+ b) Do not attach any of the trees to its destination.
+ However note down its ->mnt_parent and ->mnt_mountpoint
+ c) Link all the new mounts to form a propagation tree that
+ is identical to the propagation tree of the destination
+ mount.
+
+ If this phase is successful, there should be 'n' new
+ propagation trees; where 'n' is the number of mounts in the
+ source tree. Go to the commit phase
+
+ Also there should be 'm' new mount trees, where 'm' is
+ the number of mounts to which the destination mount
+ propagates to.
+
+ if any memory allocations fail, go to the abort phase.
+
+ Commit phase
+ attach each of the mount trees to their corresponding
+ destination mounts.
+
+ Abort phase
+ delete all the newly created trees.
+
+ NOTE: all the propagation related functionality resides in the file
+ pnode.c
+
+
+------------------------------------------------------------------------
+
+version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
+version 0.2 (Incorporated comments from Al Viro)
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt
new file mode 100644
index 00000000..1343d118
--- /dev/null
+++ b/Documentation/filesystems/spufs.txt
@@ -0,0 +1,521 @@
+SPUFS(2) Linux Programmer's Manual SPUFS(2)
+
+
+
+NAME
+ spufs - the SPU file system
+
+
+DESCRIPTION
+ The SPU file system is used on PowerPC machines that implement the Cell
+ Broadband Engine Architecture in order to access Synergistic Processor
+ Units (SPUs).
+
+ The file system provides a name space similar to posix shared memory or
+ message queues. Users that have write permissions on the file system
+ can use spu_create(2) to establish SPU contexts in the spufs root.
+
+ Every SPU context is represented by a directory containing a predefined
+ set of files. These files can be used for manipulating the state of the
+ logical SPU. Users can change permissions on those files, but not actu-
+ ally add or remove files.
+
+
+MOUNT OPTIONS
+ uid=<uid>
+ set the user owning the mount point, the default is 0 (root).
+
+ gid=<gid>
+ set the group owning the mount point, the default is 0 (root).
+
+
+FILES
+ The files in spufs mostly follow the standard behavior for regular sys-
+ tem calls like read(2) or write(2), but often support only a subset of
+ the operations supported on regular file systems. This list details the
+ supported operations and the deviations from the behaviour in the
+ respective man pages.
+
+ All files that support the read(2) operation also support readv(2) and
+ all files that support the write(2) operation also support writev(2).
+ All files support the access(2) and stat(2) family of operations, but
+ only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
+ contain reliable information.
+
+ All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
+ tions, but will not be able to grant permissions that contradict the
+ possible operations, e.g. read access on the wbox file.
+
+ The current set of files is:
+
+
+ /mem
+ the contents of the local storage memory of the SPU. This can be
+ accessed like a regular shared memory file and contains both code and
+ data in the address space of the SPU. The possible operations on an
+ open mem file are:
+
+ read(2), pread(2), write(2), pwrite(2), lseek(2)
+ These operate as documented, with the exception that seek(2),
+ write(2) and pwrite(2) are not supported beyond the end of the
+ file. The file size is the size of the local storage of the SPU,
+ which normally is 256 kilobytes.
+
+ mmap(2)
+ Mapping mem into the process address space gives access to the
+ SPU local storage within the process address space. Only
+ MAP_SHARED mappings are allowed.
+
+
+ /mbox
+ The first SPU to CPU communication mailbox. This file is read-only and
+ can be read in units of 32 bits. The file can only be used in non-
+ blocking mode and it even poll() will not block on it. The possible
+ operations on an open mbox file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. If there is no data available in the mail
+ box, the return value is set to -1 and errno becomes EAGAIN.
+ When data has been read successfully, four bytes are placed in
+ the data buffer and the value four is returned.
+
+
+ /ibox
+ The second SPU to CPU communication mailbox. This file is similar to
+ the first mailbox file, but can be read in blocking I/O mode, and the
+ poll family of system calls can be used to wait for it. The possible
+ operations on an open ibox file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. If there is no data available in the mail
+ box and the file descriptor has been opened with O_NONBLOCK, the
+ return value is set to -1 and errno becomes EAGAIN.
+
+ If there is no data available in the mail box and the file
+ descriptor has been opened without O_NONBLOCK, the call will
+ block until the SPU writes to its interrupt mailbox channel.
+ When data has been read successfully, four bytes are placed in
+ the data buffer and the value four is returned.
+
+ poll(2)
+ Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
+ data is available for reading.
+
+
+ /wbox
+ The CPU to SPU communation mailbox. It is write-only and can be written
+ in units of 32 bits. If the mailbox is full, write() will block and
+ poll can be used to wait for it becoming empty again. The possible
+ operations on an open wbox file are: write(2) If a count smaller than
+ four is requested, write returns -1 and sets errno to EINVAL. If there
+ is no space available in the mail box and the file descriptor has been
+ opened with O_NONBLOCK, the return value is set to -1 and errno becomes
+ EAGAIN.
+
+ If there is no space available in the mail box and the file descriptor
+ has been opened without O_NONBLOCK, the call will block until the SPU
+ reads from its PPE mailbox channel. When data has been read success-
+ fully, four bytes are placed in the data buffer and the value four is
+ returned.
+
+ poll(2)
+ Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
+ space is available for writing.
+
+
+ /mbox_stat
+ /ibox_stat
+ /wbox_stat
+ Read-only files that contain the length of the current queue, i.e. how
+ many words can be read from mbox or ibox or how many words can be
+ written to wbox without blocking. The files can be read only in 4-byte
+ units and return a big-endian binary integer number. The possible
+ operations on an open *box_stat file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the number of elements that can be
+ read from (for mbox_stat and ibox_stat) or written to (for
+ wbox_stat) the respective mail box without blocking or resulting
+ in EAGAIN.
+
+
+ /npc
+ /decr
+ /decr_status
+ /spu_tag_mask
+ /event_mask
+ /srr0
+ Internal registers of the SPU. The representation is an ASCII string
+ with the numeric value of the next instruction to be executed. These
+ can be used in read/write mode for debugging, but normal operation of
+ programs should not rely on them because access to any of them except
+ npc requires an SPU context save and is therefore very inefficient.
+
+ The contents of these files are:
+
+ npc Next Program Counter
+
+ decr SPU Decrementer
+
+ decr_status Decrementer Status
+
+ spu_tag_mask MFC tag mask for SPU DMA
+
+ event_mask Event mask for SPU interrupts
+
+ srr0 Interrupt Return address register
+
+
+ The possible operations on an open npc, decr, decr_status,
+ spu_tag_mask, event_mask or srr0 file are:
+
+ read(2)
+ When the count supplied to the read call is shorter than the
+ required length for the pointer value plus a newline character,
+ subsequent reads from the same file descriptor will result in
+ completing the string, regardless of changes to the register by
+ a running SPU task. When a complete string has been read, all
+ subsequent read operations will return zero bytes and a new file
+ descriptor needs to be opened to read the value again.
+
+ write(2)
+ A write operation on the file results in setting the register to
+ the value given in the string. The string is parsed from the
+ beginning to the first non-numeric character or the end of the
+ buffer. Subsequent writes to the same file descriptor overwrite
+ the previous setting.
+
+
+ /fpcr
+ This file gives access to the Floating Point Status and Control Regis-
+ ter as a four byte long file. The operations on the fpcr file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the current value of the fpcr regis-
+ ter.
+
+ write(2)
+ If a count smaller than four is requested, write returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is copied
+ from the data buffer, updating the value of the fpcr register.
+
+
+ /signal1
+ /signal2
+ The two signal notification channels of an SPU. These are read-write
+ files that operate on a 32 bit word. Writing to one of these files
+ triggers an interrupt on the SPU. The value written to the signal
+ files can be read from the SPU through a channel read or from host user
+ space through the file. After the value has been read by the SPU, it
+ is reset to zero. The possible operations on an open signal1 or sig-
+ nal2 file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the current value of the specified
+ signal notification register.
+
+ write(2)
+ If a count smaller than four is requested, write returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is copied
+ from the data buffer, updating the value of the specified signal
+ notification register. The signal notification register will
+ either be replaced with the input data or will be updated to the
+ bitwise OR or the old value and the input data, depending on the
+ contents of the signal1_type, or signal2_type respectively,
+ file.
+
+
+ /signal1_type
+ /signal2_type
+ These two files change the behavior of the signal1 and signal2 notifi-
+ cation files. The contain a numerical ASCII string which is read as
+ either "1" or "0". In mode 0 (overwrite), the hardware replaces the
+ contents of the signal channel with the data that is written to it. in
+ mode 1 (logical OR), the hardware accumulates the bits that are subse-
+ quently written to it. The possible operations on an open signal1_type
+ or signal2_type file are:
+
+ read(2)
+ When the count supplied to the read call is shorter than the
+ required length for the digit plus a newline character, subse-
+ quent reads from the same file descriptor will result in com-
+ pleting the string. When a complete string has been read, all
+ subsequent read operations will return zero bytes and a new file
+ descriptor needs to be opened to read the value again.
+
+ write(2)
+ A write operation on the file results in setting the register to
+ the value given in the string. The string is parsed from the
+ beginning to the first non-numeric character or the end of the
+ buffer. Subsequent writes to the same file descriptor overwrite
+ the previous setting.
+
+
+EXAMPLES
+ /etc/fstab entry
+ none /spu spufs gid=spu 0 0
+
+
+AUTHORS
+ Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
+ Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
+
+SEE ALSO
+ capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
+
+
+
+Linux 2005-09-28 SPUFS(2)
+
+------------------------------------------------------------------------------
+
+SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
+
+
+
+NAME
+ spu_run - execute an spu context
+
+
+SYNOPSIS
+ #include <sys/spu.h>
+
+ int spu_run(int fd, unsigned int *npc, unsigned int *event);
+
+DESCRIPTION
+ The spu_run system call is used on PowerPC machines that implement the
+ Cell Broadband Engine Architecture in order to access Synergistic Pro-
+ cessor Units (SPUs). It uses the fd that was returned from spu_cre-
+ ate(2) to address a specific SPU context. When the context gets sched-
+ uled to a physical SPU, it starts execution at the instruction pointer
+ passed in npc.
+
+ Execution of SPU code happens synchronously, meaning that spu_run does
+ not return while the SPU is still running. If there is a need to exe-
+ cute SPU code in parallel with other code on either the main CPU or
+ other SPUs, you need to create a new thread of execution first, e.g.
+ using the pthread_create(3) call.
+
+ When spu_run returns, the current value of the SPU instruction pointer
+ is written back to npc, so you can call spu_run again without updating
+ the pointers.
+
+ event can be a NULL pointer or point to an extended status code that
+ gets filled when spu_run returns. It can be one of the following con-
+ stants:
+
+ SPE_EVENT_DMA_ALIGNMENT
+ A DMA alignment error
+
+ SPE_EVENT_SPE_DATA_SEGMENT
+ A DMA segmentation error
+
+ SPE_EVENT_SPE_DATA_STORAGE
+ A DMA storage error
+
+ If NULL is passed as the event argument, these errors will result in a
+ signal delivered to the calling process.
+
+RETURN VALUE
+ spu_run returns the value of the spu_status register or -1 to indicate
+ an error and set errno to one of the error codes listed below. The
+ spu_status register value contains a bit mask of status codes and
+ optionally a 14 bit code returned from the stop-and-signal instruction
+ on the SPU. The bit masks for the status codes are:
+
+ 0x02 SPU was stopped by stop-and-signal.
+
+ 0x04 SPU was stopped by halt.
+
+ 0x08 SPU is waiting for a channel.
+
+ 0x10 SPU is in single-step mode.
+
+ 0x20 SPU has tried to execute an invalid instruction.
+
+ 0x40 SPU has tried to access an invalid channel.
+
+ 0x3fff0000
+ The bits masked with this value contain the code returned from
+ stop-and-signal.
+
+ There are always one or more of the lower eight bits set or an error
+ code is returned from spu_run.
+
+ERRORS
+ EAGAIN or EWOULDBLOCK
+ fd is in non-blocking mode and spu_run would block.
+
+ EBADF fd is not a valid file descriptor.
+
+ EFAULT npc is not a valid pointer or status is neither NULL nor a valid
+ pointer.
+
+ EINTR A signal occurred while spu_run was in progress. The npc value
+ has been updated to the new program counter value if necessary.
+
+ EINVAL fd is not a file descriptor returned from spu_create(2).
+
+ ENOMEM Insufficient memory was available to handle a page fault result-
+ ing from an MFC direct memory access.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+
+NOTES
+ spu_run is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+CONFORMING TO
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+BUGS
+ The code does not yet fully implement all features lined out here.
+
+
+AUTHOR
+ Arnd Bergmann <arndb@de.ibm.com>
+
+SEE ALSO
+ capabilities(7), close(2), spu_create(2), spufs(7)
+
+
+
+Linux 2005-09-28 SPU_RUN(2)
+
+------------------------------------------------------------------------------
+
+SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
+
+
+
+NAME
+ spu_create - create a new spu context
+
+
+SYNOPSIS
+ #include <sys/types.h>
+ #include <sys/spu.h>
+
+ int spu_create(const char *pathname, int flags, mode_t mode);
+
+DESCRIPTION
+ The spu_create system call is used on PowerPC machines that implement
+ the Cell Broadband Engine Architecture in order to access Synergistic
+ Processor Units (SPUs). It creates a new logical context for an SPU in
+ pathname and returns a handle to associated with it. pathname must
+ point to a non-existing directory in the mount point of the SPU file
+ system (spufs). When spu_create is successful, a directory gets cre-
+ ated on pathname and it is populated with files.
+
+ The returned file handle can only be passed to spu_run(2) or closed,
+ other operations are not defined on it. When it is closed, all associ-
+ ated directory entries in spufs are removed. When the last file handle
+ pointing either inside of the context directory or to this file
+ descriptor is closed, the logical SPU context is destroyed.
+
+ The parameter flags can be zero or any bitwise or'd combination of the
+ following constants:
+
+ SPU_RAWIO
+ Allow mapping of some of the hardware registers of the SPU into
+ user space. This flag requires the CAP_SYS_RAWIO capability, see
+ capabilities(7).
+
+ The mode parameter specifies the permissions used for creating the new
+ directory in spufs. mode is modified with the user's umask(2) value
+ and then used for both the directory and the files contained in it. The
+ file permissions mask out some more bits of mode because they typically
+ support only read or write access. See stat(2) for a full list of the
+ possible mode values.
+
+
+RETURN VALUE
+ spu_create returns a new file descriptor. It may return -1 to indicate
+ an error condition and set errno to one of the error codes listed
+ below.
+
+
+ERRORS
+ EACCESS
+ The current user does not have write access on the spufs mount
+ point.
+
+ EEXIST An SPU context already exists at the given path name.
+
+ EFAULT pathname is not a valid string pointer in the current address
+ space.
+
+ EINVAL pathname is not a directory in the spufs mount point.
+
+ ELOOP Too many symlinks were found while resolving pathname.
+
+ EMFILE The process has reached its maximum open file limit.
+
+ ENAMETOOLONG
+ pathname was too long.
+
+ ENFILE The system has reached the global open file limit.
+
+ ENOENT Part of pathname could not be resolved.
+
+ ENOMEM The kernel could not allocate all resources required.
+
+ ENOSPC There are not enough SPU resources available to create a new
+ context or the user specific limit for the number of SPU con-
+ texts has been reached.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+ ENOTDIR
+ A part of pathname is not a directory.
+
+
+
+NOTES
+ spu_create is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+FILES
+ pathname must point to a location beneath the mount point of spufs. By
+ convention, it gets mounted in /spu.
+
+
+CONFORMING TO
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+BUGS
+ The code does not yet fully implement all features lined out here.
+
+
+AUTHOR
+ Arnd Bergmann <arndb@de.ibm.com>
+
+SEE ALSO
+ capabilities(7), close(2), spu_run(2), spufs(7)
+
+
+
+Linux 2005-09-28 SPU_CREATE(2)
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt
new file mode 100644
index 00000000..d4d41465
--- /dev/null
+++ b/Documentation/filesystems/squashfs.txt
@@ -0,0 +1,257 @@
+SQUASHFS 4.0 FILESYSTEM
+=======================
+
+Squashfs is a compressed read-only filesystem for Linux.
+It uses zlib/lzo compression to compress files, inodes and directories.
+Inodes in the system are very small and all blocks are packed to minimise
+data overhead. Block sizes greater than 4K are supported up to a maximum
+of 1Mbytes (default block size 128K).
+
+Squashfs is intended for general read-only filesystem use, for archival
+use (i.e. in cases where a .tar.gz file may be used), and in constrained
+block device/memory systems (e.g. embedded systems) where low overhead is
+needed.
+
+Mailing list: squashfs-devel@lists.sourceforge.net
+Web site: www.squashfs.org
+
+1. FILESYSTEM FEATURES
+----------------------
+
+Squashfs filesystem features versus Cramfs:
+
+ Squashfs Cramfs
+
+Max filesystem size: 2^64 256 MiB
+Max file size: ~ 2 TiB 16 MiB
+Max files: unlimited unlimited
+Max directories: unlimited unlimited
+Max entries per directory: unlimited unlimited
+Max block size: 1 MiB 4 KiB
+Metadata compression: yes no
+Directory indexes: yes no
+Sparse file support: yes no
+Tail-end packing (fragments): yes no
+Exportable (NFS etc.): yes no
+Hard link support: yes no
+"." and ".." in readdir: yes no
+Real inode numbers: yes no
+32-bit uids/gids: yes no
+File creation time: yes no
+Xattr support: yes no
+ACL support: no no
+
+Squashfs compresses data, inodes and directories. In addition, inode and
+directory data are highly compacted, and packed on byte boundaries. Each
+compressed inode is on average 8 bytes in length (the exact length varies on
+file type, i.e. regular file, directory, symbolic link, and block/char device
+inodes have different sizes).
+
+2. USING SQUASHFS
+-----------------
+
+As squashfs is a read-only filesystem, the mksquashfs program must be used to
+create populated squashfs filesystems. This and other squashfs utilities
+can be obtained from http://www.squashfs.org. Usage instructions can be
+obtained from this site also.
+
+
+3. SQUASHFS FILESYSTEM DESIGN
+-----------------------------
+
+A squashfs filesystem consists of a maximum of nine parts, packed together on a
+byte alignment:
+
+ ---------------
+ | superblock |
+ |---------------|
+ | compression |
+ | options |
+ |---------------|
+ | datablocks |
+ | & fragments |
+ |---------------|
+ | inode table |
+ |---------------|
+ | directory |
+ | table |
+ |---------------|
+ | fragment |
+ | table |
+ |---------------|
+ | export |
+ | table |
+ |---------------|
+ | uid/gid |
+ | lookup table |
+ |---------------|
+ | xattr |
+ | table |
+ ---------------
+
+Compressed data blocks are written to the filesystem as files are read from
+the source directory, and checked for duplicates. Once all file data has been
+written the completed inode, directory, fragment, export and uid/gid lookup
+tables are written.
+
+3.1 Compression options
+-----------------------
+
+Compressors can optionally support compression specific options (e.g.
+dictionary size). If non-default compression options have been used, then
+these are stored here.
+
+3.2 Inodes
+----------
+
+Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each
+compressed block is prefixed by a two byte length, the top bit is set if the
+block is uncompressed. A block will be uncompressed if the -noI option is set,
+or if the compressed block was larger than the uncompressed block.
+
+Inodes are packed into the metadata blocks, and are not aligned to block
+boundaries, therefore inodes overlap compressed blocks. Inodes are identified
+by a 48-bit number which encodes the location of the compressed metadata block
+containing the inode, and the byte offset into that block where the inode is
+placed (<block, offset>).
+
+To maximise compression there are different inodes for each file type
+(regular file, directory, device, etc.), the inode contents and length
+varying with the type.
+
+To further maximise compression, two types of regular file inode and
+directory inode are defined: inodes optimised for frequently occurring
+regular files and directories, and extended types where extra
+information has to be stored.
+
+3.3 Directories
+---------------
+
+Like inodes, directories are packed into compressed metadata blocks, stored
+in a directory table. Directories are accessed using the start address of
+the metablock containing the directory and the offset into the
+decompressed block (<block, offset>).
+
+Directories are organised in a slightly complex way, and are not simply
+a list of file names. The organisation takes advantage of the
+fact that (in most cases) the inodes of the files will be in the same
+compressed metadata block, and therefore, can share the start block.
+Directories are therefore organised in a two level list, a directory
+header containing the shared start block value, and a sequence of directory
+entries, each of which share the shared start block. A new directory header
+is written once/if the inode start block changes. The directory
+header/directory entry list is repeated as many times as necessary.
+
+Directories are sorted, and can contain a directory index to speed up
+file lookup. Directory indexes store one entry per metablock, each entry
+storing the index/filename mapping to the first directory header
+in each metadata block. Directories are sorted in alphabetical order,
+and at lookup the index is scanned linearly looking for the first filename
+alphabetically larger than the filename being looked up. At this point the
+location of the metadata block the filename is in has been found.
+The general idea of the index is ensure only one metadata block needs to be
+decompressed to do a lookup irrespective of the length of the directory.
+This scheme has the advantage that it doesn't require extra memory overhead
+and doesn't require much extra storage on disk.
+
+3.4 File data
+-------------
+
+Regular files consist of a sequence of contiguous compressed blocks, and/or a
+compressed fragment block (tail-end packed block). The compressed size
+of each datablock is stored in a block list contained within the
+file inode.
+
+To speed up access to datablocks when reading 'large' files (256 Mbytes or
+larger), the code implements an index cache that caches the mapping from
+block index to datablock location on disk.
+
+The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
+retaining a simple and space-efficient block list on disk. The cache
+is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
+Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
+The index cache is designed to be memory efficient, and by default uses
+16 KiB.
+
+3.5 Fragment lookup table
+-------------------------
+
+Regular files can contain a fragment index which is mapped to a fragment
+location on disk and compressed size using a fragment lookup table. This
+fragment lookup table is itself stored compressed into metadata blocks.
+A second index table is used to locate these. This second index table for
+speed of access (and because it is small) is read at mount time and cached
+in memory.
+
+3.6 Uid/gid lookup table
+------------------------
+
+For space efficiency regular files store uid and gid indexes, which are
+converted to 32-bit uids/gids using an id look up table. This table is
+stored compressed into metadata blocks. A second index table is used to
+locate these. This second index table for speed of access (and because it
+is small) is read at mount time and cached in memory.
+
+3.7 Export table
+----------------
+
+To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
+can optionally (disabled with the -no-exports Mksquashfs option) contain
+an inode number to inode disk location lookup table. This is required to
+enable Squashfs to map inode numbers passed in filehandles to the inode
+location on disk, which is necessary when the export code reinstantiates
+expired/flushed inodes.
+
+This table is stored compressed into metadata blocks. A second index table is
+used to locate these. This second index table for speed of access (and because
+it is small) is read at mount time and cached in memory.
+
+3.8 Xattr table
+---------------
+
+The xattr table contains extended attributes for each inode. The xattrs
+for each inode are stored in a list, each list entry containing a type,
+name and value field. The type field encodes the xattr prefix
+("user.", "trusted." etc) and it also encodes how the name/value fields
+should be interpreted. Currently the type indicates whether the value
+is stored inline (in which case the value field contains the xattr value),
+or if it is stored out of line (in which case the value field stores a
+reference to where the actual value is stored). This allows large values
+to be stored out of line improving scanning and lookup performance and it
+also allows values to be de-duplicated, the value being stored once, and
+all other occurrences holding an out of line reference to that value.
+
+The xattr lists are packed into compressed 8K metadata blocks.
+To reduce overhead in inodes, rather than storing the on-disk
+location of the xattr list inside each inode, a 32-bit xattr id
+is stored. This xattr id is mapped into the location of the xattr
+list using a second xattr id lookup table.
+
+4. TODOS AND OUTSTANDING ISSUES
+-------------------------------
+
+4.1 Todo list
+-------------
+
+Implement ACL support.
+
+4.2 Squashfs internal cache
+---------------------------
+
+Blocks in Squashfs are compressed. To avoid repeatedly decompressing
+recently accessed data Squashfs uses two small metadata and fragment caches.
+
+The cache is not used for file datablocks, these are decompressed and cached in
+the page-cache in the normal way. The cache is used to temporarily cache
+fragment and metadata blocks which have been read as a result of a metadata
+(i.e. inode or directory) or fragment access. Because metadata and fragments
+are packed together into blocks (to gain greater compression) the read of a
+particular piece of metadata or fragment will retrieve other metadata/fragments
+which have been packed with it, these because of locality-of-reference may be
+read in the near future. Temporarily caching them ensures they are available
+for near future access without requiring an additional read and decompress.
+
+In the future this internal cache may be replaced with an implementation which
+uses the kernel page cache. Because the page cache operates on page sized
+units this may introduce additional complexity in terms of locking and
+associated race conditions.
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
new file mode 100644
index 00000000..74eaac26
--- /dev/null
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -0,0 +1,120 @@
+Accessing PCI device resources through sysfs
+--------------------------------------------
+
+sysfs, usually mounted at /sys, provides access to PCI resources on platforms
+that support it. For example, a given bus might look like this:
+
+ /sys/devices/pci0000:17
+ |-- 0000:17:00.0
+ | |-- class
+ | |-- config
+ | |-- device
+ | |-- enable
+ | |-- irq
+ | |-- local_cpus
+ | |-- remove
+ | |-- resource
+ | |-- resource0
+ | |-- resource1
+ | |-- resource2
+ | |-- rom
+ | |-- subsystem_device
+ | |-- subsystem_vendor
+ | `-- vendor
+ `-- ...
+
+The topmost element describes the PCI domain and bus number. In this case,
+the domain number is 0000 and the bus number is 17 (both values are in hex).
+This bus contains a single function device in slot 0. The domain and bus
+numbers are reproduced for convenience. Under the device directory are several
+files, each with their own function.
+
+ file function
+ ---- --------
+ class PCI class (ascii, ro)
+ config PCI config space (binary, rw)
+ device PCI device (ascii, ro)
+ enable Whether the device is enabled (ascii, rw)
+ irq IRQ number (ascii, ro)
+ local_cpus nearby CPU mask (cpumask, ro)
+ remove remove device from kernel's list (ascii, wo)
+ resource PCI resource host addresses (ascii, ro)
+ resource0..N PCI resource N, if present (binary, mmap, rw[1])
+ resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
+ rom PCI ROM resource, if present (binary, ro)
+ subsystem_device PCI subsystem device (ascii, ro)
+ subsystem_vendor PCI subsystem vendor (ascii, ro)
+ vendor PCI vendor (ascii, ro)
+
+ ro - read only file
+ rw - file is readable and writable
+ wo - write only file
+ mmap - file is mmapable
+ ascii - file contains ascii text
+ binary - file contains binary data
+ cpumask - file contains a cpumask type
+
+[1] rw for RESOURCE_IO (I/O port) regions only
+
+The read only files are informational, writes to them will be ignored, with
+the exception of the 'rom' file. Writable files can be used to perform
+actions on the device (e.g. changing config space, detaching a device).
+mmapable files are available via an mmap of the file at offset 0 and can be
+used to do actual device programming from userspace. Note that some platforms
+don't support mmapping of certain resources, so be sure to check the return
+value from any attempted mmap. The most notable of these are I/O port
+resources, which also provide read/write access.
+
+The 'enable' file provides a counter that indicates how many times the device
+has been enabled. If the 'enable' file currently returns '4', and a '1' is
+echoed into it, it will then return '5'. Echoing a '0' into it will decrease
+the count. Even when it returns to 0, though, some of the initialisation
+may not be reversed.
+
+The 'rom' file is special in that it provides read-only access to the device's
+ROM file, if available. It's disabled by default, however, so applications
+should write the string "1" to the file to enable it before attempting a read
+call, and disable it following the access by writing "0" to the file. Note
+that the device must be enabled for a rom read to return data successfully.
+In the event a driver is not bound to the device, it can be enabled using the
+'enable' file, documented above.
+
+The 'remove' file is used to remove the PCI device, by writing a non-zero
+integer to the file. This does not involve any kind of hot-plug functionality,
+e.g. powering off the device. The device is removed from the kernel's list of
+PCI devices, the sysfs directory for it is removed, and the device will be
+removed from any drivers attached to it. Removal of PCI root buses is
+disallowed.
+
+Accessing legacy resources through sysfs
+----------------------------------------
+
+Legacy I/O port and ISA memory resources are also provided in sysfs if the
+underlying platform supports them. They're located in the PCI class hierarchy,
+e.g.
+
+ /sys/class/pci_bus/0000:17/
+ |-- bridge -> ../../../devices/pci0000:17
+ |-- cpuaffinity
+ |-- legacy_io
+ `-- legacy_mem
+
+The legacy_io file is a read/write file that can be used by applications to
+do legacy port I/O. The application should open the file, seek to the desired
+port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
+file should be mmapped with an offset corresponding to the memory offset
+desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
+simply dereference the returned pointer (after checking for errors of course)
+to access legacy memory space.
+
+Supporting PCI access on new platforms
+--------------------------------------
+
+In order to support PCI resource mapping as described above, Linux platform
+code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
+Platforms are free to only support subsets of the mmap functionality, but
+useful return codes should be provided.
+
+Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
+wishing to support legacy functionality should define it and provide
+pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
new file mode 100644
index 00000000..caaaf126
--- /dev/null
+++ b/Documentation/filesystems/sysfs-tagging.txt
@@ -0,0 +1,42 @@
+Sysfs tagging
+-------------
+
+(Taken almost verbatim from Eric Biederman's netns tagging patch
+commit msg)
+
+The problem. Network devices show up in sysfs and with the network
+namespace active multiple devices with the same name can show up in
+the same directory, ouch!
+
+To avoid that problem and allow existing applications in network
+namespaces to see the same interface that is currently presented in
+sysfs, sysfs now has tagging directory support.
+
+By using the network namespace pointers as tags to separate out the
+the sysfs directory entries we ensure that we don't have conflicts
+in the directories and applications only see a limited set of
+the network devices.
+
+Each sysfs directory entry may be tagged with zero or one
+namespaces. A sysfs_dirent is augmented with a void *s_ns. If a
+directory entry is tagged, then sysfs_dirent->s_flags will have a
+flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will
+point to the namespace to which it belongs.
+
+Each sysfs superblock's sysfs_super_info contains an array void
+*ns[KOBJ_NS_TYPES]. When a a task in a tagging namespace
+kobj_nstype first mounts sysfs, a new superblock is created. It
+will be differentiated from other sysfs mounts by having its
+s_fs_info->ns[kobj_nstype] set to the new namespace. Note that
+through bind mounting and mounts propagation, a task can easily view
+the contents of other namespaces' sysfs mounts. Therefore, when a
+namespace exits, it will call kobj_ns_exit() to invalidate any
+sysfs_dirent->s_ns pointers pointing to it.
+
+Users of this interface:
+- define a type in the kobj_ns_type enumeration.
+- call kobj_ns_type_register() with its kobj_ns_type_operations which has
+ - current_ns() which returns current's namespace
+ - netlink_ns() which returns a socket's namespace
+ - initial_ns() which returns the initial namesapce
+- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
new file mode 100644
index 00000000..597f728e
--- /dev/null
+++ b/Documentation/filesystems/sysfs.txt
@@ -0,0 +1,372 @@
+
+sysfs - _The_ filesystem for exporting kernel objects.
+
+Patrick Mochel <mochel@osdl.org>
+Mike Murphy <mamurph@cs.clemson.edu>
+
+Revised: 15 July 2010
+Original: 10 January 2003
+
+
+What it is:
+~~~~~~~~~~~
+
+sysfs is a ram-based filesystem initially based on ramfs. It provides
+a means to export kernel data structures, their attributes, and the
+linkages between them to userspace.
+
+sysfs is tied inherently to the kobject infrastructure. Please read
+Documentation/kobject.txt for more information concerning the kobject
+interface.
+
+
+Using sysfs
+~~~~~~~~~~~
+
+sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
+it by doing:
+
+ mount -t sysfs sysfs /sys
+
+
+Directory Creation
+~~~~~~~~~~~~~~~~~~
+
+For every kobject that is registered with the system, a directory is
+created for it in sysfs. That directory is created as a subdirectory
+of the kobject's parent, expressing internal object hierarchies to
+userspace. Top-level directories in sysfs represent the common
+ancestors of object hierarchies; i.e. the subsystems the objects
+belong to.
+
+Sysfs internally stores a pointer to the kobject that implements a
+directory in the sysfs_dirent object associated with the directory. In
+the past this kobject pointer has been used by sysfs to do reference
+counting directly on the kobject whenever the file is opened or closed.
+With the current sysfs implementation the kobject reference count is
+only modified directly by the function sysfs_schedule_callback().
+
+
+Attributes
+~~~~~~~~~~
+
+Attributes can be exported for kobjects in the form of regular files in
+the filesystem. Sysfs forwards file I/O operations to methods defined
+for the attributes, providing a means to read and write kernel
+attributes.
+
+Attributes should be ASCII text files, preferably with only one value
+per file. It is noted that it may not be efficient to contain only one
+value per file, so it is socially acceptable to express an array of
+values of the same type.
+
+Mixing types, expressing multiple lines of data, and doing fancy
+formatting of data is heavily frowned upon. Doing these things may get
+you publicly humiliated and your code rewritten without notice.
+
+
+An attribute definition is simply:
+
+struct attribute {
+ char * name;
+ struct module *owner;
+ mode_t mode;
+};
+
+
+int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
+void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
+
+
+A bare attribute contains no means to read or write the value of the
+attribute. Subsystems are encouraged to define their own attribute
+structure and wrapper functions for adding and removing attributes for
+a specific object type.
+
+For example, the driver model defines struct device_attribute like:
+
+struct device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device *dev, struct device_attribute *attr,
+ char *buf);
+ ssize_t (*store)(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+};
+
+int device_create_file(struct device *, const struct device_attribute *);
+void device_remove_file(struct device *, const struct device_attribute *);
+
+It also defines this helper for defining device attributes:
+
+#define DEVICE_ATTR(_name, _mode, _show, _store) \
+struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+For example, declaring
+
+static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
+
+is equivalent to doing:
+
+static struct device_attribute dev_attr_foo = {
+ .attr = {
+ .name = "foo",
+ .mode = S_IWUSR | S_IRUGO,
+ .show = show_foo,
+ .store = store_foo,
+ },
+};
+
+
+Subsystem-Specific Callbacks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a subsystem defines a new attribute type, it must implement a
+set of sysfs operations for forwarding read and write calls to the
+show and store methods of the attribute owners.
+
+struct sysfs_ops {
+ ssize_t (*show)(struct kobject *, struct attribute *, char *);
+ ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
+};
+
+[ Subsystems should have already defined a struct kobj_type as a
+descriptor for this type, which is where the sysfs_ops pointer is
+stored. See the kobject documentation for more information. ]
+
+When a file is read or written, sysfs calls the appropriate method
+for the type. The method then translates the generic struct kobject
+and struct attribute pointers to the appropriate pointer types, and
+calls the associated methods.
+
+
+To illustrate:
+
+#define to_dev(obj) container_of(obj, struct device, kobj)
+#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
+
+static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct device_attribute *dev_attr = to_dev_attr(attr);
+ struct device *dev = to_dev(kobj);
+ ssize_t ret = -EIO;
+
+ if (dev_attr->show)
+ ret = dev_attr->show(dev, dev_attr, buf);
+ if (ret >= (ssize_t)PAGE_SIZE) {
+ print_symbol("dev_attr_show: %s returned bad count\n",
+ (unsigned long)dev_attr->show);
+ }
+ return ret;
+}
+
+
+
+Reading/Writing Attribute Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To read or write attributes, show() or store() methods must be
+specified when declaring the attribute. The method types should be as
+simple as those defined for device attributes:
+
+ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
+ssize_t (*store)(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+
+IOW, they should take only an object, an attribute, and a buffer as parameters.
+
+
+sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
+method. Sysfs will call the method exactly once for each read or
+write. This forces the following behavior on the method
+implementations:
+
+- On read(2), the show() method should fill the entire buffer.
+ Recall that an attribute should only be exporting one value, or an
+ array of similar values, so this shouldn't be that expensive.
+
+ This allows userspace to do partial reads and forward seeks
+ arbitrarily over the entire file at will. If userspace seeks back to
+ zero or does a pread(2) with an offset of '0' the show() method will
+ be called again, rearmed, to fill the buffer.
+
+- On write(2), sysfs expects the entire buffer to be passed during the
+ first write. Sysfs then passes the entire buffer to the store()
+ method.
+
+ When writing sysfs files, userspace processes should first read the
+ entire file, modify the values it wishes to change, then write the
+ entire buffer back.
+
+ Attribute method implementations should operate on an identical
+ buffer when reading and writing values.
+
+Other notes:
+
+- Writing causes the show() method to be rearmed regardless of current
+ file position.
+
+- The buffer will always be PAGE_SIZE bytes in length. On i386, this
+ is 4096.
+
+- show() methods should return the number of bytes printed into the
+ buffer. This is the return value of scnprintf().
+
+- show() should always use scnprintf().
+
+- store() should return the number of bytes used from the buffer. If the
+ entire buffer has been used, just return the count argument.
+
+- show() or store() can always return errors. If a bad value comes
+ through, be sure to return an error.
+
+- The object passed to the methods will be pinned in memory via sysfs
+ referencing counting its embedded object. However, the physical
+ entity (e.g. device) the object represents may not be present. Be
+ sure to have a way to check this, if necessary.
+
+
+A very simple (and naive) implementation of a device attribute is:
+
+static ssize_t show_name(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
+}
+
+static ssize_t store_name(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ snprintf(dev->name, sizeof(dev->name), "%.*s",
+ (int)min(count, sizeof(dev->name) - 1), buf);
+ return count;
+}
+
+static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
+
+
+(Note that the real implementation doesn't allow userspace to set the
+name for a device.)
+
+
+Top Level Directory Layout
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sysfs directory arrangement exposes the relationship of kernel
+data structures.
+
+The top level sysfs directory looks like:
+
+block/
+bus/
+class/
+dev/
+devices/
+firmware/
+net/
+fs/
+
+devices/ contains a filesystem representation of the device tree. It maps
+directly to the internal kernel device tree, which is a hierarchy of
+struct device.
+
+bus/ contains flat directory layout of the various bus types in the
+kernel. Each bus's directory contains two subdirectories:
+
+ devices/
+ drivers/
+
+devices/ contains symlinks for each device discovered in the system
+that point to the device's directory under root/.
+
+drivers/ contains a directory for each device driver that is loaded
+for devices on that particular bus (this assumes that drivers do not
+span multiple bus types).
+
+fs/ contains a directory for some filesystems. Currently each
+filesystem wanting to export attributes must create its own hierarchy
+below fs/ (see ./fuse.txt for an example).
+
+dev/ contains two directories char/ and block/. Inside these two
+directories there are symlinks named <major>:<minor>. These symlinks
+point to the sysfs directory for the given device. /sys/dev provides a
+quick way to lookup the sysfs interface for a device from the result of
+a stat(2) operation.
+
+More information can driver-model specific features can be found in
+Documentation/driver-model/.
+
+
+TODO: Finish this section.
+
+
+Current Interfaces
+~~~~~~~~~~~~~~~~~~
+
+The following interface layers currently exist in sysfs:
+
+
+- devices (include/linux/device.h)
+----------------------------------
+Structure:
+
+struct device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device *dev, struct device_attribute *attr,
+ char *buf);
+ ssize_t (*store)(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+};
+
+Declaring:
+
+DEVICE_ATTR(_name, _mode, _show, _store);
+
+Creation/Removal:
+
+int device_create_file(struct device *dev, const struct device_attribute * attr);
+void device_remove_file(struct device *dev, const struct device_attribute * attr);
+
+
+- bus drivers (include/linux/device.h)
+--------------------------------------
+Structure:
+
+struct bus_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct bus_type *, char * buf);
+ ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
+};
+
+Declaring:
+
+BUS_ATTR(_name, _mode, _show, _store)
+
+Creation/Removal:
+
+int bus_create_file(struct bus_type *, struct bus_attribute *);
+void bus_remove_file(struct bus_type *, struct bus_attribute *);
+
+
+- device drivers (include/linux/device.h)
+-----------------------------------------
+
+Structure:
+
+struct driver_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device_driver *, char * buf);
+ ssize_t (*store)(struct device_driver *, const char * buf,
+ size_t count);
+};
+
+Declaring:
+
+DRIVER_ATTR(_name, _mode, _show, _store)
+
+Creation/Removal:
+
+int driver_create_file(struct device_driver *, const struct driver_attribute *);
+void driver_remove_file(struct device_driver *, const struct driver_attribute *);
+
+
diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt
new file mode 100644
index 00000000..253b50d1
--- /dev/null
+++ b/Documentation/filesystems/sysv-fs.txt
@@ -0,0 +1,197 @@
+It implements all of
+ - Xenix FS,
+ - SystemV/386 FS,
+ - Coherent FS.
+
+To install:
+* Answer the 'System V and Coherent filesystem support' question with 'y'
+ when configuring the kernel.
+* To mount a disk or a partition, use
+ mount [-r] -t sysv device mountpoint
+ The file system type names
+ -t sysv
+ -t xenix
+ -t coherent
+ may be used interchangeably, but the last two will eventually disappear.
+
+Bugs in the present implementation:
+- Coherent FS:
+ - The "free list interleave" n:m is currently ignored.
+ - Only file systems with no filesystem name and no pack name are recognized.
+ (See Coherent "man mkfs" for a description of these features.)
+- SystemV Release 2 FS:
+ The superblock is only searched in the blocks 9, 15, 18, which
+ corresponds to the beginning of track 1 on floppy disks. No support
+ for this FS on hard disk yet.
+
+
+These filesystems are rather similar. Here is a comparison with Minix FS:
+
+* Linux fdisk reports on partitions
+ - Minix FS 0x81 Linux/Minix
+ - Xenix FS ??
+ - SystemV FS ??
+ - Coherent FS 0x08 AIX bootable
+
+* Size of a block or zone (data allocation unit on disk)
+ - Minix FS 1024
+ - Xenix FS 1024 (also 512 ??)
+ - SystemV FS 1024 (also 512 and 2048)
+ - Coherent FS 512
+
+* General layout: all have one boot block, one super block and
+ separate areas for inodes and for directories/data.
+ On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
+ all the block numbers (including the super block) are offset by one track.
+
+* Byte ordering of "short" (16 bit entities) on disk:
+ - Minix FS little endian 0 1
+ - Xenix FS little endian 0 1
+ - SystemV FS little endian 0 1
+ - Coherent FS little endian 0 1
+ Of course, this affects only the file system, not the data of files on it!
+
+* Byte ordering of "long" (32 bit entities) on disk:
+ - Minix FS little endian 0 1 2 3
+ - Xenix FS little endian 0 1 2 3
+ - SystemV FS little endian 0 1 2 3
+ - Coherent FS PDP-11 2 3 0 1
+ Of course, this affects only the file system, not the data of files on it!
+
+* Inode on disk: "short", 0 means non-existent, the root dir ino is:
+ - Minix FS 1
+ - Xenix FS, SystemV FS, Coherent FS 2
+
+* Maximum number of hard links to a file:
+ - Minix FS 250
+ - Xenix FS ??
+ - SystemV FS ??
+ - Coherent FS >=10000
+
+* Free inode management:
+ - Minix FS a bitmap
+ - Xenix FS, SystemV FS, Coherent FS
+ There is a cache of a certain number of free inodes in the super-block.
+ When it is exhausted, new free inodes are found using a linear search.
+
+* Free block management:
+ - Minix FS a bitmap
+ - Xenix FS, SystemV FS, Coherent FS
+ Free blocks are organized in a "free list". Maybe a misleading term,
+ since it is not true that every free block contains a pointer to
+ the next free block. Rather, the free blocks are organized in chunks
+ of limited size, and every now and then a free block contains pointers
+ to the free blocks pertaining to the next chunk; the first of these
+ contains pointers and so on. The list terminates with a "block number"
+ 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
+
+* Super-block location:
+ - Minix FS block 1 = bytes 1024..2047
+ - Xenix FS block 1 = bytes 1024..2047
+ - SystemV FS bytes 512..1023
+ - Coherent FS block 1 = bytes 512..1023
+
+* Super-block layout:
+ - Minix FS
+ unsigned short s_ninodes;
+ unsigned short s_nzones;
+ unsigned short s_imap_blocks;
+ unsigned short s_zmap_blocks;
+ unsigned short s_firstdatazone;
+ unsigned short s_log_zone_size;
+ unsigned long s_max_size;
+ unsigned short s_magic;
+ - Xenix FS, SystemV FS, Coherent FS
+ unsigned short s_firstdatazone;
+ unsigned long s_nzones;
+ unsigned short s_fzone_count;
+ unsigned long s_fzones[NICFREE];
+ unsigned short s_finode_count;
+ unsigned short s_finodes[NICINOD];
+ char s_flock;
+ char s_ilock;
+ char s_modified;
+ char s_rdonly;
+ unsigned long s_time;
+ short s_dinfo[4]; -- SystemV FS only
+ unsigned long s_free_zones;
+ unsigned short s_free_inodes;
+ short s_dinfo[4]; -- Xenix FS only
+ unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
+ char s_fname[6];
+ char s_fpack[6];
+ then they differ considerably:
+ Xenix FS
+ char s_clean;
+ char s_fill[371];
+ long s_magic;
+ long s_type;
+ SystemV FS
+ long s_fill[12 or 14];
+ long s_state;
+ long s_magic;
+ long s_type;
+ Coherent FS
+ unsigned long s_unique;
+ Note that Coherent FS has no magic.
+
+* Inode layout:
+ - Minix FS
+ unsigned short i_mode;
+ unsigned short i_uid;
+ unsigned long i_size;
+ unsigned long i_time;
+ unsigned char i_gid;
+ unsigned char i_nlinks;
+ unsigned short i_zone[7+1+1];
+ - Xenix FS, SystemV FS, Coherent FS
+ unsigned short i_mode;
+ unsigned short i_nlink;
+ unsigned short i_uid;
+ unsigned short i_gid;
+ unsigned long i_size;
+ unsigned char i_zone[3*(10+1+1+1)];
+ unsigned long i_atime;
+ unsigned long i_mtime;
+ unsigned long i_ctime;
+
+* Regular file data blocks are organized as
+ - Minix FS
+ 7 direct blocks
+ 1 indirect block (pointers to blocks)
+ 1 double-indirect block (pointer to pointers to blocks)
+ - Xenix FS, SystemV FS, Coherent FS
+ 10 direct blocks
+ 1 indirect block (pointers to blocks)
+ 1 double-indirect block (pointer to pointers to blocks)
+ 1 triple-indirect block (pointer to pointers to pointers to blocks)
+
+* Inode size, inodes per block
+ - Minix FS 32 32
+ - Xenix FS 64 16
+ - SystemV FS 64 16
+ - Coherent FS 64 8
+
+* Directory entry on disk
+ - Minix FS
+ unsigned short inode;
+ char name[14/30];
+ - Xenix FS, SystemV FS, Coherent FS
+ unsigned short inode;
+ char name[14];
+
+* Dir entry size, dir entries per block
+ - Minix FS 16/32 64/32
+ - Xenix FS 16 64
+ - SystemV FS 16 64
+ - Coherent FS 16 32
+
+* How to implement symbolic links such that the host fsck doesn't scream:
+ - Minix FS normal
+ - Xenix FS kludge: as regular files with chmod 1000
+ - SystemV FS ??
+ - Coherent FS kludge: as regular files with chmod 1000
+
+
+Notation: We often speak of a "block" but mean a zone (the allocation unit)
+and not the disk driver's notion of "block".
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
new file mode 100644
index 00000000..98ef5512
--- /dev/null
+++ b/Documentation/filesystems/tmpfs.txt
@@ -0,0 +1,148 @@
+Tmpfs is a file system which keeps all files in virtual memory.
+
+
+Everything in tmpfs is temporary in the sense that no files will be
+created on your hard drive. If you unmount a tmpfs instance,
+everything stored therein is lost.
+
+tmpfs puts everything into the kernel internal caches and grows and
+shrinks to accommodate the files it contains and is able to swap
+unneeded pages out to swap space. It has maximum size limits which can
+be adjusted on the fly via 'mount -o remount ...'
+
+If you compare it to ramfs (which was the template to create tmpfs)
+you gain swapping and limit checking. Another similar thing is the RAM
+disk (/dev/ram*), which simulates a fixed size hard disk in physical
+RAM, where you have to create an ordinary filesystem on top. Ramdisks
+cannot swap and you do not have the possibility to resize them.
+
+Since tmpfs lives completely in the page cache and on swap, all tmpfs
+pages currently in memory will show up as cached. It will not show up
+as shared or something like that. Further on you can check the actual
+RAM+swap use of a tmpfs instance with df(1) and du(1).
+
+
+tmpfs has the following uses:
+
+1) There is always a kernel internal mount which you will not see at
+ all. This is used for shared anonymous mappings and SYSV shared
+ memory.
+
+ This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
+ set, the user visible part of tmpfs is not build. But the internal
+ mechanisms are always present.
+
+2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
+ POSIX shared memory (shm_open, shm_unlink). Adding the following
+ line to /etc/fstab should take care of this:
+
+ tmpfs /dev/shm tmpfs defaults 0 0
+
+ Remember to create the directory that you intend to mount tmpfs on
+ if necessary.
+
+ This mount is _not_ needed for SYSV shared memory. The internal
+ mount is used for that. (In the 2.3 kernel versions it was
+ necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
+ shared memory)
+
+3) Some people (including me) find it very convenient to mount it
+ e.g. on /tmp and /var/tmp and have a big swap partition. And now
+ loop mounts of tmpfs files do work, so mkinitrd shipped by most
+ distributions should succeed with a tmpfs /tmp.
+
+4) And probably a lot more I do not know about :-)
+
+
+tmpfs has three mount options for sizing:
+
+size: The limit of allocated bytes for this tmpfs instance. The
+ default is half of your physical RAM without swap. If you
+ oversize your tmpfs instances the machine will deadlock
+ since the OOM handler will not be able to free that memory.
+nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
+nr_inodes: The maximum number of inodes for this instance. The default
+ is half of the number of your physical RAM pages, or (on a
+ machine with highmem) the number of lowmem RAM pages,
+ whichever is the lower.
+
+These parameters accept a suffix k, m or g for kilo, mega and giga and
+can be changed on remount. The size parameter also accepts a suffix %
+to limit this tmpfs instance to that percentage of your physical RAM:
+the default, when neither size nor nr_blocks is specified, is size=50%
+
+If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
+if nr_inodes=0, inodes will not be limited. It is generally unwise to
+mount with such options, since it allows any user with write access to
+use up all the memory on the machine; but enhances the scalability of
+that instance in a system with many cpus making intensive use of it.
+
+
+tmpfs has a mount option to set the NUMA memory allocation policy for
+all files in that instance (if CONFIG_NUMA is enabled) - which can be
+adjusted on the fly via 'mount -o remount ...'
+
+mpol=default use the process allocation policy
+ (see set_mempolicy(2))
+mpol=prefer:Node prefers to allocate memory from the given Node
+mpol=bind:NodeList allocates memory only from nodes in NodeList
+mpol=interleave prefers to allocate from each node in turn
+mpol=interleave:NodeList allocates from each node of NodeList in turn
+mpol=local prefers to allocate memory from the local node
+
+NodeList format is a comma-separated list of decimal numbers and ranges,
+a range being two hyphen-separated decimal numbers, the smallest and
+largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
+
+A memory policy with a valid NodeList will be saved, as specified, for
+use at file creation time. When a task allocates a file in the file
+system, the mount option memory policy will be applied with a NodeList,
+if any, modified by the calling task's cpuset constraints
+[See Documentation/cgroups/cpusets.txt] and any optional flags, listed
+below. If the resulting NodeLists is the empty set, the effective memory
+policy for the file will revert to "default" policy.
+
+NUMA memory allocation policies have optional flags that can be used in
+conjunction with their modes. These optional flags can be specified
+when tmpfs is mounted by appending them to the mode before the NodeList.
+See Documentation/vm/numa_memory_policy.txt for a list of all available
+memory allocation policy mode flags and their effect on memory policy.
+
+ =static is equivalent to MPOL_F_STATIC_NODES
+ =relative is equivalent to MPOL_F_RELATIVE_NODES
+
+For example, mpol=bind=static:NodeList, is the equivalent of an
+allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES.
+
+Note that trying to mount a tmpfs with an mpol option will fail if the
+running kernel does not support NUMA; and will fail if its nodelist
+specifies a node which is not online. If your system relies on that
+tmpfs being mounted, but from time to time runs a kernel built without
+NUMA capability (perhaps a safe recovery kernel), or with fewer nodes
+online, then it is advisable to omit the mpol option from automatic
+mount options. It can be added later, when the tmpfs is already mounted
+on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
+
+
+To specify the initial root directory you can use the following mount
+options:
+
+mode: The permissions as an octal number
+uid: The user id
+gid: The group id
+
+These options do not have any effect on remount. You can change these
+parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+
+
+So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
+will give you tmpfs instance on /mytmpfs which can allocate 10GB
+RAM/SWAP in 10240 inodes and it is only accessible by root.
+
+
+Author:
+ Christoph Rohland <cr@sap.com>, 1.12.01
+Updated:
+ Hugh Dickins, 4 June 2007
+Updated:
+ KOSAKI Motohiro, 16 Mar 2010
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt
new file mode 100644
index 00000000..8e4fab63
--- /dev/null
+++ b/Documentation/filesystems/ubifs.txt
@@ -0,0 +1,147 @@
+Introduction
+=============
+
+UBIFS file-system stands for UBI File System. UBI stands for "Unsorted
+Block Images". UBIFS is a flash file system, which means it is designed
+to work with flash devices. It is important to understand, that UBIFS
+is completely different to any traditional file-system in Linux, like
+Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems
+which work with MTD devices, not block devices. The other Linux
+file-system of this class is JFFS2.
+
+To make it more clear, here is a small comparison of MTD devices and
+block devices.
+
+1 MTD devices represent flash devices and they consist of eraseblocks of
+ rather large size, typically about 128KiB. Block devices consist of
+ small blocks, typically 512 bytes.
+2 MTD devices support 3 main operations - read from some offset within an
+ eraseblock, write to some offset within an eraseblock, and erase a whole
+ eraseblock. Block devices support 2 main operations - read a whole
+ block and write a whole block.
+3 The whole eraseblock has to be erased before it becomes possible to
+ re-write its contents. Blocks may be just re-written.
+4 Eraseblocks become worn out after some number of erase cycles -
+ typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC
+ NAND flashes. Blocks do not have the wear-out property.
+5 Eraseblocks may become bad (only on NAND flashes) and software should
+ deal with this. Blocks on hard drives typically do not become bad,
+ because hardware has mechanisms to substitute bad blocks, at least in
+ modern LBA disks.
+
+It should be quite obvious why UBIFS is very different to traditional
+file-systems.
+
+UBIFS works on top of UBI. UBI is a separate software layer which may be
+found in drivers/mtd/ubi. UBI is basically a volume management and
+wear-leveling layer. It provides so called UBI volumes which is a higher
+level abstraction than a MTD device. The programming model of UBI devices
+is very similar to MTD devices - they still consist of large eraseblocks,
+they have read/write/erase operations, but UBI devices are devoid of
+limitations like wear and bad blocks (items 4 and 5 in the above list).
+
+In a sense, UBIFS is a next generation of JFFS2 file-system, but it is
+very different and incompatible to JFFS2. The following are the main
+differences.
+
+* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on
+ top of UBI volumes.
+* JFFS2 does not have on-media index and has to build it while mounting,
+ which requires full media scan. UBIFS maintains the FS indexing
+ information on the flash media and does not require full media scan,
+ so it mounts many times faster than JFFS2.
+* JFFS2 is a write-through file-system, while UBIFS supports write-back,
+ which makes UBIFS much faster on writes.
+
+Similarly to JFFS2, UBIFS supports on-the-flight compression which makes
+it possible to fit quite a lot of data to the flash.
+
+Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts.
+It does not need stuff like fsck.ext2. UBIFS automatically replays its
+journal and recovers from crashes, ensuring that the on-flash data
+structures are consistent.
+
+UBIFS scales logarithmically (most of the data structures it uses are
+trees), so the mount time and memory consumption do not linearly depend
+on the flash size, like in case of JFFS2. This is because UBIFS
+maintains the FS index on the flash media. However, UBIFS depends on
+UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly.
+Nevertheless, UBI/UBIFS scales considerably better than JFFS2.
+
+The authors of UBIFS believe, that it is possible to develop UBI2 which
+would scale logarithmically as well. UBI2 would support the same API as UBI,
+but it would be binary incompatible to UBI. So UBIFS would not need to be
+changed to use UBI2
+
+
+Mount options
+=============
+
+(*) == default.
+
+bulk_read read more in one go to take advantage of flash
+ media that read faster sequentially
+no_bulk_read (*) do not bulk-read
+no_chk_data_crc (*) skip checking of CRCs on data nodes in order to
+ improve read performance. Use this option only
+ if the flash media is highly reliable. The effect
+ of this option is that corruption of the contents
+ of a file can go unnoticed.
+chk_data_crc do not skip checking CRCs on data nodes
+compr=none override default compressor and set it to "none"
+compr=lzo override default compressor and set it to "lzo"
+compr=zlib override default compressor and set it to "zlib"
+
+
+Quick usage instructions
+========================
+
+The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax,
+where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is
+UBI volume name.
+
+Mount volume 0 on UBI device 0 to /mnt/ubifs:
+$ mount -t ubifs ubi0_0 /mnt/ubifs
+
+Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume
+name):
+$ mount -t ubifs ubi0:rootfs /mnt/ubifs
+
+The following is an example of the kernel boot arguments to attach mtd0
+to UBI and mount volume "rootfs":
+ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs
+
+
+Module Parameters for Debugging
+===============================
+
+When UBIFS has been compiled with debugging enabled, there are 2 module
+parameters that are available to control aspects of testing and debugging.
+
+debug_chks Selects extra checks that UBIFS can do while running:
+
+ Check Flag value
+
+ General checks 1
+ Check Tree Node Cache (TNC) 2
+ Check indexing tree size 4
+ Check orphan area 8
+ Check old indexing tree 16
+ Check LEB properties (lprops) 32
+ Check leaf nodes and inodes 64
+
+debug_tsts Selects a mode of testing, as follows:
+
+ Test mode Flag value
+
+ Failure mode for recovery testing 4
+
+For example, set debug_chks to 3 to enable general and TNC checks.
+
+
+References
+==========
+
+UBIFS documentation and FAQ/HOWTO at the MTD web site:
+http://www.linux-mtd.infradead.org/doc/ubifs.html
+http://www.linux-mtd.infradead.org/faq/ubifs.html
diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt
new file mode 100644
index 00000000..902b95d0
--- /dev/null
+++ b/Documentation/filesystems/udf.txt
@@ -0,0 +1,82 @@
+*
+* Documentation/filesystems/udf.txt
+*
+UDF Filesystem version 0.9.8.1
+
+If you encounter problems with reading UDF discs using this driver,
+please report them to linux_udf@hpesjro.fc.hp.com, which is the
+developer's list.
+
+Write support requires a block driver which supports writing. Currently
+dvd+rw drives and media support true random sector writes, and so a udf
+filesystem on such devices can be directly mounted read/write. CD-RW
+media however, does not support this. Instead the media can be formatted
+for packet mode using the utility cdrwtool, then the pktcdvd driver can
+be bound to the underlying cd device to provide the required buffering
+and read-modify-write cycles to allow the filesystem random sector writes
+while providing the hardware with only full packet writes. While not
+required for dvd+rw media, use of the pktcdvd driver often enhances
+performance due to very poor read-modify-write support supplied internally
+by drive firmware.
+
+-------------------------------------------------------------------------------
+The following mount options are supported:
+
+ gid= Set the default group.
+ umask= Set the default umask.
+ mode= Set the default file permissions.
+ dmode= Set the default directory permissions.
+ uid= Set the default user.
+ bs= Set the block size.
+ unhide Show otherwise hidden files.
+ undelete Show deleted files in lists.
+ adinicb Embed data in the inode (default)
+ noadinicb Don't embed data in the inode
+ shortad Use short ad's
+ longad Use long ad's (default)
+ nostrict Unset strict conformance
+ iocharset= Set the NLS character set
+
+The uid= and gid= options need a bit more explaining. They will accept a
+decimal numeric value which will be used as the default ID for that mount.
+They will also accept the string "ignore" and "forget". For files on the disk
+that are owned by nobody ( -1 ), they will instead look as if they are owned
+by the default ID. The ignore option causes the default ID to override all
+IDs on the disk, not just -1. The forget option causes all IDs to be written
+to disk as -1, so when the media is later remounted, they will appear to be
+owned by whatever default ID it is mounted with at that time.
+
+For typical desktop use of removable media, you should set the ID to that
+of the interactively logged on user, and also specify both the forget and
+ignore options. This way the interactive user will always see the files
+on the disk as belonging to him.
+
+The remaining are for debugging and disaster recovery:
+
+ novrs Skip volume sequence recognition
+
+The following expect a offset from 0.
+
+ session= Set the CDROM session (default= last session)
+ anchor= Override standard anchor location. (default= 256)
+ volume= Override the VolumeDesc location. (unused)
+ partition= Override the PartitionDesc location. (unused)
+ lastblock= Set the last block of the filesystem/
+
+The following expect a offset from the partition root.
+
+ fileset= Override the fileset block location. (unused)
+ rootdir= Override the root directory location. (unused)
+ WARNING: overriding the rootdir to a non-directory may
+ yield highly unpredictable results.
+-------------------------------------------------------------------------------
+
+
+For the latest version and toolset see:
+ http://linux-udf.sourceforge.net/
+
+Documentation on UDF and ECMA 167 is available FREE from:
+ http://www.osta.org/
+ http://www.ecma-international.org/
+
+Ben Fennema <bfennema@falcon.csc.calpoly.edu>
diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt
new file mode 100644
index 00000000..7a602ade
--- /dev/null
+++ b/Documentation/filesystems/ufs.txt
@@ -0,0 +1,60 @@
+USING UFS
+=========
+
+mount -t ufs -o ufstype=type_of_ufs device dir
+
+
+UFS OPTIONS
+===========
+
+ufstype=type_of_ufs
+ UFS is a file system widely used in different operating systems.
+ The problem are differences among implementations. Features of
+ some implementations are undocumented, so its hard to recognize
+ type of ufs automatically. That's why user must specify type of
+ ufs manually by mount option ufstype. Possible values are:
+
+ old old format of ufs
+ default value, supported as read-only
+
+ 44bsd used in FreeBSD, NetBSD, OpenBSD
+ supported as read-write
+
+ ufs2 used in FreeBSD 5.x
+ supported as read-write
+
+ 5xbsd synonym for ufs2
+
+ sun used in SunOS (Solaris)
+ supported as read-write
+
+ sunx86 used in SunOS for Intel (Solarisx86)
+ supported as read-write
+
+ hp used in HP-UX
+ supported as read-only
+
+ nextstep
+ used in NextStep
+ supported as read-only
+
+ nextstep-cd
+ used for NextStep CDROMs (block_size == 2048)
+ supported as read-only
+
+ openstep
+ used in OpenStep
+ supported as read-only
+
+
+POSSIBLE PROBLEMS
+=================
+
+See next section, if you have any.
+
+
+BUG REPORTS
+===========
+
+Any ufs bug report you can send to daniel.pirkl@email.cz or
+to dushistov@mail.ru (do not send partition tables bug reports).
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt
new file mode 100644
index 00000000..ead764b2
--- /dev/null
+++ b/Documentation/filesystems/vfat.txt
@@ -0,0 +1,295 @@
+USING VFAT
+----------------------------------------------------------------------
+To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
+ mount -t vfat /dev/fd0 /mnt
+
+No special partition formatter is required. mkdosfs will work fine
+if you want to format from within Linux.
+
+VFAT MOUNT OPTIONS
+----------------------------------------------------------------------
+uid=### -- Set the owner of all files on this filesystem.
+ The default is the uid of current process.
+
+gid=### -- Set the group of all files on this filesystem.
+ The default is the gid of current process.
+
+umask=### -- The permission mask (for files and directories, see umask(1)).
+ The default is the umask of current process.
+
+dmask=### -- The permission mask for the directory.
+ The default is the umask of current process.
+
+fmask=### -- The permission mask for files.
+ The default is the umask of current process.
+
+allow_utime=### -- This option controls the permission check of mtime/atime.
+
+ 20 - If current process is in group of file's group ID,
+ you can change timestamp.
+ 2 - Other users can change timestamp.
+
+ The default is set from `dmask' option. (If the directory is
+ writable, utime(2) is also allowed. I.e. ~dmask & 022)
+
+ Normally utime(2) checks current process is owner of
+ the file, or it has CAP_FOWNER capability. But FAT
+ filesystem doesn't have uid/gid on disk, so normal
+ check is too unflexible. With this option you can
+ relax it.
+
+codepage=### -- Sets the codepage number for converting to shortname
+ characters on FAT filesystem.
+ By default, FAT_DEFAULT_CODEPAGE setting is used.
+
+iocharset=<name> -- Character set to use for converting between the
+ encoding is used for user visible filename and 16 bit
+ Unicode characters. Long filenames are stored on disk
+ in Unicode format, but Unix for the most part doesn't
+ know how to deal with Unicode.
+ By default, FAT_DEFAULT_IOCHARSET setting is used.
+
+ There is also an option of doing UTF-8 translations
+ with the utf8 option.
+
+ NOTE: "iocharset=utf8" is not recommended. If unsure,
+ you should consider the following option instead.
+
+utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that
+ is used by the console. It can be enabled for the
+ filesystem with this option. If 'uni_xlate' gets set,
+ UTF-8 gets disabled.
+
+uni_xlate=<bool> -- Translate unhandled Unicode characters to special
+ escaped sequences. This would let you backup and
+ restore filenames that are created with any Unicode
+ characters. Until Linux supports Unicode for real,
+ this gives you an alternative. Without this option,
+ a '?' is used when no translation is possible. The
+ escape character is ':' because it is otherwise
+ illegal on the vfat filesystem. The escape sequence
+ that gets used is ':' and the four digits of hexadecimal
+ unicode.
+
+nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
+ end in '~1' or tilde followed by some number. If this
+ option is set, then if the filename is
+ "longfilename.txt" and "longfile.txt" does not
+ currently exist in the directory, 'longfile.txt' will
+ be the short alias instead of 'longfi~1.txt'.
+
+usefree -- Use the "free clusters" value stored on FSINFO. It'll
+ be used to determine number of free clusters without
+ scanning disk. But it's not used by default, because
+ recent Windows don't update it correctly in some
+ case. If you are sure the "free clusters" on FSINFO is
+ correct, by this option you can avoid scanning disk.
+
+quiet -- Stops printing certain warning messages.
+
+check=s|r|n -- Case sensitivity checking setting.
+ s: strict, case sensitive
+ r: relaxed, case insensitive
+ n: normal, default setting, currently case insensitive
+
+nocase -- This was deprecated for vfat. Use shortname=win95 instead.
+
+shortname=lower|win95|winnt|mixed
+ -- Shortname display/create setting.
+ lower: convert to lowercase for display,
+ emulate the Windows 95 rule for create.
+ win95: emulate the Windows 95 rule for display/create.
+ winnt: emulate the Windows NT rule for display/create.
+ mixed: emulate the Windows NT rule for display,
+ emulate the Windows 95 rule for create.
+ Default setting is `mixed'.
+
+tz=UTC -- Interpret timestamps as UTC rather than local time.
+ This option disables the conversion of timestamps
+ between local time (as used by Windows on FAT) and UTC
+ (which Linux uses internally). This is particularly
+ useful when mounting devices (like digital cameras)
+ that are set to UTC in order to avoid the pitfalls of
+ local time.
+
+showexec -- If set, the execute permission bits of the file will be
+ allowed only if the extension part of the name is .EXE,
+ .COM, or .BAT. Not set by default.
+
+debug -- Can be set, but unused by the current implementation.
+
+sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as
+ IMMUTABLE flag on Linux. Not set by default.
+
+flush -- If set, the filesystem will try to flush to disk more
+ early than normal. Not set by default.
+
+rodir -- FAT has the ATTR_RO (read-only) attribute. On Windows,
+ the ATTR_RO of the directory will just be ignored,
+ and is used only by applications as a flag (e.g. it's set
+ for the customized folder).
+
+ If you want to use ATTR_RO as read-only flag even for
+ the directory, set this option.
+
+errors=panic|continue|remount-ro
+ -- specify FAT behavior on critical errors: panic, continue
+ without doing anything or remount the partition in
+ read-only mode (default behavior).
+
+<bool>: 0,1,yes,no,true,false
+
+TODO
+----------------------------------------------------------------------
+* Need to get rid of the raw scanning stuff. Instead, always use
+ a get next directory entry approach. The only thing left that uses
+ raw scanning is the directory renaming code.
+
+
+POSSIBLE PROBLEMS
+----------------------------------------------------------------------
+* vfat_valid_longname does not properly checked reserved names.
+* When a volume name is the same as a directory name in the root
+ directory of the filesystem, the directory name sometimes shows
+ up as an empty file.
+* autoconv option does not work correctly.
+
+BUG REPORTS
+----------------------------------------------------------------------
+If you have trouble with the VFAT filesystem, mail bug reports to
+chaffee@bmrc.cs.berkeley.edu. Please specify the filename
+and the operation that gave you trouble.
+
+TEST SUITE
+----------------------------------------------------------------------
+If you plan to make any modifications to the vfat filesystem, please
+get the test suite that comes with the vfat distribution at
+
+ http://web.archive.org/web/*/http://bmrc.berkeley.edu/
+ people/chaffee/vfat.html
+
+This tests quite a few parts of the vfat filesystem and additional
+tests for new features or untested features would be appreciated.
+
+NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
+----------------------------------------------------------------------
+(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
+ and lightly annotated by Gordon Chaffee).
+
+This document presents a very rough, technical overview of my
+knowledge of the extended FAT file system used in Windows NT 3.5 and
+Windows 95. I don't guarantee that any of the following is correct,
+but it appears to be so.
+
+The extended FAT file system is almost identical to the FAT
+file system used in DOS versions up to and including 6.223410239847
+:-). The significant change has been the addition of long file names.
+These names support up to 255 characters including spaces and lower
+case characters as opposed to the traditional 8.3 short names.
+
+Here is the description of the traditional FAT entry in the current
+Windows 95 filesystem:
+
+ struct directory { // Short 8.3 names
+ unsigned char name[8]; // file name
+ unsigned char ext[3]; // file extension
+ unsigned char attr; // attribute byte
+ unsigned char lcase; // Case for base and extension
+ unsigned char ctime_ms; // Creation time, milliseconds
+ unsigned char ctime[2]; // Creation time
+ unsigned char cdate[2]; // Creation date
+ unsigned char adate[2]; // Last access date
+ unsigned char reserved[2]; // reserved values (ignored)
+ unsigned char time[2]; // time stamp
+ unsigned char date[2]; // date stamp
+ unsigned char start[2]; // starting cluster number
+ unsigned char size[4]; // size of the file
+ };
+
+The lcase field specifies if the base and/or the extension of an 8.3
+name should be capitalized. This field does not seem to be used by
+Windows 95 but it is used by Windows NT. The case of filenames is not
+completely compatible from Windows NT to Windows 95. It is not completely
+compatible in the reverse direction, however. Filenames that fit in
+the 8.3 namespace and are written on Windows NT to be lowercase will
+show up as uppercase on Windows 95.
+
+Note that the "start" and "size" values are actually little
+endian integer values. The descriptions of the fields in this
+structure are public knowledge and can be found elsewhere.
+
+With the extended FAT system, Microsoft has inserted extra
+directory entries for any files with extended names. (Any name which
+legally fits within the old 8.3 encoding scheme does not have extra
+entries.) I call these extra entries slots. Basically, a slot is a
+specially formatted directory entry which holds up to 13 characters of
+a file's extended name. Think of slots as additional labeling for the
+directory entry of the file to which they correspond. Microsoft
+prefers to refer to the 8.3 entry for a file as its alias and the
+extended slot directory entries as the file name.
+
+The C structure for a slot directory entry follows:
+
+ struct slot { // Up to 13 characters of a long name
+ unsigned char id; // sequence number for slot
+ unsigned char name0_4[10]; // first 5 characters in name
+ unsigned char attr; // attribute byte
+ unsigned char reserved; // always 0
+ unsigned char alias_checksum; // checksum for 8.3 alias
+ unsigned char name5_10[12]; // 6 more characters in name
+ unsigned char start[2]; // starting cluster number
+ unsigned char name11_12[4]; // last 2 characters in name
+ };
+
+If the layout of the slots looks a little odd, it's only
+because of Microsoft's efforts to maintain compatibility with old
+software. The slots must be disguised to prevent old software from
+panicking. To this end, a number of measures are taken:
+
+ 1) The attribute byte for a slot directory entry is always set
+ to 0x0f. This corresponds to an old directory entry with
+ attributes of "hidden", "system", "read-only", and "volume
+ label". Most old software will ignore any directory
+ entries with the "volume label" bit set. Real volume label
+ entries don't have the other three bits set.
+
+ 2) The starting cluster is always set to 0, an impossible
+ value for a DOS file.
+
+Because the extended FAT system is backward compatible, it is
+possible for old software to modify directory entries. Measures must
+be taken to ensure the validity of slots. An extended FAT system can
+verify that a slot does in fact belong to an 8.3 directory entry by
+the following:
+
+ 1) Positioning. Slots for a file always immediately proceed
+ their corresponding 8.3 directory entry. In addition, each
+ slot has an id which marks its order in the extended file
+ name. Here is a very abbreviated view of an 8.3 directory
+ entry and its corresponding long name slots for the file
+ "My Big File.Extension which is long":
+
+ <proceeding files...>
+ <slot #3, id = 0x43, characters = "h is long">
+ <slot #2, id = 0x02, characters = "xtension whic">
+ <slot #1, id = 0x01, characters = "My Big File.E">
+ <directory entry, name = "MYBIGFIL.EXT">
+
+ Note that the slots are stored from last to first. Slots
+ are numbered from 1 to N. The Nth slot is or'ed with 0x40
+ to mark it as the last one.
+
+ 2) Checksum. Each slot has an "alias_checksum" value. The
+ checksum is calculated from the 8.3 name using the
+ following algorithm:
+
+ for (sum = i = 0; i < 11; i++) {
+ sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
+ }
+
+ 3) If there is free space in the final slot, a Unicode NULL (0x0000)
+ is stored after the final character. After that, all unused
+ characters in the final slot are set to Unicode 0xFFFF.
+
+Finally, note that the extended name is stored in Unicode. Each Unicode
+character takes two bytes.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
new file mode 100644
index 00000000..88b9f551
--- /dev/null
+++ b/Documentation/filesystems/vfs.txt
@@ -0,0 +1,1097 @@
+
+ Overview of the Linux Virtual File System
+
+ Original author: Richard Gooch <rgooch@atnf.csiro.au>
+
+ Last updated on June 24, 2007.
+
+ Copyright (C) 1999 Richard Gooch
+ Copyright (C) 2005 Pekka Enberg
+
+ This file is released under the GPLv2.
+
+
+Introduction
+============
+
+The Virtual File System (also known as the Virtual Filesystem Switch)
+is the software layer in the kernel that provides the filesystem
+interface to userspace programs. It also provides an abstraction
+within the kernel which allows different filesystem implementations to
+coexist.
+
+VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
+on are called from a process context. Filesystem locking is described
+in the document Documentation/filesystems/Locking.
+
+
+Directory Entry Cache (dcache)
+------------------------------
+
+The VFS implements the open(2), stat(2), chmod(2), and similar system
+calls. The pathname argument that is passed to them is used by the VFS
+to search through the directory entry cache (also known as the dentry
+cache or dcache). This provides a very fast look-up mechanism to
+translate a pathname (filename) into a specific dentry. Dentries live
+in RAM and are never saved to disc: they exist only for performance.
+
+The dentry cache is meant to be a view into your entire filespace. As
+most computers cannot fit all dentries in the RAM at the same time,
+some bits of the cache are missing. In order to resolve your pathname
+into a dentry, the VFS may have to resort to creating dentries along
+the way, and then loading the inode. This is done by looking up the
+inode.
+
+
+The Inode Object
+----------------
+
+An individual dentry usually has a pointer to an inode. Inodes are
+filesystem objects such as regular files, directories, FIFOs and other
+beasts. They live either on the disc (for block device filesystems)
+or in the memory (for pseudo filesystems). Inodes that live on the
+disc are copied into the memory when required and changes to the inode
+are written back to disc. A single inode can be pointed to by multiple
+dentries (hard links, for example, do this).
+
+To look up an inode requires that the VFS calls the lookup() method of
+the parent directory inode. This method is installed by the specific
+filesystem implementation that the inode lives in. Once the VFS has
+the required dentry (and hence the inode), we can do all those boring
+things like open(2) the file, or stat(2) it to peek at the inode
+data. The stat(2) operation is fairly simple: once the VFS has the
+dentry, it peeks at the inode data and passes some of it back to
+userspace.
+
+
+The File Object
+---------------
+
+Opening a file requires another operation: allocation of a file
+structure (this is the kernel-side implementation of file
+descriptors). The freshly allocated file structure is initialized with
+a pointer to the dentry and a set of file operation member functions.
+These are taken from the inode data. The open() file method is then
+called so the specific filesystem implementation can do its work. You
+can see that this is another switch performed by the VFS. The file
+structure is placed into the file descriptor table for the process.
+
+Reading, writing and closing files (and other assorted VFS operations)
+is done by using the userspace file descriptor to grab the appropriate
+file structure, and then calling the required file structure method to
+do whatever is required. For as long as the file is open, it keeps the
+dentry in use, which in turn means that the VFS inode is still in use.
+
+
+Registering and Mounting a Filesystem
+=====================================
+
+To register and unregister a filesystem, use the following API
+functions:
+
+ #include <linux/fs.h>
+
+ extern int register_filesystem(struct file_system_type *);
+ extern int unregister_filesystem(struct file_system_type *);
+
+The passed struct file_system_type describes your filesystem. When a
+request is made to mount a filesystem onto a directory in your namespace,
+the VFS will call the appropriate mount() method for the specific
+filesystem. New vfsmount referring to the tree returned by ->mount()
+will be attached to the mountpoint, so that when pathname resolution
+reaches the mountpoint it will jump into the root of that vfsmount.
+
+You can see all filesystems that are registered to the kernel in the
+file /proc/filesystems.
+
+
+struct file_system_type
+-----------------------
+
+This describes the filesystem. As of kernel 2.6.39, the following
+members are defined:
+
+struct file_system_type {
+ const char *name;
+ int fs_flags;
+ struct dentry (*mount) (struct file_system_type *, int,
+ const char *, void *);
+ void (*kill_sb) (struct super_block *);
+ struct module *owner;
+ struct file_system_type * next;
+ struct list_head fs_supers;
+ struct lock_class_key s_lock_key;
+ struct lock_class_key s_umount_key;
+};
+
+ name: the name of the filesystem type, such as "ext2", "iso9660",
+ "msdos" and so on
+
+ fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
+
+ mount: the method to call when a new instance of this
+ filesystem should be mounted
+
+ kill_sb: the method to call when an instance of this filesystem
+ should be shut down
+
+ owner: for internal VFS use: you should initialize this to THIS_MODULE in
+ most cases.
+
+ next: for internal VFS use: you should initialize this to NULL
+
+ s_lock_key, s_umount_key: lockdep-specific
+
+The mount() method has the following arguments:
+
+ struct file_system_type *fs_type: describes the filesystem, partly initialized
+ by the specific filesystem code
+
+ int flags: mount flags
+
+ const char *dev_name: the device name we are mounting.
+
+ void *data: arbitrary mount options, usually comes as an ASCII
+ string (see "Mount Options" section)
+
+The mount() method must return the root dentry of the tree requested by
+caller. An active reference to its superblock must be grabbed and the
+superblock must be locked. On failure it should return ERR_PTR(error).
+
+The arguments match those of mount(2) and their interpretation
+depends on filesystem type. E.g. for block filesystems, dev_name is
+interpreted as block device name, that device is opened and if it
+contains a suitable filesystem image the method creates and initializes
+struct super_block accordingly, returning its root dentry to caller.
+
+->mount() may choose to return a subtree of existing filesystem - it
+doesn't have to create a new one. The main result from the caller's
+point of view is a reference to dentry at the root of (sub)tree to
+be attached; creation of new superblock is a common side effect.
+
+The most interesting member of the superblock structure that the
+mount() method fills in is the "s_op" field. This is a pointer to
+a "struct super_operations" which describes the next level of the
+filesystem implementation.
+
+Usually, a filesystem uses one of the generic mount() implementations
+and provides a fill_super() callback instead. The generic variants are:
+
+ mount_bdev: mount a filesystem residing on a block device
+
+ mount_nodev: mount a filesystem that is not backed by a device
+
+ mount_single: mount a filesystem which shares the instance between
+ all mounts
+
+A fill_super() callback implementation has the following arguments:
+
+ struct super_block *sb: the superblock structure. The callback
+ must initialize this properly.
+
+ void *data: arbitrary mount options, usually comes as an ASCII
+ string (see "Mount Options" section)
+
+ int silent: whether or not to be silent on error
+
+
+The Superblock Object
+=====================
+
+A superblock object represents a mounted filesystem.
+
+
+struct super_operations
+-----------------------
+
+This describes how the VFS can manipulate the superblock of your
+filesystem. As of kernel 2.6.22, the following members are defined:
+
+struct super_operations {
+ struct inode *(*alloc_inode)(struct super_block *sb);
+ void (*destroy_inode)(struct inode *);
+
+ void (*dirty_inode) (struct inode *, int flags);
+ int (*write_inode) (struct inode *, int);
+ void (*drop_inode) (struct inode *);
+ void (*delete_inode) (struct inode *);
+ void (*put_super) (struct super_block *);
+ void (*write_super) (struct super_block *);
+ int (*sync_fs)(struct super_block *sb, int wait);
+ int (*freeze_fs) (struct super_block *);
+ int (*unfreeze_fs) (struct super_block *);
+ int (*statfs) (struct dentry *, struct kstatfs *);
+ int (*remount_fs) (struct super_block *, int *, char *);
+ void (*clear_inode) (struct inode *);
+ void (*umount_begin) (struct super_block *);
+
+ int (*show_options)(struct seq_file *, struct vfsmount *);
+
+ ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+ ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+};
+
+All methods are called without any locks being held, unless otherwise
+noted. This means that most methods can block safely. All methods are
+only called from a process context (i.e. not from an interrupt handler
+or bottom half).
+
+ alloc_inode: this method is called by inode_alloc() to allocate memory
+ for struct inode and initialize it. If this function is not
+ defined, a simple 'struct inode' is allocated. Normally
+ alloc_inode will be used to allocate a larger structure which
+ contains a 'struct inode' embedded within it.
+
+ destroy_inode: this method is called by destroy_inode() to release
+ resources allocated for struct inode. It is only required if
+ ->alloc_inode was defined and simply undoes anything done by
+ ->alloc_inode.
+
+ dirty_inode: this method is called by the VFS to mark an inode dirty.
+
+ write_inode: this method is called when the VFS needs to write an
+ inode to disc. The second parameter indicates whether the write
+ should be synchronous or not, not all filesystems check this flag.
+
+ drop_inode: called when the last access to the inode is dropped,
+ with the inode->i_lock spinlock held.
+
+ This method should be either NULL (normal UNIX filesystem
+ semantics) or "generic_delete_inode" (for filesystems that do not
+ want to cache inodes - causing "delete_inode" to always be
+ called regardless of the value of i_nlink)
+
+ The "generic_delete_inode()" behavior is equivalent to the
+ old practice of using "force_delete" in the put_inode() case,
+ but does not have the races that the "force_delete()" approach
+ had.
+
+ delete_inode: called when the VFS wants to delete an inode
+
+ put_super: called when the VFS wishes to free the superblock
+ (i.e. unmount). This is called with the superblock lock held
+
+ write_super: called when the VFS superblock needs to be written to
+ disc. This method is optional
+
+ sync_fs: called when VFS is writing out all dirty data associated with
+ a superblock. The second parameter indicates whether the method
+ should wait until the write out has been completed. Optional.
+
+ freeze_fs: called when VFS is locking a filesystem and
+ forcing it into a consistent state. This method is currently
+ used by the Logical Volume Manager (LVM).
+
+ unfreeze_fs: called when VFS is unlocking a filesystem and making it writable
+ again.
+
+ statfs: called when the VFS needs to get filesystem statistics.
+
+ remount_fs: called when the filesystem is remounted. This is called
+ with the kernel lock held
+
+ clear_inode: called then the VFS clears the inode. Optional
+
+ umount_begin: called when the VFS is unmounting a filesystem.
+
+ show_options: called by the VFS to show mount options for
+ /proc/<pid>/mounts. (see "Mount Options" section)
+
+ quota_read: called by the VFS to read from filesystem quota file.
+
+ quota_write: called by the VFS to write to filesystem quota file.
+
+Whoever sets up the inode is responsible for filling in the "i_op" field. This
+is a pointer to a "struct inode_operations" which describes the methods that
+can be performed on individual inodes.
+
+
+The Inode Object
+================
+
+An inode object represents an object within the filesystem.
+
+
+struct inode_operations
+-----------------------
+
+This describes how the VFS can manipulate an inode in your
+filesystem. As of kernel 2.6.22, the following members are defined:
+
+struct inode_operations {
+ int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
+ struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
+ int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*unlink) (struct inode *,struct dentry *);
+ int (*symlink) (struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct inode *,struct dentry *,int);
+ int (*rmdir) (struct inode *,struct dentry *);
+ int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*rename) (struct inode *, struct dentry *,
+ struct inode *, struct dentry *);
+ int (*readlink) (struct dentry *, char __user *,int);
+ void * (*follow_link) (struct dentry *, struct nameidata *);
+ void (*put_link) (struct dentry *, struct nameidata *, void *);
+ void (*truncate) (struct inode *);
+ int (*permission) (struct inode *, int, unsigned int);
+ int (*check_acl)(struct inode *, int, unsigned int);
+ int (*setattr) (struct dentry *, struct iattr *);
+ int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
+ int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
+ ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
+ ssize_t (*listxattr) (struct dentry *, char *, size_t);
+ int (*removexattr) (struct dentry *, const char *);
+ void (*truncate_range)(struct inode *, loff_t, loff_t);
+};
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+ create: called by the open(2) and creat(2) system calls. Only
+ required if you want to support regular files. The dentry you
+ get should not have an inode (i.e. it should be a negative
+ dentry). Here you will probably call d_instantiate() with the
+ dentry and the newly created inode
+
+ lookup: called when the VFS needs to look up an inode in a parent
+ directory. The name to look for is found in the dentry. This
+ method must call d_add() to insert the found inode into the
+ dentry. The "i_count" field in the inode structure should be
+ incremented. If the named inode does not exist a NULL inode
+ should be inserted into the dentry (this is called a negative
+ dentry). Returning an error code from this routine must only
+ be done on a real error, otherwise creating inodes with system
+ calls like create(2), mknod(2), mkdir(2) and so on will fail.
+ If you wish to overload the dentry methods then you should
+ initialise the "d_dop" field in the dentry; this is a pointer
+ to a struct "dentry_operations".
+ This method is called with the directory inode semaphore held
+
+ link: called by the link(2) system call. Only required if you want
+ to support hard links. You will probably need to call
+ d_instantiate() just as you would in the create() method
+
+ unlink: called by the unlink(2) system call. Only required if you
+ want to support deleting inodes
+
+ symlink: called by the symlink(2) system call. Only required if you
+ want to support symlinks. You will probably need to call
+ d_instantiate() just as you would in the create() method
+
+ mkdir: called by the mkdir(2) system call. Only required if you want
+ to support creating subdirectories. You will probably need to
+ call d_instantiate() just as you would in the create() method
+
+ rmdir: called by the rmdir(2) system call. Only required if you want
+ to support deleting subdirectories
+
+ mknod: called by the mknod(2) system call to create a device (char,
+ block) inode or a named pipe (FIFO) or socket. Only required
+ if you want to support creating these types of inodes. You
+ will probably need to call d_instantiate() just as you would
+ in the create() method
+
+ rename: called by the rename(2) system call to rename the object to
+ have the parent and name given by the second inode and dentry.
+
+ readlink: called by the readlink(2) system call. Only required if
+ you want to support reading symbolic links
+
+ follow_link: called by the VFS to follow a symbolic link to the
+ inode it points to. Only required if you want to support
+ symbolic links. This method returns a void pointer cookie
+ that is passed to put_link().
+
+ put_link: called by the VFS to release resources allocated by
+ follow_link(). The cookie returned by follow_link() is passed
+ to this method as the last parameter. It is used by
+ filesystems such as NFS where page cache is not stable
+ (i.e. page that was installed when the symbolic link walk
+ started might not be in the page cache at the end of the
+ walk).
+
+ truncate: Deprecated. This will not be called if ->setsize is defined.
+ Called by the VFS to change the size of a file. The
+ i_size field of the inode is set to the desired size by the
+ VFS before this method is called. This method is called by
+ the truncate(2) system call and related functionality.
+
+ Note: ->truncate and vmtruncate are deprecated. Do not add new
+ instances/calls of these. Filesystems should be converted to do their
+ truncate sequence via ->setattr().
+
+ permission: called by the VFS to check for access rights on a POSIX-like
+ filesystem.
+
+ May be called in rcu-walk mode (flags & IPERM_FLAG_RCU). If in rcu-walk
+ mode, the filesystem must check the permission without blocking or
+ storing to the inode.
+
+ If a situation is encountered that rcu-walk cannot handle, return
+ -ECHILD and it will be called again in ref-walk mode.
+
+ setattr: called by the VFS to set attributes for a file. This method
+ is called by chmod(2) and related system calls.
+
+ getattr: called by the VFS to get attributes of a file. This method
+ is called by stat(2) and related system calls.
+
+ setxattr: called by the VFS to set an extended attribute for a file.
+ Extended attribute is a name:value pair associated with an
+ inode. This method is called by setxattr(2) system call.
+
+ getxattr: called by the VFS to retrieve the value of an extended
+ attribute name. This method is called by getxattr(2) function
+ call.
+
+ listxattr: called by the VFS to list all extended attributes for a
+ given file. This method is called by listxattr(2) system call.
+
+ removexattr: called by the VFS to remove an extended attribute from
+ a file. This method is called by removexattr(2) system call.
+
+ truncate_range: a method provided by the underlying filesystem to truncate a
+ range of blocks , i.e. punch a hole somewhere in a file.
+
+
+The Address Space Object
+========================
+
+The address space object is used to group and manage pages in the page
+cache. It can be used to keep track of the pages in a file (or
+anything else) and also track the mapping of sections of the file into
+process address spaces.
+
+There are a number of distinct yet related services that an
+address-space can provide. These include communicating memory
+pressure, page lookup by address, and keeping track of pages tagged as
+Dirty or Writeback.
+
+The first can be used independently to the others. The VM can try to
+either write dirty pages in order to clean them, or release clean
+pages in order to reuse them. To do this it can call the ->writepage
+method on dirty pages, and ->releasepage on clean pages with
+PagePrivate set. Clean pages without PagePrivate and with no external
+references will be released without notice being given to the
+address_space.
+
+To achieve this functionality, pages need to be placed on an LRU with
+lru_cache_add and mark_page_active needs to be called whenever the
+page is used.
+
+Pages are normally kept in a radix tree index by ->index. This tree
+maintains information about the PG_Dirty and PG_Writeback status of
+each page, so that pages with either of these flags can be found
+quickly.
+
+The Dirty tag is primarily used by mpage_writepages - the default
+->writepages method. It uses the tag to find dirty pages to call
+->writepage on. If mpage_writepages is not used (i.e. the address
+provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
+almost unused. write_inode_now and sync_inode do use it (through
+__sync_single_inode) to check if ->writepages has been successful in
+writing out the whole address_space.
+
+The Writeback tag is used by filemap*wait* and sync_page* functions,
+via filemap_fdatawait_range, to wait for all writeback to
+complete. While waiting ->sync_page (if defined) will be called on
+each page that is found to require writeback.
+
+An address_space handler may attach extra information to a page,
+typically using the 'private' field in the 'struct page'. If such
+information is attached, the PG_Private flag should be set. This will
+cause various VM routines to make extra calls into the address_space
+handler to deal with that data.
+
+An address space acts as an intermediate between storage and
+application. Data is read into the address space a whole page at a
+time, and provided to the application either by copying of the page,
+or by memory-mapping the page.
+Data is written into the address space by the application, and then
+written-back to storage typically in whole pages, however the
+address_space has finer control of write sizes.
+
+The read process essentially only requires 'readpage'. The write
+process is more complicated and uses write_begin/write_end or
+set_page_dirty to write data into the address_space, and writepage,
+sync_page, and writepages to writeback data to storage.
+
+Adding and removing pages to/from an address_space is protected by the
+inode's i_mutex.
+
+When data is written to a page, the PG_Dirty flag should be set. It
+typically remains set until writepage asks for it to be written. This
+should clear PG_Dirty and set PG_Writeback. It can be actually
+written at any point after PG_Dirty is clear. Once it is known to be
+safe, PG_Writeback is cleared.
+
+Writeback makes use of a writeback_control structure...
+
+struct address_space_operations
+-------------------------------
+
+This describes how the VFS can manipulate mapping of a file to page cache in
+your filesystem. As of kernel 2.6.22, the following members are defined:
+
+struct address_space_operations {
+ int (*writepage)(struct page *page, struct writeback_control *wbc);
+ int (*readpage)(struct file *, struct page *);
+ int (*sync_page)(struct page *);
+ int (*writepages)(struct address_space *, struct writeback_control *);
+ int (*set_page_dirty)(struct page *page);
+ int (*readpages)(struct file *filp, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages);
+ int (*write_begin)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
+ sector_t (*bmap)(struct address_space *, sector_t);
+ int (*invalidatepage) (struct page *, unsigned long);
+ int (*releasepage) (struct page *, int);
+ void (*freepage)(struct page *);
+ ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+ loff_t offset, unsigned long nr_segs);
+ struct page* (*get_xip_page)(struct address_space *, sector_t,
+ int);
+ /* migrate the contents of a page to the specified target */
+ int (*migratepage) (struct page *, struct page *);
+ int (*launder_page) (struct page *);
+ int (*error_remove_page) (struct mapping *mapping, struct page *page);
+};
+
+ writepage: called by the VM to write a dirty page to backing store.
+ This may happen for data integrity reasons (i.e. 'sync'), or
+ to free up memory (flush). The difference can be seen in
+ wbc->sync_mode.
+ The PG_Dirty flag has been cleared and PageLocked is true.
+ writepage should start writeout, should set PG_Writeback,
+ and should make sure the page is unlocked, either synchronously
+ or asynchronously when the write operation completes.
+
+ If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
+ try too hard if there are problems, and may choose to write out
+ other pages from the mapping if that is easier (e.g. due to
+ internal dependencies). If it chooses not to start writeout, it
+ should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
+ calling ->writepage on that page.
+
+ See the file "Locking" for more details.
+
+ readpage: called by the VM to read a page from backing store.
+ The page will be Locked when readpage is called, and should be
+ unlocked and marked uptodate once the read completes.
+ If ->readpage discovers that it needs to unlock the page for
+ some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
+ In this case, the page will be relocated, relocked and if
+ that all succeeds, ->readpage will be called again.
+
+ sync_page: called by the VM to notify the backing store to perform all
+ queued I/O operations for a page. I/O operations for other pages
+ associated with this address_space object may also be performed.
+
+ This function is optional and is called only for pages with
+ PG_Writeback set while waiting for the writeback to complete.
+
+ writepages: called by the VM to write out pages associated with the
+ address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
+ the writeback_control will specify a range of pages that must be
+ written out. If it is WBC_SYNC_NONE, then a nr_to_write is given
+ and that many pages should be written if possible.
+ If no ->writepages is given, then mpage_writepages is used
+ instead. This will choose pages from the address space that are
+ tagged as DIRTY and will pass them to ->writepage.
+
+ set_page_dirty: called by the VM to set a page dirty.
+ This is particularly needed if an address space attaches
+ private data to a page, and that data needs to be updated when
+ a page is dirtied. This is called, for example, when a memory
+ mapped page gets modified.
+ If defined, it should set the PageDirty flag, and the
+ PAGECACHE_TAG_DIRTY tag in the radix tree.
+
+ readpages: called by the VM to read pages associated with the address_space
+ object. This is essentially just a vector version of
+ readpage. Instead of just one page, several pages are
+ requested.
+ readpages is only used for read-ahead, so read errors are
+ ignored. If anything goes wrong, feel free to give up.
+
+ write_begin:
+ Called by the generic buffered write code to ask the filesystem to
+ prepare to write len bytes at the given offset in the file. The
+ address_space should check that the write will be able to complete,
+ by allocating space if necessary and doing any other internal
+ housekeeping. If the write will update parts of any basic-blocks on
+ storage, then those blocks should be pre-read (if they haven't been
+ read already) so that the updated blocks can be written out properly.
+
+ The filesystem must return the locked pagecache page for the specified
+ offset, in *pagep, for the caller to write into.
+
+ It must be able to cope with short writes (where the length passed to
+ write_begin is greater than the number of bytes copied into the page).
+
+ flags is a field for AOP_FLAG_xxx flags, described in
+ include/linux/fs.h.
+
+ A void * may be returned in fsdata, which then gets passed into
+ write_end.
+
+ Returns 0 on success; < 0 on failure (which is the error code), in
+ which case write_end is not called.
+
+ write_end: After a successful write_begin, and data copy, write_end must
+ be called. len is the original len passed to write_begin, and copied
+ is the amount that was able to be copied (copied == len is always true
+ if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
+
+ The filesystem must take care of unlocking the page and releasing it
+ refcount, and updating i_size.
+
+ Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
+ that were able to be copied into pagecache.
+
+ bmap: called by the VFS to map a logical block offset within object to
+ physical block number. This method is used by the FIBMAP
+ ioctl and for working with swap-files. To be able to swap to
+ a file, the file must have a stable mapping to a block
+ device. The swap system does not go through the filesystem
+ but instead uses bmap to find out where the blocks in the file
+ are and uses those addresses directly.
+
+
+ invalidatepage: If a page has PagePrivate set, then invalidatepage
+ will be called when part or all of the page is to be removed
+ from the address space. This generally corresponds to either a
+ truncation or a complete invalidation of the address space
+ (in the latter case 'offset' will always be 0).
+ Any private data associated with the page should be updated
+ to reflect this truncation. If offset is 0, then
+ the private data should be released, because the page
+ must be able to be completely discarded. This may be done by
+ calling the ->releasepage function, but in this case the
+ release MUST succeed.
+
+ releasepage: releasepage is called on PagePrivate pages to indicate
+ that the page should be freed if possible. ->releasepage
+ should remove any private data from the page and clear the
+ PagePrivate flag. If releasepage() fails for some reason, it must
+ indicate failure with a 0 return value.
+ releasepage() is used in two distinct though related cases. The
+ first is when the VM finds a clean page with no active users and
+ wants to make it a free page. If ->releasepage succeeds, the
+ page will be removed from the address_space and become free.
+
+ The second case is when a request has been made to invalidate
+ some or all pages in an address_space. This can happen
+ through the fadvice(POSIX_FADV_DONTNEED) system call or by the
+ filesystem explicitly requesting it as nfs and 9fs do (when
+ they believe the cache may be out of date with storage) by
+ calling invalidate_inode_pages2().
+ If the filesystem makes such a call, and needs to be certain
+ that all pages are invalidated, then its releasepage will
+ need to ensure this. Possibly it can clear the PageUptodate
+ bit if it cannot free private data yet.
+
+ freepage: freepage is called once the page is no longer visible in
+ the page cache in order to allow the cleanup of any private
+ data. Since it may be called by the memory reclaimer, it
+ should not assume that the original address_space mapping still
+ exists, and it should not block.
+
+ direct_IO: called by the generic read/write routines to perform
+ direct_IO - that is IO requests which bypass the page cache
+ and transfer data directly between the storage and the
+ application's address space.
+
+ get_xip_page: called by the VM to translate a block number to a page.
+ The page is valid until the corresponding filesystem is unmounted.
+ Filesystems that want to use execute-in-place (XIP) need to implement
+ it. An example implementation can be found in fs/ext2/xip.c.
+
+ migrate_page: This is used to compact the physical memory usage.
+ If the VM wants to relocate a page (maybe off a memory card
+ that is signalling imminent failure) it will pass a new page
+ and an old page to this function. migrate_page should
+ transfer any private data across and update any references
+ that it has to the page.
+
+ launder_page: Called before freeing a page - it writes back the dirty page. To
+ prevent redirtying the page, it is kept locked during the whole
+ operation.
+
+ error_remove_page: normally set to generic_error_remove_page if truncation
+ is ok for this address space. Used for memory failure handling.
+ Setting this implies you deal with pages going away under you,
+ unless you have them locked or reference counts increased.
+
+
+The File Object
+===============
+
+A file object represents a file opened by a process.
+
+
+struct file_operations
+----------------------
+
+This describes how the VFS can manipulate an open file. As of kernel
+2.6.22, the following members are defined:
+
+struct file_operations {
+ struct module *owner;
+ loff_t (*llseek) (struct file *, loff_t, int);
+ ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
+ ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
+ ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ int (*readdir) (struct file *, void *, filldir_t);
+ unsigned int (*poll) (struct file *, struct poll_table_struct *);
+ long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
+ long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*mmap) (struct file *, struct vm_area_struct *);
+ int (*open) (struct inode *, struct file *);
+ int (*flush) (struct file *);
+ int (*release) (struct inode *, struct file *);
+ int (*fsync) (struct file *, int datasync);
+ int (*aio_fsync) (struct kiocb *, int datasync);
+ int (*fasync) (int, struct file *, int);
+ int (*lock) (struct file *, int, struct file_lock *);
+ ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
+ ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
+ ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
+ ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
+ unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
+ int (*check_flags)(int);
+ int (*flock) (struct file *, int, struct file_lock *);
+ ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int);
+ ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);
+};
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+ llseek: called when the VFS needs to move the file position index
+
+ read: called by read(2) and related system calls
+
+ aio_read: called by io_submit(2) and other asynchronous I/O operations
+
+ write: called by write(2) and related system calls
+
+ aio_write: called by io_submit(2) and other asynchronous I/O operations
+
+ readdir: called when the VFS needs to read the directory contents
+
+ poll: called by the VFS when a process wants to check if there is
+ activity on this file and (optionally) go to sleep until there
+ is activity. Called by the select(2) and poll(2) system calls
+
+ unlocked_ioctl: called by the ioctl(2) system call.
+
+ compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
+ are used on 64 bit kernels.
+
+ mmap: called by the mmap(2) system call
+
+ open: called by the VFS when an inode should be opened. When the VFS
+ opens a file, it creates a new "struct file". It then calls the
+ open method for the newly allocated file structure. You might
+ think that the open method really belongs in
+ "struct inode_operations", and you may be right. I think it's
+ done the way it is because it makes filesystems simpler to
+ implement. The open() method is a good place to initialize the
+ "private_data" member in the file structure if you want to point
+ to a device structure
+
+ flush: called by the close(2) system call to flush a file
+
+ release: called when the last reference to an open file is closed
+
+ fsync: called by the fsync(2) system call
+
+ fasync: called by the fcntl(2) system call when asynchronous
+ (non-blocking) mode is enabled for a file
+
+ lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
+ commands
+
+ readv: called by the readv(2) system call
+
+ writev: called by the writev(2) system call
+
+ sendfile: called by the sendfile(2) system call
+
+ get_unmapped_area: called by the mmap(2) system call
+
+ check_flags: called by the fcntl(2) system call for F_SETFL command
+
+ flock: called by the flock(2) system call
+
+ splice_write: called by the VFS to splice data from a pipe to a file. This
+ method is used by the splice(2) system call
+
+ splice_read: called by the VFS to splice data from file to a pipe. This
+ method is used by the splice(2) system call
+
+Note that the file operations are implemented by the specific
+filesystem in which the inode resides. When opening a device node
+(character or block special) most filesystems will call special
+support routines in the VFS which will locate the required device
+driver information. These support routines replace the filesystem file
+operations with those for the device driver, and then proceed to call
+the new open() method for the file. This is how opening a device file
+in the filesystem eventually ends up calling the device driver open()
+method.
+
+
+Directory Entry Cache (dcache)
+==============================
+
+
+struct dentry_operations
+------------------------
+
+This describes how a filesystem can overload the standard dentry
+operations. Dentries and the dcache are the domain of the VFS and the
+individual filesystem implementations. Device drivers have no business
+here. These methods may be set to NULL, as they are either optional or
+the VFS uses a default. As of kernel 2.6.22, the following members are
+defined:
+
+struct dentry_operations {
+ int (*d_revalidate)(struct dentry *, struct nameidata *);
+ int (*d_hash)(const struct dentry *, const struct inode *,
+ struct qstr *);
+ int (*d_compare)(const struct dentry *, const struct inode *,
+ const struct dentry *, const struct inode *,
+ unsigned int, const char *, const struct qstr *);
+ int (*d_delete)(const struct dentry *);
+ void (*d_release)(struct dentry *);
+ void (*d_iput)(struct dentry *, struct inode *);
+ char *(*d_dname)(struct dentry *, char *, int);
+ struct vfsmount *(*d_automount)(struct path *);
+ int (*d_manage)(struct dentry *, bool);
+};
+
+ d_revalidate: called when the VFS needs to revalidate a dentry. This
+ is called whenever a name look-up finds a dentry in the
+ dcache. Most filesystems leave this as NULL, because all their
+ dentries in the dcache are valid
+
+ d_revalidate may be called in rcu-walk mode (nd->flags & LOOKUP_RCU).
+ If in rcu-walk mode, the filesystem must revalidate the dentry without
+ blocking or storing to the dentry, d_parent and d_inode should not be
+ used without care (because they can go NULL), instead nd->inode should
+ be used.
+
+ If a situation is encountered that rcu-walk cannot handle, return
+ -ECHILD and it will be called again in ref-walk mode.
+
+ d_hash: called when the VFS adds a dentry to the hash table. The first
+ dentry passed to d_hash is the parent directory that the name is
+ to be hashed into. The inode is the dentry's inode.
+
+ Same locking and synchronisation rules as d_compare regarding
+ what is safe to dereference etc.
+
+ d_compare: called to compare a dentry name with a given name. The first
+ dentry is the parent of the dentry to be compared, the second is
+ the parent's inode, then the dentry and inode (may be NULL) of the
+ child dentry. len and name string are properties of the dentry to be
+ compared. qstr is the name to compare it with.
+
+ Must be constant and idempotent, and should not take locks if
+ possible, and should not or store into the dentry or inodes.
+ Should not dereference pointers outside the dentry or inodes without
+ lots of care (eg. d_parent, d_inode, d_name should not be used).
+
+ However, our vfsmount is pinned, and RCU held, so the dentries and
+ inodes won't disappear, neither will our sb or filesystem module.
+ ->i_sb and ->d_sb may be used.
+
+ It is a tricky calling convention because it needs to be called under
+ "rcu-walk", ie. without any locks or references on things.
+
+ d_delete: called when the last reference to a dentry is dropped and the
+ dcache is deciding whether or not to cache it. Return 1 to delete
+ immediately, or 0 to cache the dentry. Default is NULL which means to
+ always cache a reachable dentry. d_delete must be constant and
+ idempotent.
+
+ d_release: called when a dentry is really deallocated
+
+ d_iput: called when a dentry loses its inode (just prior to its
+ being deallocated). The default when this is NULL is that the
+ VFS calls iput(). If you define this method, you must call
+ iput() yourself
+
+ d_dname: called when the pathname of a dentry should be generated.
+ Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay
+ pathname generation. (Instead of doing it when dentry is created,
+ it's done only when the path is needed.). Real filesystems probably
+ dont want to use it, because their dentries are present in global
+ dcache hash, so their hash should be an invariant. As no lock is
+ held, d_dname() should not try to modify the dentry itself, unless
+ appropriate SMP safety is used. CAUTION : d_path() logic is quite
+ tricky. The correct way to return for example "Hello" is to put it
+ at the end of the buffer, and returns a pointer to the first char.
+ dynamic_dname() helper function is provided to take care of this.
+
+ d_automount: called when an automount dentry is to be traversed (optional).
+ This should create a new VFS mount record and return the record to the
+ caller. The caller is supplied with a path parameter giving the
+ automount directory to describe the automount target and the parent
+ VFS mount record to provide inheritable mount parameters. NULL should
+ be returned if someone else managed to make the automount first. If
+ the vfsmount creation failed, then an error code should be returned.
+ If -EISDIR is returned, then the directory will be treated as an
+ ordinary directory and returned to pathwalk to continue walking.
+
+ If a vfsmount is returned, the caller will attempt to mount it on the
+ mountpoint and will remove the vfsmount from its expiration list in
+ the case of failure. The vfsmount should be returned with 2 refs on
+ it to prevent automatic expiration - the caller will clean up the
+ additional ref.
+
+ This function is only used if DCACHE_NEED_AUTOMOUNT is set on the
+ dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the
+ inode being added.
+
+ d_manage: called to allow the filesystem to manage the transition from a
+ dentry (optional). This allows autofs, for example, to hold up clients
+ waiting to explore behind a 'mountpoint' whilst letting the daemon go
+ past and construct the subtree there. 0 should be returned to let the
+ calling process continue. -EISDIR can be returned to tell pathwalk to
+ use this directory as an ordinary directory and to ignore anything
+ mounted on it and not to check the automount flag. Any other error
+ code will abort pathwalk completely.
+
+ If the 'rcu_walk' parameter is true, then the caller is doing a
+ pathwalk in RCU-walk mode. Sleeping is not permitted in this mode,
+ and the caller can be asked to leave it and call again by returing
+ -ECHILD.
+
+ This function is only used if DCACHE_MANAGE_TRANSIT is set on the
+ dentry being transited from.
+
+Example :
+
+static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
+{
+ return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
+ dentry->d_inode->i_ino);
+}
+
+Each dentry has a pointer to its parent dentry, as well as a hash list
+of child dentries. Child dentries are basically like files in a
+directory.
+
+
+Directory Entry Cache API
+--------------------------
+
+There are a number of functions defined which permit a filesystem to
+manipulate dentries:
+
+ dget: open a new handle for an existing dentry (this just increments
+ the usage count)
+
+ dput: close a handle for a dentry (decrements the usage count). If
+ the usage count drops to 0, and the dentry is still in its
+ parent's hash, the "d_delete" method is called to check whether
+ it should be cached. If it should not be cached, or if the dentry
+ is not hashed, it is deleted. Otherwise cached dentries are put
+ into an LRU list to be reclaimed on memory shortage.
+
+ d_drop: this unhashes a dentry from its parents hash list. A
+ subsequent call to dput() will deallocate the dentry if its
+ usage count drops to 0
+
+ d_delete: delete a dentry. If there are no other open references to
+ the dentry then the dentry is turned into a negative dentry
+ (the d_iput() method is called). If there are other
+ references, then d_drop() is called instead
+
+ d_add: add a dentry to its parents hash list and then calls
+ d_instantiate()
+
+ d_instantiate: add a dentry to the alias hash list for the inode and
+ updates the "d_inode" member. The "i_count" member in the
+ inode structure should be set/incremented. If the inode
+ pointer is NULL, the dentry is called a "negative
+ dentry". This function is commonly called when an inode is
+ created for an existing negative dentry
+
+ d_lookup: look up a dentry given its parent and path name component
+ It looks up the child of that given name from the dcache
+ hash table. If it is found, the reference count is incremented
+ and the dentry is returned. The caller must use dput()
+ to free the dentry when it finishes using it.
+
+For further information on dentry locking, please refer to the document
+Documentation/filesystems/dentry-locking.txt.
+
+Mount Options
+=============
+
+Parsing options
+---------------
+
+On mount and remount the filesystem is passed a string containing a
+comma separated list of mount options. The options can have either of
+these forms:
+
+ option
+ option=value
+
+The <linux/parser.h> header defines an API that helps parse these
+options. There are plenty of examples on how to use it in existing
+filesystems.
+
+Showing options
+---------------
+
+If a filesystem accepts mount options, it must define show_options()
+to show all the currently active options. The rules are:
+
+ - options MUST be shown which are not default or their values differ
+ from the default
+
+ - options MAY be shown which are enabled by default or have their
+ default value
+
+Options used only internally between a mount helper and the kernel
+(such as file descriptors), or which only have an effect during the
+mounting (such as ones controlling the creation of a journal) are exempt
+from the above rules.
+
+The underlying reason for the above rules is to make sure, that a
+mount can be accurately replicated (e.g. umounting and mounting again)
+based on the information found in /proc/mounts.
+
+A simple method of saving options at mount/remount time and showing
+them is provided with the save_mount_options() and
+generic_show_options() helper functions. Please note, that using
+these may have drawbacks. For more info see header comments for these
+functions in fs/namespace.c.
+
+Resources
+=========
+
+(Note some of these resources are not up-to-date with the latest kernel
+ version.)
+
+Creating Linux virtual filesystems. 2002
+ <http://lwn.net/Articles/13325/>
+
+The Linux Virtual File-system Layer by Neil Brown. 1999
+ <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
+
+A tour of the Linux VFS by Michael K. Johnson. 1996
+ <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
+
+A small trail through the Linux kernel by Andries Brouwer. 2001
+ <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt
new file mode 100644
index 00000000..2ce36439
--- /dev/null
+++ b/Documentation/filesystems/xfs-delayed-logging-design.txt
@@ -0,0 +1,793 @@
+XFS Delayed Logging Design
+--------------------------
+
+Introduction to Re-logging in XFS
+---------------------------------
+
+XFS logging is a combination of logical and physical logging. Some objects,
+such as inodes and dquots, are logged in logical format where the details
+logged are made up of the changes to in-core structures rather than on-disk
+structures. Other objects - typically buffers - have their physical changes
+logged. The reason for these differences is to reduce the amount of log space
+required for objects that are frequently logged. Some parts of inodes are more
+frequently logged than others, and inodes are typically more frequently logged
+than any other object (except maybe the superblock buffer) so keeping the
+amount of metadata logged low is of prime importance.
+
+The reason that this is such a concern is that XFS allows multiple separate
+modifications to a single object to be carried in the log at any given time.
+This allows the log to avoid needing to flush each change to disk before
+recording a new change to the object. XFS does this via a method called
+"re-logging". Conceptually, this is quite simple - all it requires is that any
+new change to the object is recorded with a *new copy* of all the existing
+changes in the new transaction that is written to the log.
+
+That is, if we have a sequence of changes A through to F, and the object was
+written to disk after change D, we would see in the log the following series
+of transactions, their contents and the log sequence number (LSN) of the
+transaction:
+
+ Transaction Contents LSN
+ A A X
+ B A+B X+n
+ C A+B+C X+n+m
+ D A+B+C+D X+n+m+o
+ <object written to disk>
+ E E Y (> X+n+m+o)
+ F E+F Yٍ+p
+
+In other words, each time an object is relogged, the new transaction contains
+the aggregation of all the previous changes currently held only in the log.
+
+This relogging technique also allows objects to be moved forward in the log so
+that an object being relogged does not prevent the tail of the log from ever
+moving forward. This can be seen in the table above by the changing
+(increasing) LSN of each subsequent transaction - the LSN is effectively a
+direct encoding of the location in the log of the transaction.
+
+This relogging is also used to implement long-running, multiple-commit
+transactions. These transaction are known as rolling transactions, and require
+a special log reservation known as a permanent transaction reservation. A
+typical example of a rolling transaction is the removal of extents from an
+inode which can only be done at a rate of two extents per transaction because
+of reservation size limitations. Hence a rolling extent removal transaction
+keeps relogging the inode and btree buffers as they get modified in each
+removal operation. This keeps them moving forward in the log as the operation
+progresses, ensuring that current operation never gets blocked by itself if the
+log wraps around.
+
+Hence it can be seen that the relogging operation is fundamental to the correct
+working of the XFS journalling subsystem. From the above description, most
+people should be able to see why the XFS metadata operations writes so much to
+the log - repeated operations to the same objects write the same changes to
+the log over and over again. Worse is the fact that objects tend to get
+dirtier as they get relogged, so each subsequent transaction is writing more
+metadata into the log.
+
+Another feature of the XFS transaction subsystem is that most transactions are
+asynchronous. That is, they don't commit to disk until either a log buffer is
+filled (a log buffer can hold multiple transactions) or a synchronous operation
+forces the log buffers holding the transactions to disk. This means that XFS is
+doing aggregation of transactions in memory - batching them, if you like - to
+minimise the impact of the log IO on transaction throughput.
+
+The limitation on asynchronous transaction throughput is the number and size of
+log buffers made available by the log manager. By default there are 8 log
+buffers available and the size of each is 32kB - the size can be increased up
+to 256kB by use of a mount option.
+
+Effectively, this gives us the maximum bound of outstanding metadata changes
+that can be made to the filesystem at any point in time - if all the log
+buffers are full and under IO, then no more transactions can be committed until
+the current batch completes. It is now common for a single current CPU core to
+be to able to issue enough transactions to keep the log buffers full and under
+IO permanently. Hence the XFS journalling subsystem can be considered to be IO
+bound.
+
+Delayed Logging: Concepts
+-------------------------
+
+The key thing to note about the asynchronous logging combined with the
+relogging technique XFS uses is that we can be relogging changed objects
+multiple times before they are committed to disk in the log buffers. If we
+return to the previous relogging example, it is entirely possible that
+transactions A through D are committed to disk in the same log buffer.
+
+That is, a single log buffer may contain multiple copies of the same object,
+but only one of those copies needs to be there - the last one "D", as it
+contains all the changes from the previous changes. In other words, we have one
+necessary copy in the log buffer, and three stale copies that are simply
+wasting space. When we are doing repeated operations on the same set of
+objects, these "stale objects" can be over 90% of the space used in the log
+buffers. It is clear that reducing the number of stale objects written to the
+log would greatly reduce the amount of metadata we write to the log, and this
+is the fundamental goal of delayed logging.
+
+From a conceptual point of view, XFS is already doing relogging in memory (where
+memory == log buffer), only it is doing it extremely inefficiently. It is using
+logical to physical formatting to do the relogging because there is no
+infrastructure to keep track of logical changes in memory prior to physically
+formatting the changes in a transaction to the log buffer. Hence we cannot avoid
+accumulating stale objects in the log buffers.
+
+Delayed logging is the name we've given to keeping and tracking transactional
+changes to objects in memory outside the log buffer infrastructure. Because of
+the relogging concept fundamental to the XFS journalling subsystem, this is
+actually relatively easy to do - all the changes to logged items are already
+tracked in the current infrastructure. The big problem is how to accumulate
+them and get them to the log in a consistent, recoverable manner.
+Describing the problems and how they have been solved is the focus of this
+document.
+
+One of the key changes that delayed logging makes to the operation of the
+journalling subsystem is that it disassociates the amount of outstanding
+metadata changes from the size and number of log buffers available. In other
+words, instead of there only being a maximum of 2MB of transaction changes not
+written to the log at any point in time, there may be a much greater amount
+being accumulated in memory. Hence the potential for loss of metadata on a
+crash is much greater than for the existing logging mechanism.
+
+It should be noted that this does not change the guarantee that log recovery
+will result in a consistent filesystem. What it does mean is that as far as the
+recovered filesystem is concerned, there may be many thousands of transactions
+that simply did not occur as a result of the crash. This makes it even more
+important that applications that care about their data use fsync() where they
+need to ensure application level data integrity is maintained.
+
+It should be noted that delayed logging is not an innovative new concept that
+warrants rigorous proofs to determine whether it is correct or not. The method
+of accumulating changes in memory for some period before writing them to the
+log is used effectively in many filesystems including ext3 and ext4. Hence
+no time is spent in this document trying to convince the reader that the
+concept is sound. Instead it is simply considered a "solved problem" and as
+such implementing it in XFS is purely an exercise in software engineering.
+
+The fundamental requirements for delayed logging in XFS are simple:
+
+ 1. Reduce the amount of metadata written to the log by at least
+ an order of magnitude.
+ 2. Supply sufficient statistics to validate Requirement #1.
+ 3. Supply sufficient new tracing infrastructure to be able to debug
+ problems with the new code.
+ 4. No on-disk format change (metadata or log format).
+ 5. Enable and disable with a mount option.
+ 6. No performance regressions for synchronous transaction workloads.
+
+Delayed Logging: Design
+-----------------------
+
+Storing Changes
+
+The problem with accumulating changes at a logical level (i.e. just using the
+existing log item dirty region tracking) is that when it comes to writing the
+changes to the log buffers, we need to ensure that the object we are formatting
+is not changing while we do this. This requires locking the object to prevent
+concurrent modification. Hence flushing the logical changes to the log would
+require us to lock every object, format them, and then unlock them again.
+
+This introduces lots of scope for deadlocks with transactions that are already
+running. For example, a transaction has object A locked and modified, but needs
+the delayed logging tracking lock to commit the transaction. However, the
+flushing thread has the delayed logging tracking lock already held, and is
+trying to get the lock on object A to flush it to the log buffer. This appears
+to be an unsolvable deadlock condition, and it was solving this problem that
+was the barrier to implementing delayed logging for so long.
+
+The solution is relatively simple - it just took a long time to recognise it.
+Put simply, the current logging code formats the changes to each item into an
+vector array that points to the changed regions in the item. The log write code
+simply copies the memory these vectors point to into the log buffer during
+transaction commit while the item is locked in the transaction. Instead of
+using the log buffer as the destination of the formatting code, we can use an
+allocated memory buffer big enough to fit the formatted vector.
+
+If we then copy the vector into the memory buffer and rewrite the vector to
+point to the memory buffer rather than the object itself, we now have a copy of
+the changes in a format that is compatible with the log buffer writing code.
+that does not require us to lock the item to access. This formatting and
+rewriting can all be done while the object is locked during transaction commit,
+resulting in a vector that is transactionally consistent and can be accessed
+without needing to lock the owning item.
+
+Hence we avoid the need to lock items when we need to flush outstanding
+asynchronous transactions to the log. The differences between the existing
+formatting method and the delayed logging formatting can be seen in the
+diagram below.
+
+Current format log vector:
+
+Object +---------------------------------------------+
+Vector 1 +----+
+Vector 2 +----+
+Vector 3 +----------+
+
+After formatting:
+
+Log Buffer +-V1-+-V2-+----V3----+
+
+Delayed logging vector:
+
+Object +---------------------------------------------+
+Vector 1 +----+
+Vector 2 +----+
+Vector 3 +----------+
+
+After formatting:
+
+Memory Buffer +-V1-+-V2-+----V3----+
+Vector 1 +----+
+Vector 2 +----+
+Vector 3 +----------+
+
+The memory buffer and associated vector need to be passed as a single object,
+but still need to be associated with the parent object so if the object is
+relogged we can replace the current memory buffer with a new memory buffer that
+contains the latest changes.
+
+The reason for keeping the vector around after we've formatted the memory
+buffer is to support splitting vectors across log buffer boundaries correctly.
+If we don't keep the vector around, we do not know where the region boundaries
+are in the item, so we'd need a new encapsulation method for regions in the log
+buffer writing (i.e. double encapsulation). This would be an on-disk format
+change and as such is not desirable. It also means we'd have to write the log
+region headers in the formatting stage, which is problematic as there is per
+region state that needs to be placed into the headers during the log write.
+
+Hence we need to keep the vector, but by attaching the memory buffer to it and
+rewriting the vector addresses to point at the memory buffer we end up with a
+self-describing object that can be passed to the log buffer write code to be
+handled in exactly the same manner as the existing log vectors are handled.
+Hence we avoid needing a new on-disk format to handle items that have been
+relogged in memory.
+
+
+Tracking Changes
+
+Now that we can record transactional changes in memory in a form that allows
+them to be used without limitations, we need to be able to track and accumulate
+them so that they can be written to the log at some later point in time. The
+log item is the natural place to store this vector and buffer, and also makes sense
+to be the object that is used to track committed objects as it will always
+exist once the object has been included in a transaction.
+
+The log item is already used to track the log items that have been written to
+the log but not yet written to disk. Such log items are considered "active"
+and as such are stored in the Active Item List (AIL) which is a LSN-ordered
+double linked list. Items are inserted into this list during log buffer IO
+completion, after which they are unpinned and can be written to disk. An object
+that is in the AIL can be relogged, which causes the object to be pinned again
+and then moved forward in the AIL when the log buffer IO completes for that
+transaction.
+
+Essentially, this shows that an item that is in the AIL can still be modified
+and relogged, so any tracking must be separate to the AIL infrastructure. As
+such, we cannot reuse the AIL list pointers for tracking committed items, nor
+can we store state in any field that is protected by the AIL lock. Hence the
+committed item tracking needs it's own locks, lists and state fields in the log
+item.
+
+Similar to the AIL, tracking of committed items is done through a new list
+called the Committed Item List (CIL). The list tracks log items that have been
+committed and have formatted memory buffers attached to them. It tracks objects
+in transaction commit order, so when an object is relogged it is removed from
+it's place in the list and re-inserted at the tail. This is entirely arbitrary
+and done to make it easy for debugging - the last items in the list are the
+ones that are most recently modified. Ordering of the CIL is not necessary for
+transactional integrity (as discussed in the next section) so the ordering is
+done for convenience/sanity of the developers.
+
+
+Delayed Logging: Checkpoints
+
+When we have a log synchronisation event, commonly known as a "log force",
+all the items in the CIL must be written into the log via the log buffers.
+We need to write these items in the order that they exist in the CIL, and they
+need to be written as an atomic transaction. The need for all the objects to be
+written as an atomic transaction comes from the requirements of relogging and
+log replay - all the changes in all the objects in a given transaction must
+either be completely replayed during log recovery, or not replayed at all. If
+a transaction is not replayed because it is not complete in the log, then
+no later transactions should be replayed, either.
+
+To fulfill this requirement, we need to write the entire CIL in a single log
+transaction. Fortunately, the XFS log code has no fixed limit on the size of a
+transaction, nor does the log replay code. The only fundamental limit is that
+the transaction cannot be larger than just under half the size of the log. The
+reason for this limit is that to find the head and tail of the log, there must
+be at least one complete transaction in the log at any given time. If a
+transaction is larger than half the log, then there is the possibility that a
+crash during the write of a such a transaction could partially overwrite the
+only complete previous transaction in the log. This will result in a recovery
+failure and an inconsistent filesystem and hence we must enforce the maximum
+size of a checkpoint to be slightly less than a half the log.
+
+Apart from this size requirement, a checkpoint transaction looks no different
+to any other transaction - it contains a transaction header, a series of
+formatted log items and a commit record at the tail. From a recovery
+perspective, the checkpoint transaction is also no different - just a lot
+bigger with a lot more items in it. The worst case effect of this is that we
+might need to tune the recovery transaction object hash size.
+
+Because the checkpoint is just another transaction and all the changes to log
+items are stored as log vectors, we can use the existing log buffer writing
+code to write the changes into the log. To do this efficiently, we need to
+minimise the time we hold the CIL locked while writing the checkpoint
+transaction. The current log write code enables us to do this easily with the
+way it separates the writing of the transaction contents (the log vectors) from
+the transaction commit record, but tracking this requires us to have a
+per-checkpoint context that travels through the log write process through to
+checkpoint completion.
+
+Hence a checkpoint has a context that tracks the state of the current
+checkpoint from initiation to checkpoint completion. A new context is initiated
+at the same time a checkpoint transaction is started. That is, when we remove
+all the current items from the CIL during a checkpoint operation, we move all
+those changes into the current checkpoint context. We then initialise a new
+context and attach that to the CIL for aggregation of new transactions.
+
+This allows us to unlock the CIL immediately after transfer of all the
+committed items and effectively allow new transactions to be issued while we
+are formatting the checkpoint into the log. It also allows concurrent
+checkpoints to be written into the log buffers in the case of log force heavy
+workloads, just like the existing transaction commit code does. This, however,
+requires that we strictly order the commit records in the log so that
+checkpoint sequence order is maintained during log replay.
+
+To ensure that we can be writing an item into a checkpoint transaction at
+the same time another transaction modifies the item and inserts the log item
+into the new CIL, then checkpoint transaction commit code cannot use log items
+to store the list of log vectors that need to be written into the transaction.
+Hence log vectors need to be able to be chained together to allow them to be
+detached from the log items. That is, when the CIL is flushed the memory
+buffer and log vector attached to each log item needs to be attached to the
+checkpoint context so that the log item can be released. In diagrammatic form,
+the CIL would look like this before the flush:
+
+ CIL Head
+ |
+ V
+ Log Item <-> log vector 1 -> memory buffer
+ | -> vector array
+ V
+ Log Item <-> log vector 2 -> memory buffer
+ | -> vector array
+ V
+ ......
+ |
+ V
+ Log Item <-> log vector N-1 -> memory buffer
+ | -> vector array
+ V
+ Log Item <-> log vector N -> memory buffer
+ -> vector array
+
+And after the flush the CIL head is empty, and the checkpoint context log
+vector list would look like:
+
+ Checkpoint Context
+ |
+ V
+ log vector 1 -> memory buffer
+ | -> vector array
+ | -> Log Item
+ V
+ log vector 2 -> memory buffer
+ | -> vector array
+ | -> Log Item
+ V
+ ......
+ |
+ V
+ log vector N-1 -> memory buffer
+ | -> vector array
+ | -> Log Item
+ V
+ log vector N -> memory buffer
+ -> vector array
+ -> Log Item
+
+Once this transfer is done, the CIL can be unlocked and new transactions can
+start, while the checkpoint flush code works over the log vector chain to
+commit the checkpoint.
+
+Once the checkpoint is written into the log buffers, the checkpoint context is
+attached to the log buffer that the commit record was written to along with a
+completion callback. Log IO completion will call that callback, which can then
+run transaction committed processing for the log items (i.e. insert into AIL
+and unpin) in the log vector chain and then free the log vector chain and
+checkpoint context.
+
+Discussion Point: I am uncertain as to whether the log item is the most
+efficient way to track vectors, even though it seems like the natural way to do
+it. The fact that we walk the log items (in the CIL) just to chain the log
+vectors and break the link between the log item and the log vector means that
+we take a cache line hit for the log item list modification, then another for
+the log vector chaining. If we track by the log vectors, then we only need to
+break the link between the log item and the log vector, which means we should
+dirty only the log item cachelines. Normally I wouldn't be concerned about one
+vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
+vectors in one checkpoint transaction. I'd guess this is a "measure and
+compare" situation that can be done after a working and reviewed implementation
+is in the dev tree....
+
+Delayed Logging: Checkpoint Sequencing
+
+One of the key aspects of the XFS transaction subsystem is that it tags
+committed transactions with the log sequence number of the transaction commit.
+This allows transactions to be issued asynchronously even though there may be
+future operations that cannot be completed until that transaction is fully
+committed to the log. In the rare case that a dependent operation occurs (e.g.
+re-using a freed metadata extent for a data extent), a special, optimised log
+force can be issued to force the dependent transaction to disk immediately.
+
+To do this, transactions need to record the LSN of the commit record of the
+transaction. This LSN comes directly from the log buffer the transaction is
+written into. While this works just fine for the existing transaction
+mechanism, it does not work for delayed logging because transactions are not
+written directly into the log buffers. Hence some other method of sequencing
+transactions is required.
+
+As discussed in the checkpoint section, delayed logging uses per-checkpoint
+contexts, and as such it is simple to assign a sequence number to each
+checkpoint. Because the switching of checkpoint contexts must be done
+atomically, it is simple to ensure that each new context has a monotonically
+increasing sequence number assigned to it without the need for an external
+atomic counter - we can just take the current context sequence number and add
+one to it for the new context.
+
+Then, instead of assigning a log buffer LSN to the transaction commit LSN
+during the commit, we can assign the current checkpoint sequence. This allows
+operations that track transactions that have not yet completed know what
+checkpoint sequence needs to be committed before they can continue. As a
+result, the code that forces the log to a specific LSN now needs to ensure that
+the log forces to a specific checkpoint.
+
+To ensure that we can do this, we need to track all the checkpoint contexts
+that are currently committing to the log. When we flush a checkpoint, the
+context gets added to a "committing" list which can be searched. When a
+checkpoint commit completes, it is removed from the committing list. Because
+the checkpoint context records the LSN of the commit record for the checkpoint,
+we can also wait on the log buffer that contains the commit record, thereby
+using the existing log force mechanisms to execute synchronous forces.
+
+It should be noted that the synchronous forces may need to be extended with
+mitigation algorithms similar to the current log buffer code to allow
+aggregation of multiple synchronous transactions if there are already
+synchronous transactions being flushed. Investigation of the performance of the
+current design is needed before making any decisions here.
+
+The main concern with log forces is to ensure that all the previous checkpoints
+are also committed to disk before the one we need to wait for. Therefore we
+need to check that all the prior contexts in the committing list are also
+complete before waiting on the one we need to complete. We do this
+synchronisation in the log force code so that we don't need to wait anywhere
+else for such serialisation - it only matters when we do a log force.
+
+The only remaining complexity is that a log force now also has to handle the
+case where the forcing sequence number is the same as the current context. That
+is, we need to flush the CIL and potentially wait for it to complete. This is a
+simple addition to the existing log forcing code to check the sequence numbers
+and push if required. Indeed, placing the current sequence checkpoint flush in
+the log force code enables the current mechanism for issuing synchronous
+transactions to remain untouched (i.e. commit an asynchronous transaction, then
+force the log at the LSN of that transaction) and so the higher level code
+behaves the same regardless of whether delayed logging is being used or not.
+
+Delayed Logging: Checkpoint Log Space Accounting
+
+The big issue for a checkpoint transaction is the log space reservation for the
+transaction. We don't know how big a checkpoint transaction is going to be
+ahead of time, nor how many log buffers it will take to write out, nor the
+number of split log vector regions are going to be used. We can track the
+amount of log space required as we add items to the commit item list, but we
+still need to reserve the space in the log for the checkpoint.
+
+A typical transaction reserves enough space in the log for the worst case space
+usage of the transaction. The reservation accounts for log record headers,
+transaction and region headers, headers for split regions, buffer tail padding,
+etc. as well as the actual space for all the changed metadata in the
+transaction. While some of this is fixed overhead, much of it is dependent on
+the size of the transaction and the number of regions being logged (the number
+of log vectors in the transaction).
+
+An example of the differences would be logging directory changes versus logging
+inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
+there are lots of transactions that only contain an inode core and an inode log
+format structure. That is, two vectors totaling roughly 150 bytes. If we modify
+10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
+vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
+comparison, if we are logging full directory buffers, they are typically 4KB
+each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
+buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
+space. From this, it should be obvious that a static log space reservation is
+not particularly flexible and is difficult to select the "optimal value" for
+all workloads.
+
+Further, if we are going to use a static reservation, which bit of the entire
+reservation does it cover? We account for space used by the transaction
+reservation by tracking the space currently used by the object in the CIL and
+then calculating the increase or decrease in space used as the object is
+relogged. This allows for a checkpoint reservation to only have to account for
+log buffer metadata used such as log header records.
+
+However, even using a static reservation for just the log metadata is
+problematic. Typically log record headers use at least 16KB of log space per
+1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
+large enough to handle arbitrary sized checkpoint transactions. This
+reservation needs to be made before the checkpoint is started, and we need to
+be able to reserve the space without sleeping. For a 8MB checkpoint, we need a
+reservation of around 150KB, which is a non-trivial amount of space.
+
+A static reservation needs to manipulate the log grant counters - we can take a
+permanent reservation on the space, but we still need to make sure we refresh
+the write reservation (the actual space available to the transaction) after
+every checkpoint transaction completion. Unfortunately, if this space is not
+available when required, then the regrant code will sleep waiting for it.
+
+The problem with this is that it can lead to deadlocks as we may need to commit
+checkpoints to be able to free up log space (refer back to the description of
+rolling transactions for an example of this). Hence we *must* always have
+space available in the log if we are to use static reservations, and that is
+very difficult and complex to arrange. It is possible to do, but there is a
+simpler way.
+
+The simpler way of doing this is tracking the entire log space used by the
+items in the CIL and using this to dynamically calculate the amount of log
+space required by the log metadata. If this log metadata space changes as a
+result of a transaction commit inserting a new memory buffer into the CIL, then
+the difference in space required is removed from the transaction that causes
+the change. Transactions at this level will *always* have enough space
+available in their reservation for this as they have already reserved the
+maximal amount of log metadata space they require, and such a delta reservation
+will always be less than or equal to the maximal amount in the reservation.
+
+Hence we can grow the checkpoint transaction reservation dynamically as items
+are added to the CIL and avoid the need for reserving and regranting log space
+up front. This avoids deadlocks and removes a blocking point from the
+checkpoint flush code.
+
+As mentioned early, transactions can't grow to more than half the size of the
+log. Hence as part of the reservation growing, we need to also check the size
+of the reservation against the maximum allowed transaction size. If we reach
+the maximum threshold, we need to push the CIL to the log. This is effectively
+a "background flush" and is done on demand. This is identical to
+a CIL push triggered by a log force, only that there is no waiting for the
+checkpoint commit to complete. This background push is checked and executed by
+transaction commit code.
+
+If the transaction subsystem goes idle while we still have items in the CIL,
+they will be flushed by the periodic log force issued by the xfssyncd. This log
+force will push the CIL to disk, and if the transaction subsystem stays idle,
+allow the idle log to be covered (effectively marked clean) in exactly the same
+manner that is done for the existing logging method. A discussion point is
+whether this log force needs to be done more frequently than the current rate
+which is once every 30s.
+
+
+Delayed Logging: Log Item Pinning
+
+Currently log items are pinned during transaction commit while the items are
+still locked. This happens just after the items are formatted, though it could
+be done any time before the items are unlocked. The result of this mechanism is
+that items get pinned once for every transaction that is committed to the log
+buffers. Hence items that are relogged in the log buffers will have a pin count
+for every outstanding transaction they were dirtied in. When each of these
+transactions is completed, they will unpin the item once. As a result, the item
+only becomes unpinned when all the transactions complete and there are no
+pending transactions. Thus the pinning and unpinning of a log item is symmetric
+as there is a 1:1 relationship with transaction commit and log item completion.
+
+For delayed logging, however, we have an asymmetric transaction commit to
+completion relationship. Every time an object is relogged in the CIL it goes
+through the commit process without a corresponding completion being registered.
+That is, we now have a many-to-one relationship between transaction commit and
+log item completion. The result of this is that pinning and unpinning of the
+log items becomes unbalanced if we retain the "pin on transaction commit, unpin
+on transaction completion" model.
+
+To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
+insertion into the CIL, unpin on checkpoint completion". In other words, the
+pinning and unpinning becomes symmetric around a checkpoint context. We have to
+pin the object the first time it is inserted into the CIL - if it is already in
+the CIL during a transaction commit, then we do not pin it again. Because there
+can be multiple outstanding checkpoint contexts, we can still see elevated pin
+counts, but as each checkpoint completes the pin count will retain the correct
+value according to it's context.
+
+Just to make matters more slightly more complex, this checkpoint level context
+for the pin count means that the pinning of an item must take place under the
+CIL commit/flush lock. If we pin the object outside this lock, we cannot
+guarantee which context the pin count is associated with. This is because of
+the fact pinning the item is dependent on whether the item is present in the
+current CIL or not. If we don't pin the CIL first before we check and pin the
+object, we have a race with CIL being flushed between the check and the pin
+(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
+lock to guarantee that we pin the items correctly.
+
+Delayed Logging: Concurrent Scalability
+
+A fundamental requirement for the CIL is that accesses through transaction
+commits must scale to many concurrent commits. The current transaction commit
+code does not break down even when there are transactions coming from 2048
+processors at once. The current transaction code does not go any faster than if
+there was only one CPU using it, but it does not slow down either.
+
+As a result, the delayed logging transaction commit code needs to be designed
+for concurrency from the ground up. It is obvious that there are serialisation
+points in the design - the three important ones are:
+
+ 1. Locking out new transaction commits while flushing the CIL
+ 2. Adding items to the CIL and updating item space accounting
+ 3. Checkpoint commit ordering
+
+Looking at the transaction commit and CIL flushing interactions, it is clear
+that we have a many-to-one interaction here. That is, the only restriction on
+the number of concurrent transactions that can be trying to commit at once is
+the amount of space available in the log for their reservations. The practical
+limit here is in the order of several hundred concurrent transactions for a
+128MB log, which means that it is generally one per CPU in a machine.
+
+The amount of time a transaction commit needs to hold out a flush is a
+relatively long period of time - the pinning of log items needs to be done
+while we are holding out a CIL flush, so at the moment that means it is held
+across the formatting of the objects into memory buffers (i.e. while memcpy()s
+are in progress). Ultimately a two pass algorithm where the formatting is done
+separately to the pinning of objects could be used to reduce the hold time of
+the transaction commit side.
+
+Because of the number of potential transaction commit side holders, the lock
+really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
+want every other CPU in the machine spinning on the CIL lock. Given that
+flushing the CIL could involve walking a list of tens of thousands of log
+items, it will get held for a significant time and so spin contention is a
+significant concern. Preventing lots of CPUs spinning doing nothing is the
+main reason for choosing a sleeping lock even though nothing in either the
+transaction commit or CIL flush side sleeps with the lock held.
+
+It should also be noted that CIL flushing is also a relatively rare operation
+compared to transaction commit for asynchronous transaction workloads - only
+time will tell if using a read-write semaphore for exclusion will limit
+transaction commit concurrency due to cache line bouncing of the lock on the
+read side.
+
+The second serialisation point is on the transaction commit side where items
+are inserted into the CIL. Because transactions can enter this code
+concurrently, the CIL needs to be protected separately from the above
+commit/flush exclusion. It also needs to be an exclusive lock but it is only
+held for a very short time and so a spin lock is appropriate here. It is
+possible that this lock will become a contention point, but given the short
+hold time once per transaction I think that contention is unlikely.
+
+The final serialisation point is the checkpoint commit record ordering code
+that is run as part of the checkpoint commit and log force sequencing. The code
+path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
+an ordering loop after writing all the log vectors into the log buffers but
+before writing the commit record. This loop walks the list of committing
+checkpoints and needs to block waiting for checkpoints to complete their commit
+record write. As a result it needs a lock and a wait variable. Log force
+sequencing also requires the same lock, list walk, and blocking mechanism to
+ensure completion of checkpoints.
+
+These two sequencing operations can use the mechanism even though the
+events they are waiting for are different. The checkpoint commit record
+sequencing needs to wait until checkpoint contexts contain a commit LSN
+(obtained through completion of a commit record write) while log force
+sequencing needs to wait until previous checkpoint contexts are removed from
+the committing list (i.e. they've completed). A simple wait variable and
+broadcast wakeups (thundering herds) has been used to implement these two
+serialisation queues. They use the same lock as the CIL, too. If we see too
+much contention on the CIL lock, or too many context switches as a result of
+the broadcast wakeups these operations can be put under a new spinlock and
+given separate wait lists to reduce lock contention and the number of processes
+woken by the wrong event.
+
+
+Lifecycle Changes
+
+The existing log item life cycle is as follows:
+
+ 1. Transaction allocate
+ 2. Transaction reserve
+ 3. Lock item
+ 4. Join item to transaction
+ If not already attached,
+ Allocate log item
+ Attach log item to owner item
+ Attach log item to transaction
+ 5. Modify item
+ Record modifications in log item
+ 6. Transaction commit
+ Pin item in memory
+ Format item into log buffer
+ Write commit LSN into transaction
+ Unlock item
+ Attach transaction to log buffer
+
+ <log buffer IO dispatched>
+ <log buffer IO completes>
+
+ 7. Transaction completion
+ Mark log item committed
+ Insert log item into AIL
+ Write commit LSN into log item
+ Unpin log item
+ 8. AIL traversal
+ Lock item
+ Mark log item clean
+ Flush item to disk
+
+ <item IO completion>
+
+ 9. Log item removed from AIL
+ Moves log tail
+ Item unlocked
+
+Essentially, steps 1-6 operate independently from step 7, which is also
+independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
+at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
+at the same time. If the log item is in the AIL or between steps 6 and 7
+and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
+are entered and completed is the object considered clean.
+
+With delayed logging, there are new steps inserted into the life cycle:
+
+ 1. Transaction allocate
+ 2. Transaction reserve
+ 3. Lock item
+ 4. Join item to transaction
+ If not already attached,
+ Allocate log item
+ Attach log item to owner item
+ Attach log item to transaction
+ 5. Modify item
+ Record modifications in log item
+ 6. Transaction commit
+ Pin item in memory if not pinned in CIL
+ Format item into log vector + buffer
+ Attach log vector and buffer to log item
+ Insert log item into CIL
+ Write CIL context sequence into transaction
+ Unlock item
+
+ <next log force>
+
+ 7. CIL push
+ lock CIL flush
+ Chain log vectors and buffers together
+ Remove items from CIL
+ unlock CIL flush
+ write log vectors into log
+ sequence commit records
+ attach checkpoint context to log buffer
+
+ <log buffer IO dispatched>
+ <log buffer IO completes>
+
+ 8. Checkpoint completion
+ Mark log item committed
+ Insert item into AIL
+ Write commit LSN into log item
+ Unpin log item
+ 9. AIL traversal
+ Lock item
+ Mark log item clean
+ Flush item to disk
+ <item IO completion>
+ 10. Log item removed from AIL
+ Moves log tail
+ Item unlocked
+
+From this, it can be seen that the only life cycle differences between the two
+logging methods are in the middle of the life cycle - they still have the same
+beginning and end and execution constraints. The only differences are in the
+committing of the log items to the log itself and the completion processing.
+Hence delayed logging should not introduce any constraints on log item
+behaviour, allocation or freeing that don't already exist.
+
+As a result of this zero-impact "insertion" of delayed logging infrastructure
+and the design of the internal structures to avoid on disk format changes, we
+can basically switch between delayed logging and the existing mechanism with a
+mount option. Fundamentally, there is no reason why the log manager would not
+be able to swap methods automatically and transparently depending on load
+characteristics, but this should not be necessary if delayed logging works as
+designed.
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
new file mode 100644
index 00000000..3fc0c31a
--- /dev/null
+++ b/Documentation/filesystems/xfs.txt
@@ -0,0 +1,252 @@
+
+The SGI XFS Filesystem
+======================
+
+XFS is a high performance journaling filesystem which originated
+on the SGI IRIX platform. It is completely multi-threaded, can
+support large files and large filesystems, extended attributes,
+variable block sizes, is extent based, and makes extensive use of
+Btrees (directories, extents, free space) to aid both performance
+and scalability.
+
+Refer to the documentation at http://oss.sgi.com/projects/xfs/
+for further details. This implementation is on-disk compatible
+with the IRIX version of XFS.
+
+
+Mount Options
+=============
+
+When mounting an XFS filesystem, the following options are accepted.
+
+ allocsize=size
+ Sets the buffered I/O end-of-file preallocation size when
+ doing delayed allocation writeout (default size is 64KiB).
+ Valid values for this option are page size (typically 4KiB)
+ through to 1GiB, inclusive, in power-of-2 increments.
+
+ attr2/noattr2
+ The options enable/disable (default is disabled for backward
+ compatibility on-disk) an "opportunistic" improvement to be
+ made in the way inline extended attributes are stored on-disk.
+ When the new form is used for the first time (by setting or
+ removing extended attributes) the on-disk superblock feature
+ bit field will be updated to reflect this format being in use.
+
+ barrier
+ Enables the use of block layer write barriers for writes into
+ the journal and unwritten extent conversion. This allows for
+ drive level write caching to be enabled, for devices that
+ support write barriers.
+
+ discard
+ Issue command to let the block device reclaim space freed by the
+ filesystem. This is useful for SSD devices, thinly provisioned
+ LUNs and virtual machine images, but may have a performance
+ impact. This option is incompatible with the nodelaylog option.
+
+ dmapi
+ Enable the DMAPI (Data Management API) event callouts.
+ Use with the "mtpt" option.
+
+ grpid/bsdgroups and nogrpid/sysvgroups
+ These options define what group ID a newly created file gets.
+ When grpid is set, it takes the group ID of the directory in
+ which it is created; otherwise (the default) it takes the fsgid
+ of the current process, unless the directory has the setgid bit
+ set, in which case it takes the gid from the parent directory,
+ and also gets the setgid bit set if it is a directory itself.
+
+ ihashsize=value
+ In memory inode hashes have been removed, so this option has
+ no function as of August 2007. Option is deprecated.
+
+ ikeep/noikeep
+ When ikeep is specified, XFS does not delete empty inode clusters
+ and keeps them around on disk. ikeep is the traditional XFS
+ behaviour. When noikeep is specified, empty inode clusters
+ are returned to the free space pool. The default is noikeep for
+ non-DMAPI mounts, while ikeep is the default when DMAPI is in use.
+
+ inode64
+ Indicates that XFS is allowed to create inodes at any location
+ in the filesystem, including those which will result in inode
+ numbers occupying more than 32 bits of significance. This is
+ provided for backwards compatibility, but causes problems for
+ backup applications that cannot handle large inode numbers.
+
+ largeio/nolargeio
+ If "nolargeio" is specified, the optimal I/O reported in
+ st_blksize by stat(2) will be as small as possible to allow user
+ applications to avoid inefficient read/modify/write I/O.
+ If "largeio" specified, a filesystem that has a "swidth" specified
+ will return the "swidth" value (in bytes) in st_blksize. If the
+ filesystem does not have a "swidth" specified but does specify
+ an "allocsize" then "allocsize" (in bytes) will be returned
+ instead.
+ If neither of these two options are specified, then filesystem
+ will behave as if "nolargeio" was specified.
+
+ logbufs=value
+ Set the number of in-memory log buffers. Valid numbers range
+ from 2-8 inclusive.
+ The default value is 8 buffers for filesystems with a
+ blocksize of 64KiB, 4 buffers for filesystems with a blocksize
+ of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
+ and 2 buffers for all other configurations. Increasing the
+ number of buffers may increase performance on some workloads
+ at the cost of the memory used for the additional log buffers
+ and their associated control structures.
+
+ logbsize=value
+ Set the size of each in-memory log buffer.
+ Size may be specified in bytes, or in kilobytes with a "k" suffix.
+ Valid sizes for version 1 and version 2 logs are 16384 (16k) and
+ 32768 (32k). Valid sizes for version 2 logs also include
+ 65536 (64k), 131072 (128k) and 262144 (256k).
+ The default value for machines with more than 32MiB of memory
+ is 32768, machines with less memory use 16384 by default.
+
+ logdev=device and rtdev=device
+ Use an external log (metadata journal) and/or real-time device.
+ An XFS filesystem has up to three parts: a data section, a log
+ section, and a real-time section. The real-time section is
+ optional, and the log section can be separate from the data
+ section or contained within it.
+
+ mtpt=mountpoint
+ Use with the "dmapi" option. The value specified here will be
+ included in the DMAPI mount event, and should be the path of
+ the actual mountpoint that is used.
+
+ noalign
+ Data allocations will not be aligned at stripe unit boundaries.
+
+ noatime
+ Access timestamps are not updated when a file is read.
+
+ norecovery
+ The filesystem will be mounted without running log recovery.
+ If the filesystem was not cleanly unmounted, it is likely to
+ be inconsistent when mounted in "norecovery" mode.
+ Some files or directories may not be accessible because of this.
+ Filesystems mounted "norecovery" must be mounted read-only or
+ the mount will fail.
+
+ nouuid
+ Don't check for double mounted file systems using the file system uuid.
+ This is useful to mount LVM snapshot volumes.
+
+ uquota/usrquota/uqnoenforce/quota
+ User disk quota accounting enabled, and limits (optionally)
+ enforced. Refer to xfs_quota(8) for further details.
+
+ gquota/grpquota/gqnoenforce
+ Group disk quota accounting enabled and limits (optionally)
+ enforced. Refer to xfs_quota(8) for further details.
+
+ pquota/prjquota/pqnoenforce
+ Project disk quota accounting enabled and limits (optionally)
+ enforced. Refer to xfs_quota(8) for further details.
+
+ sunit=value and swidth=value
+ Used to specify the stripe unit and width for a RAID device or
+ a stripe volume. "value" must be specified in 512-byte block
+ units.
+ If this option is not specified and the filesystem was made on
+ a stripe volume or the stripe width or unit were specified for
+ the RAID device at mkfs time, then the mount system call will
+ restore the value from the superblock. For filesystems that
+ are made directly on RAID devices, these options can be used
+ to override the information in the superblock if the underlying
+ disk layout changes after the filesystem has been created.
+ The "swidth" option is required if the "sunit" option has been
+ specified, and must be a multiple of the "sunit" value.
+
+ swalloc
+ Data allocations will be rounded up to stripe width boundaries
+ when the current end of file is being extended and the file
+ size is larger than the stripe width size.
+
+
+sysctls
+=======
+
+The following sysctls are available for the XFS filesystem:
+
+ fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
+ Setting this to "1" clears accumulated XFS statistics
+ in /proc/fs/xfs/stat. It then immediately resets to "0".
+
+ fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
+ The interval at which the xfssyncd thread flushes metadata
+ out to disk. This thread will flush log activity out, and
+ do some processing on unlinked inodes.
+
+ fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
+ The interval at which xfsbufd scans the dirty metadata buffers list.
+
+ fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
+ The age at which xfsbufd flushes dirty metadata buffers to disk.
+
+ fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
+ A volume knob for error reporting when internal errors occur.
+ This will generate detailed messages & backtraces for filesystem
+ shutdowns, for example. Current threshold values are:
+
+ XFS_ERRLEVEL_OFF: 0
+ XFS_ERRLEVEL_LOW: 1
+ XFS_ERRLEVEL_HIGH: 5
+
+ fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
+ Causes certain error conditions to call BUG(). Value is a bitmask;
+ AND together the tags which represent errors which should cause panics:
+
+ XFS_NO_PTAG 0
+ XFS_PTAG_IFLUSH 0x00000001
+ XFS_PTAG_LOGRES 0x00000002
+ XFS_PTAG_AILDELETE 0x00000004
+ XFS_PTAG_ERROR_REPORT 0x00000008
+ XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
+ XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
+ XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
+
+ This option is intended for debugging only.
+
+ fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
+ Controls whether symlinks are created with mode 0777 (default)
+ or whether their mode is affected by the umask (irix mode).
+
+ fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
+ Controls files created in SGID directories.
+ If the group ID of the new file does not match the effective group
+ ID or one of the supplementary group IDs of the parent dir, the
+ ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
+ is set.
+
+ fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "sync" flag set
+ by the xfs_io(8) chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "nodump" flag set
+ by the xfs_io(8) chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "noatime" flag set
+ by the xfs_io(8) chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "nosymlinks" flag set
+ by the xfs_io(8) chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
+ In "inode32" allocation mode, this option determines how many
+ files the allocator attempts to allocate in the same allocation
+ group before moving to the next allocation group. The intent
+ is to control the rate at which the allocator moves between
+ allocation groups when allocating extents for new files.
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
new file mode 100644
index 00000000..0466ee56
--- /dev/null
+++ b/Documentation/filesystems/xip.txt
@@ -0,0 +1,68 @@
+Execute-in-place for file mappings
+----------------------------------
+
+Motivation
+----------
+File mappings are performed by mapping page cache pages to userspace. In
+addition, read&write type file operations also transfer data from/to the page
+cache.
+
+For memory backed storage devices that use the block device interface, the page
+cache pages are in fact copies of the original storage. Various approaches
+exist to work around the need for an extra copy. The ramdisk driver for example
+does read the data into the page cache, keeps a reference, and discards the
+original data behind later on.
+
+Execute-in-place solves this issue the other way around: instead of keeping
+data in the page cache, the need to have a page cache copy is eliminated
+completely. With execute-in-place, read&write type operations are performed
+directly from/to the memory backed storage device. For file mappings, the
+storage device itself is mapped directly into userspace.
+
+This implementation was initially written for shared memory segments between
+different virtual machines on s390 hardware to allow multiple machines to
+share the same binaries and libraries.
+
+Implementation
+--------------
+Execute-in-place is implemented in three steps: block device operation,
+address space operation, and file operations.
+
+A block device operation named direct_access is used to retrieve a
+reference (pointer) to a block on-disk. The reference is supposed to be
+cpu-addressable, physical address and remain valid until the release operation
+is performed. A struct block_device reference is used to address the device,
+and a sector_t argument is used to identify the individual block. As an
+alternative, memory technology devices can be used for this.
+
+The block device operation is optional, these block devices support it as of
+today:
+- dcssblk: s390 dcss block device driver
+
+An address space operation named get_xip_mem is used to retrieve references
+to a page frame number and a kernel address. To obtain these values a reference
+to an address_space is provided. This function assigns values to the kmem and
+pfn parameters. The third argument indicates whether the function should allocate
+blocks if needed.
+
+This address space operation is mutually exclusive with readpage&writepage that
+do page cache read/write operations.
+The following filesystems support it as of today:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+A set of file operations that do utilize get_xip_page can be found in
+mm/filemap_xip.c . The following file operation implementations are provided:
+- aio_read/aio_write
+- readv/writev
+- sendfile
+
+The generic file operations do_sync_read/do_sync_write can be used to implement
+classic synchronous IO calls.
+
+Shortcomings
+------------
+This implementation is limited to storage devices that are cpu addressable at
+all times (no highmem or such). It works well on rom/ram, but enhancements are
+needed to make it work with flash in read+write mode.
+Putting the Linux kernel and/or its modules on a xip filesystem does not mean
+they are not copied.