From 849369d6c66d3054688672f97d31fceb8e8230fb Mon Sep 17 00:00:00 2001 From: root Date: Fri, 25 Dec 2015 04:40:36 +0000 Subject: initial_commit --- Documentation/filesystems/porting | 409 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 409 insertions(+) create mode 100644 Documentation/filesystems/porting (limited to 'Documentation/filesystems/porting') diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting new file mode 100644 index 00000000..6e299548 --- /dev/null +++ b/Documentation/filesystems/porting @@ -0,0 +1,409 @@ +Changes since 2.5.0: + +--- +[recommended] + +New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), + sb_set_blocksize() and sb_min_blocksize(). + +Use them. + +(sb_find_get_block() replaces 2.4's get_hash_table()) + +--- +[recommended] + +New methods: ->alloc_inode() and ->destroy_inode(). + +Remove inode->u.foo_inode_i +Declare + struct foo_inode_info { + /* fs-private stuff */ + struct inode vfs_inode; + }; + static inline struct foo_inode_info *FOO_I(struct inode *inode) + { + return list_entry(inode, struct foo_inode_info, vfs_inode); + } + +Use FOO_I(inode) instead of &inode->u.foo_inode_i; + +Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate +foo_inode_info and return the address of ->vfs_inode, the latter should free +FOO_I(inode) (see in-tree filesystems for examples). + +Make them ->alloc_inode and ->destroy_inode in your super_operations. + +Keep in mind that now you need explicit initialization of private data +typically between calling iget_locked() and unlocking the inode. + +At some point that will become mandatory. + +--- +[mandatory] + +Change of file_system_type method (->read_super to ->get_sb) + +->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. + +Turn your foo_read_super() into a function that would return 0 in case of +success and negative number in case of error (-EINVAL unless you have more +informative error value to report). Call it foo_fill_super(). Now declare + +int foo_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data, struct vfsmount *mnt) +{ + return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super, + mnt); +} + +(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of +filesystem). + +Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as +foo_get_sb. + +--- +[mandatory] + +Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. +Most likely there is no need to change anything, but if you relied on +global exclusion between renames for some internal purpose - you need to +change your internal locking. Otherwise exclusion warranties remain the +same (i.e. parents and victim are locked, etc.). + +--- +[informational] + +Now we have the exclusion between ->lookup() and directory removal (by +->rmdir() and ->rename()). If you used to need that exclusion and do +it by internal locking (most of filesystems couldn't care less) - you +can relax your locking. + +--- +[mandatory] + +->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), +->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() +and ->readdir() are called without BKL now. Grab it on entry, drop upon return +- that will guarantee the same locking you used to have. If your method or its +parts do not need BKL - better yet, now you can shift lock_kernel() and +unlock_kernel() so that they would protect exactly what needs to be +protected. + +--- +[mandatory] + +BKL is also moved from around sb operations. ->write_super() Is now called +without BKL held. BKL should have been shifted into individual fs sb_op +functions. If you don't need it, remove it. + +--- +[informational] + +check for ->link() target not being a directory is done by callers. Feel +free to drop it... + +--- +[informational] + +->link() callers hold ->i_mutex on the object we are linking to. Some of your +problems might be over... + +--- +[mandatory] + +new file_system_type method - kill_sb(superblock). If you are converting +an existing filesystem, set it according to ->fs_flags: + FS_REQUIRES_DEV - kill_block_super + FS_LITTER - kill_litter_super + neither - kill_anon_super +FS_LITTER is gone - just remove it from fs_flags. + +--- +[mandatory] + + FS_SINGLE is gone (actually, that had happened back when ->get_sb() +went in - and hadn't been documented ;-/). Just remove it from fs_flags +(and see ->get_sb() entry for other actions). + +--- +[mandatory] + +->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so +watch for ->i_mutex-grabbing code that might be used by your ->setattr(). +Callers of notify_change() need ->i_mutex now. + +--- +[recommended] + +New super_block field "struct export_operations *s_export_op" for +explicit support for exporting, e.g. via NFS. The structure is fully +documented at its declaration in include/linux/fs.h, and in +Documentation/filesystems/nfs/Exporting. + +Briefly it allows for the definition of decode_fh and encode_fh operations +to encode and decode filehandles, and allows the filesystem to use +a standard helper function for decode_fh, and provide file-system specific +support for this helper, particularly get_parent. + +It is planned that this will be required for exporting once the code +settles down a bit. + +[mandatory] + +s_export_op is now required for exporting a filesystem. +isofs, ext2, ext3, resierfs, fat +can be used as examples of very different filesystems. + +--- +[mandatory] + +iget4() and the read_inode2 callback have been superseded by iget5_locked() +which has the following prototype, + + struct inode *iget5_locked(struct super_block *sb, unsigned long ino, + int (*test)(struct inode *, void *), + int (*set)(struct inode *, void *), + void *data); + +'test' is an additional function that can be used when the inode +number is not sufficient to identify the actual file object. 'set' +should be a non-blocking function that initializes those parts of a +newly created inode to allow the test function to succeed. 'data' is +passed as an opaque value to both test and set functions. + +When the inode has been created by iget5_locked(), it will be returned with the +I_NEW flag set and will still be locked. The filesystem then needs to finalize +the initialization. Once the inode is initialized it must be unlocked by +calling unlock_new_inode(). + +The filesystem is responsible for setting (and possibly testing) i_ino +when appropriate. There is also a simpler iget_locked function that +just takes the superblock and inode number as arguments and does the +test and set for you. + +e.g. + inode = iget_locked(sb, ino); + if (inode->i_state & I_NEW) { + err = read_inode_from_disk(inode); + if (err < 0) { + iget_failed(inode); + return err; + } + unlock_new_inode(inode); + } + +Note that if the process of setting up a new inode fails, then iget_failed() +should be called on the inode to render it dead, and an appropriate error +should be passed back to the caller. + +--- +[recommended] + +->getattr() finally getting used. See instances in nfs, minix, etc. + +--- +[mandatory] + +->revalidate() is gone. If your filesystem had it - provide ->getattr() +and let it call whatever you had as ->revlidate() + (for symlinks that +had ->revalidate()) add calls in ->follow_link()/->readlink(). + +--- +[mandatory] + +->d_parent changes are not protected by BKL anymore. Read access is safe +if at least one of the following is true: + * filesystem has no cross-directory rename() + * we know that parent had been locked (e.g. we are looking at +->d_parent of ->lookup() argument). + * we are called from ->rename(). + * the child's ->d_lock is held +Audit your code and add locking if needed. Notice that any place that is +not protected by the conditions above is risky even in the old tree - you +had been relying on BKL and that's prone to screwups. Old tree had quite +a few holes of that kind - unprotected access to ->d_parent leading to +anything from oops to silent memory corruption. + +--- +[mandatory] + + FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags +(see rootfs for one kind of solution and bdev/socket/pipe for another). + +--- +[recommended] + + Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter +is still alive, but only because of the mess in drivers/s390/block/dasd.c. +As soon as it gets fixed is_read_only() will die. + +--- +[mandatory] + +->permission() is called without BKL now. Grab it on entry, drop upon +return - that will guarantee the same locking you used to have. If +your method or its parts do not need BKL - better yet, now you can +shift lock_kernel() and unlock_kernel() so that they would protect +exactly what needs to be protected. + +--- +[mandatory] + +->statfs() is now called without BKL held. BKL should have been +shifted into individual fs sb_op functions where it's not clear that +it's safe to remove it. If you don't need it, remove it. + +--- +[mandatory] + + is_read_only() is gone; use bdev_read_only() instead. + +--- +[mandatory] + + destroy_buffers() is gone; use invalidate_bdev(). + +--- +[mandatory] + + fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is +deliberate; as soon as struct block_device * is propagated in a reasonable +way by that code fixing will become trivial; until then nothing can be +done. + +[mandatory] + + block truncatation on error exit from ->write_begin, and ->direct_IO +moved from generic methods (block_write_begin, cont_write_begin, +nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at +ext2_write_failed and callers for an example. + +[mandatory] + + ->truncate is going away. The whole truncate sequence needs to be +implemented in ->setattr, which is now mandatory for filesystems +implementing on-disk size changes. Start with a copy of the old inode_setattr +and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to +be in order of zeroing blocks using block_truncate_page or similar helpers, +size update and on finally on-disk truncation which should not fail. +inode_change_ok now includes the size checks for ATTR_SIZE and must be called +in the beginning of ->setattr unconditionally. + +[mandatory] + + ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should +be used instead. It gets called whenever the inode is evicted, whether it has +remaining links or not. Caller does *not* evict the pagecache or inode-associated +metadata buffers; getting rid of those is responsibility of method, as it had +been for ->delete_inode(). + + ->drop_inode() returns int now; it's called on final iput() with +inode->i_lock held and it returns true if filesystems wants the inode to be +dropped. As before, generic_drop_inode() is still the default and it's been +updated appropriately. generic_delete_inode() is also alive and it consists +simply of return 1. Note that all actual eviction work is done by caller after +->drop_inode() returns. + + clear_inode() is gone; use end_writeback() instead. As before, it must +be called exactly once on each call of ->evict_inode() (as it used to be for +each call of ->delete_inode()). Unlike before, if you are using inode-associated +metadata buffers (i.e. mark_buffer_dirty_inode()), it's your responsibility to +call invalidate_inode_buffers() before end_writeback(). + No async writeback (and thus no calls of ->write_inode()) will happen +after end_writeback() returns, so actions that should not overlap with ->write_inode() +(e.g. freeing on-disk inode if i_nlink is 0) ought to be done after that call. + + NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out +if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput() +may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly +free the on-disk inode, you may end up doing that while ->write_inode() is writing +to it. + +--- +[mandatory] + + .d_delete() now only advises the dcache as to whether or not to cache +unreferenced dentries, and is now only called when the dentry refcount goes to +0. Even on 0 refcount transition, it must be able to tolerate being called 0, +1, or more times (eg. constant, idempotent). + +--- +[mandatory] + + .d_compare() calling convention and locking rules are significantly +changed. Read updated documentation in Documentation/filesystems/vfs.txt (and +look at examples of other filesystems) for guidance. + +--- +[mandatory] + + .d_hash() calling convention and locking rules are significantly +changed. Read updated documentation in Documentation/filesystems/vfs.txt (and +look at examples of other filesystems) for guidance. + +--- +[mandatory] + dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c +for details of what locks to replace dcache_lock with in order to protect +particular things. Most of the time, a filesystem only needs ->d_lock, which +protects *all* the dcache state of a given dentry. + +-- +[mandatory] + + Filesystems must RCU-free their inodes, if they can have been accessed +via rcu-walk path walk (basically, if the file can have had a path name in the +vfs namespace). + + i_dentry and i_rcu share storage in a union, and the vfs expects +i_dentry to be reinitialized before it is freed, so an: + + INIT_LIST_HEAD(&inode->i_dentry); + +must be done in the RCU callback. + +-- +[recommended] + vfs now tries to do path walking in "rcu-walk mode", which avoids +atomic operations and scalability hazards on dentries and inodes (see +Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes +(above) are examples of the changes required to support this. For more complex +filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so +no changes are required to the filesystem. However, this is costly and loses +the benefits of rcu-walk mode. We will begin to add filesystem callbacks that +are rcu-walk aware, shown below. Filesystems should take advantage of this +where possible. + +-- +[mandatory] + d_revalidate is a callback that is made on every path element (if +the filesystem provides it), which requires dropping out of rcu-walk mode. This +may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be +returned if the filesystem cannot handle rcu-walk. See +Documentation/filesystems/vfs.txt for more details. + + permission and check_acl are inode permission checks that are called +on many or all directory inodes on the way down a path walk (to check for +exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU). +See Documentation/filesystems/vfs.txt for more details. + +-- +[mandatory] + In ->fallocate() you must check the mode option passed in. If your +filesystem does not support hole punching (deallocating space in the middle of a +file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode. +Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set, +so the i_size should not change when hole punching, even when puching the end of +a file off. + +-- +[mandatory] + +-- +[mandatory] + ->get_sb() is gone. Switch to use of ->mount(). Typically it's just +a matter of switching from calling get_sb_... to mount_... and changing the +function type. If you were doing it manually, just switch from setting ->mnt_root +to some pointer to returning that pointer. On errors return ERR_PTR(...). -- cgit v1.2.3