diff --git a/en_US.ISO8859-1/articles/geom-class/Makefile b/en_US.ISO8859-1/articles/geom-class/Makefile new file mode 100644 index 0000000000..58e659e2a9 --- /dev/null +++ b/en_US.ISO8859-1/articles/geom-class/Makefile @@ -0,0 +1,19 @@ +# +# $FreeBSD$ +# +# Article: Writing a GEOM Class + +DOC?= article + +FORMATS?= html +WITH_ARTICLE_TOC?= YES + +INSTALL_COMPRESSED?= gz +INSTALL_ONLY_COMPRESSED?= + +SRCS= article.sgml + +URL_RELPREFIX?= ../../../.. +DOC_PREFIX?= ${.CURDIR}/../../.. + +.include "${DOC_PREFIX}/share/mk/doc.project.mk" diff --git a/en_US.ISO8859-1/articles/geom-class/article.sgml b/en_US.ISO8859-1/articles/geom-class/article.sgml new file mode 100644 index 0000000000..f39eb2dd80 --- /dev/null +++ b/en_US.ISO8859-1/articles/geom-class/article.sgml @@ -0,0 +1,696 @@ + + + +%articles.ent; +]> + +
+ Writing a GEOM Class + + + + + Ivan + Voras + +
ivoras@yahoo.com +
+
+
+
+ + $FreeBSD$ + + + &tm-attrib.freebsd; + &tm-attrib.cvsup; + &tm-attrib.intel; + &tm-attrib.xfree86; + &tm-attrib.general; + + + + + This text documents the way I created the gjournal + facility, starting with learning how to do kernel + programming. It's assumed the reader is familiar with C + userland programming. + + + +
+ + + + Introduction + + + Documentation + + Documentation on kernel programming is scarce - it's one of + few areas where there's nearly nothing in the way of friendly + tutorials, and the phrase use the source! really + holds true. However, there are some bits and pieces (some of + them seriously outdated) floating around that should be studied + before beginning to code: + + + + FreeBSD + Developer's Handbook - part of the documentation + project, it doesn't contain anything specific to kernel-land + programming, but rather some general + information. + + FreeBSD + Architecture Handbook - also from the documentation + project, contains descriptions of several low-level facilities + and procedures. The most important chapter is 13, Writing + FreeBSD device drivers. + + The Blueprints section of FreeBSD Diary web + site - contains several interesting articles on kernel + facilities. + + The man pages in section 9 - most important + kernel-land calls are documented here. + + The &man.geom.4; man page and PHK's GEOM slides + - for general introduction of the GEOM + subsystem. + + &man.style.9; man page, if the code should go to + FreeBSD CVS tree + + + + + + + + Preliminaries + + The best way to do kernel developing is to have (at least) + two separate computers. One of these would contain the + development environment and sources, and the other would be used + to test the newly written code by network-booting and + network-mounting filesystems from the first one. This way if + the new code contains bugs and crashes the machine, it won't + mess up the sources (and other live data). The + second system doesn't event have to have a proper display - it + could be connected with a serial cable or KVM to the first + one. + + But, since not everybody has two+ computers handy, there are + a few things that can be done to prepare an otherwise "live" + system for developing kernel code. + + + Converting a system for development + + For any kernel programming a kernel with + enabled is a must have. So enter + these in your kernel configuration file: + + options INVARIANT_SUPPORT + options INVARIANTS + + For debugging crash dumps, a kernel with debug symbols is + needed: + + makeoptions DEBUG=-g + + With the usual way of installing the kernel (make + installkernel) the debug kernel will not be + automatically installed. It's called + kernel.debug and located in + /usr/obj/usr/src/sys/KERNELNAME/. For + convenience it should be copied to + /boot/kernel/. + + Another convenience is enabling the kernel debugger so you + can examine a kernel panic when it happens. For this, enter + the following lines in your kernel configuration file: + + options KDB + options DDB + options KDB_TRACE + + For this to work you might need to set a sysctl (if it's + not on by default): + + debug.debugger_on_panic=1 + + Kernel panics will happen, so care should be taken with + the filesystem cache. In particular, having softupdates might + mean a latest file version could be lost if a panic occurs + before it's committed to storage. Disabling softupdates + yields a great performance hit (and it still doesn't guarantee + data consistency - mounting filesystem with the "sync" option + is needed for that) so for a compromise, the cache delays can + be shortened. There are three sysctl's that are useful for + this (best to be set in + /etc/sysctl.conf): + + kern.filedelay=5 + kern.dirdelay=4 + kern.metadelay=3 + + The numbers represent seconds. + + For debugging kernel panics, kernel core dumps are + required. Since a kernel panic might make filesystems + unusable, this crash dump is first written to a raw + partition. Usually, this is the swap partition (it must be at + least as large as the physical RAM in the machine). On the + next boot (after filesystems are checked and mounted and + before swap is enabled), the dump is copied to a regular + file. This is controlled with two + /etc/rc.conf variables: + + dumpdev="/dev/ad0s4b" + dumpdir="/usr/core" + + The dumpdev variable specifies the swap + partition and dumpdir tells the system + where in the filesystem to relocate the core dump on reboot. + + Writing kernel core dumps is slow and takes a long time so + if you have lots of memory (>256M) and lots of panics it could + be frustrating to sit and wait while it's done (twice - first + to write it to swap, then to relocate it to filesystem). It's + convenient then to limit the amount of RAM the system will use + via a /boot/loader.conf tunable: + + hw.physmem="256M" + + If the panics are frequent and filesystems large (or you + simply don't trust softupdates+background fsck) it's advisable + to turn background fsck off via + /etc/rc.conf variable: + + background_fsck="NO" + + This way, the filesystems will always get checked when + needed (with background fsck, a new panic could happen while + it's checking the disks). Again, the safest way is not to have + many local filesystems by using another computer as NFS + server. + + + + Starting the project + + For the purpose of making gjournal, a new empty + subdirectory was created under an arbitrary user-accessible + directory. You don't have to create the module directory under + /usr/src. + + + + The Makefile + + It's good practice to create + Makefiles for every nontrivial coding + project, which of course includes kernel modules. + + Creating the Makefile is simple + thanks to extensive set of helper routines provided by the + system. In short, here's how it looks: + + SRCS=g_journal.c + KMOD=geom_journal + + .include <bsd.kmod.mk> + + This Makefile (with changed filenames) will do for any + kernel module. If more than one file is required, list it in + SRCS variable separated with whitespace from + other filenames. + + + + + On FreeBSD kernel programming + + + Memory allocation + + See &man.malloc.9;. Basic memory allocation is only + slightly different than its user-land equivalent. Most + notably, malloc() and + free() accept additional parameters as is + described in the man page. + + A malloc type must be declared in the + declaration section of a source file, like this: + + static MALLOC_DEFINE(M_GJOURNAL, "gjournal data", "GEOM_JOURNAL Data"); + + To use the macro, sys/param.h, + sys/kernel.h and + sys/malloc.h headers must be + included. + + There's another mechanism for allocating memory, the UMA + (Universal Memory Allocator). See &man.uma.9; for details, but + it's a special type of allocator mainly used for speedy + allocation of lists comprised of same-sized items (for + example, dynamic arrays of structs). + + + + Lists and queues + + See &man.queue.3;. There are a LOT of cases when a list of + things needs to be maintained. Fortunately, this data + structure is implemented (in several ways) by the C macros + included in the system. The most used list type is TAILQ + because it's the most flexible. It's also the one with largest + memory requirements (its elements are doubly-linked) and + theoretically the slowest (though the speed variation is on + the order of several CPU instructions more, so it shouldn't be + taken seriously). + + If data retrieval speed is very important, see + &man.tree.3;. + + + + BIOs + + Structure bio is used for any and + all Input/Output operations concerning GEOM. It basically + contains information about what device ('provider') should + satisfy the request, request type, offset, length, pointer to + a buffer, and a bunch of user-specific flags + and fields that can help implement various hacks. + + The important thing here is that bios are dealt with + asynchronously. That means that, in most parts of the code, + there's no analogue to userland's &man.read.2; and + &man.write.2; calls that don't return until a request is + done. Rather, a developer-supplied function is called as a + notification when the request gets completed (or results in + error). + + Unfortunately, the asynchronous programming model (also + called "event-driven") imposed this way is somewhat harder + than the much more used imperative one (at least it takes a + while to get used to it). In some cases helper routines + g_write_data() and + g_read_data() can be used (NOT + ALWAYS!). + + + + + + On GEOM programming + + + Ggate + + If maximum performance is not needed, a much simpler way + of making a data transformation is to implement it in userland + via the ggate (GEOM gate) facility. Unfortunately, there's no + easy way to convert between, or even share code between the + two approaches. + + + + GEOM class + + GEOM class has several "class methods" that get called + when there's no geom instance available (or they're simply not + bound to a single instance): + + + + .init is called when GEOM + becomes aware of a GEOM class (e.g. when the kernel module + gets loaded.) + + .fini gets called when GEOM + abandons the class (e.g. when the module gets + unloaded) + + .taste is called next, once for + each provider the system has available. If applicable, this + function will usually create and start a geom + instance. + + .destroy_geom is called when + the geom should be disbanded + + .ctlconf is called when user + requests reconfiguration of existing geom + + + + Also defined are the GEOM event functions, which will get + copied to the geom instance. + + Field .geom in the + g_class structure is a LIST of geoms + instantiated from the class. + + These functions are called from g_event? kernel + thread. + + + + + Softc + + The name softc is a legacy term for + driver private data. The name most probably + comes from archaic term software control block. + In GEOM, it's a structure (more precise: pointer to a + structure) that can be attached to a geom instance to hold + whatever data is private to the geom instance. In gjournal + (and most of the other GEOM classes), some of it's members + are: + + + struct g_provider *provider : The + provider this geom instantiates + + uint16_t n_disks : Number of + consumer this geom consumes + + struct g_consumer **disks : Array + of struct g_consumer*. (It's not possible + to use just single indirection because struct g_consumer* + are created on our behalf by GEOM). + + + The softc structure contains all + the state of geom instance. Every geom instance has its own + softc. + + + + Metadata + + Format of metadata is more-or-less class-dependent, but + MUST start with: + + + + 16 byte buffer for null-terminated signature + (usually the class name) + + uint32 version ID + + + + It's assumed that geom classes know how to handle metadata + with version ID's lower than theirs. + + Metadata is located in the last sector of the provider + (and thus must fit in it). + + (All this is implementation-dependent but all existing + code works like that, and it's supported by libraries.) + + + + Labeling/creating a geom + + The sequence of events is: + + + + user calls &man.geom.8; utility (or one of it's + hardlinked friends) + + the utility figures out which geom class it's + supposed to handle and searches for + geom_CLASSNAME.so + library (usually in + /lib/geom). + + it &man.dlopen.3;-es the library, extracts the + definitions of command-line parameters and helper + functions. + + + + In the case of creating/labeling a new geom, this is what + happens: + + + + &man.geom.8; looks in the command-line definition + for the command (usually "label"), calls a helper + function. + + helper function checks parameters & gathers + metadata, which it proceeds to write to all concerned + providers. + + this "spoils" existing geoms (if any) and + initializes a new round of "tasting" of the providers. The + intended geom class recognizes the metadata and brings the + geom up. + + + + (The above sequence of events is implementation-dependent + but all existing code works like that, and it's supported by + libraries.) + + + + + Geom command structure + + The helper geom_CLASSNAME.so library + exports class_commands structure, + which is an array of struct g_command + elements. Commands are of uniform format and look like: + + verb [-options] geomname [other] + + Common verbs are: + + + + label - to write metadata to devices so they can be + recognized at tasting and brought up in geoms + + destroy - to destroy metadata, so the geoms get + destroyed + + + + Common options are: + + + -v : be verbose + -f : force + + + Many actions, such as labeling and destroying metadata can + be performed in userland. For this, struct + g_command provides field + gc_func that can be set to a function (in + the same .so) that will be called to + process a verb. If gc_func is NULL, the + command will be passed to kernel module, to + .ctlreq function of the geom + class. + + + + Geoms + + Geoms are instances of geom classes. They have internal + data (a softc structure) and some functions with which they + respond to external events. + + The event functions are: + + + .access : calculates + permissions (read/write/exclusive) + + .dumpconf : returns + XML-formatted information about the geom + + .orphan : called when some + underlying provider gets disconnected + + .spoiled : called when some + underlying provider gets written to + + .start : handles IO + + + These functions are called from g_down? kernel thread and + there can be no sleeping in this context (no blocking on a + mutex or any kind of locks) which limits what can be done + quite a bit, but forces the handling to be fast. + + Of these, the most important function for doing actual + usefull work is the .start() function, + which is called when a BIO requests arrives for a provider + managed by a instance of geom class. + + + + Geom threads + + There are three kernel threads created and run by the GEOM + framework: + + + g_down : Handles requests coming + from high-level entities (such as a userland request) on the + way to physical devices + + g_up : Handles responses from + device drivers to requests made by higher-level + entities + + g_event : Handles all other + cases: creation of geom instances, access counting, "spoil" + events, etc. + + + When a user process issues read data X at offset Y + of a file request, this is what happenes: + + + + The filesystem converts the request into struct bio + instance and passes it to GEOM subsystem. It knows what geom + instance should handle it because filesystems are hosted + directly on a geom instance. + + The request ends up as a call to + .start() function made on the g_down + thread and reaches the top-level geom instance. + + This top-level geom instance (for example the + partition slicer) determines that the request should be + routed to a lower-level instance (for example the disk + driver). It makes a copy of the bio request (bio requests + ALWAYS need to be copied between + instances, with g_clone_bio()!), + modifies the data offset and target provider fields and + executes the copy with + g_io_request() + + The disk driver gets the bio request also as a call + to .start() on the + g_down thread. It talks to hardware, + gets the data back, and calls + g_io_deliver() on the bio. + + Now, the notification of bio completion + bubbles up in the g_up + thread. First the partition slicer gets + .done() called in the + g_up thread, it uses information stored + in the bio to free the cloned bio + structure (with g_destroy_bio()) and + calls g_io_deliver() on the original + request. + + The filesystem gets the data and transfers it to + userland. + + + See &man.g.bio.9; man page for information how the data is + passed back and forth in the bio + structure (note particular the bio_parent + and bio_children fields and how they are + handled). + + One important feature is: THERE CAN BE NO SLEEPING IN G_UP + AND G_DOWN THREADS. This means that none of the following + things can be done in those threads (the list is of course not + complete, but only informative): + + + Calls to msleep() and + tsleep(), obviously. + + Calls to g_write_data() and + g_read_data(), because these sleep + between passing the data to consumers and + returning. + + Calls to &man.malloc.9; and + uma_zalloc() with + M_WAITOK flag set + + sx locks + + + This restriction is here to stop geom code clogging the IO + request path, because sleeping in the code is usually not + time-bound and there can be no guarantiees on how long will it + take (there are some other, more technical reasons also). It + also means that there's not much that can be done in those + threads; for example, almost any complex thing requires memory + allocation. Fortunately, there is a way out: creating + additional kernel threads. + + + + Kernel threads for use in geom code + + Kernel threads are created with &man.kthread.create.9; + function, and they are sort of similar to userland threads in + behaviour, only they can't return to caller to signify + termination, but must call &man.kthread.exit.9;. + + In geom code, the usual use of threads is to offload + processing of requests from g_down thread + (the .start() function). These threads + look like event handlers: they have a linked + list of event associated with them (on which events can posted + by various functions in various threads so it must be + protected by a mutex), take the events from the list one by + one and process them in a big switch() + statement. + + The main benefit of using a thread to handle IO requests + is that it can sleep when needed. Now, this sounds good, but + should be carefully thought out. Sleeping is well and very + convenient but can very effectively destroy performance of the + geom transformation. Extremely performance-sensitive classes + probably should do all the work in + .start() function call, taking great care + to handle out-of-memory and similar errors. + + The other benefit of having a event-handler thread like + that is to serialize all the requests and responses coming + from different geom threads into one thread. This is also very + convenient but can be slow. In most cases, handling of + .done() requests can be left to the + g_up thread. + + Mutexes in FreeBSD kernel (see &man.mutex.9; man page) have + one distinction from their more common userland cousins - they + disallow sleeping (meaning: the code can't sleep while holding + a mutex). If the code needs to sleep a lot, &man.sx.9; locks + may be more appropriate. (On the other hand, if you do almost + everything in a single thread, you may get away with no + mutexes at all). + + + + + +