diff --git a/en_US.ISO8859-1/articles/diskless-x/article.sgml b/en_US.ISO8859-1/articles/diskless-x/article.sgml index 235a346099..ec16577767 100644 --- a/en_US.ISO8859-1/articles/diskless-x/article.sgml +++ b/en_US.ISO8859-1/articles/diskless-x/article.sgml @@ -1,349 +1,349 @@ %man; ]>
Diskless X Server: a how to guide Jerry Kendall
jerry@kcis.com
28-December-1996 1996 Jerry Kendall With the help of some friends on the FreeBSD-hackers list, I have been able to create a diskless X terminal. The creation of the X terminal required first creating a diskless system with minimal utilities mounted via NFS. These same steps were used to create 2 separate diskless systems. The first is altair.example.com. A diskless X terminal that I run on my old 386DX-40. It has a 340Meg hard disk but, I did not want to change it. So, it boots from antares.example.com across a Ethernet. The second system is a 486DX2-66. I setup a diskless FreeBSD (complete) that uses no local disk. The server in that case is a Sun 670MP running SunOS 4.1.3. The same setup configuration was needed for both. I am sure that there is stuff that needs to be added to this. Please send me any comments.
Creating the boot floppy (On the diskless system) Since the network boot loaders will not work with some of the TSR's and such that MS-DOS uses, it is best to create a dedicated boot floppy or, if you can, create an MS-DOS menu that will (via the config.sys/autoexec.bat files) ask what configuration to load when the system starts. The later is the method that I use and it works great. My MS-DOS (6.x) menu is below. <filename>config.sys</filename> [menu] menuitem=normal, normal menuitem=unix, unix [normal] .... normal config.sys stuff ... [unix] <filename>autoexec.bat</filename> @ECHO OFF goto %config% :normal ... normal autoexec.bat stuff ... goto end :unix cd \netboot nb8390.com :end Getting the network boot programs (On the server) - Compile the 'net-boot' programs that are located in + Compile the net-boot programs that are located in /usr/src/sys/i386/boot/netboot. You should read the comments at the top of the Makefile. Adjust as required. Make a backup of the original in case it gets foobar'd. When the build is done, there should be 2 MS-DOS executables, nb8390.com and nb3c509.com. One of these two programs will be what you need to run on the diskless server. It will load the kernel from the boot server. At this point, put both programs on the MS-DOS boot floppy created earlier. Determine which program to run (On the diskless system) If you know the chipset that your Ethernet adapter uses, this is easy. If you have the NS8390 chipset, or a NS8390 based chipset, use nb8390.com. If you have a 3Com 509 based chipset, use the nb3C509.com boot program. If you are not sure which you have, try using one, if it says No adapter found, try the other. Beyond that, you are pretty much on your own. Booting across the network Boot the diskless system with out any config.sys/autoexec.bat files. Try running the boot program for your Ethernet adapter. My Ethernet adapter is running in WD8013 16bit mode so I run nb8390.com C:> cd \netboot C:> nb8390 Boot from Network (Y/N) ? Y BOOTP/TFTP/NFS bootstrap loader ESC for menu Searching for adapter.. WD8013EBT base 0x0300, memory 0x000D8000, addr 00:40:01:43:26:66 Searching for server... At this point, my diskless system is trying to find a machine to act as a boot server. Make note of the addr line above, you will need this number later. Reset the diskless system and modify your config.sys and autoexec.bat files to do these steps automatically for you. Perhaps in a menu. If you had to run nb3c509.com instead of nb8390.com the output is the same as above. If you got No adapter found at the Searching for adapter... message, verify that you did indeed set the compile time defines in the Makefile correctly. Allowing systems to boot across the network (On the server) Make sure the /etc/inetd.conf file has entries for tftp and bootps. Mine are listed below: tftp dgram udp wait nobody /usr/libexec/tftpd tftpd /tftpboot # # Additions by who ever you are bootps dgram udp wait root /usr/libexec/bootpd bootpd /etc/bootptab If you have to change the /etc/inetd.conf file, send a HUP signal to inetd. To do this, get the process ID of inetd with ps -ax | grep inetd | grep -v grep. Once you have it, send it a HUP signal. Do this by kill -HUP <pid>. This will force inetd to re-read its config file. Did you remember to note the addr line from the output of the boot loader on the diskless system? Guess what, here is where you need it. Add an entry to /etc/bootptab (maybe creating the file). It should be laid out identical to this: altair:\ :ht=ether:\ :ha=004001432666:\ :sm=255.255.255.0:\ :hn:\ :ds=199.246.76.1:\ :ip=199.246.76.2:\ :gw=199.246.76.1:\ :vm=rfc1048: The lines are as follows: altair the diskless systems name without the domain name. ht=ether - the hardware type of 'ethernet'. + the hardware type of ethernet. ha=004001432666 the hardware address (the number noted above). sm=255.255.255.0 the subnet mask. hn tells server to send client's hostname to the client. ds=199.246.76.1 tells the client who the domain server is. ip=199.246.76.2 tells the client what its IP address is. gw=199.246.76.1 tells the client what the default gateway is. vm=... just leave it there. Be sure to setup the IP addresses correctly, the addresses above are my own. Create the directory /tftpboot on the server it will contain the configuration files for the diskless systems that the server will serve. These files will be named cfg.ip where ip is the IP address of the diskless system. The config file for altair is /tftpboot/cfg.199.246.76.2. The contents is: rootfs 199.246.76.1:/DiskLess/rootfs/altair hostname altair.example.com The line hostname altair.example.com simply tells the diskless system what its fully qualified domain name is. The line rootfs 199.246.76.1:/DiskLess/rootfs/altair tells the diskless system where its NFS mountable root filesystem is located. The NFS mounted root filesystem will be mounted read only. The hierarchy for the diskless system can be re-mounted allowing read-write operations if required. I use my spare 386DX-40 as a dedicated X terminal. The hierarchy for altair is: / /bin /etc /tmp /sbin /dev /dev/fd /usr /var /var/run The actual list of files is: -r-xr-xr-x 1 root wheel 779984 Dec 11 23:44 ./kernel -r-xr-xr-x 1 root bin 299008 Dec 12 00:22 ./bin/sh -rw-r--r-- 1 root wheel 499 Dec 15 15:54 ./etc/rc -rw-r--r-- 1 root wheel 1411 Dec 11 23:19 ./etc/ttys -rw-r--r-- 1 root wheel 157 Dec 15 15:42 ./etc/hosts -rw-r--r-- 1 root bin 1569 Dec 15 15:26 ./etc/XF86Config.altair -r-x------ 1 bin bin 151552 Jun 10 1995 ./sbin/init -r-xr-xr-x 1 bin bin 176128 Jun 10 1995 ./sbin/ifconfig -r-xr-xr-x 1 bin bin 110592 Jun 10 1995 ./sbin/mount_nfs -r-xr-xr-x 1 bin bin 135168 Jun 10 1995 ./sbin/reboot -r-xr-xr-x 1 root bin 73728 Dec 13 22:38 ./sbin/mount -r-xr-xr-x 1 root wheel 1992 Jun 10 1995 ./dev/MAKEDEV.local -r-xr-xr-x 1 root wheel 24419 Jun 10 1995 ./dev/MAKEDEV Do not forget to run MAKEDEV all in the dev directory. My /etc/rc for altair is: #!/bin/sh # PATH=/bin:/ export PATH # # configure the localhost /sbin/ifconfig lo0 127.0.0.1 # # configure the ethernet card /sbin/ifconfig ed0 199.246.76.2 netmask 0xffffff00 # # mount the root filesystem via NFS /sbin/mount antares:/DiskLess/rootfs/altair / # # mount the /usr filesystem via NFS /sbin/mount antares:/DiskLess/usr /usr # /usr/X11R6/bin/XF86_SVGA -query antares -xf86config /etc/XF86Config.altair > /dev/null 2>&1 # # Reboot after X exits /sbin/reboot # # We blew up.... exit 1 Any comments and all questions welcome.
diff --git a/en_US.ISO8859-1/articles/filtering-bridges/article.sgml b/en_US.ISO8859-1/articles/filtering-bridges/article.sgml index de0ffd3172..af4cf51e9a 100644 --- a/en_US.ISO8859-1/articles/filtering-bridges/article.sgml +++ b/en_US.ISO8859-1/articles/filtering-bridges/article.sgml @@ -1,356 +1,356 @@ %man; ]>
Filtering Bridges Nick Sayer
nsayer@FreeBSD.org
$FreeBSD$ For those of you who do not know, DSL differs from more traditional - connectivity methods in that the "connectivity spigot" that comes + connectivity methods in that the connectivity spigot that comes out of the wall has no possibility for packet filtering. If you get a T1 line or some such it will come with a router that can generally include a packet filter. If you get ISDN or a dialup link, you also either have a software routing component (a PPP daemon, specifically) that can do some filtering or can be combined with a filter on the machine running the link. But with DSL you only get a little white box with some Blinkenlights on it and an Ethernet port that takes your traffic back and forth from the Internet and nothing else (to some extent the same can be said of other mass-market high speed connectivity methods, like cable modems or high speed wireless links as well. The same technique I plan to describe works just as well for them, or for any other technology that provides an Ethernet port with no filtering).
Why use a filtering bridge? Bridging is not the only conceivable option. It is possible to set up a two Ethernet machine as a router instead of a bridge. Where it is possible to do so, it is actually a better idea. Bridges run their interfaces in promiscuous mode, meaning they must process every packet presented to them. The problem is that routers can only route traffic between different subnets. Also, subnets can only be made by by cutting an existing space in half or defining a new space that is typically unroutable (see RFC 1918). This wastes half of the useful addresses (or at least puts - them on the "wrong" side of the router -- the thing that is + them on the wrong side of the router—the thing that is doing the packet filtering that makes the inside network safe). Using a bridge costs some CPU cycles, but makes all of the problems of adding a 2nd router go away. Configuring a Kernel After configuring and installing a kernel as shown here, you should carry out the other final preperation tasks before booting into your new kernel. Adding bridging to a FreeBSD machine is not hard to do. It means having 2 (or more, but we will just use 2 here) Ethernet cards and adding a couple of lines to the kernel configuration. Since May of 2000, RELENG_4 and -current have had bridging support for all Ethernet interfaces. This does not mean that any Ethernet interface will work. For them to work, they have to support a working promiscuous mode for - both reception and transmission -- that is, they have to be able to + both reception and transmission—that is, they have to be able to transmit Ethernet packets with any source address, not just their own. In order to get good throughput, the cards should also be PCI bus mastering cards. The best choices still are the Intel EtherExpress Pro 100 cards, with 3com 3c9xx cards being second. So you will want to add the following to your kernel configuration file: device fxp (or whatever is appropriate for the cards you are using) options BRIDGE options IPFIREWALL options IPFIREWALL_VERBOSE Note that recent versions of FreeBSD support dynamically loading the IP Firewall code into the kernel. You can not do this, however, with bridging, as the bridge code itself needs to interact with IPFIREWALL in a special way. It is also a good idea at this point to see if Luigi has updated versions of the bridge code available that are more recent than what is in the distribution. As an example, 3.3-RELEASE comes with 981214, but as of this writing, the most up-to-date bridge code is 990810. You can fetch the latest version from http://www.iet.unipi.it/~luigi/. You will want to fetch bridge.c and bridge.h and drop them into sys/net/. For instructions on how to build and install a new kernel, refer to the Building and Installing a Custom Kernel section of the handbook Final Preperation Before you boot the new kernel, you must make some preparations in rc.boot and rc.firewall. The default rule for the firewall is to drop all packets on the floor. You - will want to override this by setting up the 'open' firewall in + will want to override this by setting up the open firewall in /etc/rc.conf. Put these lines in /etc/rc.conf to achieve this: firewall_enable="YES" firewall_type="open" There is one more thing that is necessary. When running IP over Ethernet, there are actually two Ethernet protocols in use. One is IP, the other is ARP. ARP is used when a machine must figure out what Ethernet address corresponds to a given IP address. ARP is not a part of the IP layer, since it only applies to IP when run over Ethernet. The standard ipfirewall rule for the open firewall is pass ip from any to any but what about ARP? If ARP is not passed, no IP traffic can flow at all. But IPFIREWALL has no provisions for dealing with non-IP protocols, and that includes ARP. Fortunately, a hackish extension was made to the ipfirewall code to assist filtering bridges. If you set up a special rule for UDP packets from IP address 0.0.0.0, the UDP port number will be used to match the Ethernet protocol number for bridged packets. In this way your bridge can be configured to pass or reject non IP protocols. So add this line just below the two lines near the top of /etc/rc.firewall that deal with lo0 (the ones that say that you should almost never change those two rules). ${fwcmd} add allow udp from 0.0.0.0 2054 to 0.0.0.0 This rule makes almost no sense at all from a normal perspective on IPFIREWALL, but the bridge code will use it to pass ARP packets without restriction (which you almost certainly want to do). Now you should be able to reboot your machine and have it act no differently than it did before. There will be some new boot messages about bridging, but the bridging will not be enabled. If there are any problems, you should try and sort them out at this point before proceeding. Enabling The Bridge Next, you should do this: &prompt.root; sysctl -w net.link.ether.bridge_ipfw=1 &prompt.root; sysctl -w net.link.ether.bridge=1 At this point, the bridge should be enabled, and because of the previous changes to /etc/rc.conf, the firewall should be wide open. At this point, you should be able to insert the machine between two sets of hosts and go back and forth without difficulty. If so, the next step is to add those two sysctl lines to either /etc/rc.local or add the net.link.[blah blah]=1 portions of the lines to /etc/sysctl.conf (which path you take depends on what version of FreeBSD you have). Now before we started all of this, you should have had a machine with two Ethernet interfaces, but with only one of them configured. That is, there should only be one ifconfig line /etc/rc.conf. With the bridge in place, that is still true. But there is a detail that deserves some thought. The bridge is not in place by default. That means that until the sysctls are run that turn the bridge on, rather late in the startup, it is still an ordinary machine with two interfaces, only one of which is configured by /etc/rc.conf. This becomes important for those portions of the startup that require network access, say for DNS resolution. Some care must be made in picking which interface is going to be the configured one. In most cases, you are best to pick the - "outside" one (that is, the interface connected to the Internet). Let's + outside one (that is, the interface connected to the Internet). Let's presume for the sake of the examples to come, that - fxp0 is the "outside" interface, and - fxp1 is the "inside" one. That means that fxp0 + fxp0 is the outside interface, and + fxp1 is the inside one. That means that fxp0 should be mentioned in /etc/rc.conf's ifconfig sections, but fxp1 should not be. The sysctl that turns the bridge on will make fxp1 start working automagically. Configuring The Firewall Now it is time to start adding ipfirewall rules to secure the inside network. There are some complications in doing this because not all of the ipfirewall functionality is available on bridged packets. Also, there is a difference between packets that are in the process of being bridged and packets that are being received by the local machine. In general, packets being bridged are only run through ipfirewall once, not twice as is usually the case. Bridged packets are filtered while they - are being received, so rules that use 'out' or 'xmit' will never match. - I usually use 'in via' which is an older syntax, but one that makes + are being received, so rules that use out or xmit will never match. + I usually use in via which is an older syntax, but one that makes sense as you read it. Another limitation is that you are restricted - only to 'pass' or 'drop' for filtering bridged packets. Sophisticated - things like 'divert' or 'forward' or 'reject' are not available. Such + only to pass or drop for filtering bridged packets. Sophisticated + things like divert or forward or reject are not available. Such options can still be used, but only on traffic to or from the bridge machine itself. New in FreeBSD 4.0 is the concept of stateful filtering. This is a big boost for UDP traffic, which typically is a request going out, followed shortly thereafter by a response with the exact same set of IP addresses and port numbers (but with source and dest reversed, of course). For firewalls that have no statekeeping, there is almost no way to deal with this sort of traffic short of setting up proxies. But - a firewall that can "remember" an outgoing UDP packet and for the next + a firewall that can remember an outgoing UDP packet and for the next few minutes allow a response, handling UDP services is trivial. The example to follow shows how to do this. The truly paranoid can also set up rules like this to handle TCP. This allows you to avoid some sorts of denial of service attacks or other nasty tricks, but it also typically makes your state table mushroom in size. Let's look at an example setup. Note first that at the top of /etc/rc.firewall we should already have taken care of the loopback interface and the special hack for ARP should still be in place. So we will not worry about them any further. us_ip=192.168.1.1 oif=fxp0 iif=fxp1 # Things that we've kept state on before get to go through in a hurry. ${ipfw} add check-state # Throw away RFC 1918 networks ${ipfw} add deny log ip from 10.0.0.0/8 to any in via ${oif} ${ipfw} add deny log ip from 172.16.0.0/12 to any in via ${oif} ${ipfw} add deny log ip from 192.68.0.0/16 to any in via ${oif} # Allow the bridge machine to say anything it wants (keep state if UDP) ${ipfw} add pass udp from ${us_ip} to any keep-state ${ipfw} add pass ip from ${us_ip} to any # Allow the inside net to say anything it wants (keep state if UDP) ${ipfw} add pass udp from any to any in via ${iif} keep-state ${ipfw} add pass ip from any to any in via ${iif} # Allow all manner of ICMP ${ipfw} add pass icmp from any to any # TCP section # established TCP sessions are ok everywhere. ${ipfw} add pass tcp from any to any established # Pass the "quarantine" range. ${ipfw} add pass tcp from any to any 49152-65535 in via ${oif} # Pass ident probes. It's better than waiting for them to timeout ${ipfw} add pass tcp from any to any 113 in via ${oif} # Pass SSH. ${ipfw} add pass tcp from any to any 22 in via ${oif} # Pass DNS. Only if you have name servers inside. #${ipfw} add pass tcp from any to any 53 in via ${oif} # Pass SMTP to the mail server only ${ipfw} add pass tcp from any to mailhost 25 in via ${oif} # UDP section # Pass the "quarantine" range. ${ipfw} add pass udp from any to any 49152-65535 in via ${oif} # Pass DNS. Only if you have name servers inside. #${ipfw} add pass udp from any to any 53 in via ${oif} # Everything else is suspect ${ipfw} add deny log ip from any to any Those of you who have set up firewalls before may notice some things missing. In particular, there are no anti-spoofing rules. That is, we did not add: ${ipfw} add deny ip from ${us_ip}/24 to any in via ${oif} That is, drop packets claiming to be from our network that are coming in from the outside. This is something that you would commonly do to make sure that someone does not try and evade the packet filter by generating nefarious packets that look like they are from the inside. The problem with that is that there is at least one host on the outside - interface that you do not want to ignore -- your router. In my + interface that you do not want to ignore—your router. In my particular case, I have some machines on the outside and some on the inside, but I do not necessarily want the outside machines to have routine access to the inside. At the same time, I do not want to throw their traffic away. In my own case, my ISP anti-spoofs at their router, so I do not need to bother. And in general, the fewer rules the better, since it will take time and CPU to process each one. Note also that the last rule is almost an exact duplicate of the default rule 65536. There are two major differences when it comes to bridging, however. Our rule logs what it drops, of course, but our rule will only apply to IP traffic. Apart from the UDP 0.0.0.0 trick there is no way to deal with non IP traffic, so the default rule at 65536 will drop ALL traffic, not merely all non-IP traffic. So the net effect is that unmatched IP traffic will be logged, but not non-IP traffic. If you want, you can add option IPFIREWALL_DEFAULT_TO_ACCEPT to your kernel configuration and non-IP traffic will be passed instead of dropped. But in the case of a filtering bridge between you and the Internet, it is unlikely that you would want to do this (if you are sufficiently paranoid). There is a rule for passing SMTP to a mailhost if you have one. Obviously the whole ruleset above should be flavored to taste, and that is an example of a specific service exemption. Note that - in order for 'mailhost' to work, name service lookups must work + in order for mailhost to work, name service lookups must work BEFORE the bridge is enabled. This is an example of making sure that you enable the correct interface. Another item to note is that the DNS rules are set up only to allow DNS servers to work. This means that if do not set up a DNS server, you do not need them. Folks used to setting up IP firewalls also probably are used to - either having a 'reset' or a 'forward' rule for ident packets + either having a reset or a forward rule for ident packets (TCP port 113). Unfortunately, this is not an option with the bridging code, so the path of least resistance is to simply pass them to their destination. As long as that destination machine is not running an ident daemon, this is relatively harmless. The alternative is dropping port 113 connections, which makes firing up things like IRC take forever (the ident probe must timeout). The only other thing that is a little weird that you may have noticed - is that there is a rule to let ${us_ip} speak and a separate rule to + is that there is a rule to let ${us_ip} speak and a separate rule to allow the inside network to speak. Remember that this is because the two sets of traffic will be taking different paths through the kernel and into the packet filter. The inside net will be going through the bridge code. The local machine, however, will be using the normal IP stack to speak. Thus the two rules to handle the different cases. The in via ${oif} rules work for both paths. In general if you use in via rules throughout the filter, you will need to make an exception for - locally generated packets, because they did not "come in" via + locally generated packets, because they did not come in via anything. Contributors To some extent the material for this discussion is a combination of the items that were discussed by Luigi Rizzo in his Dummynet lecture at FreeBSDcon '99 and by Mark Murray during his Network Security lecture. In addition, for quite some time now I have been putting together filtering bridges for friends and colleagues who were getting DSL connections for their home.
diff --git a/en_US.ISO8859-1/articles/freebsd-questions/article.sgml b/en_US.ISO8859-1/articles/freebsd-questions/article.sgml index a6eb23c0c0..f36c3c3fe6 100644 --- a/en_US.ISO8859-1/articles/freebsd-questions/article.sgml +++ b/en_US.ISO8859-1/articles/freebsd-questions/article.sgml @@ -1,564 +1,564 @@ %man; ]>
How to get best results from the FreeBSD-questions mailing list Greg Lehey
grog@FreeBSD.org
$FreeBSD$ This document provides useful information for people looking to prepare an e-mail to the FreeBSD-questions mailing list. Advice and hints are given that will maximise the chance that the reader will receive useful replies. This document is regularly posted to the FreeBSD-questions mailing list.
Introduction FreeBSD-questions is a mailing list maintained by the FreeBSD project to help people who have questions about the normal use of FreeBSD. Another group, FreeBSD-hackers, discusses more advanced questions such as future development work. The term hacker has nothing to do with breaking into other people's computers. The correct term for the latter activity is cracker, but the popular press has not found out yet. The FreeBSD hackers disapprove strongly of cracking security, and have nothing to do with it. For a longer description of hackers, see Eric Raymond's How To Become A Hacker This is a regular posting aimed to help both those seeking advice from FreeBSD-questions (the newcomers), and also those who answer the questions (the hackers). Inevitably there is some friction, which stems from the different viewpoints of the two groups. The newcomers accuse the hackers of being arrogant, stuck-up, and unhelpful, while the hackers accuse the newcomers of being stupid, unable to read plain English, and expecting everything to be handed to them on a silver platter. Of course, there is an element of truth in both these claims, but for the most part these viewpoints come from a sense of frustration. In this document, I would like to do something to relieve this frustration and help everybody get better results from FreeBSD-questions. In the following section, I recommend how to submit a question; after that, we will look at how to answer one. How to subscribe to FreeBSD-questions FreeBSD-questions is a mailing list, so you need mail access. Send a mail message to majordomo@FreeBSD.org with the single line: subscribe FreeBSD-questions majordomo is an automatic program which maintains the mailing list, so you do not need a subject line. If your mailer complains, however, you can put anything you like in the subject line. When you get the reply from majordomo telling you the details of the list, please save it. If you ever should want to leave the list, you will need the information there. See the next section for more details. How to unsubscribe from FreeBSD-questions When you subscribed to FreeBSD-questions, you got a welcome message from Majordomo@FreeBSD.ORG. In this message, amongst other things, it told you how to unsubscribe. Here is a typical message: Welcome to the freebsd-questions mailing list! If you ever want to remove yourself from this mailing list, you can send mail to "Majordomo@FreeBSD.ORG" with the following command in the body of your email message: unsubscribe freebsd-questions Greg Lehey <grog@lemis.de> Here's the general information for the list you've subscribed to, in case you don't already have it: FREEBSD-QUESTIONS User questions This is the mailing list for questions about FreeBSD. You should not send "how to" questions to the technical lists unless you consider the question to be pretty technical. Normally, unsubscribing is even simpler than the message suggests: you do not need to specify your mail ID unless it is different from the one which you specified when you subscribed. If Majordomo replies and tells you (incorrectly) that you are not on the list, this may mean one of two things: You have changed your mail ID since you subscribed. That is where keeping the original message from majordomo comes in handy. For example, the sample message above shows my mail ID as grog@lemis.de. Since then, I have changed it to grog@lemis.com. If I were to try to remove grog@lemis.com from the list, it would fail: I would have to specify the name with which I joined. You are subscribed to a mailing list which is subscribed to FreeBSD-questions. If that is the case, you will have to figure out which one it is and get your name taken off that one. If you are not sure which one it might be, check the headers of the messages you receive from freebsd-questions: maybe there is a clue there. If you have done all this, and you still can not figure out what is going on, send a message to Postmaster@FreeBSD.org, and he will sort things out for you. Do not send a message to FreeBSD-questions: they can not help you. Should I ask <literal>-questions</literal> or <literal>-hackers</literal>? Two mailing lists handle general questions about FreeBSD, FreeBSD-questions and FreeBSD-hackers. In some cases, it is not really clear which group you should ask. The following criteria should help for 99% of all questions, however: If the question is of a general nature, ask FreeBSD-questions. Examples might be questions about installing FreeBSD or the use of a particular UNIX utility. If you think the question relates to a bug, but you are not sure, or you do not know how to look for it, send the message to FreeBSD-questions. If the question relates to a bug, and you are sure that it is a bug (for example, you can pinpoint the place in the code where it happens, and you maybe have a fix), then send the message to FreeBSD-hackers. If the question relates to enhancements to FreeBSD, and you can make suggestions about how to implement them, then send the message to FreeBSD-hackers. There are also a number of other specialized mailing lists, for example FreeBSD-isp, which caters to the interests of ISPs (Internet Service Providers) who run FreeBSD. If you happen to be an ISP, this does not mean you should automatically send your questions to FreeBSD-isp. The criteria above still apply, and it is in your interest to stick to them, since you are more likely to get good results that way. How to submit a question When submitting a question to FreeBSD-questions, consider the following points: Remember that nobody gets paid for answering a FreeBSD question. They do it of their own free will. You can influence this free will positively by submitting a well-formulated question supplying as much relevant information as possible. You can influence this free will negatively by submitting an incomplete, illegible, or rude question. It is perfectly possible to send a message to FreeBSD-questions and not get an answer even if you follow these rules. It is much more possible to not get an answer if you do not. In the rest of this document, we will look at how to get the most out of your question to FreeBSD-questions. Not everybody who answers FreeBSD questions reads every message: they look at the subject line and decide whether it interests them. - Clearly, it is in your interest to specify a subject. ``FreeBSD - problem'' or ``Help'' are not enough. If you provide no subject at + Clearly, it is in your interest to specify a subject. FreeBSD + problem or Help are not enough. If you provide no subject at all, many people will not bother reading it. If your subject is not specific enough, the people who can answer it may not read it. Format your message so that it is legible, and PLEASE DO NOT SHOUT!!!!!. We appreciate that a lot of people do not speak English as their first language, and we try to make allowances for that, but it is really painful to try to read a message written full of typos or without any line breaks. Do not underestimate the effect that a poorly formatted mail message has, not just on the FreeBSD-questions mailing list. Your mail message is all people see of you, and if it is poorly formatted, one line per paragraph, badly spelt, or full of errors, it will give people a poor impression of you. A lot of badly formatted messages come from bad mailers or badly configured mailers. The following mailers are known to send out badly formatted messages without you finding out about them: cc:Mail Eudora exmh Microsoft Exchange Microsoft Internet Mail Microsoft Outlook Netscape As you can see, the mailers in the Microsoft world are frequent offenders. If at all possible, use a UNIX mailer. If you must use a mailer under Microsoft environments, make sure it is set up correctly. Try not to use MIME: a lot of people use mailers which do not get on very well with MIME. Make sure your time and time zone are set correctly. This may seem a little silly, since your message still gets there, but many of the people you are trying to reach get several hundred messages a day. They frequently sort the incoming messages by subject and by date, and if your message does not come before the first answer, they may assume they missed it and not bother to look. Do not include unrelated questions in the same message. Firstly, a long message tends to scare people off, and secondly, it is more difficult to get all the people who can answer all the questions to read the message. Specify as much information as possible. This is a difficult area, and we need to expand on what information you need to submit, but here is a start: In nearly every case, it is important to know the version of FreeBSD you are running. This is particularly the case for FreeBSD-CURRENT, where you should also specify the date of the sources, though of course you should not be sending questions about -CURRENT to FreeBSD-questions. With any problem which could be hardware related, tell us about your hardware. In case of doubt, assume it is possible that it is hardware. What kind of CPU are you using? How fast? What motherboard? How much memory? What peripherals? There is a judgement call here, of course, but the output of the &man.dmesg.8; command can frequently be very useful, since it tells not just what hardware you are running, but what version of FreeBSD as well. If you get error messages, do not say I get error messages, say (for example) I get the error message 'No route to host'. If your system panics, do not say My system panicked, say (for example) my system panicked with the message 'free vnode isn't'. If you have difficulty installing FreeBSD, please tell us what hardware you have. In particular, it is important to know the IRQs and I/O addresses of the boards installed in your machine. If you have difficulty getting PPP to run, describe the configuration. Which version of PPP do you use? What kind of authentication do you have? Do you have a static or dynamic IP address? What kind of messages do you get in the log file? A lot of the information you need to supply is the output of programs, such as &man.dmesg.8;, or console messages, which usually appear in /var/log/messages. Do not try to copy this information by typing it in again; it is a real pain, and you are bound to make a mistake. To send log file contents, either make a copy of the file and use an editor to trim the information to what is relevant, or cut and paste into your message. For the output of programs like &man.dmesg.8;, redirect the output to a file and include that. For example, &prompt.user; dmesg > /tmp/dmesg.out This redirects the information to the file /tmp/dmesg.out. If you do all this, and you still do not get an answer, there could be other reasons. For example, the problem is so complicated that nobody knows the answer, or the person who does know the answer was offline. If you do not get an answer after, say, a week, it might help to re-send the message. If you do not get an answer to your second message, though, you are probably not going to get one from this forum. Resending the same message again and again will only make you unpopular. To summarize, let's assume you know the answer to the following question (yes, it is the same one in each case). You choose which of these two questions you would be more prepared to answer: Message 1 Subject: HELP!!?!?? I just can't get hits damn silly FereBSD system to workd, and Im really good at this tsuff, but I have never seen anythign sho difficult to install, it jst wont work whatever I try so why don't y9ou guys tell me what I doing wrong. Message 2 Subject: Problems installing FreeBSD I've just got the FreeBSD 2.1.5 CDROM from Walnut Creek, and I'm having a lot of difficulty installing it. I have a 66 MHz 486 with 16 MB of memory and an Adaptec 1540A SCSI board, a 1.2GB Quantum Fireball disk and a Toshiba 3501XA CDROM drive. The installation works just fine, but when I try to reboot the system, I get the message -``Missing Operating System''. +Missing Operating System. How to follow up to a question Often you will want to send in additional information to a question you have already sent. The best way to do this is to reply to your original message. This has three advantages: You include the original message text, so people will know what you are talking about. Do not forget to trim unnecessary text out, though. The text in the subject line stays the same (you did remember to put one in, did you not?). Many mailers will sort messages by subject. This helps group messages together. The message reference numbers in the header will refer to the previous message. Some mailers, such as mutt, can thread messages, showing the exact relationships between the messages. How to answer a question Before you answer a question to FreeBSD-questions, consider: A lot of the points on submitting questions also apply to answering questions. Read them. Has somebody already answered the question? The easiest way to check this is to sort your incoming mail by subject: then (hopefully) you will see the question followed by any answers, all together. If somebody has already answered it, it does not automatically mean that you should not send another answer. But it makes sense to read all the other answers first. Do you have something to contribute beyond what has already been said? In general, Yeah, me too answers do not help much, although there are exceptions, like when somebody is describing a problem he is having, and he does not know whether it is his fault or whether there is something wrong with the hardware or software. If you do send a me too answer, you should also include any further relevant information. Are you sure you understand the question? Very frequently, the person who asks the question is confused or does not express himself very well. Even with the best understanding of the system, it is easy to send a reply which does not answer the question. This does not help: you will leave the person who submitted the question more frustrated or confused than ever. If nobody else answers, and you are not too sure either, you can always ask for more information. Are you sure your answer is correct? If not, wait a day or so. If nobody else comes up with a better answer, you can still reply and say, for example, I do not know if this is correct, but since nobody else has replied, why don't you try replacing your ATAPI CDROM with a frog?. Unless there is a good reason to do otherwise, reply to the sender and to FreeBSD-questions. Many people on the FreeBSD-questions are lurkers: they learn by reading messages sent and replied to by others. If you take a message which is of general interest off the list, you are depriving these people of their information. Be careful with group replies; lots of people send messages with hundreds of CCs. If this is the case, be sure to trim the Cc: lines appropriately. Include relevant text from the original message. Trim it to the minimum, but do not overdo it. It should still be possible for somebody who did not read the original message to understand what you are talking about. Use some technique to identify which text came from the original message, and which text you add. I personally find that prepending > to the original message works best. Leaving white space after the > and leave empty lines between your text and the original text both make the result more readable. Put your response in the correct place (after the text to which it replies). It is very difficult to read a thread of responses where each reply comes before the text to which it replies. Most mailers change the subject line on a reply by prepending a text such as Re: . If your mailer does not do it automatically, you should do it manually. If the submitter did not abide by format conventions (lines too long, inappropriate subject line), please fix it. In the case of an incorrect subject line (such as HELP!!??), change the subject line to (say) Re: Difficulties with sync PPP (was: HELP!!??). That way other people trying to follow the thread will have less difficulty following it. In such cases, it is appropriate to say what you did and why you did it, but try not to be rude. If you find you can not answer without being rude, do not answer. If you just want to reply to a message because of its bad format, just reply to the submitter, not to the list. You can just send him this message in reply, if you like.
diff --git a/en_US.ISO8859-1/articles/laptop/article.sgml b/en_US.ISO8859-1/articles/laptop/article.sgml index 87f4ebe6fb..bac561f779 100644 --- a/en_US.ISO8859-1/articles/laptop/article.sgml +++ b/en_US.ISO8859-1/articles/laptop/article.sgml @@ -1,179 +1,179 @@ %man; %freebsd; %authors; %mailing-lists; ]>
FreeBSD on Laptops $FreeBSD$ FreeBSD works fine on most laptops, with a few caveats. Some issues specific to running FreeBSD on laptops, relating to different hardware requirements from desktops, are discussed below. FreeBSD is often thought of as a server operating system, but it works just fine on the desktop, and if you want to use it on your laptop you can enjoy all the usual benefits: systematic layout, easy administration and upgrading, the ports/packages system for adding software, and so on. (Its other benefits, such as stability, network performance, and performance under a heavy load, may not be obvious on a laptop, of course.) However, installing it on laptops often involves problems which are not encountered on desktop machines and are not commonly discussed (laptops, even more than desktops, are fine-tuned for Microsoft Windows). This article aims to discuss some of these issues. XFree86 Recent versions of XFree86 work with most display adapters available on laptops these days. Acceleration may not be supported, but a generic SVGA configuration should work. Check your laptop documentation for which card you have, and check in the XFree86 documentation (or setup program) to see whether it is specifically supported. If it is not, use a generic device (do not go for a name which just looks similar). In XFree86 version 4, you can try your luck with the command XFree86 -configure which auto-detects a lot of configurations. The problem often is configuring the monitor. Common resources for XFree86 focus on CRT monitors; getting a suitable modeline for an LCD display may be tricky. You may be lucky and not need to specify a modeline, or just need to specify suitable HorizSync and VertRefresh ranges. If that does not work, the best option is to check web resources devoted to configuring X on laptops (these are often linux-oriented sites but it does not matter because both systems use XFree86) and copy a modeline posted by someone for similar hardware. Most laptops come with two buttons on their pointing devices, which is rather problematic in X (since the middle button is commonly used to paste text); you can map a simultaneous left-right click in your X configuration to a middle button click with the line Option "Emulate3Buttons" - in the XF86Config file in the "InputDevice" section (for XFree86 - version 4; for version 3, put just the line "Emulate3Buttons", - without the quotes, in the "Pointer" section.) + in the XF86Config file in the InputDevice section (for XFree86 + version 4; for version 3, put just the line Emulate3Buttons, + without the quotes, in the Pointer section.) Modems Laptops usually come with internal (on-board) modems. - Unfortunately, this almost always means they are "winmodems" whose + Unfortunately, this almost always means they are winmodems whose functionality is implemented in software, for which only windows drivers are normally available (though a few drivers are beginning to show up for other operating systems). Otherwise, you need to buy an external modem: the most compact option is probably a PC-Card (PCMCIA) modem, discussed below, but serial or USB modems may be cheaper. Generally, regular modems (non-winmodems) should work fine. PCMCIA (PC-card) devices Most laptops come with PCMCIA (also called PC-card) slots; these are supported fine under FreeBSD. Look through your boot-up messages (using dmesg) and see whether these were detected correctly (they should appear as pccard0, pccard1 etc on devices like pcic0). FreeBSD currently supports 16-bit PCMCIA cards, but not - 32-bit ("CardBus") cards. A database of supported cards is in + 32-bit (CardBus) cards. A database of supported cards is in the file /etc/defaults/pccard.conf. Look through it, and preferably buy cards listed there. Cards not - listed may also work as "generic" devices: in particular most + listed may also work as generic devices: in particular most modems (16-bit) should work fine, provided they are not winmodems (these do exist even as PC-cards, so watch out). If your card is recognised as a generic modem, note that the default pccard.conf file specifies a delay time of 10 seconds (to avoid freezes on certain modems); this may well be over-cautious for your modem, so you may want to play with it, reducing it or removing it totally. - Some parts of pccard.conf may need editing. Check the irq + Some parts of pccard.conf may need editing. Check the irq line, and be sure to remove any number already being used: in particular, if you have an on board sound card, remove irq 5 (otherwise you may experience hangs when you insert a card). Check also the available memory slots; if your card is not being detected, try changing it to one of the other allowed values (listed in the man page &man.pccardc.8;). If it is not running already, start the pccardd daemon. (To enable it at boot time, add pccard_enable="YES" to /etc/rc.conf). Now your cards should be detected when you insert and remove them, and you should get log messages about new devices being enabled. There have been major changes to the pccard code (including ISA routing of interrupts, for machines whose PCIBIOS FreeBSD can not seem to use) before the FreeBSD 4.4 release. If you have problems, try upgrading your system. Power management Unfortunately, this is not very reliably supported under FreeBSD. If you are lucky, some functions may work reliably; or they may not work at all. To enable this, you may need to compile a kernel with power management support (device apm0) or add the option enable apm0 to /boot/loader.conf, and also enable the apm daemon at boot time (line apm_enable="YES" in /etc/rc.conf). The apm commands are listed in the &man.apm.8; manpage. For instance, apm -b gives you battery status (or 255 if not supported), apm -Z puts the laptop on standby, apm -z (or zzz) suspends it. To - shutdown and power off the machine, use "shutdown -p". + shutdown and power off the machine, use shutdown -p. Again, some or all of these functions may not work very well or at all. You may find that laptop suspension/standby works in console mode but not under X (that is, the screen does not come on again; in that case, switch to a virtual console (using Ctrl-Alt-F1 or another function key) and then execute the apm command. The X window system (XFree86) also includes display power management (look at the &man.xset.1; man page, and search for dpms there). You may want to investigate this. However, this, too, works inconsistently on laptops: it often turns off the display but does not turn off the backlight.
diff --git a/en_US.ISO8859-1/articles/multi-os/article.sgml b/en_US.ISO8859-1/articles/multi-os/article.sgml index 8b28f5a6b2..a40b7c76b0 100644 --- a/en_US.ISO8859-1/articles/multi-os/article.sgml +++ b/en_US.ISO8859-1/articles/multi-os/article.sgml @@ -1,741 +1,741 @@
Installing and Using FreeBSD With Other Operating Systems Jay Richmond
jayrich@sysc.com
6 August 1996 This document discusses how to make FreeBSD coexist nicely with other popular operating systems such as Linux, MS-DOS, OS/2, and Windows 95. Special thanks to: Annelise Anderson andrsn@stanford.edu, Randall Hopper rhh@ct.picker.com, and Jordan K. Hubbard jkh@time.cdrom.com
Overview Most people can not fit these operating systems together comfortably without having a larger hard disk, so special information on large EIDE drives is included. Because there are so many combinations of possible operating systems and hard disk configurations, the section may be of the most use to you. It contains descriptions of specific working computer setups that use multiple operating systems. This document assumes that you have already made room on your hard disk for an additional operating system. Any time you repartition your hard drive, you run the risk of destroying the data on the original partitions. However, if your hard drive is completely occupied by DOS, you might find the FIPS utility (included on the FreeBSD CDROM in the \TOOLS directory or via ftp) useful. It lets you repartition your hard disk without destroying the data already on it. There is also a commercial program available called Partition Magic, which lets you size and delete partitions without consequence. Overview of Boot Managers These are just brief descriptions of some of the different boot managers you may encounter. Depending on your computer setup, you may find it useful to use more than one of them on the same system. Boot Easy This is the default boot manager used with FreeBSD. It has the ability to boot most anything, including BSD, OS/2 (HPFS), Windows 95 (FAT and FAT32), and Linux. Partitions are selected with the function keys. OS/2 Boot Manager This will boot FAT, HPFS, FFS (FreeBSD), and EXT2 (Linux). It will also boot FAT32 partitions. Partitions are selected using arrow keys. The OS/2 Boot Manager is the only one to use its own separate partition, unlike the others which use the master boot record (MBR). Therefore, it must be installed below the 1024th cylinder to avoid booting problems. It can boot Linux using LILO when it is part of the boot sector, not the MBR. Go to Linux HOWTOs on the World Wide Web for more information on booting Linux with OS/2's boot manager. OS-BS This is an alternative to Boot Easy. It gives you more control over the booting process, with the ability to set the default partition to boot and the booting timeout. The beta version of this programs allows you to boot by selecting the OS with your arrow keys. It is included on the FreeBSD CD in the \TOOLS directory, and via ftp. LILO, or LInux LOader This is a limited boot manager. It will boot FreeBSD, though some customization work is required in the LILO configuration file. About FAT32 FAT32 is the replacement to the FAT filesystem included in Microsoft's OEM SR2 Beta release, which started replacing FAT on computers pre-loaded with Windows 95 towards the end of 1996. It converts the normal FAT file system and allows you to use smaller cluster sizes for larger hard drives. FAT32 also modifies the traditional FAT boot sector and allocation table, making it incompatible with some boot managers. A Typical Installation Let's say I have two large EIDE hard drives, and I want to install FreeBSD, Linux, and Windows 95 on them. Here is how I might do it using these hard disks: /dev/wd0 (first physical hard disk) /dev/wd1 (second hard disk) Both disks have 1416 cylinders. I boot from a MS-DOS or Windows 95 boot disk that contains the FDISK.EXE utility and make a small 50 meg primary partition (35-40 for Windows 95, plus a little breathing room) on the first disk. Also create a larger partition on the second hard disk for my Windows applications and data. I reboot and install Windows 95 (easier said than done) on the C: partition. The next thing I do is install Linux. I am not sure about all the distributions of Linux, but slackware includes LILO (see ). When I am partitioning out my hard disk with Linux fdisk, I would put all of Linux on the first drive (maybe 300 megs for a nice root partition and some swap space). After I install Linux, and are prompted about installing LILO, make SURE that I install it on the boot sector of my root Linux partition, not in the MBR (master boot record). The remaining hard disk space can go to FreeBSD. I also make sure that my FreeBSD root slice does not go beyond the 1024th cylinder. (The 1024th cylinder is 528 megs into the disk with our hypothetical 720MB disks). I will use the rest of the hard drive (about 270 megs) for the /usr and / slices if I wish. The rest of the second hard disk (size depends on the amount of my Windows application/data partition that I created in step 1 can go to the /usr/src slice and swap space. When viewed with the Windows 95 fdisk utility, my hard drives should now look something like this: --------------------------------------------------------------------- Display Partition Information Current fixed disk drive: 1 Partition Status Type Volume_Label Mbytes System Usage C: 1 A PRI DOS 50 FAT** 7% 2 A Non-DOS (Linux) 300 43% Total disk space is 696 Mbytes (1 Mbyte = 1048576 bytes) Press Esc to continue --------------------------------------------------------------------- Display Partition Information Current fixed disk drive: 2 Partition Status Type Volume_Label Mbytes System Usage D: 1 A PRI DOS 420 FAT** 60% Total disk space is 696 Mbytes (1 Mbyte = 1048576 bytes) Press Esc to continue --------------------------------------------------------------------- ** May say FAT16 or FAT32 if you are using the OEM SR2 update. See ). Install FreeBSD. I make sure to boot with my first hard disk set at NORMAL in the BIOS. If it is not, I will have the enter my true disk geometry at boot time (to get this, boot Windows 95 and consult Microsoft Diagnostics (MSD.EXE), or check your BIOS) with the parameter hd0=1416,16,63 where 1416 is the number of cylinders on my hard disk, 16 is the number of heads per track, and 63 is the number of sectors per track on the drive. When partitioning out the hard disk, I make sure to install Boot Easy on the first disk. I do not worry about the second disk, nothing is booting off of it. When I reboot, Boot Easy should recognize my three bootable partitions as DOS (Windows 95), Linux, and BSD (FreeBSD). Special Considerations Most operating systems are very picky about where and how they are placed on the hard disk. Windows 95 and DOS need to be on the first primary partition on the first hard disk. OS/2 is the exception. It can be installed on the first or second disk in a primary or extended partition. If you are not sure, keep the beginning of the bootable partitions below the 1024th cylinder. If you install Windows 95 on an existing BSD system, it will destroy the MBR, and you will have to reinstall your previous boot manager. Boot Easy can be reinstalled by using the BOOTINST.EXE utility included in the \TOOLS directory on the CDROM, and via ftp. You can also re-start the installation process and go to the partition editor. From there, mark the FreeBSD partition as bootable, select Boot Manager, and then type W to (W)rite out the information to the MBR. You can now reboot, and Boot Easy should then recognize Windows 95 as DOS. Please keep in mind that OS/2 can read FAT and HPFS partitions, but not FFS (FreeBSD) or EXT2 (Linux) partitions. Likewise, Windows 95 can only read and write to FAT and FAT32 (see ) partitions. FreeBSD can read most file systems, but currently cannot read HPFS partitions. Linux can read HPFS partitions, but can not write to them. Recent versions of the Linux kernel (2.x) can read and write to Windows 95 VFAT partitions (VFAT is what gives Windows 95 long file names - it is pretty much the same as FAT). Linux can read and write to most file systems. Got that? I hope so. Examples (section needs work, please send your example to jayrich@sysc.com). FreeBSD+Win95: If you installed FreeBSD after Windows 95, you should see DOS on the Boot Easy menu. This is Windows 95. If you installed Windows 95 after FreeBSD, read above. As long as your hard disk does not have 1024 cylinders you should not have a problem booting. If one of your partitions goes beyond the 1024th cylinder however, and you get messages like invalid system disk under DOS (Windows 95) and FreeBSD will not boot, try looking for a setting in your BIOS called > 1024 cylinder support or NORMAL/LBA mode. DOS may need LBA (Logical Block Addressing) in order to boot correctly. If the idea of switching BIOS settings every time you boot up does not appeal to you, you can boot FreeBSD through DOS via the FBSDBOOT.EXE utility on the CD (It should find your FreeBSD partition and boot it.) FreeBSD+OS/2+Win95: Nothing new here. OS/2's boot manger can boot all of these operating systems, so that should not be a problem. FreeBSD+Linux: You can also use Boot Easy to boot both operating systems. FreeBSD+Linux+Win95: (see ) Other Sources of Help There are many Linux HOW-TOs that deal with multiple operating systems on the same hard disk. The Linux+DOS+Win95+OS2 mini-HOWTO offers help on configuring the OS/2 boot manager, and the Linux+FreeBSD mini-HOWTO might be interesting as well. The Linux-HOWTO is also helpful. The NT Loader Hacking Guide provides good information on multibooting Windows NT, '95, and DOS with other operating systems. ]]> - And Hale Landis's "How It Works" document pack contains some + And Hale Landis's How It Works document pack contains some good info on all sorts of disk geometry and booting related topics. You can find it at ftp://fission.dt.wdc.com/pub/otherdocs/pc_systems/how_it_works/allhiw.zip. Finally, do not overlook FreeBSD's kernel documentation on the booting procedure, available in the kernel source distribution (it unpacks to file:/usr/src/sys/i386/boot/biosboot/README.386BSD. Technical Details (Contributed by Randall Hopper, rhh@ct.picker.com) This section attempts to give you enough basic information about your hard disks and the disk booting process so that you can troubleshoot most problems you might encounter when getting set up to boot several operating systems. It starts in pretty basic terms, so you may want to skim down in this section until it begins to look unfamiliar and then start reading. Disk Primer Three fundamental terms are used to describe the location of data on your hard disk: Cylinders, Heads, and Sectors. It is not particularly important to know what these terms relate to except to know that, together, they identify where data is physically on your disk. Your disk has a particular number of cylinders, number of heads, and number of sectors per cylinder-head (a cylinder-head also known nown as a track). Collectively this - information defines the "physical disk geometry" for your hard + information defines the physical disk geometry for your hard disk. There are typically 512 bytes per sector, and 63 sectors per track, with the number of cylinders and heads varying widely from disk to disk. Thus you can figure the number of bytes of data that will fit on your own disk by calculating: (# of cylinders) × (# heads) × (63 sectors/track) × (512 bytes/sect) For example, on my 1.6 Gig Western Digital AC31600 EIDE hard disk, that is: (3148 cyl) × (16 heads) × (63 sectors/track) × (512 bytes/sect) which is 1,624,670,208 bytes, or around 1.6 Gig. You can find out the physical disk geometry (number of cylinders, heads, and sectors/track counts) for your hard disks using ATAID or other programs off the net. Your hard disk probably came with this information as well. Be careful though: if you are using BIOS LBA (see ), you can not use just any program to get the physical geometry. This is because many programs (e.g. MSD.EXE or FreeBSD fdisk) do not identify the physical disk geometry; they instead report the translated geometry (virtual numbers from using LBA). Stay tuned for what that means. One other useful thing about these terms. Given 3 numbers—a cylinder number, a head number, and a sector-within-track number—you identify a specific absolute sector (a 512 byte block of data) on your disk. Cylinders and Heads are numbered up from 0, and Sectors are numbered up from 1. For those that are interested in more technical details, information on disk geometry, boot sectors, BIOSes, etc. can be found all over the net. Query Lycos, Yahoo, etc. for boot sector or master boot record. Among the useful info you will find are Hale Landis's How It Works document pack. See the section for a few pointers to this pack. Ok, enough terminology. We are talking about booting here. The Booting Process On the first sector of your disk (Cyl 0, Head 0, Sector 1) lives the Master Boot Record (MBR). It contains a map of your disk. It identifies up to 4 partitions, each of which is a contiguous chunk of that disk. FreeBSD calls partitions slices to avoid confusion with its own partitions, but we will not do that here. Each partition can contain its own operating system. Each partition entry in the MBR has a Partition ID, a Start Cylinder/Head/Sector, and an End Cylinder/Head/Sector. The Partition ID tells what type of partition it is (what OS) and the Start/End tells where it is. lists a smattering of some common Partition IDs. Partition IDs ID (hex) Description 01 Primary DOS12 (12-bit FAT) 04 Primary DOS16 (16-bit FAT) 05 Extended DOS 06 Primary big DOS (> 32MB) 0A OS/2 83 Linux (EXT2FS) A5 FreeBSD, NetBSD, 386BSD (UFS)
Note that not all partitions are bootable (e.g. Extended DOS). Some are—some are not. What makes a partition bootable is the configuration of the Partition Boot Sector that exists at the beginning of each partition. When you configure your favorite boot manager, it looks up the entries in the MBR partition tables of all your hard disks and lets you name the entries in that list. Then when you boot, the boot manager is invoked by special code in the Master Boot Sector of the first probed hard disk on your system. It looks at the MBR partition table entry corresponding to the partition choice you made, uses the Start Cylinder/Head/Sector information for that partition, loads up the Partition Boot Sector for that partition, and gives it control. That Boot Sector for the partition itself contains enough information to start loading the operating system on that partition. One thing we just brushed past that is important to know. All of your hard disks have MBRs. However, the one that is important is the one on the disk that is first probed by the BIOS. If you have only IDE hard disks, its the first IDE disk (e.g. primary disk on first controller). Similarly for SCSI only systems. If you have both IDE and SCSI hard disks though, the IDE disk is typically probed first by the BIOS, so the first IDE disk is the first probed disk. The boot manager you will install will be hooked into the MBR on this first probed hard disk that we have just described.
Booting Limitations and Warnings Now the interesting stuff that you need to watch out for. The dreaded 1024 cylinder limit and how BIOS LBA helps The first part of the booting process is all done through the BIOS, (if that is a new term to you, the BIOS is a software chip on your system motherboard which provides startup code for your computer). As such, this first part of the process is subject to the limitations of the BIOS interface. The BIOS interface used to read the hard disk during this period (INT 13H, Subfunction 2) allocates 10 bits to the Cylinder Number, 8 bits to the Head Number, and 6 bits to the Sector Number. This restricts users of this interface (i.e. boot managers hooked into your disk's MBR as well as OS loaders hooked into the Boot Sectors) to the following limits: 1024 cylinders, max 256 heads, max 64 sectors/track, max (actually 63, 0 is not available) Now big hard disks have lots of cylinders but not a lot of heads, so invariably with big hard disks the number of cylinders is greater than 1024. Given this and the BIOS interface as is, you can not boot off just anywhere on your hard disk. The boot code (the boot manager and the OS loader hooked into all bootable partitions' Boot Sectors) has to reside below cylinder 1024. In fact, if your hard disk is typical and has 16 heads, this equates to: 1024 cyl/disk × 16 heads/disk × 63 sect/(cyl-head) × 512 bytes/sector which is around the often-mentioned 528MB limit. This is where BIOS LBA (Logical Block Addressing) comes in. BIOS LBA gives the user of the BIOS API calls access to physical cylinders above 1024 though the BIOS interfaces by redefining a cylinder. That is, it remaps your cylinders and heads, making it appear through the BIOS as though the disk has fewer cylinders and more heads than it actually does. In other words, it takes advantage of the fact that hard disks have relatively few heads and lots of cylinders by shifting the balance between number of cylinders and number of heads so that both numbers lie below the above-mentioned limits (1024 cylinders, 256 heads). With BIOS LBA, the hard disk size limitation is virtually removed (well, pushed up to 8 Gigabytes anyway). If you have an LBA BIOS, you can put FreeBSD or any OS anywhere you want and not hit the 1024 cylinder limit. To use my 1.6 Gig Western Digital as an example again, its physical geometry is: (3148 cyl, 16 heads, 63 sectors/track, 512 bytes/sector) However, my BIOS LBA remaps this to: (787 cyl, 64 heads, 63 sectors/track, 512 bytes/sector) giving the same effective size disk, but with cylinder and head counts within the BIOS API's range (Incidentally, I have both Linux and FreeBSD existing on one of my hard disks above the 1024th physical cylinder, and both operating systems boot fine, thanks to BIOS LBA). Boot Managers and Disk Allocation Another gotcha to watch out when installing boot managers is allocating space for your boot manager. It is best to be aware of this issue up front to save yourself from having to reinstall one or more of your OSs. If you followed the discussion in about the Master Boot Sector (where the MBR is), Partition Boot Sectors, and the booting process, you may have been wondering just exactly where on your hard disk that nifty boot manager is going to live. Well, some boot managers are small enough to fit entirely within the Master Boot Sector (Cylinder 0, Head 0, Sector 0) along with the partition table. Others need a bit more room and actually extend a few sectors past the Master Boot Sector in the Cylinder 0 Head 0 track, since that is typically free…typically. That is the catch. Some operating systems (FreeBSD included) let you start their partitions right after the Master Boot Sector at Cylinder 0, Head 0, Sector 2 if you want. In fact, if you give FreeBSD's sysinstall a disk with an empty chunk up front or the whole disk empty, that is where it will start the FreeBSD partition by default (at least it did when I fell into this trap). Then when you go to install your boot manager, if it is one that occupies a few extra sectors after the MBR, it will overwrite the front of the first partition's data. In the case of FreeBSD, this overwrites the disk label, and renders your FreeBSD partition unbootable. The easy way to avoid this problem (and leave yourself the flexibility to try different boot managers later) is just to always leave the first full track on your disk unallocated when you partition your disk. That is, leave the space from Cylinder 0, Head 0, Sector 2 through Cylinder 0, Head 0, Sector 63 unallocated, and start your first partition at Cylinder 0, Head 1, Sector 1. For what it is worth, when you create a DOS partition at the front of your disk, DOS leaves this space open by default (this is why some boot managers assume it is free). So creating a DOS partition up at the front of your disk avoids this problem altogether. I like to do this myself, creating 1 Meg DOS partition up front, because it also avoids my primary DOS drive letters shifting later when I repartition. For reference, the following boot managers use the Master Boot Sector to store their code and data: OS-BS 1.35 Boot Easy LILO These boot managers use a few additional sectors after the Master Boot Sector: OS-BS 2.0 Beta 8 (sectors 2-5) OS/2's boot manager What if your machine will not boot? At some point when installing boot managers, you might leave the MBR in a state such that your machine will not boot. This is unlikely, but possible when re-FDISKing underneath an already-installed boot manager. If you have a bootable DOS partition on your disk, you can boot off a DOS floppy, and run: A:\> FDISK /MBR to put the original, simple DOS boot code back into the system. You can then boot DOS (and DOS only) off the hard drive. Alternatively, just re-run your boot manager installation program off a bootable floppy.
diff --git a/en_US.ISO8859-1/articles/pxe/article.sgml b/en_US.ISO8859-1/articles/pxe/article.sgml index 3800349d55..cb769136b9 100644 --- a/en_US.ISO8859-1/articles/pxe/article.sgml +++ b/en_US.ISO8859-1/articles/pxe/article.sgml @@ -1,280 +1,279 @@ %man %authors; ]>
FreeBSD Jumpstart Guide Alfred Perlstein
alfred@FreeBSD.org
$FreeBSD$ This article details the method used to allow machines to install FreeBSD using the Intel PXE method of booting a machine over a network.
Introduction - This procedure will make the 'Server' both insecure and dangerous, - it is best to just keep the 'Server' on its own hub and not in any way - accessible by any machines other than the 'Clients'. + This procedure will make the Server both insecure and dangerous, + it is best to just keep the Server on its own hub and not in any way + accessible by any machines other than the Clients. Terminology: Server The machine offering netboot and install options. Client The machine that will have FreeBSD installed on it. Requires: Clients supporting the Intel PXE netboot option, an Ethernet connection. Please let me know if you come across anything you have problems with or suggestions for additional documentation. If you would like someone to train/implement a specific netinstall system for you, please send email so that we can discuss terms. I would also like to thank &a.ps; and &a.jhb; for doing most of the programming work on pxeboot, the interface to Intel's PXE (netboot) system. Server Configuration Install DHCP: Install isc-dhcp-2.0 you can use this config file dhcpd.conf, stick it in /usr/local/etc/ Enable tftp: Make a directory /usr/tftpboot Add this line to your /etc/inetd.conf: tftp dgram udp wait nobody /usr/libexec/tftpd tftpd /usr/tftpboot Enable NFS: Add this to /etc/rc.conf: nfs_server_enable="YES" Add this to /etc/exports: /usr -alldirs -ro Reboot to enable the new services or start them manually. Bootstrap Setup Download bootfiles: Download the kern.flp and mfsroot.flp floppy images. Setup tftp/pxe-boot directory: Put pxeboot in the boot directory: &prompt.root; rm -rf /usr/obj/* &prompt.root; cd /usr/src/sys/boot &prompt.root; make &prompt.root; cp /usr/src/sys/boot/i386/pxeldr/pxeboot /usr/tftpboot Using the vndevice mount the kern.flp file and copy its contents to /usr/tftpboot: &prompt.root; vnconfig vn0 kern.flp # associate a vndevice with the file &prompt.root; mount /dev/vn0 /mnt # mount it &prompt.root; cp -R /mnt /usr/tftpboot # copy the contents to /usr/tftpboot &prompt.root; umount /mnt # unmount it &prompt.root; vnconfig -u vn0 # disassociate the vndevice from the file Compile a custom kernel for the clients (particularly to avoid the device config screen at boot) and stick it in /usr/tftpboot. Make a special loader.rc to and install it in /usr/tftpboot/boot/loader.rc so that it does not prompt for the second disk, here is mine. Extract the installer and helper utilities from the mfsroot disk and uncompress them, put them in /usr/tftpboot as well: &prompt.root; vnconfig vn0 mfsroot.flp # associate a vndevice with the file &prompt.root; mount /dev/vn0 /mnt # mount it &prompt.root; cp /mnt/mfsroot.gz /usr/tftpboot # copy the contents to /usr/tftpboot &prompt.root; umount /mnt # unmount it &prompt.root; vnconfig -u vn0 # disassociate the vndevice from the file &prompt.root; cd /usr/tftpboot # get into the pxeboot directory &prompt.root; gunzip mfsroot.gz # uncompress the mfsroot Make your sysinstall script install.cfg, you can use mine as a template, but you must edit it. Copy the sysinstall script into the extracted and uncompressed mfsroot image: &prompt.root; cd /usr/tftpboot &prompt.root; vnconfig vn0 mfsroot &prompt.root; mount /dev/vn0 /mnt &prompt.root; cp install.cfg /mnt &prompt.root; umount /mnt &prompt.root; vnconfig -u vn0 Install Setup Put the install files in an NFS accessible location on the Server. Make a directory corresponding the 'nfs' directive in the install.cfg file and mirror the FreeBSD install files there, you will want it to look somewhat like this: ABOUT.TXT TROUBLE.TXT compat20 floppies ports ERRATA.TXT UPGRADE.TXT compat21 games proflibs HARDWARE.TXT XF86336 compat22 info src INSTALL.TXT bin compat3x kern.flp LAYOUT.TXT catpages crypto manpages README.TXT cdrom.inf dict mfsroot.flp RELNOTES.TXT compat1x doc packages Copy the compressed packages into the packages/All directory under nfs. Make sure you have an INDEX file prepared in the packages directory. You can make your own INDEX entries like so: alfred-1.0||/|Alfred install bootstrap||alfred@FreeBSD.org|||| Then you can install custom packages, particularly your own custom post-install package. Custom Post-Install Package You can use the script pkgmaker.sh to create a custom package for post install, the idea is to have it install and configure any special things you may need done. pkgmaker is run in the directory above the package you wish to create with the single argument of the package (ie mypkg) which will then create a mypkg.tgz for you to include in your sysinstall package. Inside your custom package dir you will want a file called PLIST which contains all the files that you wish to install and be incorporated into your package. - You will also want files called 'pre' and - 'post' in the directory, these are shell scripts + You will also want files called pre and + post in the directory, these are shell scripts that you want to execute before and after your package is installed. Since this package is in your install.cfg file it should be run and do the final configuration for you.
- diff --git a/en_US.ISO8859-1/articles/storage-devices/article.sgml b/en_US.ISO8859-1/articles/storage-devices/article.sgml index 218d4c0cfe..9179f68707 100644 --- a/en_US.ISO8859-1/articles/storage-devices/article.sgml +++ b/en_US.ISO8859-1/articles/storage-devices/article.sgml @@ -1,2627 +1,2627 @@ %man; %authors; ]>
Storage Devices Wilko Bulte
wilko@FreeBSD.org
$FreeBSD$ This article talks about storage devices with FreeBSD.
Using ESDI hard disks Copyright © 1995, &a.wilko;. 24 September 1995. ESDI is an acronym that means Enhanced Small Device Interface. It is loosely based on the good old ST506/412 interface originally devised by Seagate Technology, the makers of the first affordable 5.25" winchester disk. The acronym says Enhanced, and rightly so. In the first place the speed of the interface is higher, 10 or 15 Mbits/second instead of the 5 Mbits/second of ST412 interfaced drives. Secondly some higher level commands are added, making - the ESDI interface somewhat 'smarter' to the operating system + the ESDI interface somewhat smarter to the operating system driver writers. It is by no means as smart as SCSI by the way. ESDI is standardized by ANSI. Capacities of the drives are boosted by putting more sectors on each track. Typical is 35 sectors per track, high capacity drives I have seen were up to 54 sectors/track. Although ESDI has been largely obsoleted by IDE and SCSI interfaces, the availability of free or cheap surplus drives makes them ideal for low (or now) budget systems. Concepts of ESDI Physical connections The ESDI interface uses two cables connected to each drive. One cable is a 34 pin flat cable edge connector that carries the command and status signals from the controller to the drive and vice-versa. The command cable is daisy chained between all the drives. So, it forms a bus onto which all drives are connected. The second cable is a 20 pin flat cable edge connector that carries the data to and from the drive. This cable is radially connected, so each drive has its own direct connection to the controller. To the best of my knowledge PC ESDI controllers are limited to using a maximum of 2 drives per controller. This is compatibility feature(?) left over from the WD1003 standard that reserves only a single bit for device addressing. Device addressing On each command cable a maximum of 7 devices and 1 controller can be present. To enable the controller to uniquely identify which drive it addresses, each ESDI device is equipped with jumpers or switches to select the devices address. On PC type controllers the first drive is set to address 0, the second disk to address 1. Always make sure you set each disk to an unique address! So, on a PC with its two drives/controller maximum the first drive is drive 0, the second is drive 1. Termination The daisy chained command cable (the 34 pin cable remember?) needs to be terminated at the last drive on the chain. For this purpose ESDI drives come with a termination resistor network that can be removed or disabled by a jumper when it is not used. So, one and only one drive, the one at the farthest end of the command cable has its terminator installed/enabled. The controller automatically terminates the other end of the cable. Please note that this implies that the controller must be at one end of the cable and not in the middle. Using ESDI disks with FreeBSD Why is ESDI such a pain to get working in the first place? People who tried ESDI disks with FreeBSD are known to have developed a profound sense of frustration. A combination of factors works against you to produce effects that are hard to understand when you have never seen them before. This has also led to the popular legend ESDI and FreeBSD is a plain NO-GO. The following sections try to list all the pitfalls and solutions. ESDI speed variants As briefly mentioned before, ESDI comes in two speed flavors. The older drives and controllers use a 10 Mbits/second data transfer rate. Newer stuff uses 15 Mbits/second. It is not hard to imagine that 15 Mbits/second drive cause problems on controllers laid out for 10 Mbits/second. As always, consult your controller and drive documentation to see if things match. Stay on track Mainstream ESDI drives use 34 to 36 sectors per track. Most (older) controllers cannot handle more than this number of sectors. Newer, higher capacity, drives use higher numbers of sectors per track. For instance, I own a 670 MB drive that has 54 sectors per track. In my case, the controller could not handle this number of sectors. It proved to work well except that it only used 35 sectors on each track. This meant losing a lot of disk space. Once again, check the documentation of your hardware for more info. Going out-of-spec like in the example might or might not work. Give it a try or get another more capable controller. Hard or soft sectoring Most ESDI drives allow hard or soft sectoring to be selected using a jumper. Hard sectoring means that the drive will produce a sector pulse on the start of each new sector. The controller uses this pulse to tell when it should start to write or read. Hard sectoring allows a selection of sector size (normally 256, 512 or 1024 bytes per formatted sector). FreeBSD uses 512 byte sectors. The number of sectors per track also varies while still using the same number of bytes per formatted sector. The number of unformatted bytes per sector varies, dependent on your controller it needs more or less overhead bytes to work correctly. Pushing more sectors on a track of course gives you more usable space, but might give problems if your controller needs more bytes than the drive offers. In case of soft sectoring, the controller itself determines where to start/stop reading or writing. For ESDI hard sectoring is the default (at least on everything I came across). I never felt the urge to try soft sectoring. In general, experiment with sector settings before you install FreeBSD because you need to re-run the low-level format after each change. Low level formatting ESDI drives need to be low level formatted before they are usable. A reformat is needed whenever you figgle with the number of sectors/track jumpers or the physical orientation of the drive (horizontal, vertical). So, first think, then format. The format time must not be underestimated, for big disks it can take hours. After a low level format, a surface scan is done to find and flag bad sectors. Most disks have a manufacturer bad block list listed on a piece of paper or adhesive sticker. In addition, on most disks the list is also written onto the disk. Please use the manufacturer's list. It is much easier to remap a defect now than after FreeBSD is installed. Stay away from low-level formatters that mark all sectors of a track as bad as soon as they find one bad sector. Not only does this waste space, it also and more importantly causes you grief with bad144 (see the section on bad144). Translations Translations, although not exclusively a ESDI-only problem, might give you real trouble. Translations come in multiple flavors. Most of them have in common that they attempt to work around the limitations posed upon disk geometries by the original IBM PC/AT design (thanks IBM!). First of all there is the (in)famous 1024 cylinder limit. For a system to be able to boot, the stuff (whatever operating system) must be in the first 1024 cylinders of a disk. Only 10 bits are available to encode the cylinder number. For the number of sectors the limit is 64 (0-63). When you combine the 1024 cylinder limit with the 16 head limit (also a design feature) you max out at fairly limited disk sizes. To work around this problem, the manufacturers of ESDI PC controllers added a BIOS prom extension on their boards. This BIOS extension handles disk I/O for booting (and for some operating systems all disk I/O) by using translation. For instance, a big drive might be presented to the system as having 32 heads and 64 sectors/track. The result is that the number of cylinders is reduced to something below 1024 and is therefore usable by the system without problems. It is noteworthy to know that FreeBSD does not use the BIOS after its kernel has started. More on this later. A second reason for translations is the fact that most older system BIOSes could only handle drives with 17 sectors per track (the old ST412 standard). Newer system BIOSes usually have a user-defined drive type (in most cases this is drive type 47). Whatever you do to translations after reading this document, keep in mind that if you have multiple operating systems on the same disk, all must use the same translation While on the subject of translations, I have seen one controller type (but there are probably more like this) offer the option to logically split a drive in multiple partitions as a BIOS option. I had select 1 drive == 1 partition because this controller wrote this info onto the disk. On power-up it read the info and presented itself to the system based on the info from the disk. Spare sectoring Most ESDI controllers offer the possibility to remap bad sectors. During/after the low-level format of the disk bad sectors are marked as such, and a replacement sector is put in place (logically of course) of the bad one. In most cases the remapping is done by using N-1 sectors on each track for actual data storage, and sector N itself is the spare sector. N is the total number of sectors physically available on the track. The idea behind this is that the - operating system sees a 'perfect' disk without bad sectors. In + operating system sees a perfect disk without bad sectors. In the case of FreeBSD this concept is not usable. The problem is that the translation from bad to good is performed by the BIOS of the ESDI controller. FreeBSD, being a true 32 bit operating system, does not use the BIOS after it has been booted. Instead, it has device drivers that talk directly to the hardware. So: do not use spare sectoring, bad block remapping or whatever it may be called by the controller manufacturer when you want to use the disk for FreeBSD. Bad block handling The preceding section leaves us with a problem. The controller's bad block handling is not usable and still FreeBSD's filesystems assume perfect media without any flaws. To solve this problem, FreeBSD use the bad144 tool. Bad144 (named after a Digital Equipment standard for bad block handling) scans a FreeBSD slice for bad blocks. Having found these bad blocks, it writes a table with the offending block numbers to the end of the FreeBSD slice. When the disk is in operation, the disk accesses are checked against the table read from the disk. Whenever a block number is requested that is in the bad144 list, a replacement block (also from the end of the FreeBSD slice) is used. In this way, the bad144 replacement - scheme presents 'perfect' media to the FreeBSD filesystems. + scheme presents perfect media to the FreeBSD filesystems. There are a number of potential pitfalls associated with the use of bad144. First of all, the slice cannot have more than 126 bad sectors. If your drive has a high number of bad sectors, you might need to divide it into multiple FreeBSD slices each containing less than 126 bad sectors. Stay away from low-level format programs that mark every sector of a track as bad when they find a flaw on the track. As you can imagine, the 126 limit is quickly reached when the low-level format is done this way. Second, if the slice contains the root filesystem, the slice should be within the 1024 cylinder BIOS limit. During the boot process the bad144 list is read using the BIOS and this only succeeds when the list is within the 1024 cylinder limit. The restriction is not that only the root filesystem must be within the 1024 cylinder limit, but rather the entire slice that contains the root filesystem. Kernel configuration ESDI disks are handled by the same wddriver as IDE and ST412 MFM disks. The wd driver should work for all WD1003 compatible interfaces. Most hardware is jumperable for one of two different I/O address ranges and IRQ lines. This allows you to have two wd type controllers in one system. When your hardware allows non-standard strappings, you can use these with FreeBSD as long as you enter the correct info into the kernel config file. An example from the kernel config file (they live in /sys/i386/conf BTW). # First WD compatible controller controller wdc0 at isa? port "IO_WD1" bio irq 14 vector wdintr disk wd0 at wdc0 drive 0 disk wd1 at wdc0 drive 1 # Second WD compatible controller controller wdc1 at isa? port "IO_WD2" bio irq 15 vector wdintr disk wd2 at wdc1 drive 0 disk wd3 at wdc1 drive 1 Particulars on ESDI hardware Adaptec 2320 controllers I successfully installed FreeBSD onto a ESDI disk controlled by a ACB-2320. No other operating system was present on the disk. To do so I low level formatted the disk using NEFMT.EXE (ftpable from www.adaptec.com) and answered NO to the question whether the disk should be formatted with a spare sector on each track. The BIOS on the ACD-2320 was disabled. I used the free configurable option in the system BIOS to allow the BIOS to boot it. Before using NEFMT.EXE I tried to format the disk using the ACB-2320 BIOS built-in formatter. This proved to be a show stopper, because it did not give me an option to disable spare sectoring. With spare sectoring enabled the FreeBSD installation process broke down on the bad144 run. Please check carefully which ACB-232xy variant you have. The x is either 0 or 2, indicating a controller without or with a floppy controller on board. The y is more interesting. It can either be a blank, a A-8 or a D. A blank indicates a plain 10 Mbits/second controller. An A-8 indicates a 15 Mbits/second controller capable of handling 52 sectors/track. A D means a 15 Mbits/second controller that can also handle drives with > 36 sectors/track (also 52?). All variations should be capable of using 1:1 interleaving. Use 1:1, FreeBSD is fast enough to handle it. Western Digital WD1007 controllers I successfully installed FreeBSD onto a ESDI disk controlled by a WD1007 controller. To be precise, it was a WD1007-WA2. Other variations of the WD1007 do exist. To get it to work, I had to disable the sector translation and the WD1007's onboard BIOS. This implied I could not use the low-level formatter built into this BIOS. Instead, I grabbed WDFMT.EXE from www.wdc.com Running this formatted my drive just fine. Ultrastor U14F controllers According to multiple reports from the net, Ultrastor ESDI boards work OK with FreeBSD. I lack any further info on particular settings. Further reading If you intend to do some serious ESDI hacking, you might want to have the official standard at hand: The latest ANSI X3T10 committee document is: Enhanced Small Device Interface (ESDI) [X3.170-1990/X3.170a-1991] [X3T10/792D Rev 11] On Usenet the newsgroup comp.periphs is a noteworthy place to look for more info. The World Wide Web (WWW) also proves to be a very handy info source: For info on Adaptec ESDI controllers see http://www.adaptec.com/. For info on Western Digital controllers see http://www.wdc.com/. Thanks to... Andrew Gordon for sending me an Adaptec 2320 controller and ESDI disk for testing. What is SCSI? Copyright © 1995, &a.wilko;. July 6, 1996. SCSI is an acronym for Small Computer Systems Interface. It is an ANSI standard that has become one of the leading I/O buses in the computer industry. The foundation of the SCSI standard was laid by Shugart Associates (the same guys that gave the world the first mini floppy disks) when they introduced the SASI bus (Shugart Associates Standard Interface). After some time an industry effort was started to come to a more strict standard allowing devices from different vendors to work together. This effort was recognized in the ANSI SCSI-1 standard. The SCSI-1 standard (approximately 1985) is rapidly becoming obsolete. The current standard is SCSI-2 (see Further reading), with SCSI-3 on the drawing boards. In addition to a physical interconnection standard, SCSI defines a logical (command set) standard to which disk devices must adhere. This standard is called the Common Command Set (CCS) and was developed more or less in parallel with ANSI SCSI-1. SCSI-2 includes the (revised) CCS as part of the standard itself. The commands are dependent on the type of device at hand. It does not make much sense of course to define a Write command for a scanner. The SCSI bus is a parallel bus, which comes in a number of variants. The oldest and most used is an 8 bit wide bus, with single-ended signals, carried on 50 wires. (If you do not know what single-ended means, do not worry, that is what this document is all about.) Modern designs also use 16 bit wide buses, with differential signals. This allows transfer speeds of 20Mbytes/second, on cables lengths of up to 25 meters. SCSI-2 allows a maximum bus width of 32 bits, using an additional cable. Quickly emerging are Ultra SCSI (also called Fast-20) and Ultra2 (also called Fast-40). Fast-20 is 20 million transfers per second (20 Mbytes/sec on a 8 bit bus), Fast-40 is 40 million transfers per second (40 Mbytes/sec on a 8 bit bus). Most hard drives sold today are single-ended Ultra SCSI (8 or 16 bits). Of course the SCSI bus not only has data lines, but also a number of control signals. A very elaborate protocol is part of the standard to allow multiple devices to share the bus in an efficient manner. In SCSI-2, the data is always checked using a separate parity line. In pre-SCSI-2 designs parity was optional. In SCSI-3 even faster bus types are introduced, along with a serial SCSI busses that reduces the cabling overhead and allows a higher maximum bus length. You might see names like SSA and fibre channel in this context. None of the serial buses are currently in widespread use (especially not in the typical FreeBSD environment). For this reason the serial bus types are not discussed any further. As you could have guessed from the description above, SCSI devices are intelligent. They have to be to adhere to the SCSI standard (which is over 2 inches thick BTW). So, for a hard disk drive for instance you do not specify a head/cylinder/sector to address a particular block, but simply the number of the block you want. Elaborate caching schemes, automatic bad block replacement etc are all - made possible by this 'intelligent device' approach. + made possible by this intelligent device approach. On a SCSI bus, each possible pair of devices can communicate. Whether their function allows this is another matter, but the standard does not restrict it. To avoid signal contention, the 2 devices have to arbitrate for the bus before using it. The philosophy of SCSI is to have a standard that allows older-standard devices to work with newer-standard ones. So, an old SCSI-1 device should normally work on a SCSI-2 bus. I say Normally, because it is not absolutely sure that the implementation of an old device follows the (old) standard closely enough to be acceptable on a new bus. Modern devices are usually more well-behaved, because the standardization has become more strict and is better adhered to by the device manufacturers. Generally speaking, the chances of getting a working set of devices on a single bus is better when all the devices are SCSI-2 or newer. This implies that you do not have to dump all your old stuff when you get that shiny 2GB disk: I own a system on which a pre-SCSI-1 disk, a SCSI-2 QIC tape unit, a SCSI-1 helical scan tape unit and 2 SCSI-1 disks work together quite happily. From a performance standpoint you might want to separate your older and newer (=faster) devices however. Components of SCSI As said before, SCSI devices are smart. The idea is to put the knowledge about intimate hardware details onto the SCSI device itself. In this way, the host system does not have to worry about things like how many heads a hard disks has, or how many tracks there are on a specific tape device. If you are curious, the standard specifies commands with which you can query your devices on their hardware particulars. FreeBSD uses this capability during boot to check out what devices are connected and whether they need any special treatment. The advantage of intelligent devices is obvious: the device drivers on the host can be made in a much more generic fashion, there is no longer a need to change (and qualify!) drivers for every odd new device that is introduced. For cabling and connectors there is a golden rule: get good stuff. With bus speeds going up all the time you will save yourself a lot of grief by using good material. So, gold plated connectors, shielded cabling, sturdy connector hoods with strain reliefs etc are the way to go. Second golden rule: do no use cables longer than necessary. I once spent 3 days hunting down a problem with a flaky machine only to discover that shortening the SCSI bus by 1 meter solved the problem. And the original bus length was well within the SCSI specification. SCSI bus types From an electrical point of view, there are two incompatible bus types: single-ended and differential. This means that there are two different main groups of SCSI devices and controllers, which cannot be mixed on the same bus. It is possible however to use special converter hardware to transform a single-ended bus into a differential one (and vice versa). The differences between the bus types are explained in the next sections. In lots of SCSI related documentation there is a sort of jargon in use to abbreviate the different bus types. A small list: FWD: Fast Wide Differential FND: Fast Narrow Differential SE: Single Ended FN: Fast Narrow etc. With a minor amount of imagination one can usually imagine what is meant. Wide is a bit ambiguous, it can indicate 16 or 32 bit buses. As far as I know, the 32 bit variant is not (yet) in use, so wide normally means 16 bit. Fast means that the timing on the bus is somewhat different, so that on a narrow (8 bit) bus 10 Mbytes/sec are possible instead of 5 - Mbytes/sec for 'slow' SCSI. As discussed before, bus speeds of 20 + Mbytes/sec for slow SCSI. As discussed before, bus speeds of 20 and 40 million transfers/second are also emerging (Fast-20 == Ultra SCSI and Fast-40 == Ultra2 SCSI). The data lines > 8 are only used for data transfers and device addressing. The transfers of commands and status messages etc are only performed on the lowest 8 data lines. The standard allows narrow devices to operate on a wide bus. The usable bus width is negotiated between the devices. You have to watch your device addressing closely when mixing wide and narrow. Single ended buses A single-ended SCSI bus uses signals that are either 5 Volts or 0 Volts (indeed, TTL levels) and are relative to a COMMON ground reference. A singled ended 8 bit SCSI bus has - approximately 25 ground lines, who are all tied to a single `rail' + approximately 25 ground lines, who are all tied to a single rail on all devices. A standard single ended bus has a maximum length of 6 meters. If the same bus is used with fast-SCSI devices, the maximum length allowed drops to 3 meters. Fast-SCSI means that instead of 5Mbytes/sec the bus allows 10Mbytes/sec transfers. Fast-20 (Ultra SCSI) and Fast-40 allow for 20 and 40 million transfers/second respectively. So, F20 is 20 Mbytes/second on a 8 bit bus, 40 Mbytes/second on a 16 bit bus etc. For F20 the max bus length is 1.5 meters, for F40 it becomes 0.75 meters. Be aware that F20 is pushing the limits quite a bit, so you will quickly find out if your SCSI bus is electrically sound. - If some devices on your bus use 'fast' to communicate your + If some devices on your bus use fast to communicate your bus must adhere to the length restrictions for fast buses! It is obvious that with the newer fast-SCSI devices the bus length can become a real bottleneck. This is why the differential SCSI bus was introduced in the SCSI-2 standard. For connector pinning and connector types please refer to the SCSI-2 standard (see Further reading) itself, connectors etc are listed there in painstaking detail. Beware of devices using non-standard cabling. For instance Apple uses a 25pin D-type connecter (like the one on serial ports and parallel printers). Considering that the official SCSI bus needs 50 pins you can imagine the use of this connector needs some - 'creative cabling'. The reduction of the number of ground wires + creative cabling. The reduction of the number of ground wires they used is a bad idea, you better stick to 50 pins cabling in accordance with the SCSI standard. For Fast-20 and 40 do not even think about buses like this. Differential buses A differential SCSI bus has a maximum length of 25 meters. Quite a difference from the 3 meters for a single-ended fast-SCSI bus. The idea behind differential signals is that each bus signal has its own return wire. So, each signal is carried on a (preferably twisted) pair of wires. The voltage difference between these two wires determines whether the signal is asserted or de-asserted. To a certain extent the voltage difference between ground and the signal wire pair is not relevant (do not try 10 kVolts though). It is beyond the scope of this document to explain why this differential idea is so much better. Just accept that electrically seen the use of differential signals gives a much better noise margin. You will normally find differential buses in use for inter-cabinet connections. Because of the lower cost single ended is mostly used for shorter buses like inside cabinets. There is nothing that stops you from using differential stuff with FreeBSD, as long as you use a controller that has device driver support in FreeBSD. As an example, Adaptec marketed the AHA1740 as a single ended board, whereas the AHA1744 was differential. The software interface to the host is identical for both. Terminators Terminators in SCSI terminology are resistor networks that are used to get a correct impedance matching. Impedance matching is important to get clean signals on the bus, without reflections or ringing. If you once made a long distance telephone call on a bad line you probably know what reflections are. With 20Mbytes/sec traveling over your SCSI bus, you do not want signals echoing back. Terminators come in various incarnations, with more or less sophisticated designs. Of course, there are internal and external variants. Many SCSI devices come with a number of sockets in which a number of resistor networks can (must be!) installed. If you remove terminators from a device, carefully store them. You will need them when you ever decide to reconfigure your SCSI bus. There is enough variation in even these simple tiny things to make finding the exact replacement a frustrating business. There are also SCSI devices that have a single jumper to enable or disable a built-in terminator. There are special terminators you can stick onto a flat cable bus. Others look like external connectors, or a connector hood without a cable. So, lots of choice as you can see. There is much debate going on if and when you should switch from simple resistor (passive) terminators to active terminators. Active terminators contain slightly more elaborate circuit to give cleaner bus signals. The general consensus seems to be that the usefulness of active termination increases when you have long buses and/or fast devices. If you ever have problems with your SCSI buses you might consider trying an active terminator. Try to borrow one first, they reputedly are quite expensive. Please keep in mind that terminators for differential and single-ended buses are not identical. You should not mix the two variants. OK, and now where should you install your terminators? This is by far the most misunderstood part of SCSI. And it is by far the simplest. The rule is: every single line on the SCSI bus has 2 (two) terminators, one at each end of the bus. So, two and not one or three or whatever. Do yourself a favor and stick to this rule. It will save you endless grief, because wrong termination has the potential to introduce highly mysterious bugs. (Note the potential here; the nastiest part is that it may or may not work.) A common pitfall is to have an internal (flat) cable in a machine and also an external cable attached to the controller. It seems almost everybody forgets to remove the terminators from the controller. The terminator must now be on the last external device, and not on the controller! In general, every reconfiguration of a SCSI bus must pay attention to this. Termination is to be done on a per-line basis. This means if you have both narrow and wide buses connected to the same host adapter, you need to enable termination on the higher 8 bits of the bus on the adapter (as well as the last devices on each bus, of course). What I did myself is remove all terminators from my SCSI devices and controllers. I own a couple of external terminators, for both the Centronics-type external cabling and for the internal flat cable connectors. This makes reconfiguration much easier. On modern devices, sometimes integrated terminators are used. These things are special purpose integrated circuits that can be enabled or disabled with a control pin. It is not necessary to physically remove them from a device. You may find them on newer host adapters, sometimes they are software configurable, using some sort of setup tool. Some will even auto-detect the cables attached to the connectors and automatically set up the termination as necessary. At any rate, consult your documentation! Terminator power The terminators discussed in the previous chapter need power to operate properly. On the SCSI bus, a line is dedicated to this purpose. So, simple huh? Not so. Each device can provide its own terminator power to the terminator sockets it has on-device. But if you have external terminators, or when the device supplying the terminator power to the SCSI bus line is switched off you are in trouble. The idea is that initiators (these are devices that initiate actions on the bus, a discussion follows) must supply terminator power. All SCSI devices are allowed (but not required) to supply terminator power. To allow for un-powered devices on a bus, the terminator power must be supplied to the bus via a diode. This prevents the backflow of current to un-powered devices. To prevent all kinds of nastiness, the terminator power is usually fused. As you can imagine, fuses might blow. This can, but does not have to, lead to a non functional bus. If multiple devices supply terminator power, a single blown fuse will not put you out of business. A single supplier with a blown fuse certainly will. Clever external terminators sometimes have a LED indication that shows whether terminator power is present. - In newer designs auto-restoring fuses that 'reset' themselves + In newer designs auto-restoring fuses that reset themselves after some time are sometimes used. Device addressing Because the SCSI bus is, ehh, a bus there must be a way to distinguish or address the different devices connected to it. This is done by means of the SCSI or target ID. Each device has a unique target ID. You can select the ID to which a device must respond using a set of jumpers, or a dip switch, or something similar. Some SCSI host adapters let you change the target ID from the boot menu. (Yet some others will not let you change the ID from 7.) Consult the documentation of your device for more information. Beware of multiple devices configured to use the same ID. Chaos normally reigns in this case. A pitfall is that one of the devices sharing the same ID sometimes even manages to answer to I/O requests! For an 8 bit bus, a maximum of 8 targets is possible. The maximum is 8 because the selection is done bitwise using the 8 data lines on the bus. For wide buses this increases to the number of data lines (usually 16). A narrow SCSI device can not communicate with a SCSI device with a target ID larger than 7. This means it is generally not a good idea to move your SCSI host adapter's target ID to something higher than 7 (or your CDROM will stop working). The higher the SCSI target ID, the higher the priority the devices has. When it comes to arbitration between devices that want to use the bus at the same time, the device that has the highest SCSI ID will win. This also means that the SCSI host adapter usually uses target ID 7. Note however that the lower 8 IDs have higher priorities than the higher 8 IDs on a wide-SCSI bus. Thus, the order of target IDs is: [7 6 .. 1 0 15 14 .. 9 8] on a wide-SCSI system. (If you are wondering why the lower 8 have higher priority, read the previous paragraph for a hint.) For a further subdivision, the standard allows for Logical Units or LUNs for short. A single target ID may have multiple LUNs. For example, a tape device including a tape changer may have LUN 0 for the tape device itself, and LUN 1 for the tape changer. In this way, the host system can address each of the functional units of the tape changer as desired. Bus layout SCSI buses are linear. So, not shaped like Y-junctions, star topologies, rings, cobwebs or whatever else people might want to invent. One of the most common mistakes is for people with wide-SCSI host adapters to connect devices on all three connecters (external connector, internal wide connector, internal narrow connector). Do not do that. It may appear to work if you are really lucky, but I can almost guarantee that your system will stop functioning at the most unfortunate moment (this is also known as Murphy's law). You might notice that the terminator issue discussed earlier becomes rather hairy if your bus is not linear. Also, if you have more connectors than devices on your internal SCSI cable, make sure you attach devices on connectors on both ends instead of using the connectors in the middle and let one or both ends dangle. This will screw up the termination of the bus. The electrical characteristics, its noise margins and ultimately the reliability of it all are tightly related to linear bus rule. Stick to the linear bus rule! Using SCSI with FreeBSD About translations, BIOSes and magic... As stated before, you should first make sure that you have a electrically sound bus. When you want to use a SCSI disk on your PC as boot disk, you must aware of some quirks related to PC BIOSes. The PC BIOS in its first incarnation used a low level physical interface to the hard disk. So, you had to tell the BIOS (using a setup tool or a BIOS built-in setup) how your disk physically looked like. This involved stating number of heads, number of cylinders, number of sectors per track, obscure things like precompensation and reduced write current cylinder etc. One might be inclined to think that since SCSI disks are smart you can forget about this. Alas, the arcane setup issue is still present today. The system BIOS needs to know how to access your SCSI disk with the head/cyl/sector method in order to load the FreeBSD kernel during boot. The SCSI host adapter or SCSI controller you have put in your AT/EISA/PCI/whatever bus to connect your disk therefore has its own on-board BIOS. During system startup, the SCSI BIOS takes over the hard disk interface routines from the system BIOS. To fool the system BIOS, the system setup is normally set to No hard disk present. Obvious, is it not? The SCSI BIOS itself presents to the system a so called translated drive. This means that a fake drive table is constructed that allows the PC to boot the drive. This translation is often (but not always) done using a pseudo drive with 64 heads and 32 sectors per track. By varying the number of cylinders, the SCSI BIOS adapts to the actual drive size. It is useful to note that 32 * 64 / 2 = the size of your drive in megabytes. The division by 2 is to get from disk blocks that are normally 512 bytes in size to Kbytes. Right. All is well now?! No, it is not. The system BIOS has another quirk you might run into. The number of cylinders of a bootable hard disk cannot be greater than 1024. Using the translation above, this is a show-stopper for disks greater than 1 GB. With disk capacities going up all the time this is causing problems. Fortunately, the solution is simple: just use another translation, e.g. with 128 heads instead of 32. In most cases new SCSI BIOS versions are available to upgrade older SCSI host adapters. Some newer adapters have an option, in the form of a jumper or software setup selection, to switch the translation the SCSI BIOS uses. It is very important that all operating systems on the disk use the same translation to get the right idea about where to find the relevant partitions. So, when installing FreeBSD you must answer any questions about heads/cylinders etc using the translated values your host adapter uses. Failing to observe the translation issue might lead to un-bootable systems or operating systems overwriting each others partitions. Using fdisk you should be able to see all partitions. You might have heard some talk of lying devices? Older FreeBSD kernels used to report the geometry of SCSI disks when booting. An example from one of my systems: aha0 targ 0 lun 0: <MICROP 1588-15MB1057404HSP4> sd0: 636MB (1303250 total sec), 1632 cyl, 15 head, 53 sec, bytes/sec 512 Newer kernels usually do not report this information. e.g. (bt0:0:0): "SEAGATE ST41651 7574" type 0 fixed SCSI 2 sd0(bt0:0:0): Direct-Access 1350MB (2766300 512 byte sectors) Why has this changed? This info is retrieved from the SCSI disk itself. Newer disks often use a technique called zone bit recording. The idea is that on the outer cylinders of the drive there is more space so more sectors per track can be put on them. This results in disks that have more tracks on outer cylinders than on the inner cylinders and, last but not least, have more capacity. You can imagine that the value reported by the drive when inquiring about the geometry now becomes suspect at best, and nearly always misleading. When asked for a geometry, it is nearly always better to supply the geometry used by the BIOS, or if the BIOS is never going to know about this disk, (e.g. it is not a booting disk) to supply a fictitious geometry that is convenient. SCSI subsystem design FreeBSD uses a layered SCSI subsystem. For each different controller card a device driver is written. This driver knows all the intimate details about the hardware it controls. The driver has a interface to the upper layers of the SCSI subsystem through which it receives its commands and reports back any status. On top of the card drivers there are a number of more generic drivers for a class of devices. More specific: a driver for tape devices (abbreviation: st), magnetic disks (sd), CDROMs (cd) etc. In case you are wondering where you can find this stuff, it all lives in /sys/scsi. See the man pages in section 4 for more details. The multi level design allows a decoupling of low-level bit banging and more high level stuff. Adding support for another piece of hardware is a much more manageable problem. Kernel configuration Dependent on your hardware, the kernel configuration file must contain one or more lines describing your host adapter(s). This includes I/O addresses, interrupts etc. Consult the man page for your adapter driver to get more info. Apart from that, check out /sys/i386/conf/LINT for an overview of a kernel config file. LINT contains every possible option you can dream of. It does not imply LINT will actually get you to a working kernel at all. Although it is probably stating the obvious: the kernel config file should reflect your actual hardware setup. So, interrupts, I/O addresses etc must match the kernel config file. During system boot messages will be displayed to indicate whether the configured hardware was actually found. Note that most of the EISA/PCI drivers (namely ahb, ahc, ncr and amd will automatically obtain the correct parameters from the host adapters themselves at boot time; thus, you just need to write, for instance, controller ahc0. An example loosely based on the FreeBSD 2.2.5-Release kernel config file LINT with some added comments (between []): # SCSI host adapters: `aha', `ahb', `aic', `bt', `nca' # # aha: Adaptec 154x # ahb: Adaptec 174x # ahc: Adaptec 274x/284x/294x # aic: Adaptec 152x and sound cards using the Adaptec AIC-6360 (slow!) # amd: AMD 53c974 based SCSI cards (e.g., Tekram DC-390 and 390T) # bt: Most Buslogic controllers # nca: ProAudioSpectrum cards using the NCR 5380 or Trantor T130 # ncr: NCR/Symbios 53c810/815/825/875 etc based SCSI cards # uha: UltraStore 14F and 34F # sea: Seagate ST01/02 8 bit controller (slow!) # wds: Western Digital WD7000 controller (no scatter/gather!). # [For an Adaptec AHA274x/284x/294x/394x etc controller] controller ahc0 [For an NCR/Symbios 53c875 based controller] controller ncr0 [For an Ultrastor adapter] controller uha0 at isa? port "IO_UHA0" bio irq ? drq 5 vector uhaintr # Map SCSI buses to specific SCSI adapters controller scbus0 at ahc0 controller scbus2 at ncr0 controller scbus1 at uha0 # The actual SCSI devices disk sd0 at scbus0 target 0 unit 0 [SCSI disk 0 is at scbus 0, LUN 0] disk sd1 at scbus0 target 1 [implicit LUN 0 if omitted] disk sd2 at scbus1 target 3 [SCSI disk on the uha0] disk sd3 at scbus2 target 4 [SCSI disk on the ncr0] tape st1 at scbus0 target 6 [SCSI tape at target 6] device cd0 at scbus? [the first ever CDROM found, no wiring] The example above tells the kernel to look for a ahc (Adaptec 274x) controller, then for an NCR/Symbios board, and so on. The lines following the controller specifications tell the kernel to configure specific devices but only attach them when they match the target ID and LUN specified on the corresponding bus. Wired down devices get first shot at the unit numbers so the first non wired down device, is allocated the unit number one greater than the highest wired down unit number for that kind of device. So, if you had a SCSI tape at target ID 2 it would be configured as st2, as the tape at target ID 6 is wired down to unit number 1. Wired down devices need not be found to get their unit number. The unit number for a wired down device is reserved for that device, even if it is turned off at boot time. This allows the device to be turned on and brought on-line at a later time, without rebooting. Notice that a device's unit number has no relationship with its target ID on the SCSI bus. Below is another example of a kernel config file as used by FreeBSD version < 2.0.5. The difference with the first example is that devices are not wired down. Wired down means that you specify which SCSI target belongs to which device. A kernel built to the config file below will attach the first SCSI disk it finds to sd0, the second disk to sd1 etc. If you ever removed or added a disk, all other devices of the same type (disk - in this case) would 'move around'. This implies you have to + in this case) would move around. This implies you have to change /etc/fstab each time. Although the old style still works, you are strongly recommended to use this new feature. It will save you a lot of grief whenever you shift your hardware around on the SCSI buses. So, when you re-use your old trusty config file after upgrading from a pre-FreeBSD2.0.5.R system check this out. [driver for Adaptec 174x] controller ahb0 at isa? bio irq 11 vector ahbintr [for Adaptec 154x] controller aha0 at isa? port "IO_AHA0" bio irq 11 drq 5 vector ahaintr [for Seagate ST01/02] controller sea0 at isa? bio irq 5 iomem 0xc8000 iosiz 0x2000 vector seaintr controller scbus0 device sd0 [support for 4 SCSI harddisks, sd0 up sd3] device st0 [support for 2 SCSI tapes] [for the CDROM] device cd0 #Only need one of these, the code dynamically grows Both examples support SCSI disks. If during boot more devices of a specific type (e.g. sd disks) are found than are configured in the booting kernel, the system will simply allocate more devices, incrementing the unit number starting at the last number wired down. If there are no wired down devices then counting starts at unit 0. Use man 4 scsi to check for the latest info on the SCSI subsystem. For more detailed info on host adapter drivers use e.g., man 4 ahc for info on the Adaptec 294x driver. Tuning your SCSI kernel setup Experience has shown that some devices are slow to respond to INQUIRY commands after a SCSI bus reset (which happens at boot time). An INQUIRY command is sent by the kernel on boot to see what kind of device (disk, tape, CDROM etc.) is connected to a specific target ID. This process is called device probing by the way. - To work around the 'slow response' problem, FreeBSD allows a + To work around the slow response problem, FreeBSD allows a tunable delay time before the SCSI devices are probed following a SCSI bus reset. You can set this delay time in your kernel configuration file using a line like: options SCSI_DELAY=15 #Be pessimistic about Joe SCSI device This line sets the delay time to 15 seconds. On my own system I had to use 3 seconds minimum to get my trusty old CDROM drive to be recognized. Start with a high value (say 30 seconds or so) when you have problems with device recognition. If this helps, tune it back until it just stays working. Rogue SCSI devices Although the SCSI standard tries to be complete and concise, it is a complex standard and implementing things correctly is no easy task. Some vendors do a better job then others. This is exactly where the rogue devices come into view. Rogues are devices that are recognized by the FreeBSD kernel as behaving slightly (...) non-standard. Rogue devices are reported by the kernel when booting. An example for two of my cartridge tape units: Feb 25 21:03:34 yedi /kernel: ahb0 targ 5 lun 0: <TANDBERG TDC 3600 -06:> Feb 25 21:03:34 yedi /kernel: st0: Tandberg tdc3600 is a known rogue Mar 29 21:16:37 yedi /kernel: aha0 targ 5 lun 0: <ARCHIVE VIPER 150 21247-005> Mar 29 21:16:37 yedi /kernel: st1: Archive Viper 150 is a known rogue For instance, there are devices that respond to all LUNs on a certain target ID, even if they are actually only one device. It is easy to see that the kernel might be fooled into believing that there are 8 LUNs at that particular target ID. The confusion this causes is left as an exercise to the reader. The SCSI subsystem of FreeBSD recognizes devices with bad habits by looking at the INQUIRY response they send when probed. Because the INQUIRY response also includes the version number of the device firmware, it is even possible that for different firmware versions different workarounds are used. See e.g. /sys/scsi/st.c and /sys/scsi/scsiconf.c for more info on how this is done. This scheme works fine, but keep in mind that it of course only works for devices that are known to be weird. If you are the first to connect your bogus Mumbletech SCSI CDROM you might be the one that has to define which workaround is needed. After you got your Mumbletech working, please send the required workaround to the FreeBSD development team for inclusion in the next release of FreeBSD. Other Mumbletech owners will be grateful to you. Multiple LUN devices In some cases you come across devices that use multiple logical units (LUNs) on a single SCSI ID. In most cases FreeBSD only probes devices for LUN 0. An example are so called bridge boards that connect 2 non-SCSI hard disks to a SCSI bus (e.g. an Emulex MD21 found in old Sun systems). This means that any devices with LUNs != 0 are not normally found during device probe on system boot. To work around this problem you must add an appropriate entry in /sys/scsi/scsiconf.c and rebuild your kernel. Look for a struct that is initialized like below: { T_DIRECT, T_FIXED, "MAXTOR", "XT-4170S", "B5A", "mx1", SC_ONE_LU } For you Mumbletech BRIDGE2000 that has more than one LUN, acts as a SCSI disk and has firmware revision 123 you would add something like: { T_DIRECT, T_FIXED, "MUMBLETECH", "BRIDGE2000", "123", "sd", SC_MORE_LUS } The kernel on boot scans the inquiry data it receives against the table and acts accordingly. See the source for more info. Tagged command queuing Modern SCSI devices, particularly magnetic disks, support what is called tagged command queuing (TCQ). In a nutshell, TCQ allows the device to have multiple I/O requests outstanding at the same time. Because the device is intelligent, it can optimize its operations (like head positioning) based on its own request queue. On SCSI devices like RAID (Redundant Array of Independent Disks) arrays the TCQ function is indispensable to take advantage of the device's inherent parallelism. Each I/O request is uniquely identified by a tag (hence the name tagged command queuing) and this tag is used by FreeBSD to see which I/O in the device drivers queue is reported as complete by the device. It should be noted however that TCQ requires device driver support and that some devices implemented it not quite right in their firmware. This problem bit me once, and it leads to highly mysterious problems. In such cases, try to disable TCQ. Busmaster host adapters Most, but not all, SCSI host adapters are bus mastering controllers. This means that they can do I/O on their own without putting load onto the host CPU for data movement. This is of course an advantage for a multitasking operating system like FreeBSD. It must be noted however that there might be some rough edges. For instance an Adaptec 1542 controller can be set to use different transfer speeds on the host bus (ISA or AT in this case). The controller is settable to different rates because not all motherboards can handle the higher speeds. Problems like hang-ups, bad data etc might be the result of using a higher data transfer rate then your motherboard can stomach. The solution is of course obvious: switch to a lower data transfer rate and try if that works better. In the case of a Adaptec 1542, there is an option that can be put into the kernel config file to allow dynamic determination of the right, read: fastest feasible, transfer rate. This option is disabled by default: options "TUNE_1542" #dynamic tune of bus DMA speed Check the man pages for the host adapter that you use. Or better still, use the ultimate documentation (read: driver source). Tracking down problems The following list is an attempt to give a guideline for the most common SCSI problems and their solutions. It is by no means complete. Check for loose connectors and cables. Check and double check the location and number of your terminators. Check if your bus has at least one supplier of terminator power (especially with external terminators. Check if no double target IDs are used. Check if all devices to be used are powered up. Make a minimal bus config with as little devices as possible. If possible, configure your host adapter to use slow bus speeds. Disable tagged command queuing to make things as simple as possible (for a NCR host adapter based system see man ncrcontrol) If you can compile a kernel, make one with the SCSIDEBUG option, and try accessing the device with debugging turned on for that device. If your device does not even probe at startup, you may have to define the address of the device that is failing, and the desired debug level in /sys/scsi/scsidebug.h. If it probes but just does not work, you can use the &man.scsi.8; command to dynamically set a debug level to it in a running kernel (if SCSIDEBUG is defined). This will give you copious debugging output with which to confuse the gurus. See man 4 scsi for more exact information. Also look at man 8 scsi. Further reading If you intend to do some serious SCSI hacking, you might want to have the official standard at hand: Approved American National Standards can be purchased from ANSI at
13th Floor 11 West 42nd Street New York NY 10036 Sales Dept: (212) 642-4900
You can also buy many ANSI standards and most committee draft documents from Global Engineering Documents,
15 Inverness Way East Englewood CO, 80112-5704 Phone: (800) 854-7179 Outside USA and Canada: (303) 792-2181 Fax: (303) 792- 2192
Many X3T10 draft documents are available electronically on the SCSI BBS (719-574-0424) and on the ncrinfo.ncr.com anonymous FTP site. Latest X3T10 committee documents are: AT Attachment (ATA or IDE) [X3.221-1994] (Approved) ATA Extensions (ATA-2) [X3T10/948D Rev 2i] Enhanced Small Device Interface (ESDI) [X3.170-1990/X3.170a-1991] (Approved) Small Computer System Interface — 2 (SCSI-2) [X3.131-1994] (Approved) SCSI-2 Common Access Method Transport and SCSI Interface Module (CAM) [X3T10/792D Rev 11] Other publications that might provide you with additional information are: SCSI: Understanding the Small Computer System Interface, written by NCR Corporation. Available from: Prentice Hall, Englewood Cliffs, NJ, 07632 Phone: (201) 767-5937 ISBN 0-13-796855-8 Basics of SCSI, a SCSI tutorial written by Ancot Corporation Contact Ancot for availability information at: Phone: (415) 322-5322 Fax: (415) 322-0455 SCSI Interconnection Guide Book, an AMP publication (dated 4/93, Catalog 65237) that lists the various SCSI connectors and suggests cabling schemes. Available from AMP at (800) 522-6752 or (717) 564-0100 Fast Track to SCSI, A Product Guide written by Fujitsu. Available from: Prentice Hall, Englewood Cliffs, NJ, 07632 Phone: (201) 767-5937 ISBN 0-13-307000-X The SCSI Bench Reference, The SCSI Encyclopedia, and the SCSI Tutor, ENDL Publications, 14426 Black Walnut Court, Saratoga CA, 95070 Phone: (408) 867-6642 Zadian SCSI Navigator (quick ref. book) and Discover the Power of SCSI (First book along with a one-hour video and tutorial book), Zadian Software, Suite 214, 1210 S. Bascom Ave., San Jose, CA 92128, (408) 293-0800 On Usenet the newsgroups comp.periphs.scsi and comp.periphs are noteworthy places to look for more info. You can also find the SCSI-Faq there, which is posted periodically. Most major SCSI device and host adapter suppliers operate FTP sites and/or BBS systems. They may be valuable sources of information about the devices you own.
* Disk/tape controllers * SCSI * IDE * Floppy Hard drives SCSI hard drives Contributed by &a.asami;. 17 February 1998. As mentioned in the SCSI section, virtually all SCSI hard drives sold today are SCSI-2 compliant and thus will work fine as long as you connect them to a supported SCSI host adapter. Most problems people encounter are either due to badly designed cabling (cable too long, star topology, etc.), insufficient termination, or defective parts. Please refer to the SCSI section first if your SCSI hard drive is not working. However, there are a couple of things you may want to take into account before you purchase SCSI hard drives for your system. Rotational speed Rotational speeds of SCSI drives sold today range from around 4,500RPM to 10,000RPM. Most of them are either 5,400RPM or 7,200RPM. Even though the 7,200RPM drives can generally transfer data faster, they run considerably hotter than their 5,400RPM counterparts. A large fraction of today's disk drive malfunctions are heat-related. If you do not have very good cooling in your PC case, you may want to stick with 5,400RPM or slower drives. Note that newer drives, with higher areal recording densities, can deliver much more bits per rotation than older ones. Today's top-of-line 5,400RPM drives can sustain a throughput comparable to 7,200RPM drives of one or two model generations ago. The number to find on the spec sheet for bandwidth is internal data (or transfer) rate. It is usually in megabits/sec so divide it by 8 and you will get the rough approximation of how much megabytes/sec you can get out of the drive. (If you are a speed maniac and want a 10,000RPM drive for your cute little PC, be my guest; however, those drives become extremely hot. Do not even think about it if you do not have a fan blowing air directly at the drive or a properly ventilated disk enclosure.) Obviously, the latest 10,000RPM drives and 7,200RPM drives can deliver more data than the latest 5,400RPM drives, so if absolute bandwidth is the necessity for your applications, you have little choice but to get the faster drives. Also, if you need low latency, faster drives are better; not only do they usually have lower average seek times, but also the rotational delay is one place where slow-spinning drives can never beat a faster one. (The average rotational latency is half the time it takes to rotate the drive once; thus, it is 3 milliseconds for 10,000RPM drives, 4.2ms for 7,200RPM drives and 5.6ms for 5,400RPM drives.) Latency is seek time plus rotational delay. Make sure you understand whether you need low latency or more accesses per second, though; in the latter case (e.g., news servers), it may not be optimal to purchase one big fast drive. You can achieve similar or even better results by using the ccd (concatenated disk) driver to create a striped disk array out of multiple slower drives for comparable overall cost. Make sure you have adequate air flow around the drive, especially if you are going to use a fast-spinning drive. You - generally need at least 1/2" (1.25cm) of spacing above and below a + generally need at least 1/2” (1.25cm) of spacing above and below a drive. Understand how the air flows through your PC case. Most cases have the power supply suck the air out of the back. See where the air flows in, and put the drive where it will have the largest volume of cool air flowing around it. You may need to seal some unwanted holes or add a new fan for effective cooling. Another consideration is noise. Many 7,200 or faster drives generate a high-pitched whine which is quite unpleasant to most people. That, plus the extra fans often required for cooling, may make 7,200 or faster drives unsuitable for some office and home environments. Form factor - Most SCSI drives sold today are of 3.5" form factor. They - come in two different heights; 1.6" (half-height) or - 1" (low-profile). The half-height drive is the same + Most SCSI drives sold today are of 3.5” form factor. They + come in two different heights; 1.6” (half-height) or + 1” (low-profile). The half-height drive is the same height as a CDROM drive. However, do not forget the spacing rule mentioned in the previous section. If you have three standard - 3.5" drive bays, you will not be able to put three half-height + 3.5” drive bays, you will not be able to put three half-height drives in there (without frying them, that is). Interface The majority of SCSI hard drives sold today are Ultra or Ultra-wide SCSI. The maximum bandwidth of Ultra SCSI is 20MB/sec, and Ultra-wide SCSI is 40MB/sec. There is no difference in max cable length between Ultra and Ultra-wide; however, the more devices you have on the same bus, the sooner you will start having bus integrity problems. Unless you have a well-designed disk enclosure, it is not easy to make more than 5 or 6 Ultra SCSI drives work on a single bus. On the other hand, if you need to connect many drives, going for Fast-wide SCSI may not be a bad idea. That will have the same max bandwidth as Ultra (narrow) SCSI, while electronically it is much easier to get it right. My advice would be: if you want to connect many disks, get wide SCSI drives; they usually cost a little more but it may save you down the road. (Besides, if you can not afford the cost difference, you should not be building a disk array.) There are two variant of wide SCSI drives; 68-pin and 80-pin SCA (Single Connector Attach). The SCA drives do not have a separate 4-pin power connector, and also read the SCSI ID settings through the 80-pin connector. If you are really serious about building a large storage system, get SCA drives and a good SCA enclosure (dual power supply with at least one extra fan). They are more electronically sound than 68-pin counterparts because there is no stub of the SCSI bus inside the disk canister as in arrays built from 68-pin drives. They are easier to install too (you just need to screw the drive in the canister, instead of trying to squeeze in your fingers in a tight place to hook up all the little cables (like the SCSI ID and disk activity LED lines). * IDE hard drives Tape drives Contributed by &a.jmb;. 2 July 1996. General tape access commands &man.mt.1; provides generic access to the tape drives. Some of the more common commands are rewind, erase, and status. See the &man.mt.1; manual page for a detailed description. Controller Interfaces There are several different interfaces that support tape drives. The interfaces are SCSI, IDE, Floppy and Parallel Port. A wide variety of tape drives are available for these interfaces. Controllers are discussed in Disk/tape controllers. SCSI drives The &man.st.4; driver provides support for 8mm (Exabyte), 4mm (DAT: Digital Audio Tape), QIC (Quarter-Inch Cartridge), DLT (Digital Linear Tape), QIC Mini cartridge and 9-track (remember the big reels that you see spinning in Hollywood computer rooms) tape drives. See the &man.st.4; manual page for a detailed description. The drives listed below are currently being used by members of the FreeBSD community. They are not the only drives that will work with FreeBSD. They just happen to be the ones that we use. 4mm (DAT: Digital Audio Tape) Archive Python 28454 Archive Python 04687 HP C1533A HP C1534A HP 35450A HP 35470A HP 35480A SDT-5000 Wangtek 6200 8mm (Exabyte) EXB-8200 EXB-8500 EXB-8505 QIC (Quarter-Inch Cartridge) Archive Anaconda 2750 Archive Viper 60 Archive Viper 150 Archive Viper 2525 Tandberg TDC 3600 Tandberg TDC 3620 Tandberg TDC 3800 Tandberg TDC 4222 Wangtek 5525ES DLT (Digital Linear Tape) Digital TZ87 Mini-Cartridge Conner CTMS 3200 Exabyte 2501 Autoloaders/Changers Hewlett-Packard HP C1553A Autoloading DDS2 * IDE drives Floppy drives Conner 420R * Parallel port drives Detailed Information Archive Anaconda 2750 The boot message identifier for this drive is ARCHIVE ANCDA 2750 28077 -003 type 1 removable SCSI 2 This is a QIC tape drive. Native capacity is 1.35GB when using QIC-1350 tapes. This drive will read and write QIC-150 (DC6150), QIC-250 (DC6250), and QIC-525 (DC6525) tapes as well. Data transfer rate is 350kB/s using &man.dump.8;. Rates of 530kB/s have been reported when using Amanda Production of this drive has been discontinued. The SCSI bus connector on this tape drive is reversed from that on most other SCSI devices. Make sure that you have enough SCSI cable to twist the cable one-half turn before and after the Archive Anaconda tape drive, or turn your other SCSI devices upside-down. Two kernel code changes are required to use this drive. This drive will not work as delivered. If you have a SCSI-2 controller, short jumper 6. Otherwise, the drive behaves are a SCSI-1 device. When operating as a SCSI-1 device, this drive, locks the SCSI bus during some tape operations, including: fsf, rewind, and rewoffl. If you are using the NCR SCSI controllers, patch the file /usr/src/sys/pci/ncr.c (as shown below). Build and install a new kernel. *** 4831,4835 **** }; ! if (np->latetime>4) { /* ** Although we tried to wake it up, --- 4831,4836 ---- }; ! if (np->latetime>1200) { /* ** Although we tried to wake it up, Reported by: &a.jmb; Archive Python 28454 The boot message identifier for this drive is ARCHIVE Python 28454-XXX4ASB type 1 removable SCSI 2 density code 0x8c, 512-byte blocks This is a DDS-1 tape drive. Native capacity is 2.5GB on 90m tapes. Data transfer rate is XXX. This drive was repackaged by Sun Microsystems as model 595-3067. Reported by: Bob Bishop rb@gid.co.uk Throughput is in the 1.5 MByte/sec range, however this will drop if the disks and tape drive are on the same SCSI controller. Reported by: Robert E. Seastrom rs@seastrom.com Archive Python 04687 The boot message identifier for this drive is ARCHIVE Python 04687-XXX 6580 Removable Sequential Access SCSI-2 device This is a DAT-DDS-2 drive. Native capacity is 4GB when using 120m tapes. This drive supports hardware data compression. Switch 4 controls MRS (Media Recognition System). MRS tapes have stripes on the transparent leader. Switch 4 off enables MRS, on disables MRS. Parity is controlled by switch 5. Switch 5 on to enable parity control. Compression is enabled with Switch 6 off. It is possible to override compression with the SCSI MODE SELECT command (see &man.mt.1;). Data transfer rate is 800kB/s. Archive Viper 60 The boot message identifier for this drive is ARCHIVE VIPER 60 21116 -007 type 1 removable SCSI 1 This is a QIC tape drive. Native capacity is 60MB. Data transfer rate is XXX. Production of this drive has been discontinued. Reported by: Philippe Regnauld regnauld@hsc.fr Archive Viper 150 The boot message identifier for this drive is ARCHIVE VIPER 150 21531 -004 Archive Viper 150 is a known rogue type 1 removable SCSI 1. A multitude of firmware revisions exist for this drive. Your drive may report different numbers (e.g 21247 -005. This is a QIC tape drive. Native capacity is 150/250MB. Both 150MB (DC6150) and 250MB (DC6250) tapes have the recording format. The 250MB tapes are approximately 67% longer than the 150MB tapes. This drive can read 120MB tapes as well. It can not write 120MB tapes. Data transfer rate is 100kB/s This drive reads and writes DC6150 (150MB) and DC6250 (250MB) tapes. This drives quirks are known and pre-compiled into the scsi tape device driver (&man.st.4;). Under FreeBSD 2.2-CURRENT, use mt blocksize 512 to set the blocksize. (The particular drive had firmware revision 21247 -005. Other firmware revisions may behave differently) Previous versions of FreeBSD did not have this problem. Production of this drive has been discontinued. Reported by: Pedro A M Vazquez vazquez@IQM.Unicamp.BR &a.msmith; Archive Viper 2525 The boot message identifier for this drive is ARCHIVE VIPER 2525 25462 -011 type 1 removable SCSI 1 This is a QIC tape drive. Native capacity is 525MB. Data transfer rate is 180kB/s at 90 inches/sec. The drive reads QIC-525, QIC-150, QIC-120 and QIC-24 tapes. Writes QIC-525, QIC-150, and QIC-120. Firmware revisions prior to 25462 -011 are bug ridden and will not function properly. Production of this drive has been discontinued. Conner 420R The boot message identifier for this drive is Conner tape. This is a floppy controller, mini cartridge tape drive. Native capacity is XXXX Data transfer rate is XXX The drive uses QIC-80 tape cartridges. Reported by: Mark Hannon mark@seeware.DIALix.oz.au Conner CTMS 3200 The boot message identifier for this drive is CONNER CTMS 3200 7.00 type 1 removable SCSI 2. This is a mini cartridge tape drive. Native capacity is XXXX Data transfer rate is XXX The drive uses QIC-3080 tape cartridges. Reported by: Thomas S. Traylor tst@titan.cs.mci.com <ulink url="http://www.digital.com/info/Customer-Update/931206004.txt.html">DEC TZ87</ulink> The boot message identifier for this drive is DEC TZ87 (C) DEC 9206 type 1 removable SCSI 2 density code 0x19 This is a DLT tape drive. Native capacity is 10GB. This drive supports hardware data compression. Data transfer rate is 1.2MB/s. This drive is identical to the Quantum DLT2000. The drive firmware can be set to emulate several well-known drives, including an Exabyte 8mm drive. Reported by: &a.wilko; <ulink url="http://www.Exabyte.COM:80/Products/Minicartridge/2501/Rfeatures.html">Exabyte EXB-2501</ulink> The boot message identifier for this drive is EXABYTE EXB-2501 This is a mini-cartridge tape drive. Native capacity is 1GB when using MC3000XL mini cartridges. Data transfer rate is XXX This drive can read and write DC2300 (550MB), DC2750 (750MB), MC3000 (750MB), and MC3000XL (1GB) mini cartridges. WARNING: This drive does not meet the SCSI-2 specifications. The drive locks up completely in response to a SCSI MODE_SELECT command unless there is a formatted tape in the drive. Before using this drive, set the tape blocksize with &prompt.root; mt -f /dev/st0ctl.0 blocksize 1024 Before using a mini cartridge for the first time, the mini cartridge must be formated. FreeBSD 2.1.0-RELEASE and earlier: &prompt.root; /sbin/scsi -f /dev/rst0.ctl -s 600 -c "4 0 0 0 0 0" (Alternatively, fetch a copy of the scsiformat shell script from FreeBSD 2.1.5/2.2.) FreeBSD 2.1.5 and later: &prompt.root; /sbin/scsiformat -q -w /dev/rst0.ctl Right now, this drive cannot really be recommended for FreeBSD. Reported by: Bob Beaulieu ez@eztravel.com Exabyte EXB-8200 The boot message identifier for this drive is EXABYTE EXB-8200 252X type 1 removable SCSI 1 This is an 8mm tape drive. Native capacity is 2.3GB. Data transfer rate is 270kB/s. This drive is fairly slow in responding to the SCSI bus during boot. A custom kernel may be required (set SCSI_DELAY to 10 seconds). There are a large number of firmware configurations for this drive, some have been customized to a particular vendor's hardware. The firmware can be changed via EPROM replacement. Production of this drive has been discontinued. Reported by: &a.msmith; Exabyte EXB-8500 The boot message identifier for this drive is EXABYTE EXB-8500-85Qanx0 0415 type 1 removable SCSI 2 This is an 8mm tape drive. Native capacity is 5GB. Data transfer rate is 300kB/s. Reported by: Greg Lehey grog@lemis.de <ulink url="http://www.Exabyte.COM:80/Products/8mm/8505XL/Rfeatures.html">Exabyte EXB-8505</ulink> The boot message identifier for this drive is EXABYTE EXB-85058SQANXR1 05B0 type 1 removable SCSI 2 This is an 8mm tape drive which supports compression, and is upward compatible with the EXB-5200 and EXB-8500. Native capacity is 5GB. The drive supports hardware data compression. Data transfer rate is 300kB/s. Reported by: Glen Foster gfoster@gfoster.com Hewlett-Packard HP C1533A The boot message identifier for this drive is HP C1533A 9503 type 1 removable SCSI 2. This is a DDS-2 tape drive. DDS-2 means hardware data compression and narrower tracks for increased data capacity. Native capacity is 4GB when using 120m tapes. This drive supports hardware data compression. Data transfer rate is 510kB/s. This drive is used in Hewlett-Packard's SureStore 6000eU and 6000i tape drives and C1533A DDS-2 DAT drive. The drive has a block of 8 dip switches. The proper settings for FreeBSD are: 1 ON; 2 ON; 3 OFF; 4 ON; 5 ON; 6 ON; 7 ON; 8 ON. switch 1 switch 2 Result On On Compression enabled at power-on, with host control On Off Compression enabled at power-on, no host control Off On Compression disabled at power-on, with host control Off Off Compression disabled at power-on, no host control Switch 3 controls MRS (Media Recognition System). MRS tapes have stripes on the transparent leader. These identify the tape as DDS (Digital Data Storage) grade media. Tapes that do not have the stripes will be treated as write-protected. Switch 3 OFF enables MRS. Switch 3 ON disables MRS. See HP SureStore Tape Products and Hewlett-Packard Disk and Tape Technical Information for more information on configuring this drive. Warning: Quality control on these drives varies greatly. One FreeBSD core-team member has returned 2 of these drives. Neither lasted more than 5 months. Reported by: &a.se; Hewlett-Packard HP 1534A The boot message identifier for this drive is HP HP35470A T503 type 1 removable SCSI 2 Sequential-Access density code 0x13, variable blocks. This is a DDS-1 tape drive. DDS-1 is the original DAT tape format. Native capacity is 2GB when using 90m tapes. Data transfer rate is 183kB/s. The same mechanism is used in Hewlett-Packard's SureStore 2000i tape drive, C35470A DDS format DAT drive, C1534A DDS format DAT drive and HP C1536A DDS format DAT drive. The HP C1534A DDS format DAT drive has two indicator lights, one green and one amber. The green one indicates tape action: slow flash during load, steady when loaded, fast flash during read/write operations. The amber one indicates warnings: slow flash when cleaning is required or tape is nearing the end of its useful life, steady indicates an hard fault. (factory service required?) Reported by Gary Crutcher gcrutchr@nightflight.com Hewlett-Packard HP C1553A Autoloading DDS2 The boot message identifier for this drive is "". This is a DDS-2 tape drive with a tape changer. DDS-2 means hardware data compression and narrower tracks for increased data capacity. Native capacity is 24GB when using 120m tapes. This drive supports hardware data compression. Data transfer rate is 510kB/s (native). This drive is used in Hewlett-Packard's SureStore 12000e tape drive. The drive has two selectors on the rear panel. The selector closer to the fan is SCSI id. The other selector should be set to 7. There are four internal switches. These should be set: 1 ON; 2 ON; 3 ON; 4 OFF. At present the kernel drivers do not automatically change tapes at the end of a volume. This shell script can be used to change tapes: #!/bin/sh PATH="/sbin:/usr/sbin:/bin:/usr/bin"; export PATH usage() { echo "Usage: dds_changer [123456ne] raw-device-name echo "1..6 = Select cartridge" echo "next cartridge" echo "eject magazine" exit 2 } if [ $# -ne 2 ] ; then usage fi cdb3=0 cdb4=0 cdb5=0 case $1 in [123456]) cdb3=$1 cdb4=1 ;; n) ;; e) cdb5=0x80 ;; ?) usage ;; esac scsi -f $2 -s 100 -c "1b 0 0 $cdb3 $cdb4 $cdb5" Hewlett-Packard HP 35450A The boot message identifier for this drive is HP HP35450A -A C620 type 1 removable SCSI 2 Sequential-Access density code 0x13 This is a DDS-1 tape drive. DDS-1 is the original DAT tape format. Native capacity is 1.2GB. Data transfer rate is 160kB/s. Reported by: Mark Thompson mark.a.thompson@pobox.com Hewlett-Packard HP 35470A The boot message identifier for this drive is HP HP35470A 9 09 type 1 removable SCSI 2 This is a DDS-1 tape drive. DDS-1 is the original DAT tape format. Native capacity is 2GB when using 90m tapes. Data transfer rate is 183kB/s. The same mechanism is used in Hewlett-Packard's SureStore 2000i tape drive, C35470A DDS format DAT drive, C1534A DDS format DAT drive, and HP C1536A DDS format DAT drive. Warning: Quality control on these drives varies greatly. One FreeBSD core-team member has returned 5 of these drives. None lasted more than 9 months. Reported by: David Dawes dawes@rf900.physics.usyd.edu.au (9 09) Hewlett-Packard HP 35480A The boot message identifier for this drive is HP HP35480A 1009 type 1 removable SCSI 2 Sequential-Access density code 0x13. This is a DDS-DC tape drive. DDS-DC is DDS-1 with hardware data compression. DDS-1 is the original DAT tape format. Native capacity is 2GB when using 90m tapes. It cannot handle 120m tapes. This drive supports hardware data compression. Please refer to the section on HP C1533A for the proper switch settings. Data transfer rate is 183kB/s. This drive is used in Hewlett-Packard's SureStore 5000eU and 5000i tape drives and C35480A DDS format DAT drive.. This drive will occasionally hang during a tape eject operation (mt offline). Pressing the front panel button will eject the tape and bring the tape drive back to life. WARNING: HP 35480-03110 only. On at least two occasions this tape drive when used with FreeBSD 2.1.0, an IBM Server 320 and an 2940W SCSI controller resulted in all SCSI disk partitions being lost. The problem has not be analyzed or resolved at this time. <ulink url="http://www.sel.sony.com/SEL/ccpg/storage/tape/t5000.html">Sony SDT-5000</ulink> There are at least two significantly different models: one is a DDS-1 and the other DDS-2. The DDS-1 version is SDT-5000 3.02. The DDS-2 version is SONY SDT-5000 327M. The DDS-2 version has a 1MB cache. This cache is able to keep the tape streaming in almost any circumstances. The boot message identifier for this drive is SONY SDT-5000 3.02 type 1 removable SCSI 2 Sequential-Access density code 0x13 Native capacity is 4GB when using 120m tapes. This drive supports hardware data compression. Data transfer rate is depends upon the model or the drive. The rate is 630kB/s for the SONY SDT-5000 327M while compressing the data. For the SONY SDT-5000 3.02, the data transfer rate is 225kB/s. In order to get this drive to stream, set the blocksize to 512 bytes (mt blocksize 512) reported by Kenneth Merry ken@ulc199.residence.gatech.edu. SONY SDT-5000 327M information reported by Charles Henrich henrich@msu.edu. Reported by: &a.jmz; Tandberg TDC 3600 The boot message identifier for this drive is TANDBERG TDC 3600 =08: type 1 removable SCSI 2 This is a QIC tape drive. Native capacity is 150/250MB. This drive has quirks which are known and work around code is present in the scsi tape device driver (&man.st.4;). Upgrading the firmware to XXX version will fix the quirks and provide SCSI 2 capabilities. Data transfer rate is 80kB/s. IBM and Emerald units will not work. Replacing the firmware EPROM of these units will solve the problem. Reported by: &a.msmith; Tandberg TDC 3620 This is very similar to the Tandberg TDC 3600 drive. Reported by: &a.joerg; Tandberg TDC 3800 The boot message identifier for this drive is TANDBERG TDC 3800 =04Y Removable Sequential Access SCSI-2 device This is a QIC tape drive. Native capacity is 525MB. Reported by: &a.jhs; Tandberg TDC 4222 The boot message identifier for this drive is TANDBERG TDC 4222 =07 type 1 removable SCSI 2 This is a QIC tape drive. Native capacity is 2.5GB. The drive will read all cartridges from the 60 MB (DC600A) upwards, and write 150 MB (DC6150) upwards. Hardware compression is optionally supported for the 2.5 GB cartridges. This drives quirks are known and pre-compiled into the scsi tape device driver (&man.st.4;) beginning with FreeBSD 2.2-CURRENT. For previous versions of FreeBSD, use mt to read one block from the tape, rewind the tape, and then execute the backup program (mt fsr 1; mt rewind; dump ...) Data transfer rate is 600kB/s (vendor claim with compression), 350 KB/s can even be reached in start/stop mode. The rate decreases for smaller cartridges. Reported by: &a.joerg; Wangtek 5525ES The boot message identifier for this drive is WANGTEK 5525ES SCSI REV7 3R1 type 1 removable SCSI 1 density code 0x11, 1024-byte blocks This is a QIC tape drive. Native capacity is 525MB. Data transfer rate is 180kB/s. The drive reads 60, 120, 150, and 525MB tapes. The drive will not write 60MB (DC600 cartridge) tapes. In order to overwrite 120 and 150 tapes reliably, first erase (mt erase) the tape. 120 and 150 tapes used a wider track (fewer tracks per tape) than 525MB tapes. The extra width of the previous tracks is not overwritten, as a result the new data lies in a band surrounded on both sides by the previous data unless the tape have been erased. This drives quirks are known and pre-compiled into the scsi tape device driver (&man.st.4;). Other firmware revisions that are known to work are: M75D Reported by: Marc van Kempen marc@bowtie.nl REV73R1 Andrew Gordon Andrew.Gordon@net-tel.co.uk M75D Wangtek 6200 The boot message identifier for this drive is WANGTEK 6200-HS 4B18 type 1 removable SCSI 2 Sequential-Access density code 0x13 This is a DDS-1 tape drive. Native capacity is 2GB using 90m tapes. Data transfer rate is 150kB/s. Reported by: Tony Kimball alk@Think.COM * Problem drives CDROM drives Contributed by &a.obrien;. 23 November 1997. Generally speaking those in The FreeBSD Project prefer SCSI CDROM drives over IDE CDROM drives. However not all SCSI CDROM drives are equal. Some feel the quality of some SCSI CDROM drives have been deteriorating to that of IDE CDROM drives. Toshiba used to be the favored stand-by, but many on the SCSI mailing list have found displeasure with the 12x speed XM-5701TA as its volume (when playing audio CDROMs) is not controllable by the various audio player software. Another area where SCSI CDROM manufacturers are cutting corners is adherence to the SCSI specification. Many SCSI CDROMs will respond to multiple LUNs for its target address. Known violators include the 6x Teac CD-56S 1.0D.
diff --git a/en_US.ISO8859-1/articles/vinum/article.sgml b/en_US.ISO8859-1/articles/vinum/article.sgml index 32b3f435a7..9eed234383 100644 --- a/en_US.ISO8859-1/articles/vinum/article.sgml +++ b/en_US.ISO8859-1/articles/vinum/article.sgml @@ -1,2542 +1,2542 @@ Vinum"> %man; ]>
Bootstrapping Vinum: A Foundation for Reliable Servers Robert A. Van Valzah 2001 Robert A. Van Valzah - $Date: 2001-10-31 23:12:55 $ GMT - $Id: article.sgml,v 1.4 2001-10-31 23:12:55 chern Exp $ + $Date: 2002-02-14 23:57:13 $ GMT + $Id: article.sgml,v 1.5 2002-02-14 23:57:13 keramida Exp $ In the most abstract sense, these instructions show how to build a pair of disk drives where either one is adequate to keep your server running if the other fails. Life is better if they are both working, but your server will never die unless both disk drives die at once. If you choose ATAPI drives and use a fairly generic kernel, you can be confident that either of these drives can be plugged into most any main board to produce a working server in a pinch. The drives need not be identical. These techniques work equally well with SCSI drives as they do with ATAPI, but I will focus on ATAPI here because main boards with this interface are ubiquitous. After building the foundation of a reliable server as shown here, you can expand to as many disk drives as necessary to build the failure-resilient server of your dreams.
Introduction Any machine that is going to provide reliable service needs to have either redundant components on-line or a pool of off-line spares that can be promptly swapped in. Commodity PC hardware makes it affordable for even small organizations to have some spare parts available that could be pressed into service following the failure of production equipment. In many organizations, a failed power supply, NIC, memory, or main board could easily be swapped with a standby in a matter of minutes and be ready to return to production work. If a disk drive fails, however, it often has to be restored from a tape backup. This may take many hours. With disk drive capacities rising faster than tape drive capacities, the time needed to restore a failed disk drive seems to increase as technology progresses. &vinum.ap; is a volume manager for FreeBSD that provides a standard block I/O layer interface to the file system code just as any hardware device driver would. It works by managing partitions of type vinum and allows you to subdivide and group the space in such partitions into logical devices called volumes that can be used in the same way as disk partitions. Volumes can be configured for resilience, performance, or both. Experienced system administrators will immediately recognize the benefits of being able to configure each file system to match the way it is most often used. In some ways, Vinum is similar to &man.ccd.4;, but it is far more flexible and robust in the face of failures. It is only slightly more difficult to set up than &man.ccd.4;. &man.ccd.4; may meet your needs if you are only interested in concatenation.
Terminology Discussion of storage management can get very tricky simply because of the terminology involved. As we will see below, the terms disk, slice, partition, subdisk, and volume each refer to different things that present the same interface to a kernel function like swapping. The potential for confusion is compounded because the objects that these terms represent can be nested inside each other. I will refer to a physical disk drive as a spindle. A partition here means a BSD partition as maintained by disklabel. It does not refer to slices or BIOS partitions as maintained by fdisk.
Vinum Objects Vinum defines a hierarchy of four objects that it uses to manage storage (see ). Different combinations of these objects are used to achieve failure resilience, performance, and/or extra capacity. I will give a whirlwind tour of the objects here--see the Vinum web site for a more thorough description.
Vinum Objects and Architecture +-----+------+------+ | UFS | swap | Etc. | +---+-+------+----+ + | volume | | + V +-------------+ + | i plex | | + n +-------------+ + | u subdisk | | + m +-------------+ + | drive | | +-----------------+ + | Block I/O devices | +-------------------+ Vinum Objects and Architecture
The top object, a vinum volume, implements a virtual disk that provides a standard block I/O layer interface to other parts of the kernel. The bottom object, a vinum drive, uses this same interface to request I/O from physical devices below it. In between these two (from top to bottom) we have objects called a vinum plex and a vinum subdisk. As you can probably guess from the name, a vinum subdisk is a contiguous subset of the space available on a vinum drive. It lets you subdivide a vinum drive in much the same way that a disk BSD partition lets you subdivide a BIOS slice. A plex allows subdisks to be grouped together making the space of all subdisks available as a single object. A plex can be organized with its constituent subdisks concatenated or striped. Both organizations are useful for spreading I/O requests across spindles since plexes reside on distinct spindles. A striped plex will switch spindles each time a multiple of the strip size is reached. A concatenated plex will switch spindles only when the end of a subdisk is reached. An important characteristic of a Vinum volume is that it can be made up of more than one plex. In this case, writes go to all plexes and a read may be satisfied by any plex. Configuring two or more plexes on distinct spindles yields a volume that is resilient to failure. Vinum maintains a configuration that defines instances of the above objects and the way they are related to each other. This configuration is automatically written to all spindles under Vinum management whenever it changes.
Vinum Volume/Plex Organization Although Vinum can manage any number of spindles, I will only cover scenarios with two spindles here for simplification. See to see how two spindles organized with Vinum compare to two spindles without Vinum. Characteristics of Two Spindles Organized with Vinum Organization Total Capacity Failure Resilient Peak Read Performance Peak Write Performance Concatenated Plexes Unchanged, but appears as a single drive No Unchanged Unchanged Striped Plexes (RAID-0) Unchanged, but appears as a single drive No 2x 2x Mirrored Volumes (RAID-1) 1/2, appearing as a single drive Yes 2x Unchanged
shows that striping yields the same capacity and lack of failure resilience as concatenation, but it has better peak read and write performance. Hence we will not be using concatenation in any of the examples here. Mirrored volumes provide the benefits of improved peak read performance and failure resilience--but this comes at a loss in capacity. Both concatenation and striping bring their benefits over a single spindle at the cost of increased likelihood of failure since more than one spindle is now involved. When three or more spindles are present, Vinum also supports rotated, block-interleaved parity (also called RAID-5) that provides better capacity than mirroring (but not quite as good as striping), better read performance than both mirroring and striping, and good failure resilience. There is, however, a substantial decrease in write performance with RAID-5. Most of the benefits become more pronounced with five or more spindles. The organizations described above may be combined to provide benefits that no single organization can match. For example, mirroring and striping can be combined to provide failure-resilience with very fast read performance.
Vinum History Vinum is a standard part of even a "minimum" FreeBSD distribution and it has been standard since 3.0-RELEASE. The official pronunciation of the name is VEE-noom. &vinum.ap; was inspired by the Veritas Volume Manager, but was not derived from it. The name is a play on that history and the Latin adage In Vino Veritas (Vino is the accusative form of Vinum). - Literally translated, that is "Truth lies in wine" hinting that + Literally translated, that is Truth lies in wine hinting that drunkards have a hard time lying. I have been using it in production on six different servers for over two years with no data loss. Like the rest of FreeBSD, Vinum - provides "rock-stable performance." + provides rock-stable performance. (On a personal note, I have seen Vinum panic when I misconfigured something, but I have never had any trouble in normal operation.) Greg Lehey wrote Vinum for FreeBSD, but he is seeking help in porting it to NetBSD and OpenBSD. Just like the rest of FreeBSD, Vinum is undergoing continuous development. Several subtle, but significant bugs have been fixed in recent releases. It is always best to use the most recent code base that meets your stability requirements.
Vinum Deployment Strategy Vinum, coupled with prudent partition management, lets you - keep "warm-spare" spindles on-line so that failures + keep warm-spare spindles on-line so that failures are transparent to users. Failed spindles can be replaced during regular maintenance periods or whenever it is convenient. When all spindles are working, the server benefits from increased performance and capacity. Having redundant copies of your home directory does not help you if the spindle holding root, /usr, or swap fails on your server. Hence I focus here on building a simple foundation for a failure-resilient server covering the root, /usr, /home, and swap partitions. Vinum mirroring does not remove the need for making backups! Mirroring cannot help you recover from site disasters or the dreaded rm -r -f / command.
Why Bootstrap Vinum? It is possible to add Vinum to a server configuration after it is already in production use, but this is much harder than designing for it from the start. Ironically, Vinum is not supported by /stand/sysinstall and hence you cannot install /usr right onto a Vinum volume. Vinum currently does not support the root file system (this feature is in development). Hence it is a bit tricky to get started using Vinum, but these instructions take you though the process of planning for Vinum, installing FreeBSD without it, and then beginning to use it. - I have come to call this whole process "bootstrapping Vinum." + I have come to call this whole process bootstrapping Vinum. That is, the process of getting Vinum initially installed and operating to the point where you have met your resilience or performance goals. My purpose here is to document a Vinum bootstrapping method that I have found that works well for me.
Vinum Benefits The server foundation scenario I have chosen here allows me to show you examples of configuring for resilience on /usr and /home. Yet Vinum provides benefits other than resilience--namely performance, capacity, and manageability. It can significantly improve disk performance (especially under multi-user loads). Vinum can easily concatenate many smaller disks to produce the illusion of a single larger disk (but my server foundation scenario does not allow me to illustrate these benefits here). For servers with many spindles, Vinum provides substantial benefits in volume management, particularly when coupled with hot-pluggable hardware. Data can be moved from spindle to spindle while the system is running without loss of production time. Again, details of this will not be given here, but once you get your feet wet with Vinum, other documentation will help you do things like this. See "The Vinum Volume Manager" for a technical introduction to Vinum, &man.vinum.8; for a description of the vinum command, and &man.vinum.4; for a description of the vinum device driver and the way Vinum objects are named. Breaking up your disk space into smaller and smaller partitions - has the benefit of allowing you to "tune" for the most common - type of access and tends to keep disk hogs "within their pens." + has the benefit of allowing you to tune for the most common + type of access and tends to keep disk hogs within their pens. However it also causes some loss in total available disk space due to fragmentation.
Server Operation in Degraded Mode Some disk failures in this two-spindle scenario will result in Vinum automatically routing all disk I/O to the remaining good spindle. Others will require brief manual intervention on the console to configure the server for degraded mode operation and a quick reboot. Other than actual hardware repairs, most recovery work can be done while the server is running in multi-user degraded mode so there is as little production impact from failures as possible. I give the instructions in needed to configure the server for degraded mode operation in those cases where Vinum cannot do it automatically. I also give the instructions needed to return to normal operation once the failed hardware is repaired. You might call these instructions Vinum failure recovery techniques. I recommend practicing using these instructions by recovering from simulated failures. For each failure scenario, I also give tips below for simulating a failure even when your hardware is working well. Even a minimum Vinum system as described in below can be a good place to experiment with recovery techniques without impacting production equipment.
Hardware RAID vs. Vinum (Software RAID) Manual intervention is sometimes required to configure a server for degraded mode because Vinum is implemented in software that runs after the FreeBSD kernel is loaded. One disadvantage of such software RAID solutions is that there is nothing that can be done to hide spindle failures from the BIOS or the FreeBSD boot sequence. Hence the manual reconfiguration of the server for degraded operation mentioned above just informs the BIOS and boot sequence of failed spindles. Hardware RAID solutions generally have an advantage in that they require no such reconfiguration since spindle failures are hidden from the BIOS and boot sequence. Hardware RAID, however, may have some disadvantages that can be significant in some cases: The hardware RAID controller itself may become a single point of failure for the system. The data is usually kept in a proprietary format so that a disk drive cannot be simply plugged into another main board and booted. You often cannot mix and match drives with different sizes and interfaces. You are often limited to the number of drives supported by the hardware RAID controller (often only four or eight). In other words, &vinum.ap; may offer advantages in that there is no single point of failure, the drives can boot on most any main board, and you are free to mix and match as many drives using whatever interface you choose. Keep your kernel fairly generic (or at least keep /kernel.GENERIC around). This will improve the chances that you can come back up on - "foreign" hardware more quickly. + foreign hardware more quickly. The pros and cons discussed above suggest that the root file system and swap partition are good candidates for hardware RAID if available. This is especially true for servers where it is difficult for administrators to get console access (recall that this is sometimes required to configure a server for degraded mode operation). A server with only software RAID is well suited to office and home environments where an administrator can be close at hand. A common myth is that hardware RAID is always faster than software RAID. Since it runs on the host CPU, Vinum often has more CPU power and memory available than a dedicated RAID controller would have. If performance is a prime concern, it is best to benchmark your application running on your CPU with your spindles using both hardware and software RAID systems before making a decision.
Hardware for Vinum These instructions may be timely since commodity PC hardware can now easily host several hundred gigabytes of reasonably high-performance disk space at a low price. Many disk drive manufactures now sell 7,200 RPM disk drives with quite low seek times and high transfer rates through ATA-100 interfaces, all at very attractive prices. Four such drives, attached to a suitable main board and configured with Vinum and prudent partitioning, yields a failure-resilient, high performance disk server at a very reasonable cost. However, you can indeed get started with Vinum very simply. A minimum system can be as simple as an old CPU (even a 486 is fine) and a pair of drives that are 500 MB or more. They need not be the same size or even use the same interface (i.e., it is fine to mix ATAPI and SCSI). So get busy and give this a try today! You will have the foundation of a failure-resilient server running in an hour or so!
Bootstrapping Phases Greg Lehey suggested this bootstrapping method. It uses knowledge of how Vinum internally allocates disk space to avoid copying data. Instead, Vinum objects are configured so that they occupy the same disk space where /stand/sysinstall built file systems. The file systems are thus embedded within Vinum objects without copying. There are several distinct phases to the Vinum bootstrapping procedure. Each of these phases is presented in a separate section below. The section starts with a general overview of the phase and its goals. It then gives example steps for the two-spindle scenario presented here and advice on how to adapt them for your server. (If you are reading for a general understanding of Vinum bootstrapping, the example sections for each phase can safely be skipped.) The remainder of this section gives an overview of the entire bootstrapping process. Phase 1 involves planning and preparation. We will balance requirements for the server against available resources and make design tradeoffs. We will plan the transition from no Vinum to Vinum on just one spindle, to Vinum on two spindles. In phase 2, we will install a minimum FreeBSD system on a single spindle using partitions of type 4.2BSD (regular UFS file systems). Phase 3 will embed the non-root file systems from phase 2 in Vinum objects. Note that Vinum will be up and running at this point, but it cannot yet provide any resilience since it only has one spindle on which to store data. Finally in phase 4, we configure Vinum on a second spindle and make a backup copy of the root file system. This will give us resilience on all file systems.
Bootstrapping Phase 1: Planning and Preparation Our goal in this phase is to define the different partitions we will need and examine their requirements. We will also look at available disk drives and controllers and allocate partitions to them. Finally, we will determine the size of each partition and its use during the bootstrapping process. After this planning is complete, we can optionally prepare to use some tools that will make bootstrapping Vinum easier. Several key questions must be answered in this planning phase: What file system and partitions will be needed? How will they be used? How will we name each spindle? How will the partitions be ordered for each spindle? How will partitions be assigned to the spindles? How will partitions be configured? Resilience or performance? What technique will be used to achieve resilience? What spindles will be used? How will they be configured on the available controllers? How much space is required for each partition?
Phase 1 Example In this example, I will assume a scenario where we are building a minimal foundation for a failure-resilient server. Hence we will need at least root, /usr, /home, and swap partitions. The root, /usr, and /home file systems all need resilience since the server will not be much good without them. The swap partition needs performance first and generally does not need resilience since nothing it holds needs to be retained across a reboot.
Spindle Naming The kernel would refer to the master spindle on the primary and secondary ATA controllers as /dev/ad0 and /dev/ad2 respectively. This assumes that you have not removed the line options ATA_STATIC_ID from your kernel configuration. But Vinum also needs to have a name for each spindle that will stay the same name regardless of how it is attached to the CPU (i.e., if the drive moves, the Vinum name moves with the drive). Some recovery techniques documented below suggest moving a spindle from the secondary ATA controller to the primary ATA controller. (Indeed, the flexibility of making such moves is a key benefit of Vinum especially if you are managing a large number of spindles.) After such a drive/controller swap, the kernel will see what used to be /dev/ad2 as /dev/ad0 but Vinum will still call it by whatever name it had when it was attached to /dev/ad2 - (i.e., when it was "created" or first made known to + (i.e., when it was created or first made known to Vinum). Since connections can change, it is best to give each spindle a unique, abstract name that gives no hint of how it is attached. Avoid names that suggest a manufacturer, model number, physical location, or membership in a sequence (e.g. avoid names like upper, lower, etc., alpha, beta, etc., SCSI1, SCSI2, etc., or Seagate1, Seagate2 etc.). Such names are likely to lose their uniqueness or get out of sequence someday even if they seem like great names today. Once you have picked names for your spindles, label them with a permanent marker. If you have hot-swappable hardware, write the names on the sleds in which the spindles are mounted. This will significantly reduce the likelihood of error when you are moving spindles around later as part of failure recovery or routine system management procedures. In the instructions that follow, Vinum will name the root spindle YouCrazy and the rootback spindle UpWindow. I will only use /dev/ad0 when I want to refer to whichever of the two spindles is currently attached as /dev/ad0.
Partition Ordering Modern disk drives operate with fairly uniform areal density across the surface of the disk. That implies that more data is available under the heads without seeking on the outer cylinders than on the inner cylinders. We will allocate partitions most critical to system performance from these outer cylinders as /stand/sysinstall generally does. The root file system is traditionally the outermost, even though it generally is not as critical to system performance as others. (However root can have a larger impact on performance if it contains /tmp and /var as it does in this example.) The FreeBSD boot loaders assume that the root file system lives in the a partition. There is no requirement that the a partition start on the outermost cylinders, but this convention makes it easier to manage disk labels. Swap performance is critical so it comes next on our way toward the center. I/O operations here tend to be large and contiguous. Having as much data under the heads as possible avoids seeking while swapping. With all the smaller partitions out of the way, we finish up the disk with /home and /usr. Access patterns here tend not to be as intense as for other file systems (especially if there is an abundant supply of RAM and read cache hit rates are high). If the pair of spindles you have are large enough to allow for more than /home and /usr, it is fine to plan for additional file systems here.
Assigning Partitions to Spindles We will want to assign partitions to these spindles so that either can fail without loss of data on file systems configured for resilience. Reliability on /usr and /home is best achieved using Vinum mirroring. Resilience will have to come differently, however, for the root file system since Vinum is not a part of the FreeBSD boot sequence. Here we will have to settle for two identical partitions with a periodic copy from the primary to the backup secondary. The kernel already has support for interleaved swap across all available partitions so there is no need for help from Vinum here. /stand/sysinstall will automatically configure /etc/fstab for all swap partitions given. The &vinum.ap; bootstrapping method given below requires a pair of spindles that I will call the root spindle and the rootback spindle. The rootback spindle must be the same size or larger than the root spindle. These instructions first allocate all space on the root spindle and then allocate exactly that amount of space on a rootback spindle. (After &vinum.ap; is bootstrapped, there is nothing special about either of these spindles--they are interchangeable.) You can later use the remaining space on the rootback spindle for other file systems. If you have more than two spindles, the bootvinum Perl script and the procedure below will help you initialize them for use with &vinum.ap;. However you will have to figure out how to assign partitions to them on your own.
Assigning Space to Partitions For this example, I will use two spindles: one with 4,124,673 blocks (about 2 GB) on /dev/ad0 and one with 8,420,769 blocks (about 4 GB) on /dev/ad2. It is best to configure your two spindles on separate controllers so that both can operate in parallel and so that you will have failure resilience in case a controller dies. Note that mirrored volume write performance will be halved in cases where both spindles share a controller that requires they operate serially (as is often the case with ATA controllers). One spindle will be the master on the primary ATA controller and the other will be the master on the secondary ATA controller. Recall that we will be allocating space on the smaller spindle first and the larger spindle second.
Assigning Partitions on the Root Spindle We will allocate 200,000 blocks (about 93 MB) for a root file system on each spindle (/dev/ad0s1a and /dev/ad2s1a). We will initially allocate 200,265 blocks for a swap partition on each spindle, giving a total of about 186 MB of swap space (/dev/ad0s1b and /dev/ad2s1b). We will lose 265 blocks from each swap partition as part of the bootstrapping process. This is the size of the space used by Vinum to store configuration information. The space will be taken from swap and given to a vinum partition but will be unavailable for Vinum subdisks. I have done the partition allocation in nice round numbers of blocks just to emphasize where the 265 blocks go. There is nothing wrong with allocating space in MB if that is more convenient for you. This leaves 4,124,673 - 200,000 - 200,265 = 3,724,408 blocks (about 1,818 MB) on the root spindle for Vinum partitions (/dev/ad0s1e and /dev/ad2s1f). From this, allocate the 265 blocks for Vinum configuration information, 1,000,000 blocks (about 488 MB) for /home, and the remaining 2,724,408 blocks (about 1,330 MB) for /usr. See below to see this graphically. The left-hand side of below shows what spindle ad0 will look like at the end of phase 2. The right-hand side shows what it will look like at the end of phase 3.
Spindle ad0 Before and After Vinum ad0 Before Vinum Offset (blocks) ad0 After Vinum +----------------------+ <-- 0--> +----------------------+ | root | | root | | /dev/ad0s1a | | /dev/ad0s1a | +----------------------+ <-- 200000--> +----------------------+ | swap | | swap | | /dev/ad0s1b | | /dev/ad0s1b | | | 400000--> +----------------------+ | | | Vinum drive YouCrazy | | | | /dev/ad0s1h | +----------------------+ <-- 400265--> +-----------------+ | | /home | | Vinum sd | | | /dev/ad0s1e | | home.p0.s0 | | +----------------------+ <--1400265--> +-----------------+ | | /usr | | Vinum sd | | | /dev/ad0s1f | | usr.p0.s0 | | +----------------------+ <--4124673--> +-----------------+----+ Not to scale Spindle /dev/ad0 Before and After Vinum
Assigning Partitions on the Rootback Spindle The /rootback and swap partition sizes on the rootback spindle must match the root and swap partition sizes on the root spindle. That leaves 8,420,769 - 200,000 - 200,265 = 8,020,504 blocks for the Vinum partition. Mirrors of /home and /usr receive the same allocation as on the root spindle. That will leave an extra 2 GB or so that we can deal with later. See below to see this graphically. The left-hand side of below shows what spindle ad2 will look like at the beginning of phase 4. The right-hand side shows what it will look like at the end.
Spindle ad2 Before and After Vinum ad2 Before Vinum Offset (blocks) ad2 After Vinum +----------------------+ <-- 0--> +----------------------+ | /rootback | | /rootback | | /dev/ad2s1e | | /dev/ad2s1a | +----------------------+ <-- 200000--> +----------------------+ | swap | | swap | | /dev/ad2s1b | | /dev/ad2s1b | | | 400000--> +----------------------+ | | | Vinum drive UpWindow | | | | /dev/ad2s1h | +----------------------+ <-- 400265--> +-----------------+ | | /NOFUTURE | | Vinum sd | | | /dev/ad2s1f | | home.p1.s0 | | | | 1400265--> +-----------------+ | | | | Vinum sd | | | | | usr.p1.s0 | | | | 4124673--> +-----------------+ | | | | Vinum sd | | | | | hope.p0.s0 | | +----------------------+ <--8420769--> +-----------------+----+ Not to scale Spindle ad2 Before and After Vinum
Preparation of Tools The bootvinum Perl script given below in will make the Vinum bootstrapping process much easier if you can run it on the machine being bootstrapped. It is over 200 lines and you would not want to type it in. At this point, I recommend that you copy it to a floppy or arrange some alternative method of making it readily available so that it can be available later when needed. For example: &prompt.root; fdformat -f 1440 /dev/fd0 &prompt.root; newfs_msdos -f 1440 /dev/fd0 &prompt.root; mount /dev/fd0 /mnt &prompt.root; cp /usr/share/examples/vinum/bootvinum /mnt XXX Someday, I would like this script to live in /usr/share/examples/vinum. Till then, please use this link to get a copy.
Bootstrapping Phase 2: Minimal OS Installation Our goal in this phase is to complete the smallest possible FreeBSD installation in such a way that we can later install Vinum. We will use only partitions of type 4.2BSD (i.e., regular UFS file systems) since that is the only type supported by /stand/sysinstall.
Phase 2 Example Start up the FreeBSD installation process by running /stand/sysinstall from installation media as you normally would. Fdisk partition all spindles as needed. Make sure to select BootMgr for all spindles. Partition the root spindle with appropriate block allocations as described above in . For this example on a 2 GB spindle, I will use 200,000 blocks for root, 200,265 blocks for swap, 1,000,000 blocks for /home, and the rest of the spindle (2,724,408 blocks) for /usr. (/stand/sysinstall should automatically assign these to /dev/ad0s1a, /dev/ad0s1b, /dev/ad0s1e, and /dev/ad0s1f by default.) If you prefer soft updates as I do and you are using 4.4-RELEASE or better, this is a good time to enable them. Partition the rootback spindle with the appropriate block allocations as described above in . For this example on a 4 GB spindle, I will use 200,000 blocks for /rootback, 200,265 blocks for swap, and the rest of the spindle (8,020,504 blocks) for /NOFUTURE. (/stand/sysinstall should automatically assign these to /dev/ad2s1e, /dev/ad2s1b, and /dev/ad2s1f by default.) We do not really want to have a /NOFUTURE UFS file system (we want a vinum partition instead), but that is the best choice we have for the space given the limitations of /stand/sysinstall. Mount point names beginning with NOFUTURE and rootback serve as sentinels to the bootstrapping script presented in below. Partition any other spindles with swap if desired and a single /NOFUTURExx file system. Select a minimum system install for now even if you want to end up with more distributions loaded later. Do not worry about system configuration options at this point--get Vinum set up and get the partitions in the right places first. Exit /stand/sysinstall and reboot. Do a quick test to verify that the minimum installation was successful. The left-hand side of above and the left-hand side of above show how the disks will look at this point.
Bootstrapping Phase 3: Root Spindle Setup Our goal in this phase is get Vinum set up and running on the root spindle. We will embed the existing /usr and /home file systems in a Vinum partition. Note that the Vinum volumes created will not yet be failure-resilient since we have only one underlying Vinum drive to hold them. The resulting system will automatically start Vinum as it boots to multi-user mode.
Phase 3 Example Login as root. We will need a directory in the root file system in which to keep a few files that will be used in the Vinum bootstrapping process. &prompt.root; mkdir /bootvinum &prompt.root; cd /bootvinum Several files need to be prepared for use in bootstrapping. I have written a Perl script that makes all the required files for you. Copy this script to /bootvinum by floppy disk, tape, network, or any convenient means and then run it. (If you cannot get this script copied onto the machine being bootstrapped, then see below for a manual alternative.) &prompt.root; cp /mnt/bootvinum . &prompt.root; ./bootvinum bootvinum produces no output when run successfully. If you get any errors, something may have gone wrong when you were creating partitions with /stand/sysinstall above. Running bootvinum will: Create /etc/fstab.vinum based on what it finds in your existing /etc/fstab Create new disk labels for each spindle mentioned in /etc/fstab and keep copies of the current disk labels Create files needed as input to vinum for building Vinum objects on each spindle Create many alternates to /etc/fstab.vinum that might come in handy should a spindle fail You may want to take a look at these files to learn more about the disk partitioning required for Vinum or to learn more about the commands needed to create Vinum objects. We now need to install new spindle partitioning for /dev/ad0. This requires that /dev/ad0s1b not be in use for swapping so we have to reboot in single-user mode. First, reboot the system. &prompt.root; reboot Next, enter single-user mode. Hit [Enter] to boot immediately, or any other key for command prompt. Booting [kernel] in 8 seconds... Type '?' for a list of commands, 'help' for more detailed help. ok boot -s In single-user mode, install the new partitioning created above. &prompt.root; cd /bootvinum &prompt.root; disklabel -R ad0s1 disklabel.ad0s1 &prompt.root; disklabel -R ad2s1 disklabel.ad2s1 If you have additional spindles, repeat the above commands as appropriate for them. We are about to start Vinum for the first time. It is going to want to create several device nodes under /dev/vinum so we will need to mount the root file system for read/write access. &prompt.root; fsck -p / &prompt.root; mount / Now it is time to create the Vinum objects that will embed the existing non-root file systems on the root spindle in a Vinum partition. This will load the Vinum kernel module and start Vinum as a side effect. &prompt.root; vinum create create.YouCrazy You should see a list of Vinum objects created that looks like the following: 1 drives: D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%) 2 volumes: V home State: up Plexes: 1 Size: 488 MB V usr State: up Plexes: 1 Size: 1330 MB 2 plexes: P home.p0 C State: up Subdisks: 1 Size: 488 MB P usr.p0 C State: up Subdisks: 1 Size: 1330 MB 2 subdisks: S home.p0.s0 State: up PO: 0 B Size: 488 MB S usr.p0.s0 State: up PO: 0 B Size: 1330 MB You should also see several kernel messages which state that the Vinum objects you have created are now up. Our non-root file systems should now be embedded in a Vinum partition and hence available through Vinum volumes. It is important to test that this embedding worked. &prompt.root; fsck -n /dev/vinum/home &prompt.root; fsck -n /dev/vinum/usr This should produce no errors. If it does produce errors do not fix them. Instead, go back and examine the root spindle partition tables before and after Vinum to see if you can spot the error. You can back out the partition table changes by using disklabel -R with the disklabel.*.b4vinum files. While we have the root file system mounted read/write, this is a good time to install /etc/fstab. &prompt.root; mv /etc/fstab /etc/fstab.b4vinum &prompt.root; cp /etc/fstab.vinum /etc/fstab We are now done with tasks requiring single-user mode, so it is safe to go multi-user from here on. &prompt.root; ^D Login as root. Edit /etc/rc.conf and add this line: start_vinum="YES"
Bootstrapping Phase 4: Rootback Spindle Setup Our goal in this phase is to get redundant copies of all data from the root spindle to the rootback spindle. We will first create the necessary Vinum objects on the rootback spindle. Then we will ask Vinum to copy the data from the root spindle to the rootback spindle. Finally, we use dump and restore to copy the root file system.
Phase 4 Example Now that Vinum is running on the root spindle, we can bring it up on the rootback spindle so that our Vinum volumes can become failure-resilient. &prompt.root; cd /bootvinum &prompt.root; vinum create create.UpWindow You should see a list of Vinum objects created that looks like the following: 2 drives: D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%) D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%) 2 volumes: V home State: up Plexes: 2 Size: 488 MB V usr State: up Plexes: 2 Size: 1330 MB 4 plexes: P home.p0 C State: up Subdisks: 1 Size: 488 MB P usr.p0 C State: up Subdisks: 1 Size: 1330 MB P home.p1 C State: faulty Subdisks: 1 Size: 488 MB P usr.p1 C State: faulty Subdisks: 1 Size: 1330 MB 4 subdisks: S home.p0.s0 State: up PO: 0 B Size: 488 MB S usr.p0.s0 State: up PO: 0 B Size: 1330 MB S home.p1.s0 State: stale PO: 0 B Size: 488 MB S usr.p1.s0 State: stale PO: 0 B Size: 1330 MB You should also see several kernel messages which state that some of the Vinum objects you have created are now up while others are faulty or stale. Now we ask Vinum to copy each of the subdisks on drive YouCrazy to drive UpWindow. This will change the state of the newly created Vinum subdisks from stale to up. It will also change the state of the newly created Vinum plexes from faulty to up. First, we do the new subdisk we added to /home. &prompt.root; vinum start -w home.p1.s0 reviving home.p1.s0 (time passes . . . ) home.p1.s0 is up by force home.p1 is up home.p1.s0 is up My 5,400 RPM EIDE spindles copied at about 3.5 MBytes/sec. Your mileage may vary. Next we do the new subdisk we added to /usr. &prompt.root; vinum -w start usr.p1.s0 reviving usr.p1.s0 (time passes . . . ) usr.p1.s0 is up by force usr.p1 is up usr.p1.s0 is up All Vinum objects should be in state up at this point. The output of vinum list should look like the following: 2 drives: D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%) D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%) 2 volumes: V home State: up Plexes: 2 Size: 488 MB V usr State: up Plexes: 2 Size: 1330 MB 4 plexes: P home.p0 C State: up Subdisks: 1 Size: 488 MB P usr.p0 C State: up Subdisks: 1 Size: 1330 MB P home.p1 C State: up Subdisks: 1 Size: 488 MB P usr.p1 C State: up Subdisks: 1 Size: 1330 MB 4 subdisks: S home.p0.s0 State: up PO: 0 B Size: 488 MB S usr.p0.s0 State: up PO: 0 B Size: 1330 MB S home.p1.s0 State: up PO: 0 B Size: 488 MB S usr.p1.s0 State: up PO: 0 B Size: 1330 MB Copy the root file system so that you will have a backup. &prompt.root; cd /rootback &prompt.root; dump 0f - / | restore rf - &prompt.root; rm restoresymtable &prompt.root; cd / You may see errors like this: ./tmp/rstdir1001216411: (inode 558) not found on tape cannot find directory inode 265 abort? [yn] n expected next file 492, got 491 They seem to cause no harm. I suspect they are a consequence of dumping the file system containing /tmp and/or the pipe connecting dump and restore. Make a directory on which we can mount a damaged root file system during the recovery process. &prompt.root; mkdir /rootbad Remove sentinel mount points that are now unused. &prompt.root; rmdir /NOFUTURE* Create empty &vinum.ap; drives on remaining spindles. &prompt.root; vinum create create.ThruBank &prompt.root; ... At this point, the reliable server foundation is complete. The right-hand side of above and the right-hand side of above show how the disks will look. You may want to do a quick reboot to multi-user and give it a quick test drive. This is also a good point to complete installation of other distributions beyond the minimal install. Add packages, ports, and users as required. Configure /etc/rc.conf as required. After you have completed your server configuration, remember to do one more copy of root to /rootback as shown above before placing the server into production. Make a schedule to refresh /rootback periodically. It may be a good idea to mount /rootback read-only for normal operation of the server. This does, however, complicate the periodic refresh a bit. Do not forget to watch /var/log/messages carefully for errors. Vinum may automatically avoid failed hardware in a way that users do not notice. You must watch for such failures and get them repaired before a second failure results in data loss. You may see Vinum noting damaged objects at server boot time.
Where to Go from Here? Now that you have established the foundation of a reliable server, there are several things you might want to try next.
Make a Vinum Volume with Remaining Space Following are the steps to create another Vinum volume with space remaining on the rootback spindle. This volume will not be resilient to spindle failure since it has only one plex on a single spindle. Create a file with the following contents: volume hope plex name hope.p0 org concat volume hope sd name hope.p0.s0 drive UpWindow plex hope.p0 len 0 Specifying a length of 0 for the hope.p0.s0 subdisk asks Vinum to use whatever space is left available on the underlying drive. Feed these commands into vinum . &prompt.root; vinum create filename Now we newfs the volume and mount it. &prompt.root; newfs -v /dev/vinum/hope &prompt.root; mkdir /hope &prompt.root; mount /dev/vinum/hope /hope Edit /etc/fstab if you want /hope mounted at boot time.
Try Out More Vinum Commands You might already be familiar with vinum to get a list of all Vinum objects. Try following it to see more detail. If you have more spindles and you want to bring them up as concatenated, mirrored, or striped volumes, then give vinum drivelist, vinum drivelist, or vinum drivelist a try. See &man.vinum.8; for sample configurations and important performance considerations before settling on a final organization for your additional spindles. The failure recovery instructions below will also give you some experience using more Vinum commands.
Failure Scenarios This section contains descriptions of various failure scenarios. For each scenario, there is a subsection on how to configure your server for degraded mode operation, how to recover from the failure, how to exit degraded mode, and how to simulate the failure. Make a hard copy of these instructions and leave them inside the CPU case, being careful not to interfere with ventilation.
Root file system on ad0 unusable, rest of drive ok We assume here that the boot blocks and disk label on /dev/ad0 are ok. If your BIOS can boot from a drive other than C:, you may be able to get around this limitation.
Configure Server for Degraded Mode Use BootMgr to load kernel from /dev/ad2s1a. Hit F5 in BootMgr to select Drive 1. Hit F1 to select FreeBSD. After the kernel is loaded, hit any key but enter to interrupt the boot sequence. Boot into single-user mode and allow explicit entry of a root file system. Hit [Enter] to boot immediately, or any other key for command prompt. Booting [kernel] in 8 seconds... Type '?' for a list of commands, 'help' for more detailed help. ok boot -as Select /rootback as your root file system. Manual root file system specification: <fstype>:<device> Mount <device> using filesystem <fstype> e.g. ufs:/dev/da0s1a ? List valid disk boot devices <empty line> Abort manual input mountroot> ufs:/dev/ad2s1a Now that you are in single-user mode, change /etc/fstab to avoid the bad root file system. If you used the bootvinum Perl script from below, then these commands should configure your server for degraded mode. &prompt.root; fsck -p / &prompt.root; mount / &prompt.root; cd /etc &prompt.root; mv fstab fstab.bak &prompt.root; cp fstab_ad0s1_root_bad fstab &prompt.root; cd / &prompt.root; mount -o ro / &prompt.root; vinum start &prompt.root; fsck -p &prompt.root; ^D
Recovery Restore /dev/ad0s1a from backups or copy /rootback to it with these commands: &prompt.root; umount /rootbad &prompt.root; newfs /dev/ad0s1a &prompt.root; tunefs -n enable /dev/ad0s1a &prompt.root; mount /rootbad &prompt.root; cd /rootbad &prompt.root; dump 0f - / | restore rf - &prompt.root; rm restoresymtable
Exiting Degraded Mode Enter single-user mode. &prompt.root; shutdown now Put /etc/fstab back to normal and reboot. &prompt.root; cd /rootbad/etc &prompt.root; rm fstab &prompt.root; mv fstab.bak fstab &prompt.root; reboot Reboot and hit F1 to boot from /dev/ad0 when prompted by BootMgr.
Simulation This kind of failure can be simulated by shutting down to single-user mode and then booting as shown above in .
Drive ad2 Fails This section deals with the total failure of /dev/ad2.
Configure Server for Degraded Mode After the kernel is loaded, hit any key but Enter to interrupt the boot sequence. Boot into single-user mode. Hit [Enter] to boot immediately, or any other key for command prompt. Booting [kernel] in 8 seconds... Type '?' for a list of commands, 'help' for more detailed help. ok boot -s Change /etc/fstab to avoid the bad drive. If you used the bootvinum Perl script from below, then these commands should configure your server for degraded mode. &prompt.root; fsck -p / &prompt.root; mount / &prompt.root; cd /etc &prompt.root; mv fstab fstab.bak &prompt.root; cp fstab_only_have_ad0s1 fstab &prompt.root; cd / &prompt.root; mount -o ro / &prompt.root; vinum start &prompt.root; fsck -p &prompt.root; ^D If you do not have modified versions of /etc/fstab that are ready for use, then you can use ed to make one. Alternatively, you can fsck and mount /usr and then use your favorite editor.
Recovery We assume here that your server is up and running multi-user in degraded mode on just /dev/ad0 and that you have a new spindle now on /dev/ad2 ready to go. You will need a new spindle with enough room to hold root and swap partitions plus a Vinum partition large enough to hold /home and /usr. Create a BIOS partition (slice) on the new spindle. &prompt.root; /stand/sysinstall Select Custom. Select Partition. Select ad2. Create a FreeBSD (type 165) slice large enough to hold everything mentioned above. Write changes. Yes, you are absolutely sure. Select BootMgr. Quit Partitioning. Exit /stand/sysinstall. Create disk label partitioning based on current /dev/ad0 partitioning. &prompt.root; disklabel ad0 > /tmp/ad0 &prompt.root; disklabel -e ad2 This will drop you into your favorite editor. Copy the lines for the a and b partitions from /tmp/ad0 to the ad2 disklabel. Add the size of the a and b partitions to find the proper offset for the h partition. Subtract this offset from the size of the c partition to find the proper size for the h partition. Define an h partition with the size and offset calculated above. Set the fstype column to vinum. Save the file and quit your editor. Tell Vinum about the new drive. Ask Vinum to start an editor with a copy of the current configuration. &prompt.root; vinum create Uncomment the drive line referring to drive UpWindow and set device to /dev/ad2s1h. Save the file and quit your editor. Now that Vinum has two spindles again, revive the mirrors. &prompt.root; vinum start -w usr.p1.s0 &prompt.root; vinum start -w home.p1.s0 Now we need to restore /rootback to a current copy of the root file system. These commands will accomplish this. &prompt.root; newfs /dev/ad2s1a &prompt.root; tunefs -n enable /dev/ad2s1a &prompt.root; mount /dev/ad2s1a /mnt &prompt.root; cd /mnt &prompt.root; dump 0f - / | restore rf - &prompt.root; rm restoresymtable &prompt.root; cd / &prompt.root; umount /mnt
Exiting Degraded Mode Enter single-user mode. &prompt.root; shutdown now Return /etc/fstab to its normal state and reboot. &prompt.root; cd /etc &prompt.root; rm fstab &prompt.root; mv fstab.bak fstab &prompt.root; reboot
Simulation You can simulate this kind of failure by unplugging /dev/ad2, write-protecting it, or by this procedure: Shutdown to single-user mode. Unmount all non-root file systems. Clobber any existing Vinum configuration and partitioning on /dev/ad2. &prompt.root; vinum stop &prompt.root; dd if=/dev/zero of=/dev/ad2s1h count=512 &prompt.root; dd if=/dev/zero of=/dev/ad2 count=512
Drive ad0 Fails Some BIOSes can boot from drive 1 or drive 2 (often called C: or D:), while others can boot only from drive 1. If your BIOS can boot from either, the fastest road to recovery might be to boot directly from /dev/ad2 in single-user mode and install /etc/fsatb_only_have_ad2s1 as /etc/fstab. You would then have to adapt the /dev/ad2 failure recovery instructions from above. If your BIOS can only boot from drive one, then you will have to unplug drive YouCrazy from the controller for /dev/ad2 and plug it into the controller for /dev/ad0. Then continue with the instructions for /dev/ad2 failure recovery in above.
bootvinum Perl Script The bootvinum Perl script below reads /etc/fstab and current drive partitioning. It then writes several files in the current directory and several variants of /etc/fstab in /etc. These files significantly simplify the installation of Vinum and recovery from spindle failures. #!/usr/bin/perl -w use strict; use FileHandle; -my $config_tag1 = '$Id: article.sgml,v 1.4 2001-10-31 23:12:55 chern Exp $'; +my $config_tag1 = '$Id: article.sgml,v 1.5 2002-02-14 23:57:13 keramida Exp $'; # Copyright (C) 2001 Robert A. Van Valzah # # Bootstrap Vinum # # Read /etc/fstab and current partitioning for all spindles mentioned there. # Generate files needed to mirror all file systems on root spindle. # A new partition table for each spindle # Input for the vinum create command to create Vinum objects on each spindle # A copy of fstab mounting Vinum volumes instead of BSD partitions # Copies of fstab altered for server's degraded modes of operation # See handbook for instructions on how to use the the files generated. # N.B. This bootstrapping method shrinks size of swap partition by the size # of Vinum's on-disk configuration (265 sectors). It embeds existing file # systems on the root spindle in Vinum objects without having to copy them. # Thanks to Greg Lehey for suggesting this bootstrapping method. # Expectations: # The root spindle must contain at least root, swap, and /usr partitions # The rootback spindle must have matching /rootback and swap partitions # Other spindles should only have a /NOFUTURE* file system and maybe swap # File systems named /NOFUTURE* will be replaced with Vinum drives # Change configuration variables below to suit your taste my $vip = 'h'; # VInum Partition my @drv = ('YouCrazy', 'UpWindow', 'ThruBank', # Vinum DRiVe names 'OutSnakes', 'MeWild', 'InMovie', 'HomeJames', 'DownPrices', 'WhileBlind'); # No configuration variables beyond this point my %vols; # One entry per Vinum volume to be created my @spndl; # One entry per SPiNDLe my $rsp; # Root SPindle (as in /dev/$rsp) my $rbsp; # RootBack SPindle (as in /dev/$rbsp) my $cfgsiz = 265; # Size of Vinum on-disk configuration info in sectors my $nxtpas = 2; # Next fsck pass number for non-root file systems # Parse fstab, generating the version we'll need for Vinum and noting # spindles in use. my $fsin = "/etc/fstab"; #my $fsin = "simu/fstab"; open(FSIN, "$fsin") || die("Couldn't open $fsin: $!\n"); my $fsout = "/etc/fstab.vinum"; open(FSOUT, ">$fsout") || die("Couldn't open $fsout for writing: $!\n"); while (<FSIN>) { my ($dev, $mnt, $fstyp, $opt, $dump, $pass) = split; next if $dev =~ /^#/; if ($mnt eq '/' || $mnt eq '/rootback' || $mnt =~ /^\/NOFUTURE/) { my $dn = substr($dev, 5, length($dev)-6); # Device Name without /dev/ push(@spndl, $dn) unless grep($_ eq $dn, @spndl); $rsp = $dn if $mnt eq '/'; next if $mnt =~ /^\/NOFUTURE/; } # Move /rootback from partition e to a if ($mnt =~ /^\/rootback/) { $dev =~ s/e$/a/; $pass = 1; $rbsp = substr($dev, 5, length($dev)-6); print FSOUT "$dev\t\t$mnt\t$fstyp\t$opt\t\t$dump\t$pass\n"; next; } # Move non-root file systems on smallest spindle into Vinum if (defined($rsp) && $dev =~ /^\/dev\/$rsp/ && $dev =~ /[d-h]$/) { $pass = $nxtpas++; print FSOUT "/dev/vinum$mnt\t\t$mnt\t\t$fstyp\t$opt\t\t$dump\t$pass\n"; $vols{$dev}->{mnt} = substr($mnt, 1); next; } print FSOUT $_; } close(FSOUT); die("Found more spindles than we have abstract names\n") if $#spndl > $#drv; die("Didn't find a root partition!\n") if !defined($rsp); die("Didn't find a /rootback partition!\n") if !defined($rbsp); # Table of server's Degraded Modes # One row per mode with hash keys # fn FileName # xpr eXPRession needed to convert fstab lines for this mode # cm1 CoMment 1 describing this mode # cm2 CoMment 2 describing this mode # FH FileHandle (dynamically initialized below) my @DM = ( { cm1 => "When we only have $rsp, comment out lines using $rbsp", fn => "/etc/fstab_only_have_$rsp", xpr => "s:^/dev/$rbsp:#\$&:", }, { cm1 => "When we only have $rbsp, comment out lines using $rsp and", cm2 => "rootback becomes root", fn => "/etc/fstab_only_have_$rbsp", xpr => "s:^/dev/$rsp:#\$&: || s:/rootback:/\t:", }, { cm1 => "When only $rsp root is bad, /rootback becomes root and", cm2 => "root becomes /rootbad", fn => "/etc/fstab_${rsp}_root_bad", xpr => "s:\t/\t:\t/rootbad: || s:/rootback:/\t:", }, ); # Initialize output FileHandles and write comments foreach my $dm (@DM) { my $fh = new FileHandle; $fh->open(">$dm->{fn}") || die("Can't write $dm->{fn}: $!\n"); print $fh "# $dm->{cm1}\n" if $dm->{cm1}; print $fh "# $dm->{cm2}\n" if $dm->{cm2}; $dm->{FH} = $fh; } # Parse the Vinum version of fstab written above and write versions needed # for server's degraded modes. open(FSOUT, "$fsout") || die("Couldn't open $fsout: $!\n"); while (<FSOUT>) { my $line = $_; foreach my $dm (@DM) { $_ = $line; eval $dm->{xpr}; print {$dm->{FH}} $_; } } # Parse partition table for each spindle and write versions needed for Vinum my $rootsiz; # ROOT partition SIZe my $swapsiz; # SWAP partition SIZe my $rspminoff; # Root SPindle MINimum OFFset of non-root, non-swap, non-c parts my $rspsiz; # Root SPindle SIZe my $rbspsiz; # RootBack SPindle SIZe foreach my $i (0..$#spndl) { my $dlin = "disklabel $spndl[$i] |"; # my $dlin = "simu/disklabel.$spndl[$i]"; open(DLIN, "$dlin") || die("Couldn't open $dlin: $!\n"); my $dlout = "disklabel.$spndl[$i]"; open(DLOUT, ">$dlout") || die("Couldn't open $dlout for writing: $!\n"); my $dlb4 = "$dlout.b4vinum"; open(DLB4, ">$dlb4") || die("Couldn't open $dlb4 for writing: $!\n"); my $minoff; # MINimum OFFset of non-root, non-swap, non-c partitions my $totsiz = 0; # TOTal SIZe of all non-root, non-swap, non-c partitions my $swapspndl = 0; # True if SWAP partition on this SPiNDLe while (<DLIN>) { print DLB4 $_; my ($part, $siz, $off, $fstyp, $fsiz, $bsiz, $bps) = split; if ($part && $part eq 'a:' && $spndl[$i] eq $rsp) { $rootsiz = $siz; } if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) { if ($rootsiz != $siz) { die("Rootback size ($siz) != root size ($rootsiz)\n"); } } if ($part && $part eq 'c:') { $rspsiz = $siz if $spndl[$i] eq $rsp; $rbspsiz = $siz if $spndl[$i] eq $rbsp; } # Make swap partition $cfgsiz sectors smaller if ($part && $part eq 'b:') { if ($spndl[$i] eq $rsp) { $swapsiz = $siz; } else { if ($swapsiz != $siz) { die("Swap partition sizes unequal across spindles\n"); } } printf DLOUT "%4s%9d%9d%10s\n", $part, $siz-$cfgsiz, $off, $fstyp; $swapspndl = 1; next; } # Move rootback spindle e partitions to a if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) { printf DLOUT "%4s%9d%9d%10s%9d%6d%6d\n", 'a:', $siz, $off, $fstyp, $fsiz, $bsiz, $bps; next; } # Delete non-root, non-swap, non-c partitions but note their minimum # offset and total size that're needed below. if ($part && $part =~ /^[d-h]:$/) { $minoff = $off unless $minoff; $minoff = $off if $off < $minoff; $totsiz += $siz; if ($spndl[$i] eq $rsp) { # If doing spindle containing root my $dev = "/dev/$spndl[$i]" . substr($part, 0, 1); $vols{$dev}->{siz} = $siz; $vols{$dev}->{off} = $off; $rspminoff = $minoff; } next; } print DLOUT $_; } if ($swapspndl) { # If there was a swap partition on this spindle # Make a Vinum partition the size of all non-root, non-swap, # non-c partitions + the size of Vinum's on-disk configuration. # Set its offset so that the start of the first subdisk it contains # coincides with the first file system we're embedding in Vinum. printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz+$cfgsiz, $minoff-$cfgsiz, 'vinum'; } else { # No need to mess with size size and offset if there was no swap printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz, $minoff, 'vinum'; } } die("Swap partition not found\n") unless $swapsiz; die("Swap partition not larger than $cfgsiz blocks\n") unless $swapsiz>$cfgsiz; die("Rootback spindle size not >= root spindle size\n") unless $rbspsiz>=$rspsiz; # Generate input to vinum create command needed for each spindle. foreach my $i (0..$#spndl) { my $cfn = "create.$drv[$i]"; # Create File Name open(CF, ">$cfn") || die("Can't open $cfn for writing: $!\n"); print CF "drive $drv[$i] device /dev/$spndl[$i]$vip\n"; next unless $spndl[$i] eq $rsp || $spndl[$i] eq $rbsp; foreach my $dev (keys(%vols)) { my $mnt = $vols{$dev}->{mnt}; my $siz = $vols{$dev}->{siz}; my $off = $vols{$dev}->{off}-$rspminoff+$cfgsiz; print CF "volume $mnt\n" if $spndl[$i] eq $rsp; print CF <<EOF; plex name $mnt.p$i org concat volume $mnt sd name $mnt.p$i.s0 drive $drv[$i] plex $mnt.p$i len ${siz}s driveoffset ${off}s EOF } } Manual Vinum Bootstrapping The bootvinum Perl script in makes life easier, but it may be necessary to manually perform some or all of the steps that it automates. This appendix describes how you would manually mimic the script. Make a copy of /etc/fstab to be customized. &prompt.root; cp /etc/fstab /etc/fstab.vinum Edit /etc/fstab.vinum. Change the device column of non-root partitions on the root spindle to /dev/vinum/mnt. Change the pass column of non-root partitions on the root spindle to 2, 3, etc. Delete any lines with mountpoint matching /NOFUTURE*. Change the device column of /rootback from e to a. Change the pass column of /rootback to 1. Prepare disklabels for editing: &prompt.root; cd /bootvinum &prompt.root; disklabel ad0s1 > disklabel.ad0s1 &prompt.root; cp disklabel.ad0s1 disklabel.ad0s1.b4vinum &prompt.root; disklabel ad2s1 > disklabel.ad2s1 &prompt.root; cp disklabel.ad2s1 disklabel.ad2s1.b4vinum Edit /etc/disklabel.ad?s1. On the root spindle: Decrease the size of the b partition by 265 blocks. Note the size and offset of the a and b partitions. Note the smallest offset for partitions d-h. Note the size and offset for all non-root, non-swap partitions (/home was probably on e and /usr was probably on f). Delete partitions d-h. Create a new h partition with offset 265 blocks less than the smallest offset for partitions d-h noted above. Set its size to the size of the c partition less the smallest offset for partitions d-h noted above + 265 blocks. Vinum can use any partition other than c. It is not strictly necessary to use h for all your Vinum partitions, but it is good practice to be consistent across all spindles. Set the fstype of this new partition to vinum. On the rootback spindle: Move the e partition to a. Verify that the size of the a and b partitions matches the root spindle. Note the smallest offset for partitions d-h. Delete partitions d-h. Create a new h partition with offset 265 blocks less than the smallest offset noted above for partitions d-h. Set its size to the size of the c partition less the smallest offset for partitions d-h noted above + 265 blocks. Set the fstype of this new partition to vinum. Create a file named create.YouCrazy that contains: drive YouCrazy device /dev/ad0s1h volume home plex name home.p0 org concat volume home sd name home.p0.s0 drive YouCrazy plex home.p0 len $hl driveoffset $ho volume usr plex name usr.p0 org concat volume usr sd name usr.p0.s0 drive YouCrazy plex usr.p0 len $ul driveoffset $uo Where: $hl is the length noted above for /home. $ho is the offset noted above for /home less the smallest offset noted above + 265 blocks. $ul is the length noted above for /usr. $uo is the offset noted above for /usr less the smallest offset noted above + 265 blocks. Create a file named create.UpWindow containing: drive UpWindow device /dev/ad2s1h plex name home.p1 org concat volume home sd name home.p1.s0 drive UpWindow plex home.p1 len $hl driveoffset $ho plex name usr.p1 org concat volume usr sd name usr.p1.s0 drive UpWindow plex usr.p1 len $ul driveoffset $uo Where $hl, $ho, $ul, and $uo are set as above. Acknowledgements I would like to thank Greg Lehey for writing &vinum.ap; and for providing very helpful comments on early drafts. Several others made helpful suggestions after reviewing later drafts including Dag-Erling Smørgrav, Michael Splendoria, Chern Lee, Stefan Aeschbacher, Fleming Froekjaer, Bernd Walter, Aleksey Baranov, and Doug Swarin.
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml index 3f400b365b..9e992c6039 100644 --- a/en_US.ISO8859-1/articles/vm-design/article.sgml +++ b/en_US.ISO8859-1/articles/vm-design/article.sgml @@ -1,838 +1,838 @@ %man; ]>
Design elements of the FreeBSD VM system Matthew Dillon
dillon@apollo.backplane.com
The title is really just a fancy way of saying that I am going to attempt to describe the whole VM enchilada, hopefully in a way that everyone can follow. For the last year I have concentrated on a number of major kernel subsystems within FreeBSD, with the VM and Swap subsystems being the most interesting and NFS being a necessary chore. I rewrote only small portions of the code. In the VM arena the only major rewrite I have done is to the swap subsystem. Most of my work was cleanup and maintenance, with only moderate code rewriting and no major algorithmic adjustments within the VM subsystem. The bulk of the VM subsystem's theoretical base remains unchanged and a lot of the credit for the modernization effort in the last few years belongs to John Dyson and David Greenman. Not being a historian like Kirk I will not attempt to tag all the various features with peoples names, since I will invariably get it wrong. This article was originally published in the January 2000 issue of DaemonNews. This version of the article may include updates from Matt and other authors to reflect changes in FreeBSD's VM implementation.
Introduction Before moving along to the actual design let's spend a little time on the necessity of maintaining and modernizing any long-living codebase. In the programming world, algorithms tend to be more important than code and it is precisely due to BSD's academic roots that a great deal of attention was paid to algorithm design from the beginning. More attention paid to the design generally leads to a clean and flexible codebase that can be fairly easily modified, extended, or replaced over time. While BSD is considered an old operating system by some people, those of us who work on it tend to view it more as a mature codebase which has various components modified, extended, or replaced with modern code. It has evolved, and FreeBSD is at the bleeding edge no matter how old some of the code might be. This is an important distinction to make and one that is unfortunately lost to many people. The biggest error a programmer can make is to not learn from history, and this is precisely the error that many other modern operating systems have made. NT is the best example of this, and the consequences have been dire. Linux also makes this mistake to some degree—enough that we BSD folk can make small jokes about it every once in a while, anyway. Linux's problem is simply one of a lack of experience and history to compare ideas against, a problem that is easily and rapidly being addressed by the Linux community in the same way it has been addressed in the BSD community—by continuous code development. The NT folk, on the other hand, repeatedly make the same mistakes solved by Unix decades ago and then spend years fixing them. Over and over again. They have a severe case of not designed here and we are always right because our marketing department says so. I have little tolerance for anyone who cannot learn from history. Much of the apparent complexity of the FreeBSD design, especially in the VM/Swap subsystem, is a direct result of having to solve serious performance issues that occur under various conditions. These issues are not due to bad algorithmic design but instead rise from environmental factors. In any direct comparison between platforms, these issues become most apparent when system resources begin to get stressed. As I describe FreeBSD's VM/Swap subsystem the reader should always keep two points in mind. First, the most important aspect of performance design is what is known as Optimizing the Critical Path. It is often the case that performance optimizations add a little bloat to the code in order to make the critical path perform better. Second, a solid, generalized design outperforms a heavily-optimized design over the long run. While a generalized design may end up being slower than an heavily-optimized design when they are first implemented, the generalized design tends to be easier to adapt to changing conditions and the heavily-optimized design winds up having to be thrown away. Any codebase that will survive and be maintainable for years must therefore be designed properly from the beginning even if it costs some performance. Twenty years ago people were still arguing that programming in assembly was better than programming in a high-level language because it produced code that was ten times as fast. Today, the fallibility of that argument is obvious—as are the parallels to algorithmic design and code generalization. VM Objects The best way to begin describing the FreeBSD VM system is to look at it from the perspective of a user-level process. Each user process sees a single, private, contiguous VM address space containing several types of memory objects. These objects have various characteristics. Program code and program data are effectively a single memory-mapped file (the binary file being run), but program code is read-only while program data is copy-on-write. Program BSS is just memory allocated and filled with zeros on demand, called demand zero page fill. Arbitrary files can be memory-mapped into the address space as well, which is how the shared library mechanism works. Such mappings can require modifications to remain private to the process making them. The fork system call adds an entirely new dimension to the VM management problem on top of the complexity already given. A program binary data page (which is a basic copy-on-write page) illustrates the complexity. A program binary contains a preinitialized data section which is initially mapped directly from the program file. When a program is loaded into a process's VM space, this area is initially memory-mapped and backed by the program binary itself, allowing the VM system to free/reuse the page and later load it back in from the binary. The moment a process modifies this data, however, the VM system must make a private copy of the page for that process. Since the private copy has been modified, the VM system may no longer free it, because there is no longer any way to restore it later on. You will notice immediately that what was originally a simple file mapping has become much more complex. Data may be modified on a page-by-page basis whereas the file mapping encompasses many pages at once. The complexity further increases when a process forks. When a process forks, the result is two processes—each with their own private address spaces, including any modifications made by the original process prior to the call to fork(). It would be silly for the VM system to make a complete copy of the data at the time of the fork() because it is quite possible that at least one of the two processes will only need to read from that page from then on, allowing the original page to continue to be used. What was a private page is made copy-on-write again, since each process (parent and child) expects their own personal post-fork modifications to remain private to themselves and not effect the other. FreeBSD manages all of this with a layered VM Object model. The original binary program file winds up being the lowest VM Object layer. A copy-on-write layer is pushed on top of that to hold those pages which had to be copied from the original file. If the program modifies a data page belonging to the original file the VM system takes a fault and makes a copy of the page in the higher layer. When a process forks, additional VM Object layers are pushed on. This might make a little more sense with a fairly basic example. A fork() is a common operation for any *BSD system, so this example will consider a program that starts up, and forks. When the process starts, the VM system creates an object layer, let's call this A: +---------------+ | A | +---------------+ A picture A represents the file—pages may be paged in and out of the file's physical media as necessary. Paging in from the disk is reasonable for a program, but we really do not want to page back out and overwrite the executable. The VM system therefore creates a second layer, B, that will be physically backed by swap space: +---------------+ | B | +---------------+ | A | +---------------+ On the first write to a page after this, a new page is created in B, and its contents are initialized from A. All pages in B can be paged in or out to a swap device. When the program forks, the VM system creates two new object layers—C1 for the parent, and C2 for the child—that rest on top of B: +-------+-------+ | C1 | C2 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, let's say a page in B is modified by the original parent process. The process will take a copy-on-write fault and duplicate the page in C1, leaving the original page in B untouched. Now, let's say the same page in B is modified by the child process. The process will take a copy-on-write fault and duplicate the page in C2. The original page in B is now completely hidden since both C1 and C2 have a copy and B could theoretically be destroyed if it does not - represent a 'real' file). However, this sort of optimization is not + represent a real file). However, this sort of optimization is not trivial to make because it is so fine-grained. FreeBSD does not make this optimization. Now, suppose (as is often the case) that the child process does an exec(). Its current address space is usually replaced by a new address space representing a new file. In this case, the C2 layer is destroyed: +-------+ | C1 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, the number of children of B drops to one, and all accesses to B now go through C1. This means that B and C1 can be collapsed together. Any pages in B that also exist in C1 are deleted from B during the collapse. Thus, even though the optimization in the previous step could not be made, we can recover the dead pages when either of the processes exit or exec(). This model creates a number of potential problems. The first is that you can wind up with a relatively deep stack of layered VM Objects which can cost scanning time and memory when you take a fault. Deep layering can occur when processes fork and then fork again (either parent or child). The second problem is that you can wind up with dead, inaccessible pages deep in the stack of VM Objects. In our last example if both the parent and child processes modify the same page, they both get their own private copies of the page and the original page in B is no longer accessible by anyone. That page in B can be freed. FreeBSD solves the deep layering problem with a special optimization called the All Shadowed Case. This case occurs if either C1 or C2 take sufficient COW faults to completely shadow all pages in B. Lets say that C1 achieves this. C1 can now bypass B entirely, so rather then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But look what also happened—now B has only one reference (C2), so we can collapse B and C2 together. The end result is that B is deleted entirely and we have C1->A and C2->A. It is often the case that B will contain a large number of pages and neither C1 nor C2 will be able to completely overshadow it. If we fork again and create a set of D layers, however, it is much more likely that one of the D layers will eventually be able to completely overshadow the much smaller dataset represented by C1 or C2. The same optimization will work at any point in the graph and the grand result of this is that even on a heavily forked machine VM Object stacks tend to not get much deeper then 4. This is true of both the parent and the children and true whether the parent is doing the forking or whether the children cascade forks. The dead page problem still exists in the case where C1 or C2 do not completely overshadow B. Due to our other optimizations this case does not represent much of a problem and we simply allow the pages to be dead. If the system runs low on memory it will swap them out, eating a little swap, but that is it. The advantage to the VM Object model is that fork() is extremely fast, since no real data copying need take place. The disadvantage is that you can build a relatively complex VM Object layering that slows page fault handling down a little, and you spend memory managing the VM Object structures. The optimizations FreeBSD makes proves to reduce the problems enough that they can be ignored, leaving no real disadvantage. SWAP Layers Private data pages are initially either copy-on-write or zero-fill pages. When a change, and therefore a copy, is made, the original backing object (usually a file) can no longer be used to save a copy of the page when the VM system needs to reuse it for other purposes. This is where SWAP comes in. SWAP is allocated to create backing store for memory that does not otherwise have it. FreeBSD allocates the swap management structure for a VM Object only when it is actually needed. However, the swap management structure has had problems historically. Under FreeBSD 3.x the swap management structure preallocates an array that encompasses the entire object requiring swap backing store—even if only a few pages of that object are swap-backed. This creates a kernel memory fragmentation problem when large objects are mapped, or processes with large runsizes (RSS) fork. Also, in order to keep track of swap space, a list of holes is kept in kernel memory, and this tends to get severely fragmented as well. Since - the 'list of holes' is a linear list, the swap allocation and freeing + the list of holes is a linear list, the swap allocation and freeing performance is a non-optimal O(n)-per-page. It also requires kernel memory allocations to take place during the swap freeing process, and that creates low memory deadlock problems. The problem is further exacerbated by holes created due to the interleaving algorithm. Also, the swap block map can become fragmented fairly easily resulting in non-contiguous allocations. Kernel memory must also be allocated on the fly for additional swap management structures when a swapout occurs. It is evident that there was plenty of room for improvement. For FreeBSD 4.x, I completely rewrote the swap subsystem. With this rewrite, swap management structures are allocated through a hash table rather than a linear array giving them a fixed allocation size and much finer granularity. Rather then using a linearly linked list to keep track of swap space reservations, it now uses a bitmap of swap blocks arranged in a radix tree structure with free-space hinting in the radix node structures. This effectively makes swap allocation and freeing an O(1) operation. The entire radix tree bitmap is also preallocated in order to avoid having to allocate kernel memory during critical low memory swapping operations. After all, the system tends to swap when it is low on memory so we should avoid allocating kernel memory at such times in order to avoid potential deadlocks. Finally, to reduce fragmentation the radix tree is capable of allocating large contiguous chunks at once, skipping over smaller fragmented chunks. I did not take - the final step of having an 'allocating hint pointer' that would trundle + the final step of having an allocating hint pointer that would trundle through a portion of swap as allocations were made in order to further guarantee contiguous allocations or at least locality of reference, but I ensured that such an addition could be made. When to free a page Since the VM system uses all available memory for disk caching, there are usually very few truly-free pages. The VM system depends on being able to properly choose pages which are not in use to reuse for new allocations. Selecting the optimal pages to free is possibly the single-most important function any VM system can perform because if it makes a poor selection, the VM system may be forced to unnecessarily retrieve pages from disk, seriously degrading system performance. How much overhead are we willing to suffer in the critical path to avoid freeing the wrong page? Each wrong choice we make will cost us hundreds of thousands of CPU cycles and a noticeable stall of the affected processes, so we are willing to endure a significant amount of overhead in order to be sure that the right page is chosen. This is why FreeBSD tends to outperform other systems when memory resources become stressed. The free page determination algorithm is built upon a history of the use of memory pages. To acquire this history, the system takes advantage of a page-used bit feature that most hardware page tables have. In any case, the page-used bit is cleared and at some later point the VM system comes across the page again and sees that the page-used bit has been set. This indicates that the page is still being actively used. If the bit is still clear it is an indication that the page is not being actively used. By testing this bit periodically, a use history (in the form of a counter) for the physical page is developed. When the VM system later needs to free up some pages, checking this history becomes the cornerstone of determining the best candidate page to reuse. What if the hardware has no page-used bit? For those platforms that do not have this feature, the system actually emulates a page-used bit. It unmaps or protects a page, forcing a page fault if the page is accessed again. When the page fault is taken, the system simply marks the page as having been used and unprotects the page so that it may be used. While taking such page faults just to determine if a page is being used appears to be an expensive proposition, it is much less expensive than reusing the page for some other purpose only to find that a process needs it back and then have to go to disk. FreeBSD makes use of several page queues to further refine the selection of pages to reuse as well as to determine when dirty pages must be flushed to their backing store. Since page tables are dynamic entities under FreeBSD, it costs virtually nothing to unmap a page from the address space of any processes using it. When a page candidate has been chosen based on the page-use counter, this is precisely what is done. The system must make a distinction between clean pages which can theoretically be freed up at any time, and dirty pages which must first be written to their backing store before being reusable. When a page candidate has been found it is moved to the inactive queue if it is dirty, or the cache queue if it is clean. A separate algorithm based on the dirty-to-clean page ratio determines when dirty pages in the inactive queue must be flushed to disk. Once this is accomplished, the flushed pages are moved from the inactive queue to the cache queue. At this point, pages in the cache queue can still be reactivated by a VM fault at relatively low cost. However, pages in the cache queue are considered to be immediately freeable and will be reused in an LRU (least-recently used) fashion when the system needs to allocate new memory. It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command. As the VM system becomes more stressed, it makes a greater effort to maintain the various page queues at the levels determined to be the most effective. An urban myth has circulated for years that Linux did a better job avoiding swapouts than FreeBSD, but this in fact is not true. What was actually occurring was that FreeBSD was proactively paging out unused pages in order to make room for more disk cache while Linux was keeping unused pages in core and leaving less memory available for cache and process pages. I do not know whether this is still true today. Pre-Faulting and Zeroing Optimizations Taking a VM fault is not expensive if the underlying page is already in core and can simply be mapped into the process, but it can become expensive if you take a whole lot of them on a regular basis. A good example of this is running a program such as &man.ls.1; or &man.ps.1; over and over again. If the program binary is mapped into memory but not mapped into the page table, then all the pages that will be accessed by the program will have to be faulted in every time the program is run. This is unnecessary when the pages in question are already in the VM Cache, so FreeBSD will attempt to pre-populate a process's page tables with those pages that are already in the VM Cache. One thing that FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For example, if you run the &man.ls.1; program while running vmstat 1 you will notice that it always takes a certain number of page faults, even when you run it over and over again. These are zero-fill faults, not program code faults (which were pre-faulted in already). Pre-copying pages on exec or fork is an area that could use more study. A large percentage of page faults that occur are zero-fill faults. You can usually see this by observing the vmstat -s output. These occur when a process accesses pages in its BSS area. The BSS area is expected to be initially zero but the VM system does not bother to allocate any memory at all until the process actually accesses it. When a fault occurs the VM system must not only allocate a new page, it must zero it as well. To optimize the zeroing operation the VM system has the ability to pre-zero pages and mark them as such, and to request pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs whenever the CPU is idle but the number of pages the system pre-zeros is limited in order to avoid blowing away the memory caches. This is an excellent example of adding complexity to the VM system in order to optimize the critical path. Page Table Optimizations The page table optimizations make up the most contentious part of the FreeBSD VM design and they have shown some strain with the advent of serious use of mmap(). I think this is actually a feature of most BSDs though I am not sure when it was first introduced. There are two major optimizations. The first is that hardware page tables do not contain persistent state but instead can be thrown away at any time with only a minor amount of management overhead. The second is that every active page table entry in the system has a governing pv_entry structure which is tied into the vm_page structure. FreeBSD can simply iterate through those mappings that are known to exist while Linux must check all page tables that might contain a specific mapping to see if it does, which can achieve O(n^2) overhead in certain situations. It is because of this that FreeBSD tends to make better choices on which pages to reuse or swap when memory is stressed, giving it better performance under load. However, FreeBSD requires kernel tuning to accommodate large-shared-address-space situations such as those that can occur in a news system because it may run out of pv_entry structures. Both Linux and FreeBSD need work in this area. FreeBSD is trying to maximize the advantage of a potentially sparse active-mapping model (not all processes need to map all pages of a shared library, for example), whereas Linux is trying to simplify its algorithms. FreeBSD generally has the performance advantage here at the cost of wasting a little extra memory, but FreeBSD breaks down in the case where a large file is massively shared across hundreds of processes. Linux, on the other hand, breaks down in the case where many processes are sparsely-mapping the same shared library and also runs non-optimally when trying to determine whether a page can be reused or not. Page Coloring We will end with the page coloring optimizations. Page coloring is a performance optimization designed to ensure that accesses to contiguous pages in virtual memory make the best use of the processor cache. In ancient times (i.e. 10+ years ago) processor caches tended to map virtual memory rather than physical memory. This led to a huge number of problems including having to clear the cache on every context switch in some cases, and problems with data aliasing in the cache. Modern processor caches map physical memory precisely to solve those problems. This means that two side-by-side pages in a processes address space may not correspond to two side-by-side pages in the cache. In fact, if you are not careful side-by-side pages in virtual memory could wind up using the same page in the processor cache—leading to cacheable data being thrown away prematurely and reducing CPU performance. This is true even with multi-way set-associative caches (though the effect is mitigated somewhat). FreeBSD's memory allocation code implements page coloring optimizations, which means that the memory allocation code will attempt to locate free pages that are contiguous from the point of view of the cache. For example, if page 16 of physical memory is assigned to page 0 of a process's virtual memory and the cache can hold 4 pages, the page coloring code will not assign page 20 of physical memory to page 1 of a process's virtual memory. It would, instead, assign page 21 of physical memory. The page coloring code attempts to avoid assigning page 20 because this maps over the same cache memory as page 16 and would result in non-optimal caching. This code adds a significant amount of complexity to the VM memory allocation subsystem as you can well imagine, but the result is well worth the effort. Page Coloring makes VM memory as deterministic as physical memory in regards to cache performance. Conclusion Virtual memory in modern operating systems must address a number of different issues efficiently and for many different usage patterns. The modular and algorithmic approach that BSD has historically taken allows us to study and understand the current implementation as well as relatively cleanly replace large sections of the code. There have been a number of improvements to the FreeBSD VM system in the last several years, and work is ongoing. Bonus QA session by Allen Briggs <email>briggs@ninthwonder.com</email> What is the interleaving algorithm that you refer to in your listing of the ills of the FreeBSD 3.x swap arrangements? FreeBSD uses a fixed swap interleave which defaults to 4. This means that FreeBSD reserves space for four swap areas even if you only have one, two, or three. Since swap is interleaved the linear address space representing the four swap areas will be fragmented if you do not actually have four swap areas. For example, if you have two swap areas A and B FreeBSD's address space representation for that swap area will be interleaved in blocks of 16 pages: A B C D A B C D A B C D A B C D FreeBSD 3.x uses a sequential list of free regions approach to accounting for the free swap areas. The idea is that large blocks of free linear space can be represented with a single list node (kern/subr_rlist.c). But due to the fragmentation the sequential list winds up being insanely fragmented. In the above example, completely unused swap will have A and B shown as free and C and D shown as all allocated. Each A-B sequence requires a list node to account for because C and D are holes, so the list node cannot be combined with the next A-B sequence. Why do we interleave our swap space instead of just tack swap areas onto the end and do something fancier? Because it is a whole lot easier to allocate linear swaths of an address space and have the result automatically be interleaved across multiple disks than it is to try to put that sophistication elsewhere. The fragmentation causes other problems. Being a linear list under 3.x, and having such a huge amount of inherent fragmentation, allocating and freeing swap winds up being an O(N) algorithm instead of an O(1) algorithm. Combined with other factors (heavy swapping) and you start getting into O(N^2) and O(N^3) levels of overhead, which is bad. The 3.x system may also need to allocate KVM during a swap operation to create a new list node which can lead to a deadlock if the system is trying to pageout pages in a low-memory situation. Under 4.x we do not use a sequential list. Instead we use a radix tree and bitmaps of swap blocks rather than ranged list nodes. We take the hit of preallocating all the bitmaps required for the entire swap area up front but it winds up wasting less memory due to the use of a bitmap (one bit per block) instead of a linked list of nodes. The use of a radix tree instead of a sequential list gives us nearly O(1) performance no matter how fragmented the tree becomes. I do not get the following:
It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command.
How is the separation of clean and dirty (inactive) pages related to the situation where you see low cache queue counts and high active queue counts in systat -vm? Do the systat stats roll the active and dirty pages together for the active queue count?
Yes, that is confusing. The relationship is goal verses reality. Our goal is to separate the pages but the reality is that if we are not in a memory crunch, we do not really have to. What this means is that FreeBSD will not try very hard to separate out dirty pages (inactive queue) from clean pages (cache queue) when the system is not being stressed, nor will it try to deactivate pages (active queue -> inactive queue) when the system is not being stressed, even if they are not being used.
In the &man.ls.1; / vmstat 1 example, would not some of the page faults be data page faults (COW from executable file to private page)? I.e., I would expect the page faults to be some zero-fill and some program data. Or are you implying that FreeBSD does do pre-COW for the program data? A COW fault can be either zero-fill or program-data. The mechanism is the same either way because the backing program-data is almost certainly already in the cache. I am indeed lumping the two together. FreeBSD does not pre-COW program data or zero-fill, but it does pre-map pages that exist in its cache. In your section on page table optimizations, can you give a little more detail about pv_entry and vm_page (or should vm_page be vm_pmap—as in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of operation/reaction would require scanning the mappings? How does Linux do in the case where FreeBSD breaks down (sharing a large file mapping over many processes)? A vm_page represents an (object,index#) tuple. A pv_entry represents a hardware page table entry (pte). If you have five processes sharing the same physical page, and three of those processes's page tables actually map the page, that page will be represented by a single vm_page structure and three pv_entry structures. pv_entry structures only represent pages mapped by the MMU (one pv_entry represents one pte). This means that when we need to remove all hardware references to a vm_page (in order to reuse the page for something else, page it out, clear it, dirty it, and so forth) we can simply scan the linked list of pv_entry's associated with that vm_page to remove or modify the pte's from their page tables. Under Linux there is no such linked list. In order to remove all the hardware page table mappings for a vm_page linux must index into every VM object that might have mapped the page. For example, if you have 50 processes all mapping the same shared library and want to get rid of page X in that library, you need to index into the page table for each of those 50 processes even if only 10 of them have actually mapped the page. So Linux is trading off the simplicity of its design against performance. Many VM algorithms which are O(1) or (small N) under FreeBSD wind up being O(N), O(N^2), or worse under Linux. Since the pte's representing a particular page in an object tend to be at the same offset in all the page tables they are mapped in, reducing the number of accesses into the page tables at the same pte offset will often avoid blowing away the L1 cache line for that offset, which can lead to better performance. FreeBSD has added complexity (the pv_entry scheme) in order to increase performance (to limit page table accesses to only those pte's that need to be modified). But FreeBSD has a scaling problem that Linux does not in that there are a limited number of pv_entry structures and this causes problems when you have massive sharing of data. In this case you may run out of pv_entry structures even though there is plenty of free memory available. This can be fixed easily enough by bumping up the number of pv_entry structures in the kernel config, but we really need to find a better way to do it. In regards to the memory overhead of a page table verses the pv_entry scheme: Linux uses permanent page tables that are not throw away, but does not need a pv_entry for each potentially mapped pte. FreeBSD uses throw away page tables but adds in a pv_entry structure for each actually-mapped pte. I think memory utilization winds up being about the same, giving FreeBSD an algorithmic advantage with its ability to throw away page tables at will with very low overhead. Finally, in the page coloring section, it might help to have a little more description of what you mean here. I did not quite follow it. Do you know how an L1 hardware memory cache works? I will explain: Consider a machine with 16MB of main memory but only 128K of L1 cache. Generally the way this cache works is that each 128K block of main memory uses the same 128K of cache. If you access offset 0 in main memory and then offset offset 128K in main memory you can wind up throwing away the cached data you read from offset 0! Now, I am simplifying things greatly. What I just described is what is called a direct mapped hardware memory cache. Most modern caches are what are called 2-way-set-associative or 4-way-set-associative caches. The set-associatively allows you to access up to N different memory regions that overlap the same cache memory without destroying the previously cached data. But only N. So if I have a 4-way set associative cache I can access offset 0, offset 128K, 256K and offset 384K and still be able to access offset 0 again and have it come from the L1 cache. If I then access offset 512K, however, one of the four previously cached data objects will be thrown away by the cache. It is extremely important… extremely important for most of a processor's memory accesses to be able to come from the L1 cache, because the L1 cache operates at the processor frequency. The moment you have an L1 cache miss and have to go to the L2 cache or to main memory, the processor will stall and potentially sit twiddling its fingers for hundreds of instructions worth of time waiting for a read from main memory to complete. Main memory (the dynamic ram you stuff into a computer) is slow, when compared to the speed of a modern processor core. Ok, so now onto page coloring: All modern memory caches are what are known as physical caches. They cache physical memory addresses, not virtual memory addresses. This allows the cache to be left alone across a process context switch, which is very important. But in the Unix world you are dealing with virtual address spaces, not physical address spaces. Any program you write will see the virtual address space given to it. The actual physical pages underlying that virtual address space are not necessarily physically contiguous! In fact, you might have two pages that are side by side in a processes address space which wind up being at offset 0 and offset 128K in physical memory. A program normally assumes that two side-by-side pages will be optimally cached. That is, that you can access data objects in both pages without having them blow away each other's cache entry. But this is only true if the physical pages underlying the virtual address space are contiguous (insofar as the cache is concerned). This is what Page coloring does. Instead of assigning random physical pages to virtual addresses, which may result in non-optimal cache performance, Page coloring assigns reasonably-contiguous physical pages to virtual addresses. Thus programs can be written under the assumption that the characteristics of the underlying hardware cache are the same for their virtual address space as they would be if the program had been run directly in a physical address space. Note that I say reasonably contiguous rather than simply contiguous. From the point of view of a 128K direct mapped cache, the physical address 0 is the same as the physical address 128K. So two side-by-side pages in your virtual address space may wind up being offset 128K and offset 132K in physical memory, but could also easily be offset 128K and offset 4K in physical memory and still retain the same cache performance characteristics. So page-coloring does not have to assign truly contiguous pages of physical memory to contiguous pages of virtual memory, it just needs to make sure it assigns contiguous pages from the point of view of cache performance and operation.