diff --git a/en_US.ISO8859-1/articles/diskless-x/article.sgml b/en_US.ISO8859-1/articles/diskless-x/article.sgml
index 235a346099..ec16577767 100644
--- a/en_US.ISO8859-1/articles/diskless-x/article.sgml
+++ b/en_US.ISO8859-1/articles/diskless-x/article.sgml
@@ -1,349 +1,349 @@
%man;
]>
Diskless X Server: a how to guideJerryKendalljerry@kcis.com28-December-19961996Jerry KendallWith the help of some friends on the FreeBSD-hackers list, I have
been able to create a diskless X terminal. The creation of the X
terminal required first creating a diskless system with minimal
utilities mounted via NFS. These same steps were used to create 2
separate diskless systems. The first is altair.example.com. A diskless X terminal that I
run on my old 386DX-40. It has a 340Meg hard disk but, I did not want
to change it. So, it boots from antares.example.com across a Ethernet. The second
system is a 486DX2-66. I setup a diskless FreeBSD (complete) that
uses no local disk. The server in that case is a Sun 670MP running
SunOS 4.1.3. The same setup configuration was needed for both.I am sure that there is stuff that needs to be added
to this. Please send me any comments.Creating the boot floppy (On the diskless system)Since the network boot loaders will not work with some of the TSR's
and such that MS-DOS uses, it is best to create a dedicated boot floppy
or, if you can, create an MS-DOS menu that will (via the
config.sys/autoexec.bat files)
ask what configuration to load when the system starts. The later is the
method that I use and it works great. My MS-DOS (6.x) menu is
below.config.sys[menu]
menuitem=normal, normal
menuitem=unix, unix
[normal]
....
normal config.sys stuff
...
[unix]autoexec.bat@ECHO OFF
goto %config%
:normal
...
normal autoexec.bat stuff
...
goto end
:unix
cd \netboot
nb8390.com
:endGetting the network boot programs (On the server)
- Compile the 'net-boot' programs that are located in
+ Compile the net-boot programs that are located in
/usr/src/sys/i386/boot/netboot. You should read
the comments at the top of the Makefile. Adjust as
required. Make a backup of the original in case it gets foobar'd. When
the build is done, there should be 2 MS-DOS executables,
nb8390.com and nb3c509.com.
One of these two programs will be what you need to run on the diskless
server. It will load the kernel from the boot server. At this point,
put both programs on the MS-DOS boot floppy created earlier.Determine which program to run (On the diskless system)If you know the chipset that your Ethernet adapter uses, this is
easy. If you have the NS8390 chipset, or a NS8390 based chipset, use
nb8390.com. If you have a 3Com 509 based chipset,
use the nb3C509.com boot program. If you are not
sure which you have, try using one, if it says No adapter
found, try the other. Beyond that, you are pretty much on
your own.Booting across the networkBoot the diskless system with out any config.sys/autoexec.bat
files. Try running the boot program for your Ethernet adapter.My Ethernet adapter is running in WD8013 16bit mode so I run
nb8390.comC:>cd \netbootC:>nb8390Boot from Network (Y/N) ?Y
BOOTP/TFTP/NFS bootstrap loader ESC for menu
Searching for adapter..
WD8013EBT base 0x0300, memory 0x000D8000, addr 00:40:01:43:26:66
Searching for server...At this point, my diskless system is trying to find a machine to act
as a boot server. Make note of the addr line above,
you will need this number later. Reset the diskless system and modify
your config.sys and
autoexec.bat files to do these steps automatically
for you. Perhaps in a menu. If you had to run
nb3c509.com instead of nb8390.com
the output is the same as above. If you got No adapter
found at the Searching for adapter...
message, verify that you did indeed set the compile time defines in the
Makefile correctly.Allowing systems to boot across the network (On the server)Make sure the /etc/inetd.conf file has entries
for tftp and bootps. Mine are listed below:tftp dgram udp wait nobody /usr/libexec/tftpd tftpd /tftpboot
#
# Additions by who ever you are
bootps dgram udp wait root /usr/libexec/bootpd bootpd /etc/bootptabIf you have to change the /etc/inetd.conf file,
send a HUP signal to inetd. To do this, get the
process ID of inetd with ps -ax | grep inetd | grep -v
grep. Once you have it, send it a HUP signal. Do this by
kill -HUP <pid>. This will force inetd to
re-read its config file.Did you remember to note the addr line from the
output of the boot loader on the diskless system? Guess what, here is
where you need it.Add an entry to /etc/bootptab (maybe creating the
file). It should be laid out identical to this:altair:\
:ht=ether:\
:ha=004001432666:\
:sm=255.255.255.0:\
:hn:\
:ds=199.246.76.1:\
:ip=199.246.76.2:\
:gw=199.246.76.1:\
:vm=rfc1048:The lines are as follows:altairthe diskless systems name without the domain name.ht=ether
- the hardware type of 'ethernet'.
+ the hardware type of ethernet.ha=004001432666the hardware address (the number noted above).sm=255.255.255.0the subnet mask.hntells server to send client's hostname to the
client.ds=199.246.76.1tells the client who the domain server is.ip=199.246.76.2tells the client what its IP address is.gw=199.246.76.1tells the client what the default gateway is.vm=...just leave it there.Be sure to setup the IP addresses correctly, the addresses above
are my own.Create the directory /tftpboot on the server it will contain the
configuration files for the diskless systems that the server will serve.
These files will be named cfg.ip where ip is the IP
address of the diskless system. The config file for altair is
/tftpboot/cfg.199.246.76.2. The contents is:rootfs 199.246.76.1:/DiskLess/rootfs/altair
hostname altair.example.comThe line hostname altair.example.com simply tells
the diskless system what its fully qualified domain name is.The line rootfs
199.246.76.1:/DiskLess/rootfs/altair tells the diskless
system where its NFS mountable root filesystem is located.The NFS mounted root filesystem will be mounted read
only.The hierarchy for the diskless system can be re-mounted allowing
read-write operations if required.I use my spare 386DX-40 as a dedicated X terminal.The hierarchy for altair is:/
/bin
/etc
/tmp
/sbin
/dev
/dev/fd
/usr
/var
/var/runThe actual list of files is:-r-xr-xr-x 1 root wheel 779984 Dec 11 23:44 ./kernel
-r-xr-xr-x 1 root bin 299008 Dec 12 00:22 ./bin/sh
-rw-r--r-- 1 root wheel 499 Dec 15 15:54 ./etc/rc
-rw-r--r-- 1 root wheel 1411 Dec 11 23:19 ./etc/ttys
-rw-r--r-- 1 root wheel 157 Dec 15 15:42 ./etc/hosts
-rw-r--r-- 1 root bin 1569 Dec 15 15:26 ./etc/XF86Config.altair
-r-x------ 1 bin bin 151552 Jun 10 1995 ./sbin/init
-r-xr-xr-x 1 bin bin 176128 Jun 10 1995 ./sbin/ifconfig
-r-xr-xr-x 1 bin bin 110592 Jun 10 1995 ./sbin/mount_nfs
-r-xr-xr-x 1 bin bin 135168 Jun 10 1995 ./sbin/reboot
-r-xr-xr-x 1 root bin 73728 Dec 13 22:38 ./sbin/mount
-r-xr-xr-x 1 root wheel 1992 Jun 10 1995 ./dev/MAKEDEV.local
-r-xr-xr-x 1 root wheel 24419 Jun 10 1995 ./dev/MAKEDEVDo not forget to run MAKEDEV all in the
dev directory.My /etc/rc for altair
is:#!/bin/sh
#
PATH=/bin:/
export PATH
#
# configure the localhost
/sbin/ifconfig lo0 127.0.0.1
#
# configure the ethernet card
/sbin/ifconfig ed0 199.246.76.2 netmask 0xffffff00
#
# mount the root filesystem via NFS
/sbin/mount antares:/DiskLess/rootfs/altair /
#
# mount the /usr filesystem via NFS
/sbin/mount antares:/DiskLess/usr /usr
#
/usr/X11R6/bin/XF86_SVGA -query antares -xf86config /etc/XF86Config.altair > /dev/null 2>&1
#
# Reboot after X exits
/sbin/reboot
#
# We blew up....
exit 1Any comments and all questions welcome.
diff --git a/en_US.ISO8859-1/articles/filtering-bridges/article.sgml b/en_US.ISO8859-1/articles/filtering-bridges/article.sgml
index de0ffd3172..af4cf51e9a 100644
--- a/en_US.ISO8859-1/articles/filtering-bridges/article.sgml
+++ b/en_US.ISO8859-1/articles/filtering-bridges/article.sgml
@@ -1,356 +1,356 @@
%man;
]>
Filtering BridgesNickSayernsayer@FreeBSD.org$FreeBSD$For those of you who do not know, DSL differs from more traditional
- connectivity methods in that the "connectivity spigot" that comes
+ connectivity methods in that the connectivity spigot that comes
out of the wall has no possibility for packet filtering. If you get
a T1 line or some such it will come with a router that can generally
include a packet filter. If you get ISDN or a dialup link, you also
either have a software routing component (a PPP daemon, specifically)
that can do some filtering or can be combined with a filter on the
machine running the link. But with DSL you only get a little white
box with some Blinkenlights on it and an Ethernet port that takes
your traffic back and forth from the Internet and nothing else (to
some extent the same can be said of other mass-market high speed
connectivity methods, like cable modems or high speed wireless links
as well. The same technique I plan to describe works just as well
for them, or for any other technology that provides an Ethernet
port with no filtering).Why use a filtering bridge?Bridging is not the only conceivable option. It is possible to
set up a two Ethernet machine as a router instead of a bridge.
Where it is possible to do so, it is actually a better idea.
Bridges run their interfaces in promiscuous mode, meaning they
must process every packet presented to them. The problem is
that routers can only route traffic between different subnets.
Also, subnets can only be made by by cutting an existing space in
half or defining a new space that is typically unroutable (see
RFC 1918).
This wastes half of the useful addresses (or at least puts
- them on the "wrong" side of the router -- the thing that is
+ them on the wrong side of the router—the thing that is
doing the packet filtering that makes the inside network safe).
Using a bridge costs some CPU cycles, but makes all of the
problems of adding a 2nd router go away.Configuring a KernelAfter configuring and installing a kernel as shown here, you
should carry out the other
final preperation
tasks before booting into your new kernel.Adding bridging to a FreeBSD machine is not hard to do. It means
having 2 (or more, but we will just use 2 here) Ethernet cards and adding
a couple of lines to the kernel configuration. Since May of 2000,
RELENG_4 and -current have had bridging support for all Ethernet
interfaces. This does not mean that any Ethernet interface will work.
For them to work, they have to support a working promiscuous mode for
- both reception and transmission -- that is, they have to be able to
+ both reception and transmission—that is, they have to be able to
transmit Ethernet packets with any source address, not just their own.
In order to get good throughput, the cards should also be PCI bus
mastering cards. The best choices still are the Intel EtherExpress Pro
100 cards, with 3com 3c9xx cards being second.So you will want to add the following to your kernel configuration
file:device fxp (or whatever is appropriate for the cards you are using)
options BRIDGE
options IPFIREWALL
options IPFIREWALL_VERBOSENote that recent versions of FreeBSD support dynamically loading the
IP Firewall code into the kernel. You can not do this, however, with
bridging, as the bridge code itself needs to interact with IPFIREWALL
in a special way.It is also a good idea at this point to see if Luigi has updated
versions of the bridge code available that are more recent than what is
in the distribution. As an example, 3.3-RELEASE comes with 981214, but
as of this writing, the most up-to-date bridge code is 990810. You can
fetch the latest version from
http://www.iet.unipi.it/~luigi/. You will want to fetch bridge.c and bridge.h and drop them into sys/net/.For instructions on how to build and install a new kernel, refer to
the
Building and Installing a Custom Kernel section of the handbookFinal PreperationBefore you boot the new kernel, you must make some preparations in
rc.boot and rc.firewall. The
default rule for the firewall is to drop all packets on the floor. You
- will want to override this by setting up the 'open' firewall in
+ will want to override this by setting up the open firewall in
/etc/rc.conf. Put these lines in
/etc/rc.conf to achieve this:firewall_enable="YES"
firewall_type="open"There is one more thing that is necessary. When running IP over
Ethernet, there are actually two Ethernet protocols in use. One
is IP, the other is ARP. ARP is used when a machine must figure out
what Ethernet address corresponds to a given IP address. ARP is not
a part of the IP layer, since it only applies to IP when run over
Ethernet. The standard ipfirewall rule for the open firewall ispass ip from any to anybut what about ARP? If ARP is not passed, no IP traffic can flow at
all. But IPFIREWALL has no provisions for dealing with non-IP
protocols, and that includes ARP. Fortunately, a hackish extension was
made to the ipfirewall code to assist filtering bridges. If you set up
a special rule for UDP packets from IP address
0.0.0.0, the UDP port number will be used
to match the Ethernet protocol number for bridged packets. In this way
your bridge can be configured to pass or reject non IP protocols. So add
this line just below the two lines near the top of
/etc/rc.firewall that deal with
lo0 (the ones that say that you should almost
never change those two rules).${fwcmd} add allow udp from 0.0.0.0 2054 to 0.0.0.0This rule makes almost no sense at all from a normal perspective on
IPFIREWALL, but the bridge code will use it to pass ARP packets without
restriction (which you almost certainly want to do).Now you should be able to reboot your machine and have it act no
differently than it did before. There will be some new boot messages
about bridging, but the bridging will not be enabled. If there are any
problems, you should try and sort them out at this point before
proceeding.Enabling The BridgeNext, you should do this:&prompt.root; sysctl -w net.link.ether.bridge_ipfw=1
&prompt.root; sysctl -w net.link.ether.bridge=1At this point, the bridge should be enabled, and because of the
previous changes to /etc/rc.conf, the firewall
should be wide open. At this point, you should be able to insert the
machine between two sets of hosts and go back and forth without
difficulty. If so, the next step is to add those two sysctl lines to
either /etc/rc.local or add the net.link.[blah
blah]=1 portions of the lines to /etc/sysctl.conf
(which path you take depends on what version of FreeBSD you
have).Now before we started all of this, you should have had a machine
with two Ethernet interfaces, but with only one of them configured. That
is, there should only be one ifconfig line
/etc/rc.conf. With the bridge in place, that is
still true. But there is a detail that deserves some thought. The
bridge is not in place by default. That means that until the sysctls
are run that turn the bridge on, rather late in the startup, it is still
an ordinary machine with two interfaces, only one of which is configured
by /etc/rc.conf. This becomes important for those
portions of the startup that require network access, say for DNS
resolution. Some care must be made in picking which interface is going
to be the configured one. In most cases, you are best to pick the
- "outside" one (that is, the interface connected to the Internet). Let's
+ outside one (that is, the interface connected to the Internet). Let's
presume for the sake of the examples to come, that
- fxp0 is the "outside" interface, and
- fxp1 is the "inside" one. That means that fxp0
+ fxp0 is the outside interface, and
+ fxp1 is the inside one. That means that fxp0
should be mentioned in /etc/rc.conf's ifconfig
sections, but fxp1 should not be. The sysctl
that turns the bridge on will make fxp1 start
working automagically.Configuring The FirewallNow it is time to start adding ipfirewall rules to secure the inside
network. There are some complications in doing this because not all of
the ipfirewall functionality is available on bridged packets. Also,
there is a difference between packets that are in the process of being
bridged and packets that are being received by the local machine. In
general, packets being bridged are only run through ipfirewall once, not
twice as is usually the case. Bridged packets are filtered while they
- are being received, so rules that use 'out' or 'xmit' will never match.
- I usually use 'in via' which is an older syntax, but one that makes
+ are being received, so rules that use out or xmit will never match.
+ I usually use in via which is an older syntax, but one that makes
sense as you read it. Another limitation is that you are restricted
- only to 'pass' or 'drop' for filtering bridged packets. Sophisticated
- things like 'divert' or 'forward' or 'reject' are not available. Such
+ only to pass or drop for filtering bridged packets. Sophisticated
+ things like divert or forward or reject are not available. Such
options can still be used, but only on traffic to or from the bridge
machine itself.New in FreeBSD 4.0 is the concept of stateful filtering. This is a
big boost for UDP traffic, which typically is a request going out,
followed shortly thereafter by a response with the exact same set of IP
addresses and port numbers (but with source and dest reversed, of
course). For firewalls that have no statekeeping, there is almost no
way to deal with this sort of traffic short of setting up proxies. But
- a firewall that can "remember" an outgoing UDP packet and for the next
+ a firewall that can remember an outgoing UDP packet and for the next
few minutes allow a response, handling UDP services is trivial. The
example to follow shows how to do this. The truly paranoid can also set
up rules like this to handle TCP. This allows you to avoid some sorts
of denial of service attacks or other nasty tricks, but it also
typically makes your state table mushroom in size.Let's look at an example setup. Note first that at the top of
/etc/rc.firewall we should already have taken care
of the loopback interface and the special hack for ARP should still be
in place. So we will not worry about them any further.us_ip=192.168.1.1
oif=fxp0
iif=fxp1
# Things that we've kept state on before get to go through in a hurry.
${ipfw} add check-state
# Throw away RFC 1918 networks
${ipfw} add deny log ip from 10.0.0.0/8 to any in via ${oif}
${ipfw} add deny log ip from 172.16.0.0/12 to any in via ${oif}
${ipfw} add deny log ip from 192.68.0.0/16 to any in via ${oif}
# Allow the bridge machine to say anything it wants (keep state if UDP)
${ipfw} add pass udp from ${us_ip} to any keep-state
${ipfw} add pass ip from ${us_ip} to any
# Allow the inside net to say anything it wants (keep state if UDP)
${ipfw} add pass udp from any to any in via ${iif} keep-state
${ipfw} add pass ip from any to any in via ${iif}
# Allow all manner of ICMP
${ipfw} add pass icmp from any to any
# TCP section
# established TCP sessions are ok everywhere.
${ipfw} add pass tcp from any to any established
# Pass the "quarantine" range.
${ipfw} add pass tcp from any to any 49152-65535 in via ${oif}
# Pass ident probes. It's better than waiting for them to timeout
${ipfw} add pass tcp from any to any 113 in via ${oif}
# Pass SSH.
${ipfw} add pass tcp from any to any 22 in via ${oif}
# Pass DNS. Only if you have name servers inside.
#${ipfw} add pass tcp from any to any 53 in via ${oif}
# Pass SMTP to the mail server only
${ipfw} add pass tcp from any to mailhost 25 in via ${oif}
# UDP section
# Pass the "quarantine" range.
${ipfw} add pass udp from any to any 49152-65535 in via ${oif}
# Pass DNS. Only if you have name servers inside.
#${ipfw} add pass udp from any to any 53 in via ${oif}
# Everything else is suspect
${ipfw} add deny log ip from any to anyThose of you who have set up firewalls before may notice some things
missing. In particular, there are no anti-spoofing rules. That is,
we did not add:${ipfw} add deny ip from ${us_ip}/24 to any in via ${oif}That is, drop packets claiming to be from our network that are
coming in from the outside. This is something that you would commonly
do to make sure that someone does not try and evade the packet filter by
generating nefarious packets that look like they are from the inside.
The problem with that is that there is at least one host on the outside
- interface that you do not want to ignore -- your router. In my
+ interface that you do not want to ignore—your router. In my
particular case, I have some machines on the outside and some on the
inside, but I do not necessarily want the outside machines to have
routine access to the inside. At the same time, I do not want to throw
their traffic away. In my own case, my ISP anti-spoofs at their router,
so I do not need to bother. And in general, the fewer rules the better,
since it will take time and CPU to process each one.Note also that the last rule is almost an exact duplicate of the
default rule 65536. There are two major differences when it comes to
bridging, however. Our rule logs what it drops, of course, but our rule
will only apply to IP traffic. Apart from the UDP
0.0.0.0 trick there is no way to deal
with non IP traffic, so the default rule at 65536 will drop ALL traffic,
not merely all non-IP traffic. So the net effect is that unmatched IP
traffic will be logged, but not non-IP traffic. If you want, you can
add option IPFIREWALL_DEFAULT_TO_ACCEPT to your
kernel configuration and
non-IP traffic will be passed instead of dropped. But in the case of a
filtering bridge between you and the Internet, it is unlikely that you
would want to do this (if you are sufficiently paranoid).There is a rule for passing SMTP to a mailhost if you have one.
Obviously the whole ruleset above should be flavored to taste, and
that is an example of a specific service exemption. Note that
- in order for 'mailhost' to work, name service lookups must work
+ in order for mailhost to work, name service lookups must work
BEFORE the bridge is enabled. This is an example of making sure
that you enable the correct interface.Another item to note is that the DNS rules are set up only to
allow DNS servers to work. This means that if do not set up a
DNS server, you do not need them.Folks used to setting up IP firewalls also probably are used to
- either having a 'reset' or a 'forward' rule for ident packets
+ either having a reset or a forward rule for ident packets
(TCP port 113). Unfortunately, this is not an option with the
bridging code, so the path of least resistance is to simply pass
them to their destination. As long as that destination machine
is not running an ident daemon, this is relatively harmless.
The alternative is dropping port 113 connections, which makes
firing up things like IRC take forever (the ident probe must
timeout).The only other thing that is a little weird that you may have noticed
- is that there is a rule to let ${us_ip} speak and a separate rule to
+ is that there is a rule to let ${us_ip} speak and a separate rule to
allow the inside network to speak. Remember that this is because the
two sets of traffic will be taking different paths through the kernel
and into the packet filter. The inside net will be going through the
bridge code. The local machine, however, will be using the normal IP
stack to speak. Thus the two rules to handle the different cases. The
in via ${oif} rules work for both paths. In general if you use in via
rules throughout the filter, you will need to make an exception for
- locally generated packets, because they did not "come in" via
+ locally generated packets, because they did not come in via
anything.ContributorsTo some extent the material for this discussion is a combination of
the items that were discussed by Luigi Rizzo in his Dummynet lecture at
FreeBSDcon '99 and by Mark Murray during his Network Security lecture.
In addition, for quite some time now I have been putting together
filtering bridges for friends and colleagues who were getting DSL
connections for their home.
diff --git a/en_US.ISO8859-1/articles/freebsd-questions/article.sgml b/en_US.ISO8859-1/articles/freebsd-questions/article.sgml
index a6eb23c0c0..f36c3c3fe6 100644
--- a/en_US.ISO8859-1/articles/freebsd-questions/article.sgml
+++ b/en_US.ISO8859-1/articles/freebsd-questions/article.sgml
@@ -1,564 +1,564 @@
%man;
]>
How to get best results from the FreeBSD-questions mailing
listGregLeheygrog@FreeBSD.org$FreeBSD$This document provides useful information for people looking to
prepare an e-mail to the FreeBSD-questions mailing list. Advice and
hints are given that will maximise the chance that the reader will
receive useful replies.This document is regularly posted to the FreeBSD-questions mailing
list.IntroductionFreeBSD-questions is a mailing list maintained by
the FreeBSD project to help people who have questions about the normal
use of FreeBSD. Another group, FreeBSD-hackers,
discusses more advanced questions such as future development
work.The term hacker has nothing to do with breaking
into other people's computers. The correct term for the latter
activity is cracker, but the popular press has not found
out yet. The FreeBSD hackers disapprove strongly of cracking
security, and have nothing to do with it. For a longer description of
hackers, see Eric Raymond's How To Become
A HackerThis is a regular posting aimed to help both those seeking advice
from FreeBSD-questions (the newcomers), and also those
who answer the questions (the hackers).Inevitably there is some friction, which stems from the different
viewpoints of the two groups. The newcomers accuse the hackers of being
arrogant, stuck-up, and unhelpful, while the hackers accuse the
newcomers of being stupid, unable to read plain English, and expecting
everything to be handed to them on a silver platter. Of course, there is
an element of truth in both these claims, but for the most part these
viewpoints come from a sense of frustration.In this document, I would like to do something to relieve this
frustration and help everybody get better results from
FreeBSD-questions. In the following section, I recommend how to submit
a question; after that, we will look at how to answer one.How to subscribe to FreeBSD-questionsFreeBSD-questions is a mailing list, so you need mail access. Send
a mail message to majordomo@FreeBSD.org with the single
line:subscribe FreeBSD-questionsmajordomo is an automatic program which
maintains the mailing list, so you do not need a subject line. If your
mailer complains, however, you can put anything you like in the subject
line.When you get the reply from majordomo
telling you the details of the list, please save
it. If you ever should want to leave the list, you will need
the information there. See the next section for more details.How to unsubscribe from FreeBSD-questionsWhen you subscribed to FreeBSD-questions, you got a welcome message
from Majordomo@FreeBSD.ORG. In this message, amongst
other things, it told you how to unsubscribe. Here is a typical
message:Welcome to the freebsd-questions mailing list!
If you ever want to remove yourself from this mailing list, you can send
mail to "Majordomo@FreeBSD.ORG" with the following command in the body
of your email message:
unsubscribe freebsd-questions Greg Lehey <grog@lemis.de>
Here's the general information for the list you've subscribed to,
in case you don't already have it:
FREEBSD-QUESTIONS User questions
This is the mailing list for questions about FreeBSD.
You should not send "how to" questions to the technical lists unless
you consider the question to be pretty technical.Normally, unsubscribing is even simpler than the message suggests:
you do not need to specify your mail ID unless it is different from the
one which you specified when you subscribed.If Majordomo replies and tells you (incorrectly) that you are not on
the list, this may mean one of two things:You have changed your mail ID since you subscribed. That is
where keeping the original message from majordomo
comes in handy. For example, the sample message above shows my mail
ID as grog@lemis.de. Since then, I have changed
it to grog@lemis.com. If I were to try to remove
grog@lemis.com from the list, it would fail: I
would have to specify the name with which I joined.You are subscribed to a mailing list which is subscribed to
FreeBSD-questions. If that is the case, you will
have to figure out which one it is and get your name taken off that
one. If you are not sure which one it might be, check the headers of
the messages you receive from freebsd-questions: maybe there is a
clue there.If you have done all this, and you still can not figure out what is going
on, send a message to Postmaster@FreeBSD.org, and he will
sort things out for you. Do not send a message to
FreeBSD-questions: they can not help you.Should I ask -questions or
-hackers?Two mailing lists handle general questions about FreeBSD,
FreeBSD-questions and
FreeBSD-hackers. In some cases, it is not really
clear which group you should ask. The following criteria should help
for 99% of all questions, however:If the question is of a general nature, ask
FreeBSD-questions. Examples might be questions
about installing FreeBSD or the use of a particular UNIX
utility.If you think the question relates to a bug, but you are not sure,
or you do not know how to look for it, send the message to
FreeBSD-questions.If the question relates to a bug, and you are
sure that it is a bug (for example, you can
pinpoint the place in the code where it happens, and you maybe have
a fix), then send the message to
FreeBSD-hackers.If the question relates to enhancements to FreeBSD, and you
can make suggestions about how to implement them, then send the
message to FreeBSD-hackers.There are also a number of other specialized mailing lists, for
example FreeBSD-isp, which caters to the interests of
ISPs (Internet Service Providers) who run FreeBSD. If you happen to be
an ISP, this does not mean you should automatically send your questions
to FreeBSD-isp. The criteria above still apply, and
it is in your interest to stick to them, since you are more likely to get
good results that way.How to submit a questionWhen submitting a question to FreeBSD-questions, consider the
following points:Remember that nobody gets paid for answering a FreeBSD
question. They do it of their own free will. You can influence this
free will positively by submitting a well-formulated question
supplying as much relevant information as possible. You can
influence this free will negatively by submitting an incomplete,
illegible, or rude question. It is perfectly possible to send a
message to FreeBSD-questions and not get an answer even if you
follow these rules. It is much more possible to not get an answer if
you do not. In the rest of this document, we will look at how to get
the most out of your question to FreeBSD-questions.Not everybody who answers FreeBSD questions reads every message:
they look at the subject line and decide whether it interests them.
- Clearly, it is in your interest to specify a subject. ``FreeBSD
- problem'' or ``Help'' are not enough. If you provide no subject at
+ Clearly, it is in your interest to specify a subject. FreeBSD
+ problem or Help are not enough. If you provide no subject at
all, many people will not bother reading it. If your subject is not
specific enough, the people who can answer it may not read
it.Format your message so that it is legible, and
PLEASE DO NOT SHOUT!!!!!. We appreciate that a lot of people do not
speak English as their first language, and we try to make
allowances for that, but it is really painful to try to read a
message written full of typos or without any line breaks.Do not underestimate the effect that a poorly formatted mail
message has, not just on the FreeBSD-questions mailing list.
Your mail message is all people see of you, and if it is poorly
formatted, one line per paragraph, badly spelt, or full of
errors, it will give people a poor impression of you.A lot of badly formatted messages come from
bad mailers or badly
configured mailers. The following mailers are known to
send out badly formatted messages without you finding out about
them:cc:MailEudoraexmhMicrosoft ExchangeMicrosoft Internet MailMicrosoft OutlookNetscapeAs you can see, the mailers in the Microsoft world are frequent
offenders. If at all possible, use a UNIX mailer. If you must use a
mailer under Microsoft environments, make sure it is set up
correctly. Try not to use MIME: a lot of people
use mailers which do not get on very well with
MIME.Make sure your time and time zone are set correctly. This may
seem a little silly, since your message still gets there, but many
of the people you are trying to reach get several hundred messages a
day. They frequently sort the incoming messages by subject and by
date, and if your message does not come before the first answer, they
may assume they missed it and not bother to look.Do not include unrelated questions in the same message. Firstly,
a long message tends to scare people off, and secondly, it is more
difficult to get all the people who can answer all the questions to
read the message.Specify as much information as possible. This is a difficult
area, and we need to expand on what information you need to submit,
but here is a start:In nearly every case, it is important to know the version of
FreeBSD you are running. This is particularly the case for
FreeBSD-CURRENT, where you should also specify the date of the
sources, though of course you should not be sending questions
about -CURRENT to FreeBSD-questions.With any problem which could be
hardware related, tell us about your hardware. In case of
doubt, assume it is possible that it is hardware. What kind of
CPU are you using? How fast? What motherboard? How much
memory? What peripherals?There is a judgement call here, of course, but the output of
the &man.dmesg.8; command can frequently be very useful, since it
tells not just what hardware you are running, but what version of
FreeBSD as well.If you get error messages, do not say I get error
messages, say (for example) I get the error
message 'No route to host'.If your system panics, do not say My system
panicked, say (for example) my system panicked
with the message 'free vnode isn't'.If you have difficulty installing FreeBSD, please tell us
what hardware you have. In particular, it is important to know
the IRQs and I/O addresses of the boards installed in your
machine.If you have difficulty getting PPP to run, describe the
configuration. Which version of PPP do you use? What kind of
authentication do you have? Do you have a static or dynamic IP
address? What kind of messages do you get in the log
file?A lot of the information you need to supply is the output of
programs, such as &man.dmesg.8;, or console messages, which usually
appear in /var/log/messages. Do not try to copy
this information by typing it in again; it is a real pain, and you are
bound to make a mistake. To send log file contents, either make a
copy of the file and use an editor to trim the information to what
is relevant, or cut and paste into your message. For the output of
programs like &man.dmesg.8;, redirect the output to a file and
include that. For example,&prompt.user; dmesg > /tmp/dmesg.outThis redirects the information to the file
/tmp/dmesg.out.If you do all this, and you still do not get an answer, there
could be other reasons. For example, the problem is so complicated
that nobody knows the answer, or the person who does know the answer
was offline. If you do not get an answer after, say, a week, it
might help to re-send the message. If you do not get an answer to
your second message, though, you are probably not going to get one
from this forum. Resending the same message again and again will
only make you unpopular.To summarize, let's assume you know the answer to the following
question (yes, it is the same one in each case).
You choose which of these two questions you would be more prepared to
answer:Message 1Subject: HELP!!?!??
I just can't get hits damn silly FereBSD system to
workd, and Im really good at this tsuff, but I have never seen
anythign sho difficult to install, it jst wont work whatever I try
so why don't y9ou guys tell me what I doing wrong.Message 2Subject: Problems installing FreeBSD
I've just got the FreeBSD 2.1.5 CDROM from Walnut Creek, and I'm having a lot
of difficulty installing it. I have a 66 MHz 486 with 16 MB of
memory and an Adaptec 1540A SCSI board, a 1.2GB Quantum Fireball
disk and a Toshiba 3501XA CDROM drive. The installation works just
fine, but when I try to reboot the system, I get the message
-``Missing Operating System''.
+Missing Operating System.
How to follow up to a questionOften you will want to send in additional information to a question
you have already sent. The best way to do this is to reply to your
original message. This has three advantages:You include the original message text, so people will know what
you are talking about. Do not forget to trim unnecessary text out,
though.The text in the subject line stays the same (you did remember to
put one in, did you not?). Many mailers will sort messages by
subject. This helps group messages together.The message reference numbers in the header will refer to the
previous message. Some mailers, such as
mutt, can
thread messages, showing the exact
relationships between the messages.How to answer a questionBefore you answer a question to FreeBSD-questions, consider:A lot of the points on submitting questions also apply to
answering questions. Read them.Has somebody already answered the question? The easiest way to
check this is to sort your incoming mail by subject: then
(hopefully) you will see the question followed by any answers, all
together.If somebody has already answered it, it does not automatically
mean that you should not send another answer. But it makes sense to
read all the other answers first.Do you have something to contribute beyond what has already been
said? In general, Yeah, me too answers do not help
much, although there are exceptions, like when somebody is
describing a problem he is having, and he does not know whether it is
his fault or whether there is something wrong with the hardware or
software. If you do send a me too answer, you should
also include any further relevant information.Are you sure you understand the question? Very frequently, the
person who asks the question is confused or does not express himself
very well. Even with the best understanding of the system, it is
easy to send a reply which does not answer the question. This
does not help: you will leave the person who submitted the question
more frustrated or confused than ever. If nobody else answers, and
you are not too sure either, you can always ask for more
information.Are you sure your answer is correct?
If not, wait a day or so. If nobody else comes up with a
better answer, you can still reply and say, for example, I
do not know if this is correct, but since nobody else has
replied, why don't you try replacing your ATAPI CDROM with
a frog?.Unless there is a good reason to do otherwise, reply to the
sender and to FreeBSD-questions. Many people on the
FreeBSD-questions are lurkers: they learn by reading
messages sent and replied to by others. If you take a message which
is of general interest off the list, you are depriving these people
of their information. Be careful with group replies; lots of people
send messages with hundreds of CCs. If this is the case, be sure to
trim the Cc: lines appropriately.Include relevant text from the original message. Trim it to the
minimum, but do not overdo it. It should still be possible for
somebody who did not read the original message to understand what
you are talking about.Use some technique to identify which text came from the original
message, and which text you add. I personally find that prepending
> to the original message
works best. Leaving white space after the
> and leave empty lines
between your text and the original text both make the result more
readable.Put your response in the correct place (after the text to which
it replies). It is very difficult to read a thread of responses
where each reply comes before the text to which it replies.Most mailers change the subject line on a reply by prepending a
text such as Re: . If your mailer does not do it
automatically, you should do it manually.If the submitter did not abide by format conventions (lines too
long, inappropriate subject line), please fix
it. In the case of an incorrect subject line (such as
HELP!!??), change the subject line to (say)
Re: Difficulties with sync PPP (was: HELP!!??). That
way other people trying to follow the thread will have less
difficulty following it.In such cases, it is appropriate to say what you did and why you
did it, but try not to be rude. If you find you can not answer
without being rude, do not answer.If you just want to reply to a message because of its bad
format, just reply to the submitter, not to the list. You can just
send him this message in reply, if you like.
diff --git a/en_US.ISO8859-1/articles/laptop/article.sgml b/en_US.ISO8859-1/articles/laptop/article.sgml
index 87f4ebe6fb..bac561f779 100644
--- a/en_US.ISO8859-1/articles/laptop/article.sgml
+++ b/en_US.ISO8859-1/articles/laptop/article.sgml
@@ -1,179 +1,179 @@
%man;
%freebsd;
%authors;
%mailing-lists;
]>
FreeBSD on Laptops$FreeBSD$FreeBSD works fine on most laptops, with a few caveats.
Some issues specific to running FreeBSD on laptops, relating
to different hardware requirements from desktops, are
discussed below.FreeBSD is often thought of as a server operating system, but
it works just fine on the desktop, and if you want to use it on
your laptop you can enjoy all the usual benefits: systematic
layout, easy administration and upgrading, the ports/packages
system for adding software, and so on. (Its other benefits,
such as stability, network performance, and performance under
a heavy load, may not be obvious on a laptop, of course.)
However, installing it on laptops often involves problems which
are not encountered on desktop machines and are not commonly
discussed (laptops, even more than desktops, are fine-tuned for
Microsoft Windows). This article aims to discuss some of these
issues.XFree86Recent versions of XFree86 work with most display adapters
available on laptops these days. Acceleration may not be
supported, but a generic SVGA configuration should work.Check your laptop documentation for which card you have,
and check in the XFree86 documentation (or setup program)
to see whether it is specifically supported. If it is not, use
a generic device (do not go for a name which just looks
similar). In XFree86 version 4, you can try your luck
with the command XFree86 -configure
which auto-detects a lot of configurations.The problem often is configuring the monitor. Common
resources for XFree86 focus on CRT monitors; getting a
suitable modeline for an LCD display may be tricky. You may
be lucky and not need to specify a modeline, or just need to
specify suitable HorizSync and VertRefresh ranges. If that
does not work, the best option is to check web resources
devoted to configuring X on laptops (these are often
linux-oriented sites but it does not matter because both systems
use XFree86) and copy a modeline posted by someone for similar
hardware.Most laptops come with two buttons on their pointing
devices, which is rather problematic in X (since the middle
button is commonly used to paste text); you can map a
simultaneous left-right click in your X configuration to
a middle button click with the line
Option "Emulate3Buttons"
- in the XF86Config file in the "InputDevice" section (for XFree86
- version 4; for version 3, put just the line "Emulate3Buttons",
- without the quotes, in the "Pointer" section.)
+ in the XF86Config file in the InputDevice section (for XFree86
+ version 4; for version 3, put just the line Emulate3Buttons,
+ without the quotes, in the Pointer section.)
Modems
Laptops usually come with internal (on-board) modems.
- Unfortunately, this almost always means they are "winmodems" whose
+ Unfortunately, this almost always means they are winmodems whose
functionality is implemented in software, for which only windows
drivers are normally available (though a few drivers are beginning
to show up for other operating systems). Otherwise, you
need to buy an external modem: the most compact option is
probably a PC-Card (PCMCIA) modem, discussed below, but
serial or USB modems may be cheaper. Generally, regular
modems (non-winmodems) should work fine.
PCMCIA (PC-card) devices Most laptops come with PCMCIA (also called PC-card)
slots; these are supported fine under FreeBSD. Look through
your boot-up messages (using dmesg) and see whether these were
detected correctly (they should appear as
pccard0,
pccard1 etc on devices like
pcic0).FreeBSD currently supports 16-bit PCMCIA cards, but not
- 32-bit ("CardBus") cards. A database of supported cards is in
+ 32-bit (CardBus) cards. A database of supported cards is in
the file /etc/defaults/pccard.conf. Look
through it, and preferably buy cards listed there. Cards not
- listed may also work as "generic" devices: in particular most
+ listed may also work as generic devices: in particular most
modems (16-bit) should work fine, provided they are not
winmodems (these do exist even as PC-cards, so watch out). If
your card is recognised as a generic modem, note that the
default pccard.conf file specifies a delay time of 10 seconds
(to avoid freezes on certain modems); this may well be
over-cautious for your modem, so you may want to play with it,
reducing it or removing it totally.
- Some parts of pccard.conf may need editing. Check the irq
+ Some parts of pccard.conf may need editing. Check the irq
line, and be sure to remove any number already being used: in
particular, if you have an on board sound card, remove irq 5
(otherwise you may experience hangs when you insert a card).
Check also the available memory slots; if your card is not
being detected, try changing it to one of the other allowed
values (listed in the man page &man.pccardc.8;).
If it is not running already, start the pccardd daemon.
(To enable it at boot time, add
pccard_enable="YES" to
/etc/rc.conf). Now your cards should be
detected when you insert and remove them, and you should get
log messages about new devices being enabled.There have been major changes to the pccard code
(including ISA routing of interrupts, for machines whose
PCIBIOS FreeBSD can not seem to use) before the FreeBSD 4.4
release. If you have problems, try upgrading your system.
Power managementUnfortunately, this is not very reliably supported under
FreeBSD. If you are lucky, some functions may work reliably;
or they may not work at all.To enable this, you may need to compile a kernel with
power management support (device apm0) or
add the option enable apm0 to /boot/loader.conf, and
also enable the apm daemon at boot time (line
apm_enable="YES" in
/etc/rc.conf). The apm commands are
listed in the &man.apm.8; manpage. For instance,
apm -b gives you battery status (or 255 if
not supported), apm -Z puts the laptop on
standby, apm -z (or zzz) suspends it. To
- shutdown and power off the machine, use "shutdown -p".
+ shutdown and power off the machine, use shutdown -p.
Again, some or all of these functions may not work very well
or at all. You may find that laptop suspension/standby works
in console mode but not under X (that is, the screen does not
come on again; in that case, switch to a virtual console
(using Ctrl-Alt-F1 or another function key) and then execute
the apm command.
The X window system (XFree86) also includes display power
management (look at the &man.xset.1; man page, and search for
dpms there). You may want to investigate this. However, this,
too, works inconsistently on laptops: it
often turns off the display but does not turn off the
backlight.
diff --git a/en_US.ISO8859-1/articles/multi-os/article.sgml b/en_US.ISO8859-1/articles/multi-os/article.sgml
index 8b28f5a6b2..a40b7c76b0 100644
--- a/en_US.ISO8859-1/articles/multi-os/article.sgml
+++ b/en_US.ISO8859-1/articles/multi-os/article.sgml
@@ -1,741 +1,741 @@
Installing and Using FreeBSD With Other Operating SystemsJayRichmondjayrich@sysc.com6 August 1996This document discusses how to make FreeBSD coexist nicely
with other popular operating systems such as Linux, MS-DOS,
OS/2, and Windows 95. Special thanks to: Annelise Anderson
andrsn@stanford.edu, Randall Hopper
rhh@ct.picker.com, and Jordan K. Hubbard
jkh@time.cdrom.comOverviewMost people can not fit these operating systems together
comfortably without having a larger hard disk, so special
information on large EIDE drives is included. Because there are
so many combinations of possible operating systems and hard disk
configurations, the section may be of the
most use to you. It contains descriptions of specific working
computer setups that use multiple operating systems.This document assumes that you have already made room on
your hard disk for an additional operating system. Any time you
repartition your hard drive, you run the risk of destroying the
data on the original partitions. However, if your hard drive is
completely occupied by DOS, you might find the FIPS utility
(included on the FreeBSD CDROM in the
\TOOLS directory or via ftp)
useful. It lets you repartition your hard disk without
destroying the data already on it. There is also a commercial
program available called Partition Magic, which lets you size
and delete partitions without consequence.Overview of Boot ManagersThese are just brief descriptions of some of the different
boot managers you may encounter. Depending on your computer
setup, you may find it useful to use more than one of them on
the same system.Boot EasyThis is the default boot manager used with FreeBSD.
It has the ability to boot most anything, including BSD,
OS/2 (HPFS), Windows 95 (FAT and FAT32), and Linux.
Partitions are selected with the function keys.OS/2 Boot ManagerThis will boot FAT, HPFS, FFS (FreeBSD), and EXT2
(Linux). It will also boot FAT32 partitions. Partitions
are selected using arrow keys. The OS/2 Boot Manager is
the only one to use its own separate partition, unlike the
others which use the master boot record (MBR). Therefore,
it must be installed below the 1024th cylinder to avoid
booting problems. It can boot Linux using LILO when it is
part of the boot sector, not the MBR. Go to Linux
HOWTOs on the World Wide Web for more
information on booting Linux with OS/2's boot
manager.OS-BSThis is an alternative to Boot Easy. It gives you more
control over the booting process, with the ability to set
the default partition to boot and the booting timeout.
The beta version of this programs allows you to boot by
selecting the OS with your arrow keys. It is included on
the FreeBSD CD in the \TOOLS
directory, and via ftp.LILO, or LInux LOaderThis is a limited boot manager. It will boot FreeBSD,
though some customization work is required in the LILO
configuration file.About FAT32FAT32 is the replacement to the FAT filesystem included in
Microsoft's OEM SR2 Beta release, which started replacing FAT
on computers pre-loaded with Windows 95 towards the
end of 1996. It converts the normal FAT file system and
allows you to use smaller cluster sizes for larger hard
drives. FAT32 also modifies the traditional FAT boot sector
and allocation table, making it incompatible with some boot
managers.A Typical InstallationLet's say I have two large EIDE hard drives, and I want to
install FreeBSD, Linux, and Windows 95 on them.Here is how I might do it using these hard disks:/dev/wd0 (first physical hard disk)/dev/wd1 (second hard disk)Both disks have 1416 cylinders.I boot from a MS-DOS or Windows 95 boot disk that
contains the FDISK.EXE utility and make a small
50 meg primary partition (35-40 for Windows 95, plus a
little breathing room) on the first disk. Also create a
larger partition on the second hard disk for my Windows
applications and data.I reboot and install Windows 95 (easier said than done)
on the C: partition.The next thing I do is install Linux. I am not sure
about all the distributions of Linux, but slackware includes
LILO (see ). When I am partitioning out
my hard disk with Linux fdisk, I would
put all of Linux on the first drive (maybe 300 megs for a
nice root partition and some swap space).After I install Linux, and are prompted about installing
LILO, make SURE that I install it on the boot sector of my
root Linux partition, not in the MBR (master boot
record).The remaining hard disk space can go to FreeBSD. I also
make sure that my FreeBSD root slice does not go beyond the
1024th cylinder. (The 1024th cylinder is 528 megs into the
disk with our hypothetical 720MB disks). I will use the
rest of the hard drive (about 270 megs) for the
/usr and / slices if I wish. The
rest of the second hard disk (size depends on the amount of
my Windows application/data partition that I created in step
1 can go to the /usr/src slice and swap
space.When viewed with the Windows 95 fdisk
utility, my hard drives should now look something like this:
---------------------------------------------------------------------
Display Partition Information
Current fixed disk drive: 1
Partition Status Type Volume_Label Mbytes System Usage
C: 1 A PRI DOS 50 FAT** 7%
2 A Non-DOS (Linux) 300 43%
Total disk space is 696 Mbytes (1 Mbyte = 1048576 bytes)
Press Esc to continue
---------------------------------------------------------------------
Display Partition Information
Current fixed disk drive: 2
Partition Status Type Volume_Label Mbytes System Usage
D: 1 A PRI DOS 420 FAT** 60%
Total disk space is 696 Mbytes (1 Mbyte = 1048576 bytes)
Press Esc to continue
---------------------------------------------------------------------
** May say FAT16 or FAT32 if you are using the OEM SR2
update. See ).Install FreeBSD. I make sure to boot with my first hard
disk set at NORMAL in the BIOS. If it is not,
I will have the enter my true disk geometry at boot time (to
get this, boot Windows 95 and consult Microsoft Diagnostics
(MSD.EXE), or check your BIOS) with the
parameter hd0=1416,16,63 where
1416 is the number of cylinders on my hard
disk, 16 is the number of heads per track,
and 63 is the number of sectors per track on
the drive.When partitioning out the hard disk, I make sure to
install Boot Easy on the first disk. I do not worry about
the second disk, nothing is booting off of it.When I reboot, Boot Easy should recognize my three
bootable partitions as DOS (Windows 95), Linux, and BSD
(FreeBSD).Special ConsiderationsMost operating systems are very picky about where and how
they are placed on the hard disk. Windows 95 and DOS need to be
on the first primary partition on the first hard disk. OS/2 is
the exception. It can be installed on the first or second disk
in a primary or extended partition. If you are not sure, keep
the beginning of the bootable partitions below the 1024th
cylinder.If you install Windows 95 on an existing BSD system, it will
destroy the MBR, and you will have to reinstall your
previous boot manager. Boot Easy can be reinstalled by using
the BOOTINST.EXE utility included in the \TOOLS directory on the
CDROM, and via ftp.
You can also re-start the installation process and go to the
partition editor. From there, mark the FreeBSD partition as
bootable, select Boot Manager, and then type W to (W)rite out
the information to the MBR. You can now reboot, and Boot Easy
should then recognize Windows 95 as DOS.Please keep in mind that OS/2 can read FAT and HPFS
partitions, but not FFS (FreeBSD) or EXT2 (Linux) partitions.
Likewise, Windows 95 can only read and write to FAT and FAT32
(see ) partitions. FreeBSD can read most
file systems, but currently cannot read HPFS partitions. Linux
can read HPFS partitions, but can not write to them. Recent
versions of the Linux kernel (2.x) can read and write to Windows
95 VFAT partitions (VFAT is what gives Windows 95 long file
names - it is pretty much the same as FAT). Linux can read and
write to most file systems. Got that? I hope so.Examples(section needs work, please send your example to
jayrich@sysc.com).FreeBSD+Win95: If you installed FreeBSD after Windows 95,
you should see DOS on the Boot Easy menu. This is
Windows 95. If you installed Windows 95 after FreeBSD, read
above. As long as your hard disk does not
have 1024 cylinders you should not have a problem booting. If
one of your partitions goes beyond the 1024th cylinder however,
and you get messages like invalid system disk
under DOS (Windows 95) and FreeBSD will not boot, try looking
for a setting in your BIOS called > 1024 cylinder
support or NORMAL/LBA mode. DOS may need LBA
(Logical Block Addressing) in order to boot correctly. If the
idea of switching BIOS settings every time you boot up does not
appeal to you, you can boot FreeBSD through DOS via the
FBSDBOOT.EXE utility on the CD (It should find your
FreeBSD partition and boot it.)FreeBSD+OS/2+Win95: Nothing new here. OS/2's boot manger
can boot all of these operating systems, so that should not be a
problem.FreeBSD+Linux: You can also use Boot Easy to boot both
operating systems.FreeBSD+Linux+Win95: (see )Other Sources of HelpThere are many Linux
HOW-TOs that deal with multiple operating systems on
the same hard disk.The Linux+DOS+Win95+OS2
mini-HOWTO offers help on configuring the OS/2 boot
manager, and the Linux+FreeBSD
mini-HOWTO might be interesting as well. The Linux-HOWTO
is also helpful.The NT
Loader Hacking Guide provides good information on
multibooting Windows NT, '95, and DOS with other operating
systems.
]]>
- And Hale Landis's "How It Works" document pack contains some
+ And Hale Landis's How It Works document pack contains some
good info on all sorts of disk geometry and booting related
topics. You can find it at
ftp://fission.dt.wdc.com/pub/otherdocs/pc_systems/how_it_works/allhiw.zip.Finally, do not overlook FreeBSD's kernel documentation on
the booting procedure, available in the kernel source
distribution (it unpacks to file:/usr/src/sys/i386/boot/biosboot/README.386BSD.Technical Details(Contributed by Randall Hopper,
rhh@ct.picker.com)This section attempts to give you enough basic information
about your hard disks and the disk booting process so that you
can troubleshoot most problems you might encounter when getting
set up to boot several operating systems. It starts in pretty
basic terms, so you may want to skim down in this section until
it begins to look unfamiliar and then start reading.Disk PrimerThree fundamental terms are used to describe the location
of data on your hard disk: Cylinders, Heads, and Sectors.
It is not particularly important to know what these terms
relate to except to know that, together, they identify where
data is physically on your disk.Your disk has a particular number of cylinders, number of
heads, and number of sectors per cylinder-head (a
cylinder-head also known nown as a track). Collectively this
- information defines the "physical disk geometry" for your hard
+ information defines the physical disk geometry for your hard
disk. There are typically 512 bytes per sector, and 63
sectors per track, with the number of cylinders and heads
varying widely from disk to disk. Thus you can figure the
number of bytes of data that will fit on your own disk by
calculating:(# of cylinders) × (# heads) × (63
sectors/track) × (512 bytes/sect)For example, on my 1.6 Gig Western Digital AC31600 EIDE hard
disk, that is:(3148 cyl) × (16 heads) × (63
sectors/track) × (512 bytes/sect)which is 1,624,670,208 bytes, or around 1.6 Gig.You can find out the physical disk geometry (number of
cylinders, heads, and sectors/track counts) for your hard
disks using ATAID or other programs off the net. Your hard
disk probably came with this information as well. Be careful
though: if you are using BIOS LBA (see ), you can not use just any program to get
the physical geometry. This is because many programs (e.g.
MSD.EXE or FreeBSD fdisk) do not identify the
physical disk geometry; they instead report the
translated geometry (virtual numbers from using
LBA). Stay tuned for what that means.One other useful thing about these terms. Given 3
numbers—a cylinder number, a head number, and a
sector-within-track number—you identify a specific
absolute sector (a 512 byte block of data) on your disk.
Cylinders and Heads are numbered up from 0, and Sectors are
numbered up from 1.For those that are interested in more technical details,
information on disk geometry, boot sectors, BIOSes, etc. can
be found all over the net. Query Lycos, Yahoo, etc. for
boot sector or master boot record.
Among the useful info you will find are Hale Landis's
How It Works document pack. See the section for a few pointers to this
pack.Ok, enough terminology. We are talking about booting
here.The Booting ProcessOn the first sector of your disk (Cyl 0, Head 0, Sector 1)
lives the Master Boot Record (MBR). It contains a map of your
disk. It identifies up to 4 partitions, each of
which is a contiguous chunk of that disk. FreeBSD calls
partitions slices to avoid confusion with its
own partitions, but we will not do that here. Each partition can
contain its own operating system.Each partition entry in the MBR has a Partition
ID, a Start Cylinder/Head/Sector, and an
End Cylinder/Head/Sector. The Partition ID
tells what type of partition it is (what OS) and the Start/End
tells where it is. lists a
smattering of some common Partition IDs.
Partition IDsID (hex)Description01Primary DOS12 (12-bit FAT)04Primary DOS16 (16-bit FAT)05Extended DOS06Primary big DOS (> 32MB)0AOS/283Linux (EXT2FS)A5FreeBSD, NetBSD, 386BSD (UFS)
Characteristics of Two Spindles Organized with VinumOrganizationTotal CapacityFailure ResilientPeak Read PerformancePeak Write PerformanceConcatenated PlexesUnchanged, but appears as a single driveNoUnchangedUnchangedStriped Plexes (RAID-0)Unchanged, but appears as a single driveNo2x2xMirrored Volumes (RAID-1)1/2, appearing as a single driveYes2xUnchanged
shows that striping yields
the same capacity and lack of failure resilience
as concatenation, but it has better peak read and write performance.
Hence we will not be using concatenation in any of the examples here.
Mirrored volumes provide the benefits of improved peak read performance
and failure resilience--but this comes at a loss in capacity.Both concatenation and striping bring their benefits over a
single spindle at the cost of increased likelihood of failure since
more than one spindle is now involved.When three or more spindles are present,
Vinum also supports rotated,
block-interleaved parity (also called RAID-5)
that provides better
capacity than mirroring (but not quite as good as striping), better
read performance than both mirroring and striping,
and good failure resilience.
There is, however,
a substantial decrease in write performance with RAID-5.
Most of the benefits become more pronounced with five or more
spindles.The organizations described above may be combined to provide
benefits that no single organization can match.
For example, mirroring and striping can be combined to provide
failure-resilience with very fast read performance.Vinum HistoryVinum
is a standard part of even a "minimum" FreeBSD distribution and
it has been standard since 3.0-RELEASE.
The official pronunciation of the name is
VEE-noom.&vinum.ap; was inspired by the Veritas Volume Manager, but
was not derived from it.
The name is a play on that history and the Latin adage
In Vino Veritas
(Vino is the accusative form of
Vinum).
- Literally translated, that is "Truth lies in wine" hinting that
+ Literally translated, that is Truth lies in wine hinting that
drunkards have a hard time lying.
I have been using it in production on six different servers for
over two years with no data loss.
Like the rest of FreeBSD, Vinum
- provides "rock-stable performance."
+ provides rock-stable performance.
(On a personal note, I have seen Vinum
panic when I misconfigured something, but I have
never had any trouble in normal operation.)
Greg Lehey wrote
Vinum for FreeBSD,
but he is seeking
help in porting it to NetBSD and OpenBSD.Just like the rest of FreeBSD, Vinum
is undergoing continuous
development.
Several subtle, but significant bugs have been fixed in recent
releases.
It is always best to use the most recent code base that meets your
stability requirements.Vinum Deployment StrategyVinum,
coupled with prudent partition management, lets you
- keep "warm-spare" spindles on-line so that failures
+ keep warm-spare spindles on-line so that failures
are transparent to users. Failed spindles can be replaced
during regular maintenance periods or whenever it is convenient.
When all spindles are working, the server benefits from increased
performance and capacity.Having redundant copies of your home directory does not
help you if the spindle holding root,
/usr, or swap fails on your server.
Hence I focus here on building a simple
foundation for a failure-resilient server covering the root,
/usr,
/home, and swap partitions.Vinum
mirroring does not remove the need for making backups!
Mirroring cannot help you recover from site disasters
or the dreaded
rm -r -f / command.Why Bootstrap Vinum?It is possible to add Vinum
to a server configuration after
it is already in production use, but this is much harder than
designing for it from the start. Ironically,
Vinum is not supported by
/stand/sysinstall
and hence you cannot install
/usr right onto a
Vinum volume.Vinum currently does not
support the root file system (this feature
is in development).Hence it is a bit
tricky to get started using
Vinum, but these instructions
take you though the process of planning for
Vinum, installing FreeBSD
without it, and then beginning to use it.
- I have come to call this whole process "bootstrapping Vinum."
+ I have come to call this whole process bootstrapping Vinum.
That is, the process of getting Vinum
initially installed
and operating to the point where you have met your resilience
or performance goals. My purpose here is to document a
Vinum
bootstrapping method that I have found that works well for me.Vinum BenefitsThe server foundation scenario I have chosen here allows me
to show you examples of configuring for resilience on
/usr and
/home.
Yet Vinum
provides benefits other than resilience--namely
performance, capacity, and manageability.
It can significantly improve disk performance (especially
under multi-user loads).
Vinum
can easily concatenate many smaller disks to produce the
illusion of a single larger disk (but my server foundation
scenario does not allow me to illustrate these benefits here).For servers with many spindles, Vinum
provides substantial
benefits in volume management, particularly when coupled with
hot-pluggable hardware. Data can be moved from spindle to
spindle while the system is running without loss of production
time. Again, details of this will not be given here, but once
you get your feet wet with Vinum,
other documentation will help you do things like this.
See
"The Vinum
Volume Manager" for a technical introduction to
Vinum,
&man.vinum.8; for a description of the vinum
command, and
&man.vinum.4;
for a description of the vinum device
driver and the way Vinum
objects are named.Breaking up your disk space into smaller and smaller partitions
- has the benefit of allowing you to "tune" for the most common
- type of access and tends to keep disk hogs "within their pens."
+ has the benefit of allowing you to tune for the most common
+ type of access and tends to keep disk hogs within their pens.
However it also causes some loss in total available disk space
due to fragmentation.Server Operation in Degraded ModeSome disk failures in this two-spindle scenario will result in
Vinum
automatically routing
all disk I/O to the remaining good spindle.
Others will require brief manual intervention on the console
to configure the server for degraded mode operation and a quick reboot.
Other than actual hardware repairs, most recovery work
can be done while the server is running in multi-user degraded
mode so there is as little production impact
from failures as possible.I give the instructions in needed to
configure the server for degraded mode operation
in those cases where Vinum
cannot do it automatically.
I also give the instructions needed to
return to normal operation once the failed hardware is repaired.
You might call these instructions Vinum
failure recovery techniques.I recommend practicing using these instructions
by recovering from simulated failures.
For each failure scenario, I also give tips below for simulating
a failure even when your hardware is working well.
Even a minimum Vinum
system as described in
below can be a good place to experiment with
recovery techniques without impacting production equipment.Hardware RAID vs. Vinum (Software RAID)Manual intervention is sometimes required to configure a server for
degraded mode because
Vinum
is implemented in software that runs after the FreeBSD
kernel is loaded. One disadvantage of such
software RAID
solutions is that there is nothing that can be done to hide spindle
failures from the BIOS or the FreeBSD boot sequence. Hence
the manual reconfiguration of the server
for degraded operation mentioned
above just informs the BIOS and boot sequence of failed
spindles.
Hardware RAID solutions generally have an
advantage in that they require no such reconfiguration since
spindle failures are hidden from the BIOS and boot sequence.Hardware RAID, however, may have some disadvantages that can
be significant in some cases:
The hardware RAID controller itself may become a single
point of failure for the system.
The data is usually kept in a proprietary
format so that a disk drive cannot be simply plugged
into another main board and booted.
You often cannot mix and
match drives with different sizes and interfaces.
You are often limited to the number of drives supported by the
hardware RAID controller (often only four or eight).
In other words, &vinum.ap; may offer advantages in that
there is no single point of failure,
the drives can boot on most any main board, and
you are free to mix and match as many drives using
whatever interface you choose.Keep your kernel fairly generic (or at least keep
/kernel.GENERIC around).
This will improve the chances that you can come back up on
- "foreign" hardware more quickly.
+ foreign hardware more quickly.
The pros and cons discussed above suggest
that the root file system and swap partition are good
candidates for hardware RAID if available.
This is especially true for servers where it is difficult for
administrators to get console access (recall that this is sometimes
required to configure a server for degraded mode operation).
A server with only software RAID is well suited to office and home
environments where an administrator can be close at hand.A common myth is that hardware RAID is always faster
than software RAID.
Since it runs on the host CPU, Vinum
often has more CPU power and memory available than a
dedicated RAID controller would have.
If performance is a prime concern, it is best to benchmark
your application running on your CPU with your spindles using
both hardware and software RAID systems before making
a decision.Hardware for VinumThese instructions may be timely since commodity PC hardware
can now easily host several hundred gigabytes of reasonably
high-performance disk space at a low price. Many disk
drive manufactures now sell 7,200 RPM disk drives with quite
low seek times and high transfer rates through ATA-100
interfaces, all at very attractive prices. Four such drives,
attached to a suitable main board and configured with
Vinum
and prudent partitioning, yields a failure-resilient, high
performance disk server at a very reasonable cost.However, you can indeed get started with
Vinum very simply.
A minimum system can be as simple as
an old CPU (even a 486 is fine) and a pair of drives
that are 500 MB or more. They need not be the same size or
even use the same interface (i.e., it is fine to mix ATAPI and
SCSI). So get busy and give this a try today! You will have
the foundation of a failure-resilient server running in an
hour or so!Bootstrapping PhasesGreg Lehey suggested this bootstrapping method.
It uses knowledge of how Vinum
internally allocates disk space to avoid copying data.
Instead, Vinum
objects are configured so that they occupy the
same disk space where /stand/sysinstall built
file systems.
The file systems are thus embedded within
Vinum objects without copying.There are several distinct phases to the
Vinum bootstrapping
procedure. Each of these phases is presented in a separate section below.
The section starts with a general overview of the phase and its goals.
It then gives example steps for the two-spindle scenario
presented here and advice on how to adapt them for your server.
(If you are reading for a general understanding
of Vinum
bootstrapping, the example sections for each phase
can safely be skipped.)
The remainder of this section gives
an overview of the entire bootstrapping process.Phase 1 involves planning and preparation.
We will balance requirements
for the server against available resources and make design
tradeoffs.
We will plan the transition from no
Vinum to
Vinum
on just one spindle, to Vinum
on two spindles.In phase 2, we will install a minimum FreeBSD system on a
single spindle using partitions of type
4.2BSD (regular UFS file systems).Phase 3 will embed the non-root file systems from phase 2 in
Vinum objects.
Note that Vinum will be up and
running at this point,
but it cannot yet provide any resilience since it only has
one spindle on which to store data.Finally in phase 4, we configure Vinum
on a second spindle and make a backup copy of the root file system.
This will give us resilience on all file systems.Bootstrapping Phase 1: Planning and PreparationOur goal in this phase is to define the different partitions
we will need and examine their requirements.
We will also look at available disk drives and controllers and allocate
partitions to them.
Finally, we will determine the size of
each partition and its use during the bootstrapping process.
After this planning is complete, we can optionally prepare to use some
tools that will make bootstrapping Vinum
easier.Several key questions must be answered in this
planning phase:
What file system and partitions will be needed?
How will they be used?
How will we name each spindle?
How will the partitions be ordered for each spindle?
How will partitions be assigned to the spindles?
How will partitions be configured? Resilience or performance?
What technique will be used to achieve resilience?
What spindles will be used?
How will they be configured on the available controllers?
How much space is required for each partition?
Phase 1 ExampleIn this example, I will assume a scenario
where we are building
a minimal foundation for a failure-resilient server.
Hence we will need at least root,
/usr,
/home,
and swap partitions.
The root,
/usr, and
/home file systems all need resilience since the
server will not be much good without them.
The swap partition needs performance first and
generally does
not need resilience since nothing it holds needs to be retained
across a reboot.Spindle NamingThe kernel would refer to the master spindle on
the primary and secondary ATA controllers as
/dev/ad0 and
/dev/ad2 respectively.
This assumes that you have not removed the line
options ATA_STATIC_ID
from your kernel configuration.
But Vinum
also needs to have a name for each spindle
that will stay the same name regardless
of how it is attached to the CPU (i.e., if the drive moves, the
Vinum name moves with the drive).Some recovery techniques documented below suggest
moving a spindle from
the secondary ATA controller to the primary ATA controller.
(Indeed, the flexibility of making such moves is a key benefit
of Vinum
especially if you are managing a large number of spindles.)
After such a drive/controller swap,
the kernel will see what used to be
/dev/ad2 as
/dev/ad0
but Vinum
will still call
it by whatever name it had when it was attached to
/dev/ad2
- (i.e., when it was "created" or first made known to
+ (i.e., when it was created or first made known to
Vinum).Since connections can change, it is best to give
each spindle a unique, abstract
name that gives no hint of how it is attached.
Avoid names that suggest a manufacturer, model number,
physical location, or membership in a sequence
(e.g. avoid names like
upper, lower, etc.,
alpha, beta, etc.,
SCSI1, SCSI2, etc., or
Seagate1, Seagate2 etc.).
Such names are likely to lose their uniqueness or
get out of sequence
someday even if they seem like great names today.Once you have picked names for your spindles,
label them with a permanent marker.
If you have hot-swappable hardware, write the names on the sleds
in which the spindles are mounted.
This will significantly reduce the likelihood of
error when you are moving spindles around later as
part of failure recovery or routine system management
procedures.In the instructions that follow,
Vinum
will name the root spindle YouCrazy
and the rootback spindle UpWindow.
I will only use /dev/ad0
when I want to refer to whichever
of the two spindles is currently attached as
/dev/ad0.Partition OrderingModern disk drives operate with fairly uniform areal
density across the surface of the disk.
That implies that more data is available under the heads without
seeking on the outer cylinders than on the inner cylinders.
We will allocate partitions most critical to system performance
from these outer cylinders as
/stand/sysinstall generally does.The root file system is traditionally the outermost, even though
it generally is not as critical to system performance as others.
(However root can have a larger impact on performance if it contains
/tmp and /var as it
does in this example.)
The FreeBSD boot loaders assume that the
root file system lives in the a partition.
There is no requirement that the a
partition start on the outermost cylinders, but this
convention makes it easier to manage disk labels.Swap performance is critical so it comes next on our way toward
the center.
I/O operations here tend to be large and contiguous.
Having as much data under the heads as possible avoids seeking
while swapping.With all the smaller partitions out of the way, we finish
up the disk with
/home and
/usr.
Access patterns here tend not to be as intense as for other
file systems (especially if there is an abundant supply of RAM
and read cache hit rates are high).If the pair of spindles you have are large enough to allow
for more than
/home and
/usr,
it is fine to plan for additional file systems here.Assigning Partitions to SpindlesWe will want to assign
partitions to these spindles so that either can fail
without loss of data on file systems configured for
resilience.Reliability on
/usr and
/home
is best achieved using Vinum
mirroring.
Resilience will have to come differently, however, for the root
file system since Vinum
is not a part of the FreeBSD boot sequence.
Here we will have to settle for two identical
partitions with a periodic copy from the primary to the
backup secondary.The kernel already has support for interleaved swap across
all available partitions so there is no need for help from
Vinum here.
/stand/sysinstall
will automatically configure /etc/fstab
for all swap partitions given.The &vinum.ap; bootstrapping method given below
requires a pair of spindles that I will call the
root spindle and the
rootback spindle.The rootback spindle must be the same size or
larger than the root spindle.These instructions first allocate all space on the root
spindle and then allocate exactly that amount of space on
a rootback spindle.
(After &vinum.ap; is bootstrapped, there is nothing special
about either of these spindles--they are interchangeable.)
You can later use the remaining space on the rootback spindle for
other file systems.If you have more than two spindles, the
bootvinum Perl script and the procedure
below will help you initialize them for use with &vinum.ap;.
However you will have to figure out how to assign partitions
to them on your own.Assigning Space to PartitionsFor this example, I will use two spindles: one with
4,124,673 blocks (about 2 GB) on /dev/ad0
and one with 8,420,769 blocks (about 4 GB) on
/dev/ad2.It is best to configure your two spindles on separate
controllers so that both can operate in parallel and
so that you will have failure resilience in case a
controller dies.
Note that mirrored volume write performance will be halved
in cases where both spindles share a controller that requires
they operate serially (as is often the case with ATA controllers).
One spindle will be the master on the primary ATA
controller and the other will be the master on the
secondary ATA controller.Recall that we will be allocating space on the smaller
spindle first and the larger spindle second.Assigning Partitions on the Root SpindleWe will allocate 200,000 blocks (about 93 MB)
for a root file system on each spindle
(/dev/ad0s1a and
/dev/ad2s1a).
We will initially allocate 200,265 blocks for a swap partition
on each spindle,
giving a total of about 186 MB of
swap space (/dev/ad0s1b and
/dev/ad2s1b).We will lose 265 blocks from each swap partition
as part of the bootstrapping process.
This is the size of the space used by
Vinum to store configuration
information.
The space will be taken from swap and given to a vinum
partition but will be unavailable for
Vinum subdisks.I have done the partition allocation in nice round
numbers of blocks just to emphasize where the 265 blocks go.
There is nothing wrong with allocating space in MB if that is
more convenient for you.This leaves 4,124,673 - 200,000 - 200,265 = 3,724,408 blocks
(about 1,818 MB) on the root spindle for
Vinum
partitions (/dev/ad0s1e and
/dev/ad2s1f).
From this, allocate the 265 blocks for
Vinum configuration information,
1,000,000 blocks (about 488 MB)
for /home, and the remaining
2,724,408 blocks (about 1,330 MB) for
/usr.
See below to see this graphically.The left-hand side of
below shows what spindle ad0 will
look like at the end of phase 2.
The right-hand side shows what it will look like at the
end of phase 3.Spindle ad0 Before and After Vinum ad0 Before Vinum Offset (blocks) ad0 After Vinum
+----------------------+ <-- 0--> +----------------------+
| root | | root |
| /dev/ad0s1a | | /dev/ad0s1a |
+----------------------+ <-- 200000--> +----------------------+
| swap | | swap |
| /dev/ad0s1b | | /dev/ad0s1b |
| | 400000--> +----------------------+
| | | Vinum drive YouCrazy |
| | | /dev/ad0s1h |
+----------------------+ <-- 400265--> +-----------------+ |
| /home | | Vinum sd | |
| /dev/ad0s1e | | home.p0.s0 | |
+----------------------+ <--1400265--> +-----------------+ |
| /usr | | Vinum sd | |
| /dev/ad0s1f | | usr.p0.s0 | |
+----------------------+ <--4124673--> +-----------------+----+
Not to scaleSpindle /dev/ad0 Before and After VinumAssigning Partitions on the Rootback SpindleThe /rootback and swap partition sizes
on the rootback spindle must
match the root and swap partition sizes on the root spindle.
That leaves 8,420,769 - 200,000 - 200,265 = 8,020,504
blocks for the Vinum partition.
Mirrors of /home and
/usr receive the same allocation as on
the root spindle.
That will leave an extra 2 GB or so that we can deal
with later.
See below to see this graphically.The left-hand side of
below shows what spindle ad2 will
look like at the beginning of phase 4.
The right-hand side shows what it will look like at the end.Spindle ad2 Before and After Vinum ad2 Before Vinum Offset (blocks) ad2 After Vinum
+----------------------+ <-- 0--> +----------------------+
| /rootback | | /rootback |
| /dev/ad2s1e | | /dev/ad2s1a |
+----------------------+ <-- 200000--> +----------------------+
| swap | | swap |
| /dev/ad2s1b | | /dev/ad2s1b |
| | 400000--> +----------------------+
| | | Vinum drive UpWindow |
| | | /dev/ad2s1h |
+----------------------+ <-- 400265--> +-----------------+ |
| /NOFUTURE | | Vinum sd | |
| /dev/ad2s1f | | home.p1.s0 | |
| | 1400265--> +-----------------+ |
| | | Vinum sd | |
| | | usr.p1.s0 | |
| | 4124673--> +-----------------+ |
| | | Vinum sd | |
| | | hope.p0.s0 | |
+----------------------+ <--8420769--> +-----------------+----+
Not to scaleSpindle ad2 Before and After VinumPreparation of ToolsThe bootvinum Perl script given below in
will make the
Vinum bootstrapping process much
easier if you can run it on the machine being bootstrapped.
It is over 200 lines and you would not want to type it in.
At this point, I recommend that you
copy it to a floppy or arrange some
alternative method of making it readily available
so that it can be available later when needed.
For example:&prompt.root; fdformat -f 1440 /dev/fd0
&prompt.root; newfs_msdos -f 1440 /dev/fd0
&prompt.root; mount /dev/fd0 /mnt
&prompt.root; cp /usr/share/examples/vinum/bootvinum /mntXXX Someday, I would like this script to live in
/usr/share/examples/vinum.
Till then, please use this
link
to get a copy.Bootstrapping Phase 2: Minimal OS InstallationOur goal in this phase is to complete the smallest possible
FreeBSD installation in such a way that we can later install
Vinum.
We will use only
partitions of type 4.2BSD (i.e., regular UFS file
systems) since that is the only type supported by
/stand/sysinstall.Phase 2 ExampleStart up the FreeBSD installation process by running
/stand/sysinstall from
installation media as you normally would.Fdisk partition all spindles as needed.Make sure to select BootMgr for all spindles.Partition the root spindle with appropriate block
allocations as described above in .
For this example on a 2 GB spindle, I will use
200,000 blocks for root, 200,265 blocks for swap,
1,000,000 blocks for /home, and
the rest of the spindle (2,724,408 blocks) for
/usr.
(/stand/sysinstall
should automatically assign these to
/dev/ad0s1a,
/dev/ad0s1b,
/dev/ad0s1e, and
/dev/ad0s1f
by default.)If you prefer soft updates as I do and you are
using 4.4-RELEASE or better, this is a good time to enable
them.Partition the rootback spindle with the appropriate block
allocations as described above in .
For this example on a 4 GB spindle, I will use
200,000 blocks for /rootback,
200,265 blocks for swap, and
the rest of the spindle (8,020,504 blocks) for
/NOFUTURE.
(/stand/sysinstall
should automatically assign these to
/dev/ad2s1e,
/dev/ad2s1b, and
/dev/ad2s1f by default.)We do not really want to have a
/NOFUTURE UFS file system (we
want a vinum partition instead), but that is the
best choice we have for the space given the limitations of
/stand/sysinstall.
Mount point names beginning with NOFUTURE
and rootback
serve as sentinels to the bootstrapping
script presented in below.Partition any other spindles with swap if desired and a
single /NOFUTURExx file system.Select a minimum system install for now even if you
want to end up with more distributions loaded later.Do not worry about system configuration options at this
point--get Vinum
set up and get the partitions in
the right places first.Exit /stand/sysinstall and reboot.
Do a quick test to verify that the minimum
installation was successful.The left-hand side of above
and the left-hand side of above
show how the disks will look at this point.Bootstrapping Phase 3: Root Spindle SetupOur goal in this phase is get Vinum
set up and running on the
root spindle.
We will embed the existing
/usr and
/home file systems in a
Vinum partition.
Note that the Vinum
volumes created will not yet be
failure-resilient since we have
only one underlying Vinum
drive to hold them.
The resulting system will automatically start
Vinum as it boots to multi-user mode.Phase 3 ExampleLogin as root.We will need a directory in the root file system in
which to keep a few files that will be used in the
Vinum
bootstrapping process.&prompt.root; mkdir /bootvinum
&prompt.root; cd /bootvinumSeveral files need to be prepared for use in bootstrapping.
I have written a Perl script that makes all the required
files for you.
Copy this script to /bootvinum by
floppy disk, tape, network, or any convenient means and
then run it.
(If you cannot get this script copied onto the machine being
bootstrapped, then see
below for a manual alternative.)&prompt.root; cp /mnt/bootvinum .
&prompt.root; ./bootvinumbootvinum produces no output
when run successfully.
If you get any errors,
something may have gone wrong when you were creating
partitions with
/stand/sysinstall above.Running bootvinum will:
Create /etc/fstab.vinum
based on what it finds
in your existing /etc/fstab
Create new disk labels for each spindle mentioned
in /etc/fstab and keep copies of the
current disk labels
Create files needed as input to vinum
for building
Vinum objects on each spindle
Create many alternates to /etc/fstab.vinum
that might come in handy should a spindle fail
You may want to take a look at these files to learn more
about the disk partitioning required for
Vinum or to learn more about the
commands needed to create
Vinum objects.We now need to install new spindle partitioning for
/dev/ad0.
This requires that
/dev/ad0s1b not be in use for
swapping so we have to reboot in single-user mode.First, reboot the system.&prompt.root; rebootNext, enter single-user mode.Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [kernel] in 8 seconds...
Type '?' for a list of commands, 'help' for more detailed help.
ok boot -sIn single-user mode, install the new partitioning
created above.&prompt.root; cd /bootvinum
&prompt.root; disklabel -R ad0s1 disklabel.ad0s1
&prompt.root; disklabel -R ad2s1 disklabel.ad2s1If you have additional spindles, repeat the
above commands as appropriate for them.We are about to start Vinum
for the first time.
It is going to want to create several device nodes under
/dev/vinum so we will need to mount the
root file system for read/write access.&prompt.root; fsck -p /
&prompt.root; mount /Now it is time to create the Vinum
objects that
will embed the existing non-root file systems on
the root spindle in a
Vinum partition.
This will load the Vinum
kernel module and start Vinum
as a side effect.&prompt.root; vinum create create.YouCrazy
You should see a list of Vinum
objects created that looks like the following:1 drives:
D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%)
2 volumes:
V home State: up Plexes: 1 Size: 488 MB
V usr State: up Plexes: 1 Size: 1330 MB
2 plexes:
P home.p0 C State: up Subdisks: 1 Size: 488 MB
P usr.p0 C State: up Subdisks: 1 Size: 1330 MB
2 subdisks:
S home.p0.s0 State: up PO: 0 B Size: 488 MB
S usr.p0.s0 State: up PO: 0 B Size: 1330 MB
You should also see several kernel messages
which state that the Vinum
objects you have created are now up.Our non-root file systems should now be embedded in a
Vinum partition and
hence available through Vinum
volumes.
It is important to test that this embedding worked.&prompt.root; fsck -n /dev/vinum/home
&prompt.root; fsck -n /dev/vinum/usrThis should produce no errors.
If it does produce errors do not fix them.
Instead, go back and examine the root spindle partition tables
before and after Vinum
to see if you can spot the error.
You can back out the partition table changes by using
disklabel -R with the
disklabel.*.b4vinum files.While we have the root file system mounted read/write, this is
a good time to install /etc/fstab.&prompt.root; mv /etc/fstab /etc/fstab.b4vinum
&prompt.root; cp /etc/fstab.vinum /etc/fstabWe are now done with tasks requiring single-user
mode, so it is safe to go multi-user from here on.&prompt.root; ^DLogin as root.Edit /etc/rc.conf and add this line:
start_vinum="YES"Bootstrapping Phase 4: Rootback Spindle SetupOur goal in this phase is to get redundant copies of all data
from the root spindle to the rootback spindle.
We will first create the necessary Vinum
objects on the rootback spindle.
Then we will ask Vinum
to copy the data from the root spindle to the
rootback spindle.
Finally, we use dump and restore
to copy the root file system.Phase 4 ExampleNow that Vinum
is running on the root spindle, we can bring
it up on the rootback spindle so that our
Vinum volumes can become
failure-resilient.&prompt.root; cd /bootvinum
&prompt.root; vinum create create.UpWindowYou should see a list of Vinum
objects created that
looks like the following:2 drives:
D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%)
D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%)
2 volumes:
V home State: up Plexes: 2 Size: 488 MB
V usr State: up Plexes: 2 Size: 1330 MB
4 plexes:
P home.p0 C State: up Subdisks: 1 Size: 488 MB
P usr.p0 C State: up Subdisks: 1 Size: 1330 MB
P home.p1 C State: faulty Subdisks: 1 Size: 488 MB
P usr.p1 C State: faulty Subdisks: 1 Size: 1330 MB
4 subdisks:
S home.p0.s0 State: up PO: 0 B Size: 488 MB
S usr.p0.s0 State: up PO: 0 B Size: 1330 MB
S home.p1.s0 State: stale PO: 0 B Size: 488 MB
S usr.p1.s0 State: stale PO: 0 B Size: 1330 MBYou should also see several kernel messages
which state that some of the Vinum
objects you have created are now up
while others are faulty or
stale.Now we ask Vinum
to copy each of the subdisks on drive
YouCrazy to drive UpWindow.
This will change the state of the newly created
Vinum subdisks
from stale to up.
It will also change the state of the newly created
Vinum plexes
from faulty to up.First, we do the new subdisk we
added to /home.&prompt.root; vinum start -w home.p1.s0
reviving home.p1.s0
(time passes . . . )
home.p1.s0 is up by force
home.p1 is up
home.p1.s0 is up
My 5,400 RPM EIDE spindles copied at about 3.5 MBytes/sec.
Your mileage may vary.
Next we do the new subdisk we
added to /usr.&prompt.root; vinum -w start usr.p1.s0
reviving usr.p1.s0
(time passes . . . )
usr.p1.s0 is up by force
usr.p1 is up
usr.p1.s0 is upAll Vinum
objects should be in state up at this point.
The output of
vinum list should look
like the following:2 drives:
D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%)
D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%)
2 volumes:
V home State: up Plexes: 2 Size: 488 MB
V usr State: up Plexes: 2 Size: 1330 MB
4 plexes:
P home.p0 C State: up Subdisks: 1 Size: 488 MB
P usr.p0 C State: up Subdisks: 1 Size: 1330 MB
P home.p1 C State: up Subdisks: 1 Size: 488 MB
P usr.p1 C State: up Subdisks: 1 Size: 1330 MB
4 subdisks:
S home.p0.s0 State: up PO: 0 B Size: 488 MB
S usr.p0.s0 State: up PO: 0 B Size: 1330 MB
S home.p1.s0 State: up PO: 0 B Size: 488 MB
S usr.p1.s0 State: up PO: 0 B Size: 1330 MBCopy the root file system so that you will have a backup.&prompt.root; cd /rootback
&prompt.root; dump 0f - / | restore rf -
&prompt.root; rm restoresymtable
&prompt.root; cd /You may see errors like this:./tmp/rstdir1001216411: (inode 558) not found on tape
cannot find directory inode 265
abort? [yn] n
expected next file 492, got 491They seem to cause no harm.
I suspect they are a consequence of dumping the file system
containing /tmp and/or the pipe
connecting dump and
restore.Make a directory on which we can mount a damaged root
file system during the recovery process.&prompt.root; mkdir /rootbadRemove sentinel mount points that are now unused.&prompt.root; rmdir /NOFUTURE*Create empty &vinum.ap; drives on remaining spindles.&prompt.root; vinum create create.ThruBank
&prompt.root; ...At this point, the reliable server foundation is complete.
The right-hand side of above
and the right-hand side of above
show how the disks will look.You may want to do a quick reboot to multi-user and give it
a quick test drive.
This is also a good point to complete installation
of other distributions beyond the minimal install.
Add packages, ports, and users as required.
Configure /etc/rc.conf as required.After you have completed your server configuration,
remember to do one more copy of root to
/rootback as shown above before placing
the server into production.Make a schedule to refresh
/rootback periodically.It may be a good idea to mount
/rootback read-only for normal operation
of the server.
This does, however, complicate the periodic refresh a bit.Do not forget to watch
/var/log/messages carefully for errors.
Vinum
may automatically avoid failed hardware in a way that users
do not notice.
You must watch for such failures and get them repaired before a
second failure results in data loss.
You may see
Vinum noting damaged objects
at server boot time.Where to Go from Here?Now that you have established the foundation of a reliable server,
there are several things you might want to try next.Make a Vinum Volume with Remaining SpaceFollowing are the steps to create another
Vinum volume with space remaining
on the rootback spindle.This volume will not be resilient to spindle failure
since it has only one plex on a single spindle.Create a file with the following contents:volume hope
plex name hope.p0 org concat volume hope
sd name hope.p0.s0 drive UpWindow plex hope.p0 len 0Specifying a length of 0 for
the hope.p0.s0 subdisk
asks Vinum
to use whatever space is left available on the underlying
drive.Feed these commands into vinum .&prompt.root; vinum create filenameNow we newfs the volume and
mount it.&prompt.root; newfs -v /dev/vinum/hope
&prompt.root; mkdir /hope
&prompt.root; mount /dev/vinum/hope /hopeEdit /etc/fstab if you want
/hope mounted at boot time.Try Out More Vinum CommandsYou might already be familiar with
vinum to get a list of
all Vinum objects.
Try following it to see more detail.If you have more spindles and you want to bring them up as
concatenated, mirrored, or striped volumes, then give
vinumdrivelist,
vinumdrivelist, or
vinumdrivelist a try.See &man.vinum.8; for sample configurations and important
performance considerations before settling on a final organization
for your additional spindles.The failure recovery instructions below will also give you
some experience using more Vinum
commands.Failure ScenariosThis section contains descriptions of various failure scenarios.
For each scenario, there is a subsection on how to configure your
server for degraded mode operation, how to recover from the failure,
how to exit degraded mode, and how to simulate the failure.Make a hard copy of these instructions and leave them inside the CPU
case, being careful not to interfere with ventilation.Root file system on ad0 unusable, rest of drive okWe assume here that the boot blocks and disk label on
/dev/ad0 are ok.
If your BIOS can boot from a drive other than
C:, you may be able to get around this
limitation.Configure Server for Degraded ModeUse BootMgr to load kernel from
/dev/ad2s1a.Hit F5 in BootMgr to select
Drive 1.Hit F1 to select
FreeBSD.After the kernel is loaded, hit any key but enter to interrupt
the boot sequence.
Boot into single-user mode and allow explicit entry of
a root file system.Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [kernel] in 8 seconds...
Type '?' for a list of commands, 'help' for more detailed help.
ok boot -asSelect /rootback
as your root file system.Manual root file system specification:
<fstype>:<device> Mount <device> using filesystem <fstype>
e.g. ufs:/dev/da0s1a
? List valid disk boot devices
<empty line> Abort manual input
mountroot> ufs:/dev/ad2s1aNow that you are in single-user mode, change
/etc/fstab to avoid the
bad root file system.If you used the bootvinum Perl script from
below, then these commands should configure your server for
degraded mode.&prompt.root; fsck -p /
&prompt.root; mount /
&prompt.root; cd /etc
&prompt.root; mv fstab fstab.bak
&prompt.root; cp fstab_ad0s1_root_bad fstab
&prompt.root; cd /
&prompt.root; mount -o ro /
&prompt.root; vinum start
&prompt.root; fsck -p
&prompt.root; ^DRecoveryRestore /dev/ad0s1a from
backups or copy
/rootback to it with these commands:&prompt.root; umount /rootbad
&prompt.root; newfs /dev/ad0s1a
&prompt.root; tunefs -n enable /dev/ad0s1a
&prompt.root; mount /rootbad
&prompt.root; cd /rootbad
&prompt.root; dump 0f - / | restore rf -
&prompt.root; rm restoresymtableExiting Degraded ModeEnter single-user mode.&prompt.root; shutdown nowPut /etc/fstab back to
normal and reboot.&prompt.root; cd /rootbad/etc
&prompt.root; rm fstab
&prompt.root; mv fstab.bak fstab
&prompt.root; rebootReboot and hit F1 to boot from
/dev/ad0 when
prompted by BootMgr.SimulationThis kind of failure can be simulated by shutting down to
single-user mode and then booting as shown above in
.Drive ad2 FailsThis section deals with the total failure of
/dev/ad2.Configure Server for Degraded ModeAfter the kernel is loaded, hit any key but
Enter to interrupt the boot sequence.
Boot into single-user mode.Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [kernel] in 8 seconds...
Type '?' for a list of commands, 'help' for more detailed help.
ok boot -sChange
/etc/fstab to avoid the bad drive.
If you used the bootvinum Perl script from
below, then
these commands should configure your server for
degraded mode.&prompt.root; fsck -p /
&prompt.root; mount /
&prompt.root; cd /etc
&prompt.root; mv fstab fstab.bak
&prompt.root; cp fstab_only_have_ad0s1 fstab
&prompt.root; cd /
&prompt.root; mount -o ro /
&prompt.root; vinum start
&prompt.root; fsck -p
&prompt.root; ^DIf you do not have modified versions of
/etc/fstab that are ready for use,
then you can use ed to make one.
Alternatively, you can fsck and
mount/usr and then use your
favorite editor.RecoveryWe assume here that your server is up and running multi-user in
degraded mode on just
/dev/ad0 and that you have
a new spindle now on
/dev/ad2 ready to go.You will need a new spindle with enough room to hold root and swap
partitions plus a Vinum
partition large enough to hold
/home and /usr.Create a BIOS partition (slice) on the new spindle.&prompt.root; /stand/sysinstallSelect Custom.Select Partition.Select ad2.Create a FreeBSD (type 165) slice
large enough to hold everything mentioned above.Write changes.Yes, you are absolutely sure.Select BootMgr.Quit Partitioning.Exit /stand/sysinstall.Create disk label partitioning based on current
/dev/ad0 partitioning.&prompt.root; disklabel ad0 > /tmp/ad0
&prompt.root; disklabel -e ad2This will drop you into your favorite editor.Copy the lines for the a and
b partitions from
/tmp/ad0 to the
ad2 disklabel.Add the size of the
a and
b partitions to find the proper
offset for the
h partition.Subtract this offset from the
size of the c
partition to find the proper size for the h
partition.Define an h partition with the
size and
offset calculated above.Set the fstype column to
vinum.Save the file and quit your editor.Tell Vinum
about the new drive.Ask Vinum to start an
editor with a copy of the current configuration.&prompt.root; vinum createUncomment the drive line referring to drive
UpWindow and set
device to
/dev/ad2s1h.Save the file and quit your editor.Now that Vinum
has two spindles again, revive the mirrors.&prompt.root; vinum start -w usr.p1.s0
&prompt.root; vinum start -w home.p1.s0Now we need to restore
/rootback to a current copy of the
root file system.
These commands will accomplish this.&prompt.root; newfs /dev/ad2s1a
&prompt.root; tunefs -n enable /dev/ad2s1a
&prompt.root; mount /dev/ad2s1a /mnt
&prompt.root; cd /mnt
&prompt.root; dump 0f - / | restore rf -
&prompt.root; rm restoresymtable
&prompt.root; cd /
&prompt.root; umount /mntExiting Degraded ModeEnter single-user mode.&prompt.root; shutdown nowReturn /etc/fstab to
its normal state and reboot.&prompt.root; cd /etc
&prompt.root; rm fstab
&prompt.root; mv fstab.bak fstab
&prompt.root; rebootSimulationYou can simulate this kind of failure by unplugging
/dev/ad2, write-protecting it,
or by this procedure:Shutdown to single-user mode.Unmount all non-root file systems.Clobber any existing Vinum
configuration and partitioning on
/dev/ad2.&prompt.root; vinum stop
&prompt.root; dd if=/dev/zero of=/dev/ad2s1h count=512
&prompt.root; dd if=/dev/zero of=/dev/ad2 count=512Drive ad0 FailsSome BIOSes can boot from drive 1 or drive 2 (often called
C: or D:),
while others can boot only from drive 1.
If your BIOS can boot from either, the fastest road to recovery
might be to boot directly from /dev/ad2
in single-user mode and
install /etc/fsatb_only_have_ad2s1 as
/etc/fstab.
You would then have to adapt the /dev/ad2
failure recovery instructions from above.If your BIOS can only boot from drive one, then you will have to
unplug drive YouCrazy from the controller for
/dev/ad2 and plug it
into the controller for /dev/ad0.
Then continue with the instructions for
/dev/ad2 failure recovery
in above.bootvinum Perl ScriptThe bootvinum Perl script below reads /etc/fstab
and current drive partitioning.
It then writes several files in the current directory and several
variants of /etc/fstab in /etc.
These files significantly simplify the installation of
Vinum and recovery from
spindle failures.#!/usr/bin/perl -w
use strict;
use FileHandle;
-my $config_tag1 = '$Id: article.sgml,v 1.4 2001-10-31 23:12:55 chern Exp $';
+my $config_tag1 = '$Id: article.sgml,v 1.5 2002-02-14 23:57:13 keramida Exp $';
# Copyright (C) 2001 Robert A. Van Valzah
#
# Bootstrap Vinum
#
# Read /etc/fstab and current partitioning for all spindles mentioned there.
# Generate files needed to mirror all file systems on root spindle.
# A new partition table for each spindle
# Input for the vinum create command to create Vinum objects on each spindle
# A copy of fstab mounting Vinum volumes instead of BSD partitions
# Copies of fstab altered for server's degraded modes of operation
# See handbook for instructions on how to use the the files generated.
# N.B. This bootstrapping method shrinks size of swap partition by the size
# of Vinum's on-disk configuration (265 sectors). It embeds existing file
# systems on the root spindle in Vinum objects without having to copy them.
# Thanks to Greg Lehey for suggesting this bootstrapping method.
# Expectations:
# The root spindle must contain at least root, swap, and /usr partitions
# The rootback spindle must have matching /rootback and swap partitions
# Other spindles should only have a /NOFUTURE* file system and maybe swap
# File systems named /NOFUTURE* will be replaced with Vinum drives
# Change configuration variables below to suit your taste
my $vip = 'h'; # VInum Partition
my @drv = ('YouCrazy', 'UpWindow', 'ThruBank', # Vinum DRiVe names
'OutSnakes', 'MeWild', 'InMovie', 'HomeJames', 'DownPrices', 'WhileBlind');
# No configuration variables beyond this point
my %vols; # One entry per Vinum volume to be created
my @spndl; # One entry per SPiNDLe
my $rsp; # Root SPindle (as in /dev/$rsp)
my $rbsp; # RootBack SPindle (as in /dev/$rbsp)
my $cfgsiz = 265; # Size of Vinum on-disk configuration info in sectors
my $nxtpas = 2; # Next fsck pass number for non-root file systems
# Parse fstab, generating the version we'll need for Vinum and noting
# spindles in use.
my $fsin = "/etc/fstab";
#my $fsin = "simu/fstab";
open(FSIN, "$fsin") || die("Couldn't open $fsin: $!\n");
my $fsout = "/etc/fstab.vinum";
open(FSOUT, ">$fsout") || die("Couldn't open $fsout for writing: $!\n");
while (<FSIN>) {
my ($dev, $mnt, $fstyp, $opt, $dump, $pass) = split;
next if $dev =~ /^#/;
if ($mnt eq '/' || $mnt eq '/rootback' || $mnt =~ /^\/NOFUTURE/) {
my $dn = substr($dev, 5, length($dev)-6); # Device Name without /dev/
push(@spndl, $dn) unless grep($_ eq $dn, @spndl);
$rsp = $dn if $mnt eq '/';
next if $mnt =~ /^\/NOFUTURE/;
}
# Move /rootback from partition e to a
if ($mnt =~ /^\/rootback/) {
$dev =~ s/e$/a/;
$pass = 1;
$rbsp = substr($dev, 5, length($dev)-6);
print FSOUT "$dev\t\t$mnt\t$fstyp\t$opt\t\t$dump\t$pass\n";
next;
}
# Move non-root file systems on smallest spindle into Vinum
if (defined($rsp) && $dev =~ /^\/dev\/$rsp/ && $dev =~ /[d-h]$/) {
$pass = $nxtpas++;
print FSOUT "/dev/vinum$mnt\t\t$mnt\t\t$fstyp\t$opt\t\t$dump\t$pass\n";
$vols{$dev}->{mnt} = substr($mnt, 1);
next;
}
print FSOUT $_;
}
close(FSOUT);
die("Found more spindles than we have abstract names\n") if $#spndl > $#drv;
die("Didn't find a root partition!\n") if !defined($rsp);
die("Didn't find a /rootback partition!\n") if !defined($rbsp);
# Table of server's Degraded Modes
# One row per mode with hash keys
# fn FileName
# xpr eXPRession needed to convert fstab lines for this mode
# cm1 CoMment 1 describing this mode
# cm2 CoMment 2 describing this mode
# FH FileHandle (dynamically initialized below)
my @DM = (
{ cm1 => "When we only have $rsp, comment out lines using $rbsp",
fn => "/etc/fstab_only_have_$rsp",
xpr => "s:^/dev/$rbsp:#\$&:",
},
{ cm1 => "When we only have $rbsp, comment out lines using $rsp and",
cm2 => "rootback becomes root",
fn => "/etc/fstab_only_have_$rbsp",
xpr => "s:^/dev/$rsp:#\$&: || s:/rootback:/\t:",
},
{ cm1 => "When only $rsp root is bad, /rootback becomes root and",
cm2 => "root becomes /rootbad",
fn => "/etc/fstab_${rsp}_root_bad",
xpr => "s:\t/\t:\t/rootbad: || s:/rootback:/\t:",
},
);
# Initialize output FileHandles and write comments
foreach my $dm (@DM) {
my $fh = new FileHandle;
$fh->open(">$dm->{fn}") || die("Can't write $dm->{fn}: $!\n");
print $fh "# $dm->{cm1}\n" if $dm->{cm1};
print $fh "# $dm->{cm2}\n" if $dm->{cm2};
$dm->{FH} = $fh;
}
# Parse the Vinum version of fstab written above and write versions needed
# for server's degraded modes.
open(FSOUT, "$fsout") || die("Couldn't open $fsout: $!\n");
while (<FSOUT>) {
my $line = $_;
foreach my $dm (@DM) {
$_ = $line;
eval $dm->{xpr};
print {$dm->{FH}} $_;
}
}
# Parse partition table for each spindle and write versions needed for Vinum
my $rootsiz; # ROOT partition SIZe
my $swapsiz; # SWAP partition SIZe
my $rspminoff; # Root SPindle MINimum OFFset of non-root, non-swap, non-c parts
my $rspsiz; # Root SPindle SIZe
my $rbspsiz; # RootBack SPindle SIZe
foreach my $i (0..$#spndl) {
my $dlin = "disklabel $spndl[$i] |";
# my $dlin = "simu/disklabel.$spndl[$i]";
open(DLIN, "$dlin") || die("Couldn't open $dlin: $!\n");
my $dlout = "disklabel.$spndl[$i]";
open(DLOUT, ">$dlout") || die("Couldn't open $dlout for writing: $!\n");
my $dlb4 = "$dlout.b4vinum";
open(DLB4, ">$dlb4") || die("Couldn't open $dlb4 for writing: $!\n");
my $minoff; # MINimum OFFset of non-root, non-swap, non-c partitions
my $totsiz = 0; # TOTal SIZe of all non-root, non-swap, non-c partitions
my $swapspndl = 0; # True if SWAP partition on this SPiNDLe
while (<DLIN>) {
print DLB4 $_;
my ($part, $siz, $off, $fstyp, $fsiz, $bsiz, $bps) = split;
if ($part && $part eq 'a:' && $spndl[$i] eq $rsp) {
$rootsiz = $siz;
}
if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) {
if ($rootsiz != $siz) {
die("Rootback size ($siz) != root size ($rootsiz)\n");
}
}
if ($part && $part eq 'c:') {
$rspsiz = $siz if $spndl[$i] eq $rsp;
$rbspsiz = $siz if $spndl[$i] eq $rbsp;
}
# Make swap partition $cfgsiz sectors smaller
if ($part && $part eq 'b:') {
if ($spndl[$i] eq $rsp) {
$swapsiz = $siz;
} else {
if ($swapsiz != $siz) {
die("Swap partition sizes unequal across spindles\n");
}
}
printf DLOUT "%4s%9d%9d%10s\n", $part, $siz-$cfgsiz, $off, $fstyp;
$swapspndl = 1;
next;
}
# Move rootback spindle e partitions to a
if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) {
printf DLOUT "%4s%9d%9d%10s%9d%6d%6d\n", 'a:', $siz, $off, $fstyp,
$fsiz, $bsiz, $bps;
next;
}
# Delete non-root, non-swap, non-c partitions but note their minimum
# offset and total size that're needed below.
if ($part && $part =~ /^[d-h]:$/) {
$minoff = $off unless $minoff;
$minoff = $off if $off < $minoff;
$totsiz += $siz;
if ($spndl[$i] eq $rsp) { # If doing spindle containing root
my $dev = "/dev/$spndl[$i]" . substr($part, 0, 1);
$vols{$dev}->{siz} = $siz;
$vols{$dev}->{off} = $off;
$rspminoff = $minoff;
}
next;
}
print DLOUT $_;
}
if ($swapspndl) { # If there was a swap partition on this spindle
# Make a Vinum partition the size of all non-root, non-swap,
# non-c partitions + the size of Vinum's on-disk configuration.
# Set its offset so that the start of the first subdisk it contains
# coincides with the first file system we're embedding in Vinum.
printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz+$cfgsiz, $minoff-$cfgsiz,
'vinum';
} else {
# No need to mess with size size and offset if there was no swap
printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz, $minoff,
'vinum';
}
}
die("Swap partition not found\n") unless $swapsiz;
die("Swap partition not larger than $cfgsiz blocks\n") unless $swapsiz>$cfgsiz;
die("Rootback spindle size not >= root spindle size\n") unless $rbspsiz>=$rspsiz;
# Generate input to vinum create command needed for each spindle.
foreach my $i (0..$#spndl) {
my $cfn = "create.$drv[$i]"; # Create File Name
open(CF, ">$cfn") || die("Can't open $cfn for writing: $!\n");
print CF "drive $drv[$i] device /dev/$spndl[$i]$vip\n";
next unless $spndl[$i] eq $rsp || $spndl[$i] eq $rbsp;
foreach my $dev (keys(%vols)) {
my $mnt = $vols{$dev}->{mnt};
my $siz = $vols{$dev}->{siz};
my $off = $vols{$dev}->{off}-$rspminoff+$cfgsiz;
print CF "volume $mnt\n" if $spndl[$i] eq $rsp;
print CF <<EOF;
plex name $mnt.p$i org concat volume $mnt
sd name $mnt.p$i.s0 drive $drv[$i] plex $mnt.p$i len ${siz}s driveoffset ${off}s
EOF
}
}Manual Vinum BootstrappingThe bootvinum Perl script in makes life easier, but
it may be necessary to manually perform some or all of the steps that
it automates.
This appendix describes how you would manually mimic the script.Make a copy of /etc/fstab
to be customized.&prompt.root; cp /etc/fstab /etc/fstab.vinumEdit /etc/fstab.vinum.Change the device column of
non-root partitions on the root spindle to
/dev/vinum/mnt.Change the pass column of
non-root partitions on the root spindle to 2,
3, etc.Delete any lines with mountpoint
matching /NOFUTURE*.Change the device column of
/rootback
from e to
a.Change the pass column of
/rootback to
1.Prepare disklabels for editing:&prompt.root; cd /bootvinum
&prompt.root; disklabel ad0s1 > disklabel.ad0s1
&prompt.root; cp disklabel.ad0s1 disklabel.ad0s1.b4vinum
&prompt.root; disklabel ad2s1 > disklabel.ad2s1
&prompt.root; cp disklabel.ad2s1 disklabel.ad2s1.b4vinumEdit /etc/disklabel.ad?s1.On the root spindle:Decrease the size of the
b partition by 265 blocks.Note the size and
offset of the a and
b partitions.Note the smallest offset for partitions
d-h.Note the size and
offset for all non-root, non-swap
partitions (/home was probably on
e and /usr was
probably on f).Delete partitions
d-h.Create a new h partition with
offset 265 blocks less than the
smallest offset
for partitions d-h
noted above.
Set its size to the size
of the c partition less the
smallest offset
for partitions d-h
noted above + 265 blocks.Vinum
can use any partition other than c.
It is not strictly necessary to use h
for all your Vinum
partitions, but it is good practice to
be consistent across all spindles.Set the fstype of this new
partition to vinum.On the rootback spindle:Move the e partition to
a.Verify that the size of the
a and
b partitions matches the
root spindle.Note the smallest offset for partitions
d-h.Delete partitions
d-h.Create a new h partition with
offset 265 blocks less than the
smallest offset
noted above for partitions
d-h.
Set its size to the size
of the c partition less the
smallest offset
for partitions d-h
noted above + 265 blocks.Set the fstype of this new
partition to vinum.Create a file named
create.YouCrazy that contains:drive YouCrazy device /dev/ad0s1h
volume home
plex name home.p0 org concat volume home
sd name home.p0.s0 drive YouCrazy plex home.p0 len $hl driveoffset $ho
volume usr
plex name usr.p0 org concat volume usr
sd name usr.p0.s0 drive YouCrazy plex usr.p0 len $ul driveoffset $uoWhere:$hl is the length noted above for
/home.$ho is the offset noted above for
/home less the smallest offset
noted above + 265 blocks.$ul is the length noted above for
/usr.$uo is the offset noted above for
/usr less the smallest offset
noted above + 265 blocks.Create a file named
create.UpWindow containing:drive UpWindow device /dev/ad2s1h
plex name home.p1 org concat volume home
sd name home.p1.s0 drive UpWindow plex home.p1 len $hl driveoffset $ho
plex name usr.p1 org concat volume usr
sd name usr.p1.s0 drive UpWindow plex usr.p1 len $ul driveoffset $uoWhere $hl, $ho, $ul, and $uo are set as above.AcknowledgementsI would like to thank Greg Lehey for writing &vinum.ap; and for
providing very helpful comments on early drafts.
Several others made helpful suggestions after reviewing later drafts
including
Dag-Erling Smørgrav,
Michael Splendoria,
Chern Lee,
Stefan Aeschbacher,
Fleming Froekjaer,
Bernd Walter,
Aleksey Baranov, and
Doug Swarin.
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml
index 3f400b365b..9e992c6039 100644
--- a/en_US.ISO8859-1/articles/vm-design/article.sgml
+++ b/en_US.ISO8859-1/articles/vm-design/article.sgml
@@ -1,838 +1,838 @@
%man;
]>
Design elements of the FreeBSD VM systemMatthewDillondillon@apollo.backplane.comThe title is really just a fancy way of saying that I am going to
attempt to describe the whole VM enchilada, hopefully in a way that
everyone can follow. For the last year I have concentrated on a number
of major kernel subsystems within FreeBSD, with the VM and Swap
subsystems being the most interesting and NFS being a necessary
chore. I rewrote only small portions of the code. In the VM
arena the only major rewrite I have done is to the swap subsystem.
Most of my work was cleanup and maintenance, with only moderate code
rewriting and no major algorithmic adjustments within the VM
subsystem. The bulk of the VM subsystem's theoretical base remains
unchanged and a lot of the credit for the modernization effort in the
last few years belongs to John Dyson and David Greenman. Not being a
historian like Kirk I will not attempt to tag all the various features
with peoples names, since I will invariably get it wrong.This article was originally published in the January 2000 issue of
DaemonNews. This
version of the article may include updates from Matt and other authors
to reflect changes in FreeBSD's VM implementation.IntroductionBefore moving along to the actual design let's spend a little time
on the necessity of maintaining and modernizing any long-living
codebase. In the programming world, algorithms tend to be more
important than code and it is precisely due to BSD's academic roots that
a great deal of attention was paid to algorithm design from the
beginning. More attention paid to the design generally leads to a clean
and flexible codebase that can be fairly easily modified, extended, or
replaced over time. While BSD is considered an old
operating system by some people, those of us who work on it tend to view
it more as a mature codebase which has various components
modified, extended, or replaced with modern code. It has evolved, and
FreeBSD is at the bleeding edge no matter how old some of the code might
be. This is an important distinction to make and one that is
unfortunately lost to many people. The biggest error a programmer can
make is to not learn from history, and this is precisely the error that
many other modern operating systems have made. NT is the best example
of this, and the consequences have been dire. Linux also makes this
mistake to some degree—enough that we BSD folk can make small
jokes about it every once in a while, anyway. Linux's problem is simply
one of a lack of experience and history to compare ideas against, a
problem that is easily and rapidly being addressed by the Linux
community in the same way it has been addressed in the BSD
community—by continuous code development. The NT folk, on the
other hand, repeatedly make the same mistakes solved by Unix decades ago
and then spend years fixing them. Over and over again. They have a
severe case of not designed here and we are always
right because our marketing department says so. I have little
tolerance for anyone who cannot learn from history.Much of the apparent complexity of the FreeBSD design, especially in
the VM/Swap subsystem, is a direct result of having to solve serious
performance issues that occur under various conditions. These issues
are not due to bad algorithmic design but instead rise from
environmental factors. In any direct comparison between platforms,
these issues become most apparent when system resources begin to get
stressed. As I describe FreeBSD's VM/Swap subsystem the reader should
always keep two points in mind. First, the most important aspect of
performance design is what is known as Optimizing the Critical
Path. It is often the case that performance optimizations add a
little bloat to the code in order to make the critical path perform
better. Second, a solid, generalized design outperforms a
heavily-optimized design over the long run. While a generalized design
may end up being slower than an heavily-optimized design when they are
first implemented, the generalized design tends to be easier to adapt to
changing conditions and the heavily-optimized design winds up having to
be thrown away. Any codebase that will survive and be maintainable for
years must therefore be designed properly from the beginning even if it
costs some performance. Twenty years ago people were still arguing that
programming in assembly was better than programming in a high-level
language because it produced code that was ten times as fast. Today,
the fallibility of that argument is obvious—as are the parallels
to algorithmic design and code generalization.VM ObjectsThe best way to begin describing the FreeBSD VM system is to look at
it from the perspective of a user-level process. Each user process sees
a single, private, contiguous VM address space containing several types
of memory objects. These objects have various characteristics. Program
code and program data are effectively a single memory-mapped file (the
binary file being run), but program code is read-only while program data
is copy-on-write. Program BSS is just memory allocated and filled with
zeros on demand, called demand zero page fill. Arbitrary files can be
memory-mapped into the address space as well, which is how the shared
library mechanism works. Such mappings can require modifications to
remain private to the process making them. The fork system call adds an
entirely new dimension to the VM management problem on top of the
complexity already given.A program binary data page (which is a basic copy-on-write page)
illustrates the complexity. A program binary contains a preinitialized
data section which is initially mapped directly from the program file.
When a program is loaded into a process's VM space, this area is
initially memory-mapped and backed by the program binary itself,
allowing the VM system to free/reuse the page and later load it back in
from the binary. The moment a process modifies this data, however, the
VM system must make a private copy of the page for that process. Since
the private copy has been modified, the VM system may no longer free it,
because there is no longer any way to restore it later on.You will notice immediately that what was originally a simple file
mapping has become much more complex. Data may be modified on a
page-by-page basis whereas the file mapping encompasses many pages at
once. The complexity further increases when a process forks. When a
process forks, the result is two processes—each with their own
private address spaces, including any modifications made by the original
process prior to the call to fork(). It would be
silly for the VM system to make a complete copy of the data at the time
of the fork() because it is quite possible that at
least one of the two processes will only need to read from that page
from then on, allowing the original page to continue to be used. What
was a private page is made copy-on-write again, since each process
(parent and child) expects their own personal post-fork modifications to
remain private to themselves and not effect the other.FreeBSD manages all of this with a layered VM Object model. The
original binary program file winds up being the lowest VM Object layer.
A copy-on-write layer is pushed on top of that to hold those pages which
had to be copied from the original file. If the program modifies a data
page belonging to the original file the VM system takes a fault and
makes a copy of the page in the higher layer. When a process forks,
additional VM Object layers are pushed on. This might make a little
more sense with a fairly basic example. A fork()
is a common operation for any *BSD system, so this example will consider
a program that starts up, and forks. When the process starts, the VM
system creates an object layer, let's call this A:+---------------+
| A |
+---------------+A pictureA represents the file—pages may be paged in and out of the
file's physical media as necessary. Paging in from the disk is
reasonable for a program, but we really do not want to page back out and
overwrite the executable. The VM system therefore creates a second
layer, B, that will be physically backed by swap space:+---------------+
| B |
+---------------+
| A |
+---------------+On the first write to a page after this, a new page is created in B,
and its contents are initialized from A. All pages in B can be paged in
or out to a swap device. When the program forks, the VM system creates
two new object layers—C1 for the parent, and C2 for the
child—that rest on top of B:+-------+-------+
| C1 | C2 |
+-------+-------+
| B |
+---------------+
| A |
+---------------+In this case, let's say a page in B is modified by the original
parent process. The process will take a copy-on-write fault and
duplicate the page in C1, leaving the original page in B untouched.
Now, let's say the same page in B is modified by the child process. The
process will take a copy-on-write fault and duplicate the page in C2.
The original page in B is now completely hidden since both C1 and C2
have a copy and B could theoretically be destroyed if it does not
- represent a 'real' file). However, this sort of optimization is not
+ represent a real file). However, this sort of optimization is not
trivial to make because it is so fine-grained. FreeBSD does not make
this optimization. Now, suppose (as is often the case) that the child
process does an exec(). Its current address space
is usually replaced by a new address space representing a new file. In
this case, the C2 layer is destroyed:+-------+
| C1 |
+-------+-------+
| B |
+---------------+
| A |
+---------------+In this case, the number of children of B drops to one, and all
accesses to B now go through C1. This means that B and C1 can be
collapsed together. Any pages in B that also exist in C1 are deleted
from B during the collapse. Thus, even though the optimization in the
previous step could not be made, we can recover the dead pages when
either of the processes exit or exec().This model creates a number of potential problems. The first is that
you can wind up with a relatively deep stack of layered VM Objects which
can cost scanning time and memory when you take a fault. Deep
layering can occur when processes fork and then fork again (either
parent or child). The second problem is that you can wind up with dead,
inaccessible pages deep in the stack of VM Objects. In our last example
if both the parent and child processes modify the same page, they both
get their own private copies of the page and the original page in B is
no longer accessible by anyone. That page in B can be freed.FreeBSD solves the deep layering problem with a special optimization
called the All Shadowed Case. This case occurs if either
C1 or C2 take sufficient COW faults to completely shadow all pages in B.
Lets say that C1 achieves this. C1 can now bypass B entirely, so rather
then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But
look what also happened—now B has only one reference (C2), so we
can collapse B and C2 together. The end result is that B is deleted
entirely and we have C1->A and C2->A. It is often the case that B will
contain a large number of pages and neither C1 nor C2 will be able to
completely overshadow it. If we fork again and create a set of D
layers, however, it is much more likely that one of the D layers will
eventually be able to completely overshadow the much smaller dataset
represented by C1 or C2. The same optimization will work at any point in
the graph and the grand result of this is that even on a heavily forked
machine VM Object stacks tend to not get much deeper then 4. This is
true of both the parent and the children and true whether the parent is
doing the forking or whether the children cascade forks.The dead page problem still exists in the case where C1 or C2 do not
completely overshadow B. Due to our other optimizations this case does
not represent much of a problem and we simply allow the pages to be
dead. If the system runs low on memory it will swap them out, eating a
little swap, but that is it.The advantage to the VM Object model is that
fork() is extremely fast, since no real data
copying need take place. The disadvantage is that you can build a
relatively complex VM Object layering that slows page fault handling
down a little, and you spend memory managing the VM Object structures.
The optimizations FreeBSD makes proves to reduce the problems enough
that they can be ignored, leaving no real disadvantage.SWAP LayersPrivate data pages are initially either copy-on-write or zero-fill
pages. When a change, and therefore a copy, is made, the original
backing object (usually a file) can no longer be used to save a copy of
the page when the VM system needs to reuse it for other purposes. This
is where SWAP comes in. SWAP is allocated to create backing store for
memory that does not otherwise have it. FreeBSD allocates the swap
management structure for a VM Object only when it is actually needed.
However, the swap management structure has had problems
historically.Under FreeBSD 3.x the swap management structure preallocates an
array that encompasses the entire object requiring swap backing
store—even if only a few pages of that object are swap-backed.
This creates a kernel memory fragmentation problem when large objects
are mapped, or processes with large runsizes (RSS) fork. Also, in order
to keep track of swap space, a list of holes is kept in
kernel memory, and this tends to get severely fragmented as well. Since
- the 'list of holes' is a linear list, the swap allocation and freeing
+ the list of holes is a linear list, the swap allocation and freeing
performance is a non-optimal O(n)-per-page. It also requires kernel
memory allocations to take place during the swap freeing process, and
that creates low memory deadlock problems. The problem is further
exacerbated by holes created due to the interleaving algorithm. Also,
the swap block map can become fragmented fairly easily resulting in
non-contiguous allocations. Kernel memory must also be allocated on the
fly for additional swap management structures when a swapout occurs. It
is evident that there was plenty of room for improvement.For FreeBSD 4.x, I completely rewrote the swap subsystem. With this
rewrite, swap management structures are allocated through a hash table
rather than a linear array giving them a fixed allocation size and much
finer granularity. Rather then using a linearly linked list to keep
track of swap space reservations, it now uses a bitmap of swap blocks
arranged in a radix tree structure with free-space hinting in the radix
node structures. This effectively makes swap allocation and freeing an
O(1) operation. The entire radix tree bitmap is also preallocated in
order to avoid having to allocate kernel memory during critical low
memory swapping operations. After all, the system tends to swap when it
is low on memory so we should avoid allocating kernel memory at such
times in order to avoid potential deadlocks. Finally, to reduce
fragmentation the radix tree is capable of allocating large contiguous
chunks at once, skipping over smaller fragmented chunks. I did not take
- the final step of having an 'allocating hint pointer' that would trundle
+ the final step of having an allocating hint pointer that would trundle
through a portion of swap as allocations were made in order to further
guarantee contiguous allocations or at least locality of reference, but
I ensured that such an addition could be made.When to free a pageSince the VM system uses all available memory for disk caching,
there are usually very few truly-free pages. The VM system depends on
being able to properly choose pages which are not in use to reuse for
new allocations. Selecting the optimal pages to free is possibly the
single-most important function any VM system can perform because if it
makes a poor selection, the VM system may be forced to unnecessarily
retrieve pages from disk, seriously degrading system performance.How much overhead are we willing to suffer in the critical path to
avoid freeing the wrong page? Each wrong choice we make will cost us
hundreds of thousands of CPU cycles and a noticeable stall of the
affected processes, so we are willing to endure a significant amount of
overhead in order to be sure that the right page is chosen. This is why
FreeBSD tends to outperform other systems when memory resources become
stressed.The free page determination algorithm is built upon a history of the
use of memory pages. To acquire this history, the system takes advantage
of a page-used bit feature that most hardware page tables have.In any case, the page-used bit is cleared and at some later point
the VM system comes across the page again and sees that the page-used
bit has been set. This indicates that the page is still being actively
used. If the bit is still clear it is an indication that the page is not
being actively used. By testing this bit periodically, a use history (in
the form of a counter) for the physical page is developed. When the VM
system later needs to free up some pages, checking this history becomes
the cornerstone of determining the best candidate page to reuse.What if the hardware has no page-used bit?For those platforms that do not have this feature, the system
actually emulates a page-used bit. It unmaps or protects a page,
forcing a page fault if the page is accessed again. When the page
fault is taken, the system simply marks the page as having been used
and unprotects the page so that it may be used. While taking such page
faults just to determine if a page is being used appears to be an
expensive proposition, it is much less expensive than reusing the page
for some other purpose only to find that a process needs it back and
then have to go to disk.FreeBSD makes use of several page queues to further refine the
selection of pages to reuse as well as to determine when dirty pages
must be flushed to their backing store. Since page tables are dynamic
entities under FreeBSD, it costs virtually nothing to unmap a page from
the address space of any processes using it. When a page candidate has
been chosen based on the page-use counter, this is precisely what is
done. The system must make a distinction between clean pages which can
theoretically be freed up at any time, and dirty pages which must first
be written to their backing store before being reusable. When a page
candidate has been found it is moved to the inactive queue if it is
dirty, or the cache queue if it is clean. A separate algorithm based on
the dirty-to-clean page ratio determines when dirty pages in the
inactive queue must be flushed to disk. Once this is accomplished, the
flushed pages are moved from the inactive queue to the cache queue. At
this point, pages in the cache queue can still be reactivated by a VM
fault at relatively low cost. However, pages in the cache queue are
considered to be immediately freeable and will be reused
in an LRU (least-recently used) fashion when the system needs to
allocate new memory.It is important to note that the FreeBSD VM system attempts to
separate clean and dirty pages for the express reason of avoiding
unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
it move pages between the various page queues gratuitously when the
memory subsystem is not being stressed. This is why you will see some
systems with very low cache queue counts and high active queue counts
when doing a systat -vm command. As the VM system
becomes more stressed, it makes a greater effort to maintain the various
page queues at the levels determined to be the most effective. An urban
myth has circulated for years that Linux did a better job avoiding
swapouts than FreeBSD, but this in fact is not true. What was actually
occurring was that FreeBSD was proactively paging out unused pages in
order to make room for more disk cache while Linux was keeping unused
pages in core and leaving less memory available for cache and process
pages. I do not know whether this is still true today.Pre-Faulting and Zeroing OptimizationsTaking a VM fault is not expensive if the underlying page is already
in core and can simply be mapped into the process, but it can become
expensive if you take a whole lot of them on a regular basis. A good
example of this is running a program such as &man.ls.1; or &man.ps.1;
over and over again. If the program binary is mapped into memory but
not mapped into the page table, then all the pages that will be accessed
by the program will have to be faulted in every time the program is run.
This is unnecessary when the pages in question are already in the VM
Cache, so FreeBSD will attempt to pre-populate a process's page tables
with those pages that are already in the VM Cache. One thing that
FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For
example, if you run the &man.ls.1; program while running vmstat
1 you will notice that it always takes a certain number of
page faults, even when you run it over and over again. These are
zero-fill faults, not program code faults (which were pre-faulted in
already). Pre-copying pages on exec or fork is an area that could use
more study.A large percentage of page faults that occur are zero-fill faults.
You can usually see this by observing the vmstat -s
output. These occur when a process accesses pages in its BSS area. The
BSS area is expected to be initially zero but the VM system does not
bother to allocate any memory at all until the process actually accesses
it. When a fault occurs the VM system must not only allocate a new page,
it must zero it as well. To optimize the zeroing operation the VM system
has the ability to pre-zero pages and mark them as such, and to request
pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs
whenever the CPU is idle but the number of pages the system pre-zeros is
limited in order to avoid blowing away the memory caches. This is an
excellent example of adding complexity to the VM system in order to
optimize the critical path.Page Table OptimizationsThe page table optimizations make up the most contentious part of
the FreeBSD VM design and they have shown some strain with the advent of
serious use of mmap(). I think this is actually a
feature of most BSDs though I am not sure when it was first introduced.
There are two major optimizations. The first is that hardware page
tables do not contain persistent state but instead can be thrown away at
any time with only a minor amount of management overhead. The second is
that every active page table entry in the system has a governing
pv_entry structure which is tied into the
vm_page structure. FreeBSD can simply iterate
through those mappings that are known to exist while Linux must check
all page tables that might contain a specific
mapping to see if it does, which can achieve O(n^2) overhead in certain
situations. It is because of this that FreeBSD tends to make better
choices on which pages to reuse or swap when memory is stressed, giving
it better performance under load. However, FreeBSD requires kernel
tuning to accommodate large-shared-address-space situations such as
those that can occur in a news system because it may run out of
pv_entry structures.Both Linux and FreeBSD need work in this area. FreeBSD is trying to
maximize the advantage of a potentially sparse active-mapping model (not
all processes need to map all pages of a shared library, for example),
whereas Linux is trying to simplify its algorithms. FreeBSD generally
has the performance advantage here at the cost of wasting a little extra
memory, but FreeBSD breaks down in the case where a large file is
massively shared across hundreds of processes. Linux, on the other hand,
breaks down in the case where many processes are sparsely-mapping the
same shared library and also runs non-optimally when trying to determine
whether a page can be reused or not.Page ColoringWe will end with the page coloring optimizations. Page coloring is a
performance optimization designed to ensure that accesses to contiguous
pages in virtual memory make the best use of the processor cache. In
ancient times (i.e. 10+ years ago) processor caches tended to map
virtual memory rather than physical memory. This led to a huge number of
problems including having to clear the cache on every context switch in
some cases, and problems with data aliasing in the cache. Modern
processor caches map physical memory precisely to solve those problems.
This means that two side-by-side pages in a processes address space may
not correspond to two side-by-side pages in the cache. In fact, if you
are not careful side-by-side pages in virtual memory could wind up using
the same page in the processor cache—leading to cacheable data
being thrown away prematurely and reducing CPU performance. This is true
even with multi-way set-associative caches (though the effect is
mitigated somewhat).FreeBSD's memory allocation code implements page coloring
optimizations, which means that the memory allocation code will attempt
to locate free pages that are contiguous from the point of view of the
cache. For example, if page 16 of physical memory is assigned to page 0
of a process's virtual memory and the cache can hold 4 pages, the page
coloring code will not assign page 20 of physical memory to page 1 of a
process's virtual memory. It would, instead, assign page 21 of physical
memory. The page coloring code attempts to avoid assigning page 20
because this maps over the same cache memory as page 16 and would result
in non-optimal caching. This code adds a significant amount of
complexity to the VM memory allocation subsystem as you can well
imagine, but the result is well worth the effort. Page Coloring makes VM
memory as deterministic as physical memory in regards to cache
performance.ConclusionVirtual memory in modern operating systems must address a number of
different issues efficiently and for many different usage patterns. The
modular and algorithmic approach that BSD has historically taken allows
us to study and understand the current implementation as well as
relatively cleanly replace large sections of the code. There have been a
number of improvements to the FreeBSD VM system in the last several
years, and work is ongoing.Bonus QA session by Allen Briggs
briggs@ninthwonder.comWhat is the interleaving algorithm that you
refer to in your listing of the ills of the FreeBSD 3.x swap
arrangements?FreeBSD uses a fixed swap interleave which defaults to 4. This
means that FreeBSD reserves space for four swap areas even if you
only have one, two, or three. Since swap is interleaved the linear
address space representing the four swap areas will be
fragmented if you do not actually have four swap areas. For
example, if you have two swap areas A and B FreeBSD's address
space representation for that swap area will be interleaved in
blocks of 16 pages:A B C D A B C D A B C D A B C DFreeBSD 3.x uses a sequential list of free
regions approach to accounting for the free swap areas.
The idea is that large blocks of free linear space can be
represented with a single list node
(kern/subr_rlist.c). But due to the
fragmentation the sequential list winds up being insanely
fragmented. In the above example, completely unused swap will
have A and B shown as free and C and D shown as
all allocated. Each A-B sequence requires a list
node to account for because C and D are holes, so the list node
cannot be combined with the next A-B sequence.Why do we interleave our swap space instead of just tack swap
areas onto the end and do something fancier? Because it is a whole
lot easier to allocate linear swaths of an address space and have
the result automatically be interleaved across multiple disks than
it is to try to put that sophistication elsewhere.The fragmentation causes other problems. Being a linear list
under 3.x, and having such a huge amount of inherent
fragmentation, allocating and freeing swap winds up being an O(N)
algorithm instead of an O(1) algorithm. Combined with other
factors (heavy swapping) and you start getting into O(N^2) and
O(N^3) levels of overhead, which is bad. The 3.x system may also
need to allocate KVM during a swap operation to create a new list
node which can lead to a deadlock if the system is trying to
pageout pages in a low-memory situation.Under 4.x we do not use a sequential list. Instead we use a
radix tree and bitmaps of swap blocks rather than ranged list
nodes. We take the hit of preallocating all the bitmaps required
for the entire swap area up front but it winds up wasting less
memory due to the use of a bitmap (one bit per block) instead of a
linked list of nodes. The use of a radix tree instead of a
sequential list gives us nearly O(1) performance no matter how
fragmented the tree becomes.I do not get the following:
It is important to note that the FreeBSD VM system attempts
to separate clean and dirty pages for the express reason of
avoiding unnecessary flushes of dirty pages (which eats I/O
bandwidth), nor does it move pages between the various page
queues gratuitously when the memory subsystem is not being
stressed. This is why you will see some systems with very low
cache queue counts and high active queue counts when doing a
systat -vm command.
How is the separation of clean and dirty (inactive) pages
related to the situation where you see low cache queue counts and
high active queue counts in systat -vm? Do the
systat stats roll the active and dirty pages together for the
active queue count?Yes, that is confusing. The relationship is
goal verses reality. Our goal is to
separate the pages but the reality is that if we are not in a
memory crunch, we do not really have to.What this means is that FreeBSD will not try very hard to
separate out dirty pages (inactive queue) from clean pages (cache
queue) when the system is not being stressed, nor will it try to
deactivate pages (active queue -> inactive queue) when the system
is not being stressed, even if they are not being used. In the &man.ls.1; / vmstat 1 example,
would not some of the page faults be data page faults (COW from
executable file to private page)? I.e., I would expect the page
faults to be some zero-fill and some program data. Or are you
implying that FreeBSD does do pre-COW for the program data?A COW fault can be either zero-fill or program-data. The
mechanism is the same either way because the backing program-data
is almost certainly already in the cache. I am indeed lumping the
two together. FreeBSD does not pre-COW program data or zero-fill,
but it does pre-map pages that exist in its
cache.In your section on page table optimizations, can you give a
little more detail about pv_entry and
vm_page (or should vm_page be
vm_pmap—as in 4.4, cf. pp. 180-181 of
McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of
operation/reaction would require scanning the mappings?How does Linux do in the case where FreeBSD breaks down
(sharing a large file mapping over many processes)?A vm_page represents an (object,index#)
tuple. A pv_entry represents a hardware page
table entry (pte). If you have five processes sharing the same
physical page, and three of those processes's page tables actually
map the page, that page will be represented by a single
vm_page structure and three
pv_entry structures.pv_entry structures only represent pages
mapped by the MMU (one pv_entry represents one
pte). This means that when we need to remove all hardware
references to a vm_page (in order to reuse the
page for something else, page it out, clear it, dirty it, and so
forth) we can simply scan the linked list of
pv_entry's associated with that
vm_page to remove or modify the pte's from
their page tables.Under Linux there is no such linked list. In order to remove
all the hardware page table mappings for a
vm_page linux must index into every VM object
that might have mapped the page. For
example, if you have 50 processes all mapping the same shared
library and want to get rid of page X in that library, you need to
index into the page table for each of those 50 processes even if
only 10 of them have actually mapped the page. So Linux is
trading off the simplicity of its design against performance.
Many VM algorithms which are O(1) or (small N) under FreeBSD wind
up being O(N), O(N^2), or worse under Linux. Since the pte's
representing a particular page in an object tend to be at the same
offset in all the page tables they are mapped in, reducing the
number of accesses into the page tables at the same pte offset
will often avoid blowing away the L1 cache line for that offset,
which can lead to better performance.FreeBSD has added complexity (the pv_entry
scheme) in order to increase performance (to limit page table
accesses to only those pte's that need to be
modified).But FreeBSD has a scaling problem that Linux does not in that
there are a limited number of pv_entry
structures and this causes problems when you have massive sharing
of data. In this case you may run out of
pv_entry structures even though there is plenty
of free memory available. This can be fixed easily enough by
bumping up the number of pv_entry structures in
the kernel config, but we really need to find a better way to do
it.In regards to the memory overhead of a page table verses the
pv_entry scheme: Linux uses
permanent page tables that are not throw away, but
does not need a pv_entry for each potentially
mapped pte. FreeBSD uses throw away page tables but
adds in a pv_entry structure for each
actually-mapped pte. I think memory utilization winds up being
about the same, giving FreeBSD an algorithmic advantage with its
ability to throw away page tables at will with very low
overhead.Finally, in the page coloring section, it might help to have a
little more description of what you mean here. I did not quite
follow it.Do you know how an L1 hardware memory cache works? I will
explain: Consider a machine with 16MB of main memory but only 128K
of L1 cache. Generally the way this cache works is that each 128K
block of main memory uses the same 128K of
cache. If you access offset 0 in main memory and then offset
offset 128K in main memory you can wind up throwing away the
cached data you read from offset 0!Now, I am simplifying things greatly. What I just described
is what is called a direct mapped hardware memory
cache. Most modern caches are what are called
2-way-set-associative or 4-way-set-associative caches. The
set-associatively allows you to access up to N different memory
regions that overlap the same cache memory without destroying the
previously cached data. But only N.So if I have a 4-way set associative cache I can access offset
0, offset 128K, 256K and offset 384K and still be able to access
offset 0 again and have it come from the L1 cache. If I then
access offset 512K, however, one of the four previously cached
data objects will be thrown away by the cache.It is extremely important…
extremely important for most of a processor's
memory accesses to be able to come from the L1 cache, because the
L1 cache operates at the processor frequency. The moment you have
an L1 cache miss and have to go to the L2 cache or to main memory,
the processor will stall and potentially sit twiddling its fingers
for hundreds of instructions worth of time
waiting for a read from main memory to complete. Main memory (the
dynamic ram you stuff into a computer) is
slow, when compared to the speed of a modern
processor core.Ok, so now onto page coloring: All modern memory caches are
what are known as physical caches. They
cache physical memory addresses, not virtual memory addresses.
This allows the cache to be left alone across a process context
switch, which is very important.But in the Unix world you are dealing with virtual address
spaces, not physical address spaces. Any program you write will
see the virtual address space given to it. The actual
physical pages underlying that virtual
address space are not necessarily physically contiguous! In fact,
you might have two pages that are side by side in a processes
address space which wind up being at offset 0 and offset 128K in
physical memory.A program normally assumes that two side-by-side pages will be
optimally cached. That is, that you can access data objects in
both pages without having them blow away each other's cache entry.
But this is only true if the physical pages underlying the virtual
address space are contiguous (insofar as the cache is
concerned).This is what Page coloring does. Instead of assigning
random physical pages to virtual addresses,
which may result in non-optimal cache performance, Page coloring
assigns reasonably-contiguous physical pages
to virtual addresses. Thus programs can be written under the
assumption that the characteristics of the underlying hardware
cache are the same for their virtual address space as they would
be if the program had been run directly in a physical address
space.Note that I say reasonably contiguous rather
than simply contiguous. From the point of view of a
128K direct mapped cache, the physical address 0 is the same as
the physical address 128K. So two side-by-side pages in your
virtual address space may wind up being offset 128K and offset
132K in physical memory, but could also easily be offset 128K and
offset 4K in physical memory and still retain the same cache
performance characteristics. So page-coloring does
not have to assign truly contiguous pages of
physical memory to contiguous pages of virtual memory, it just
needs to make sure it assigns contiguous pages from the point of
view of cache performance and operation.