diff --git a/en/projects/bigdisk/Makefile b/en/projects/bigdisk/Makefile new file mode 100644 index 0000000000..2a29ffa58a --- /dev/null +++ b/en/projects/bigdisk/Makefile @@ -0,0 +1,17 @@ +# Summary of work needed to support large disks and arrays. +# +# $FreeBSD$ + +MAINTAINER= scottl + +.if exists(../Makefile.conf) +.include "../Makefile.conf" +.endif +.if exists(../Makefile.inc) +.include "../Makefile.inc" +.endif + +DOCS= index.sgml +DATA= style.css + +.include "${WEB_PREFIX}/share/mk/web.site.mk" diff --git a/en/projects/bigdisk/index.sgml b/en/projects/bigdisk/index.sgml new file mode 100644 index 0000000000..0b0182f0e7 --- /dev/null +++ b/en/projects/bigdisk/index.sgml @@ -0,0 +1,216 @@ + + + + %includes; + + +N/A"> +Done"> +In progress"> +Needs testing"> +Not done"> +Unknown"> + + + + %developers; + +]> + + + &header; + +
When the UFS filesystem was introduced to BSD in 1982, its use of 32 + bit offsets and counters to address the storage was considered to be + ahead of its time. Since most fixed-disk storage devices use 512 byte + sectors, 32 bits allowed for 2 Terabytes of storage. That was an almost + un-imaginable quantity for the time. But now that 250 and 400 Gigabyte + disks are available at consumer prices, it's trivial to build a hardware + or software based storage array that can exceed 2TB for a few thousand + dollars.
+ +The UFS2 filesystem was introduced in 2003 as a replacement to the + original UFS and provides 64 bit counters and offsets. This allows for + files and filesystems to grow to 2^73 bytes (2^64 * 512) in size and + hopefully be sufficient from quite a long time. UFS2 largely solved + the storage size limits imposed by the filesystem. Unfortunately, many + tools and storage mechanisms still use or assume 32 bit values, often + keeping FreeBSD limited to 2TB.
+ +We need to ensure that FreeBSD supports large storage sizes and that + the benefits of UFS2 can actually be realized so that FreeBSD can remain + relevant in the enterprise world. This page describes known issues and + limits and provides a focus for further auditing, validation, and + fixing.
+ +The first limit that is encountered is in disk partitioning. For x86 + and amd64 PC's, the FDISK MBR table is used by the BIOS to partition the + disk into logical extents and identify which partition ('slice' in FreeBSD + terms) to boot off of. The MBR is defined to use 32 bit disk offsets, + and since it's an industry standard and interoperability is required, + there is nothing that can be done to change this. As long as booting a + PC requires the MBR, the boot slice in FreeBSD is going to be limited to + 2TB.
+ +The GPT partitioning scheme was introduced with the ia64 architecture + as an MBR replacement. It provides 64 bit offsets and allows for an + arbitrary number of partitions. It also provides a compatibility mode + with MBR where it can generate an MBR-compatible structure on the disk + for use with systems that don't understand GPT. However, to get the + full benefits for boot storage, the BIOS and the FreeBSD loader must + understand it. For secondary storage, GPT can be used by any + architecture regardless of BIOS or boot support.
+ +Many systems don't require an MBR or GPT, and even PCs don't require it + if booting and inter-operating with other OS's is not required. The next + limit that comes in, though, is with the BSD disklabel. This label + defines up to 8 partitions on a disk, MBR slice, or other storage extent + for filesystems and swap space. Unfortunately, the on-disk format of the + disk label again uses 32 bit quantities, so it is also limited to 2TB. + Fixing this would require creating a new format that is incompatible + with the old and would require an update to the FreeBSD boot loader. + This would complicate interoperability and the upgrade path. Also, if a + new format is going to be created, it should also address the 8 partition + limit that exists now. Given these requirements, it's tempting to just + adopt the GPT format instead for secondary storage partitioning.
+ + +Even though large drives are cheap, it still isn't always feasible or + economical to test on real hardware. Swap-backed memory disks, via the + md(4) driver, can provide a good substitute for some of the testing. + Backing with swap means that only the pages that are dirtied by data + are actually allocated, so a multi-terabyte storage can be simulated + with a minimal of physical RAM+swap. Note that this is less true with + UFS1 since it will initialize all of the inode blocks during newfs, + which will dirty quite a bit of data. But for UFS2, swap-backed md + has the potential for working well. Unfortunately, the kernel md driver + has a number of 32-bit size limits of its own that need to be fixed. + Details are provided below.
+ +It is still possible to avoid disklabels and MBRs for testing by + using newfs directly on the raw disk or md disk. Sysinstall can be + tested from a running system by just selecting Expert mode and just + performing the MBR and disklabel steps. Beware that sysinstall might + have other bugs that will wipe out your existing system, so care must + be taken here!
+ + +The following userland tools need auditing and testing for 64-bit + cleanliness:
+ +| Task | +Responsible | +Last updated | +Status | +Details | +
|---|---|---|---|---|
| newfs_ffs | ++ | + | &status.new; | +A quick audit of newfs shows that the '-s' option uses atoi() + instead of strtoull() or equivalent. A more thorough audit is needed + to see if other integer limits exist. | +
| df | ++ | + | &status.new; | +An audit is needed to make sure that all reported fields are + 64-bit clean. There are reports with certain fields being incorrect + or negative with NFS volumes, which could either be an NFS or df + problem. | +
| du | ++ | + | &status.new; | +An audit is needed to make sure that all reported fields are + 64-bit clean. | +
| growfs | +&a.scottl; | +12 Sept 2004 | +&status.wip; | +Growfs has problems with expanding to new cylinder groups. It also + initializes UFS2 inode blocks instead of leaving them for lazy + initialization. It also needs a 64-bit audit. | +
| sysinstall | ++ | + | &status.new; | +A full audit is needed. Reports exist of problems with >1TB + partitions. | +
| fsck_ffs | ++ | + | &status.new; | +A full audit is needed. | +
Many storage peripherals simply are not designed to handle >2TB + capacities. For those that are, an audit should be done to verify + that their drivers handle the sizes correctly and pass those sizes + correctly to the test of the kernel.
+| Task | +Responsible | +Last updated | +Status | +Details | +
|---|---|---|---|---|
| md | +&a.scottl; | +12 Sept 2004 | +&status.wip; | +A number of sizes and offsets are tracked using the 'unsigned' + data type, so it appears that it cannot comprehend sizes greater + than 2TB with a 512 byte sector size. For the swap-backed module, + the page counter is also stored as a 32-bit quantity which also + might be a limiting factor. | +