Linux (and me)

Do not expect general info here, at least not now. I will cover some very special details I had the "pleasure" to meet.

vfat file system extreme fragmentation on multiprocessor

See archives of Linux Kenel Mailing List for complete discussion.

2008-09-11 my posting

I like to share the following observation made with different kernels of 2.6.x series, a T7100 Core2Duo CPU (effectively 2 processors). I have not seen such a post while searching.

Two applications compress data at the same time and try to do their best to avoid fragmenting the file system by writing blocks of 50 MByte to a VFAT (FAT32) partition on SATA harddisk, cluster size 8 KByte. Resulting file size is 200 to 250 MByte. It is ok to get 4 to 5 fragments per file. But at random, approximately at every 4th file, there are a few 100 up to more than 4500 (most likely case approx 1500) fragments for each of the two files written in parallel.

My best guess: In this case both CPU cores were in the cluster allocation function of the fat file system at (nearly) the same time, allocating only a few clusters (guess 8) for their file before the other core got the next. The compression task is CPU bound. The harddisk could probably cater 4 cores. This reverses for decompression.

The files are ok, no corruption, just heavy fragmentation. I know vfat is not liked very much. Nevertheless I like to hope someone with more Linux kernel coding experience than me fixes this in the future.

vfat still seems to be the reliable way for data exchange accross platforms (anyone an ext2 driver for Win up to Vista which does not trash the f.s. every few days, or a reliable NTFS for Linux?). Anyway, it is a general design issue on SMP systems one should not forget.

I tried the same to an ext2 f.s.. It showed only very little fragmentation, most files were 1 piece, well done!

2008-09-12 my posting

Why should C library fwrite() split anything? There is no good reason (unless in x86 64 KByte segmented model trying to emulate a flat model - old DOS). The 50 MByte go to write() in a single piece here.

There are only very little "single operation"s in kernel nowadays. In most cases this is a good thing.

Looks like fat/fatent.c fat_alloc_clusters() is limited to allocate only 4 clusters under a single lock.

Found another assumption I do not like: cluster size >= 512. There are old FAT systems on SRAM cards having 128 byte/sector and cluster. But I don't want to have long filenames on them. Hope the 512 does not sit elsewhere, too.

fat/inode.c __fat_get_block() /* TODO: multiple cluster allocation would be desirable */ YES, OF COURSE. Only a single cluster is allocated at a time, no lock here, I can be happy I got still 8 clusters per fragment, might have been only 1.

2008-09-13 my posting: Compared with MS Windows

Both Windows Vista and XP SP3 produce very little or no fragmentation on same FAT partition and same workload (2 tasks writing in 50 MByte pieces, up to some hundred MByte file size). But XP SP3 has a stupid behavior, too, allocating new space always behind previous allocations, even if old files have already been really deleted (not in trash), at least for one session (not tried what happens after reboot). This can result in multi-GByte holes of free space.

Kernel routes IP although OFF

I will have to dig out this ugly issue again. From memory, it was 2.6.x, and even present in 2.4.x. 2 eth Interfaces at different subnets. IP Routing off in Kernel. On an Ethernet, packets where addressed to the Linux machine directly by Ethernet address, but the IP content was addressed to the other eth interface. (Usually this does not happen on wire, but an attacker can easily construct this.) Expectation was these packets are dropped, because IP routing in Kernel was OFF.

But the Kernel did route (?), took the packets from first eth and responded under the IP of the second eth, on the first eth. Worked very well for ping, after manual setup of ARP on test partner machine. Worked for telnet, too, as far as I remember.

DANGER!

--- Author: Harun Scheutzow ------ Last change: 2011-07-12 ---