Cromfs: Compressed ROM filesystem for Linux (user-space)

0. Contents

This is the documentation of cromfs-1.5.4.
   1. Purpose
   2. News
   3. Overview
   4. Limitations
   5. Development status
   6. Comparing to other filesystems
      6.1. Compression tests
      6.2. Speed tests
   7. Getting started
   8. Tips
         8.0.1. To improve compression
         8.0.2. To improve mkcromfs speed
         8.0.3. To control the memory usage
         8.0.4. To control the filesystem speed
         8.0.5. Using cromfs with automount
   9. Understanding the concepts
         9.0.1. Inode
         9.0.2. Block
         9.0.3. Fblock
         9.0.4. Block number and block table
         9.0.5. Data locator
         9.0.6. Block indexing (mkcromfs only)
         9.0.7. Random compress period (mkcromfs only)
         9.0.8. Where are the inodes stored then?
   10. Using cromfs in bootdisks and tiny Linux distributions
   11. Other applications of cromfs
   12. Copying and contributing
      12.1. Contribution wishes
   13. Requirements
   14. Links
   15. Downloading

1. Purpose

cromfs Cromfs is a compressed read-only filesystem for Linux. It uses the LZMA compression algorithm from 7-zip, and a powerful block merging mechanism, that is especially efficient with gigabytes of large files having lots of redundancy.

The primary design goal of cromfs is compression power. It is much slower than its peers, and uses more RAM. If all you care about is "powerful compression" and "random file access", then you will be happy with cromfs.

The creation of cromfs was inspired from Squashfs and Cramfs.

The downloading section is at the bottom of this page.

2. News

See the ChangeLog.

3. Overview

[cromfs size demo] See the documentation of the cromfs format for technical details (also included in the source package as doc/FORMAT).

4. Limitations

5. Development status

Development status: Stable. (Really: progressive.)
(Fully functional release exists, but is updated from time to time.)

Cromfs has been in beta stage for over a year, during which time very little bugs have been reported, and no known bugs remain at this time.

It does not make sense to keep it as "beta" indefinitely, but since there is never going to be a "final" version — new versions may always be released — it is now labeled as "progressive".

In practice, the author trusts it works as advertised, but as per GPL policy, there is NO WARRANTY whatsoever. The entire risk to the quality and performance of the program suite is with you.

#include "GNU gdb/show warranty"

6. Comparing to other filesystems

This is all very biased probably, hypothetical, and by no means a scientific study, but here goes:

Legend: Good, Bad, Partial
Feature Cromfs Cramfs Squashfs (3.0) Cloop
Compression unit adjustable arbitrarily (2 MB default) adjustable, must be power of 2 (4 kB default) adjustable, must be power of 2 (64 kB max) adjustable in 512-byte units (1 MB max)
Files are compressed (up to block size limit) Together Individually Individually, except for fragments Together
Maximum file size 16 EB (264 bytes) (theoretical; actual limit depends on settings) 16 MB (224 bytes) 16 EB (264 bytes)
(4 GB before v3.0)
Depends on slave filesystem
Maximum filesystem size 16 EB (264 bytes) 272 MB 16 EB (264 bytes)
(4 GB before v3.0)
16 EB (264 bytes)
Duplicate whole file detection Yes No Yes No
Hardlinks detected and saved Yes Yes Yes, since v3.0 depends on slave filesystem
Near-identical file detection Yes (identical blocks) No No No
Compression method LZMA gzip (patches exist to use LZMA) gzip (patches exist to use LZMA) gzip or LZMA
Ownerships uid,gid (since version 1.1.2) uid,gid (but gid truncated to 8 bits) uid,gid Depends on slave filesystem
Timestamps mtime only None mtime only Depends on slave filesystem
Endianess-safety Theoretically safe (untested on bigendian) Safe, but not exchangeable Safe, but not exchangeable Depends on slave filesystem
Linux kernel driver No Yes Yes Yes
Userspace driver Yes (fuse) No An extraction tool (unsquashfs) An extraction tool (extract_compressed_fs), but cannot be used to extract a single file
Windows driver No No No No
Appending to a previously created filesystem No No Yes No (the slave filesystem can be decompressed, modified, and compressed again, but in a sense, so can every other of these.)
Mounting as read-write No No No No
Supported inode types all all all Depends on slave filesystem
Fragmentation
(good for compression, bad for access speed)
Depends on compression settings None File tails only Depends on slave filesystem
Holes (aka. sparse files); storage optimization of blocks which consist entirely of nul bytes Any two identical blocks are merged and stored only once. Supported Not supported Depends on slave filesystem
Padding (partially filled sectors, wastes space) No Unknown Mostly not Depends on slave filesystem, usually yes
Extended attributes No Unknown Unknown Unknown, may depend on slave filesystem

Note: If you notice that this table contains wrong information, please contact me telling what it is and I will change it.

Note: cromfs now saves the uid and gid in the filesystem. However, when the uid is 0 (root), the cromfs-driver returns the uid of the user who mounted the filesystem, instead of root. Similarly for gid. This is both for backward compatibility and for security.
If you mount as root, this behavior has no effect.

6.1. Compression tests

Note: I use the -e and -r options in all of these mkcromfs tests to avoid unnecessary decompression+recompression steps, in order to speed up the filesystem generation. This has no effect in compression ratio.

In this table, k equals 1024 bytes (210) and M equals 1048576 bytes (220).

Note: Again, these tests have not been peer-verified so it is not a real scientific study. But I attest that these are the results I got.
Item 10783 NES ROMs (2523 MB) Firefox 2.0.0.5 source code (233 MB)
(MD5sum 5a6ca3e4ac3ebc335d473cd3f682a916)
Damn small Linux liveCD (113 MB)
(size taken from "du -c" output in the uncompressed filesystem)
Cromfs mkcromfs -s65536 -c16 -a… -b… -f…
With 16M fblocks, 2k blocks: 198,553,574 bytes (v1.4.1)
With 16M fblocks, 1k blocks, 194,813,427 bytes (v1.4.1)
With 16M fblocks, ¼k blocks: 187,575,926 bytes (v1.5.0)
mkcromfs
With default options: 33,866,164 bytes (v1.5.2)
(Peak memory use (RSS): 97 MB (mostly comprising of memory-mapped files)
mkcromfs -f1048576
With 64k blocks (-b65536), 39,778,030 bytes (v1.2.0)
With 16k blocks (-b16384), 39,718,882 bytes (v1.2.0)
With 1k blocks (-b1024), 40,141,729 bytes (v1.2.0)
Cramfs v1.1 mkcramfs -b65536
dies prematurely, "filesystem too big"
mkcramfs
with 2M blocks (-b2097152), 65,011,712 bytes
with 64k blocks (-b65536), 64,618,496 bytes
with 4k blocks (-b4096), 77,340,672 bytes
mkcramfs -b65536
51,445,760 bytes
Squashfs v3.2 mksquashfs -b65536
(using an optimized sort file) 1,185,546,240 bytes
mksquashfs
49,139,712 bytes
mksquashfs -b65536
50,028,544 bytes
Cloop v2.05~20060829 create_compressed_fs
(using an iso9660 image created with mkisofs -R)
using 7zip, 1M blocks (-B1048576 -t2 -L-1): 1,136,789,006 bytes
create_compressed_fs
(using an iso9660 image created with mkisofs -RJ)
using 7zip, 1M blocks (-B1048576 -L-1): 46,726,041 bytes
(1 MB is the maximum block size in cloop)
create_compressed_fs
(using an iso9660 image)
using 7zip, 1M blocks (-B1048576 -L-1): 48,328,580 bytes
using zlib, 64k blocks (-B65536 -L9): 50,641,093 bytes
7-zip (p7zip) v4.30
(an archive, not a filesystem)
7za -mx9 -ma=2 a
with 32M blocks (-md=32m): 235,037,017 bytes
with 128M blocks (-md=128m): 222,523,590 bytes
with 256M blocks (-md=256m): 212,533,778 bytes
7za -mx9 -ma=2 -md=256m a
29,079,247 bytes
(Peak memory use: 2545 MiB)
7za -mx9 -ma2 a
37,205,238 bytes
An explanation why mkcromfs beats 7-zip in the NES ROM packing test:

7-zip packs all the files together as one stream. The maximum dictionary size in 32-bit mode is 256 MB. (Note: The default for "maximum compression" is 32 MB.) When 256 MB of data has been packed and more data comes in, similarities between the first megabytes of data and the latest data are not utilized. For example, Mega Man and Rockman are two almost identical versions of the same image, but because there's more than 400 MB of files in between of those when they are processed in alphabetical order, 7-zip does not see that they are similar, and will compress each one separately.
7-zip's chances could be improved by sorting the files so that it will process similar images sequentially. It already attempts to accomplish this by sorting the files by filename extension and filename, but it is not always the optimal way, as shown here.

mkcromfs however keeps track of all blocks it has encoded, and will remember similarities no matter how long ago they were added to the archive. (Click here to read how it does that.) This is why it outperforms 7-zip in this case, even when it only used 16 MB fblocks.

In the liveCD compressing test, mkcromfs does not beat 7-zip because this advantage is too minor to overcome the overhead needed to provide random access to the filesystem. It still beats cloop, squashfs and cramfs though.

6.2. Speed tests

Speed testing hasn't been done yet. It is difficult to test the speed, because it depends on factors such as cache (with compressed filesystems, decompression consumes CPU power but usually only needs to be done once) and block size (bigger blocks need more time to decompress).

However, in the general case, it is quite safe to assume that mkcromfs is the slowest of all. The same goes for resource testing (RAM).

cromfs-driver requires an amount of RAM proportional to a few factors. It can be approximated with this formula:

Max_RAM_usage = FBLOCK_CACHE_MAX_SIZE × fblock_size + READDIR_CACHE_MAX_SIZE × 60k + 8 × num_blocks

Where

For example, for a 500 MB archive with 16 kB blocks and 1 MB fblocks, the memory usage would be around 10.2 MB.

7. Getting started

  1. Install the development requirements: make, gcc-c++ and fuse
    • Remember that for fuse to work, the kernel must also contain the fuse support. Do "modprobe fuse", and check if you have "/dev/fuse" and check if it works.
      • If "/dev/fuse" does not exist after loading the "fuse" module, create it manually (as root):
        # cd /dev
        # mknod fuse c 10 229
      • If an attempt to read from "/dev/fuse" (as root) gives "no such device", it does not work. If it gives "operation not permitted", it might work.
  2. Configure the source code:
    $ ./configure
    It will automatically determine your software environment (mainly, the features supported by your compiler).
  3. Build the programs:
    $ make

    This builds the programs "cromfs-driver", "cromfs-driver-static", "util/mkcromfs", "util/cvcromfs" and "util/unmkcromfs".

  4. Create a sample filesystem:
    $ util/mkcromfs . sample.cromfs
  5. Mount the sample filesystem:
    $ mkdir sample
    $ ./cromfs-driver sample.cromfs sample
  6. Observe the sample filesystem:
    $ cd sample
    $ du
    $ ls -al
  7. Unmounting the filesystem:
    $ cd ..
    $ fusermount -u sample

8. Tips

8.0.1. To improve compression

To improve the compression, try these tips:

8.0.2. To improve mkcromfs speed

To improve the filesystem generation speed, try these tips:

8.0.3. To control the memory usage

To control the memory usage, use these tips:

8.0.4. To control the filesystem speed

To control the filesystem speed, use these tips:

8.0.5. Using cromfs with automount

Since version 1.3.0, you can use cromfs in conjunction with the automount (autofs) feature present in Linux kernel. This allows you to mount cromfs volumes automatically on demand, and umount them when they are not used, conserving free memory.

This line in your autofs file (such as auto.misc) will do the trick (assuming the path you want is "books", and your volume is located at "/home/myself/books.cromfs"):

books -fstype=fuse,ro,allow_other    :/usr/local/bin/cromfs-driver\#/home/myself/books.cromfs

9. Understanding the concepts

Skip over this section if you don't think yourself as technically inclined.

9.0.1. Inode

Every object in a filesystem (from user's side) is an "inode". This includes at least symlinks, directories, files, fifos and device entries. The inode contains the file attributes and its contents, but not its name. (The name is contained in a directory listing, along with the reference to the inode.) This is the traditional way in *nix systems.

When a file is "hardlinked" into multiple locations in the filesystem, the inode is not copied. The inode number just is listed in multiple directories.
A symlink however, is an entirely new inode unrelated to the file it points to.

The file attributes and the file contents are stored separately. In cromfs, the inode contains an array of block numbers, which are necessary in finding the actual contents of the file.

9.0.2. Block

The contents of every file (denoted by the inode) are divided into "blocks". The size of this block is controlled by the --bsize commandline parameter. For example, if your file is 10000 bytes in size, and your bsize is 4000, the file contains three blocks: 4000 + 4000 + 2000 bytes. The inode contains thus three block numbers, which refer to entries in the block table.

Only regular files, symlinks and directories have "contents" that need storing. Device entries for example, do not have associated contents.
The contents of a directory is a list of file names and inode numbers.

Every time mkcromfs stores a new block, a new block number is generated to denote that particular block (this number is stored in the inode), and a new data locator is stored to describe where the block is found (the locator is stored in the block table).

If mkcromfs reused a previously generated data locator, only the block number needs to be stored.

9.0.3. Fblock

Fblock is a storage unit in a cromfs filesystem. It is the physical container of block data for multiple files.
When mkcromfs creates a new filesystem, it splits each file into blocks (see above), and for each of those blocks, it determines which fblock they go to. The maximum fblock size is mandated by the --fsize commandline parameter.

Each fblock is compressed separately, so a few big fblocks compresses better than many small fblocks. Cromfs automatically creates as many fblocks as is needed to store the contents of the entire filesystem being created.

A fblock is merely a storage. Regardless of the sizes of the blocks and fblocks, the fblock may contain any number of blocks, from 1 to upwards (no upper limit). It is beneficial for blocks to overlap, and this is an important source of the power of cromfs.

The working principle behind fblocks is: What is the shortest string that can contain all these substrings?

9.0.4. Block number and block table

The filesystem contains a structure called "blktab" (block table), which is a list of data locators. This list is indexed by a block number.
Each locator describes, where to find the particular block denoted by this block number.

At the end of the filesystem creation process, the blktab is compressed and becomes "blkdata" before being written into the filesystem.
(These names are only useful when referencing the filesystem format documentation; they are not found in the filesystem itself.

9.0.5. Data locator

A data locator tells cromfs, where to find the contents of this particular block. It is composed of an fblock number and an offset into that fblock. These locators are stored in the global block table, as explained above.

Multiple files may be sharing same data locators, and multiple data locators may be pointing to same, partially overlapping data.

9.0.6. Block indexing (mkcromfs only)

When mkcromfs stores blocks, it remembers where it stored them, so that if it later finds an identical block in another file (or the same file), it won't need to search fblocks again to find a best placement.
The index is a map of block hashes to data locators and block numbers.

The --autoindexperiod (-A) setting can be used to extend this mechanism, that in addition to the blocks it has already encoded, it will memorize more locations in those fblocks — create "just in case" data locators for future use but not actually save them in the block table, unless they're utilized later. This helps compression when the number of fblocks searched (--bruteforcelimit) is low compared to the number of fblocks generated, at the cost of memory consumed by mkcromfs, and has also potential to make mkcromfs faster (but also slower).

9.0.7. Random compress period (mkcromfs only)

When mkcromfs runs, it generates a temporary file for each fblock of the resulting filesystem. If your resulting filesystem is large, those fblocks will take even more of space, a lot anyway.
To save disk space, mkcromfs compresses those fblocks when they are not accessed. However, if it needs to access them again (to search the contents for a match), it will need to decompress them first.

This compressing+decompressing may consume lots of time. It does not help the size of the resulting filesystem; it only saves some temporary disk space.

If you are not concerned about temporary disk space, you should give the --randomcompressperiod option a large number (such as 10000) to prevent it from needlessly decompressing+compressing the fblocks over and over again. This will improve the speed of mkcromfs.

The --decompresslookups option is related. If you use the --randomcompressperiod option, you should also enable --decompresslookups.

By the way, the temporary files are written into wherever your TEMP environment variable points to. TMP is also recognized.

9.0.8. Where are the inodes stored then?

All the inodes of the filesystem are also stored in a file, together. That file is packed like any one other file, split into blocks and scattered into fblocks. That data locator list of that file, is stored in a special inode called "inotab", but it is not seen in any directory. The "inotab" has its own place in the cromfs file.

10. Using cromfs in bootdisks and tiny Linux distributions

Cromfs can be used in bootdisks and tiny Linux distributions only by starting the cromfs-driver from a ramdisk (initrd), and then pivot_rooting into the mounted filesystem (but not before the filesystem has been initialized; there is a delay of a few seconds).

Theoretical requirements to use cromfs in the root filesystem:

Do not use cromfs in machines that are low on RAM!

11. Other applications of cromfs

The compression algorithm in cromfs can be used to determine how similar some files are to each others.

This is an example output of the following command:

$ unmkcromfs --simgraph fs.cromfs '*.qh' > result.xml
from a sample filesystem:
<?xml version="1.0" encoding="UTF-8"?>
<simgraph>
 <volume>
  <total_size>64016101</total_size>
  <num_inodes>7</num_inodes>
  <num_files>307</num_files>
 </volume>
 <inodes>
  <inode id="5595"><file>45/qb5/ir/basewc.qh</file></inode>
  <inode id="5775"><file>45/qb5/ir/edit.qh</file></inode>
  <inode id="5990"><file>45/qb5/ir/help.qh</file></inode>
  <inode id="6220"><file>45/qb5/ir/oemwc.qh</file></inode>
  <inode id="6426"><file>45/qb5/ir/qbasic.qh</file></inode>
  <inode id="18833"><file>c6ers/newcmds/toolib/doc/contents.qh</file></inode>
  <inode id="19457"><file>c6ers/newcmds/toolib/doc/index.qh</file></inode>
 </inodes>
 <matches>
  <match inode1="5595" inode2="5990"><bytes>396082</bytes><ratio>0.5565442944</ratio></match>
  <match inode1="5595" inode2="6220"><bytes>456491</bytes><ratio>0.6414264256</ratio></match>
  <match inode1="5990" inode2="6220"><bytes>480031</bytes><ratio>0.6732618693</ratio></match>
 </matches>
</simgraph>
It reads a cromfs volume generated earlier, and outputs statistics of it. Such statistics can be useful in refining further compression, or just finding useful information regarding the redundancy of the data set.

It follows this DTD:

 <!ENTITY % INTEGER "#PCDATA">
 <!ENTITY % REAL "#PCDATA">
 <!ENTITY % int "CDATA">
 <!ELEMENT simgraph (volume, inodes, matches)>
 <!ELEMENT volume (total_size, num_inodes, num_files)>
 <!ELEMENT total_size (%INTEGER;)>
 <!ELEMENT num_inodes (%INTEGER;)>
 <!ELEMENT num_files (%INTEGER;)>
 <!ELEMENT inodes (inode*)>
 <!ELEMENT inode (file+)>
 <!ATTLIST inode id %int; #REQUIRED>
 <!ELEMENT file (#PCDATA)>
 <!ELEMENT matches (match*)>
 <!ELEMENT match (bytes, ratio)>
 <!ATTLIST match inode1 %int; #REQUIRED>
 <!ATTLIST match inode2 %int; #REQUIRED>
 <!ELEMENT bytes (%INTEGER;)>
 <!ELEMENT ratio (%REAL;)>
Once you have generated the file system, running the --simgraph query is relatively a cheap operation (but still O(n2) for the number of files); it involves analyzing the structures created by mkcromfs, and does not require any search on the actual file contents. However, it can only report as fine-grained similarity information as were the options in the generation of the filesystem (level of compression).

12. Copying and contributing

cromfs has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the terms of the General Public License version 3 (GPL3).
The LZMA code from LZMA SDK embedded within is licensed under LGPL.
The BWT code from libGRzip embedded within is licensed under GPL 2 or higher.

Patches and other related material can be submitted to the author by e-mail at:fhlJoelpoB305 Ylien@59mluomq.i@akqmea <bijx4eYtysqwi6ei@z5vgyt@ikhCyi.fi>

The author also wishes to hear if you use cromfs, and for what you use it and what you think of it.

12.1. Contribution wishes

The author wishes for the following things to be done to this package.

13. Requirements

15. Downloading

The official home page of cromfs is at http://iki.fi/bisqwit/source/cromfs.html.
Check there for new versions.

Generated from progdesc.php (last updated: Sun, 26 Aug 2007 13:52:46 +0300)
with docmaker.php (last updated: Sun, 12 Jun 2005 06:08:02 +0300)
at Mon, 15 Oct 2007 01:12:07 +0300