[LINUX] The story of the "hole" in the file

The story of the "hole" in the file

$ dd if=/dev/zero of=testfile1 bs=1 seek=104857599 count=1 ; ls -ls testfile1
1+0 records in
1+0 records out
1 bytes transferred in 0.000114 secs (8775 bytes/sec)
8 -rw-r--r--  1 user  staff  104857600  5 25 13:24 testfile
$

You can create a file that has a file size of 100MB but consumes only 8 blocks. You can do something similar with the Linux (coreutils) command truncate or the qemu administration command qemu-img.

$ truncate -s 100M testfile2
$ qemu-img create -f raw testfile3 100M
$

dd uses two system calls, lseek (2) and write (2), and truncate and qemu-img use two system calls, ftruncate (2), but the results are almost the same [^ 1].

[^ 1]: Since dd writes 1 byte at the end, the disk block of that part is allocated. On the other hand, truncate and qemu-img may not be allocated at all. How much it is actually allocated depends on the implementation of the file system.

When these files are read, the data filled with "0" is read, and when written, the disk block is allocated at that point and the written information is saved. The part where the disk block is not allocated is called a "hole" or "hole", and the file having such a part is called a "perforated file" or "sparse file". From ancient times, files that are literally sparse (the file size is large compared to the amount of information), such as DB files, have often been perforated, and recently, virtual disk images of Qemu / KVM virtual machines have been perforated. May be.

Hole detection and drilling

The hole in this file is about how to use disk space inside the file system, which has not traditionally appeared in the API. In addition to the regular POSIX API, there was also a DMAPI, but it didn't seem to be very popular.

Since ancient times, Linux has been able to detect holes in files by opening the file and issuing an ioctl (2). An ioctl called FIBMAP returns the number of the disk block where the file data is stored. The hole part returns 0 as the disk block number, so it can be detected. The problem is that you need root privileges to run this ioctl, and you need to issue one ioctl (2) to check one block.

Linux also has another ioctl (2) called FS_IOC_FIEMAP. This is an enhanced version of FIBM AP, which allows you to get information about a specified range of files at once and does not require root privileges. FIEMAP is under development for October 2008, 2.6.28 Introduced in.

Both FIBMAP and FS_IOC_FIEMAP were methods of knowing the specific location on the disk where the file data was stored, and the side effect was that a hole was found. As a dedicated API for detecting holes, there are options SEEK_HOLE and SEEK_DATA in the lseek (2) system call. It was initially implemented on Solaris and later on FreeBSD and Linux Since it is implemented, it seems good to think that it is not standard but has some portability.

SEEK_HOLE moves to the first hole after the specified offset. SEEK_DATA moves to a non-first hole after the specified offset. With the right combination, you can correctly enumerate the holes in the entire file.

On the other hand, how about making a hole? The lseek (2) + write (2) and ftruncate (2) methods mentioned at the beginning can only add a hole at the end of the file. In other words, it is not possible to make a hole in the part where the data block has already been allocated.

Again, Solaris precedes and a command called F_FREESP is added to fcntl (2). You can make a hole in a specified area of the file. This API is not followed by other operating systems [^ 2], and in Linux a system call called fallocate (2) has been flagged as FALLOC_FL_PUNCH_HOLE. scm / linux / kernel / git / torvalds / linux.git/commit/?id=79124f18b335172e1916075c633745e12dae1dac), you can also make holes in the specified area. In both implementations, the data recorded in the specified area is lost (reading returns data padded with zeros).

[^ 2]: However, XFS has an ioctl called XFS_IOC_FREESP for a long time, and it can do the same as fcntl (F_FREESP).

Save and copy holes

Even before the APIs that could be used to detect holes (without root privileges) were added, coreutils cp commands, GNU tar, etc. could detect perforated files and efficiently copy and archive them. How was this detected?

According to the update history of coreutils, FS_IOC_FIEMAP hole detection was implemented in May 2010. It is a commit of //git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=dff2b95e4fb22f3e6f3360da0774c784d2f50ff9). See previous code Just try it.

Roughly chasing, a normal file copy is [copy_reg ()](https://git.savannah.gnu.org/gitweb/?p=coreutils.git; a = blob; f = src / copy.c; h = 9a014ad5aa672c78b7e3cd69de1e684c73563e1f; hb = e1aaf8903db97f3240b1551fd6936ccdc652dfc8 # l458). Line 707 The copy is being executed in the while loop of.

Attention is line 746, Count the number of 0s and seek by that much. In other words, if the copy source data is 0, it is not writing 0 to the copy destination, but simply seeking, so it may become a hole.

Even with the latest coreutils, holes are processed this way when FIEMAP isn't available.

File holes and TRIM (UNMAP) commands

On a different note, the term TRIM command became a hot topic a few years before SSDs became widespread. TRIM solves the problem that data writing is relatively slow because NAND flash memory, which is the storage medium of SSD, requires an operation of erasing before rewriting. The SSD can be erased in advance by notifying the SSD of unused or released blocks from the OS (file system). In addition, it is also advantageous in wear leveling that equalizes the write frequency for each block of NAND flash, which has an upper limit on the number of writes.

TRIM is a SATA command, but SCSI also has a similar command called UNMAP. There are SSDs with SCSI connection, but it is also effective for thin provisioning with SAN. Since the allocated area can be returned to the storage when deleting a file, the used area does not only increase monotonically as in the past.

Speaking of thin provisioning, virtual disks for virtual machines may be provided by thin provisioning. For example, Qemu's Qcow2 file has a thin provisionable file format itself, and if a raw image is created as a perforated file, it will be thin provisioned. So I came back to the hole.

Qemu and TRIM / UNMAP commands

Qemu's emulated SATA and SCSI disks understand the TRIM (UNMAP) command to properly manipulate the backend disk image (optionally).

First, SATA began to interpret TRIM in commit in May 2011. The SCSI disk is August 2012. The VirtIO block device is much newer February 2019. In terms of Qemu versions, they correspond to the release cycles of 0.15, 1.2.0, and 4.0, respectively.

On the other hand, among the drivers for backend virtual image files, Qcow2 is in January 2011, TRIM (UNMAP) Now releases the part. However, the file size is not reduced, and the freed area can be punched by fallocate. Also, in raw format, the area TRIM (UNMAP) by fallocate () is added to January 2013. It came to be open. It seems that XFS had been released by ioctl () before that.

Now, when you actually create a virtual machine in an Ubuntu 20.04 LTS (focal, Linux-5.4 series, Qemu-4.2 series) environment, the virtual disk (all examples of VirtIO SCSI) becomes a guest OS as a "thin provisionable" disk. It is visible from. For example, when you start a Linux guest, the files thin_provisioning and provisioning_mode are created in the sysfs disk entry as shown below, which are as follows.

$ cat /sys/bus/scsi/devices/0\:0\:0\:0/scsi_disk/0\:0\:0\:0/thin_provisioning
1
$ cat /sys/bus/scsi/devices/0\:0\:0\:0/scsi_disk/0\:0\:0\:0/provisioning_mode
unmap
$

For Windows guests, if you look at the "Drive Defragmentation and Optimization" tool (Start-> Windows Administrative Tools), the "Media Type" section is "Virtual Provisioning Compatible Drive", and select "Optimize". And TRIM are executed (Z: in the figure is an iSCSI disk).

On the host side with a normal disk connected, if you look at the same file that you saw in the Linux guest,

$ cat /sys/bus/scsi/devices/0\:0\:0\:0/scsi_disk/0\:0\:0\:0/thin_provisioning
0
$ cat /sys/bus/scsi/devices/0\:0\:0\:0/scsi_disk/0\:0\:0\:0/provisioning_mode
full
$

It can be seen that it is not thin provisioning (thick provisioning).

In fact, when a guest issues the UNMAP command (because it's SCSI), the block device has a discard option. If so, fallocate (2) frees up space. If you are using a kernel of Linux 4.9 or later, such as Ubuntu 20.04, [fallocate (2) works] not only when the virtual disk is a file, but also when it is a host device (such as an LVM volume) (https:: //git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=25f4c41415e513f0e9fb1f3fce2ce98fcba8d263), TRIM (UNMAP) is passed through.

I actually experimented with raw images.

host# qemu-img create -f raw vol.img 20G
Formatting 'vol.img', fmt=raw size=21474836480
host# qemu-img info vol.img ; ls -lsh vol.img
image: vol.img
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 4 KiB
4.0K -rw-r--r-- 1 libvirt-qemu kvm 20G May 30 13:42 vol.img
host#

Attach this disk to the guest sdb. With libvirt to add the discard option

      <driver name='qemu' type='raw'/>

Where it looks like

      <driver name='qemu' type='raw' discard='unmap'/>

To do. After launching the guest and identifying the target disk, format it. I chose XFS here.

guest:~# mkfs.xfs /dev/sdb
meta-data=/dev/sdb               isize=512    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
guest:~#

If you examine the image on the host side, you can see that it is larger by the amount of metadata.

host# qemu-img info vol.img ; ls -lsh vol.img
image: vol.img
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 10.3 MiB
11M -rw-r--r-- 1 libvirt-qemu kvm 20G May 30 13:51 vol.img
host#

Mount this and create a 512MB file.

guest:~# mount /dev/sdb /mnt
guest:~# dd if=/dev/urandom of=/mnt/randomfile bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.95154 s, 182 MB/s
guest:~#

Examining the image on the host side reveals that it is about 512MB larger.

host# qemu-img info vol.img ; ls -lsh vol.img
image: vol.img
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 522 MiB
523M -rw-r--r-- 1 libvirt-qemu kvm 20G May 30 13:52 vol.img
host#

Next, try deleting this file. TRIM is issued when discard is added to the mount option or when fstrim (8) is executed, so execute fstrim (-v is verbose).

guest:~# rm /mnt/randomfile 
guest:~# fstrim -v /mnt
/mnt: 20 GiB (21464170496 bytes) trimmed
guest:~#

Examining the image, you can see that it has returned to about the same size as immediately after mkfs.

host# qemu-img info vol.img ; ls -lsh vol.img
image: vol.img
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 10.1 MiB
11M -rw-r--r-- 1 libvirt-qemu kvm 20G May 30 13:53 vol.img
host#

Recent Debian and ubuntu have a command called zerofree, which seems to zero-fill free space on ext3 and ext4. Combine this with qemu's detect-zeros = unmap option to get the allocated space of the disk image. It may be possible to reduce it.