Linux 文件系统的“暗黑技巧”:这些你真的会用到

Source

The Dark Arts of Linux Filesystems: Practical Tricks You’ll Actually Use

Linux 文件系统的“暗黑技巧”:这些你真的会用到

“A filesystem is not just where data lives—it’s where performance, resilience, and cleverness come to negotiate.”
“文件系统不仅存放数据,更是性能、韧性与技巧博弈的舞台。”


Preface: Why Filesystem Tricks Matter

前言:为什么要在文件系统上动脑筋

Linux filesystems—ext4, XFS, Btrfs, F2FS—are mature ecosystems that reward those who understand their levers. With a few system calls and a bit of mechanical sympathy, you can save terabytes, dodge latency spikes, and extend SSD lifespan. This piece walks through practical, safe, reproducible techniques—the kind you can graft into production systems without fear—focusing on portable APIs and ext4/XFS specifics where necessary.

Linux 的文件系统——ext4、XFS、Btrfs、F2FS——是成熟而深厚的生态。只要懂得几个系统调用,再加上一点对底层的敬畏与理解,你就能节省 TB 级空间、躲开延迟尖峰、延长 SSD 寿命。本文梳理一套实用、安全、可复现的小技巧,尽量基于通用接口,在必要处点到 ext4/XFS 的细节,便于落地到生产。


1) Hole Punching: Making Big Files Light

1)打洞:让大文件“轻”起来

  • What it is: Using fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) to deallocate physical blocks within a file, turning spans into sparse regions without changing file size.

  • Why use it: Rolling logs, ever-growing capture files, databases with stale segments. Keep the tail, discard the belly.

  • 核心概念:通过 fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) 在文件内部“打洞”,逻辑大小不变,物理块被释放。

  • 适用场景:滚动日志、持续增长的抓包文件、包含陈旧段的数据库文件。保留尾部,清掉前半截。

Key constraints (ext4):

  • Offset and length must align to the filesystem block size; otherwise EINVAL.
  • On ext4 bigalloc, actual reclaim granularity is the cluster size (multiple blocks). Partial clusters may be zeroed but not freed.
  • Check block size via statfs/statvfs; check bigalloc and cluster size via tune2fs -l.

关键约束(ext4):

  • 偏移和长度须按文件系统块大小对齐,否则返回 EINVAL。
  • 启用 bigalloc 时,真正回收的粒度是簇大小;未覆盖完整簇可能仅清零不释放。
  • 可用 statfs/statvfs 查逻辑块大小,用 tune2fs -l 查看是否 bigalloc 以及簇大小。

Pragmatic pattern:

  • Keep the last N bytes of a growing file: periodically punch [0, size - N).
  • Or rotate into a new file: reflink/clone tail into new, then rename atomically.

实战套路:

  • 保留尾部 N 字节:周期性地对区间 [0, size - N) 打洞。
  • 或“新瓶装尾”:把最后 N 字节克隆到新文件,原子重命名替换。

Safety notes:

  • Always hold a file descriptor; punching a mapped, dirty region is safe but pages become zero-on-demand.
  • Some backup tools mis-handle sparse files—verify your pipeline.

安全提示:

  • 持有文件描述符执行;对映射且脏的页打洞是安全的,后续按需读为零。
  • 稀疏文件可能被某些备份工具错误处理,上线前验证链路。

2) Preallocation That Actually Helps

2)好用的预分配方式

  • fallocate(FALLOC_FL_KEEP_SIZE): Reserve disk space without changing file size—prevents runtime fragmentation.

  • fallocate(0): Extends file size and allocates blocks; ideal for append-heavy writers.

  • posix_fallocate: Portable but may be slower; ensures space logically allocated.

  • fallocate(FALLOC_FL_KEEP_SIZE):预留空间但不改动大小——减少运行时碎片。

  • fallocate(0):直接扩展文件并分配块,适合高并发追加写。

  • posix_fallocate:更通用但可能慢;保证逻辑分配到位。

Tuning tips:

  • Preallocate in powers of two up to a cap (e.g., 1–64 MiB) to match allocator behavior.
  • For SSDs, larger contiguous extents reduce write amplification at the FTL.

调优要点:

  • 以 2 的幂逐步预分配(如 1–64 MiB),贴合分配器习惯。
  • 对 SSD,大块连续 extent 能降低 FTL 写放大。

3) Reflinks and Copy-on-Write: Zero-Copy Clones

3)Reflink 与写时复制:零拷贝克隆

  • On filesystems supporting CoW (Btrfs, XFS with reflink=1, ext4 with reflink feature), use FICLONERANGE/ioctl to duplicate ranges instantly.

  • Use for “keep last N bytes” rotations, snapshotting immutable blobs, or deduplicating large common bases.

  • 在支持 CoW 的文件系统(Btrfs、开启 reflink 的 XFS、启用 reflink 特性的 ext4)上,用 FICLONERANGE 进行极速区间克隆。

  • 适合“尾部保留”的轮换、不可变大对象的快照、共享大基线的数据去重。

Caveat: CoW can fragment under heavy overwrites. Consider periodic defrag (Btrfs: btrfs filesystem defragment).

注意:频繁覆盖会让 CoW 产生碎片,必要时周期性整理(如 Btrfs 的 defragment)。


4) Sparse Files: Power With Responsibility

4)稀疏文件:强大但要自律

  • Creating holes via lseek(SEEK_HOLE/SEEK_DATA) and write allows “huge logical files” with tiny physical footprints.

  • Great for VM images, scientific arrays, or checkpoints.

  • Be sure your tooling understands sparse semantics (rsync -S, tar --sparse).

  • 用 lseek(SEEK_HOLE/SEEK_DATA) 与写入组合可创建“巨大逻辑文件”但占用很小。

  • 适合 VM 镜像、科学矩阵、检查点。

  • 工具链需支持稀疏(如 rsync -S、tar --sparse),否则复制时会膨胀。


5) Direct I/O, DAX, and When to Bypass the Page Cache

5)直接 I/O、DAX:什么时候该绕开页缓存

  • O_DIRECT bypasses the page cache; good for databases with their own caching. Beware alignment and IO size constraints.

  • DAX (for persistent memory) maps storage into process address space; eliminate page cache and block layer entirely.

  • For streaming writes, posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) trims cache pressure without O_DIRECT headaches.

  • O_DIRECT 绕开页缓存,适合自带缓存的数据库。注意对齐和 IO 大小限制。

  • DAX(持久内存)直接映射到进程地址空间,绕开页缓存与块层。

  • 流式写入可用 posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) 降低缓存压力,避免 O_DIRECT 的复杂约束。


6) fallocate Flags You Should Actually Know

6)这些 fallocate 标志值得记住

  • FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: Deallocate space within range, keep size.

  • FALLOC_FL_ZERO_RANGE: Zero out efficiently, optionally allocating.

  • FALLOC_FL_COLLAPSE_RANGE: Remove a range and shift subsequent data left—like a gapless truncate.

  • FALLOC_FL_INSERT_RANGE: Insert a hole and shift data right—useful for journal-like structures.

  • FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:内部释放空间,保持大小。

  • FALLOC_FL_ZERO_RANGE:高效清零,可选择分配。

  • FALLOC_FL_COLLAPSE_RANGE:删除一段并左移后续数据,相当于无缝截断。

  • FALLOC_FL_INSERT_RANGE:插入空洞并右移数据,适合类日志结构。

Filesystem caveats: Not all flags are supported everywhere; XFS and ext4 are ahead, others may return EOPNOTSUPP.

文件系统注意:并非所有文件系统都支持全部标志;XFS 与 ext4 支持更完备,其他可能返回 EOPNOTSUPP。


7) Advice to the Kernel: fadvise and friends

7)给内核的暗示:fadvise 等

  • posix_fadvise: SEQUENTIAL, RANDOM, WILLNEED, DONTNEED help I/O schedulers and readahead do the right thing.

  • madvise for mmap: MADV_SEQUENTIAL, MADV_DONTNEED release page cache of clean pages post-read.

  • posix_fadvise:SEQUENTIAL、RANDOM、WILLNEED、DONTNEED 为调度器与预读提供信号。

  • mmap 配合 madvise:MADV_SEQUENTIAL、MADV_DONTNEED 可在读后释放干净页,降低缓存占用。

Real-world pattern: After scanning a large file once, call DONTNEED on the scanned window to prevent cache eviction of hot data elsewhere.

实战:顺序扫描大文件后,对已扫描窗口调用 DONTNEED,避免把其他热数据挤出缓存。


8) Atomic File Updates Without Data Loss

8)原子更新:不丢数据的方式

  • Write to a temp file in the same directory, fsync(fd), fsync(dirfd), then rename over the old file—atomic on POSIX.

  • For metadata safety, also call fdatasync/fsync on the final fd if you append after rename.

  • For directories with many writes, batch and sync once to amortize latency.

  • 先写同目录临时文件,fsync(fd),fsync(dirfd),再 rename 覆盖——符合 POSIX 的原子更新。

  • 若 rename 之后还有追加,最终文件也要 fsync/fdatasync。

  • 目录写入频繁场景,合并多次操作再同步以平摊延迟。


9) Aligning for Performance: It’s Not Superstition

9)对齐不是玄学:它真有效

  • Align I/O to filesystem block size and device logical/physical sector sizes (e.g., 4K/4K, 4K/8K).

  • For NVMe, 128 KiB or 256 KiB writes often align with controller preferences and queue depths.

  • Use stat -f -c %S path to get filesystem block size; use lsblk -t and cat /sys/block//queue/ to inspect device topology.

  • I/O 对齐到文件系统块和设备逻辑/物理扇区(如 4K/4K、4K/8K)。

  • 对 NVMe,128 KiB 或 256 KiB 写通常匹配控制器偏好与队列深度。

  • 用 stat -f -c %S path 看文件系统块,用 lsblk -t 与 /sys/block//queue/ 查看设备拓扑。


10) Managing Millions of Files Without Melting Down

10)面对海量小文件别崩溃

  • Use directory hashing features (ext4 dir_index is default).

  • Avoid single hot directories; shard by hash prefixes (e.g., ab/cd/…).

  • Consider tar-like container files, SQLite/LMDB, or object stores to reduce inode churn.

  • For “cold archives,” mount with noatime,nodiratime to cut metadata writes.

  • 利用目录哈希(ext4 默认 dir_index)。

  • 避免单一热点目录;按哈希前缀分片(如 ab/cd/…)。

  • 用容器文件(tar 风格)、SQLite/LMDB 或对象存储减少 inode 抖动。

  • 冷归档场景可用 noatime,nodiratime 挂载减少元数据写。


11) Quotas, Project IDs, and Per-Tree Limits

11)配额与项目 ID:按目录树限额

  • XFS project quotas and ext4 project quotas let you cap space per directory subtree—perfect for multi-tenant caches.

  • chattr +P sets project ID; xfs_quota and quotactl manage limits.

  • Combine with lazytime to defer atime flushes.

  • XFS 与 ext4 的项目配额可对目录树限额,非常适合多租户缓存。

  • chattr +P 设置项目 ID,xfs_quota/quotactl 管理额度。

  • 搭配 lazytime 延迟 atime 刷新,降低写放大。


12) Telling the Kernel You’re Done: fallocate + DISCARD/TRIM

12)告诉内核“我用完了”:fallocate 与 DISCARD/TRIM

  • On SSDs/thin LVM, freeing space benefits from discard.

  • Mount with discard=async (where supported) for low-overhead background TRIM, or run fstrim periodically.

  • Punching holes on a filesystem with discard enabled often triggers TRIM for the freed extents.

  • 在 SSD/精简 LVM 上,释放空间配合 discard 更有效。

  • 支持处可用 discard=async 低成本后台 TRIM,或定期运行 fstrim。

  • 在启用 discard 的文件系统上打洞,通常会对释放的 extent 发出 TRIM。


13) Checksums and Scrubbing: Silent Corruption Is Real

13)校验与巡检:沉默数据腐蚀真的存在

  • Btrfs and ZFS checksum data and metadata by default; ext4/XFS don’t checksum data.

  • If you must stay on ext4/XFS, add application-level checksums and periodic verify passes.

  • For large stores, plan scrubs: read-and-verify cycles to surface latent errors.

  • Btrfs/ZFS 默认对数据与元数据校验;ext4/XFS 不校验数据。

  • 若必须用 ext4/XFS,请在应用层加校验并定期校验扫描。

  • 大规模存储要安排巡检:周期读取校验以暴露潜在错误。


14) Journal and Barrier Realities

14)日志与写屏障的现实

  • Data=ordered (ext4 default) writes data before metadata commit—good safety/perf balance.

  • nobarrier is dangerous on devices without proper write cache flush; modern kernels use barriers by default with FUA/flush.

  • For latency-sensitive workloads, consider commit=seconds tuning, but never at the expense of integrity guarantees you need.

  • ext4 默认 data=ordered:先写数据再提交元数据,兼顾安全与性能。

  • nobarrier 对不支持写缓存刷新的设备很危险;现代内核默认启用屏障(FUA/flush)。

  • 对极致延迟敏感的场景可调 commit=秒,但别牺牲必要的数据一致性。


15) Snapshot-Friendly App Design

15)让应用更“快照友好”

  • Quiesce writes before LVM/ZFS/Btrfs snapshots: fsync relevant files and freeze userspace mutations briefly.

  • Store write-intent logs or sequence numbers to recover to a consistent point post-restore.

  • 在 LVM/ZFS/Btrfs 打快照前短暂静默写入:fsync 相关文件,冻结上层变更。

  • 记录写意图或序列号,恢复后回放到一致点。


16) Reading Around the Hot Spots

16)绕开热点,读得更快

  • Use readahead tuning (blockdev --setra) for streaming workloads.

  • For mixed IO, keep hot indices on faster media (NVMe) and cold payloads on slower disks; bind-mount to unify paths.

  • 流式工作负载可调大 readahead(blockdev --setra)。

  • 混合 I/O 场景,把热点索引放在 NVMe,冷数据放慢盘,用 bind-mount 统一路径。


17) Understanding extents: Why fragmentation hurts less than you think

17)理解 extent:碎片没你想的那么可怕

  • Modern filesystems use extents (start+length) instead of per-block lists.

  • Moderate fragmentation is tolerable; seek costs still matter on HDD, less so on SSD.

  • Preallocation and sequential appends keep extents long.

  • 现代文件系统用 extent(起点+长度)替代逐块列表。

  • 适度碎片是可接受的;HDD 仍受寻道影响,SSD 影响较小。

  • 预分配与顺序追加能保持长 extent。


18) Detecting True Space Usage

18)看清“真实占用”

  • du --apparent-size vs du: the former shows logical size, the latter shows disk usage (sparse awareness).

  • stat reports allocated blocks (st_blocks).

  • For reflinks and dedupe, “used once” might be referenced by many files—monitor at filesystem level (btrfs fi df, xfs_growfs -n).

  • du --apparent-size 与 du:前者是逻辑大小,后者是实际占用(感知稀疏)。

  • stat 的 st_blocks 反映已分配块。

  • 对于 reflink 与去重,“一次写入”可能被多文件引用——应在文件系统层观察(如 btrfs fi df)。


19) Case Study: The Rolling Capture File

19)案例:滚动抓包文件

Goal: A single file grows forever; keep last N bytes physically, free older regions without rotating file names.

目标:单文件持续增长;仅保留最后 N 字节的物理空间,不改文件名不轮转。

Pattern:

  • Periodically: size = lseek(fd, 0, SEEK_END). If size > N, punch [0, align_down(size - N)].
  • Align to filesystem block size; on ext4 bigalloc, align to cluster for actual reclaim.
  • Optionally, call fstrim if you rely on periodic TRIM.

套路:

  • 周期执行:size = lseek(fd, 0, SEEK_END)。若 size > N,对 [0, align_down(size - N)] 打洞。
  • 按文件系统块对齐;如 ext4 bigalloc,再按簇对齐才真正回收。
  • 若依赖定期 TRIM,可随后 fstrim。

Outcome: Logical file size grows, disk usage stays roughly bounded around N (+ metadata and partial-aligned tails).

结果:逻辑大小增长,磁盘占用约束在 N 附近(另加元数据与对齐尾差)。


20) Operational Hygiene

20)运维卫生学

  • Monitor ENOSPC not just at filesystem level but also thin pool/array quotas.

  • Alert on inode exhaustion; space free but no inodes is a very real outage.

  • Backpressure writers before the last-GB cliff; keep reclaim daemons (e.g., punching) ahead of producers.

  • 不仅监控文件系统剩余空间,还要监控精简池/阵列配额。

  • 关注 inode 耗尽;“有空间没 inode”会导致真实故障。

  • 在临界前给写入方背压;让回收进程(如打洞)领先生产速率。


21) Tools You’ll Actually Use

21)常用工具清单

  • stat, statx: inspect block sizes, timestamps.

  • filefrag -v: visualize extents and fragmentation.

  • fiemap ioctl: programmatic extent map.

  • xfs_io: swiss army knife for fallocate/punch/zero on many filesystems.

  • fstrim: trigger TRIM on free space.

  • tune2fs/xfs_info/btrfs: query features, sectors, cluster sizes.

  • stat、statx:查看块大小、时间戳。

  • filefrag -v:显示 extent 与碎片情况。

  • fiemap ioctl:程序化获取 extent 映射。

  • xfs_io:多文件系统通吃的 fallocate/打洞/清零工具。

  • fstrim:对空闲区触发 TRIM。

  • tune2fs/xfs_info/btrfs:查询特性、扇区、簇大小。


22) Pitfalls to Respect

22)必须尊重的坑

  • Not all kernels/filesystems support SEEK_HOLE/DATA consistently. Test on your target.

  • Punching holes inside files with reflinked ranges may break sharing or be disallowed.

  • Copying sparse files naïvely can explode storage usage.

  • Heavy CoW on spinning disks can degrade badly without defrag.

  • 不同内核/文件系统对 SEEK_HOLE/DATA 的一致性不同,需实测。

  • 在被 reflink 共享的范围打洞可能打破共享或被拒绝。

  • 朴素复制稀疏文件会“膨胀”。

  • 机械盘上重度 CoW 若不整理会显著退化。


23) Minimal, Portable Code Patterns

23)最小可用、可移植代码套路

  • Query alignment: use fstatfs/statvfs to get filesystem block size; fall back sensibly.

  • Guard flags: probe with an operation on a temp file; handle EOPNOTSUPP gracefully.

  • Feature gates: expose configs (e.g., “prefer_reflink”, “align_to_cluster”) for ops to tune per mount.

  • 获取对齐:fstatfs/statvfs 取文件系统块大小,做好退化。

  • 探测支持:在临时文件上试探操作,优雅处理 EOPNOTSUPP。

  • 特性开关:把“优先 reflink”“按簇对齐”等做成配置,便于按挂载点调优。


24) Performance Mindset: Measure, Don’t Guess

24)性能心法:测量,而非猜测

  • Use perf, iostat, blktrace/bpftrace to see real queues and latencies.

  • Benchmark realistic mixes (reads, writes, fsyncs), not microbench fairy tales.

  • Record extent counts before/after changes; smaller count often correlates with smoother latency.

  • 用 perf、iostat、blktrace/bpftrace 看真实队列与延迟。

  • 基准应贴合真实读写与 fsync 混合,而非只看微基准。

  • 记录变更前后的 extent 数量;更少的 extent 往往意味着更平滑的时延。


25) Where to Bend, Where Not to

25)哪些可以灵活,哪些绝不妥协

Bend: preallocation sizes, readahead, periodic hole punching intervals, CoW usage on cold vs hot data.
Don’t bend: fsync discipline for durability guarantees, integrity of snapshots/backups, assuming discard behavior without verification.

可灵活:预分配块大小、预读窗口、打洞频率、冷热数据的 CoW 策略。
不可妥协:满足耐久性的 fsync 规范、快照/备份的一致性、未验证就假设 discard 的行为。


Epilogue: The Engineer’s File Cabinet

尾声:工程师的“文件柜”

Filesystems reward clarity. If you tell them what you will do—sequential scan, random lookup, keep just the tail—they’ll meet you halfway with smarter prefetch, better allocation, and lower write amplification. The “dark arts” are not hacks; they’re documented powers waiting for precise use. Start with measured hypotheses, validate under your workload, and graduate each trick into a reliable habit.

文件系统偏爱明确的意图。你若清楚表达——我要顺序扫描、随机查找、只保留尾部——它们会用更聪明的预读、更好的分配、更低的写放大来回应。“暗黑技巧”并非投机取巧,而是等待被正确使用的正规能力。从带着假设的测量开始,在你的工作负载下验证,把每个技巧都打磨成可复制的习惯。


Join the Conversation

一起来聊

  • Which trick saved you the most space or latency?

  • What’s your experience with hole punching on ext4 vs XFS or Btrfs?

  • Any war stories about sparse files gone wrong?

  • 哪个技巧最帮你省空间或救时延?

  • 你在 ext4、XFS、Btrfs 上打洞的体验如何?

  • 有哪些稀疏文件“翻车”故事?

在评论区分享你的实践与问题,让更多生产级经验汇聚起来。