<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
>
	<channel>
		<title></title>
		<description>aligned words
</description>
		<sy:updatePeriod>weekly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
		<link>https://kdave.github.io</link>
		<atom:link href="https://kdave.github.io/rss-pko.xml" rel="self" type="application/rss+xml" />
		
		<lastBuildDate>Mon, 24 May 2021 00:00:00 +0200</lastBuildDate>
		
			<item>
				<title>qemu 2.12 does not support if=scsi</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;After update of qemu to version 2.12, my testing vms stopped to just warn
about the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if=scsi&lt;/code&gt; (with a bit more cryptic message), and did not want to
start.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-system-x86_64: &lt;span class=&quot;nt&quot;&gt;-drive&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;root,if&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;scsi,media&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;disk,cache&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;none,index&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;0,format&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;raw: machine &lt;span class=&quot;nb&quot;&gt;type &lt;/span&gt;does not support &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;scsi,bus&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;0,unit&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if=scsi&lt;/code&gt; shortcut was handy, the maze of qemu command line options may
take some time to comprehend but then you can do wonders.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://wiki.qemu.org/ChangeLog/2.12#Deprecated_options_and_features&quot;&gt;release notes of version 2.12&lt;/a&gt; do
mention that the support is gone and that other options should be used instead,
also mentions which ones. But it does not tell how exactly. As this should be a
simple syntax transformation from one option to another, an example would save
a lot of time.&lt;/p&gt;

&lt;p&gt;I was not able to find any ready-to-use example in a few minutes so had to
experiment a bit (this saved me documentation reading time).&lt;/p&gt;

&lt;p&gt;The setup I use is a file-backed root image for a virtual machine, nothing
really fancy. The file name is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;root&lt;/code&gt;, sparse file, raw ie. no qcow2, no
caching.&lt;/p&gt;

&lt;p&gt;Here you are:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-device virtio-scsi-pci,id=scsi
-drive file=root,id=root-img,if=none,format=raw,cache=none
-device scsi-hd,drive=root-img
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;First you need to define the bus, otherwise the 3rd line will complain that
there’s no SCSI. Second line is to define the file backed drive and the third
one puts that together.&lt;/p&gt;

&lt;p&gt;Using SCSI might not be the best idea for a qemu VM as the emulated driver is
buggy and crashes, so I’d recommend to use virtio, but for a almost read-only
root image it’s fine. Also the device names are predictable.&lt;/p&gt;
</description>
				<pubDate>Fri, 08 Jun 2018 00:00:00 +0200</pubDate>
				<link>https://kdave.github.io/qemu-2.11-if=scsi/</link>
				<guid isPermaLink="true">https://kdave.github.io/qemu-2.11-if=scsi/</guid>
			</item>
		
			<item>
				<title>Linux crypto: testing blake2s</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;The BLAKE2 algorithm is out for some time, yet there’s no port to linux crypto
API. The recent push of WireGuard would add it to linux kernel, but using a
different API (zinc) than the existing one. I haven’t explored zinc, I assume
there will be a way to use it from inside kernel, but this would need another
layer to switch the APIs according to the algorithm.&lt;/p&gt;

&lt;p&gt;As btrfs is going to use more hashing algos, we are in the process of
selection.  One of the contenders is BLAKE2, though everybody would have a
different suggestion.  In order to test it, a port is needed. Which basically is
a glue code between the linux crypto API and the BLAKE2 implementation.&lt;/p&gt;

&lt;p&gt;I’m not going to reimplement crypto or anything, so the the default
implementation is going to be the reference one found in blake2 git.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: due to the maximum space of 32 bytes available in the btrfs metadata
blocks, the version is BLAKE2s.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;the-blake2s-sources&quot;&gt;The BLAKE2s sources&lt;/h2&gt;

&lt;p&gt;From the repository &lt;a href=&quot;https://github.com/BLAKE2/BLAKE2&quot;&gt;https://github.com/BLAKE2/BLAKE2&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;ref/blake2s-ref.c&lt;/li&gt;
  &lt;li&gt;ref/blake2.h&lt;/li&gt;
  &lt;li&gt;ref/blake2-impl.h&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Briefly skimming over the sources, there’s nothing that’ll cause trouble.
Standard C code, some definitions. Adapting for linux kernel would need to
replace the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stdint.h&lt;/code&gt; types (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uint8_t&lt;/code&gt; etc) with the uXX (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;u8&lt;/code&gt; etc) and change
path to includes. For simplicity, let’s remove the ifdefs for C++ and MSVC too.&lt;/p&gt;

&lt;h2 id=&quot;add-the-new-algorithm-definition-cra&quot;&gt;Add the new algorithm definition (CRA)&lt;/h2&gt;

&lt;p&gt;Though it’s possible to prepare a module for an out-of-tree build (see below),
let’s do it inside the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;linux.git/crypto/&lt;/code&gt; path for now. There’s also plenty of
sources to copy &amp;amp; paste. I’ve used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;crc32c_generic.c&lt;/code&gt;, and it turned out to be
useful.&lt;/p&gt;

&lt;p&gt;The crypto hash description is defined in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct shash_alg&lt;/code&gt;, contains the
technical description like type of the algorithm, length of the context and
callbacks for various phases of the hash calculation (init, update, final), and
identification of the algorithm and the implementation. The default
implementations are C and use the string “-generic” in the name.&lt;/p&gt;

&lt;p&gt;The crc32c module came with a few stub callbacks (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;checksum_init&lt;/code&gt; etc), that
will only call into the blake2 functions and the definition can stay. Simple
search and replace from crc32c to blake2s will do most of the conversion.&lt;/p&gt;

&lt;h2 id=&quot;add-the-glue-code-for-crypto-api&quot;&gt;Add the glue code for crypto API&lt;/h2&gt;

&lt;p&gt;Now we have the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blake2s.c&lt;/code&gt; with the reference implementation, crypto algorithm
definition. The glue code connects the API with the implementation. We need 2
helper structures that hold the context once the user starts digest calculation.
The private blake2s context is embedded into one of them.  The intermediate
results will be stored there.&lt;/p&gt;

&lt;p&gt;And the rest is quite easy, each callback will call into the respective blake2s
function, passing the context, data and lengths. One thing that I copied from
the examples is the key initialization that’s in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blake2s_cra_init&lt;/code&gt;, that
appears to be global and copied to the context each time a new one is
initialized.&lt;/p&gt;

&lt;p&gt;Here the choice of using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;crc32c.c&lt;/code&gt; helped as there were the stub callback with
the right parameters, calling the blake2s functions that can retain their
original signature. This makes later updates easier.  All the functions are
static so the compiler will most probably optimize the code that there will be
no unnecessary overhead.&lt;/p&gt;

&lt;p&gt;Well and that’s it. Let’s try to compile it and insert the module:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;linux.git$ make crypto/blake2s.o
...
  CC [M]  crypto/blake2s.o
linux.git$ make crypto/blake2s.ko
...
  CC      crypto/blake2s.mod.o
  LD [M]  crypto/blake2s.ko
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Check that it’s been properly loaded&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;linux.git/crypto$ sudo insmod ./blake2s.ko
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and registered&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;linux.git$ cat /proc/crypto
name         : blake2s
driver       : blake2s-generic
module       : blake2s
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 1
digestsize   : 32
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The selftest says it passed, but there no such thing so far. There are test
values provided in blake2 git so it would be nice to have too (tm). But
otherwise it looks good.&lt;/p&gt;

&lt;p&gt;To do actual test, we’d need something inside kernel to utilize the new hash.
One option is to implement a module that will do that or use the userspace
library &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libkcapi&lt;/code&gt; that can forward the requests from userspace to the
available kernel implementations.&lt;/p&gt;

&lt;h2 id=&quot;test-it-with-libkcapi&quot;&gt;Test it with libkcapi&lt;/h2&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libkcapi&lt;/code&gt; project at
&lt;a href=&quot;http://www.chronox.de/libkcapi.html&quot;&gt;http://www.chronox.de/libkcapi.html&lt;/a&gt;
provides an API that uses the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AF_ALG&lt;/code&gt; socket type to exchange data with
kernel. The library provides a command line tool that we can use right away and
don’t need to code anything.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ kcapi-dgst -c blake2s --hex &amp;lt; /dev/null
48a8997da407876b3d79c0d92325ad3b89cbb754d86ab71aee047ad345fd2c49
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The test vectors provided by blake2 confirm that this is hash of empty string
with the default key (0x000102..1f).&lt;/p&gt;

&lt;h2 id=&quot;out-of-tree-build&quot;&gt;Out-of-tree build&lt;/h2&gt;

&lt;p&gt;Sources in the linux.git require one additional line to Makefile, build it
unconditionally as a module. Proper submission to linux kernel would need the
Kconfig option.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;obj-m += blake2s.o
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The standalone build needs a Makefile with a few targets that use the existing
build of kernel. Note that you’d need a running kernel with the same built
sources. This is usually provided by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kernel-*-devel&lt;/code&gt; packages. Otherwise,
if you build kernels from git, you know what to do, right?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;KDIR ?= /lib/modules/`uname -r`/build
obj-m := blake2s.o

default:
        $(MAKE) -C $(KDIR) M=$$PWD

clean:
        $(MAKE) -C $(KDIR) M=$$PWD clean

modules_install:
        $(MAKE) -C $(KDIR) M=$$PWD modules_install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt;, the kernel module is ready for use.&lt;/p&gt;

&lt;h2 id=&quot;what-next&quot;&gt;What next?&lt;/h2&gt;

&lt;p&gt;Send it upstream. Well, after some work of course.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;update the coding style of the blake2 sources&lt;/li&gt;
  &lt;li&gt;add Kconfig&lt;/li&gt;
  &lt;li&gt;write self-tests&lt;/li&gt;
  &lt;li&gt;optionally add the optimized implementations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All the files can be found here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;Makefile&quot;&gt;Makefile&lt;/a&gt;: out-of-tree build&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;blake2.h&quot;&gt;blake2.h&lt;/a&gt;: copied and updated for linux&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;blake2.h&quot;&gt;blake2-impl.h&lt;/a&gt;: copied and updated for linux&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;blake2.h&quot;&gt;blake2s.c&lt;/a&gt;: copied and updated for linux&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://blake2.net&quot;&gt;https://blake2.net&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/BLAKE2/BLAKE2&quot;&gt;https://github.com/BLAKE2/BLAKE2&lt;/a&gt;:&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.chronox.de/libkcapi.html&quot;&gt;http://www.chronox.de/libkcapi.html&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.kernel.org/doc/html/latest/crypto/userspace-if.html&quot;&gt;https://www.kernel.org/doc/html/latest/crypto/userspace-if.html&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.kernel.org/doc/html/latest/crypto/intro.html&quot;&gt;https://www.kernel.org/doc/html/latest/crypto/intro.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
				<pubDate>Sat, 27 Apr 2019 00:00:00 +0200</pubDate>
				<link>https://kdave.github.io/linux-crypto-blake2s/</link>
				<guid isPermaLink="true">https://kdave.github.io/linux-crypto-blake2s/</guid>
			</item>
		
			<item>
				<title>Btrfs hilights in 5.2</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;A bit more detailed overview of a btrfs update that I find interesting, see the
&lt;a href=&quot;https://git.kernel.org/linus/9f2e3a53f7ec9ef55e9d01bc29a6285d291c151e&quot;&gt;pull request&lt;/a&gt;
for the rest.&lt;/p&gt;

&lt;h2 id=&quot;read-time-write-time-corruption-detection&quot;&gt;Read-time, write-time corruption detection&lt;/h2&gt;

&lt;p&gt;The tree-checker is a recent addition to the sanity checks, a functionality
that verifies a subset of metadata structures. The capabilities and strength is
limited by the context and bits of information available at some point, but
still there’s enough to get an additional value.&lt;/p&gt;

&lt;p&gt;The context here means a b-tree leaf that packs several items into a block (ie.
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nodesize&lt;/code&gt;, 16KiB by default). The individual items’ members’ values can be
verified against allowed values, the order of the keys of all items listed in
the leaf header can be checked etc. This is just for a rough idea what happens.&lt;/p&gt;

&lt;p&gt;With that in the works, there are two occasions that can utilize the extended
checking:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;read-time – the first time a block is read fresh from disk&lt;/li&gt;
  &lt;li&gt;write-time – when the final contents of a block is in memory and going to be written to disk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s clear that the read-time is merely a detector of problems that already
happened, so there’s not much to do besides warning and returning an error
(EUCLEAN). Turning the filesystem to read-only to prevent further problems is
another option and inevitable on some occasions.&lt;/p&gt;

&lt;p&gt;The write side check aims to catch silent errors that could make it to the
permanent storage. The reasons why this happens are two fold, but the main idea
is to catch memory bit flips. You’d be surprised how often this happens, that
would be for a separate article entirely.&lt;/p&gt;

&lt;p&gt;The new checks in 5.2 improve:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;device item &lt;em&gt;(item that describes the physical device of the filesystem)&lt;/em&gt; –
check that items have the right key type, device id is within allowed range
and that usage and total sizes are within limits of the physical device&lt;/li&gt;
  &lt;li&gt;inode item &lt;em&gt;(item for inodes, ie. files, directories or special files)&lt;/em&gt;– the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;objectid&lt;/code&gt; (that would be the inode number) is in the range, generation is
consistent with the global one and the basic sanity of mode, flags/attributes
and link count&lt;/li&gt;
  &lt;li&gt;block group profiles – check that only a single type is set in the bit mask
that represents block group profile type of a chunk (ie.
single/dup/raid1/…)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As mentioned before, the bits to check are inside a single buffer that
represents the tree block and the check is really local to that. As an inode
item can represent more than just files and directories, doing structural
checks like where and how the inode item is linked to is not easy in this
context. This is basically what would &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfs check&lt;/code&gt; do.&lt;/p&gt;

&lt;p&gt;The checks need to be fast because they happen on each metadata block, so no
additional IO is allowed. This still brings some overhead but is (considered)
negligible compared to all other updates to the block. A measurement can be
done by adding tracepoint, but that’s left as an exercise.&lt;/p&gt;
</description>
				<pubDate>Thu, 25 Jul 2019 00:00:00 +0200</pubDate>
				<link>https://kdave.github.io/btrfs-hilights-5.2/</link>
				<guid isPermaLink="true">https://kdave.github.io/btrfs-hilights-5.2/</guid>
			</item>
		
			<item>
				<title>Selecting the next checksum for btrfs</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;The currently used and the only one checksum algorithm that’s implemented in
btrfs is crc32c, with 32bit digest. It has served well for the years but has
some weaknesses so we’re looking for a replacements and also for enhancing the
use cases based on possibly better hash algorithms.&lt;/p&gt;

&lt;p&gt;The advantage of crc32c is it’s simplicity of implementation, various optimized
versions exist and hardware CPU instruction-level support. The error detection
strength is not great though, the collisions are easy to generate. Note that
crc32c has not been used in cryptographically sensitive context (eg.
deduplication).&lt;/p&gt;

&lt;p&gt;Side note: the collision generation weakness is used in the filesystem image
dump tool to preserve hashes of directory entries while obscuring the real
names.&lt;/p&gt;

&lt;h2 id=&quot;future-use-cases&quot;&gt;Future use cases&lt;/h2&gt;

&lt;p&gt;The natural use case is still to provide checksumming for data and metadata
blocks. With strong hashes, the same checksums can be used to aid deduplication
or verification (HMAC instead of plain hash). Due to different requirements,
one hash algorithm cannot possibly satisfy all requirements, namely speed vs.
strength. Or can it?&lt;/p&gt;

&lt;h2 id=&quot;the-criteria&quot;&gt;The criteria&lt;/h2&gt;

&lt;p&gt;The frequency of checksumming is high, every data block needs that, every
metadata block needs that.&lt;/p&gt;

&lt;p&gt;During the discussions in the community, there were several candidate hash
algorithms proposed and as it goes users want different things but we
developers want to keep the number of features sane or at least maintainable. I
think and hope the solution will address that.&lt;/p&gt;

&lt;p&gt;The first hash type is to replace crc32c with focus on &lt;strong&gt;speed&lt;/strong&gt; and not
necessarily strength (ie. collision resistance).&lt;/p&gt;

&lt;p&gt;The second type focus is &lt;strong&gt;strength&lt;/strong&gt;, in the cryptographic sense.&lt;/p&gt;

&lt;p&gt;In addition to the technical aspects of the hashes, there are some requirements
that would allow free distribution and use of the implementations:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;implementation available under a GPL2 compatible license&lt;/li&gt;
  &lt;li&gt;available in the linux kernel, either as a library function or as a module&lt;/li&gt;
  &lt;li&gt;license that allows use in external tools like bootloaders (namely grub2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Constraints posed by btrfs implementation:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;maximum digest width is 32 bytes&lt;/li&gt;
  &lt;li&gt;blocks of size from 4KiB up to 64KiB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the enhanced use case of data verification (using HMAC), there’s a
requirement that might not interest everybody but still covers a lot of
deployments. And this is standard compliance and certification:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;standardized by FIPS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And maybe last but not least, use something that is in wide use already, proven by time.&lt;/p&gt;

&lt;h3 id=&quot;speed&quot;&gt;Speed&lt;/h3&gt;

&lt;p&gt;Implementation of all algorithms should be performant on common hardware, ie.
64bit architectures and hopefully not terrible on 32bit architectures or older
and weaker hardware.  By hardware I mean the CPU, not specialized hardware
cards.&lt;/p&gt;

&lt;p&gt;The crypto API provided by linux kernel can automatically select the best
implementation of a given algorithm, eg. optimized implementation in assembly
and on multiple architectures.&lt;/p&gt;

&lt;h3 id=&quot;strength&quot;&gt;Strength&lt;/h3&gt;

&lt;p&gt;For the fast hash the collisions could be generated but hopefully not that
easily as for crc32c. For strong hash it’s obvious that finding a collision
would be jackpot.&lt;/p&gt;

&lt;p&gt;In case of the fast hash the quality can be evaluated using the SMHasher suite.&lt;/p&gt;

&lt;h2 id=&quot;the-contenders&quot;&gt;The contenders&lt;/h2&gt;

&lt;p&gt;The following list of hashes has been mentioned and considered or evaluated:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;xxhash&lt;/li&gt;
  &lt;li&gt;XXH3&lt;/li&gt;
  &lt;li&gt;SipHash&lt;/li&gt;
  &lt;li&gt;CRC64&lt;/li&gt;
  &lt;li&gt;SHA256&lt;/li&gt;
  &lt;li&gt;SHA3&lt;/li&gt;
  &lt;li&gt;BLAKE2&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;xxhash&quot;&gt;xxhash&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;criteria&lt;/em&gt;: license OK, implementation OK, digest size OK, not standardized but
in wide use&lt;/p&gt;

&lt;p&gt;The hash is quite fast as it tries to exploit the CPU features that allow
instruction parallelism. The SMHasher score is 10, that’s great. The linux kernel
implementation landed in 5.3.&lt;/p&gt;

&lt;p&gt;Candidate for &lt;em&gt;fast hash&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;xxh3&quot;&gt;XXH3&lt;/h3&gt;

&lt;p&gt;Unfortunately the hash is not yet finalized and cannot be in the final round,
but for evaluation of speed it was considered. The hash comes from the same
author as xxhash.&lt;/p&gt;

&lt;h3 id=&quot;siphash&quot;&gt;SipHash&lt;/h3&gt;

&lt;p&gt;This hash is made for hash tables and hashing short strings but we want 4KiB or
larger blocks. Not a candidate.&lt;/p&gt;

&lt;h3 id=&quot;crc64&quot;&gt;CRC64&lt;/h3&gt;

&lt;p&gt;Similar to crc32 and among the contenders only because it was easy to evaluate
but otherwise is not in the final round. It has shown to be slow in the
microbenchmark.&lt;/p&gt;

&lt;h3 id=&quot;sha256&quot;&gt;SHA256&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;criteria&lt;/em&gt;: license OK, implementation OK, digest size OK, standardized in FIPS&lt;/p&gt;

&lt;p&gt;The SHA family of hashes is well known, has decent support in CPU and is
standardized. Specifically, SHA256 is the strongest that still fits into the
available 32 bytes.&lt;/p&gt;

&lt;p&gt;Candidate for &lt;em&gt;strong hash&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;sha3&quot;&gt;SHA3&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;criteria&lt;/em&gt;: license OK, implementation OK, digest size OK, standardized in FIPS&lt;/p&gt;

&lt;p&gt;Winner of the 2012 hash contest, we can’t leave it out. From the practical
perspective of checksum, the speed is bad even for the strong hash type. One of
the criteria stated above is performance without special hardware, unlike what
was preferred during the SHA3 contest.&lt;/p&gt;

&lt;p&gt;Candidate for &lt;em&gt;strong hash&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;blake2&quot;&gt;BLAKE2&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;criteria&lt;/em&gt;: license OK, implementation OK, digest size OK, not standardized&lt;/p&gt;

&lt;p&gt;From the family of BLAKE that participated in the 2012 SHA contest, the ‘2’
provides a trade-off speed vs. strength. More and more projects adopt it.&lt;/p&gt;

&lt;p&gt;Candidate for &lt;em&gt;strong hash&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;final-round&quot;&gt;Final round&lt;/h2&gt;

&lt;p&gt;I don’t blame you if you skipped all the previous paragraphs. The (re)search
for the next hash was quite informative and fun so it would be shame not to
share it, also to document the selection process for some transparency. This is
a committee driven process though.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;fast hash: &lt;strong&gt;xxhash&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;strong hash: &lt;strong&gt;BLAKE2&lt;/strong&gt; and &lt;strong&gt;SHA256&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two hashes selected for the strong type is a compromise to get a
fast-but-strong hash yet also something that’s standardized.&lt;/p&gt;

&lt;p&gt;The specific version of BLAKE2 is going to be the ‘BLAKE2b-256’ variant, ie.
optimized for 64bit (2b) but with 256bit digest.&lt;/p&gt;

&lt;h3 id=&quot;microbenchmark&quot;&gt;Microbenchmark&lt;/h3&gt;

&lt;p&gt;A microbenchmark gives more details about performance of the hashes:&lt;/p&gt;

&lt;p&gt;Block: 4KiB (4096 bytes), 
Iterations: 100000&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Hash&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Total cycles&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Cycles/iteration&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;NULL-NOP&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;56888626&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;568&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;NULL-MEMCPY&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;60644484&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;606&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;CRC64&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3240483902&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;32404&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;CRC32C&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;174338871&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1743&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;CRC32C-SW&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;174388920&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1743&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;XXHASH&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;251802871&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2518&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;XXH3&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;193287384&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1932&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;BLAKE2b-256-arch&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1798517566&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;17985&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;BLAKE2b-256&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2358400785&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;23584&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;BLAKE2s-arch&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2593112451&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;25931&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;BLAKE2s&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3451879891&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;34518&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;SHA256&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;10674261873&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;106742&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;SHA3-256&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;29152193318&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;291521&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Machine: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (AVX2)&lt;/p&gt;

&lt;p&gt;Hash implementations are the reference ones in C:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;NULL-NOP – stub to measure overhead of the framework&lt;/li&gt;
  &lt;li&gt;NULL-MEMCPY – simple memcpy of the input buffer&lt;/li&gt;
  &lt;li&gt;CRC64 – linux kernel lib/crc64.c&lt;/li&gt;
  &lt;li&gt;CRC32C – hw assisted crc32c, (linux)&lt;/li&gt;
  &lt;li&gt;CRC32C-SW – software implementation, table-based (linux)&lt;/li&gt;
  &lt;li&gt;XXHASH – reference implementation&lt;/li&gt;
  &lt;li&gt;XXH3 – reference implementation&lt;/li&gt;
  &lt;li&gt;BLAKE2b-256-arch – 64bit optimized reference version (for SSE2/SSSE3/SSE41/AVX/AVX2)&lt;/li&gt;
  &lt;li&gt;BLAKE2b-256 – 64bit reference implementation&lt;/li&gt;
  &lt;li&gt;BLAKE2s-arch – 32bit optimized reference version&lt;/li&gt;
  &lt;li&gt;BLAKE2s – 32bit reference implementation&lt;/li&gt;
  &lt;li&gt;SHA256 – RFC 6234 reference implementation&lt;/li&gt;
  &lt;li&gt;SHA3-256 – C, based on canonical implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There aren’t optimized versions for all hashes so for fair comparison the
unoptimized reference implementation should be used. As BLAKE2 is my personal
favourite I added the other variants and optimized versions to observe the
relative improvements.&lt;/p&gt;

&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;/h3&gt;

&lt;p&gt;CRC64 was added by mere curiosity how does it compare to the rest. Well,
curiosity satisfied.&lt;/p&gt;

&lt;p&gt;SHA3 is indeed slow on a CPU.&lt;/p&gt;

&lt;h3 id=&quot;what-isnt-here&quot;&gt;What isn’t here&lt;/h3&gt;

&lt;p&gt;There are non-cryptographic hashes like CityHash, FarmHash, Murmur3 and more,
that were found unsuitable or not meeting some of the basic criteria.  Others
like FNV or Fletcher used in ZFS are of comparable strength of crc32c, so that
won’t be a progress.&lt;/p&gt;

&lt;h2 id=&quot;merging&quot;&gt;Merging&lt;/h2&gt;

&lt;p&gt;All the preparatory work in btrfs landed in version 5.3. Hardcoded assumptions
of crc32c were abstracted, linux crypto API wired in, with additional cleanups
or refactoring. With that in place, adding a new has is a matter of a few lines
of code adding the specifier string for crypto API.&lt;/p&gt;

&lt;p&gt;The work on btrfs-progs is following the same path.&lt;/p&gt;

&lt;p&gt;Right now, the version 5.4 is in development but new features can’t be added,
so the target for the new hashes is &lt;strong&gt;5.5&lt;/strong&gt;. The BLAKE2 algorithm family still
needs to be submitted and merged, hopefully they’ll make it to 5.5 as well.&lt;/p&gt;

&lt;p&gt;One of my merge process requirements was to do a call for public testing, so
we’re going to do that once all the kernel and progs code is ready for testing.
Stay tuned.&lt;/p&gt;
</description>
				<pubDate>Tue, 08 Oct 2019 00:00:00 +0200</pubDate>
				<link>https://kdave.github.io/selecting-hash-for-btrfs/</link>
				<guid isPermaLink="true">https://kdave.github.io/selecting-hash-for-btrfs/</guid>
			</item>
		
			<item>
				<title>Btrfs hilights in 5.3</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;A bit more detailed overview of a btrfs update that I find interesting, see the
&lt;a href=&quot;https://git.kernel.org/linus/a18f8775419d3df282dd83efdb51c5a64d092f31&quot;&gt;pull request&lt;/a&gt;
for the rest.&lt;/p&gt;

&lt;h2 id=&quot;crc32c-uses-accelerated-versions-on-other-architectures&quot;&gt;CRC32C uses accelerated versions on other architectures&lt;/h2&gt;

&lt;p&gt;… than just Intel-based ones. There was a hard coded check for the intel SSE
feature providing the accelerated instruction, but this totally skipped other
architectures. A brief check in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;linux.git/arch/*/crypto&lt;/code&gt; for files
implementing the accelerated versions revealed that there’s a bunch of them:
ARM, ARM64, MIPS, PowerPC, s390 and SPARC. I don’t have enough hardware to show
the improvements, though.&lt;/p&gt;

&lt;h2 id=&quot;automatically-remove-incompat-bit-for-raid56&quot;&gt;Automatically remove incompat bit for RAID5/6&lt;/h2&gt;

&lt;p&gt;While this is not the fix everybody is waiting on, it’s demonstrating
user-developer-user cycle how things can be improved. A filesystem created
with RAID5 or -6 profiles sets an incompatibility bit. That’s supposed to be
there as long as there’s any chunk using the profiles. User expectation is that
the bit should be removed once the chunks are eg. balanced away. This is what
got implemented and serves as a prior example for any future feature that
might get removed on a given filesystem. (Note from the future: I did that for the
RAID1C34 as well).&lt;/p&gt;

&lt;p&gt;Side note: for the chunk profile it’s easy because the runtime check of
presence is quick and requires only going through a list of chunk profiles
types. Example of the worst case would be dropping the incompat bit for LZO
after there are no compressed files using that algorithm. You can easily see
that this would require either enumerating all extent metadata (one time check)
or keeping track of count since mkfs time. This is IMHO not a typical request
and can be eventually done using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfstune&lt;/code&gt; tool on an unmounted
filesystem.&lt;/p&gt;
</description>
				<pubDate>Wed, 11 Dec 2019 00:00:00 +0100</pubDate>
				<link>https://kdave.github.io/btrfs-hilights-5.3/</link>
				<guid isPermaLink="true">https://kdave.github.io/btrfs-hilights-5.3/</guid>
			</item>
		
			<item>
				<title>BLAKE3 vs BLAKE2 for BTRFS</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;Irony isn’t it. The paint of BLAKE2 as BTRFS checksum algorithm hasn’t dried
yet, 1-2 weeks to go but there’s a successor to it. Faster, yet still supposed to
be strong. For a second or two I considered ripping out all the work and … no
not really but I do admit the excitement.&lt;/p&gt;

&lt;p&gt;Speed and strength are competing goals for a hash algorithm. The speed can be
evaluated by anyone, not so much for the strength. I am no cryptographer and
for that area rely on expertise and opinion of others. That BLAKE was a SHA3
finalist is a good indication, where BLAKE2 is it’s successor, weakened but not
weak. BLAKE3 is yet another step trading off strength and speed.&lt;/p&gt;

&lt;p&gt;Regarding BTRFS, BLAKE2 is going to be the faster of strong hashes for now (the
other one is SHA256). The argument I have for it now is proof of time. It’s
been deployed in many projects (even crypto currencies!), there are optimized
implementations, various language ports.&lt;/p&gt;

&lt;p&gt;The look ahead regarding more checksums is to revisit them in about 5 years.
Hopefully by that time there will be deployments, real workload performance
evaluations and overall user experience that will back future decisions.&lt;/p&gt;

&lt;p&gt;Maybe there are going to be new strong yet fast hashes developed. During my
research I learned about Kangaroo 12 that’s a reduced version of SHA3 (Keccak).
The hash is constructed in a different way, perhaps there might be a Kangaroo 2π
one day on par with BLAKE3. Or something else. Why not EDON-R, it’s #1 in many
of the cr.yp.to/hash benchmarks? Another thing I learned during the research is
that hash algorithms are twelve in a dozen, IOW too many to choose from. That
Kangaroo 12 is internally of a different construction might be a point for
selecting it to have wider range of “building block types”.&lt;/p&gt;

&lt;h2 id=&quot;quick-evaluation&quot;&gt;Quick evaluation&lt;/h2&gt;

&lt;p&gt;For BTRFS I have a micro benchmark, repeatedly hashing a 4 KiB block and using
cycles per block as a metric.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Block size: 4KiB&lt;/li&gt;
  &lt;li&gt;Iterations: 10000000&lt;/li&gt;
  &lt;li&gt;Digest size: 256 bits (32 bytes)&lt;/li&gt;
&lt;/ul&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Hash&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Total cycles&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Cycles/iteration&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Perf vs BLAKE3&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Perf vs BLAKE2b&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;BLAKE3  (AVX2)&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;111260245256&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;11126&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.876 (-13%)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;BLAKE2b (AVX2)&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;127009487092&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;12700&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.141 (+14%)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;BLAKE2b (AVX)&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;166426785907&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;16642&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.496 (+50%)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.310 (+31%)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;BLAKE2b (ref)&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;225053579540&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;22505&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.022 (+102%)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.772 (+77%)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Right now there’s only the reference Rust implementation and a derived C
implementation of BLAKE3, claimed not to be optimized but from my other
experience the compiler can do a good job optimizing programmers ideas away.
There’s only one BLAKE3 entry with the AVX2 implementation, the best hardware
support my testing box provides. As I had the other results of BLAKE2 at hand,
they’re in the table for comparison, but the most interesting pair are the AVX2
versions anyway.&lt;/p&gt;

&lt;p&gt;The improvement is 13-14%. Not much ain’t it, way less that the announced 4+x
faster than BLAKE2b. Well, it’s always important to interpret results of a
benchmark with respect to the environment of measurement and the tested
parameters.&lt;/p&gt;

&lt;p&gt;For BTRFS filesystem the block size is always going to be in kilobytes. I can’t
find what was the size of the official benchmark results, the bench.rs script
iterates over various sizes, so I assume it’s an average. Short input buffers
can skew the results as the setup/output overhead can be significant, while for
long buffers the compression phase is significant. I don’t have explanation for
the difference and won’t draw conclusions about BLAKE3 in general.&lt;/p&gt;

&lt;p&gt;One thing that I dare to claim is that I can sleep well because upon the above
evaluation, BLAKE3 won’t bring a notable improvement if used as a checksum
hash.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kdave.github.io/selecting-hash-for-BTRFS&quot;&gt;new hash for BTRFS selection&lt;/a&gt;, same testing box for the measurements&lt;/li&gt;
  &lt;li&gt;https://github.com/BLAKE3-team/BLAKE3 – top commit 02250a7b7c80ded, 2020-01-13, upstream version 0.1.1&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://bench.cr.yp.to/primitives-hash.html&quot;&gt;DJB’s hash menu&lt;/a&gt; and &lt;a href=&quot;https://bench.cr.yp.to/results-hash.html&quot;&gt;per-machine results&lt;/a&gt; with numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;personal-addendum&quot;&gt;Personal addendum&lt;/h2&gt;

&lt;p&gt;During the evaluations now and in the past, I’ve found it convenient if there’s
an offer of implementations in various languages. That eg. Keccak project pages
does not point me directly to a C implementation slightly annoyed me, but the
reference implementation in C++ was worse than BLAKE2 I did not take the next
step to compare the C version, wherever I would find it.&lt;/p&gt;

&lt;p&gt;BLAKE3 is fresh and Rust seems to be the only thing that has been improved
since the initial release. A plain C implementation without any
warning-not-optimized labels would be good. I think that C versions will appear
eventually, besides that Rust is now the new language hotness, there are
projects not yet &lt;em&gt;“let’s rewrite it in Rust”&lt;/em&gt;. Please Bear with us.&lt;/p&gt;
</description>
				<pubDate>Tue, 21 Jan 2020 00:00:00 +0100</pubDate>
				<link>https://kdave.github.io/blake3-vs-blake2-in-btrfs/</link>
				<guid isPermaLink="true">https://kdave.github.io/blake3-vs-blake2-in-btrfs/</guid>
			</item>
		
			<item>
				<title>Btrfs hilights in 5.5: 3-copy and 4-copy block groups</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;A bit more detailed overview of a btrfs update that I find interesting, see the
&lt;a href=&quot;https://git.kernel.org/linus/97d0bf96a0d0986f466c3ff59f2ace801e33dc69&quot;&gt;pull
request&lt;/a&gt;
for the rest.&lt;/p&gt;

&lt;h2 id=&quot;new-block-group-profiles-raid1c3-and-raid1c4&quot;&gt;New block group profiles RAID1C3 and RAID1C4&lt;/h2&gt;

&lt;p&gt;There are two new block group profiles enhancing capabilities of the RAID1
types with more copies than 2. Brief overview of the profiles is in the table
below, for table with all profiles see manual page of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkfs.brtfs&lt;/code&gt;, also
available &lt;a href=&quot;https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#PROFILES&quot;&gt;on
wiki&lt;/a&gt;.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Profile&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Copies&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Utilization&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Min devices&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RAID1&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;50%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RAID1C3&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;33%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RAID1C4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;25%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The way all the RAID1 types work is that there are 2 / 3 / 4 exact copies over
all available devices. The terminology is different from linux MD RAID, that
can do any number of copies. We decided not to do that in btrfs to keep the
implementation simple. Another point for simplicity is from the users’
perspective. That RAID1C3 provides 3 copies is clear from the type. Even after
adding a new device and not doing balance, the guarantees about redundancy
still hold. Newly written data will use the new device together with 2 devices
from the original set.&lt;/p&gt;

&lt;p&gt;Compare that with a hypothetical RAID1CN, on a filesystem with M devices (N &amp;lt;=
M). When the filesystem starts with 2 devices, equivalent to RAID1, adding a
new one will have mixed redundancy guarantees after writing more data. Old data
with RAID1, new with RAID1C3 – but all accounted under RAID1CN profile. A full
re-balance would be required to make it a reliable 3-copy RAID1. Add another
device, going to RAID1C4, same problem with more data to shuffle around.&lt;/p&gt;

&lt;p&gt;The allocation policy would depend on number of devices, making it hard for the
user to know the redundancy level. This is already the case for
RAID0/RAID5/RAID6. For the striped profile RAID0 it’s not much of a problem,
there’s no redundancy. For the parity profiles it’s been a known problem and
new balance filter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stripe&lt;/code&gt; has been added to support fine grained selection of
block groups.&lt;/p&gt;

&lt;p&gt;Speaking about RAID6, there’s the elephant in the room, trying to cover write
hole. Lack of a resiliency against 2 device damage has been bothering all of us
because of the known write hole problem in the RAID6 implementation. How this
is going to be addressed is for another post, but for now, the newly added
RAID1C3 profile is a reasonable substitute for RAID6.&lt;/p&gt;

&lt;h3 id=&quot;how-to-use-it&quot;&gt;How to use it&lt;/h3&gt;

&lt;p&gt;On a freshly created filesystem it’s simple:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mkfs.btrfs -d raid1c3 -m raid1c4 /dev/sd[abcd]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The command combines both new profiles for sake of demonstration, you should
always consider the expected use and required guarantees and choose the
appropriate profiles.&lt;/p&gt;

&lt;p&gt;Changing the profile later on an existing filesystem works as usual, you can
use:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# btrfs balance start -mconvert=raid1c3 /mnt/path
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Provided there are enough devices and enough space to do the conversion, this
will go through all metadadata block groups and after it finishes, all of them
will be of the of the desired type.&lt;/p&gt;

&lt;h3 id=&quot;backward-compatibility&quot;&gt;Backward compatibility&lt;/h3&gt;

&lt;p&gt;The new block groups are not understood by old kernels and can’t be mounted,
not even in the read-only mode. To prevent that a new incompatibility bit is
introduced, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;raid1c34&lt;/code&gt;. Its presence on a device can be checked by
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfs inspect-internal dump-super&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;incompat_flags&lt;/code&gt;. On a running
system the incompat features are exported in sysfs,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/sys/fs/btrfs/UUID/features/raid1c34&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;outlook&quot;&gt;Outlook&lt;/h3&gt;

&lt;p&gt;There is no demand for RAID1C5 at the moment (I asked more than once). The
space utilization is low already, the RAID1C4 survives 3 dead devices so IMHO
this is enough for most users. Extending resilience to more devices should
perhaps take a different route.&lt;/p&gt;

&lt;p&gt;With more copies there’s potential for parallelization of reads from multiple
devices. Up to now this is not optimal, there’s a decision logic that’s
semi-random based on process ID of the btrfs worker threads or process
submitting the IO. Better load balancing policy is a work in progress and could
appear in 5.7 at the earliest (because 5.6 development is now in fixes-only
mode).&lt;/p&gt;

&lt;h3 id=&quot;look-back&quot;&gt;Look back&lt;/h3&gt;

&lt;p&gt;The history of the patchset is a bit bumpy. There was enough motivation and
requests for the functionality, so I started the analysis what needs to be
done. Several cleanups were necessary to unify code and to make it easily
extendable for more copies while using the same mirroring code. In the end
change a few constants and be done.&lt;/p&gt;

&lt;p&gt;Following with testing, I tried simple mkfs and conversions, that worked well.
Then scrub, overwrite some blocks and let the auto-repair do the work. No
hiccups. The remaining and important part was the device replace, as the
expected use case was to substitute RAID6, replacing a missing or damaged disk.
I wrote the test script, replace 1 missing, replace 2 missing. And it did not
work. While the filesystem was mounted, everything seemed OK. Unmount, check
again and the devices were still missing. Not cool, right.&lt;/p&gt;

&lt;p&gt;Due to lack of time before the upcoming merge window (a code freeze before next
development cycle), I had to declare it not ready and put it aside. This was in
late 2018. For a highly requested feature this was not an easy decision. Should
it be something less important, the development cycle between rc1 and final
release provides enough time to fix things up. But due to the maintainer role
with its demands I was not confident that I could find enough time to debug and
fix the remaining problem. Also nobody offered help to continue the work, but
that’s how it goes.&lt;/p&gt;

&lt;p&gt;At the late 2019 I had some spare time and looked at the pending work again.
Enhanced the test script with more debugging messages and more checks. The code
worked well, the test script was subtly broken. Oh well, what a blunder. That
cost a year, but on the other hand releasing a highly requested feature that
lacks an important part was not an appealing option.&lt;/p&gt;

&lt;p&gt;The patchset was added to 5.5 development queue at about the last time before
freeze, final 5.5 release happened a week ago.&lt;/p&gt;
</description>
				<pubDate>Sun, 02 Feb 2020 00:00:00 +0100</pubDate>
				<link>https://kdave.github.io/btrfs-hilights-5.5-raid1c34/</link>
				<guid isPermaLink="true">https://kdave.github.io/btrfs-hilights-5.5-raid1c34/</guid>
			</item>
		
			<item>
				<title>Btrfs hilights in 5.4: tree checker updates</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;A bit more detailed overview of a btrfs update that I find interesting, see the
&lt;a href=&quot;https://git.kernel.org/linus/7d14df2d280fb7411eba2eb96682da0683ad97f6&quot;&gt;pull request&lt;/a&gt;
for the rest.&lt;/p&gt;

&lt;p&gt;There’s not much to show in this release. Some users find that good too, a boring release. But still there are some changes of interest. The 5.4 is a long-term support stable tree, stability and core improvements are perhaps more appropriate than features that need a release or two to stabilize.&lt;/p&gt;

&lt;p&gt;? stable not known in advance so not pushing half-baked features to stable, possibly requiring more intrusive fixups&lt;/p&gt;

&lt;p&gt;The development cycle happened over summer and this slowed down the pace of patch reviews and update turnarounds.&lt;/p&gt;

&lt;h2 id=&quot;tree-checker-updates&quot;&gt;Tree-checker updates&lt;/h2&gt;

&lt;p&gt;The tree-checker is a sanity checker of metadata that are read from/written to devices. Over time it’s being enhanced by more checks, let’s have a look at two of them.&lt;/p&gt;

&lt;h3 id=&quot;root_item-checks&quot;&gt;ROOT_ITEM checks&lt;/h3&gt;

&lt;p&gt;The item represents root of a b-tree, of the internal or the subvolume trees.&lt;/p&gt;

&lt;p&gt;Let’s take an example from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfs inspect dump-tree&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;       item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
                generation 5 root_dirid 0 bytenr 30523392 level 0 refs 1
                lastsnap 0 byte_limit 0 bytes_used 16384 flags 0x0(none)
                uuid 00000000-0000-0000-0000-000000000000
                drop key (0 UNKNOWN.0 0) level 0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some of the metadata inside the item allow only simple checks, following &lt;a href=&quot;https://git.kernel.org/linus/259ee7754b6793af8bdd77f9ca818bc41cfe9541&quot;&gt;commit 259ee7754b6793&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;key.objectid&lt;/code&gt; must match the tree that’s being read, though the code verifies only if the type is not 0&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;key.offset&lt;/code&gt; must be 0&lt;/li&gt;
  &lt;li&gt;block offset &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bytenr&lt;/code&gt; must be aligned to sector size (4KiB in this case)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;itemsize&lt;/code&gt; depends on the item type, but for the root item it’s fixed value&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;level&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;drop_level&lt;/code&gt; is 0 to 7, but it’s not possible to cross check if the tree has really of that level&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;generation&lt;/code&gt; must be lower than the super block generation, same for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lastsnap&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;flags&lt;/code&gt; can be simply compared to the bit mask of allowed flags, right now there are two, one represents a read-only subvolume and another a subvolume that has been marked as deleted but its blocks not yet cleaned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;refs&lt;/code&gt; is a reference counter and sanity check would require reading all the expected reference holders, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bytes_used&lt;/code&gt; would need to look up the block that it accounts, etc. The subvolume trees have more data like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ctime&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;otime&lt;/code&gt; and real &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uuid&lt;/code&gt;s. If you wonder what’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte_limit&lt;/code&gt;, this used to be a mechanism to emulate quotas by setting the limit value, but it has been deprecated and unused for a long time. One day we might to find another purpose for the bytes.&lt;/p&gt;

&lt;p&gt;Many of the tree-checker enhancements are follow ups to fuzz testing and reports, as it was in this case. The &lt;a href=&quot;https://bugzilla.kernel.org/show_bug.cgi?id=203261&quot;&gt;bug report&lt;/a&gt; shows that some of the incorrect data were detected and even triggered auto-repair (as this was on filesystem with DUP metadata) but there was too much damage and it crashed at some point. The crash was not random but a BUG_ON of an unexpected condition, that’s sanity check of last resort. Catching inconsistent data early with a graceful error handling is of course desired and ongoing work.&lt;/p&gt;

&lt;h3 id=&quot;extent-metadata-item-checks&quot;&gt;Extent metadata item checks&lt;/h3&gt;

&lt;p&gt;There are two item types that represent extents and information about sharing. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXTENT_ITEM&lt;/code&gt; is older and bigger  while &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;METADATA_ITEM&lt;/code&gt; is the building block of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;skinny-metadata&lt;/code&gt; feature, smaller and more compact. Both items contain type of reference(s) and the owner (a tree id). Besides the generic checks that also the root item does (alignment, value ranges, generation), there’s a number of allowed combinations of the reference types and extent types. The &lt;a href=&quot;https://git.kernel.orgl/linus/f82d1c7ca8ae1bf89e8d78c5ecb56b6b228c1a75&quot;&gt;commit f82d1c7ca8ae1bf&lt;/a&gt; implements that, however further explanation is out of scope of the overview as the sharing and references are the fundamental design of btrfs.&lt;/p&gt;

&lt;p&gt;Example of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;METADATA_ITEM&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;        item 170 key (88145920 METADATA_ITEM 0) itemoff 10640 itemsize 33
                refs 1 gen 27 flags TREE_BLOCK
                tree block skinny level 0
                tree block backref root FS_TREE
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXTENT_ITEM&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;        item 27 key (20967424 EXTENT_ITEM 4096) itemoff 14895 itemsize 53
                refs 1 gen 499706 flags DATA
                extent data backref root FS_TREE objectid 8626071 offset 0 count 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This for a simple case with one reference, tree (for metadata) and ordinary data, so comparing the sizes shows 20 bytes saved. On my 20GiB root partition with about 70 snapshots there are XXX EXTENT and YYY METADATA items.&lt;/p&gt;

&lt;p&gt;Otherwise there can be more references inside one item (eg. many snapshots of a file that is randomly updated over time) so the overhead of the item itself is smaller&lt;/p&gt;
</description>
				<pubDate>Sat, 15 Feb 2020 00:00:00 +0100</pubDate>
				<link>https://kdave.github.io/btrfs-hilights-5.4/</link>
				<guid isPermaLink="true">https://kdave.github.io/btrfs-hilights-5.4/</guid>
			</item>
		
			<item>
				<title>Authenticated hashes for btrfs (part 1)</title>
				<dc:creator>kdave</dc:creator>
				<description>&lt;p&gt;There was a request to provide authenticated hashes in btrfs, natively as one of the btrfs checksum algorithms. Sounds fun but there’s always more to it, even if this sounds easy to implement.&lt;/p&gt;

&lt;p&gt;Johaness T. at that time in SUSE sent the patchset adding the support for SHA256 &lt;a href=&quot;https://lore.kernel.org/linux-fsdevel/20200428105859.4719-1-jth@kernel.org/&quot;&gt;[1]&lt;/a&gt; with a Labs conference paper, summarizing existing solutions and giving details about the proposed implementation and use cases.&lt;/p&gt;

&lt;p&gt;The first version of the patchset posted got some feedback, issues were found and some ideas suggested. Things have stalled a bit, but the feature is still very interesting and really not hard to implement. The support for additional checksums has provided enough support code to just plug in the new algorithm and enhance the existing interfaces to provide the key bytes. So until now I’ve assumed you know what an authenticated hash means, but for clarity and in simple terms: a checksum that depends on a key. The main point is that it’s impossible to generate the same checksum for given data without knowing the key, where &lt;em&gt;impossible&lt;/em&gt; is used in the cryptographic-strength sense, there’s an almost zero probability doing that by chance and brute force attack is not practical.&lt;/p&gt;

&lt;h2 id=&quot;auth-hash-fsverity&quot;&gt;Auth hash, fsverity&lt;/h2&gt;

&lt;p&gt;Notable existing solution for that is &lt;em&gt;fsverity&lt;/em&gt; that works in read-only fashion, where the key is securely hidden and used only to verify that data that are read from media haven’t been tampered with. A typical use case is an OS image in your phone. But that’s not all. Images of OS appear in all sorts of boxed devices, IoT. Nowadays, with explosion of edge computing, assuring integrity of the end devices is a fundamental requirement.&lt;/p&gt;

&lt;p&gt;Where btrfs can add some value is the read AND write support, with an authenticated hash. This brings questions around key handling, and not everybody is OK with a device that could potentially store malicious/invalid data with a proper authenticated checksum. So yeah, use something else, this is not your use case, or maybe there’s another way how to make sure the key won’t be compromised easily. This is beyond the scope of what filesystem can do, though.&lt;/p&gt;

&lt;p&gt;As an example use case of writable filesystem with authenticated hash: detect outside tampering with on-disk data, eg. when the filesystem was unmounted. Filesystem metadata formats are public, interesting data can be located by patterns on the device, so changing a few bytes and updating the checksum(s) is not hard.&lt;/p&gt;

&lt;p&gt;There’s one issue that was brought up and I think it’s not hard to observe anyway: there’s a total dependency on the key to verify a basic integrity of the data. Ie. without the key it’s not possible to say if the data are valid as if a basic checksum was used. This might be still useful for a read-only access to the filesystem, but absence of key makes this impossible.&lt;/p&gt;

&lt;h2 id=&quot;existing-implementations&quot;&gt;Existing implementations&lt;/h2&gt;

&lt;p&gt;As was noted in the LWN discussion &lt;a href=&quot;https://lwn.net/Articles/819143/&quot;&gt;[2]&lt;/a&gt;, what ZFS does, there are two checksums. One is the authenticated and one is not. I point you to the comment stating that, as I was not able to navigate far enough in the ZFS code to verify the claim, but the idea is clear. It’s said that the authenticated hash is eg. SHA512 and the plain hash is SHA256, split half/half in the bytes available for checksum. The way the hash is stored is a simple trim of the first 16 bytes of each checksum and store them consecutively. As both hashes are cryptographically strong, the first 16 bytes &lt;em&gt;should&lt;/em&gt; provide enough strength despite the truncation. Where 16 bytes is 128 bits.&lt;/p&gt;

&lt;p&gt;When I was thinking about that, I had a different idea how to do that. Not that copying the scheme would not work for btrfs, anything that the linux kernel crypto API provides is usable, the same is achievable. I’m not judging the decisions what hashes to use or how to do the split, it works and I don’t see a problem in the strength. Where I see potential for an improvement is performance, without sacrificing strength &lt;em&gt;too much&lt;/em&gt;. Trade-offs.&lt;/p&gt;

&lt;p&gt;The CPU or software implementation of SHA256 is comparably slower to checksums with hardware aids (like CRC32C instructions) or hashes designed to perform well on CPUs. That was the topic of the previous round of new hashes, so we now compete against BLAKE2b and XXHASH. There are CPUs with native instructions to calculate SHA256 and the performance improvement is noticeable, orders of magnitude better. But the support is not as widespread as eg. for CRC32C. Anyway, there’s always choice and hardware improves over time. The number of hashes may seem to explode but as long as it’s manageable inside the filesystem, we take it. And a coffee please.&lt;/p&gt;

&lt;h2 id=&quot;secondary-hash&quot;&gt;Secondary hash&lt;/h2&gt;

&lt;p&gt;The checksum scheme proposed is to use a cryptographic hash and a non-cryptographic one. Given the current support for SHA256 and BLAKE2b, the cryptographic hash is given. There are two of them and that’s fine. I’m not drawing an exact parallel with ZFS, the common point for the cryptographic hash is that there are limited options and the calculation is expensive by design. This is where the non-cryptographic hash can be debated. Also I want to call it &lt;em&gt;secondary&lt;/em&gt; hash, with obvious meaning that it’s not too important by default and comes second when the authenticated hash is available.&lt;/p&gt;

&lt;p&gt;We have CRC32C and XXHASH to choose from. Note that there are already two hashes from the start so supporting both secondary hashes would double the number of final combinations. We’ve added XXHASH to enhance the checksum collision space from 32 bits to 64 bits. What I propose is to use just XXHASH as the secondary hash, resulting in two new hashes for the authenticated and secondary hash. I haven’t found a good reason to also include CRC32C.&lt;/p&gt;

&lt;p&gt;Another design point was where to do the split and truncation. As the XXHASH has fixed length, this could be defined as 192 bits for the cryptographic hash and 64 bits for full XXHASH.&lt;/p&gt;

&lt;p&gt;Here we are, we could have authenticated SHA256 accompanied by XXHASH, or the same with BLAKE2b. The checksum split also splits the decision tree what to do when the checksum partially matches. For a single checksum it’s a simple &lt;em&gt;yes/no&lt;/em&gt; decision. The partial match is the interesting case:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;primary (key available) hash matches, secondary does not – as the authenticated hash is hard to forge, it’s trusted (even if it’s not full length of the digest)&lt;/li&gt;
  &lt;li&gt;primary (key available) does not match, secondary does not – checksum mismatch for the same reason as above&lt;/li&gt;
  &lt;li&gt;primary (key not available) does not match, secondary does – this is the prime time for the secondary hash, the floor is yours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leads to 4 outcomes of the checksum verification, compared to 2. A boolean type can simply represent the yes/no outcome but for two hashes it’s not that easy. It depends on the context, though I think it still should be straightforward to decide what to do that in the code. Nevertheless, this has to be updated in all calls to checksum verification and has to reflect the key availability eg. in case where the data are auto-repaired during scrub or when there’s a copy.&lt;/p&gt;

&lt;h2 id=&quot;performance-considerations&quot;&gt;Performance considerations&lt;/h2&gt;

&lt;p&gt;The performance comparison should be now clear: we have the potentially slow SHA256 but fast XXHASH, for each metadata and data block, vs slow SHA512 and slow SHA256. As I reckon it’s possible to also select SHA256/SHA256 split in ZFS, but that can’t beat SHA256/XXHASH.&lt;/p&gt;

&lt;p&gt;The key availability seems to be the key point in all that, puns notwithstanding. The initial implementation assumed for simplicity to provide the raw key bytes to kernel and to the userspace utilities. This is maybe OK for a prototype but under any circumstances can’t survive until a final release. There’s key management wired deep into linux kernel, there’s a library for the whole API and command line tools. We ought to use that. Pass the key by name, not the raw bytes.&lt;/p&gt;

&lt;p&gt;Key management has it’s own culprits and surprises (key owned vs possessed), but let’s assume that there’s a standardized way how to obtain the key bytes from the key name. In kernel its “READ_USER_KEY_BYTES”, in userspace it’s either &lt;em&gt;keyctl_read&lt;/em&gt; from &lt;em&gt;libkeyutils&lt;/em&gt; or a raw syscall to &lt;em&gt;keyctl&lt;/em&gt;. Problem solved, on the low-level. But, well, don’t try that over &lt;em&gt;ssh&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Accessing a btrfs image for various reasons (check, image, restore) now needs the key to verify data or even the key itself to perform modifications (check + repair). The command line interface has to be extended for all commands that interact with the filesystem offline, ie. the image and not the mounted filesystem.&lt;/p&gt;

&lt;p&gt;This results to a global option, like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfs --auth-key 1234 ispect-internal dump-tree&lt;/code&gt;, compared to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfs inspect-internal dump-tree --auth-key 1234&lt;/code&gt;. This is not finalized, but a global option is now the preferred choice.&lt;/p&gt;

&lt;h2 id=&quot;final-words&quot;&gt;Final words&lt;/h2&gt;

&lt;p&gt;I have a prototype, that does not work in all cases but at least passes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkfs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mount&lt;/code&gt;. The number of checksum verification cases got above what I was able to fix by the time of writing this. I think this has enough matter on itself so I’m pushing it out out as part 1. There are open questions regarding the command line interface and also a some kind of proof or discussion regarding attacks. Stay tuned.&lt;/p&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;[1] &lt;a href=&quot;https://lore.kernel.org/linux-fsdevel/20200428105859.4719-1-jth@kernel.org/&quot;&gt;https://lore.kernel.org/linux-fsdevel/20200428105859.4719-1-jth@kernel.org/&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;[2] &lt;a href=&quot;https://lwn.net/Articles/819143/&quot;&gt;https://lwn.net/Articles/819143/&lt;/a&gt; LWN discussion under &lt;a href=&quot;https://lwn.net/Articles/818842/&quot;&gt;Authenticated Btrfs (2020)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
				<pubDate>Mon, 24 May 2021 00:00:00 +0200</pubDate>
				<link>https://kdave.github.io/authenticated-hashes-for-btrfs-part1/</link>
				<guid isPermaLink="true">https://kdave.github.io/authenticated-hashes-for-btrfs-part1/</guid>
			</item>
		
	</channel>
</rss>
