Skip to content

Medatata DB silent corruption protection/detection #271

@joonas-fi

Description

@joonas-fi

Possible mitigations

  1. Store the DB on an integrity-protected filesystem.
  • Ext4 doesn't support checksumming data (though recent advancements support checksumming for metadata - but that's not enough), but below Ext4 it would be possible to use something like dm-integrity
  • This is harder to guarantee because we can't control circumstances where users will want to run Varasto.
  1. When migrating to SQLite (see Migrate to SQLite #206), use something like The Checksum VFS Shim
  • this however complicates deployment of SQLite since this is plugin we must conditionally compile in or load at runtime. Does it work with SQLite's Go port etc.
  1. Build integrity verification one layer above SQLite. Since we're targeting on moving to EventSourcing -based architecture we can rely on properties of that architecture to project the event log to two different instances of the SQLite database: 1) operational 2) verification copy. If the process is deterministic (it should be) then I suspect the SQLite DB should be byte-identical (or at least semantically-equivalent) on disk. So we can compare two DB exports (based on same event cursor). If they differ, we know either instance is bad - we can shut down and investigate before more errors start accumulating. Of course this assumes the event log needs its own integrity verification (this is just shifting the problem there) - but it needs it anyway. Looks like this:
flowchart TB
    eventlog[Event log]
    operationaldb[Operational DB]
    comparisondb[Comparison DB]

    subgraph "DB integrity verification"
        compare[Comparison]
    end

    eventlog -- projected to --> operationaldb --> compare
    eventlog -- projected to --> comparisondb --> compare
Loading

Comparison of approaches:

Approach Works for all users Does not complicate SQLite deployment
1
2
3

Option 3 seems the cleanest.

Concrete incident from my own instance

Integrity verification job identified three different disks having the same blob missing:

Image

This is highly suspicious (was this a test of mine from years ago where I purposefully removed same blob from all three replicas?).

The blob ref from database record is:

6510b426e09cef0843dd8bdedd946067bcb016a9d0990794aa0fb938f4856dd8

The source for this blob ref is bucket scan:

func (r *SimpleRepository) EachFrom(from []byte, fn func(record any) error, tx *bbolt.Tx) error {
bucket := tx.Bucket(r.bucketName)
if bucket == nil {
return ErrBucketNotFound
}
all := bucket.Cursor()
for key, value := all.Seek(from); key != nil; key, value = all.Next() {
record := r.alloc()
if err := msgpack.Codec.Unmarshal(value, record); err != nil {

When patched with some debug code:

		if idExpected := r.idExtractor(record); !bytes.Equal(key, idExpected) {
			return fmt.Errorf("repo[%s] record[%x]: DISCREPANCY: id expected %x", r.bucketName, key, idExpected)
		}

When doing DB export (which visits each DB record) it first with this:

repo[blobs] record[6510b426e09cef0843dd8bdedd946067bcb016a9d0990794aa2fb938f4856dd8]: DISCREPANCY: id expected 6510b426e09cef0843dd8bdedd946067bcb016a9d0990794aa0fb938f4856dd8

(the "expected" comes from actual record, the "record" comes from the bucket's record key which should always be the same as the ID in the record)

So it looks like the ID in the record "payload" has bitrotted while the key is the original correct one. That explains why integrity verifier could read the blob metadata (it uses bucket scan which gets the incorrect ID) but then querying the blob metadata from REST API didn't work (it uses OpenByPrimaryKey() which expects the correct key and not the bitrotted one from record payload).

Most likely this is bitrot on the SSD and not bitrot on RAM when the record was last modified. Just due to probability of those different events..

Comparing the two (above the correct, below the incorrect one):

6510b426e09cef0843dd8bdedd946067bcb016a9d0990794aa2fb938f4856dd8
6510b426e09cef0843dd8bdedd946067bcb016a9d0990794aa0fb938f4856dd8
                                                  ^ difference here
hex in binary:
00000010
00000000
      ^ bit flipped 1->0

The fix will be to export the DB to JSON (a feature that already exists), fix the ID and then re-import.

Sidetrack

On my server the REST API blob metadata query:

  • didn't work with the bitrotted id.
  • did work with the correct id

When I exported the database (to JSON format) and imported into my dev setup the same REST API did work. This was baffling. But it worked because of these differences resulting from semantic export->import (not using raw DB copy):

On server On dev setup
bucket key correct incorrect
id in record incorrect incorrect

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions