# Commit Database

The Commit Database is the persistence layer: an immutable mutation
DAG, transactional, with history and deterministic reduction across
concurrent streams.

```{important}
Before reading the API, identify your **mode of use**. The
[Modes of Use](commit_modes.md) diagnostic decides whether the
sections below are reference material or load-bearing for you —
single-stream readers can ignore reduction-related sections
entirely.
```

## Structure

A `CommitDatabase` holds **two content-addressed spaces**:

- The **DAG of commits** — the versioned history of mutations. Each
  commit is identified by its content hash (`CommitId`). This is
  what the rest of this page documents.
- The **pool of blobs** — immutable binary payloads (textures,
  meshes, raw buffers, …) identified by their hash and referenced
  from inside commits. See
  [Binary Data (Blobs)](../dsviper/blobs.md) for the blob API.

Both spaces are append-only and content-addressed independently.
They are replicated together by
[Database synchronisation](commit_synchronization.md).

## Opening a CommitDatabase

```pycon
>>> db = CommitDatabase.open("model.cdb")
```

To create a new database with embedded definitions, use:

```bash
python3 tools/dsm_util.py create_commit_database model.dsm model.cdb
```

---

## Reading State

A freshly created database has no commits — `first_commit_id()` and
`last_commit_id()` return `None`:

```{doctest}
>>> db.first_commit_id() is None
True
>>> db.last_commit_id() is None
True
>>> db.head_commit_ids()
set()
```

The `initial_state()` method always works and returns the empty
state. The example uses constants like `TUTO_A_USER_LOGIN` exposed
by the database's embedded definitions — see
[Embedded Definitions](#embedded-definitions) below; a one-line
`db.definitions().inject()` makes them available in the calling
namespace.

```{doctest}
>>> initial = db.initial_state()
>>> len(initial.attachment_getting().keys(TUTO_A_USER_LOGIN))
0
```

### AttachmentGetting Interface

Read attachments via `attachment_getting()`:

```pycon
>>> getting = state.attachment_getting()

>>> doc = getting.get(attachment, key)
>>> doc
Optional({...})

>>> keys = getting.keys(attachment)
```

---

## Mutations

Create a mutable state and apply changes:

```pycon
>>> mutable_state = CommitMutableState(db.state(db.last_commit_id()))
>>> mutating = mutable_state.attachment_mutating()

>>> mutating.set(attachment, key, document)

>>> mutating.update(attachment, key, path, new_value)
```

### Committing

`commit_mutations()` returns the new commit id — capture it explicitly to
chain further mutations or read the resulting state:

```pycon
>>> commit_id = db.commit_mutations("Commit message", mutable_state)
```

---

## Complete Example

Add an Alice document and read it back:

```{doctest}
>>> key = TUTO_A_USER_LOGIN.create_key()
>>> login = TUTO_A_USER_LOGIN.create_document()
>>> login.nickname = "alice"
>>> login.password = "secret"

>>> mutable = CommitMutableState(db.initial_state())
>>> mutable.attachment_mutating().set(TUTO_A_USER_LOGIN, key, login)
>>> commit_id = db.commit_mutations("Add Alice", mutable)

>>> state = db.state(commit_id)
>>> state.attachment_getting().get(TUTO_A_USER_LOGIN, key)
Optional({nickname='alice', password='secret'})
```

---

## Path-Based Mutators

Instead of replacing entire documents with `set()`, path-based mutators use
**Paths** to target specific locations. This enables path-based merging when multiple
users edit concurrently.

| Mutator            | Target | Operation             |
|--------------------|--------|-----------------------|
| `update`           | Field  | Replace value at path |
| `union_in_set`     | Set    | Add elements          |
| `subtract_in_set`  | Set    | Remove elements       |
| `union_in_map`     | Map    | Add key-value pairs   |
| `subtract_in_map`  | Map    | Remove keys           |
| `update_in_map`    | Map    | Update existing key   |
| `insert_in_xarray` | XArray | Insert at position    |
| `update_in_xarray` | XArray | Update at position    |
| `remove_in_xarray` | XArray | Remove at position    |

### Field Update

```pycon
>>> mutating.update(TUTO_A_USER_LOGIN, key, TUTO_P_LOGIN_NICKNAME, "alice_updated")
```

### Why Paths Matter

When two users edit different fields simultaneously:

```
User A: update(attachment, key, path_to_name, "Alice")
User B: update(attachment, key, path_to_email, "bob@example.com")

After convergence: Both updates apply (disjoint paths)
```

With `set()`, one user's changes would overwrite the other's.

Paths matter here because *name* and *email* are **owned by distinct writers**:
each means exactly the field they touch, so the union — Alice's name beside
Bob's email — is the collective intent, owned end to end. The same verbs
*invent* instead when a path is a **fragment** of a whole-value intent that
`diff` happened to split — see
[Re-entering the graph](commit_contract.md#re-entering-the-graph).

See [Cooperative Discipline](commit_cooperation.md) for the principle (scope
ownership) and its limits.

---

## Commit History

Inspect commit metadata:

```{doctest}
>>> header = db.commit_header(commit_id)
>>> header.label()
'Add Alice'
>>> header.parent_commit_id() == ValueCommitId()
True
```

The first commit's parent is the zero `ValueCommitId` (no ancestor).

Navigate history by passing the explicit ids you captured:

```pycon
>>> state1 = db.state(first_commit_id)
>>> state2 = db.state(latest_commit_id)
```

---

## Embedded Definitions

CommitDatabase stores its definitions:

```{doctest}
>>> defs = db.definitions()
>>> sorted(str(t) for t in defs.types())
['Tuto::Account', 'Tuto::Identity', 'Tuto::Login', 'Tuto::Status', 'Tuto::Texture', 'Tuto::Thumbnail', 'Tuto::User']
```

Calling `defs.inject()` makes `TUTO_A_USER_LOGIN`, `TUTO_S_LOGIN`, etc.
available as constants in the calling namespace.

---

## How Reduction Picks a Winner

When concurrent streams are reduced, the engine has to choose a single
outcome for every overlapping path. The choice is deterministic given
a fixed merge sequence — same inputs, same merges, same result on
every client — but its mechanics are *structural*, not author- or
time-meaningful.

**The merge primitive.** `commitMerge(parent, target)` creates a
merge commit. When the resulting state is reconstructed, `target`'s
mutations are applied *after* `parent`'s — so on every overlapping
path, the value from `target` survives. This is the only such rule
the engine itself fixes, and it makes the operation non-commutative:
`commitMerge(A, B) ≠ commitMerge(B, A)`.

**Reducing multiple heads is a strategy, not a guarantee.** The
built-in `reduceHeads` seeds the running result with the most recent
head (`lastCommitId()`, by authoring timestamp) and folds the remaining
heads into it in ascending `CommitId` order, calling `commitMerge` once
per head with the running result as parent and the head as target. Applications are free to use a
different order — or to skip `reduceHeads` entirely and issue their
own `commitMerge` sequence. The final state depends on *who calls
commitMerge in what order*, not on a property of the engine.

Within this default, the outcome on an overlapping path is still fully
determined — a function of the `CommitId` hashes, reproducible on every
client — but it is not predictable without computing them. One
consequence is easy to miss: because the seed is folded in first and
each later `commitMerge` lets `target` overwrite it, the *most recent*
head is not the one preserved on an overlapping path; the
highest-`CommitId` head, applied last, is.

Because the outcome is set by the merge sequence and not by the engine, **every
client that reduces heads on a shared database must use the same strategy.** Two
processes folding the same heads in different orders — one in ascending
`CommitId`, another in, say, hash-table iteration order — produce different
states on contested paths, and the shared history stops converging. Fix one
reduction order for all writers of a shared store; the built-in `reduceHeads()`
is the obvious choice, and it is transactional, where a hand-rolled fold can
leave a half-merged DAG if it is interrupted mid-merge.

**On an overlapping path, the surviving value is structural, not
intentional.** Whichever strategy is used, the value that survives is
a function of how merges were sequenced — not of authorship, recency,
or semantic priority. Two authors editing the same field have no way
to predict which value will survive reduction, even within a fixed
strategy.

The implication for the application is treated in the
[Dual-Layer Contract](commit_contract.md#reading-the-state-is-an-import-not-a-load):
do not rely on a specific arbitration outcome; re-validate at read time.

---

## Performance characteristics

Reconstructing a document's state from its commit history is
**per-document**, not per-database. Cost depends on the opcodes that
touched it and on **Ops** — the number of path-targeted operations
on this document since its last `set`.

| Opcode             | Reconstruction | Notes                             |
|--------------------|----------------|-----------------------------------|
| `set`              | O(1)           | replaces the whole document       |
| `update`           | O(Ops)         | replays field-level updates       |
| `update_in_map`    | O(Ops)         |                                   |
| `update_in_xarray` | O(Ops)         |                                   |
| `remove_in_xarray` | O(Ops)         |                                   |
| `union_in_map`     | O(Ops · log M) | M = map size                      |
| `subtract_in_map`  | O(Ops · log M) |                                   |
| `union_in_set`     | O(Ops · S)     | S = set size                      |
| `subtract_in_set`  | O(Ops · S)     |                                   |
| `insert_in_xarray` | O(Ops²)        | UUID positioning, quadratic worst |

### Design pitfall: O(N²) on accumulated Sets

A document whose state grows by repeated `union_in_set` across many
commits incurs O(Ops²) reconstruction — every read replays every
prior union. For long-running topologies that grow by accumulation,
prefer either a tree structure with `set()` replacing the children
list each commit, or a periodic flatten (see
[Storage growth](#storage-growth)).

### Validated scale

Commit has been benchmarked at:

- ~6 600 documents per database — about 3 MB of structural data,
  alongside ~3 GB of associated blobs on CAD workloads (structure is
  typically < 0.15 % of total disk footprint);
- up to 8 concurrent processes sharing a single SQLite database with
  applicative jitter between commits.

State reconstruction is linear in document count: a full warm-up via
`CommitState.cache_preload()` runs at ~1.5–2 µs per document across
this range — 0.4 ms at 230 documents, 14 ms at 6 600.

Behaviour beyond those envelopes is not characterised.

---

## Storage growth

The mutation DAG is **append-only** — once written, every commit is
immutable. The database grows monotonically; the runtime carries no
incremental garbage collection, no partial purge that trims old
commits while keeping recent history, and no archival mechanism.

The only sanctioned way to shrink a database is **Flatten** — a
user-space pattern, not a runtime operation: read the current head
state from the source database, write it as the initial commit of a
fresh target database, then switch readers and writers to the new
database. The source database is untouched; the target is a new
append-only history that happens to start where the old one ended.

The `dsviper` binding ships this pattern as a ready-made converter,
`CommitDatabaseFlattener`: it flattens a chosen commit into a fresh
single-commit target, keeping only the blobs that commit still references
and dropping superseded history blobs. The source is left untouched — the
converter automates the pattern, it does not trim history in place. See
{doc}`Database Transfer <../dsviper/api/transfer>` for the full transfer
toolkit.

```{warning}
`CommitDatabase.delete_commit()` and `CommitDatabase.reset_commits()`
(plus the CLI wrapper `commit_admin reset`) are **not features** —
they are tricks for live-demo scenarios.

- `delete_commit(commit_id)` only makes sense on a head — deleting
  any other commit would orphan its descendants. Live-demo use:
  rewinding the DAG by one step.
- `reset_commits()` / `commit_admin reset` removes every commit
  except the initial one. Live-demo use: replaying a scenario from
  a known baseline between runs.

Both operations break the append-only invariant that every other
reader relies on. Never use them as storage-management tools.
```

Sustained-growth scenarios that need to keep recent history while
trimming older commits are not addressed by the current runtime.

---

## Safe Usage

A checklist for the operational gotchas. None of this is enforced by the
engine — it's on the application.

- **Identify your mode first.** The [Modes of Use](commit_modes.md) carry
  different burdens. Only *multi-stream with strong invariants* makes
  the [Dual-Layer Contract](commit_contract.md) load-bearing; the
  other modes are safe to use without it.
- **Capture `commit_id` explicitly.** `commit_mutations()` returns the new
  id; there is no implicit current commit to auto-advance. Chain further
  mutations and reads from the captured value. **That id is itself state**: if
  you keep it outside the database — in a scene file, a config, any side-channel
  the store never sees — it can drift out of sync (one is copied, restored, or
  reopened without the other). A stale or absent baseline makes the next read
  either raise or, worse, diff against the wrong commit and emit authored
  mutations no one made. Store the cursor so it cannot outlive or desync from
  the database it points into.
- **Prefer path-based mutators over `set()`** for fields edited
  concurrently. `set()` replaces the whole document, so disjoint edits
  collide. `update`, `union_in_set`, `update_in_map`, etc. converge
  cleanly on disjoint paths — see [Why Paths Matter](#why-paths-matter).
  But match the path to the semantic unit: letting `diff` split a bound
  value into sub-paths is its own failure mode — see
  [Re-entering the graph](commit_contract.md#re-entering-the-graph).
- **Re-validate the state when you read it back, not when you build the
  mutations.** Under best-effort reduction, mutations may have been
  silently dropped and combined states may violate cross-field
  invariants. See
  [The Dual-Layer Contract](commit_contract.md#reading-the-state-is-an-import-not-a-load)
  for the discipline and where it becomes load-bearing.