Skip to content

Chaos Testing Strategy


  1. Philosophy
  2. Test Environment
  3. Chaos Toolchain
  4. Test Topology
  5. Failure Scenarios — Transport Layer
  6. Failure Scenarios — Encryption & Handshake
  7. Failure Scenarios — Sync Protocol
  8. Failure Scenarios — Content Layer
  9. Failure Scenarios — Adversarial
  10. Cross-Platform Chaos
  11. Chaos Test Automation
  12. Integration with CI/CD
  13. Metrics & Observability
  14. Runbook
  15. Development Phasing

0k-Sync operates in the worst possible environment for correctness: the real world. Devices go offline mid-sync. Networks drop packets. Users close laptop lids during handshakes. Mobile connections switch from WiFi to cellular. Relays restart for updates.

A sync protocol that only works on clean networks is a sync protocol that doesn’t work.

Chaos testing verifies one thing: when the world misbehaves, does 0k-Sync lose data? Everything else — performance degradation, retry delays, user-facing errors — is secondary. The invariants are:

Invariant 1: No data loss. Every blob pushed by a client must eventually be retrievable by all paired clients after chaos heals. No silent drops.

Invariant 2: No silent corruption. Every blob received must pass content-hash verification. A single bit flip is a P0.

Invariant 3: No leaked plaintext. The relay must never see, log, or store plaintext content. This is the “0k” guarantee.

Invariant 4: State convergence. After chaos heals, all paired clients must converge to identical version vectors without requiring a full re-scan. Blob presence alone is insufficient — the state markers must match.

Invariant 5: No metadata leakage. No sensitive metadata (filenames, folder structures, vault sizes, or client identifiers beyond session tokens) shall appear in relay logs at any log level, including TRACE-level crash output. Implementation note: the relay’s tracing subscriber must wrap sensitive fields (VaultID, BlobHash, client addresses) through a redaction layer that blinds them before they reach stdout/stderr. This is not optional filtering — it must be structural, so a developer adding a new tracing::debug!() call cannot accidentally leak metadata without explicitly opting out of redaction. Use a Redacted<T> wrapper type that implements Display and Debug as [REDACTED] by default — this ensures even a generic #[derive(Debug)] on a parent struct won’t leak inner data through derived trait impls.

Every chaos scenario follows the same pattern:

  1. Spec the failure — Define the exact condition (e.g., “relay dies 50ms into a blob push”)
  2. Write the assertion — What must be true after recovery? (e.g., “blob eventually arrives intact on all paired devices”)
  3. Automate the chaos — Script the failure injection
  4. Run until proven — Not once. Hundreds of times. Flaky passes are failures.

Chaos tests are not exploratory. They are automated, repeatable, and part of the test suite. They live in tests/chaos/ and run on the dedicated test server.

This document covers chaos testing of the 0k-Sync protocol and its components. It does not cover:

  • Application-level testing (CashTable, VardKista Suite) — those projects own their own chaos strategies
  • Tauri framework testing — covered by separate verification
  • Load/performance testing — separate document (future)
  • Penetration testing — separate engagement (future)

All chaos testing runs on a dedicated high-memory server (96GB RAM, multi-core) with sufficient resources to simulate a full adversarial network locally.

Why not CI? Chaos tests are:

  • Resource-intensive — dozens of containers, network namespaces, and virtual interfaces running simultaneously
  • Time-intensive — meaningful chaos requires sustained operation, not 5-minute CI jobs
  • Nondeterministic by design — need many iterations to catch timing-dependent bugs

CI runs the deterministic unit and integration tests. The dedicated server runs the chaos.

Target allocation for a full chaos run:

ResourceAllocationPurpose
RAM32GB (of 96GB available)Containers, VMs, test data
CPU8 coresParallel test topologies
Disk50GB scratch (NVMe)Blob storage, logs, captures
NetworkIsolated Docker networksNo interference with production

This leaves 64GB RAM and remaining cores free for other services during chaos runs. If a full-matrix run needs more, schedule it during off-hours.

I/O note: The 50GB scratch space must be on NVMe, not spinning disk. When running Swarm topologies (20+ clients), I/O wait on slow disk will mask the network chaos latency being injected, producing misleading results. The bottleneck for large topologies is I/O, not RAM.

Chaos tests must never affect real infrastructure:

  • All test containers run in dedicated Docker networks (0ksync-chaos-*)
  • No port bindings to the host network (container-to-container only)
  • Test data uses generated keys, never production credentials
  • Cleanup script runs after every session: scripts/chaos-cleanup.sh

ToolRoleWhy This One
Docker ComposeTopology definitionDeclarative, reproducible, already in stack
tc (traffic control)Network degradation (latency, jitter, loss, reorder)Kernel-level, precise, no overhead
ToxiproxyApplication-level fault injection (timeouts, slow close, bandwidth limits)Sits between nodes as a proxy, programmable API
PumbaContainer-level chaos (kill, pause, stop, remove)Docker-native, scriptable
cargo-nextestTest runnerParallel execution, retry support, JUnit output
tracing + OpenTelemetryObservability during chaosAlready in 0k-Sync architecture

Chaos Mesh and Litmus are Kubernetes-native. 0k-Sync chaos testing runs on bare Docker on a single machine. The toolchain above is simpler, has no k8s dependency, and gives more precise control over network conditions. If 0k-Sync ever needs multi-host chaos testing (e.g., geo-distributed relays), revisit Chaos Mesh at that point.

Toxiproxy is the key enabler. Every connection between clients and relays routes through a Toxiproxy instance, giving programmatic control over the connection mid-test:

Toxiproxy routing: Client A → Toxiproxy → Relay ← Toxiproxy ← Client B

Toxiproxy toxics available:

ToxicEffectUse Case
latencyAdd fixed + jitter delaySimulate mobile/satellite links
bandwidthLimit throughputSimulate congested networks
slow_closeDelay connection closeSimulate half-open connections
timeoutStop data after delaySimulate network partitions
slicerFragment data into small chunksStress framing/reassembly
limit_dataClose after N bytesSimulate mid-transfer disconnection

These can be added, modified, and removed via HTTP API during test execution — no container restarts needed.


The minimum chaos topology simulates a real-world sync scenario:

Chaos test topology: Two clients connecting through Toxiproxy instances to a relay, with chaos controller orchestrating

All components are Docker containers on an isolated network. The chaos controller is the test harness — a Rust binary or script that orchestrates the scenario.

docker-compose.chaos.yml
services:
relay:
build:
context: .
dockerfile: Dockerfile.relay
networks:
- chaos-net
toxiproxy:
image: ghcr.io/shopify/toxiproxy:latest
networks:
- chaos-net
client-a:
build:
context: .
dockerfile: Dockerfile.cli
depends_on: [toxiproxy, relay]
networks:
- chaos-net
client-b:
build:
context: .
dockerfile: Dockerfile.cli
depends_on: [toxiproxy, relay]
networks:
- chaos-net
networks:
chaos-net:
driver: bridge
internal: true # No external access
TopologyClientsRelaysPurpose
Pair21Basic sync correctness
Multi-device51Fan-out sync, conflict resolution
Multi-relay42-3Relay failover and fan-out (Phase 6.5)
Swarm201Connection limits, resource exhaustion

Start with Pair for alpha. Scale to Multi-device at beta. Swarm for RC/GA stress validation.

4.4 Multi-Relay Chaos Scenarios (Phase 6.5)

Section titled “4.4 Multi-Relay Chaos Scenarios (Phase 6.5)”

These scenarios test client-side fan-out and failover across multiple independent relays. All fan-out logic is in the client — relays have no awareness of each other.

IDScenarioSetupExpected Behaviour
MR-1Primary relay killed during active push2 relays, kill primary after HELLOClient connect fails over to secondary relay, push completes there
MR-2Secondary relay killed2 relays, kill secondary during fan-outClient reports primary ack success, logs warning about secondary failure
MR-3All relays killed2 relays, kill bothClient returns AllRelaysFailed, sync is offline until relays recover
MR-4Relay flapping (up/down/up)1 relay, toggle availabilityClient reconnects on next operation, per-relay cursor resumes correctly
MR-5Primary relay high latency2 relays, inject 5s latency on primarySecondary fan-out push completes fast, primary push completes eventually

Pre-requisites for MR scenarios:

  • Client configured with 2+ relay addresses
  • Each relay running independently with own SQLite
  • Per-relay cursor tracking enabled in client

MR-1 detail: The client attempts HELLO/Welcome on the primary relay. If the connection fails (relay killed), the client tries the next relay in preference order. Once connected, the push proceeds normally. The test verifies:

  1. Connect failover fires within timeout
  2. Push completes on secondary
  3. Per-relay cursor is tracked for the secondary relay (not the dead primary)

MR-3 detail: When all relays are unreachable, the client must fail gracefully — not hang, not panic. ClientError::AllRelaysFailed is returned. The caller can retry later with exponential backoff.

MR-4 detail: Relay flapping tests cursor resilience. After reconnect, the client resumes pulling from the last known cursor for that relay. No duplicate data, no missed data (within TTL window).


These test iroh’s transport behaviour under degraded conditions.

IDScenarioInjectionAssertion
T-LAT-01Fixed 200ms latencytc qdisc add dev eth0 root netem delay 200msSync completes. Blob hashes match.
T-LAT-02High jitter (200ms +/- 150ms)tc ... delay 200ms 150ms distribution normalSync completes. No reordering corruption.
T-LAT-03Asymmetric latency (fast up, slow down)Toxiproxy: 10ms upstream, 500ms downstreamSync completes in both directions.
T-LAT-04Satellite simulation (600ms + 50ms jitter)tc ... delay 600ms 50msHandshake completes. Blobs transfer. Timeouts appropriate.
IDScenarioInjectionAssertion
T-LOSS-015% random packet losstc ... loss 5%Sync completes (retries handle it).
T-LOSS-0220% packet losstc ... loss 20%Sync completes or fails gracefully with retryable error. No corruption.
T-LOSS-03Burst loss (10% with 25% correlation)tc ... loss 10% 25%No data corruption. Recovery after burst.
T-LOSS-04100% loss (partition) then recoveryToxiproxy: timeout toxic on, wait 30s, removeClient reconnects. Sync resumes from last checkpoint. No duplicate data.
IDScenarioInjectionAssertion
T-CONN-01Relay crash mid-syncPumba: kill relay container during blob pushClient detects disconnection. Retries on reconnect. Blob arrives intact.
T-CONN-02Client crash mid-pushPumba: kill client-a during pushRelay cleans up partial state. Client-b unaffected. Client-a resumes on restart.
T-CONN-03Network partition (both clients online, relay unreachable)Toxiproxy: timeout on both proxy pathsBoth clients detect partition. No split-brain. Sync resumes when partition heals.
T-CONN-04Rapid reconnect cycle (10 connect/disconnect in 5s)Script: connect, push 1 blob, disconnect, repeatNo connection leak. No state corruption. Relay handles gracefully.
T-CONN-05Half-open connection (client thinks connected, relay doesn’t)Toxiproxy: slow_close + kill client TCP keepaliveRelay times out stale session. Client detects on next operation. Clean reconnect.
IDScenarioInjectionAssertion
T-BW-0156kbps (edge network)Toxiproxy: bandwidth limit 7KB/sSmall blobs sync (slowly). Large blobs time out gracefully or succeed with patience.
T-BW-02Bandwidth drop mid-transferToxiproxy: start at 1MB/s, drop to 10KB/s at 50%Transfer completes or retries. No corruption of partial data.
T-BW-03Asymmetric bandwidth (fast client A, slow client B)Different Toxiproxy bandwidth per clientBoth eventually sync. Relay doesn’t block fast client waiting for slow one.

6. Failure Scenarios — Encryption & Handshake

Section titled “6. Failure Scenarios — Encryption & Handshake”

These test the hybrid Noise handshake (clatter: ML-KEM-768 + X25519) and session encryption (XChaCha20-Poly1305) under adversarial conditions.

IDScenarioInjectionAssertion
E-HS-01Disconnect after handshake message 1 (-> e)Toxiproxy: limit_data after first messageHandshake times out. Clean retry. No partial key material leaked.
E-HS-02Disconnect after handshake message 2 (<- e, ee, s, es)Toxiproxy: limit_data after second messageHandshake fails cleanly. No session established. Retry succeeds.
E-HS-03Disconnect after handshake message 3 (-> s, se)Toxiproxy: limit_data after third messageOne side thinks established, other doesn’t. Detect mismatch. Renegotiate.
E-HS-04Extreme latency during handshake (5s per message)Toxiproxy: latency 5000msHandshake completes if timeout is sufficient. If not, clean timeout error.
E-HS-05Handshake message reorderToxiproxy: slicer + latency to reorderNoise Protocol rejects out-of-order. No state corruption.
E-HS-06Concurrent handshake from same client (race)Two simultaneous connection attemptsExactly one succeeds. No resource leak from the failed attempt.
IDScenarioInjectionAssertion
E-ENC-01Message corruption (bit flip in ciphertext)Toxiproxy custom toxic or network tapXChaCha20-Poly1305 AEAD rejects. No plaintext exposed. Connection reset or message retry.
E-ENC-02Message truncationToxiproxy: limit_data mid-encrypted-messageDecryption fails (tag mismatch). Clean error. No partial plaintext.
E-ENC-03Message duplication (replay)Capture and replay a valid encrypted messageNonce tracking rejects the replay. No state change from replayed message.
E-ENC-04High-volume encryption (1000 messages/sec)Load generator + latencyNo nonce reuse. No encryption errors under load. Memory stable.
E-ENC-05Key renegotiation under loadTrigger rekey during active blob transferTransfer survives renegotiation. No plaintext gap between old and new keys.
IDScenarioInjectionAssertion
E-PQ-01ML-KEM encapsulation with corrupted ciphertextInject bit error into KEM ciphertextDecapsulation fails. Handshake aborts cleanly. Fallback to retry (not to non-PQ).
E-PQ-02Large handshake messages (ML-KEM-768 ~1.5KB per direction)Combine with T-BW-01 (56kbps)Handshake completes even on slow links. Timeout appropriate for PQ message sizes.
E-PQ-03ML-KEM + X25519 hybrid — one component failsMock: force X25519 to fail during hybrid combineEntire handshake fails. Does NOT fall back to ML-KEM-only or X25519-only. Hybrid is all-or-nothing. The hybrid combine must be a cryptographic binding (concatenated shared secrets fed into a single KDF), not a logical AND — neither component secret can be recoverable if the other is compromised. Downgrade check: if KDF is HKDF-SHA256, verify the relay cannot negotiate or force the client into a single-component key derivation path. The test must confirm that the session key is always derived from `HKDF(SS_kem
E-PQ-04Hybrid binding verificationExtract both KEM and ECDH shared secrets independently; verify combined session key cannot be derived from either aloneSession key requires both components. Compromising X25519 alone or ML-KEM alone yields nothing usable.
E-PQ-05Clock skew between client and relayfaketime or container clock offset: client 5 minutes ahead/behind relayIf Noise sessions or tokens use timestamps/TTLs, handshake still succeeds within skew tolerance. If skew exceeds tolerance, clean rejection with actionable error (not a cryptic timeout).

These test the 0k-Sync protocol logic — state machine, blob exchange, and eventual consistency.

Architectural note: 0k-Sync uses content-addressed immutable blobs. The relay is a dumb pipe — it stores encrypted blobs and has no knowledge of their contents. There is no conflict resolution at the protocol level because there are no conflicts: every blob is unique (identified by hash) and immutable. Merge semantics (LWW, CRDT, or otherwise) are the responsibility of the application layer (CashTable, VardKista Journal, etc.). What 0k-Sync guarantees is that all blobs reach all paired clients and that version vectors converge.

IDScenarioInjectionAssertion
S-SM-01Disconnect during PUSH stateKill connection while client is pushingClient resumes push on reconnect. No duplicate blobs on relay.
S-SM-02Disconnect during PULL stateKill connection while client is pullingClient resumes pull. Partial blob discarded (hash won’t match). Full blob re-pulled.
S-SM-03Disconnect during state reconciliationKill connection during version vector exchangeNo state corruption. Reconciliation restarts cleanly.
S-SM-04Rapid state transitions (push -> pull -> push)Automated client rapidly alternatingState machine handles transitions. No stuck states.
IDScenarioInjectionAssertion
S-CONC-01Simultaneous push from 2 clients (same vault)Both clients push different blobs at same timeBoth blobs eventually present on both clients. No lost writes. Version vectors identical after sync settles.
S-CONC-02Push from A while B is pullingInterleave push and pull timingBoth operations complete. B gets A’s new data on next sync cycle.
S-CONC-035 clients syncing simultaneouslyScale topology to 5 clients, all activeAll clients converge to same state. No client left behind.
S-CONC-04Client syncs with stale state (offline for 1000 versions)Client A pushes 1000 times while B is offline. B reconnects.B catches up fully. No truncation. Transfer is efficient (only missing data).

State convergence goes beyond blob presence. After chaos heals, all clients must agree on the complete state — version vectors, blob manifests, and collection metadata — without requiring a full re-scan.

IDScenarioInjectionAssertion
S-CONV-01Convergence after partition healT-LOSS-04 partition scenario, then verify stateVersion vectors on Client A and Client B are byte-identical after sync settles. No full re-scan triggered.
S-CONV-02Convergence after relay restartPumba: restart relay, both clients reconnectClients re-establish state from relay. Version vectors match pre-restart state. No regression.
S-CONV-03Convergence after asymmetric chaosClient A has 200ms latency, Client B has 20% loss, both active for 5 minutesAfter chaos removed and sync settles, version vectors identical. All blobs present on both.
S-CONV-04Convergence verification methodNo chaos — clean sync of 100 blobsVerify assert_state_converged() helper works: compares version vectors, blob manifests, and collection completeness. This validates the test tooling itself.
IDScenarioInjectionAssertion
S-BLOB-01Large blob (100MB) under chaosCombine 200ms latency + 5% loss + blob pushBlob arrives intact. Hash verification passes.
S-BLOB-02Many small blobs (10,000 x 1KB) rapid firePush all blobs in tight loopAll 10,000 arrive. No deduplication errors. No ordering issues.
S-BLOB-03Identical blob from two clientsBoth clients push same content simultaneouslySingle blob stored (content-addressed dedup). Both clients see it.
S-BLOB-04Empty blob edge casePush a zero-byte blobHandled correctly. Not rejected, not confused with “no data.”

These test iroh-blobs content-addressed storage under failure.

IDScenarioInjectionAssertion
C-STOR-01Relay disk full during blob writeMount tmpfs with size limit, fill itRelay returns clear error. Client retries later. No partial/corrupt blob on disk.
C-STOR-02Client disk full during blob pullSame technique on client containerClient reports error. Can retry after space freed. Relay unaffected.
C-STOR-03Corrupt blob on relay disk (bit rot)Modify stored blob file after writeHash verification fails on read. Client rejects. Relay should self-heal or flag.
C-STOR-04Relay restart with cold cachePumba: restart relay, client immediately requestsRelay recovers from disk. Blob served correctly (iroh-blobs handles this).
IDScenarioInjectionAssertion
C-COLL-01Partial collection sync (some blobs missing)Kill connection after 3 of 10 blobs in collectionClient knows collection is incomplete. Resumes from blob 4 on reconnect.
C-COLL-02Blob deletion during active syncDelete blob from relay while client is pullingClient gets clean error for missing blob. No crash.

These simulate active attackers, not just unreliable networks.

IDScenarioInjectionAssertion
A-PROTO-01Modified relay (tampers with encrypted blobs)Custom relay binary that flips bits in stored blobsClients detect tampering via hash verification. Reject corrupted data. Alert/log.
A-PROTO-02Replay of old encrypted messagesCapture and re-inject previous session messagesNonce/counter tracking rejects. No state rollback.
A-PROTO-03Client impersonation (stolen identity key)Second client using same static keyNoise Protocol detects (KK pattern requires mutual authentication). Session rejected or flagged.
A-PROTO-04Relay tries to read contentInstrument relay to log all blob plaintextsAll logged content is ciphertext. Zero plaintext exposure. (This is the “0k” guarantee.)
A-PROTO-05Message injection (attacker sends fabricated messages)Raw socket sends crafted bytes to relay portRelay rejects: invalid framing, failed decryption, or unknown session. No crash.
IDScenarioInjectionAssertion
A-RES-01Connection flood (1000 simultaneous handshakes)Script opening connections without completing handshakeRelay applies backpressure or connection limits. Existing sessions unaffected. Rate-limiting verification: confirm the relay correctly triggers per-IP connection rate limiting (or integration point for external tools like Fail2Ban). Log output must include the offending IP and rejection reason without leaking session metadata.
A-RES-02Memory exhaustion (huge messages)Send oversized messages exceeding protocol limitsRelay enforces max message size at the framing layer before full buffering — rejects on size header, not after reading the entire payload into memory. No OOM.
A-RES-03Slowloris (open connections, send data very slowly)Toxiproxy: extreme bandwidth limit on attacker connectionRelay times out slow connections. Doesn’t block connection pool for legitimate clients. QUIC note: with iroh’s QUIC transport, traditional TCP Slowloris tools may behave differently. The test must also target QUIC stream limits specifically — verify the relay enforces per-connection stream caps and doesn’t hang on stalled QUIC streams. Test the QUIC path thoroughly — verify the relay enforces stream caps and handles stalled streams correctly.
A-RES-04Storage flood (push endless blobs)Client pushes blobs in infinite loopRelay enforces per-vault or per-connection storage quota. Rejects when limit hit.
A-RES-05Entropy exhaustion under loadHigh-concurrency handshakes (50 simultaneous) while stressing /dev/urandom via background dd if=/dev/urandom readsNonce generation never blocks or returns predictable values. Handshakes complete (possibly slower). No crypto operation falls back to weak randomness.

For adversarial scenarios, standard container logs may not reveal the failure. Use OS-level instrumentation:

ToolPurposeScenarios
straceTrace system calls on relay processT-CONN-04 (file descriptor leaks), A-RES-02 (buffer allocation patterns)
eBPF (bpftrace)Kernel-level network and I/O tracingA-RES-01 (connection tracking), T-CONN-05 (half-open detection)
ss -tnpSocket state snapshotsAny connection-related scenario — verify no CLOSE_WAIT accumulation

0k-Sync targets Linux, macOS, Windows, iOS, and Android via Tauri. The protocol must behave identically regardless of the client’s OS. Subtle differences in TCP stack behaviour, file system semantics, and timing can cause platform-specific bugs that only appear under stress.

Run on the dedicated server using lightweight VMs (not full containers — need real OS networking stacks):

VMOSPurpose
vm-linuxUbuntu 24.04Reference platform (matches CI)
vm-macosmacOS (Virtualization.framework if available, else skip)Apple-specific networking
vm-windowsWindows 11Windows TCP/Winsock behaviour
IDScenarioSetupAssertion
X-PLAT-01Linux client <-> Windows client, 500ms latency, 20% lossvm-linux + vm-windows + tcBoth clients sync. Encryption renegotiation holds.
X-PLAT-02macOS client with aggressive sleep (lid close simulation)vm-macos + Pumba pause/unpauseClient recovers on wake. Sync resumes. No stale session.
X-PLAT-03Mixed OS clients, relay restart3 VMs + relay restartAll three clients reconnect and converge.
X-PLAT-04Windows file locking during blob writevm-windows + concurrent file accessClient handles OS-level lock contention gracefully. No data corruption from “file in use” errors. Windows-specific: sync-client must open blob files with `FILE_SHARE_READ

Cross-platform VM testing is expensive and slow. Prioritise:

  1. Always: Linux-to-Linux chaos (containerised, fast, covers protocol logic)
  2. Beta: Add Windows VM testing — highest platform-specific risk. Windows file locking semantics (mandatory locking, “file in use” errors) differ fundamentally from POSIX advisory locks. X-PLAT-04 frequently reveals bugs that Linux and macOS never encounter. Prioritise this in beta when real users on Windows will be running CashTable.
  3. RC: Full matrix including macOS if VM support is viable
  4. Mobile: Defer to device-farm testing or manual verification (Tauri mobile lifecycle is a separate concern)

The chaos test harness is a Rust binary that orchestrates Docker, Toxiproxy, and Pumba:

tests/chaos/
├── src/
│ ├── main.rs # Test runner entry point
│ ├── topology.rs # Docker Compose management
│ ├── toxiproxy.rs # Toxiproxy HTTP API client
│ ├── pumba.rs # Pumba command wrapper
│ ├── assertions.rs # Sync state verification helpers
│ └── scenarios/
│ ├── transport.rs # Section 5 scenarios
│ ├── encryption.rs # Section 6 scenarios
│ ├── sync.rs # Section 7 scenarios
│ ├── content.rs # Section 8 scenarios
│ └── adversarial.rs # Section 9 scenarios
├── docker-compose.chaos.yml
├── Dockerfile.relay
├── Dockerfile.cli
└── Cargo.toml

Each scenario follows a consistent structure:

/// T-LOSS-04: 100% packet loss (partition) then recovery
#[chaos_test]
async fn partition_then_recovery() -> ChaosResult {
// ARRANGE: Start topology, push initial data, verify sync
let topo = Topology::pair().start().await?;
topo.client_a.push_blob(test_blob()).await?;
topo.wait_for_sync().await?;
// ACT: Inject chaos
topo.toxiproxy_a.add_toxic("timeout", json!({"timeout": 0})).await?;
topo.toxiproxy_b.add_toxic("timeout", json!({"timeout": 0})).await?;
// Client A pushes during partition
topo.client_a.push_blob(partition_blob()).await?;
// Wait, then heal partition
sleep(Duration::from_secs(30)).await;
topo.toxiproxy_a.remove_toxic("timeout").await?;
topo.toxiproxy_b.remove_toxic("timeout").await?;
// ASSERT: Sync recovers
topo.wait_for_sync_timeout(Duration::from_secs(60)).await?;
assert_blob_present(&topo.client_b, partition_blob().hash()).await?;
assert_no_data_loss(&topo).await?;
Ok(())
}

Chaos tests are probabilistic. A single pass means nothing. Each scenario runs multiple iterations:

Scenario TypeIterationsRationale
Transport (deterministic chaos)10tc-based, fairly reproducible
Encryption (timing-dependent)50Handshake races need many attempts to catch
Sync protocol (state-dependent)25State machine paths need coverage
Adversarial10More about correctness than timing
Cross-platform5Expensive, but each run is meaningful

A scenario passes only if all iterations pass. One failure in 50 is a bug, not noise.

Each chaos run produces:

  • JUnit XML — Parsed by CI and workstation dashboards
  • Chaos log — Full timeline of injections and observations
  • Failure captures — On assertion failure: container logs, Toxiproxy state, network captures (tcpdump) saved to chaos-results/{run-id}/

EnvironmentWhat RunsTrigger
Dedicated serverFull chaos suite (all scenarios, all iterations)Nightly cron or manual
CI (GitHub Actions)Smoke chaos (3 key scenarios, 1 iteration each)PR merge to main
Dev workstationAnalysis of chaos results, trend trackingPost-run

Three representative scenarios run in CI as a sanity check (not a substitute for full chaos):

  1. T-LOSS-02 — 20% packet loss, sync completes
  2. E-HS-01 — Handshake disruption, clean retry
  3. S-CONC-01 — Concurrent push, no lost writes

These run in a lightweight Docker topology within GitHub Actions (2 clients + 1 relay + 1 Toxiproxy). Target: <10 minutes.

From the release strategy quality gates:

MilestoneChaos Requirement
AlphaCI smoke chaos passes
BetaFull transport + encryption chaos passes on server
RCFull chaos suite passes (all sections, all iterations)
GAFull chaos + cross-platform chaos passes
Terminal window
# cron: 0 2 * * * (2 AM daily)
cd /opt/0k-sync && ./scripts/chaos-run.sh \
--all --iterations default \
--output /data/chaos-results/$(date +%Y%m%d)

Results reviewed next morning. Any failure blocks the day’s development until investigated.


MetricCollectionPurpose
Sync completion timeClient logsDetect performance regression under chaos
Handshake success rateClient + relay logsEncryption resilience
Blob integrity (hash match rate)Client assertionsThe primary invariant
Connection retry countClient logsDetect retry storms
Memory usage (relay)docker statsDetect leaks under sustained chaos
Open file descriptors (relay)/proc/{pid}/fd countDetect connection leaks
Nonce counter progressionEncryption layer instrumentationDetect nonce reuse risk

Before running chaos, establish baselines on a clean network:

  • Sync time for 1MB blob (2 clients, clean LAN)
  • Handshake completion time
  • Memory steady-state after 1000 sync cycles

Chaos results are compared against baselines. Acceptable degradation thresholds:

MetricClean BaselineAcceptable Under ChaosFailure
Sync completionX ms<= 10X ms> 10X or timeout
Handshake timeY ms<= 5Y ms> 5Y or failure
Memory growth0 (steady)<= 10% growth over 1hr> 10% (leak)
Blob integrity100%100%< 100% (CRITICAL)

Note: blob integrity has no acceptable degradation. 100% or it’s a P0 bug.

Dev workstation tracks chaos trends over time. Key views:

  • Pass rate per scenario — trend over nightly runs (catch regressions early)
  • Sync time under chaos — box plot per scenario (detect performance drift)
  • Resource consumption — relay memory/CPU during chaos (catch leaks)

Terminal window
# On the dedicated server
cd /opt/0k-sync
# Pull latest
git pull origin main
# Build chaos test images
docker compose -f tests/chaos/docker-compose.chaos.yml build
# Run all scenarios
cargo nextest run -p chaos-tests --no-capture
# Or run a specific section
cargo nextest run -p chaos-tests -E 'test(/^transport/)'
cargo nextest run -p chaos-tests -E 'test(/^encryption/)'
cargo nextest run -p chaos-tests -E 'test(/^sync/)'
cargo nextest run -p chaos-tests -E 'test(/^adversarial/)'
Terminal window
# Run T-LOSS-04 specifically, 50 iterations
cargo nextest run -p chaos-tests \
-E 'test(partition_then_recovery)' -- --iterations 50

If a chaos run fails or is interrupted, containers may be left running:

scripts/chaos-cleanup.sh
docker compose -f tests/chaos/docker-compose.chaos.yml down -v
docker network prune -f --filter "label=project=0ksync-chaos"
docker volume prune -f --filter "label=project=0ksync-chaos"

When a chaos test fails:

  1. Check chaos-results/{run-id}/ for captured logs and network traces
  2. Look at Toxiproxy state at time of failure — what toxics were active?
  3. Check relay container logs for panics or unexpected errors
  4. Check client container logs for the specific assertion that failed
  5. Reproduce with --iterations 1 and RUST_LOG=trace for detailed output
  6. If timing-dependent, add the scenario to the “flaky watch” list and increase iterations
  1. Identify the failure mode (what goes wrong in the real world?)
  2. Write the assertion first (what must be true after recovery?)
  3. Implement the chaos injection (Toxiproxy toxic, Pumba action, or custom)
  4. Add to the appropriate section in tests/chaos/src/scenarios/
  5. Run 50 iterations on the dedicated server to validate reliability
  6. Add to the scenario table in this document
  7. If it should be in CI smoke, add to the CI workflow

15. Development Phasing: When to Write Chaos Tests

Section titled “15. Development Phasing: When to Write Chaos Tests”

Cross-reference: The implementation plan includes chaos deliverables in each phase’s validation gate and checkpoint, matching the phase-by-phase mapping below.

15.1 Principle: Assertions First, Infrastructure Second, Topology Last

Section titled “15.1 Principle: Assertions First, Infrastructure Second, Topology Last”

Chaos tests follow the same TDD discipline as everything else in 0k-Sync: write the assertion before you write the thing it tests. A chaos scenario’s assertion (“after 200ms latency heals, all blobs must be present on both clients”) is a resilience requirement. Writing it early forces you to design for recovery from the start.

What you do NOT do is dump all 68 scenarios as a single sprint after the code is “done.” By then it’s too late — the architecture has already baked in assumptions that chaos testing would have caught.

Impl PhaseChaos WorkWhat’s Runnable
Phase 1-2 (sync-types, sync-core)Build chaos harness skeleton: Docker Compose templates, Toxiproxy Rust wrapper, chaos controller scaffold, assertion helpers (blob integrity checker, version vector convergence comparator). No scenarios yet — there’s no network code to break.Harness compiles, helper functions have their own unit tests.
Phase 3 (sync-client + clatter)Write E-HS-*, E-ENC-*, E-PQ-* scenario assertions against a mock transport. Inject chaos at the crypto layer (corrupt handshake bytes, truncate messages, reorder). These test the encryption logic, not the network. Also stub out T-* scenario signatures with #[ignore] — they need a relay to actually run.Encryption chaos runs in-process (no Docker). cargo test -p chaos-tests -E 'test(/^encryption/)' executes against mocks. Transport stubs compile but are skipped.
Phase 3.5 (sync-content)Write S-BLOB-* assertions (hash verification after transfer), C-STOR-* stubs (disk full, bit rot). These test the content pipeline in isolation using mock storage backends.Content chaos runs in-process.
Phase 4 (sync-cli)The CLI becomes the chaos test client driver. Write S-SM-*, S-CONC-* scenario logic using the CLI as the programmable client. Still can’t run full topology without a relay — but the scenario logic and assertions are complete.Scenario logic compiles. Integration needs Phase 6.
Phase 6 (sync-relay)This is where the full suite lights up. Wire all existing stubs and mocks to real Docker topology. Transport scenarios (T-*) go live. Adversarial scenarios (A-PROTO-*, A-RES-*) get implemented. Cross-platform (X-PLAT-*) follows when VM infrastructure is ready. Run every scenario 50 iterations on the dedicated server.Full chaos suite operational. Nightly runs begin.

15.3 What Gets Written When — Scenario Mapping

Section titled “15.3 What Gets Written When — Scenario Mapping”

Chaos test development phasing: Phase 1-2 (harness) → Phase 3 (mock encryption) → Phase 3.5 (mock storage) → Phase 4 (CLI-driven) → Phase 6 (full Docker topology, all 68 scenarios)

The dividing line is simple: if the scenario tests logic, mock the transport. If the scenario tests the network, use Docker.

Encryption chaos (E-*) can run entirely in-process because you’re testing “what happens when bytes are corrupted between handshake messages” — you don’t need TCP to corrupt bytes. A mock transport that injects bit flips is faster, more deterministic, and more debuggable than Toxiproxy doing the same thing through Docker.

Transport chaos (T-*) must use real Docker topology because the whole point is testing the actual TCP/QUIC stack’s behavior under network degradation. You cannot meaningfully mock “200ms latency with 150ms jitter” — you need tc netem or Toxiproxy acting on real socket connections.

Adversarial chaos (A-*) must use real topology because you’re testing the relay’s behavior as a running process under attack conditions.

When Phase 6 lands and the full topology is available, encryption scenarios that previously ran against mocks should ALSO run against the real topology. This catches integration-level issues that mocks miss (buffer sizes, timeout interactions, flow control under real QUIC).

The mock versions stay — they run in CI (fast, no Docker needed). The real topology versions run on the dedicated server (slow, full fidelity). Both must pass.

CI (every PR): Mock-based encryption + sync chaos → <5 min
Server (nightly): Full Docker topology, all 68 → ~2 hrs

PrefixSectionCount
T-LATTransport: Latency & Jitter4
T-LOSSTransport: Packet Loss4
T-CONNTransport: Connection Events5
T-BWTransport: Bandwidth3
E-HSEncryption: Handshake6
E-ENCEncryption: Session5
E-PQEncryption: Post-Quantum5
S-SMSync: State Machine4
S-CONCSync: Concurrent Operations4
S-CONVSync: State Convergence4
S-BLOBSync: Blob Integrity4
C-STORContent: Storage4
C-COLLContent: Collections2
A-PROTOAdversarial: Protocol5
A-RESAdversarial: Resource Exhaustion5
X-PLATCross-Platform4
Total68 scenarios

Version: 1.5.0 | Date: 2026-02-03