Skip to content

Distributed Testing Guide


  1. Overview
  2. Architecture
  3. Permanent Relay Setup
  4. Running Tests
  5. Relay Observability
  6. Test Categories
  7. Adding New Tests
  8. Troubleshooting
  9. File Reference

Distributed testing validates 0k-sync across real machines, real networks, and real latency — not simulated chaos on a Docker bridge. The test infrastructure spans three machines on a WireGuard mesh:

  • Server A (Mac Mini, macOS) — Test orchestrator, runs cargo test
  • Server B (Linux server, 91GB RAM) — Hosts 3 relay instances in Docker
  • Server C (Raspberry Pi, ARM Linux) — Edge device under test

There are two tiers of chaos tests:

TierTestsWhere They RunWhat They Test
Single-host Docker28 scenariosServer B onlyToxiproxy-mediated chaos (latency, loss, partition)
Distributed37 scenariosServer A -> Server B + Server CReal multi-machine sync, relay failover, edge device behavior

Both tiers coexist. Single-host tests use docker-compose.chaos.yml. Distributed tests use docker-compose.distributed.yml.


Distributed test architecture: Server A (test orchestrator) ↔ Server B (3 relay instances) ↔ Server C (edge device), connected via WireGuard mesh

Network: WireGuard mesh. All nodes directly routable. Traffic goes over real internet (WireGuard tunnels). iroh endpoints publish via Pkarr/DNS discovery.

Key design decisions:

  1. Permanent relays. The 3 relays on Server B start once and stay running. Tests connect to them, not start them. This enables rapid iteration, load testing, and cost analysis.

  2. Per-test isolation via passphrase. Each test generates a unique passphrase, creating a unique sync group. Multiple tests can run against the same relays without data collision.

  3. SSH orchestration. All remote commands run via tokio::process::Command shelling out to ssh. No SSH crate needed — the mesh VPN handles authentication.

  4. ARM cross-compilation on Server B. Server C’s binary is cross-compiled on Server B (Linux -> ARM Linux via cross), then SCP’d to Server C. Server A is macOS and can’t cross-compile for ARM Linux easily.


From Server A or any machine with SSH access to Server B:

Terminal window
ssh [email protected] "cd ~/0k-sync && git pull && \
docker compose -f tests/chaos/docker-compose.distributed.yml \
-p dist-chaos up -d --build --wait"

This builds 3 relay images and starts them with:

  • RUST_LOG=sync_relay=debug,iroh=warn — full debug logging
  • NET_ADMIN capability — for tc netem chaos injection
  • Health checks on each relay (5s interval, 3s timeout, 5 retries)

Verify all 3 are healthy:

Terminal window
ssh [email protected] "curl -s http://localhost:8090/health && echo && \
curl -s http://localhost:8091/health && echo && \
curl -s http://localhost:8092/health"

Each relay logs its iroh Endpoint ID at startup:

Terminal window
ssh [email protected] "docker compose -p dist-chaos logs relay-1 2>&1 | grep 'Endpoint ID'"
ssh [email protected] "docker compose -p dist-chaos logs relay-2 2>&1 | grep 'Endpoint ID'"
ssh [email protected] "docker compose -p dist-chaos logs relay-3 2>&1 | grep 'Endpoint ID'"

The test harness discovers these automatically — you only need to check manually when debugging.

Terminal window
ssh [email protected] "cd ~/0k-sync && \
docker compose -f tests/chaos/docker-compose.distributed.yml \
-p dist-chaos down -v --remove-orphans"
Terminal window
ssh [email protected] "cd ~/0k-sync && git pull && \
docker compose -f tests/chaos/docker-compose.distributed.yml \
-p dist-chaos up -d --build --wait"

The --build flag rebuilds the Docker images. The --wait flag blocks until all health checks pass.


All distributed tests are annotated #[ignore = "requires distributed"] and run from Server A.

  1. 3 relays running on Server B (see Section 3.1)
  2. SSH keys configured: passwordless SSH to Server B and Server C must work
  3. sync-cli built locally: cargo build -p zerok-sync-cli --release
  4. Server C binary available: The harness handles this automatically (cross-compiles on Server B, SCPs to Server C)
Terminal window
cargo test -p chaos-tests distributed -- --ignored

This runs all 37 distributed tests: 5 SSH primitives, 16 infrastructure tests, 16 scenario tests.

Terminal window
# SSH primitives only
cargo test -p chaos-tests distributed::ssh -- --ignored
# Harness infrastructure tests
cargo test -p chaos-tests distributed::harness -- --ignored
# Multi-relay failover scenarios
cargo test -p chaos-tests mr_ -- --ignored
# Cross-machine sync scenarios
cargo test -p chaos-tests cm_ -- --ignored
# Edge device (Server C) scenarios
cargo test -p chaos-tests edge_ -- --ignored
# Network partition & convergence
cargo test -p chaos-tests net_ -- --ignored
cargo test -p chaos-tests conv_ -- --ignored
Terminal window
cargo test -p chaos-tests mr_01_relay_crash_failover -- --ignored

4.5 Run All Chaos Tests (Single-Host + Distributed)

Section titled “4.5 Run All Chaos Tests (Single-Host + Distributed)”

This only works on Server B (single-host tests require Docker):

Terminal window
cargo test -p chaos-tests -- --ignored

4.6 Run Non-Ignored Tests (Unit Tests Only)

Section titled “4.6 Run Non-Ignored Tests (Unit Tests Only)”
Terminal window
cargo test -p chaos-tests

This runs the pure unit tests (SSH parsing, endpoint ID extraction, etc.) without any infrastructure.


Each relay exposes /health on its HTTP port:

Terminal window
curl -s http://10.0.1.2:8090/health | python3 -m json.tool

Response:

{
"status": "ok",
"version": "0.1.0",
"connections": 3,
"groups": 2,
"uptime_seconds": 7200,
"total_blobs": 150,
"storage_bytes": 51200,
"groups_with_data": 5
}
FieldMeaning
connectionsCurrently active QUIC sessions
groupsGroups with active sessions
uptime_secondsSeconds since relay started
total_blobsBlobs stored in SQLite
storage_bytesTotal ciphertext bytes in database
groups_with_dataDistinct groups with stored data

Each relay exposes /metrics in Prometheus text format:

Terminal window
curl -s http://10.0.1.2:8090/metrics

Gauges (current state):

MetricDescription
sync_relay_connections_activeActive QUIC sessions now
sync_relay_groups_activeActive sync groups now
sync_relay_storage_blobsBlobs in database now
sync_relay_storage_bytesCiphertext bytes in database now
sync_relay_storage_groupsGroups with stored data now
sync_relay_info{version="..."}Server version

Counters (monotonic since startup):

MetricDescription
sync_relay_pushes_totalTotal PUSH requests handled
sync_relay_pulls_totalTotal PULL requests handled
sync_relay_connections_totalTotal connections accepted
sync_relay_bytes_received_totalTotal ciphertext bytes received
sync_relay_bytes_sent_totalTotal ciphertext bytes sent
sync_relay_blobs_stored_totalTotal blobs stored since startup
sync_relay_rate_limit_hits_totalTotal rate limit rejections
sync_relay_errors_totalTotal protocol errors
Terminal window
# Follow all relay logs
ssh [email protected] "docker compose -p dist-chaos logs -f"
# Follow a specific relay
ssh [email protected] "docker compose -p dist-chaos logs -f relay-1"
# Last 100 lines from relay-2
ssh [email protected] "docker compose -p dist-chaos logs --tail 100 relay-2"

With RUST_LOG=sync_relay=debug, you’ll see:

  • Every connection accept/close
  • Every HELLO handshake (device ID, group ID, pending count)
  • Every PUSH (blob ID, cursor, group, bytes)
  • Every PULL (blob count, bytes, cursor range)
  • Every NOTIFY delivery
  • Rate limit hits
  • Cleanup task activity

Quick health dashboard:

Terminal window
for port in 8090 8091 8092; do
echo "=== Relay on :$port ==="
ssh [email protected] "curl -s http://localhost:$port/health" | python3 -m json.tool
echo
done

Quick metrics comparison:

Terminal window
for port in 8090 8091 8092; do
echo "=== Relay :$port ==="
ssh [email protected] "curl -s http://localhost:$port/metrics" \
| grep -E "^sync_relay_(pushes|pulls|bytes|connections_total|storage_bytes)"
echo
done

TestWhat
ssh_exec_server_b_whoamiSSH to Server B, verify user
ssh_exec_server_b_docker_versionDocker available on Server B
ssh_exec_server_c_whoamiSSH to Server C, verify user
ssh_exec_nonexistent_commandError handling for bad commands
ssh_scp_round_tripSCP file to Server B and back
TestWhat
distributed_connect_to_relaysConnect to 3 permanent relays, discover Endpoint IDs
distributed_relay_health_checksAll 3 relays respond to health checks
distributed_server_c_binary_existsARM binary present on Server C
distributed_server_c_cli_versionsync-cli runs on Server C
distributed_server_c_initsync-cli init works on Server C
distributed_init_pair_aInit and pair on Server A (local)
distributed_init_pair_cInit and pair on Server C (SSH)
distributed_init_pair_b_containerInit and pair in Server B container
distributed_push_pull_a_to_cServer A pushes, Server C pulls, data matches
distributed_configure_multi_relayAll clients have all 3 relay addresses
distributed_netem_relay_latencytc netem latency on relay container
distributed_netem_server_c_losstc netem packet loss on Server C
distributed_partition_b_ciptables partition between machines
distributed_heal_partitionRemove iptables partition
TestWhat
mr_01_relay_crash_failoverKill relay-1, verify failover to relay-2/3
mr_02_fan_out_all_relaysPush fan-out reaches all 3 relays
mr_03_relay_restart_new_endpointRestart relay, verify new Endpoint ID
mr_04_all_relays_downKill all 3, verify error, restart 1, verify recovery
TestWhat
cm_01_a_push_c_pullServer A pushes 10 messages, Server C pulls all 10
cm_02_bidirectional_syncServer A pushes 5, Server C pushes 5, both see all 10
cm_03_three_way_syncServer A + Server B + Server C all push, all see everything
cm_04_concurrent_push_pull20 rapid pushes from Server A, Server C receives all
TestWhat
edge_01_server_c_high_latency500ms + 100ms jitter on Server C
edge_02_server_c_bandwidth_limit128kbps bandwidth limit on Server C
edge_03_server_c_partition_recoveryBlock Server C, push 10 msgs, unblock, catch up
edge_04_server_c_slow_relay_fast_client200ms relay latency, bidirectional push/pull

6.6 Network Partition & Convergence — NET/CONV (4 tests)

Section titled “6.6 Network Partition & Convergence — NET/CONV (4 tests)”
TestWhat
net_01_partition_a_bBlock Server A <-> Server B, verify error, heal, verify recovery
net_02_selective_relay_partition100% loss on relay-1 only, clients fail over
net_03_asymmetric_chaosRelay-1: 200ms, Relay-2: 10% loss, Relay-3: clean
conv_01_convergence_after_multi_failureKill relay + partition Server C + push + heal -> converge

use crate::distributed::harness::{DistributedHarness, Machine, ChaosTarget};
use crate::netem::NetemConfig;
// Connect to permanent relays (fail-fast if not running)
let harness = DistributedHarness::connect().await?;
// Init and pair all 3 clients (Server A, Server B, Server C)
harness.init_and_pair_all().await?;
// Push/pull from any machine
harness.push(Machine::ServerA, "hello").await?;
let output = harness.pull(Machine::ServerC).await?;
// Kill/restart relays (tests that modify relays must restore them)
harness.kill_relay(0).await?;
harness.restart_relay(0).await?;
// Chaos injection (tc netem)
let netem = NetemConfig::new().delay(200).jitter(50).loss(5.0);
harness.inject_netem(ChaosTarget::Relay(0), &netem).await?;
harness.inject_netem(ChaosTarget::ServerC, &netem).await?;
harness.clear_netem(ChaosTarget::Relay(0)).await?;
// Network partition (iptables on Server B)
harness.partition("10.0.1.3", "10.0.1.2").await?;
harness.heal_partition("10.0.1.3", "10.0.1.2").await?;
// Collect relay logs for zero-knowledge assertion
let logs = harness.all_relay_logs().await?;
for log in &logs {
let result = assert_no_plaintext_in_logs(log);
assert!(result.passed);
}
// Clean up client state (does NOT touch relays)
harness.cleanup().await?;
#[tokio::test]
#[ignore = "requires distributed"]
async fn my_new_scenario() {
let harness = DistributedHarness::connect().await.expect("connect failed");
DistributedHarness::ensure_server_c_binary()
.await
.expect("ensure_server_c_binary failed");
harness.init_and_pair_all().await.expect("init_and_pair_all failed");
// --- Test logic here ---
// Always clean up
harness.cleanup().await.expect("cleanup failed");
}
  • All distributed tests go in tests/chaos/src/scenarios/distributed.rs
  • All tests must be #[ignore = "requires distributed"]
  • Tests that kill/restart relays must restore them before returning
  • Tests that inject netem/iptables must clear them before returning
  • Use settle() (3s) for normal propagation delays
  • Use settle_long() (10s) for chaos recovery scenarios
  • Unique message prefixes prevent cross-test interference (UUID per message)

The harness checks relay health on connect. If relays aren’t running:

Terminal window
# Start them
ssh [email protected] "cd ~/0k-sync && \
docker compose -f tests/chaos/docker-compose.distributed.yml \
-p dist-chaos up -d --build --wait"
Terminal window
# Check what's using the ports
ssh [email protected] "ss -tlnp | grep -E '8090|8091|8092'"
# Kill orphaned containers
ssh [email protected] "docker ps -a --filter 'name=dist-chaos' --format '{{.Names}}'"
ssh [email protected] "docker compose -p dist-chaos down -v --remove-orphans"
Terminal window
# Test SSH manually
ssh [email protected] "whoami"
ssh [email protected] "whoami"
# Check VPN mesh status
# Check mesh VPN connectivity
wg show # or equivalent mesh VPN status command

The harness auto-builds and SCPs the ARM binary. If it fails:

Terminal window
# Manual cross-compile on Server B
ssh [email protected] "export PATH=\$HOME/.cargo/bin:\$PATH && \
cd ~/0k-sync && cargo install cross 2>/dev/null || true && \
cross build --target aarch64-unknown-linux-gnu -p zerok-sync-cli --release"
# Manual SCP to Server C
ssh [email protected] "scp ~/0k-sync/target/aarch64-unknown-linux-gnu/release/sync-cli \
[email protected]:/tmp/0k-sync-test/sync-cli"

The most likely cause is a relay that’s not responding. Check health:

Terminal window
for port in 8090 8091 8092; do
echo -n "Relay :$port"
ssh [email protected] "curl -sf http://localhost:$port/health" && echo "OK" || echo "DOWN"
done

If a relay is down after a kill_relay test, restart it:

Terminal window
ssh [email protected] "cd ~/0k-sync && \
docker compose -f tests/chaos/docker-compose.distributed.yml \
-p dist-chaos up -d --wait relay-1"

If a test crashed mid-partition, clean up manually:

Terminal window
ssh [email protected] "sudo iptables -L INPUT -n | grep DROP"
ssh [email protected] "sudo iptables -F INPUT && sudo iptables -F OUTPUT"
Terminal window
# On a relay container
ssh [email protected] "docker exec dist-chaos-relay-1-1 tc qdisc del dev eth0 root 2>/dev/null; echo done"
# On Server C
ssh [email protected] "sudo tc qdisc del dev eth0 root 2>/dev/null; echo done"

FilePurpose
tests/chaos/src/distributed/mod.rsModule root
tests/chaos/src/distributed/ssh.rsSSH execution primitives (SshTarget, exec, scp)
tests/chaos/src/distributed/config.rsMachine IPs, paths, ports, timeouts
tests/chaos/src/distributed/harness.rsDistributedHarness orchestrator
tests/chaos/src/scenarios/distributed.rs16 scenario tests (MR, CM, EDGE, NET, CONV)
FilePurpose
tests/chaos/docker-compose.distributed.yml3-relay + client topology for Server B
tests/chaos/Dockerfile.relayRelay Docker image (shared with single-host tests)
tests/chaos/Dockerfile.cliCLI Docker image (for client container)
EndpointURL (relay-1)Format
Healthhttp://10.0.1.2:8090/healthJSON
Metricshttp://10.0.1.2:8090/metricsPrometheus text
Logsdocker compose -p dist-chaos logs relay-1Text (tracing fmt)
MachineMesh IPRole
Server A10.0.1.1Test orchestrator (macOS)
Server B10.0.1.2Relay host (Linux, 91GB RAM)
Server C10.0.1.3Edge device (Raspberry Pi, ARM)

See also:


Version: 1.0.0 | Date: 2026-02-07