Distributed Testing Guide

Overview
Architecture
Permanent Relay Setup
Running Tests
Relay Observability
Test Categories
Adding New Tests
Troubleshooting
File Reference

1. Overview

Distributed testing validates 0k-sync across real machines, real networks, and real latency — not simulated chaos on a Docker bridge. The test infrastructure spans three machines on a WireGuard mesh:

Server A (Mac Mini, macOS) — Test orchestrator, runs cargo test
Server B (Linux server, 91GB RAM) — Hosts 3 relay instances in Docker
Server C (Raspberry Pi, ARM Linux) — Edge device under test

There are two tiers of chaos tests:

Tier	Tests	Where They Run	What They Test
Single-host Docker	28 scenarios	Server B only	Toxiproxy-mediated chaos (latency, loss, partition)
Distributed	37 scenarios	Server A -> Server B + Server C	Real multi-machine sync, relay failover, edge device behavior

Both tiers coexist. Single-host tests use docker-compose.chaos.yml. Distributed tests use docker-compose.distributed.yml.

2. Architecture

Distributed test architecture: Server A (test orchestrator) ↔ Server B (3 relay instances) ↔ Server C (edge device), connected via WireGuard mesh

Network: WireGuard mesh. All nodes directly routable. Traffic goes over real internet (WireGuard tunnels). iroh endpoints publish via Pkarr/DNS discovery.

Key design decisions:

Permanent relays. The 3 relays on Server B start once and stay running. Tests connect to them, not start them. This enables rapid iteration, load testing, and cost analysis.
Per-test isolation via passphrase. Each test generates a unique passphrase, creating a unique sync group. Multiple tests can run against the same relays without data collision.
SSH orchestration. All remote commands run via tokio::process::Command shelling out to ssh. No SSH crate needed — the mesh VPN handles authentication.
ARM cross-compilation on Server B. Server C’s binary is cross-compiled on Server B (Linux -> ARM Linux via cross), then SCP’d to Server C. Server A is macOS and can’t cross-compile for ARM Linux easily.

3. Permanent Relay Setup

3.1 Starting Relays (One-Time)

From Server A or any machine with SSH access to Server B:

ssh [email protected] "cd ~/0k-sync && git pull && \
  docker compose -f tests/chaos/docker-compose.distributed.yml \
  -p dist-chaos up -d --build --wait"

This builds 3 relay images and starts them with:

RUST_LOG=sync_relay=debug,iroh=warn — full debug logging
NET_ADMIN capability — for tc netem chaos injection
Health checks on each relay (5s interval, 3s timeout, 5 retries)

Verify all 3 are healthy:

ssh [email protected] "curl -s http://localhost:8090/health && echo && \
  curl -s http://localhost:8091/health && echo && \
  curl -s http://localhost:8092/health"

3.2 Discovering Endpoint IDs

Each relay logs its iroh Endpoint ID at startup:

ssh [email protected] "docker compose -p dist-chaos logs relay-1 2>&1 | grep 'Endpoint ID'"
ssh [email protected] "docker compose -p dist-chaos logs relay-2 2>&1 | grep 'Endpoint ID'"
ssh [email protected] "docker compose -p dist-chaos logs relay-3 2>&1 | grep 'Endpoint ID'"

The test harness discovers these automatically — you only need to check manually when debugging.

3.3 Stopping Relays

ssh [email protected] "cd ~/0k-sync && \
  docker compose -f tests/chaos/docker-compose.distributed.yml \
  -p dist-chaos down -v --remove-orphans"

3.4 Rebuilding After Code Changes

ssh [email protected] "cd ~/0k-sync && git pull && \
  docker compose -f tests/chaos/docker-compose.distributed.yml \
  -p dist-chaos up -d --build --wait"

The --build flag rebuilds the Docker images. The --wait flag blocks until all health checks pass.

4. Running Tests

All distributed tests are annotated #[ignore = "requires distributed"] and run from Server A.

4.1 Prerequisites

3 relays running on Server B (see Section 3.1)
SSH keys configured: passwordless SSH to Server B and Server C must work
sync-cli built locally: cargo build -p zerok-sync-cli --release
Server C binary available: The harness handles this automatically (cross-compiles on Server B, SCPs to Server C)

4.2 Run All Distributed Tests

cargo test -p chaos-tests distributed -- --ignored

This runs all 37 distributed tests: 5 SSH primitives, 16 infrastructure tests, 16 scenario tests.

4.3 Run Specific Test Categories

# SSH primitives only
cargo test -p chaos-tests distributed::ssh -- --ignored

# Harness infrastructure tests
cargo test -p chaos-tests distributed::harness -- --ignored

# Multi-relay failover scenarios
cargo test -p chaos-tests mr_ -- --ignored

# Cross-machine sync scenarios
cargo test -p chaos-tests cm_ -- --ignored

# Edge device (Server C) scenarios
cargo test -p chaos-tests edge_ -- --ignored

# Network partition & convergence
cargo test -p chaos-tests net_ -- --ignored
cargo test -p chaos-tests conv_ -- --ignored

4.4 Run a Single Test

cargo test -p chaos-tests mr_01_relay_crash_failover -- --ignored

4.5 Run All Chaos Tests (Single-Host + Distributed)

This only works on Server B (single-host tests require Docker):

cargo test -p chaos-tests -- --ignored

4.6 Run Non-Ignored Tests (Unit Tests Only)

cargo test -p chaos-tests

This runs the pure unit tests (SSH parsing, endpoint ID extraction, etc.) without any infrastructure.

5. Relay Observability

5.1 Health Endpoint

Each relay exposes /health on its HTTP port:

curl -s http://10.0.1.2:8090/health | python3 -m json.tool

Response:

{
    "status": "ok",
    "version": "0.1.0",
    "connections": 3,
    "groups": 2,
    "uptime_seconds": 7200,
    "total_blobs": 150,
    "storage_bytes": 51200,
    "groups_with_data": 5
}

Field	Meaning
`connections`	Currently active QUIC sessions
`groups`	Groups with active sessions
`uptime_seconds`	Seconds since relay started
`total_blobs`	Blobs stored in SQLite
`storage_bytes`	Total ciphertext bytes in database
`groups_with_data`	Distinct groups with stored data

5.2 Prometheus Metrics

Each relay exposes /metrics in Prometheus text format:

curl -s http://10.0.1.2:8090/metrics

Gauges (current state):

Metric	Description
`sync_relay_connections_active`	Active QUIC sessions now
`sync_relay_groups_active`	Active sync groups now
`sync_relay_storage_blobs`	Blobs in database now
`sync_relay_storage_bytes`	Ciphertext bytes in database now
`sync_relay_storage_groups`	Groups with stored data now
`sync_relay_info{version="..."}`	Server version

Counters (monotonic since startup):

Metric	Description
`sync_relay_pushes_total`	Total PUSH requests handled
`sync_relay_pulls_total`	Total PULL requests handled
`sync_relay_connections_total`	Total connections accepted
`sync_relay_bytes_received_total`	Total ciphertext bytes received
`sync_relay_bytes_sent_total`	Total ciphertext bytes sent
`sync_relay_blobs_stored_total`	Total blobs stored since startup
`sync_relay_rate_limit_hits_total`	Total rate limit rejections
`sync_relay_errors_total`	Total protocol errors

5.3 Live Logs

# Follow all relay logs
ssh [email protected] "docker compose -p dist-chaos logs -f"

# Follow a specific relay
ssh [email protected] "docker compose -p dist-chaos logs -f relay-1"

# Last 100 lines from relay-2
ssh [email protected] "docker compose -p dist-chaos logs --tail 100 relay-2"

With RUST_LOG=sync_relay=debug, you’ll see:

Every connection accept/close
Every HELLO handshake (device ID, group ID, pending count)
Every PUSH (blob ID, cursor, group, bytes)
Every PULL (blob count, bytes, cursor range)
Every NOTIFY delivery
Rate limit hits
Cleanup task activity

5.4 Comparing All 3 Relays

Quick health dashboard:

for port in 8090 8091 8092; do
  echo "=== Relay on :$port ==="
  ssh [email protected] "curl -s http://localhost:$port/health" | python3 -m json.tool
  echo
done

Quick metrics comparison:

for port in 8090 8091 8092; do
  echo "=== Relay :$port ==="
  ssh [email protected] "curl -s http://localhost:$port/metrics" \
    | grep -E "^sync_relay_(pushes|pulls|bytes|connections_total|storage_bytes)"
  echo
done

6. Test Categories

6.1 SSH Primitives (5 tests)

Test	What
`ssh_exec_server_b_whoami`	SSH to Server B, verify user
`ssh_exec_server_b_docker_version`	Docker available on Server B
`ssh_exec_server_c_whoami`	SSH to Server C, verify user
`ssh_exec_nonexistent_command`	Error handling for bad commands
`ssh_scp_round_trip`	SCP file to Server B and back

6.2 Harness Infrastructure (16 tests)

Test	What
`distributed_connect_to_relays`	Connect to 3 permanent relays, discover Endpoint IDs
`distributed_relay_health_checks`	All 3 relays respond to health checks
`distributed_server_c_binary_exists`	ARM binary present on Server C
`distributed_server_c_cli_version`	sync-cli runs on Server C
`distributed_server_c_init`	sync-cli init works on Server C
`distributed_init_pair_a`	Init and pair on Server A (local)
`distributed_init_pair_c`	Init and pair on Server C (SSH)
`distributed_init_pair_b_container`	Init and pair in Server B container
`distributed_push_pull_a_to_c`	Server A pushes, Server C pulls, data matches
`distributed_configure_multi_relay`	All clients have all 3 relay addresses
`distributed_netem_relay_latency`	tc netem latency on relay container
`distributed_netem_server_c_loss`	tc netem packet loss on Server C
`distributed_partition_b_c`	iptables partition between machines
`distributed_heal_partition`	Remove iptables partition

6.3 Multi-Relay Failover — MR (4 tests)

Test	What
`mr_01_relay_crash_failover`	Kill relay-1, verify failover to relay-2/3
`mr_02_fan_out_all_relays`	Push fan-out reaches all 3 relays
`mr_03_relay_restart_new_endpoint`	Restart relay, verify new Endpoint ID
`mr_04_all_relays_down`	Kill all 3, verify error, restart 1, verify recovery

6.4 Cross-Machine Sync — CM (4 tests)

Test	What
`cm_01_a_push_c_pull`	Server A pushes 10 messages, Server C pulls all 10
`cm_02_bidirectional_sync`	Server A pushes 5, Server C pushes 5, both see all 10
`cm_03_three_way_sync`	Server A + Server B + Server C all push, all see everything
`cm_04_concurrent_push_pull`	20 rapid pushes from Server A, Server C receives all

6.5 Edge Device — EDGE (4 tests)

Test	What
`edge_01_server_c_high_latency`	500ms + 100ms jitter on Server C
`edge_02_server_c_bandwidth_limit`	128kbps bandwidth limit on Server C
`edge_03_server_c_partition_recovery`	Block Server C, push 10 msgs, unblock, catch up
`edge_04_server_c_slow_relay_fast_client`	200ms relay latency, bidirectional push/pull

6.6 Network Partition & Convergence — NET/CONV (4 tests)

Test	What
`net_01_partition_a_b`	Block Server A <-> Server B, verify error, heal, verify recovery
`net_02_selective_relay_partition`	100% loss on relay-1 only, clients fail over
`net_03_asymmetric_chaos`	Relay-1: 200ms, Relay-2: 10% loss, Relay-3: clean
`conv_01_convergence_after_multi_failure`	Kill relay + partition Server C + push + heal -> converge

7. Adding New Tests

7.1 Harness API

use crate::distributed::harness::{DistributedHarness, Machine, ChaosTarget};
use crate::netem::NetemConfig;

// Connect to permanent relays (fail-fast if not running)
let harness = DistributedHarness::connect().await?;

// Init and pair all 3 clients (Server A, Server B, Server C)
harness.init_and_pair_all().await?;

// Push/pull from any machine
harness.push(Machine::ServerA, "hello").await?;
let output = harness.pull(Machine::ServerC).await?;

// Kill/restart relays (tests that modify relays must restore them)
harness.kill_relay(0).await?;
harness.restart_relay(0).await?;

// Chaos injection (tc netem)
let netem = NetemConfig::new().delay(200).jitter(50).loss(5.0);
harness.inject_netem(ChaosTarget::Relay(0), &netem).await?;
harness.inject_netem(ChaosTarget::ServerC, &netem).await?;
harness.clear_netem(ChaosTarget::Relay(0)).await?;

// Network partition (iptables on Server B)
harness.partition("10.0.1.3", "10.0.1.2").await?;
harness.heal_partition("10.0.1.3", "10.0.1.2").await?;

// Collect relay logs for zero-knowledge assertion
let logs = harness.all_relay_logs().await?;
for log in &logs {
    let result = assert_no_plaintext_in_logs(log);
    assert!(result.passed);
}

// Clean up client state (does NOT touch relays)
harness.cleanup().await?;

7.2 Test Template

#[tokio::test]
#[ignore = "requires distributed"]
async fn my_new_scenario() {
    let harness = DistributedHarness::connect().await.expect("connect failed");

    DistributedHarness::ensure_server_c_binary()
        .await
        .expect("ensure_server_c_binary failed");

    harness.init_and_pair_all().await.expect("init_and_pair_all failed");

    // --- Test logic here ---

    // Always clean up
    harness.cleanup().await.expect("cleanup failed");
}

7.3 Conventions

All distributed tests go in tests/chaos/src/scenarios/distributed.rs
All tests must be #[ignore = "requires distributed"]
Tests that kill/restart relays must restore them before returning
Tests that inject netem/iptables must clear them before returning
Use settle() (3s) for normal propagation delays
Use settle_long() (10s) for chaos recovery scenarios
Unique message prefixes prevent cross-test interference (UUID per message)

8. Troubleshooting

”RelaysNotRunning” Error

The harness checks relay health on connect. If relays aren’t running:

# Start them
ssh [email protected] "cd ~/0k-sync && \
  docker compose -f tests/chaos/docker-compose.distributed.yml \
  -p dist-chaos up -d --build --wait"

Port Conflicts on Server B

# Check what's using the ports
ssh [email protected] "ss -tlnp | grep -E '8090|8091|8092'"

# Kill orphaned containers
ssh [email protected] "docker ps -a --filter 'name=dist-chaos' --format '{{.Names}}'"
ssh [email protected] "docker compose -p dist-chaos down -v --remove-orphans"

SSH Failures

# Test SSH manually
ssh [email protected] "whoami"
ssh [email protected] "whoami"

# Check VPN mesh status
# Check mesh VPN connectivity
wg show  # or equivalent mesh VPN status command

Server C Binary Not Found

The harness auto-builds and SCPs the ARM binary. If it fails:

# Manual cross-compile on Server B
ssh [email protected] "export PATH=\$HOME/.cargo/bin:\$PATH && \
  cd ~/0k-sync && cargo install cross 2>/dev/null || true && \
  cross build --target aarch64-unknown-linux-gnu -p zerok-sync-cli --release"

# Manual SCP to Server C
ssh [email protected] "scp ~/0k-sync/target/aarch64-unknown-linux-gnu/release/sync-cli \
  [email protected]:/tmp/0k-sync-test/sync-cli"

Tests Hang

The most likely cause is a relay that’s not responding. Check health:

for port in 8090 8091 8092; do
  echo -n "Relay :$port — "
  ssh [email protected] "curl -sf http://localhost:$port/health" && echo "OK" || echo "DOWN"
done

If a relay is down after a kill_relay test, restart it:

ssh [email protected] "cd ~/0k-sync && \
  docker compose -f tests/chaos/docker-compose.distributed.yml \
  -p dist-chaos up -d --wait relay-1"

Stale iptables Rules

If a test crashed mid-partition, clean up manually:

ssh [email protected] "sudo iptables -L INPUT -n | grep DROP"
ssh [email protected] "sudo iptables -F INPUT && sudo iptables -F OUTPUT"

Stale netem Rules

# On a relay container
ssh [email protected] "docker exec dist-chaos-relay-1-1 tc qdisc del dev eth0 root 2>/dev/null; echo done"

# On Server C
ssh [email protected] "sudo tc qdisc del dev eth0 root 2>/dev/null; echo done"

9. File Reference

Test Code

File	Purpose
`tests/chaos/src/distributed/mod.rs`	Module root
`tests/chaos/src/distributed/ssh.rs`	SSH execution primitives (`SshTarget`, `exec`, `scp`)
`tests/chaos/src/distributed/config.rs`	Machine IPs, paths, ports, timeouts
`tests/chaos/src/distributed/harness.rs`	`DistributedHarness` orchestrator
`tests/chaos/src/scenarios/distributed.rs`	16 scenario tests (MR, CM, EDGE, NET, CONV)

Infrastructure

File	Purpose
`tests/chaos/docker-compose.distributed.yml`	3-relay + client topology for Server B
`tests/chaos/Dockerfile.relay`	Relay Docker image (shared with single-host tests)
`tests/chaos/Dockerfile.cli`	CLI Docker image (for client container)

Relay Observability

Endpoint	URL (relay-1)	Format
Health	`http://10.0.1.2:8090/health`	JSON
Metrics	`http://10.0.1.2:8090/metrics`	Prometheus text
Logs	`docker compose -p dist-chaos logs relay-1`	Text (tracing fmt)

Machine Reference

Machine	Mesh IP	Role
Server A	`10.0.1.1`	Test orchestrator (macOS)
Server B	`10.0.1.2`	Relay host (Linux, 91GB RAM)
Server C	`10.0.1.3`	Edge device (Raspberry Pi, ARM)

See also:

Chaos Testing Strategy — Single-host chaos testing strategy (68 scenarios)

Version: 1.0.0 | Date: 2026-02-07