OperationsInternal runbook

Operator Runbook

Reference procedures for operating the LYDOS server: startup, health verification, incident response, log access, and backup. Keep this page bookmarked for on-call use.

Server restart
source venv, then python3 server.py from the project root.
Health check
curl http://localhost:8888/api/health — expect score 90+.
API not responding
Check process, check port 8888, check .env for required keys.
Backup
config/, hedefler/, ~/.config/lydos/ — tar and encrypt offsite.

Server Restart

The LYDOS server is a single FastAPI process on port 8888. To restart it, activate the virtual environment and run server.py from the project root. The server loads all modules, registers Q-engine routers, and initialises agents at startup — this takes 2–5 seconds on a typical machine.

terminalBASH
# 1. Activate the virtual environment
source ~/.ailydian-venv/bin/activate

# 2. Navigate to the project root
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR

# 3. Start the server (runs in foreground — Ctrl+C to stop)
python3 server.py

# Expected startup output (abridged):
# INFO:     LYDOS Agent OS v12.0.0 starting on http://0.0.0.0:8888
# INFO:     Loaded 29 modules
# INFO:     109 agents registered
# INFO:     MCP server: 162 tools available
# INFO:     Application startup complete.

Background service (systemd)

For persistent operation without a terminal, install the systemd unit. The install script creates the unit file and enables it on boot.

terminalBASH
# Install as a systemd service (run once)
bash scripts/install_service.sh

# Check service status
systemctl status lydos

# Start / stop / restart
systemctl start lydos
systemctl stop lydos
systemctl restart lydos

# Follow logs
journalctl -u lydos -f

Auto-start via session-start hook

When working inside an AI IDE (Claude Code, Cursor, etc.), thehooks/session-start.sh hook starts the server automatically at the beginning of each session. The hook also writes the current health status to .lydos_status.json for fast offline reads.

terminalBASH
# Manually run the session-start hook (useful for debugging)
bash hooks/session-start.sh

# Read the cached status without hitting the live server
cat .lydos_status.json | python3 -m json.tool

Health Check

The health endpoint runs a 29-module diagnostic and returns a composite score out of 100. A score of 90+ means the system is fully operational. The one expected degraded module is analysis_api (score 70) when no Analysis Provider key is configured — this is normal and does not affect core operations.

terminalBASH
# Full health check — human-readable
curl -s http://localhost:8888/api/health | python3 -m json.tool

# Minimal check — just the score
curl -s http://localhost:8888/api/health | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Score: {d["score"]}/100  Status: {d["status"]}')
print(f'Modules: {d["modules_healthy"]}/{d.get("modules_total", 29)} healthy')
if d.get('modules_failed', 0) > 0:
    print('ALERT: failed modules:', d.get('failed_modules', []))
"

# Status endpoint (broader system overview)
curl -s http://localhost:8888/api/status | python3 -m json.tool

# Via lydos CLI
lydos health
lydos health --json

Expected healthy response (abridged):

GET /api/health — healthy responseJSON
{
  "status": "excellent",
  "score": 94,
  "modules_healthy": 28,
  "modules_degraded": 1,
  "modules_failed": 0,
  "modules_total": 29,
  "total_agents": 109,
  "active_agents": 109,
  "degraded_modules": ["analysis_api"],
  "uptime_seconds": 14400,
  "version": "12.0.0"
}

API Not Responding

If curl http://localhost:8888/api/health fails or hangs, work through this checklist in order.

1. Check if the process is running

terminalBASH
# Check for the server process
ps aux | grep "python3 server.py" | grep -v grep

# If no output, the server is not running — start it
source ~/.ailydian-venv/bin/activate
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR
python3 server.py &

# Check the process started
ps aux | grep "python3 server.py" | grep -v grep

2. Check the port

terminalBASH
# Verify something is listening on port 8888
ss -tlnp | grep 8888
# Expected: LISTEN  0  128  0.0.0.0:8888  ...

# Or with lsof
lsof -i :8888

# If port is in use by another process, find it and kill it
fuser 8888/tcp
fuser -k 8888/tcp  # Force-kill the process holding port 8888

3. Check the .env file

terminalBASH
# Verify required variables are set
grep -E "^(PRIMARY_API_KEY|BILINGUAL_API_KEY)" .env

# If .env is missing or empty, copy from the example
cp .env.example .env
# Then add your API keys

# Check the server can read .env
python3 -c "import dotenv; dotenv.load_dotenv(); import os; print(os.getenv('PRIMARY_API_KEY', 'MISSING')[:8] + '...')"
# Expected: gsk_abcd...  (first 8 chars of your key)

4. Check the venv

terminalBASH
# Verify the venv is activated and intact
which python3
# Expected: /home/user/.ailydian-venv/bin/python3

# If not activated
source ~/.ailydian-venv/bin/activate

# If venv is missing or broken, recreate it
python3 -m venv ~/.ailydian-venv
source ~/.ailydian-venv/bin/activate
pip install -r requirements.txt

5. Check the startup logs

terminalBASH
# Run in foreground to see all startup errors
source ~/.ailydian-venv/bin/activate
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR
python3 server.py 2>&1 | head -60

# Look for lines containing "ERROR" or "CRITICAL"
python3 server.py 2>&1 | grep -E "(ERROR|CRITICAL|ImportError|ModuleNotFound)"

# Run with debug logging
LOG_LEVEL=DEBUG python3 server.py

Common Issues

Port 8888 already in use
Run: fuser -k 8888/tcp — this kills whatever is holding the port. Then restart the server.
fuser -k 8888/tcp && python3 server.py
Missing PRIMARY_API_KEY
The server starts but LLM calls fail with status 503. Add your Groq API key to .env and restart.
echo "PRIMARY_API_KEY=gsk_your_key_here" >> .env
Virtual environment not activated
ImportError on startup means the venv is not active. Always activate before running server.py.
source ~/.ailydian-venv/bin/activate
database locked (SQLite WAL error)
Another server process has the embedded database locked. Kill the other process and restart cleanly.
pkill -f 'python3 server.py' && sleep 1 && python3 server.py
Health score drops below 80
Check /api/health for the list of failed_modules. Each module has a /api/<module>/health endpoint for detailed diagnostics.
curl -s http://localhost:8888/api/health | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('failed_modules',[]))"
MCP server not connecting
Verify the venv path in .mcp.json matches your actual venv. Run the MCP server manually to check for errors.
~/.ailydian-venv/bin/python3 core/infrastructure/mcp_server.py --transport stdio

Incident Severity Matrix

Use this matrix to classify incidents and decide the appropriate response speed. LYDOS is a local development tool — there are no SLAs — but tracking severity helps prioritise when multiple issues arise.

SeverityConditionImpactResponse
P1Server completely down — port 8888 not reachableAll agents unavailable, MCP disconnectedImmediate restart. Check process, port, venv, .env.
P2Server up but health score < 80 or 2+ modules failedDegraded agent performance, some engines unavailableIdentify failed modules via /api/health, check per-module logs.
P3Single non-critical engine failed (e.g. analysis_api)One provider or engine unavailable, fallback activeCheck API key for that provider. Non-urgent — investigate at next opportunity.
P4Cosmetic issues: slow response, UI glitch, log noiseNo functional impactLog the issue, investigate during regular maintenance window.

Log Locations

LYDOS writes structured JSON logs via Python logging. By default logs go to stdout, which systemd captures into the journal. The CLI also maintains its own operation log.

LocationContentsHow to read
stdout / systemd journalServer startup, request routing, module errorsjournalctl -u lydos -f (systemd) or terminal output
~/.config/lydos/logs/cli.logCLI commands, auth events, agent task invocationstail -f ~/.config/lydos/logs/cli.log
~/.config/lydos/audit.jsonlQ48 Kavach audit trail — every approved/rejected actioncat ~/.config/lydos/audit.jsonl | python3 -m json.tool | head -40
.lydos/history/Per-task agent run logs (one JSON file per task)ls .lydos/history/ | sort -r | head -5
.lydos_status.jsonCached health status written by session-start hookcat .lydos_status.json | python3 -m json.tool

Changing the log level

terminalBASH
# Set log level via environment variable before starting
LOG_LEVEL=DEBUG python3 server.py

# Or add to .env for permanent change
echo "LOG_LEVEL=DEBUG" >> .env

# Filter server logs to errors only (useful in production)
LOG_LEVEL=ERROR python3 server.py

Backup and Restore

LYDOS stores state in three locations: configuration files, the goal store (hedefler/), and memory files. Back up all three to ensure full recovery after a disk failure or migration.

What to back up

PathContentsFrequency
config/agents.yaml, kernel.yaml, master_prompt.mdOn every change
hedefler/Goal storage — long-term goals, sprints, milestonesDaily
~/.config/lydos/auth.json, config.yaml, CLI logs, audit trailDaily (exclude logs if large)
.envAPI keys — store ENCRYPTED in a password managerOn every change — encrypt before storing
.lydos/Project config, agent run historyWeekly or on project milestones

Backup script

backup.shBASH
#!/usr/bin/env bash
# Minimal LYDOS state backup script
set -euo pipefail

BACKUP_DIR="$HOME/lydos-backup-$(date +%Y%m%d-%H%M%S)"
LYDOS_ROOT="$HOME/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR"

mkdir -p "$BACKUP_DIR"

# Config files (no secrets)
cp -r "$LYDOS_ROOT/config/" "$BACKUP_DIR/config/"

# Goal storage
cp -r "$LYDOS_ROOT/hedefler/" "$BACKUP_DIR/hedefler/"

# CLI auth and global config (contains API key — encrypt this)
cp -r "$HOME/.config/lydos/" "$BACKUP_DIR/lydos-config/"

# Encrypt the backup (requires gpg key configured)
tar czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR/"
gpg --symmetric --batch --passphrase-file "$HOME/.lydos-backup-passphrase"     "$BACKUP_DIR.tar.gz"
rm -rf "$BACKUP_DIR" "$BACKUP_DIR.tar.gz"

echo "Backup created: $BACKUP_DIR.tar.gz.gpg"

Restore procedure

terminalBASH
# 1. Stop the server
systemctl stop lydos  # or Ctrl+C if running in terminal

# 2. Decrypt the backup
gpg --decrypt lydos-backup-20260328-120000.tar.gz.gpg > backup.tar.gz
tar xzf backup.tar.gz

# 3. Restore config files
cp -r backup/config/ ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR/config/

# 4. Restore goal storage
cp -r backup/hedefler/ ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR/hedefler/

# 5. Restore global config
cp -r backup/lydos-config/ ~/.config/lydos/

# 6. Restore .env (from your password manager — do not store in backup as plaintext)
# Re-create .env manually with your API keys

# 7. Restart the server and verify
source ~/.ailydian-venv/bin/activate
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR
python3 server.py &
sleep 3
curl -s http://localhost:8888/api/health | python3 -m json.tool
NOTE
The embedded SQLite database is stored at data/lydos.db (created at first run). Back it up with the WAL checkpoint command to ensure consistency: sqlite3 data/lydos.db "PRAGMA wal_checkpoint(FULL);" before copying.

Related Documentation