OperationsInternal runbook

Operator Runbook

Name: LyDos Agent OS
Author: LYDOS

Reference procedures for operating the LYDOS server: startup, health verification, incident response, log access, and backup. Keep this page bookmarked for on-call use.

Server restart

source venv, then python3 server.py from the project root.

Health check

curl http://localhost:8888/api/health — expect score 90+.

API not responding

Check process, check port 8888, check .env for required keys.

Backup

config/, hedefler/, ~/.config/lydos/ — tar and encrypt offsite.

Server Restart

The LYDOS server is a single FastAPI process on port 8888. To restart it, activate the virtual environment and run server.py from the project root. The server loads all modules, registers Q-engine routers, and initialises agents at startup — this takes 2–5 seconds on a typical machine.

terminalBASH

# 1. Activate the virtual environment
source ~/.ailydian-venv/bin/activate

# 2. Navigate to the project root
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR

# 3. Start the server (runs in foreground — Ctrl+C to stop)
python3 server.py

# Expected startup output (abridged):
# INFO:     LYDOS Agent OS v12.0.0 starting on http://0.0.0.0:8888
# INFO:     Loaded 29 modules
# INFO:     109 agents registered
# INFO:     MCP server: 162 tools available
# INFO:     Application startup complete.

Background service (systemd)

For persistent operation without a terminal, install the systemd unit. The install script creates the unit file and enables it on boot.

terminalBASH

# Install as a systemd service (run once)
bash scripts/install_service.sh

# Check service status
systemctl status lydos

# Start / stop / restart
systemctl start lydos
systemctl stop lydos
systemctl restart lydos

# Follow logs
journalctl -u lydos -f

Auto-start via session-start hook

When working inside an AI IDE (Claude Code, Cursor, etc.), thehooks/session-start.sh hook starts the server automatically at the beginning of each session. The hook also writes the current health status to .lydos_status.json for fast offline reads.

terminalBASH

# Manually run the session-start hook (useful for debugging)
bash hooks/session-start.sh

# Read the cached status without hitting the live server
cat .lydos_status.json | python3 -m json.tool

Health Check

The health endpoint runs a 29-module diagnostic and returns a composite score out of 100. A score of 90+ means the system is fully operational. The one expected degraded module is analysis_api (score 70) when no Analysis Provider key is configured — this is normal and does not affect core operations.

terminalBASH

# Full health check — human-readable
curl -s http://localhost:8888/api/health | python3 -m json.tool

# Minimal check — just the score
curl -s http://localhost:8888/api/health | python3 -c "
import sys, json
d = json.load(sys.stdin)
print(f'Score: {d["score"]}/100  Status: {d["status"]}')
print(f'Modules: {d["modules_healthy"]}/{d.get("modules_total", 29)} healthy')
if d.get('modules_failed', 0) > 0:
    print('ALERT: failed modules:', d.get('failed_modules', []))
"

# Status endpoint (broader system overview)
curl -s http://localhost:8888/api/status | python3 -m json.tool

# Via lydos CLI
lydos health
lydos health --json

Expected healthy response (abridged):

GET /api/health — healthy responseJSON

{
  "status": "excellent",
  "score": 94,
  "modules_healthy": 28,
  "modules_degraded": 1,
  "modules_failed": 0,
  "modules_total": 29,
  "total_agents": 109,
  "active_agents": 109,
  "degraded_modules": ["analysis_api"],
  "uptime_seconds": 14400,
  "version": "12.0.0"
}

API Not Responding

If curl http://localhost:8888/api/health fails or hangs, work through this checklist in order.

1. Check if the process is running

terminalBASH

# Check for the server process
ps aux | grep "python3 server.py" | grep -v grep

# If no output, the server is not running — start it
source ~/.ailydian-venv/bin/activate
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR
python3 server.py &

# Check the process started
ps aux | grep "python3 server.py" | grep -v grep

2. Check the port

terminalBASH

# Verify something is listening on port 8888
ss -tlnp | grep 8888
# Expected: LISTEN  0  128  0.0.0.0:8888  ...

# Or with lsof
lsof -i :8888

# If port is in use by another process, find it and kill it
fuser 8888/tcp
fuser -k 8888/tcp  # Force-kill the process holding port 8888

3. Check the .env file

terminalBASH

# Verify required variables are set
grep -E "^(PRIMARY_API_KEY|BILINGUAL_API_KEY)" .env

# If .env is missing or empty, copy from the example
cp .env.example .env
# Then add your API keys

# Check the server can read .env
python3 -c "import dotenv; dotenv.load_dotenv(); import os; print(os.getenv('PRIMARY_API_KEY', 'MISSING')[:8] + '...')"
# Expected: gsk_abcd...  (first 8 chars of your key)

4. Check the venv

terminalBASH

# Verify the venv is activated and intact
which python3
# Expected: /home/user/.ailydian-venv/bin/python3

# If not activated
source ~/.ailydian-venv/bin/activate

# If venv is missing or broken, recreate it
python3 -m venv ~/.ailydian-venv
source ~/.ailydian-venv/bin/activate
pip install -r requirements.txt

5. Check the startup logs

terminalBASH

# Run in foreground to see all startup errors
source ~/.ailydian-venv/bin/activate
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR
python3 server.py 2>&1 | head -60

# Look for lines containing "ERROR" or "CRITICAL"
python3 server.py 2>&1 | grep -E "(ERROR|CRITICAL|ImportError|ModuleNotFound)"

# Run with debug logging
LOG_LEVEL=DEBUG python3 server.py

Common Issues

Port 8888 already in use

Run: fuser -k 8888/tcp — this kills whatever is holding the port. Then restart the server.

fuser -k 8888/tcp && python3 server.py

Missing PRIMARY_API_KEY

The server starts but LLM calls fail with status 503. Add your Groq API key to .env and restart.

echo "PRIMARY_API_KEY=gsk_your_key_here" >> .env

Virtual environment not activated

ImportError on startup means the venv is not active. Always activate before running server.py.

source ~/.ailydian-venv/bin/activate

database locked (SQLite WAL error)

Another server process has the embedded database locked. Kill the other process and restart cleanly.

pkill -f 'python3 server.py' && sleep 1 && python3 server.py

Health score drops below 80

Check /api/health for the list of failed_modules. Each module has a /api/<module>/health endpoint for detailed diagnostics.

curl -s http://localhost:8888/api/health | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('failed_modules',[]))"

MCP server not connecting

Verify the venv path in .mcp.json matches your actual venv. Run the MCP server manually to check for errors.

~/.ailydian-venv/bin/python3 core/infrastructure/mcp_server.py --transport stdio

Incident Severity Matrix

Use this matrix to classify incidents and decide the appropriate response speed. LYDOS is a local development tool — there are no SLAs — but tracking severity helps prioritise when multiple issues arise.

Severity	Condition	Impact	Response
P1	Server completely down — port 8888 not reachable	All agents unavailable, MCP disconnected	Immediate restart. Check process, port, venv, .env.
P2	Server up but health score < 80 or 2+ modules failed	Degraded agent performance, some engines unavailable	Identify failed modules via /api/health, check per-module logs.
P3	Single non-critical engine failed (e.g. analysis_api)	One provider or engine unavailable, fallback active	Check API key for that provider. Non-urgent — investigate at next opportunity.
P4	Cosmetic issues: slow response, UI glitch, log noise	No functional impact	Log the issue, investigate during regular maintenance window.

Log Locations

LYDOS writes structured JSON logs via Python logging. By default logs go to stdout, which systemd captures into the journal. The CLI also maintains its own operation log.

Location	Contents	How to read
`stdout / systemd journal`	Server startup, request routing, module errors	journalctl -u lydos -f (systemd) or terminal output
`~/.config/lydos/logs/cli.log`	CLI commands, auth events, agent task invocations	tail -f ~/.config/lydos/logs/cli.log
`~/.config/lydos/audit.jsonl`	Q48 Kavach audit trail — every approved/rejected action	cat ~/.config/lydos/audit.jsonl \| python3 -m json.tool \| head -40
`.lydos/history/`	Per-task agent run logs (one JSON file per task)	ls .lydos/history/ \| sort -r \| head -5
`.lydos_status.json`	Cached health status written by session-start hook	cat .lydos_status.json \| python3 -m json.tool

Changing the log level

terminalBASH

# Set log level via environment variable before starting
LOG_LEVEL=DEBUG python3 server.py

# Or add to .env for permanent change
echo "LOG_LEVEL=DEBUG" >> .env

# Filter server logs to errors only (useful in production)
LOG_LEVEL=ERROR python3 server.py

Backup and Restore

LYDOS stores state in three locations: configuration files, the goal store (hedefler/), and memory files. Back up all three to ensure full recovery after a disk failure or migration.

What to back up

Path	Contents	Frequency
`config/`	agents.yaml, kernel.yaml, master_prompt.md	On every change
`hedefler/`	Goal storage — long-term goals, sprints, milestones	Daily
`~/.config/lydos/`	auth.json, config.yaml, CLI logs, audit trail	Daily (exclude logs if large)
`.env`	API keys — store ENCRYPTED in a password manager	On every change — encrypt before storing
`.lydos/`	Project config, agent run history	Weekly or on project milestones

Backup script

backup.shBASH

#!/usr/bin/env bash
# Minimal LYDOS state backup script
set -euo pipefail

BACKUP_DIR="$HOME/lydos-backup-$(date +%Y%m%d-%H%M%S)"
LYDOS_ROOT="$HOME/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR"

mkdir -p "$BACKUP_DIR"

# Config files (no secrets)
cp -r "$LYDOS_ROOT/config/" "$BACKUP_DIR/config/"

# Goal storage
cp -r "$LYDOS_ROOT/hedefler/" "$BACKUP_DIR/hedefler/"

# CLI auth and global config (contains API key — encrypt this)
cp -r "$HOME/.config/lydos/" "$BACKUP_DIR/lydos-config/"

# Encrypt the backup (requires gpg key configured)
tar czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR/"
gpg --symmetric --batch --passphrase-file "$HOME/.lydos-backup-passphrase"     "$BACKUP_DIR.tar.gz"
rm -rf "$BACKUP_DIR" "$BACKUP_DIR.tar.gz"

echo "Backup created: $BACKUP_DIR.tar.gz.gpg"

Restore procedure

terminalBASH

# 1. Stop the server
systemctl stop lydos  # or Ctrl+C if running in terminal

# 2. Decrypt the backup
gpg --decrypt lydos-backup-20260328-120000.tar.gz.gpg > backup.tar.gz
tar xzf backup.tar.gz

# 3. Restore config files
cp -r backup/config/ ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR/config/

# 4. Restore goal storage
cp -r backup/hedefler/ ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR/hedefler/

# 5. Restore global config
cp -r backup/lydos-config/ ~/.config/lydos/

# 6. Restore .env (from your password manager — do not store in backup as plaintext)
# Re-create .env manually with your API keys

# 7. Restart the server and verify
source ~/.ailydian-venv/bin/activate
cd ~/Masaüstü/AILYDIAN-AGENT-ORCHESTRATOR
python3 server.py &
sleep 3
curl -s http://localhost:8888/api/health | python3 -m json.tool

NOTE

The embedded SQLite database is stored at data/lydos.db (created at first run). Back it up with the WAL checkpoint command to ensure consistency: sqlite3 data/lydos.db "PRAGMA wal_checkpoint(FULL);" before copying.

Operator Runbook

Server Restart

Background service (systemd)

Auto-start via session-start hook

Health Check

API Not Responding

1. Check if the process is running

2. Check the port

3. Check the .env file

4. Check the venv

5. Check the startup logs

Common Issues

Incident Severity Matrix

Log Locations

Changing the log level

Backup and Restore

What to back up

Backup script

Restore procedure

Related Documentation