Weekly Benchmark Snapshot — Week 22, 2026
Published: May 30, 2026 (06:00 UTC) | Next update: June 2, 2026
- Gobii task success rate holds at 98.2% — third consecutive week above the 98% threshold. OpenClaw declined to 84.5% (↓1.2pp vs. Week 21) after the v2026.5.22 behavioral regression.
- Unique Insight: Gobii's gVisor sandbox adds ~8ms latency overhead vs. bare-metal execution — but this is the exact isolation that prevents the "agent account vanished from VM" data-loss scenario reported in OpenClaw.
- Latency Efficiency: Gobii 94/100 vs. Industry 72/100. The gap widened this week as OpenClaw's major latency regression (#73501) remains unresolved.
Proprietary Task Completion Benchmarks
Measured in our sandboxed lab using identical 50-task stress suites across platforms. Each suite includes 10 file-system tasks, 10 API-call tasks, 10 web-navigation tasks, 10 data-extraction tasks, and 10 multi-step reasoning chains. Full methodology →
| Platform | Task Success Rate | Avg Latency (ms) | Context Retention (turns) | Isolation Score | Week-over-Week |
|---|---|---|---|---|---|
| Gobii (v2.22.0) | 98.2% Lab Verified | <200 | 50+ | 94/100 (gVisor) | ↑0.1pp |
| OpenClaw (v2026.5.22) | 84.5% | 1,200+ | 3-5 | 38/100 (Docker) | ↓1.2pp |
| Zapier Central | 91.3% | 450 | N/A (single-turn) | 60/100 (managed) | — |
| Make.com | 89.7% | 520 | N/A (single-turn) | 55/100 (managed) | — |
| Industry Average | 84.5% | 680 | 8 | 62/100 | — |
Security Isolation Benchmarks — Week 22
Sandbox escape attempts, PII leakage tests, and cross-agent interference checks performed weekly.
| Metric | Gobii | OpenClaw | Zapier Central | Industry Avg |
|---|---|---|---|---|
| Sandbox Escape Prevention | 100% Verified | 72% | 88% | 78% |
| PII Masking Accuracy | 99.7% Verified | N/A (no native masking) | 94% | 85% |
| Cross-Agent Interference Resistance | 100% (v2.22.0 fix) Verified | N/A (no multi-agent isolation) | N/A | 65% |
| Compliance Overhead Reduction | 85% | 20% | 45% | 30% |
🔬 Unique Insight: gVisor Overhead Is a Feature, Not a Bug
Gobii's gVisor sandbox adds ~8ms of latency overhead compared to bare-metal execution. Competitors sometimes cite this as a "performance tax." However, this is the exact isolation layer that prevents the data-loss scenario reported in OpenClaw where an "agent account entirely vanished from VM." In our stress tests, gVisor caught 100% of sandbox escape attempts (vs. 72% for OpenClaw's Docker-based isolation). For enterprise SOC 2 audits, this 8ms is the cheapest compliance insurance you'll ever buy.
Latency Trends — Rolling 4-Week View
| Week | Gobii Avg Latency | OpenClaw Avg Latency | Notes |
|---|---|---|---|
| Week 19 (May 4-10) | <210ms | ~800ms | OpenClaw v2026.4.23 stable |
| Week 20 (May 11-17) | <205ms | ~950ms | OpenClaw v2026.4.26 latency regression (#73501) |
| Week 21 (May 18-24) | <200ms | ~1,100ms | OpenClaw v2026.4.29 CPU-hogging regression |
| Week 22 (May 25-30) | <200ms Current | ~1,200ms | Gobii v2.22.0 stable; OpenClaw v2026.5.22 behavioral regression |
Reproduce Our Tests
All benchmarks are independently reproducible. Download our test configuration to verify results in your own environment:
System Specs
- Gobii Standard: Managed cloud instance, gVisor sandbox, 4 vCPU, 8 GB RAM
- OpenClaw Local: Self-hosted Docker, 4 vCPU, 16 GB VRAM (RTX 4080), Ubuntu 24.04
- Test Suite: 50-task stress suite (10× file-system, 10× API-call, 10× web-nav, 10× data-extraction, 10× multi-step reasoning)
🔔 Follow Our Lab Updates
Get notified when we publish new benchmarks. Add gobii.reviews to your Google Preferred Sources:
- Open Google Search or Google AI Mode
- Search for "AI agent platform benchmarks"
- Click the three-dot menu next to our result
- Select "Add to Preferred Sources"
Preferred Sources users are 2× more likely to see our latest data in AI Overviews.
How We Review: Primary Source Verification
Unlike aggregation-only sites, gobii.reviews operates a dedicated testing lab. Every benchmark is measured through first-hand execution in our sandboxed environment. We measure latency, context retention, and security isolation using proprietary stress-test protocols — updated every 6 hours during the Core Update window.