strands-multi-engineer-agent

provenance:github:ihsanyurttas/strands-multi-engineer-agent

WHAT THIS AGENT DOES

This agent helps businesses understand how different AI models perform when tackling the same engineering task. It runs the same project through various AI providers (like Claude, GPT-4o, and Llama) and compares their speed, cost, and the quality of their work. This allows teams to make informed decisions about which AI tools are most reliable and effective for their specific needs, ultimately improving efficiency and reducing potential errors. It's particularly useful for engineering teams exploring and evaluating different AI options.

View Source ↗First seen 4mo agoNot yet hireable

USE CASES

Automation Testing Research

README

# strands-multi-engineer-agent

## Same workflow. Same task. Same tools. Different behavior.

Most LLM benchmarks measure models in isolation.

This shows what actually happens inside a real engineering agent workflow.

The same task is executed across providers:
Claude, GPT-4o, and Llama 3.2.

Same inputs. Same system.

Different behavior.

- Some models are fast and decisive
- Some overthink and self-review heavily
- Some take 6x longer for the same outcome
- Some fail silently with no output or clear signal

That difference matters.

In agent systems, reliability matters more than raw capability.

Built on [AWS Strands](https://github.com/strands-agents/sdk-python). Swap the provider with one env var.

---

Not a production engineering agent.

A controlled experiment:
fixed workflow, fixed task, variable provider.

The goal is behavioral comparison — not building the best agent.

---

## Benchmark Output

After each run the agent prints a compact summary:

```
  Provider    openai
  Model       gpt-4o-mini
  Latency     38.5s
  Tool calls  16
  Tokens      4,821 in / 612 out

Done! Results saved to: eval/results/openai_gpt-4o-mini_20260316T163403Z.json
```

The full result is also written as a structured JSON file in `eval/results/`:

```
eval/results/
  anthropic_claude-sonnet-4-6_20260316T130602Z.json
  openai_gpt-4o-mini_20260316T155350Z.json
  ollama_llama3_20260316T161200Z.json
```

Every file captures the full run for one provider:

```json
{
  "provider": "openai",
  "model": "gpt-4o-mini",
  "total_elapsed_seconds": 51.87,
  "total_input_tokens": 4821,
  "total_output_tokens": 612,
  "confidence_score": 7.0,
  "phases": [
    { "phase": "inspect",     "elapsed_seconds": 8.46,  "input_tokens": null, "output_tokens": null },
    { "phase": "plan",        "elapsed_seconds": 3.85,  "input_tokens": null, "output_tokens": null },
    { "phase": "implement",   "elapsed_seconds": 36.55, "input_tokens": null, "output_tokens": null },
    { "phase": "self_review", "elapsed_seconds": 3.02,  "input_tokens": null, "output_tokens": null }
  ]
}
```

Run the same task across all three providers and compare the result files side by side
to see where they differ in speed, cost, and output quality.

Because every provider run produces the same schema, results can be compared programmatically across models and providers.

---

## Architecture

```
strands-multi-engineer-agent/
├── agent/
│   ├── cli.py          # Typer CLI: run / list-tasks / doctor
│   ├── config.py       # Env-var validation + typed config (Pydantic)
│   ├── prompts.py      # Prompt templates per workflow phase
│   └── workflow.py     # Strands agent orchestration loop
├── providers/
│   ├── base_provider.py     # Abstract base + factory (get_strands_model)
│   └── provider_config.py   # Per-provider requirements documentation
├── tools/
│   ├── repo_reader.py   # list_files, read_file  (@tool)
│   ├── search_tools.py  # search_in_repo         (@tool)
│   ├── patch_writer.py  # write_file             (@tool, sandboxed)
│   └── test_runner.py   # run_tests              (@tool)
├── tasks/
│   ├── issues.yaml      # Sample engineering tasks
│   └── task_runner.py   # Load + dispatch tasks
├── eval/
│   ├── metrics.py       # Record + compare run results
│   ├── result_schema.py # Pydantic schema for WorkflowResult
│   └── results/         # JSON output per provider run (gitignored)
├── sample_repos/
│   ├── tiny_fastapi_app/   # Python target with deliberate gaps
│   └── tiny_node_service/  # Node.js target with deliberate gaps
└── content/
    └── medium_notes.md  # Notes
```

### Workflow phases

```
Issue description
       │
       ▼
  ┌─────────────┐
  │  1. Inspect  │  read_file, list_files, search_in_repo
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  2. Plan     │  (reasoning only, no tools)
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  3. Implement│  write_file
  └──────┬──────┘
         │
         ▼
  ┌─────────────┐
  │  4. Review   │  self_review (risks + confidence score)
  └──────┬──────┘
         │
         ▼
  WorkflowResult (JSON)
```

### Provider abstraction

```python
# providers/base_provider.py
model = get_strands_model(config)   # ← only call the workflow needs

# The factory returns a Strands-compatible model object
# configured from environment variables, regardless of provider.
```

---

## Portability Principles

1. **Secrets from env vars only** — no hardcoded keys anywhere
2. **Container-friendly by default** — Ollama runs as a Docker Compose service; no host binary assumed
3. **Reproducible setup** — `make setup` bootstraps a clean `venv` from scratch
4. **Explicit runtime selection** — `AGENT_RUNTIME=local|docker`
5. **Provider-agnostic workflow** — `workflow.py` never imports a provider directly

---

## Tasks

Tasks are defined in `tasks/issues.yaml`. Each task is a self-contained engineering problem
run against a sample repository:

```yaml
- id: fastapi-missing-validation
  repo: sample_repos/tiny_fastapi_app
  description: Add Pydantic model validation to the POST /items endpoint
  difficulty: easy
```

Run with: `venv/bin/agent run --task fastapi-missing-validation`

---

## Supported Providers

| Provider | Env vars required | Notes |
|---|---|---|
| `anthropic` | `ANTHROPIC_API_KEY` | Uses Strands Anthropic model. |
| `openai` | `OPENAI_API_KEY` | Uses Strands OpenAI model. |
| `ollama` | `OLLAMA_BASE_URL`, `OLLAMA_MODEL` | No API key. Runs locally via Docker Compose or native install. |

> For low-cost experimentation, start with `gpt-4o-mini` or Ollama.
> Hosted providers incur real API cost. Use `WORKFLOW_MODE=minimal` for cheaper runs.

---

## Setup

```bash
git clone <repo-url>
cd strands-multi-engineer-agent
make setup                # creates venv, copies .env.example → .env, installs deps
# edit .env — set your API key and DEFAULT_PROVIDER
make doctor               # validate config
make run                  # run with DEFAULT_PROVIDER (or: venv/bin/agent run --task ...)
```

**Anthropic:** set `ANTHROPIC_API_KEY` in `.env`
**OpenAI:** set `OPENAI_API_KEY` in `.env`
**Ollama (Docker Compose):**
```bash
make ollama-up            # start container, wait for healthy
make ollama-pull          # pull llama3.2  (MODEL=mistral to override)
make ollama-run           # run agent via Compose
```
**Ollama (native):** set `OLLAMA_BASE_URL=http://localhost:11434` in `.env`, then `make run`

---

## Custom Tasks

Built-in tasks require editing `tasks/issues.yaml`. Two shortcuts skip that:

**Ad-hoc task — define inline:**
```bash
venv/bin/agent run \
  --repo sample_repos/tiny_fastapi_app \
  --issue "Add pagination support to GET /items" \
  --difficulty medium
```

**Task file — provide a YAML file:**
```bash
venv/bin/agent run --task-file my_task.yaml
```

```yaml
# my_task.yaml
id: add-pagination
repo: sample_repos/tiny_fastapi_app
description: Add pagination support to GET /items
difficulty: medium
```

Required fields: `repo`, `description`. All other fields are optional.
Only one task source may be used at a time: `--task`, `--repo/--issue`, or `--task-file`.

---

## Environment Variables

All configuration comes from environment variables. Copy `.env.example` to `.env` and fill in values.

| Variable | Required | Default | Description |
|---|---|---|---|
| `DEFAULT_PROVIDER` | No | `anthropic` | Active provider: `anthropic` \| `openai` \| `ollama` |
| `ANTHROPIC_API_KEY` | If using Anthropic | — | Anthropic API key |
| `ANTHROPIC_MODEL` | No | `claude-sonnet-4-6` | Anthropic model ID |
| `ANTHROPIC_MAX_TOKENS` | No | `4096` | Max output tokens for Anthropic |
| `OPENAI_API_KEY` | If using OpenAI | — | OpenAI API key |
| `OPENAI_MODEL` | No | `gpt-4o` | OpenAI model ID |
| `OLLAMA_BASE_URL` | If using Ollama | `http://ollama:11434` | Ollama server URL (`http://localhost:11434` for native) |
| `OLLAMA_MODEL` | If using Ollama | `llama3.2` | Ollama model name — must support tool calling |
| `AGENT_RUNTIME` | No | `loca

[truncated…]

PUBLIC HISTORY

First discoveredMar 21, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 16, 2026

last updatedMar 17, 2026

last crawled3 months ago

version—

RELATED AGENTS

askimo

Askimo is a platform that lets you interact with artificial intelligence in a simple way, whether through chatting, sear

blog-writer-multi-agents

Here's a plain English summary of the blog-writer-multi-agents AI agent: This agent automatically creates professional-

boss-skill

This agent, boss-skill, is designed to help employees navigate challenging workplace dynamics, particularly those involv

60AI-AGENT

Here's a plain English summary of the 60AI-AGENT: This agent is like a constantly learning assistant that can handle ta

J.E.L.L.Y._AI

J.E.L.L.Y._AI is an article writing AI developed by its creator. This repository is publicly available for job seeking p

More Automation agents →

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:ihsanyurttas/strands-multi-engineer-agent)