Choosing your agentic framework

You are choosing a worldview, not a library

The instinctive question — "which agentic framework is best?" — has no answer. Every framework that has shipped in the last two years is the product of a credible team responding to a real engineering problem; they are all internally coherent. The right question is the one most teams skip: which mental model fits the problem and the people who will operate the system?

This article walks through the dominant options as of early 2026, what they assume about the world, what they make easy, what they make hard, and how Gen4Travel chose among them. The intended reader is a tech lead about to make a multi-year commitment. We avoid feature lists in favour of architectural pattern; feature lists drift, patterns last.

Read this first

If your workflow is predictable but rich (branching, retries, human-in-the-loop), you probably want a graph framework. If it is inherently dialogic (multiple specialists negotiating an answer), you want a conversation framework. If it is composed of well-defined roles producing artefacts, you want a role framework. Most production systems are blends of all three.

The four dominant patterns

Pattern 1 — The directed graph (LangGraph)

LangGraph, born inside the LangChain ecosystem and now the most adopted framework in the space, models an agent as a directed graph of state transitions. You declare nodes — for planning, tool-use, validation, summarisation, human checkpoint — and edges that connect them. Each node receives the current state, computes an update, and the graph runtime routes to the next node.

What this gets you is auditability and predictability. You can draw the graph on a whiteboard; you can show it to a regulator; you can step through it in a debugger. When something goes wrong, the trace is a sequence of node executions with full state at each point. This is invaluable in production, especially in regulated industries where "why did the agent do that?" needs an answer.

The cost is verbosity. Even a simple "ask question, look up answer, reply" workflow becomes three nodes, two edges, and a state schema. The abstraction earns its keep when workflows get complex; it feels heavy when they are not. Sweet spot: multi-step production workflows where stability and audit matter more than developer speed.

Pattern 2 — The structured conversation (AutoGen)

AutoGen, Microsoft Research's contribution, treats agents as conversational participants in a structured chat. You define roles (planner, coder, reviewer, critic), pick a conversation pattern (group chat, sequential, hierarchical, reflection), and let the agents speak. The runtime mediates turns, persists state, and routes messages.

The mental model is the most intuitive of the four. Anyone who has run a meeting can immediately reason about a group-chat agent — "the planner proposes, the critic objects, the planner revises, the executor runs". This makes AutoGen unusually accessible for teams without deep agent-system experience, and unusually good for tasks that are inherently dialogic — collaborative coding, document review, brainstorming.

The cost is emergent behaviour. When agents converse freely, they converge on solutions you did not anticipate, including bad ones. AutoGen is at its best when stakes are moderate and the workflow benefits from creative recombination. It is at its worst when you need a predictable execution path with hard guarantees about what tools get called, in what order, and on what data.

Pattern 3 — Roles and tasks (CrewAI)

CrewAI takes a different angle. You describe agents (their goal, their backstory, their tools) and tasks (what to produce, by which agent, depending on which other tasks). The runtime resolves the dependency graph and runs each task on its agent. The mental model is project management: a crew of specialists, a list of deliverables, a producer.

This pattern shines for content-style workflows: research a topic, draft an article, edit it, fact-check it, format it, publish it. Each step has a clear owner and a clear input/output contract. Non-engineers can read a CrewAI definition and understand it. The framework has become unusually popular for marketing and content ops as a result.

The cost is reactivity. CrewAI assumes a plan-first world: you know what tasks exist before you start. When the world is more reactive — a passenger reports a problem and the agent has to discover what tasks are even relevant — you end up bending the framework, and at that point you are better off with a graph or a conversation pattern.

Pattern 4 — Skills and SDK (Semantic Kernel)

Microsoft's Semantic Kernel is the most enterprise-leaning of the four. It treats agents as SDK-orchestrated skill sets: you define skills (capabilities the agent can invoke), planners (which compose skills into plans), and you wire it all up in C#, Python or Java. The feel is closer to an SDK for agent-aware application development than to a "framework" in the Python-script sense.

Where it wins: existing .NET stacks. Strong SDK ergonomics, native plugins, deep integration with Azure services. If your organisation is already Microsoft-heavy, Semantic Kernel offers the shortest path from prototype to integrated production system, and the resulting code looks like everything else you ship.

Where it does not: the Python ML ecosystem. Most cutting-edge research, evaluation tools and integrations land in Python first. A team committed to Semantic Kernel will sometimes end up running a Python sidecar for the bleeding edge.

Fig. 1 Four mental models, four representative frameworks. Most production systems blend them; the framework is the dominant idiom, not the only one.

The long tail (and why it matters)

The four patterns above cover the bulk of production deployments, but the broader landscape is much richer. A few of the projects worth knowing about as of early 2026:

LlamaIndex Workflows

The LlamaIndex team, originally focused on retrieval, now ships a workflow primitive that occupies a similar niche to LangGraph — declarative steps with explicit data flow — but with much tighter integration to retrieval and ingestion. If your agent's primary job is to reason over a large corpus (legal, financial, technical documentation), starting from LlamaIndex Workflows often saves weeks of integration work.

Mastra

The TypeScript-native option. Mastra is a relatively new framework but has gained traction quickly with teams whose product is a Node.js application and who do not want a Python sidecar. Its design borrows from LangGraph's graph model but ships in idiomatic TypeScript with type-safe tool definitions. Worth a look if your stack is already JS-heavy.

Inspect AI

From the UK AI Safety Institute, Inspect is not a framework for building agents but a framework for evaluating them. It models agent runs as datasets of trajectories, supports complex evaluators (including human-in-the-loop scoring), and has become the de-facto choice for serious agent-evaluation pipelines. We use Inspect inside Gen4Travel for our regression suite.

DSPy and BAML

Both target the prompt-engineering layer rather than the orchestration layer. DSPy lets you write programs in terms of "modules" with declared signatures and learn the prompts automatically; BAML lets you write strongly typed prompt definitions that compile to multi-vendor LLM clients. Neither replaces an agentic framework, but both pair well with one.

Decision criteria that actually predict outcomes

Based on shipping Gen4Travel and watching half a dozen partner teams ship adjacent systems, here are the criteria that we have found genuinely correlate with success or failure of the framework choice. Listed roughly in order of importance.

1. How predictable is your workflow?

If you can sketch the steps on a whiteboard and the sketch will not change every week, choose a graph framework. If the steps are emergent and depend heavily on user input, choose a conversation framework. If you have a fixed set of deliverables produced by specialists, choose a role framework. The first ten engineers we watched make this decision badly all chose a conversation framework for a predictable workflow, and spent six months adding constraints to make it predictable.

2. What is the operational discipline of the team?

Conversation frameworks demand strong evaluation discipline because emergent behaviour is harder to constrain. Graph frameworks demand strong API discipline because nodes need clean state contracts. Role frameworks demand strong prompt discipline because role definitions are the system's main lever. None of these are deal-breakers, but each one tends to surface the team's weakest discipline as the dominant pain point.

3. Who will operate this in production?

An agent system is not "done" when it ships; it lives, drifts, and needs maintenance. Pick a framework whose tracing, observability and replay tooling fits the operational team's existing habits. Teams running the system on Datadog, Grafana, or Honeycomb will save weeks if the framework has native exporters; teams without an observability stack will need to budget for one regardless.

4. Where does the talent live?

If your team is fluent in Python, the major frameworks all play to your strengths. If you have a strong JS team, look hard at Mastra. If you are a Microsoft-heavy enterprise, Semantic Kernel will have the lowest organisational friction. The best framework is the one your team can actually maintain at 2 a.m. on a Saturday when the production agent is doing something strange.

Framework	Strength	Weakness	Best for	Avoid if
LangGraph	Auditable, explicit, production-ready	Verbose for simple cases	Regulated workflows	You want low ceremony
AutoGen	Intuitive, dialogic, creative	Emergent behaviour hard to bound	Collaborative tasks	Stakes are high
CrewAI	Accessible, role-clear	Plan-first only	Content workflows	World is reactive
Semantic Kernel	Enterprise-grade, multi-language	Lags Python ecosystem	.NET enterprises	You need Python ML cutting edge
LlamaIndex WF	Tight retrieval integration	Younger orchestration story	RAG-heavy agents	Retrieval is incidental
Mastra	TypeScript-native, type-safe	Smaller community	JS/TS teams	You are Python-only

Fig. 2 The dominant frameworks at a glance — early 2026. Treat this as a starting point for shortlisting; the only framework that matters in the end is the one your team will keep operating six months from now.

How Gen4Travel chose

Gen4Travel runs on LangGraph for the orchestrator. The decision came down to three properties that mattered more than any other for our context. Auditability: regulated travel operators need to be able to explain to a customer or a regulator exactly what the agent did and why. The graph trace is that explanation. Human-in-the-loop primitives: travel involves irreversible actions, and LangGraph's checkpoint mechanism mapped cleanly to our wallet confirmation step. Production maturity: by the time we made the call, LangGraph had a year of production deployments behind it across major operators, and the bug curve had flattened.

Inside individual specialised agents, we use lighter tooling. Pydantic-AI for structured outputs (it forces the agent to return validated objects rather than free-form text, which catches a class of hallucinations at the parser layer). Inspect AI for the regression suite (every release runs against a corpus of golden trajectories). DSPy for one specific module — the inspiration agent's destination ranker — where we benefited from automated prompt optimisation against a labelled dataset.

The selection is not religious. We maintain a clean abstraction at the orchestrator level so we could swap a single agent's runtime without disturbing the others. If a future framework genuinely outperforms LangGraph for our workload, the migration cost is bounded; we did not paint ourselves into a corner. The framework is a tool, not an identity.

The bottom line

The right agentic framework is the one your team will still be operating happily in eighteen months. Match the framework to the workflow, the workflow to the operational discipline, and the operational discipline to the people who will run it. The technology is converging fast; the human factors are not, and they predict outcomes more reliably than any feature comparison.

Back to news