Blog
research

Improving long-horizon planning through harness engineering

L

Lugman Hussain Khan

Improving long-horizon planning through harness engineering

DeepPlanning is a benchmark for long-horizon agentic planning released by the Qwen team. Its Travel subset asks an agent to build a multi-day itinerary that satisfies user preferences, respects implicit environment constraints like opening hours and ticket availability, and stays inside an explicit budget. Each task is graded on four metrics: a Commonsense Score, a Personalized Score for user-specified constraints, a Composite Score, and a Case Accuracy that only awards a 1 when both the Commonsense and Personalized scores are perfect.

This post walks through the harness changes we applied to Claude Sonnet 4.6 (high reasoning) on the Travel subset, and what each change did to case accuracy.

Setting the baseline

We started with the benchmark's default agent loop and default tool set. Sonnet 4.6 with high reasoning scored 20.8% case accuracy. The Composite score was 73.2%, which means the model satisfied many individual checks but rarely landed every check on a single task.

Inspecting the failed traces surfaced four recurring patterns:

Giving the model a draft pad and a checklist

We made three changes at once.

First verification flow

Here is what the meals section of the checklist looks like:

{
  "meals": [
    "Ensure names of all locations and entities match the tool results.",
    "Strictly schedule all meal activities for 1 to 2 hours.",
    "Ensure scheduled meal times fall entirely within the specific restaurant's open hours.",
    "Maintain at least a 2-hour interval between lunch and dinner.",
    "Include exactly 2 meals (both a lunch and a dinner) on every full non-transfer day.",
    "Strictly follow the arrival/departure meal rules based on the transfer day schedule.",
    "Include all user-specified must-eat venues exactly as specified.",
    "Ensure all chosen restaurants are unique across different days.",
    "Correctly omit breakfast from the itinerary.",
    "Do not select any restaurant marked as 'Temporarily Closed'."
  ]
}

We considered adding a todo-style task tracker but dropped it. Trace analysis showed that Sonnet 4.6 already broke tasks into ordered steps without prompting, so an explicit todo tool would have been redundant.

After these changes, case accuracy moved from 20.8% to 47%. That put the configuration in second place on the benchmark, behind Claude Opus 4.6 with max reasoning at 61.5%.

When the model starts checking boxes without looking

Traces from the 47% run surfaced a new failure mode. By the time the validation phase started, the average context length was around 65,000 tokens. The model would walk through the checklist and mark items as passing without actually re-reading the draft and checking. Obvious violations (a budget overrun visible directly in the plan, an attraction listed twice across days) were stamped as compliant.

The issue was not the existence of a checklist. It was that asking the same in-context model to audit its own long trajectory was unreliable. The validator needed a clean working context for each check.

Splitting validation into focused parallel checks

We changed write_draft_plan so that on submission, the harness automatically runs validation in the background and returns only the failed checks. The agent loop then either revises and resubmits, or, when no checks fail, treats the plan as final.

Validation uses the same model (Sonnet 4.6) but fans out into parallel calls by checklist section. Each parallel call receives:

  1. The validation instruction for that section.
  2. The draft plan as written.
  3. The checklist for that specific section only.
  4. The tool calls and their results that are relevant to that section. Transport and hotel checks see transport and hotel queries. Attraction and meal checks see attraction and restaurant queries. Intra-city travel checks see route queries.

Each validator runs against a short, focused context built specifically for its section. There is no long trajectory to skim past and no incentive to rubber-stamp. Results from the parallel validators are aggregated, and any failed checks are returned to the agent loop for revision.

Validation harness flow

Putting it all together

With the validation harness in place, Claude Sonnet 4.6 reached 65.8% case accuracy on the Travel subset, with a Commonsense score of 97.2, a Personalized score of 81.7, and a Composite score of 89.4.

For comparison:

ConfigurationCommonsensePersonalizedCompositeCase Accuracy
Sonnet 4.6 baseline83.962.573.220.8
GPT-5.2 high88.583.385.835.0
Opus 4.6 max86.180.383.261.5
Sonnet 4.6 + Custom Harness97.281.789.465.8

The harness brought Sonnet 4.6 above Opus 4.6 with max reasoning on case accuracy, at a lower per-call cost.

This exercise was a reminder that agent performance often plateaus because of harness design, not model capability. Sonnet 4.6 already knew how to plan a trip. What it lacked was a clean way to write that plan down before validating it, and a validator that was not drowning in its own history.

Once we offloaded sorting to the tools, forced a draft step, and ran parallel checks in isolated contexts, the model started passing tasks it had been failing consistently. The gap between a 73 percent composite score and a 65.8 percent case accuracy is the gap between mostly right and actually done. Closing that gap is a harness problem.

Ready to ship reliable, production-ready AI?

Let's get on a call to discuss how we can help you achieve your AI vision.