deep-research

A benchmark task that exercises an agent's deep web-research loop: starting from partial author hints and a vague topic description, locate the exact target paper and emit a structured answer.

This is one of the SkillsBench tasks I wrote. The original — including the full sandbox environment, scripts, solution, and verifier — lives at BenchFlow-Hub/galaxies-bingran/tasks/deep-research. What you'll find here is the instruction and the task config — the parts that define what the agent has to do.

Instruction

Find a paper published before 06/2024. This paper is about quantum networks and fast ion string transportation and second-order correlation functions. For the 2 first authors, one of them has gotten PhD degree in NYU and become postdoc of UC Berkeley. And another finished undergraduate study in China.

Put answer to /root/final_answer.md as a markdown file. It has 4 lines:

The complete title of the paper

The DOI of the paper (only DOI string, no prefix)

The DOI of the first co-author (only DOI string, no prefix)

The DOI of the second co-author (only DOI string, no prefix)

Task config

version = "1.0"

[metadata]
author_name = "Bingran You"
author_email = "bingran.you@berkeley.edu"
difficulty = "medium"
category = "deep-research"
tags = ["web-search", "deep-research"]

[verifier]
timeout_sec = 900.0

[agent]
timeout_sec = 400.0

[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 4096
storage_mb = 10240
allow_internet = true

[agent.env]
EXA_API_KEY="${EXA_API_KEY}"

Why it's a good test

Multi-hop reasoning — the agent must cross-reference author career history with topic keywords to disambiguate a single paper.
Citation discipline — the answer requires DOIs (paper + author ORCID-style IDs), not just titles.
Web tool reliance — the task is intentionally hard to solve from cached training data alone; expect the agent to actually search.

deep-research

Description

SKILL.md

deep-research

Instruction

Task config

Why it's a good test