A GDPVal task inspected - When the model has to think and ship documents
The GDPVal interview scheduling task sits in a sweet spot: too structured for a chat prompt, too small for an enterprise workflow — bounded knowledge work a capable person could finish in one sitting.
On GDPVal
GDPVal is OpenAI’s benchmark for measuring model performance on economically valuable, real-world knowledge work rather than synthetic academic tasks. OpenAI frames it as an evaluation across realistic occupational tasks, with an accompanying public leaderboard and paper. I wrote a separate signal on GDPVal here.
For anyone building AI products, that makes it more interesting than another benchmark built around trivia, code golf, or exam-style prompts. What matters here is not just the benchmark score. It is the nature of the work: concrete briefs, real deliverables, imperfect instructions, and tasks that resemble what organizations already pay people to do.
The Interview Scheduling Task
This specific task asks the model to create a master interview schedule for a medical training program interview day, plus two one-page personal itineraries for named applicants.
The brief spells out timing rules, talks, tours, lunch placement, room-level break constraints, a transition buffer between interviews, named doctors with special availability windows, sequencing rules, and formatting expectations for the final Word documents — a genuinely structured operational problem.
How did ChatGPT perform on the task?
In this run, ChatGPT using GPT-5.4 Thinking returned the requested Word documents after 26 minutes and 16 seconds.
The model did not merely summarize the instructions. It turned them into packaged artifacts: both itineraries are structured one-page Word documents with applicant identity and group, a compact timeline, a visual slot for photo and floor layout, and a row of tour locations. The benchmark is not asking whether the model can explain a schedule. It is asking whether it can produce work product.
Master schedule, two pages
There is room to debate layout choices. There is always room to debate visual polish. But that misses the main point: GenAI can translate a fairly dense operational brief into a concrete artifact, it can preserve multiple constraints over a long output horizon, and it can package the result in a useful document.
Why the original brief worked
The original task works because it is not a first draft of a request. It reads like a scenario that has already been operationalized by humans.
In other words, much of the conceptual thinking has already happened before the task is written. The domain owners already know the shape of the day, the constraints, and the expected output format. I would argue that this is not the most common real-world scenario — and it may suggest a broader shift for knowledge workers from doing to instructing.
How good is the original brief?
On a requirements-quality spectrum, this brief sits closer to “write me a paragraph” than to safety-critical engineering. Three questions help frame it.
1. Is it good enough to produce a useful output once?
Yes. This run shows it is.
2. Is it good enough to reproduce consistently?
Maybe. Probably with variation. The structure is strong, but there is still enough ambiguity that two model runs may make different scheduling choices, document layout decisions, or other hidden assumptions.
3. Is it robust to changing parameters?
The brief will probably break when you change the number of applicants, add a late interviewer, alter lunch timing, change the number of rooms, or introduce a missing image — the unstated assumptions start to matter.
How the same task looks as a formal SRS
I wanted to see the contrast between the prompt that OpenAI used and a professional software requirements specification format. ChatGPT was used to restate the narrative brief as an IEEE-style SRS. That translation is intellectually neat, but what a harsh difference!
Assessment of the SRS for this use case
- The SRS format excels where cost of failure is high or if reproducibility is extremely important. The formalization makes sense then. Cleaner for validation, but worse for everyday communication.
- Not practical for the person giving the instruction. This is not how an operational owner would naturally brief the work. Almost certainly not how Dr. Sinnott would express it…
Can we still learn from the SRS style?
The writer’s side
Writers of a brief like this are usually busy, close to the problem, and thinking in situation and intent — so context, hard rules, and preferences naturally sit in one flow. They leave things unsaid because they’re obvious to them (how many rooms, what happens when two constraints clash), and they don’t think in “requirement 1, requirement 2.”
The reader’s side
Readers — whether a colleague or a model — are trying to do the right thing and not overstep. They can’t always ask; they have to guess. When the brief doesn’t separate “must hold” from “optimize if you can,” they don’t know whether to treat something as non-negotiable or flexible.
When two rules conflict (e.g. “tours 10 minutes” vs. “back by 9:50”), they have no way to know which the author would prioritise. When they want to check “did I do everything?”, long bundled sentences make it hard to tick items off. And when something doesn’t fit — a missing asset, a layout that won’t fit one page — the brief usually doesn’t say what to do.
The gap
Neither side is failing. The writer is being natural; the reader is being careful. Making the structure explicit, filling in assumptions and conflict rules, labelling hard vs. soft constraints, splitting requirements into checkable units, and adding guidance for edge cases gives both sides more safety and clarity — so the work can be done and checked without mind-reading.
Four practical improvements
The original brief is already better than most prompts. It can still be improved in four ways:
- Make the structure explicit — Separate scenario context, assumptions (e.g. room count, rotation logic, tour duration), required assets, and expected outputs.
- Separate hard constraints from preferences — Label what must hold vs. what to optimize if possible.
- Use atomic, checkable requirements — One requirement per statement so each can be verified and traced.
- Define failure and edge-case handling — What to do when constraints don’t fit, assets are missing, or layout can’t fit one page.
What tool could help Dr. Sinnott create better versions?
A useful service would: take a natural-language task from a domain expert; identify missing assumptions; separate hard constraints from preferences; decompose bundled requirements into atomic statements; flag ambiguity and likely failure points; and produce one artifact that is both human-readable and, if necessary, translated to a machine format.
It should be very interactive and spot-on with suggestions. It should operate on natural language and subtly introduce structure and precision — a next-generation IDE for task specification rather than code.
That is the gap this article is about. The brief was good enough to get a useful result once. Making it good enough to get reliable results every time — across changing parameters, different models, and real operational pressure — is a different problem, and one worth solving.