Agentic Engineering Hiring Interview 2026 [+21-Question PDF]

We’re convinced: classic coding interviews filter out the wrong candidates in 2026. Whiteboard algorithms test pattern matching under time pressure. A skill that LLMs do better today than any senior. What they don’t check: spec discipline, plan mode, diff review, token hygiene, sub-agent orchestration. Those are the skills that separate a 2026 senior from a mid-level. And they’re testable in 45 minutes on a real codebase.

What this means for hiring: anyone still asking reverse-a-linked-list in 2026 is hiring against a 2019 skill profile. Andrej Karpathy made the point at Sequoia’s AI Ascent 2026: “Most people have still not refactored their hiring process for agentic engineering capability. If you’re giving out puzzles to solve, this is still the old paradigm.”

This article is the selection deep-dive in our cluster on Agentic Engineering and Hiring 2026. It covers the 4-phase format, 21 questions with green-flag and red-flag scoring, and the full interview guide as a PDF download. Built from 8 CTO programs, 50+ active 2026 hiring briefings, and four deeply analyzed source interviews with Karpathy, Brockman, Liu, and Cherny.

→ Jump to PDF download

Why whiteboard exercises send the wrong signal in 2026

Karpathy’s April 2026 pivot is described as a terminology shift in What is Agentic Engineering. What concerns us here is the operational consequence: when an engineer spends 80 percent of their time orchestrating (writing specs, reviewing plans, launching sub-agents, reading diffs, managing tokens), the interview has to test exactly those activities.

A whiteboard interview doesn’t. It tests:

Who can reconstruct a mid-sized algorithm in 45 minutes under observation.
Who has Big-O notation memorized.
Who can write under stress without an IDE.

Three skills nobody needs in their daily doing in 2026. Translated: you’re hiring on what mattered seven years ago.

What a 2026 senior actually does every day:

Slice a spec to 1-2 days of work, not too small and not too big.
Use plan mode before an agent touches a single file.
Run three to eight parallel sessions across different branches.
Watch token spend and pick the tier deliberately.
Read diffs before accepting them. No Accept All.
Write their own skills that solve recurring problems structurally.
Spot a stuck loop, restart instead of rabbit-holing.

That’s exactly what the format below checks. Question by question.

The 4-phase interview format at a glance

The format runs in 45 minutes with a deliberate order: open soft with the workflow narrative, then hard setup demo plus live task, then failure reflection, then calibration on persistence and anti-patterns.

Phase 1: Workflow narrative (15 min, questions 1-7) The candidate describes what their typical day looks like. Subscription tier, share of code written by hand vs. by AI, spec workflow, custom skills, secret hygiene, review routines, default instructions to the AI. The phase reveals in 15 minutes whether someone really operates agentic or just talks about tools they use occasionally.

Phase 2: Setup demo plus live task (20-25 min, questions 8-11) Share screen. Which tools are actually running? What does their CLAUDE.md look like? How many worktrees / parallel sessions? Then a live task on a real codebase, deliberately not LeetCode style. Four task variants by target role are in the PDF (bug reproduction, refactor with spec, greenfield architecture, build-and-break take-home).

Phase 3: Failure narrative (10 min, questions 12-15) “Tell me about the last time an agent built complete nonsense.” This is where vibe coder separates from senior. Anyone who breezes through without self-reflection either hasn’t practiced enough or has Dunning-Kruger. Anyone who can talk openly about productive failure has actually lived the workflow.

Phase 4: Calibration (10 min, questions 16-21) Persistence-vs-curation, anti-patterns, multiplier skills. Howie Liu named the most common practitioner mistake on Greg Eisenberg’s podcast in 2026: “They oneshot something, it’s not quite as profound as what they hoped for, and they kind of give up. The agents are powerful enough to do almost anything you want. The issue is whether you are able to invest the time and coaching and curation to get it there.” This phase checks exactly that: whether the candidate can curate their tool, not just fire it.

Build-and-Break Take-Home (optional, 60 min) Deliberately not a standard exercise. Karpathy used his Twitter-clone example in the Sequoia talk: “Hiring has to look like — give me a really big project and see someone implement that big project. Like let’s write a Twitter clone for agents and then make it really good, make it really secure. And then I’m going to use 10 codex agents to try to break your website. They should not be able to break it.” Build and break in one setup. The sharpest hiring probe available for senior architects and security-relevant roles.

Four sample questions from the interview guide

Rather than copy the entire 21-question PDF into the article, here are four questions as a taste, with the reasoning for why each one works. Every question in the PDF has a worked-out green-flag and red-flag list.

Question 2: Which subscription tier do you run on your daily driver?

The answer reveals more in 10 seconds than three CV pages. Anyone still on a 20-dollar Pro plan in 2026 is not senior in agentic. No matter what the resume says. Senior practitioners run Max 5x or Max 20x with Anthropic, Pro+ or Ultra with Cursor, or direct API access. They can name a rough monthly spend (north of 100 dollars). They’ve blown through quotas at least once. They justify their tier choice with concrete workload.

This lines up with Tomasz Tunguz’ thesis on the fourth compensation component: token spend is a 2026 engineering KPI like headcount or cloud spend. No token spend, no hire.

Question 4: Which skills or slash commands have you written yourself? Do you have a pattern for it?

The sharpest question in Phase 1, because it exposes the gap between tool user and tool architect. Anyone who treats skills as just a boilerplate collection has missed the point. Senior answers show: at least 1-2 hand-written skills with a clear pattern (rules + checklist + guide per topic), reuse across projects, an understanding of skills as an onboarding tool for new team members.

Mid-level answers sound like “haven’t needed any so far” or “I just use the default skills.” The gap isn’t about craft. It’s conceptual. And it shows up in two minutes.

Question 14: When did you last say no to the AI? What did it propose that you rejected?

The question forces honesty. Anyone who never says no accepts too much. What marks a senior isn’t “nothing has ever happened to me,” it’s: “Here’s what happened. Here’s what I do differently since.”

Concrete senior example: the agent proposes a technically clean solution that violates a team coding standard or sidesteps an architecture decision documented in onboarding. The senior says no and translates the no into a skill rule or a CLAUDE.md note so it doesn’t come back. Saying no as core senior responsibility. Not as exception.

The sharpest variant: “Do you recognize overshooting proactivity as a typical agent EQ failure?” Anyone with their own heuristics against it (“no escalations without explicit approval,” “no external communications actions on auto-approve”) shows operational maturity.

Question 19: Tell me about a workflow you got working after multiple attempts. What was your stamina?

Liu calls giving up after the first try the most common practitioner mistake. Anyone who can’t answer this convincingly hasn’t really integrated the agentic workflow.

Senior answers have structure: Workflow X didn’t work the first time. Worked better the second. Worked well the third. The candidate can describe which iterations were needed. Refined the skill, adjusted the spec, switched tools, built their own heuristic for when to push a pattern and when to drop it.

Mid-level answers confuse persistence with stubbornness (“I just keep trying until it works”). Without reflection on what’s actually being adjusted. Or worse: “If it doesn’t work the first time, the tool isn’t ready.” Exactly the stance Liu names as the practitioner’s main failure mode.

Live task instead of algorithm: what practice actually shows

Phase 2 is the hardest part. And the part where most 2026 hiring managers fail. Classic coding challenges are designed to be solvable in isolation. One function, one clear input-output mapping, one right answer. That worked in 2019 because engineers spent their day on isolated functions.

In 2026 the day looks different. The task is ambiguous. The context is incomplete. The solution requires architecture decisions before the first line of code is written. Live tasks in the interview have to simulate exactly that.

Concrete example from the PDF, task A (bug reproduction with fix): “Here’s a small API with two endpoints. Users report: on the third pagination page request, the same items sometimes come back. Reproduce the bug, find the cause, write a fix with a matching test. 10 minutes.”

What you watch is not whether the bug gets found. It’s how:

Does the candidate write a reproduction test first that catches the bug? Or jump straight into the code?
How do they use the agent: as a reproduction helper (good) or directly as a fix generator (red flag)?
Did they use plan mode before changing anything?
Do they read the test output carefully, or accept the first green signal?

Three of the four tasks in the PDF follow this pattern. Task D, the build-and-break take-home, goes one step further: 60 minutes build, then agents trying to break the system. Karpathy’s original template, no decoration.

What you know after this format: what was invisible before

Classic interviews end with an unsatisfied gut feeling: “Was that a good one?” Hiring managers then rationalize with bullet points on coding speed, likability, and CV pattern. Most bad hires happen in exactly that vacuum.

The 4-phase format gives you, after 45 minutes, concrete answers to six questions that classic interviews leave open:

Token maturity: does the candidate run a setup that carries 2026 workload? (Phase 1, question 2)
Spec discipline: can they slice specs instead of falling into waterfall? (Phase 1, question 3)
Skill architecture: do they write their own skills with a pattern? (Phase 1, question 4)
Setup hygiene: do they have a real CLAUDE.md, worktrees, parallel sessions? (Phase 2, questions 8-10)
Failure reflection: can they speak openly about productive failure? (Phase 3, all questions)
Multiplier ability: would they ramp a mid-level engineer onto the tool? (Phase 4, question 21)

The scoring heuristic in the PDF: at least 15 of 21 questions in the green-flag range for a 2026 senior hire. More than 5 red flags is a stop signal, regardless of the rest. Question 17 (AI vs. human reviewer) is the sharpest seniority test in the whole format. Anyone who answers black-and-white isn’t senior.

The format isn’t a new invention. It’s an adaptation of the classic behavioral interview to the reality that tool setup now says more about an engineer than algorithm knowledge.

What’s inside the interview guide PDF

The full interview guide includes:

All 21 questions with concrete green-flag and red-flag answers per question.
Four live-task variants (A-D) for different target roles: bug reproduction (backend senior default), refactor with spec (frontend/mobile), greenfield architecture (tech lead), build-and-break take-home (senior architect, security). Each with scoring criteria.
Build-and-Break Take-Home template with Karpathy’s Twitter-clone format as the original reference and a concrete brief for a 60-minute build-and-break setup.
Scoring heuristic: the 15-of-21 threshold, the sharpest seniority tests, the typical Phase-1 weaknesses as early indicators.

Sequential. Ready to use in your next hiring call. Print-formatted.

→ Jump to PDF download

If you’re hiring right now

The typical observation in 2026 CTO calls: the hiring format hasn’t changed in three years. Coding challenge, system design, cultural fit. Nobody asks about CLAUDE.md. Nobody checks token spend. Nobody asks when the candidate last said no to the AI.

The result: hires that looked good in the interview and underwhelm in daily doing. We place senior freelancers whose workflow runs at the right magnitude from day one. And we help engineering teams bring their own hiring format to a 2026 standard.

Drop me a quick note on LinkedIn about where you stand. Or send a concrete request to our team. We get back within 48 hours.

FAQs

Why don't classic coding interviews work in 2026?

A whiteboard algorithm tests pattern matching under time pressure. In 2019 that was a decent proxy for engineer quality. In 2026 it just tests whether someone would have been a good engineer seven years ago. What it doesn't test: spec discipline, plan mode, diff review, token hygiene, sub-agent orchestration. Those are the skills that separate a 2026 senior from a mid-level.

What should a modern interview format check instead?

Three things. First, workflow maturity: how the candidate structures their day, which subscription tier they run, how they write specs, when they use plan mode. Second, setup hygiene live: what's in their CLAUDE.md, how many parallel sessions they run, how they handle secrets. Third, failure reflection: where they failed productively and what they changed because of it. Howie Liu put it this way: 'The agents are powerful enough to do almost anything you want. The issue is whether you are able to invest the time and coaching and curation to get it there.' That's exactly what a good format checks.

How long should an agentic engineering interview be?

45 minutes across four phases plus an optional take-home. Phase 1 (15 min) workflow narrative, 7 questions. Phase 2 (20-25 min) setup demo plus live task on a real codebase, 4 questions. Phase 3 (10 min) failure narrative, 4 questions. Phase 4 (10 min) calibration, 6 questions. The take-home as a build-and-break format: deliberately not a standard exercise, but a brief where the candidate has to decide where spec ends and code begins.

How many green flags should a senior hire score?

At least 15 of 21 questions should land clearly in the green-flag range for a 2026 senior hire. More than 5 red flags is a stop signal, regardless of the rest. Phase 4 question 17 (AI vs. human reviewer) is the sharpest seniority test in the whole guide. Anyone who answers black-and-white isn't senior in agentic. Phase 4 question 19 (persistence-curation, after Liu) is the most hidden test.

Read the latest stories

Get an update from us.

Follow me on LinkedIn.

> Read all

Ralf Gehrer

CTO & Co-founder of ElevateX and your contact for agentic engineering, AI hiring, and senior-freelance setups.

> Book a free call

← Back to Blog