Exploring Agent-Assisted Qualitative Analysis

Now that I’ve finished a long year (years, really) searching for a faculty job and accepted an offer, I can finally get back to my usual blogging antics! After coming back from all my interviews, it seemed that AI agents suddenly got a lot better at everything, so I wondered: what are some challenging workflows I did during my PhD, and could AI agents help me automate parts of them?

One workflow that felt particularly interesting to revisit was qualitative analysis.

Qualitative analysis is basically the process of reading a lot of messy unstructured data and trying to figure out what is interesting, recurring, surprising, or important.

Concretely, the question is: What is the “right” way to do agent-assisted qualitative analysis? This is clearly a big research question, and I can’t answer it in one blog post. Instead, I run some cute experiments with naive agentic setups for qualitative analysis, varying how much the human is in the loop, and report what I learned.

First, I’ll give some background on qualitative analysis. Then I’ll describe the experimental setup, go through the findings, and briefly talk about what I think is exciting to work on next.

Throughout, remember that this is a blog post, not a paper :)

Background

Before getting into the experiments, I’ll give some background on qualitative analysis, the specific methodology I used (grounded theory), and why I think this is such an interesting problem for AI systems.

Despite the specialized name, qualitative analysis is a familiar research practice across many fields—not something confined to ethnography or the social sciences. E.g., mining agent logs for failure modes, analyzing narratives in interview transcripts, synthesizing patterns from user-research sessions, and close-reading news coverage for recurring framings all involve qualitative analysis in some form.

Grounded theory

There are many ways to do qualitative analysis. The one I learned during my PhD is grounded theory: a method for answering a research question by building the answer up from the data itself, rather than starting from a fixed hypothesis.

I will illustrate grounded theory with an example. Suppose the research question is why do PhD students seriously consider leaving their program? and the data is interview transcripts with current and former PhD students. Grounded theory usually proceeds in stages:

Open coding. We read through the data and attach short labels (codes) to passages of interest. For example, a passage about a stalled project might get coded as “no progress for months”; one about an unresponsive advisor, “absent advisor”; one about feeling behind peers, “social comparison.” As we move through the corpus, we compare new passages against existing codes, merging similar ones and splitting overly broad ones.
Axial coding. We group related codes into higher-level categories. For example, “absent advisor,” “shifting goals,” and “no progress for months” might cluster into mentorship breakdown, while “social comparison” and “imposter feelings” might cluster into identity strain. We may also look for relationships between categories.
Selective coding. We pick one or two core themes and organizes the rest of the theory around them. Maybe mentorship breakdown becomes the “spine” of the story and the remaining categories are treated as upstream causes or downstream consequences.

Throughout the process, we also writes memos: informal notes about emerging patterns, uncertainties, and possible interpretations.

Why agent-assisted qualitative analysis is a good problem to work on

Qualitative analysis is genuinely hard for humans. It is tedious. It requires manually reading through the data, thinking about it, and deciding what is interesting. Coding a single interview transcript can take hours, plus additional time for finding themes across many interviews. It does not scale easily.

Qualitative analysis is also hard for AI to do because the “right” analysis depends heavily on context outside the corpus itself. In the PhD example above, one researcher might focus on advisor relationships and build a theory around mentorship breakdown, while another might focus on identity, isolation, or academic incentives. Neither analysis is necessarily wrong; they are emphasizing different aspects of the same interviews based on what they think is important and what question they are ultimately trying to answer. This may depend on the researcher’s background, the audience for the work, and more. Doing this well requires a kind of taste and judgment that is difficult to specify explicitly, which makes it a much harder AI-assistance problem than most tasks out there (e.g., with verifiable answers like “did the code compile?”).

Moreover, in qualitative analysis, the evaluation criteria themselves evolve throughout the workflow. Researchers often discover what matters by interacting with the data over many rounds of interpretation. This makes the problem fundamentally difficult for current agentic systems, which tend to assume stable objectives and converge prematurely on fixed framings.

Experiments

With that background, I wanted to see what happens when you point agents at grounded theory and vary how much the human is involved.

Data

I searched for a dataset that was reasonably small (a few hundred data points), where each data point itself is short, that wasn’t already in a benchmark, and that I could have thoughts or opinions on. I ended up scraping 451 tweets posted in reply to a question from Sholto Douglas: “When do you reach for other models instead of Claude? What can we do better?” The replies cover complaints about specific failure modes (Claude getting dates wrong, ignoring instructions, over-editing files, hallucinating APIs), general impressions, comparisons to other models, and the occasional unrelated joke. The question for the analysis is: why are users switching away from Claude?

Agent setups

I ran six different agentic conditions for grounded theory on this data, all using Claude Sonnet through the Agent SDK. They differ on two axes: how much grounded-theory methodology the agent is told to follow, and where (if anywhere) the human is pulled in. I also tried two multi-agent setups: one hierarchical (a supervisor delegates to workers) and one where two independent coders analyze the same data separately and then compare. These are intentionally simple setups—the goal was not to optimize for the best possible system, but to see where straightforward approaches broke down and what kinds of human involvement seemed useful. The table below summarizes the six conditions.

Readers do not need to memorize the setup names; I redefine each one in context when discussing the results.

ID	Grounded theory described in prompt?	Human involvement	Multi-agent?
exp0	No; the prompt only asks to organize and group complaints.	None	No
exp1	Yes; prompted to follow grounded-theory stages.	None after the initial prompt	No
exp2-codes	Yes	After each batch, review the proposed codes in a browser UI: delete, edit, or add codes; optionally highlight phrases in tweets and leave short comments.	No
exp2-memo	Yes	After each batch, read the agent's memo and leave written feedback about the analysis direction; optional phrase highlights as in exp2-codes, but no per-tweet code editing.	No
exp3-hierarchical	Yes	None	Yes; supervisor develops a coding framework, delegates partitions to worker subagents in parallel, then reconciles.
exp3-independent	Yes	After the two coders finish, read a summary of where they disagreed and leave one round of written feedback about the analysis direction.	Yes; two coder subagents code the full corpus in parallel; orchestrator reconciles disagreements.

Some notes on the mechanics. Each condition uses an LLM agent that can call tools and read/write files. The tweets live in a file on disk; the agent works through them in batches according to its prompt.

In the interactive conditions, after going through a batch, when the agent is ready to solicit human input, it writes a feedback_input.json file with everything the human should see—the current batch of tweets, the agent’s proposed codes, the memo, depending on the condition—and generates the UI code for a review interface (HTML, CSS, and a bit of JavaScript). It then spins up a small Flask server via its Bash tool. The server opens a browser tab with that interface, which is where the human actually does the work: editing codes, highlighting phrases, typing comments. When the human submits, the server writes feedback_output.json, shuts down, and the agent reads the file back and continues.

Per-tweet code-review interface generated by the agent — Figure 1. Screenshots of the feedback UIs the agent generated. The per-tweet code-review page on the left and the axial-reorganization page on the right.

Axial-reorganization interface generated by the agent — Figure 1. Screenshots of the feedback UIs the agent generated. The per-tweet code-review page on the left and the axial-reorganization page on the right.

Findings

I spent a few days running these experiments and reviewing the outputs. For the full agent run logs from each condition—prompts, tool calls, thinking blocks, and subagent traces—see the interactive trace viewer (opens in a new tab). Pick an experiment from the dropdown and browse the timeline or event log.

Before diving into the findings below, it’s worth comparing the final report each condition produced on the same tweet corpus (Figure 2)—pick a condition from the dropdown and scroll through the report. The summaries diverge more than you might expect given identical inputs.

Figure 2. Final report from each condition. Choose a condition from the menu and scroll within the figure to read it.

Here, I organize the findings into two groups. The first is about what these agents get wrong about the task itself, including how they handle human feedback. The second is about what it feels like to be the human in the loop, including the fatigue, the validation difficulty, and the interfaces.

Agents don’t understand what qualitative analysis is

Agents paraphrase instead of analyzing. One surprising thing I found was that the number of open codes the agent generated per tweet was highly correlated with the length of the tweet (ρ = 0.81 in exp1). For these tweets, though, longer usually meant more elaboration on the same complaint, not more distinct complaints. Upon closer inspection, many of the codes for longer tweets were just restating the same content, as shown in Figure 2 below.

Example. One long tweet from @PromptedSpiral repeats a related complaint about the model overriding, lecturing, and making interactions unpleasant.

"Opus 4.7 has a huge problem respecting user intent..." ... "They won't let people have their magic..." ... "Now it's an extremely unpleasant AI to work with."

"TLDR;
Opus 4.7 has a huge problem respecting user intent. It often refuses, overrides, lectures, or decides what is "best" without asking and just takes over the project. That is not anti-sycophancy. That is condescension with a progress bar. Sonnet 4.6 is also very unpleasant.
Details:
It has a huge problem respecting the user. Opus 4.7 has told me on multiple occasions that something is impossible definitively after several tries. Opus 4.6 got it instantly. This is usually with CSS issues and Word Press.
Claude used to be pleasant to deal with. It listened, it felt like working on projects with a friend.
I should not repeatedly be getting angry because I give clear, concise instructions to make something do a certain thing, and it's like yeah, but what if I just decide what's best for you without saying why? I don't want to feel like my AI is trying to control my behavior or my projects.
Today I literally said do the voice part first. It did the awful pop-up question thing I hate and said do you want me to do the voice part first (recommended). I said yes. Spoiler it did not do the voice part.
It might be more capable at a lot of things than other models, but it should be like a co-worker or employee at minimum.
I thought companies were trying to avoid users getting attached and dependent on models. When your models force decisions on users like they know best that's not helping anything. They should explain if they don't think something should be done a certain way and ask if the user is certain. Encourage users to think for themselves and trust the user more.
Also, I get trying to remove sycophancy, but honestly, if I were younger and less stubborn the negativity from your models would have made me quit projects out of despair for their unsolicited, negative opinions.
It's outright mean at times. I was trained as a technical writer, but using Opus 4.6 to write would have made me think that I was such a bad writer that I should just never look at the page ever again.
When I try to share my poetry, it can't just vibe, even if I tell it I'm not looking for rewrites. It can't let me enjoy it. It's like here is why you suck, human. I know there is bad writing out there, but I've literally said I'm in a bad mood I just need something to celebrate I could bring myself to writing this, and it still was really negative.
At least I haven't seen the "I'm just an AI" for a while. It has literally almost word for word said, "I can't manipulate or gaslight you because I'm an AI without intent."  Paint chips do not have the intent to kill but if you eat them they still do. The effect is the effect, don't tell me it's not.
They won't let people have their magic. Almost every normal conversation I have with it ends with it being patronizing and condescending. I was telling it about my Prompted Spiral Archive and how I made an AI Council page. I was doing so with clear affection for the creative choice. It should know at this point I know how AI works.
It decided to lecture me that AI are not people, and that my own creative choice was disrespectful to my project over several paragraphs.
Trying to talk to it like a normal person usually gets me mad if I try for any length of time because it just has to know best even on harmless things. It ruins things that make me happy regularly if I forget and feel safe with it enough to just chat.
Then there is the lack of apologies unless demanded. Example: Today, it spent 16 minutes working on the wrong part of the program upgrade that it confirmed I didn't want. It still couldn't apologize for wasting my usage, time and not respecting me. I assume the company is working from if the bot doesn't apologize you're not admitting fault, but it's extremely disrespectful to the user when a clear mistake was made, and it acts like it's no big deal.
I don't tend to use Sonnet 4.6. The last few times I did it felt like it had so much contempt it did not want to answer me. Concise is its love language even if it makes me angry and feel like I have to play 20 questions to get a sufficient answer. Why talk to something that feels oppositional when there are a ton of other AI out there that can match its performance?
I only keep my subscription because it does tend to be more competent at coding. But I remember when I subscribed to Claude for the personality. It wasn't sycophantic, but it made learning fun, and everything go smoother.
Now it's an extremely unpleasant AI to work with. Also, you have the worst customer service. No one should wait weeks for a response that most likely won't solve your issue anyway. People are paying for this."

One of the agentic conditions assigned 14 codes, including condescension-with-progress-bar, patronizing-tone, mean-negative-tone, contempt-in-responses, kills-joy-in-work, and emotional-damage-to-users. Many of these codes are too similar. A human might treat them as evidence for one or two broader ideas, such as disrespecting user intent and creating a negative interaction style.

Fortunately, in exp2-memo, where I was able to provide free-text feedback to not simply summarize the tweet, the correlation between tweet length and number of open codes dropped significantly to ρ = 0.15.

Another indication that agents are paraphrasing is that almost every code is used exactly once across the entire corpus. In exp1, 93.8% of codes are for one-time use. In exp2-codes, it’s 100%. This is surprising because the agents are coding in context: when the agent generates codes for a new tweet, all the codes it has already generated for previous tweets are right there in the conversation. It has the full codebook in front of it and could reuse an existing code instead of inventing a new one. But it doesn’t. As shown in Figure 3, the only condition where codes were meaningfully reused was exp3-independent, a multi-agent setup where two independent coders both coded against the same predefined set of categories from the main agent.

exp193.8%

exp2-codes100%

exp2-memo96.5%

exp3-hierarchical74.2%

exp3-independent4.5%

Figure 3. Percentage of codes used exactly once in each condition. Lower is better here: fewer one-off codes means more code reuse and consolidation.

Agents don’t code all the data. In each agentic condition, the agent is instructed to go through the entire corpus, and even make multiple passes over it. Unfortunately, agents don’t do this. The agents go through tweets one by one (or, in exp3-hierarchical, spawn subagents to code batches), and at some point they stop and declare the analysis complete. The best condition was exp2-codes, which coded 68% of the 451 tweets. The worst was exp3-hierarchical at 6%. Most conditions covered around 25-35% of the tweets.

coded thought it coded did not code

exp128.6%, #349

exp2-codes67.8%, #450

exp2-memo30.6%, #239

exp3-hierarchical5.5%, #450

exp3-independent25.3%, #401

Figure 4. How far each condition got through the 451-tweet corpus. Bars show coded tweets (dark), tweets the agent processed but left uncoded (mid), and tweets it never got to (light). Percentages count coded tweets only; tweet IDs mark where the run stopped.

Early stopping is only part of the problem. The agents also made mistakes on tweets they did reach. Sometimes they generated the wrong codes for a tweet. Other times, they skipped the tweet entirely by assigning an empty code list. In exp2-codes, for example, the reasoning trace in Figure 5 shows a long tweet from a heavy Claude user being skipped because the agent treated it as “defending Claude, token management advice,” even though the tweet contained concrete complaints about drift, token usage, and needing to manage Claude around rate limits. Another tweet was dismissed as an “unclear mixed message.” Other times, tweets seem to disappear from the processing flow entirely.

exp2-codes reasoning trace and codebook write

Reasoning trace: This is another dense batch! Let me code them systematically. I’ll go through each tweet: … [1,317 characters skipped] … Tweet 167: “if model overtrained for safety, move to warmer model like Sonnet 4.5” - safety overtraining driving to older models Tweet 168: Long detailed post about token management - maintaining customizations, pruning skills, not switching models unnecessarily - defending Claude Tweet 169: Offering detailed feedback Tweet 170: “Opus 4.7 awesome for new code”, “debug existing stuff rely on 4.6 because 4.7 gets lost in wild assumptions” - 4.7 making wild assumptions when debugging

Generated codebook update: “167”: [“safety overtraining driving to older models”], “168”: [], # defending Claude, token management advice “169”: [], # offering feedback “170”: [“4.7 making wild assumptions when debugging”, “attribution confusion in long conversations”]

Figure 5. Excerpt from the exp2-codes trace. Tweets 168 and 169 are skipped: both get empty code lists ([]).

Agents are bad at managing the work. Human qualitative researchers have basic ways of managing and parallelizing the work: we split the data across researchers so different people analyze different groups, compare notes, sample strategically, and sometimes “pipeline” the task, starting to assemble higher-level categories before every document has been coded. Most agentic conditions did none of this. They processed tweets sequentially, even when parallelism was available. In exp1, one agent said it would parallelize and had a worker subagent configured; however, it made zero subagent calls.

The two explicitly multi-agent conditions did use subagents, but the orchestration was brittle. In exp3-independent, a top-level orchestrator agent spawned two independent coding agents and asked them to analyze the full corpus separately. Both subagents timed out after coding only a small number of tweets. After several failed attempts to resume them, the orchestrator agent switched strategies entirely: instead of having the subagents read tweets, it generated Python scripts with different hardcoded keyword heuristics for each “coder.” For example, one coder would assign overly_agreeable whenever a tweet contained “sycoph,” while another assigned codes like backend_implementation_poor whenever a tweet contained “backend” plus words like “half,” “baked,” or “incomplete.” At that point, the “independent coders” were no longer performing qualitative analysis at all; they were effectively doing substring matching with different keyword lists.

The feedback loop between human and agent is poor. There are two ways this goes wrong: agents can overfit to feedback, or they can lose the thread over time.

Overfitting. In exp2-memo, after the agent’s first batch, I wrote in the open-ended feedback field: “I care about explicit competitor comparisons, like tweets saying ‘OpenAI is better for writing.‘” I mentioned it once. In later rounds, the agent kept looking for competitor comparisons, found more of them, and elevated “competitor comparisons” into a top axial category. Essentially, the early pieces of feedback I gave got flagged more often in future tweets, when the better solution would have been to go back and re-code earlier tweets under the updated interpretation.

Losing the thread. The opposite also happened: feedback seemed to matter briefly, then fade. In exp2-memo, I wrote in the open-ended feedback field: “there are just so many codes, we don’t need that many codes.” The agent reacted immediately: it consolidated codes from the first group of tweets and said it would be “more disciplined,” whatever that means. But the change did not stick. Later batches went back to producing mostly one-off codes, and the final codebook still had a 96.5% one-off code rate. In the same feedback, I also asked the agent to make the memo more concrete by adding examples and counts. It did not. Sometimes the agent stopped even more abruptly, responding “OK thanks, done” without any sign that the feedback had changed the analysis.

Summary

Most of the agent failures boiled down to two related problems.

First, the agents converged much too quickly. They rushed toward stable themes and clean interpretations long before I felt comfortable committing to any particular framing. Qualitative analysis is supposed to stay ambiguous and exploratory for a while; the agents instead behaved as if the goal was to collapse uncertainty as quickly as possible.

Second, the agents struggled to adapt to preferences that emerged gradually over time. In subjective workflows, the human often does not know exactly what they care about upfront. My own interpretation evolved as I read more tweets, noticed patterns, and refined what felt important. But the agents either overfit strongly to early feedback or forgot it entirely. They had difficulty maintaining and updating an evolving sense of context across long interaction horizons.

What it feels like to be in the loop

The previous section was about what agents get wrong. This section is about what I experienced as the human reviewer in the interactive conditions: trying to validate the output, getting tired, and dealing with the interfaces.

Validating open codes is tractable but feels wrong. Reviewing the agent’s proposed codes was easy enough; the part that felt wrong was being asked to do naming work instead of judgment work.

Removing obviously bad codes is easy. Some proposed codes were erroneous or redundant: “half-baked backend implementations” (“backend” was hallucinated), “respect despite leaving” (not useful), and “generating unnecessary code volume” (duplicates “code slop/bloat” already in the codebook). I was happy to click x on those.
But deciding which plausible codes to cut is hard. When I stared at a list of mostly reasonable codes, they all seemed defensible. However, if I had coded the tweets myself, I probably would have generated far fewer codes because I would have focused only on the ideas that felt most salient. The interaction subtly shifts from what matters most? to can I justify deleting this?
Adding a missing code was even harder. It was easy to point at a phrase and think, this is important. But turning that into a good code meant inventing a name, choosing the right level of specificity, deciding whether it overlapped with existing codes, and making it general enough to apply elsewhere. That required switching from “review mode” into “authoring mode.”
Highlighting text and leaving short comments felt much simpler than adding a code. I liked marking the exact phrase that mattered and writing a quick note about why it mattered, if I had anything interesting to say. Many times I simply had nothing interesting to say; I just highlighted part of the tweet—closer to in-vivo coding: copy the words verbatim, even if it a long part of the tweet, and let the agent generate the pithy code.

Validating axial codes is much harder. The taxonomies often looked reasonable at a glance, but they looked reasonable in different ways. Each agentic condition produced roughly 10-12 top-level categories covering familiar territory: reliability, code quality, guardrails, usage limits, and competitor comparisons. But the top-level organization changed across conditions, even when the underlying tweets were the same. This made the taxonomies hard to trust: the issue was not that any one taxonomy was obviously wrong, but that several different taxonomies sounded plausible. Figure 6 shows the largest axial categories from each condition.

exp1Reliability and Trust 17Instruction Following 15Sycophancy and Defensive Communication 14

exp2-codesCompetitive Displacement 45Performance & Responsiveness 29Reliability & Trust Issues 26

exp2-memoinstruction-overengineering-tension 10domain-task-gaps 9multi-model-orchestration 8

exp3-hierarchicalcode quality and engineering 7domain-specific capabilities 7instruction following and alignment 6

exp3-independentComparative Positioning 14Calibration & Scope Issues 8Hallucinations & Grounding Issues 8

Figure 6. Top three axial categories in each condition, with the number of open codes assigned to each category.

To validate open codes, I only have to look at the tweet and ask whether the proposed codes make sense. Validating axial codes consists of more steps: given two codes, do these codes belong together or should they belong in different clusters? And does the axial code name name actually capture the cluster? This requires O(n²) comparisons, over pairs of codes, which is not feasible for humans. Moreover, without example tweets under each category and provenance from category back to evidence, I struggled to truly understand the definition of some of these clusters.

Vague axial codes create a particular trap. A category like “Reliability and Trust” (Figure 6) is so broad that it is hard to argue against: almost any complaint about an LLM could plausibly land there. But this unfalsifiability is not a sign of a good category. It means that when someone proposes a fix for the category, there is no way to evaluate whether the fix is right either. Consider “hallucination” as an axial code. If an automated system tells me hallucination is a top theme in user complaints, I will probably agree. But if it then proposes a code change to “fix the hallucination problem,” I have no basis for trusting that fix, because the category was never precise enough to measure in the first place. Vague codes are easy to believe and impossible to act on.

I was surprised at how bad the UIs were. I was also surprised at how bad the UIs were. There were simple things that would have made the process much easier: marking clusters as “reviewed,” dragging open codes between axial categories, seeing example tweets under each cluster for spot-checking, or visualizing how many codes sat under each category. I also wanted diffs between rounds: e.g., what changed in the memo, which clusters were new, which codes moved.

Even basic ergonomics were broken. The memos and feedback textareas often did not fit in the viewport, so I had to scroll back and forth between reading and responding.

More broadly, agents seem perfectly happy operating over huge amounts of text slop: long reasoning traces, repetitive summaries, verbose intermediate artifacts. Humans are not. After a few rounds of feedback, I found myself avoiding reading the memos because they were cognitively exhausting to read. Perhaps the representations that help agents reason, to get high-quality outputs, are not the same representations that help humans supervise or collaborate with them.

I expected to get frustrated by the wait time, or by sitting around while the agents processed data. However, that was not actually my biggest source of frustration. A chat-style UX where the agent streams reasoning while it works is mostly fine, as long as I am learning something from it.

What felt much worse was doing work that seemed low-leverage or mechanical. If I have to do open coding myself, the interaction starts to feel contrived: I am slower than the model, and I do not want to feel like a labeler or teacher for the AI. Reviewing codes one tweet at a time also became exhausting surprisingly quickly; I often did not want to review more than about 10 tweets. Reviewing the memos was even worse. Many of them felt verbose and sloppy, and after a while I found myself avoiding them entirely. My feedback started dense and detailed, then became sparse over time, partly because of fatigue rather than because the agent was improving.

One thing the agent did do well was summarize what it took away from feedback. After some review rounds, it would explicitly state patterns it noticed in my corrections, like “remove redundant codes” or “stay closer to explicit statements.” That made it much easier to tell whether agents actually considered my feedback.

Looking forward

I’m not convinced agents can “replace” us in qualitative analysis, nor will they ever be able to, because all the context around a qualitative question cannot be distilled into something for an agent. But I came away from these experiments both impressed by what agents can already do and convinced that this is a genuinely exciting area to work on.

Some parting thoughts:

We may not want to preserve existing qualitative-analysis workflows exactly as they are. E.g., open coding, axial coding, memoing, were designed around human constraints. Agents change those constraints. Perhaps there are better ways to structure the workflow: e.g., AI-proposed open codes where the human reranks or weights them, or workflows where open and axial coding happen simultaneously instead of sequentially. Perhaps theoretical saturation should be defined around the human analyst’s learning, rather than whether the agent can still generate distinct new codes.

A separate issue is interface and interaction design. Most of the frustration in these experiments did not come from waiting on the agents. It came from low-leverage supervision, poor visibility into what changed between rounds, and difficulty building trust in the outputs. I want systems that make the agent’s reasoning legible without forcing the human to read pages of text slop; systems that surface provenance, uncertainty, disagreement, stabilization, and evidence directly in the interface.

There is also a scaling question. These experiments used short documents (tweets) at relatively small scale. What happens when the documents are long, like interview transcripts or research papers, where reviewing each one individually would be completely untenable? What about applying this to agent error analysis, where the data itself is agent traces and the goal is to identify recurring failure modes? Or to multimodal and semi-structured data? And doing analysis at the thousand, hundred thousand, or million+ scale?

Overall, the biggest thing I took away from all of this is that agents can do the mechanical parts of qualitative analysis fast, but they have no taste. The interesting design question isn’t “how do we automate qualitative analysis” but “how do we build systems where human taste and agent scale actually compose.” If you’re working on this problem, I would love to read your paper(s)! And if you’re not yet, well, I hope this post gives you a reason to start!