Microsoft Introduced Critique, A New Multi-Model Deep Research System In M365 Copilot

2026-04-06 06:03:48

In Brief

Microsoft has introduced Critique, a new multi-model deep research system inside Researcher, the deep research agent in Microsoft 365 Copilot, as part of a broader push to make Copilot feel more dependable for serious knowledge work instead of just fast drafting.

According to Microsoft, Critique is designed for complex research tasks and works by splitting the job into two parts: one model handles planning, retrieval, synthesis, and drafting, while a second model reviews and refines the output before the final report is produced. Microsoft says the system uses models from frontier labs including OpenAI and Anthropic, and that it is available now through the company’s Frontier program

Reuters reported that in Critique’s current setup, OpenAI’s GPT generates the response and Anthropic’s Claude reviews it for accuracy and quality before the answer reaches the user. Microsoft has also said it wants this workflow to become bi-directional later on, allowing models to review each other in both directions

What Critique actually does inside Microsoft 365 Copilot

Microsoft’s own description makes it clear that Critique is not just a cosmetic feature or a new button slapped onto Copilot.It works inside Researcher in Microsoft 365 Copilot and is built for deeper tasks where getting it right matters just as much as getting it done fast. One model does the digging and drafts the report, while the second steps in like an editor, checking the facts, sharpening the structure, and helping turn it into a more reliable final piece.

Microsoft says the whole idea is to separate generation from evaluation, rather than asking one model to brainstorm, write, fact-check, and polish its own work all at once. That distinction matters because a lot of AI failure comes from exactly that one-model bottleneck. When a single system is asked to do everything, it can produce something that looks polished while quietly missing gaps, overreaching on claims, or leaning on weak evidence

Microsoft says Critique’s review layer is built around rubric-based evaluation, with attention to source reliability, report completeness, and strict evidence grounding. In plain English, the second model is there to ask whether the draft actually answered the question, whether the sourcing is solid, and whether the final narrative is supported instead of merely sounding confident

Microsoft is not pitching Critique as a side experiment

One of the more important details in Microsoft’s announcement is that Critique will be the default experience in Researcher when Auto is selected in the model picker. That signals the company sees this as more than an optional lab feature for power users. It is effectively treating multi-model review as the new baseline for deep research quality inside Microsoft 365 Copilot. That is a meaningful product choice, because it suggests Microsoft believes enterprise customers care less about raw response speed than they do about fewer hallucinations, stronger structure, and more confidence in the finished report

That also fits neatly into Microsoft’s broader messaging around Wave 3 of Microsoft 365 Copilot, where the company has been pushing the idea of Copilot as a “system for work” built on a multi-model advantage rather than on any single AI lab. In Microsoft’s framing, Copilot is meant to pull the best available intelligence from across the industry, grounded in work context through what it calls Work IQ and protected by enterprise data controls. Critique is one of the clearest examples yet of that strategy moving from marketing language into a visible product feature

The benchmark numbers are a big part of Microsoft’s sales pitch

Microsoft is not only saying Critique feels better. It is saying the system performed better on a formal benchmark. In its technical write-up, the company says it tested Critique on the DRACO benchmark, short for Deep Research Accuracy, Completeness, and Objectivity, which covers 100 complex research tasks across 10 domains. Microsoft says responses were judged across factual accuracy, breadth and depth of analysis, presentation quality, and citation quality, and that Critique outperformed the single-model version of Researcher across all four measures

The company highlighted the largest gains in breadth and depth of analysis, followed by presentation quality and factual accuracy. It also says the improvements were statistically significant and that Researcher with Critique delivered a +7.0 point aggregated score improvement, or +13.88% over Perplexity Deep Research (Claude Opus 4.6 model), which Microsoft described as the best system reported in the benchmark paper

Data | Source: Microsoft

That is an eye-catching claim, especially because the deep research race has become one of the most competitive fronts in enterprise AI. Research tools are no longer being judged only by whether they can gather information, but by whether they can assemble a report that feels decision-ready

Microsoft’s argument is that the review layer forces researchers to identify missing angles, tighten organization, challenge weak claims, and use citations more carefully. Whether customers experience those gains in real workflows will matter more than benchmark charts, but Microsoft is clearly trying to signal that this is a measurable quality jump rather than a vague model update

Council shows Microsoft is thinking beyond one “best answer”

Critique is not the only feature Microsoft introduced alongside this update. The company also launched Council, a multi-model comparison mode inside Researcher. Microsoft says Council runs Anthropic and OpenAI models simultaneously, allowing each to generate a full standalone report. A separate judge model then creates a distilled summary showing where the reports agree, where they diverge, and what each uniquely contributes. Microsoft Support describes this as Model Council, a mode that preserves both full reports and adds a comparison summary to help users decide which output is stronger or how to combine them

That is a very interesting signal about where enterprise AI may be heading. For a while, the industry behaved as if the goal was to find one model that could replace all the others. Microsoft’s latest move suggests the more realistic future may be one where companies do not trust any single model enough to make it the only voice in the room

The timing of Critique is not accidental. Microsoft has been under pressure to show that Microsoft 365 Copilot is becoming more useful, more differentiated, and more valuable as competition intensifies

Reuters tied the rollout of Critique and Council to Microsoft’s effort to improve Copilot adoption in a market where rivals including Google’s Gemini and Anthropic’s Claude products are pushing hard into workplace AI. Axios also noted that Microsoft’s multi-model strategy has another benefit: it shows the company is not locked into overdependence on OpenAI at a time when frontier model leadership can shift quickly

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.