Over three years, the SharePoint engineering team iterated through five major architectures for AI-powered page authoring. We over-constrained early models, underestimated discoverability, were delighted by new language model capabilities every few months, and learned to pivot fast when emerging approaches leapfrogged what we already had in flight.
This post documents what worked, what didn't, and what we changed at each generation. If you're building AI features, or interested in the engineering tradeoffs of shipping LLM-based products at enterprise scale, these are the patterns and pitfalls we wish we'd known sooner.
Where we are today?
Earlier this week, at aka.ms/SharePointAI, we introduced the new agentic SharePoint page authoring that runs on a coding agent like, minimal-scaffolding, schema-driven system where the model treats the page creation as a coding problem and applies the proven code generation technique to get high quality and predictable outcomes. This architecture only became viable when frontier reasoning models arrived in late 2025 and delivered a step change over their predecessors. Early comparisons against GPT-4.1 show a 30% improvement in eval pass rates, with GPT-5 and Claude reasoning models reaching 85%+. It took five architectural pivots and learnings to get here and we want to share our journey.
The Journey at a Glance
1 - Early 2023 - Full Page
Version 1: SharePoint Copilot (v1)
GPT-3.5 • Formal DSL with multi-step orchestration.
Built on the shared Office Copilot stack for fast access to RAI and compliance tooling. Basic prompts worked in dogfood, but real page creation exposed GPT-3.5's small context window, hallucination rates, and weak instruction following. Without evals, we found these gaps late.
2 - Mid 2024 - Scope Down
Version 2: Rich Text Copilot
GPT-4 • Scoped to inline text rewrites and grounded composition. •no evals
Scoped down to text editing on SharePoint page and proved we could reach high quality. Negative feedback dropped by more than a third over 18 months, so we found product market fit. Next we focused on improving discoverability.
3 - Late 2024 - Full Page
Version 3: AI Pages with Prompt Engineering
GPT-4 • Full-page generation with RAG • Automated evals introduced late in cycle
Thumbs-down feedback (the percentage of AI suggestions users weren't satisfied with) from Rich Text Copilot (Version 2) told us users wanted to turn documents and meetings into pages. New platform capabilities (Semantic Index, Graph, Design Ideas) made full-page generation worth attempting. Templates produced strong results but open-ended prompts received 3-4x more negative feedback, and users had no way to iterate on individual sections. We introduced automated evaluation for the first time.
4 - Mid 2025 - Scope Down
Version 4: Sections with AI with Prompt Engineering
GPT 4.1 • Section-scoped generation with inline refinement • Evals maturing with grounding metrics and intent analysis
Scoped down to sections, addressing Version 3’s lack of post-generation control and context window limits. GPT-4.1’s larger context window (32K tokens) gave us more room for grounding material. Eval infrastructure matured with grounding precision tracking and intent analysis. Keep rate nearly doubled, engagement grew meaningfully. We were onto something.
4b - Mid 2025 - Scope Down
Version 4b: FAQ Web Part — Domain-Specialized Generation
GPT 4.1 • Single web part with domain-tuned prompts, staged generation, and evals maturing.
A parallel team took the scope-down approach even further, focusing on a single web part: FAQs. FAQ generation was well within GPT 4.1’s capabilities, and the narrow focus let the team ship high-quality grounded output with strong user satisfaction. The multi-stage pipeline mapped each stage directly to the user’s task rather than to infrastructure concerns, and a custom eval framework enabled rapid prompt iteration.
5 - Early 2026 - Full Page
Version 5: Agentic Workspace + Evals Suite
Thinking Models • Page as code with JSON schemas, targeted deltas, and operation validation • ~400 daily evals, 85%+ pass rate
Code generation was becoming remarkably good across models, so we modeled page content as code and turned page creation into a code delta problem. The LLM reads schemas and current page state, then generates targeted diffs. Combined with a comprehensive eval suite, this gave us our strongest and most model-first system yet.
Version 1: The Illusion of Control
SharePoint Copilot (v1)
GPT-3.5 • Early 2023 • Full Page Creation
What & How
In early 2023, we made our first foray into LLM-based page generation. We wanted to leverage common infrastructure built by the Office teams, so we bet on the shared Office Copilot stack and its formal Domain Specific Language (ODSL), where every page operation (insert_webpart(...), update_section(...)) was explicitly defined. The shared infrastructure gave us critical capabilities like Responsible AI (RAI) filtering, compliance tooling, and an embeddings-based few-shot learning pipeline that let us build quickly. We also integrated content grounding from tenant documents and videos, and Bing image generation for title regions. The system reached broad internal dogfood by mid-2023 and a limited private preview with early adopters by year's end.
Basic prompts worked well in dogfood, but real page creation intents quickly exposed the limits of GPT-3.5. The model's 4,097-token context window couldn't fit complex pages, it hallucinated frequently, and it struggled to follow multi-step instructions reliably. The multi-stage pipeline amplified these problems because each stage depended on the previous one, causing model errors to cascade through the system. We also lacked eval maturity, which meant we didn't catch these limitations early enough in the development cycle. Overall this was a great learning experience that taught us what the models could and couldn't do, showed us the cost of building without evals, and changed our approach for every subsequent version.
Key Takeaways
- Match scope to model capability. GPT-3.5’s small context window, high hallucination rates, and weak instruction following meant full-page creation was beyond what the model could reliably handle. We learned to assess model limitations honestly before committing to a scope.
- Build evals before you need them. We lacked eval maturity, which meant we didn’t catch GPT-3.5’s limitations until dogfood. Earlier evaluation infrastructure would have surfaced these gaps during development, not after deployment.
- Shared infrastructure accelerates the first mile. Betting on the Office Copilot stack gave us RAI filtering, compliance tooling, and few-shot learning pipelines without building from scratch. For a first attempt, speed to learn matters more than full architectural control.
Version 2: Finding Success by Scoping Down
Rich Text Copilot
GPT-4 • Mid 2024 • Inline Text Assistance
What & How
After Version 1, we took the lesson of matching scope to model capability and put it into practice. Instead of generating whole pages, we focused on the rich text editor on SharePoint page. GPT-4 gave us better instruction following, double the context window, and lower hallucination rates, and restricting to only the RTE kept inputs well within those limits. We started with a small, canned set of rewrite options like adjust tone, shorten, and expand. The set was small enough to manually validate quality before shipping, which mattered because we still had no automated evals. As users adopted the feature, we added custom prompts and RAG support for attaching documents and meeting transcripts as source material, both driven directly by what thumbs-down feedback told us users needed.
Our telemetry showed that users who found the feature used it actively, but only a small fraction of page authors found the buttons when they needed them, despite strong engagement from those who did. We added floating Copilot icons inspired by Word’s inline text assistant, contextual hints in empty sections, and onboarding nudges. Educating users to adopt AI features remains an ongoing challenge and these entry points continue to evolve as we learn what works.
Early metrics showed strong signal: more than half of suggestions were kept. Improving AI quality was only half the problem. The other half was helping users build new habits around a fundamentally different way of creating content. That required intentional UX work, not just better prompts. The key was facetizing customer intents and recognizing which ones we could best improve. Through persistent effort in prompt tuning, format preservation, and tone calibration, the team reduced negative feedback by more than a third. Constrained rewrites performed best (6% thumbs-down); free-prompt drafts remained the weakest scenario, primarily due to language translation issues.
Key Takeaways
- Scope down to model capabilities. We observed that models in this era were better at scoped tasks like summarizing and rewriting, so we shaped our problems to match their capabilities. Narrowing to inline text editing let us ship a feature that worked well within what GPT-4 could reliably handle.
- Persistent, data-driven improvement compounds. Negative feedback dropped by more than a third, not through one big fix, but iterative refinements in prompt tuning, format preservation, and tone calibration.
- Small scope enables quality without automated evals. We started with a canned set of rewrite options small enough to manually validate before shipping. This let us reach high quality early, even without an eval pipeline. As we expanded to custom prompts and open-ended drafts, the lack of automated evals became a bigger gap, one we addressed more seriously in later versions.
- Facetize customer intents to focus improvement. Not all intents are equal. Constrained rewrites hit 6% thumbs-down while free-prompt drafts stayed much higher. Breaking down user scenarios let us identify which intents we could improve quickly and which ones, like multilingual quality, still needed more work.
Version 3: Full Page Generation using Prompt Engineering
AI Pages
GPT-4 • Late 2024 • Complex Prompt Engineering + RAG
What & How
Rich Text Copilot proved that scoping to model capabilities works. At the same time, thumbs-down feedback from RTE Copilot was telling us what users wanted next: they were asking for ways to turn Word documents and meeting content into full SharePoint pages. That is inherently a full-page problem. New platform capabilities made it worth attempting. The Semantic Index gave us conceptual matching against organizational data, Microsoft Graph provided grounding in documents and meetings, and prompt engineering was proving effective for structured content across Office apps. We built a workflow that pulled grounded content from attached files, retrieved relevant organizational context, fed it to the LLM for page generation, and then applied Design Ideas to make the pages look beautiful. We shipped two modes: templates (newsletters, events, status updates) where the structure constrained the model’s output, and open-ended prompts for blank pages. Our intuition was that templates would produce stronger results, but blank page creation was too common an entry point to leave unsupported.
Template-driven pages achieved strong satisfaction, but the approach had a fundamental scaling problem. Each template needed its own prompt tuning to produce good results, and when users went off the tuned intents, quality dropped sharply. Open-ended prompts received roughly 3–4x more negative feedback than template-based ones. The single-shot, all-or-nothing interaction model made this worse: users waited through a long generation cycle and then had no way to fix individual sections. After investing that time, they didn’t feel in control of the result. We began building automated evaluation infrastructure in this version, measuring grounding accuracy and relevance across our top templates. The evals validated prompt tuning improvements but came together late, closer to release than to development, and open-ended prompts shipped without the same eval coverage that templates had.
Key Takeaways
- Users needed post-generation control. Generation quality and post-generation control are inseparable. Build per-unit editing controls before optimizing first-shot quality.
- Validate intuitions with data. We had an intuition that template-based generation would produce stronger results than open-ended prompts. The data confirmed it: open-ended prompts received roughly 3-4x more negative feedback. Starting with a constrained approach and measuring against the open-ended alternative gave us a clear signal about where to invest next.
- Prompt engineering eats context window. Comprehensive system prompts were less constraining than ODSL but still guided the model down narrow paths. This limited model creativity and consumed a large portion of the context window, hurting reliability on longer pages.
- User feedback reveals the next product bet. Thumbs-down feedback from RTE Copilot told us users wanted to turn documents and meetings into full pages. That signal drove the decision to attempt full-page generation. Feedback loops don’t just improve existing features, they point you toward what to build next.
Version 4: Scoped Section Generation using Prompt Engineering
Sections with AI
GPT-4.1 • Mid 2025 • Section-Level Generation
What & How
We solved the problem of inability to iterate and token bloat due to full page generation by scoping down to individual sections. Sections were small enough to evaluate in seconds, cheap to regenerate, and focused enough for the model to handle consistently. Users could refine each section with conversational follow-ups like “change the image” or “make it formal,” with design family and genre detection matching the page’s visual style. GPT 4.1’s larger context window gave us more room for grounding, and evals matured with automated grounding metrics, intent bucketing, and LLM-driven quality gates.
The first version chose from ~20 fixed templates. When a request did not match, the system forced the closest fit and silently ignored parts of what the user asked for. This drove roughly 30% of negative feedback. We fixed it by introducing an intermediate representation (IR), a simplified, model-friendly schema that sits between the user’s request and SharePoint’s complex native page format. The model writes to the IR, which maps to full web part configurations on the client. After the switch, keep rate nearly doubled and negative feedback dropped by roughly a quarter.
Key Takeaways
- Smaller chunks plus iteration is the winning formula. Sections are small enough to evaluate in seconds, cheap to regenerate, and focused enough for the model to handle consistently. Keep rate nearly doubled compared to full-page generation.
- An LLM-friendly intermediate representation simplifies the model’s job. SharePoint’s native web part JSON is too complex for an LLM to reason about reliably. The IR gave the model a clean, simplified schema to write to instead, removing the ~20-template constraint and letting it generate layouts directly from natural language. Keep rate nearly doubled.
- Fit before scale, again. Version 1 validated the section-level interaction model, but template force-fitting drove 30% of negative feedback. We shipped the IR before pushing for broader adoption, and the metrics confirmed it was the right call.
- Patterns validated at section scope carry forward. The IR, design family matching, and genre detection all informed the agentic architecture in Version 5.
Version 4b: Domain-Specialized Generation — The FAQ Web Part
FAQ Web Part
GPT 4.1 • Mid 2025 • Single Web Part, Domain-Tuned Generation
What & How
While Sections with AI scoped page authoring down to sections, a parallel team applied the same “match scope to model capability” lesson from Version 1 at its most extreme: generate just one web part, FAQs. FAQ generation from documents was well within GPT 4.1’s capabilities by this point, so the team didn’t need complex scaffolding. Like Version 2’s canned rewrite options, the team constrained the problem to three well-defined FAQ types (Event, Product, Policy) and built domain-specific instruction sets for each, embedding them directly into the orchestration layer. This kept the surface area small enough to test thoroughly. Authors selected grounding sources and FAQ type, and the model handled the rest.
The system used a three-stage pipeline (categories, then questions, then grounded answers), with users reviewing and editing at each stage. Unlike Version 1’s multi-stage approach where each stage served infrastructure concerns like classification and code generation, here each stage mapped directly to the user’s task and could run as a straightforward prompt with minimal scaffolding. We prioritized the top user tasks in our evals, and a custom evidence-anchored framework cut individual eval runs from roughly 30 minutes to under 10 minutes, giving us a much tighter prompt iteration loop. Authors reported reducing FAQ creation time from hours to minutes, and the feature saw strong adoption with a majority of generated FAQs kept.
The initial rollout surfaced familiar challenges. The biggest dissatisfaction driver was incorrect output language, a problem Rich Text Copilot also faced with multilingual content. The team resolved it with a language selector. Reliability was another concern: large prompts hitting the orchestration layer caused latency spikes, echoing the same lesson from Version 3 that monolithic prompts create fragility. Breaking prompts into staged calls improved resilience.
Key Takeaways
- Narrower scope, higher quality. FAQ generation was a well-understood task for GPT 4.1, so the team didn’t need the scaffolding overhead of earlier generations. Domain-specific instruction sets for Event, Product, and Policy FAQs replaced generic prompts and eliminated the need for user prompt engineering entirely.
- Multi-stage pipelines work when stages serve the task, not the architecture. Version 1 also used multiple stages, but those stages served infrastructure concerns (classification, embedding lookup, code generation). Here, each stage mapped to the user’s problem: categories, questions, answers. With clean separation of concerns and better model capabilities, each stage could use a simple prompt with minimal scaffolding. Human checkpoints between stages gave authors control and independently validated the “small chunks plus iteration” pattern from Version 4.
- Fast evals accelerate prompt iteration. The team built a custom evidence-anchored evaluation framework that reduced eval cycle time by roughly 6x. This enabled a much tighter feedback loop between prompt changes and quality measurement.
- Dependency-graph grounding enables content freshness. Modeling source-to-answer relationships as a graph (rather than flat retrieval) enabled answer-level change detection and targeted refresh, solving the stale-content problem that plagues enterprise FAQs.
Version 5: Full Page Generation using Agentic Workspace
Page Builder
GPT 5.x (Thinking Models) • Early 2026 • Model-First Architecture
What & How
By Version 4, we had learned that scoped generation with templates worked, that the IR could dynamically configure web parts, and that eval infrastructure was essential for rapid iteration. At the same time, we observed that code generation was becoming remarkably good across different frontier models. This prompted a fundamental pivot: instead of adding more scaffolding, we modeled page generation as a code generation problem. All page content is represented as JSON using the IR from Version 4, and each web part is similarly modeled with its own JSON schema. The LLM reads the current page state, the page schema, and the web part schemas, then decides on the right code deltas to apply to reach the desired outcome.
This approach takes advantage of the LLM’s natural code writing capabilities and world knowledge to produce the right diffs. Adding a new web part means adding a new schema file, which moves prompt engineering from a central orchestration layer to the web part itself. Each web part owns its own schema, prompts, and validation rules, so a bug in one web part does not break others. This isolation also lets us scale development across many engineers, since teams can work on their web parts independently without coordinating on a shared prompt.
An operation validator checks every delta against page semantics and web part semantics before applying it, catching invalid operations that the model might produce. This gives us a safety net that earlier generations lacked, where invalid output either failed silently (Version 1) or produced partial results (Version 4). In practice we found a balance between what goes into the system prompt and what goes into validators. Common rules that the model needs to follow on every turn belong in the system prompt, where they prevent mistakes before they happen. Edge cases like whether an image crop is acceptable belong in validators, which keep the prompt small but cost an extra model turn when they reject output.
Because the scaffolding is minimal, the architecture is model-first. This is the culmination of the eval journey that started with no evals in Version 1, added late automated evaluation in Version 3, matured with grounding metrics and intent analysis in Version 4, and now runs a comprehensive suite across every model we consider. When a new model ships, we run the evals and immediately understand how much per-model tuning it would need to reach production quality. The table below shows how multiple frontier models perform on our eval suite:
|
Model
|
Avg Pass Rate
|
|
Claude Opus 4.5
|
88.32%
|
|
Claude Opus 4.6
|
86.48%
|
|
GPT-5 Reasoning
|
84.20%
|
|
GPT-5.2 Reasoning
|
83.00%
|
|
GPT-4.1
|
53.86%
|
The eval suite covers a wide range of scenarios, from PR regression gates to complex multi-step authoring:
|
Category
|
Tests
|
% of Total
|
Description
|
|
PR
|
46
|
12%
|
Pull request regression gate, run on every PR before merge
|
|
Core
|
135
|
35%
|
Fundamental operations critical for the feature to work
|
|
Type 1
|
66
|
17%
|
Basic operations with clear, unambiguous prompts
|
|
Type 2
|
95
|
25%
|
Intermediate complexity with multiple parameters
|
|
Type 3
|
24
|
6%
|
Complex multi-step authoring scenarios
|
|
Type 4
|
18
|
5%
|
Edge cases and advanced operations
|
|
Total
|
384
|
100%
| |
Reasoning models brought a step change in code generation quality that made this architecture possible, which is clear from the evals across GPT-4.1 and the reasoning models. Earlier models needed extensive scaffolding to produce valid page structures. These models handle it with schemas alone. The eval suite lets us measure that difference precisely and swap models with confidence.
Internal testing shows higher satisfaction scores than all previous generations, along with qualitative feedback indicating that authors use the tool more naturally and spend less time working around limitations. We expect to validate these signals with public preview users.
Key Takeaways
- Model page creation as code generation. The latest reasoning models excel at writing code, so representing pages as JSON and asking the model to generate code deltas against schemas lets us harness capabilities it already has. The operation validator catches invalid output, giving us a safety net that earlier approaches lacked. Adding a new web part is just adding a new schema file, so the system scales without prompt changes.
- Self-contained leaf nodes isolate failures and scale teams. Each web part owns its own schema, prompts, and validation rules. A bug in one web part does not break others, and teams can work on their web parts independently without coordinating on a shared prompt. This is what let us scale development across many engineers.
- Balance system prompts and validators. Common rules belong in the system prompt, where they prevent mistakes before they happen. Edge cases like image crop validation belong in validators, which keep the prompt small but cost an extra model turn when they reject output.
What We Learned Along the Way
Looking across all five generations, certain patterns held regardless of which approach we tried. These aren't theoretical principles. They're lessons we learned the hard way, often by doing the opposite first.
01 - Build for the model you have, design for the model that’s coming
Version 1 needed heavy scaffolding for GPT-3.5. By Version 5, minimal scaffolding lets the model reason. Each generation’s architecture was right for its era but became overhead for the next. Design scaffolding to be removable.
02 - Invest in evals early
Every generation that invested in evaluation infrastructure shipped faster and with fewer surprises. The teams that built evals late spent cycles discovering problems from user feedback that automated testing would have caught earlier. By the time we reached Version 5, our eval suite let us evaluate new models within a day and understand exactly where to focus tuning effort.
03 - Scope down to find fit, then expand
Full page, then text, then full page, then sections and a single web part, then full page again. Each scope-down found product-market fit and produced patterns that fed directly into the next expansion. The lesson is not to stay small but to use narrower scope as a proving ground.
04 - Break down feedback, address each facet
Raw thumbs-down data is noisy. The real value came from breaking feedback into specific intents and addressing each one separately. This told us where to focus (constrained rewrites were strong while free-prompt drafts needed work), revealed platform problems that crossed features (multilingual issues appeared independently in Rich Text Copilot and the FAQ Web Part), and pointed toward new product bets (doc-to-page feedback drove the decision to attempt full-page generation).
05 - Discoverability is a feature
Half of users who found Rich Text Copilot became active users, but only a small fraction ever found it. The FAQ Web Part had a similar problem, with some users unaware it existed at all. Solving discoverability through floating icons, contextual hints, and onboarding nudges became as important as improving the AI itself.
06 - Give users control over the output, not just the input
Version 3’s single-shot, all-or-nothing flow was the biggest deal breaker. Users invested time waiting for a full page and then had no way to fix individual parts. Every subsequent version increased user control: Version 4 let users generate and refine one section at a time, Version 4b added human checkpoints at each pipeline stage, and Version 5 uses incremental deltas that users can accept or adjust individually.
Looking Forward
Three years, multiple architectures, and a lot of pivots taught us that building AI features is less about picking the right approach upfront and more about learning fast from customer feedback and being willing to throw away what isn't working. Each generation made the next one possible. Page Builder is our strongest system yet, and domain-specialized features like the FAQ Web Part show how the same principles (scoped generation, human-in-the-loop, fast evals) extend beyond page authoring to individual web parts. We know from experience that the real test is customer usage. We are heading to public preview and are eager to continue this learning journey with real feedback from real authors.
Join Our Previews and Early Access Program
Try the new SharePoint experience in public preview today
Try the AI in SharePoint experience in public preview today
Nominate your organization for the SharePoint AI Early Access Program