Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
148872 stories
·
33 followers

Microsoft SharePoint Gets New Agentic Capabilities and Governance Tools

1 Share

Key Takeaways:

  • Agentic AI in SharePoint lets teams build governed, end-to-end AI solutions directly within the Microsoft 365 environment.
  • A revamped interface and AI-powered web publishing streamline content creation and management.
  • New admin insights enhance permissions oversight, storage management, and cost allocation.

Microsoft’s SharePoint is celebrating its 25th anniversary this month with the public preview of new agentic building capabilities, alongside a redesigned user experience and new content governance tools. These enhancements aim to help enterprises confidently scale their AI initiatives while keeping security and compliance at the forefront.

In SharePoint, the new agentic experiences let organizations leverage AI to plan, create, and iterate on solutions collaboratively. Administrators will be able to define custom AI skills to shape how AI behaves within enterprise environments. These skills are organization‑specific packages that capture terminology, governance rules, and business logic. For instance, a legal team could embed its own risk thresholds and policy rules directly into the AI’s reasoning.

“Rather than a one-time prompt, this enables teams to build end-to-end solutions for critical business needs, ranging from procurement contract repositories to IT helpdesks to marketing content management, all at scale, governed within the trusted SharePoint and Microsoft 365 environment,” explained Jeff Teper, President, Collaborative Apps and Platforms.

Microsoft SharePoint Gets New Agentic Capabilities and Governance Tools
New agentic experiences in SharePoint (Image Credit: Microsoft)

As of this writing, page editing, library organization, and list management experiences are currently available through the AI in SharePoint preview program. Microsoft plans to roll out Site creation via natural language capabilities by the end of this month.

Redesigned user experience enhances content creation and management

Microsoft noted that its early deployment of advanced AI in SharePoint is powered by Anthropic’s Claude. The company added that some customers may need to grant additional permissions for Claude’s use as a sub‑processor. However, Microsoft plans to address this requirement before the feature becomes generally available for all commercial customers.

Microsoft has launched a new AI-powered SharePoint web publishing system to help organizations plan, create, refine, and measure content more effectively. This service is designed to understand web content and SharePoint components deeply. It helps businesses publish critical knowledge faster while ensuring it adheres to organizational governance standards.

New governance controls

The SharePoint Admin Agent that was announced in November 2025 is getting new skills this month. This AI agent can analyze tenant-wide permissions, flag oversharing risks, and identify ownerless or inactive sites. Moreover, storage management insights will help IT admins surface cleanup actions and govern pay-as-you-go billing.

Lastly, Microsoft has also announced the general availability of departmental billing for Microsoft 365 Backup. This feature enables organizations to allocate backup costs by team or geographic unit.

Microsoft is currently rolling out AI in SharePoint, a new user interface, and enhanced governance controls in public preview. Organizations can opt into the AI in SharePoint preview on this page, and Microsoft expects to introduce additional features in the coming weeks.

The post Microsoft SharePoint Gets New Agentic Capabilities and Governance Tools appeared first on Petri IT Knowledgebase.

Read the whole story
alvinashcraft
39 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

1 Share
White line icons against a blue-green gradient background form an architecture flow chart. In the middle of the chart is a three-by-three matrix of circles and lines within a round-edge square. Above the matrix, three icons in a row – an equation, a person using a desktop, and a head with gears flow by dotted lines to the matrix. To the left of the matrix is an icon representing a stack of files with an arrow pointing to the matrix. To the right of the matrix is a graph with a double headed arrow pointing to the matrix and to itself. Below the matrix is an icon representing a document. A dotted line arrow connects this graph to the matrix, showing the direction flowing from the matrix to the document. To the right of the document icon is an hourglass icon and three list icons with a dotted line connecting the hourglass to the lists.

At a glance

  • Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
  • We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, inferring about changes in sequences of images, and much more. Beyond these general capabilities, it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens. In particular, our model presents an appealing value relative to popular open-weight models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning.

Performance charts comparing Phi-4-Reasoning-Vision-15B against other models (Kimi-VL, Qwen-3, Gemma-3) on accuracy vs. response time and accuracy vs. completion tokens. Phi-4 stands out as being fast and token-efficient while achieving ~75% accuracy.
Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values. 

In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.

A focus on smaller and faster vision–language models

Many popular vision-language models (VLMs) have trended towards growing in parameter count and, in particular, the number of tokens they consume and generate. This leads to increase in training and inference-time cost and latency, and impedes their usability for downstream deployment, especially in resource‑constrained or interactive settings.

A growing countertrend towards smaller (opens in new tab) models aims to boost efficiency, enabled by careful model design and data curation – a goal pioneered by the Phi family of models (opens in new tab) and furthered by Phi-4-reasoning-vision-15B. We specifically build on learnings from the Phi-4 and Phi-4-Reasoning language models and show how a multimodal model can be trained to cover a wide range of vision and language tasks without relying on extremely large training datasets, architectures, or excessive inference‑time token generation. Our model is intended to be lightweight enough to run on modest hardware while remaining capable of structured reasoning when it is beneficial. Our model was trained with far less compute than many recent open-weight VLMs of similar size. We used just 200 billion tokens of multimodal data leveraging Phi-4-reasoning (trained with 16 billion tokens) based on a core model Phi-4 (400 billion unique tokens), compared to more than 1 trillion tokens used for training multimodal models like Qwen 2.5 VL (opens in new tab) and 3 VL (opens in new tab), Kimi-VL (opens in new tab), and Gemma3 (opens in new tab). We can therefore present a compelling option compared to existing models pushing the pareto-frontier of the tradeoff between accuracy and compute costs.

 A travel blog caption task. Given a photo of Iguazu Falls, the model writes a personal, evocative caption referencing the rainbow, the mist, and the emotional experience.
Restaurant bill splitting. Given a photo of a receipt and instructions about who ordered what, the model calculates each person's share including half the tax, and returns the result as JSON.
Laundry care symbol interpretation. The model correctly identifies all five symbols: machine washable, do not bleach, tumble dry low, iron on low heat, do not dry clean.
Figure 2: Phi-4-Reasoning-Vision can help with a wide range of everyday tasks.

Lessons from training a multimodal model

Training a multimodal reasoning model raises numerous questions and requires many nuanced design choices around model architecture, dataset quality and composition, and the interaction between reasoning‑heavy and non-reasoning perception‑focused tasks.

Model architecture: Early- vs mid-fusion

Model architectures for VLMs differ primarily in how visual and textual information is fused. Mid-fusion models use a pretrained vision encoder to convert images into visual tokens that are projected into a pretrained LLM’s embedding space, enabling cross-modal reasoning while leveraging components already trained on trillions of tokens. Early-fusion models process image patches and text tokens in a single model transformer, yielding richer joint representations but at significantly higher compute, memory, and data cost. We adopted a mid-fusion architecture as it offers a practical trade-off for building a performant model with modest resources.

Model architecture: Vision encoder and image processing

We build on the SigLIP-2 (opens in new tab) vision encoder and the Phi-4-Reasoning backbone. In previous research, we found that multimodal language models sometimes struggled to solve tasks, not because of a lack of reasoning proficiency, but rather an inability to extract and select relevant perceptual information from the image. An example would be a high-resolution screenshot that is information-dense with relatively small interactive elements.

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3 (opens in new tab) uses pan-and-scan and NVILA (opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

MethodMax TokensMathVistaScreenSpotScreenSpot-ProV*Bench
Dynamic-S2309642.978.49.452.9
Multi-crop309643.467.85.451.8
Multi-crop with S2204843.479.110.657.1
Dynamic resolution204845.281.59.251.3
Dynamic resolution360044.979.717.556.0
Table 1: Results with different resolution handling approaches. The top two configurations on each benchmark are in bold.

Data: Quality and composition

As with its language backbone Phi-4-Reasoning, Phi-4-reasoning-vision-15B was trained with a deliberate focus on data quality. Our final dataset consists primarily of data from three sources: open-source datasets which were meticulously filtered and improved; high-quality domain-specific internal data; and high-quality data from targeted acquisitions. The overwhelming majority of our data lies in the first category: data which originated as open-source data, which were significantly filtered and improved, whether by removing low-quality datasets or records, programmatically fixing errors in data formatting, or using open-source images as seeds to synthetically generate higher-quality accompanying text.

The process of improving open-source data began by manually reviewing samples from each dataset. Typically, 5 to 10 minutes were sufficient to classify data as excellent-quality, good questions with wrong answers, low-quality questions or images, or high-quality with formatting errors. Excellent data was kept largely unchanged. For data with incorrect answers or poor-quality captions, we re-generated responses using GPT-4o and o4-mini, excluding datasets where error rates remained too high. Low-quality questions proved difficult to salvage, but when the images themselves were high quality, we repurposed them as seeds for new caption or visual question answering (VQA) data. Datasets with fundamentally flawed images were excluded entirely. We also fixed a surprisingly large number of formatting and logical errors across widely used open-source datasets.

We extracted additional value from existing datasets through reformatting, diversification, and using images as seeds for new data generation. We generated detailed image descriptions alongside original QA pairs for math and science data, had data perform “double-duty” by embedding instruction-following requirements directly into domain-specific QA, created “scrambled,” “caption-matching,” and “what’s changed?” records to improve multi-image reasoning and sequential navigation for CUA scenarios, and diversifying prompt styles to encourage robustness beyond perfectly structured questions.

To supplement the improved open-source data, we utilize high-quality internal datasets, several math-specific datasets which were acquired during training of the Phi-4 language model, and also some domain-specific curated data; for example, latex-OCR data generated by processing and rendering equations from arXiv documents.

Top: A pie chart titled before returning a bounding box coordinates for a UI grounding task, and the other uses a tag with step-by-step reasoning to answer a chart question about expatriate populations, concluding with "Dubai." " class="wp-image-1163336"/>
Top: A
Figure 3: Phi-4-reasoning-vision-15B training data composition and examples

Data: Mathematics vs. computer-use data proportion

One of our goals was to train a model that performs well across general vision-language tasks, while excelling at mathematical and scientific reasoning and computer-use scenarios. How to structure datasets for generalizable reasoning remains an open question—particularly because the relationship between data scale and reasoning performance can lead to starkly different design decisions, such as training a single model on a large dataset versus multiple specialized models with targeted post-training.

Research on long-tailed classification robustness has suggested that balancing or removing data from overrepresented tasks or subgroups (opens in new tab) is an effective method for ensuring good performance. Nevertheless, these insights are not fully utilized or explored when it comes to training VLMs, which at times have favored scale over careful data balancing. To achieve our goals, we conducted a set of experiments to analyze a range of data ratios between our focus domains.

Using the same 5 billion parameter proxy model as for previous experiments, we trained while varying the amount of mathematics and science vs. computer-use data for each run. Each dataset included the same subset of 1 million general image-text pairs as a baseline. For mathematics and science data, we used a subsample of 150,000 records, optionally duplicating each one up to three times. Next, we included up to 450,000 computer-use records, and optionally an additional 400,000 from Phi-Ground.

We found that that multimodal mathematics and science performance were not harmed by additional computer-use data, and vice versa. Interestingly, we found that increasing mathematics data by 3x while keeping computer-use data constant improved math, science, and computer-use benchmarks.

GeneralMath and ScienceCUATotalMMMUMathVistaScreenSpot-V2
1M150K450K1.6M44.037.448.2
1M150K850K2.0M44.137.360.0
1M450K450K1.9M45.336.048.3
1M450K850K2.3M43.438.963.1
1M150K150K1.3M44.236.929.8
1M150K250K1.4M45.437.437.7
Table 2: Varying the ratios of math and CUA data. Increasing math data by 3x while keeping computer-use data constant improves both math and computer-use benchmarks. 

Data: Synthetic data for text-rich visual reasoning

Recent work (opens in new tab) suggests that targeted synthetic data can materially improve multimodal reasoning, particularly for text-rich visual domains such as charts, documents, diagrams, and rendered mathematics. Using images, questions, and answers that are programmatically generated and grounded in the visual structure enables precise control over visual content and supervision quality, resulting in data that avoids many annotation errors, ambiguities, and distributional biases common in scraped datasets. This enables cleaner alignment between visual perception and multi-step inference, which has been shown to translate into measurable gains on reasoning-heavy benchmarks.

Synthetic text-rich images expand coverage of long-tail visual formats that are underrepresented in real data but disproportionately impact reasoning accuracy, improving not only visual grounding but also downstream reasoning by ensuring that failures are less often caused by perceptual errors. We found that programmatically generated synthetic data is a useful augmentation to high-quality real datasets — not a replacement, but a scalable mechanism for strengthening both perception and reasoning that complements the training objectives in compact multimodal models such as Phi-4-reasoning-vision-15B.

Mixing non-reasoning and reasoning as a design objective

In language-only settings, reasoning traces have improved performance on many tasks, but they require additional compute which adds undesired latency. In multimodal settings, this tradeoff is less clear-cut, for tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful (opens in new tab), while mathematical and scientific problem-solving benefit from multi-step reasoning. Thus, the choice of when to reason or not can be quite nuanced.

Training approaches for multimodal reasoning models

Language-only reasoning models are typically created through supervised fine-tuning (SFT) or reinforcement learning (RL): SFT is simpler but requires large amounts of expensive reasoning trace data, while RL reduces data requirements at the cost of significantly increased training complexity and compute. Multimodal reasoning models follow a similar process, but the design space is more complex. With a mid-fusion architecture, the first decision is whether the base language model is itself a reasoning or non-reasoning model. This leads to several possible training pipelines:

  • Non-reasoning LLM → reasoning multimodal training: Reasoning and multimodal capabilities are trained together.
  • Non-reasoning LLM → non-reasoning multimodal → reasoning multimodal training: Multimodal capabilities are learned first, then reasoning is added.
  • Reasoning LLM → reasoning multimodal training: A reasoning base is used, but all multimodal data must include reasoning traces.
  • Our approach: Reasoning LLM → mixed non-reasoning / reasoning multimodal training. A reasoning-capable base is trained on a hybrid data mixture, learning when to reason and when to respond directly.

Approaches 1 and 2 offer flexibility in designing multimodal reasoning behavior from scratch using widely available non-reasoning LLM checkpoints but place a heavy burden on multimodal training. Approach 1 must teach visual understanding and reasoning simultaneously and requires a large amount of multimodal reasoning data, while Approach 2 can be trained with less reasoning data but risks catastrophic forgetting, as reasoning training may degrade previously learned visual capabilities. Both risk weaker reasoning than starting from a reasoning-capable base. Approach 3 inherits strong reasoning foundations, but like Approach 1, it requires reasoning traces for all training data and produces reasoning traces for all queries, even when not beneficial.

Our approach: A mixed reasoning and non-reasoning model

Phi-4-reasoning-vision-15B adopts the 4th approach listed previously, as it balances reasoning capability, inference efficiency, and data requirements. It inherits a strong reasoning foundation but uses a hybrid approach to combine the strengths of alternatives while mitigating their drawbacks. Our model defaults to direct inference for perception-focused domains where reasoning adds latency without improving accuracy, avoiding unnecessary verbosity and reducing inference costs, and it invokes longer reasoning paths for domains, such as math and science, that benefit from structured multi-step reasoning (opens in new tab).

Our model is trained with SFT, where reasoning samples include “…” sections with chain-of-thought reasoning before the final answer, covering domains like math and science. Non-reasoning samples are tagged to start with a “” token, signaling a direct response, and cover perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Reasoning data comprises approximately 20% of the total mix. Starting from a reasoning-capable backbone means this data grounds existing reasoning in visual contexts rather than teaching it to reason from scratch.

This approach is not without limitations. The balance between modes is a direct function of design choices we made, informed by recent literature (opens in new tab) and observed model behavior during training—though the boundary between modes can be imprecise as it is learned implicitly from the data distribution. Our model allows control through explicit prompting with “” or “” tokens when the user wants to override the default reasoning behavior. The 20/80 reasoning-to-non-reasoning data split may not be optimal for all domains or deployment contexts. Evaluating the ideal balance of data and the model’s ability to switch appropriately between modes remains an open problem.

We view this mixed approach not as a definitive solution, but as one practical and well-motivated point in the design space for balancing latency, accuracy, and flexibility in multimodal systems.

Applications

A multi-image reasoning example — five Hubble photos of Saturn from 2018–2022, with the query
Figure 4: Phi-4-Reasoning-Vision can interpret sequences of images 

Phi-4-reasoning-vision-15B is a high-performing model across many vision-language tasks. It sees and understands the world by looking at a photo, document, chart, or screen and making sense of it. In practice that covers an enormous range of applications — just a few examples include: describing images and answering questions about them, interpreting changes and trends in images sequences, and recognizing objects, landmarks, and transcribing text.

Highlights: Scientific and mathematical reasoning and supporting computer-using agents (CUA)

In addition to general vision and language tasks, Phi-4-reasoning-vision-15B was designed to excel at tasks that combine visual input with structured inference, such as solving math problems presented in visual form, such as handwritten or diagram-based questions, extracting and reasoning over quantitative information in documents and charts, and supporting multi-step reasoning in educational or scientific analysis contexts.

A physics problem about spring-mass systems, with two diagrams. The model correctly works through the spring constant relationships and arrives at answer B (0.433s).
Figure 5: Phi-4-reasoning-vision-15B is great at math and science 
A handwritten math homework checker. The student made a sign error in the quadratic formula (wrote −8 instead of +8). The model's thinking process catches the error and provides the corrected solution (x = 5 and x = 3).
Figure 6: Phi-4-reasoning-vision-15B can help with written math problems 

In addition, we trained Phi-4-reasoning-vision-15B to have skills that can enable agents to interact with graphical user interfaces by interpreting screen content and selecting actions. With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields. Due to its low inference-time needs it is great for interactive environments where low latency and compact model size are essential.

A GUI interaction task. Given a Windows 11 Start Menu screenshot and the query
A Google Shopping screenshot of heels. The model identifies all black heels, provides bounding box coordinates for each, and suggests outfit pairings (little black dress, tailored suit, jumpsuit).
Figure 7: Phi-4-reasoning-vision-15B can help navigate computer UIs

Evaluation

Phi-4-reasoning-vision-15B was evaluated for accuracy and timing using two complementary open-source frameworks to ensure both rigorous and standardized analysis: Eureka ML Insights (opens in new tab) and VLMEvalKit (opens in new tab).

BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force nothinkPhi-4-mm-instructKimi-VL-A3B-Instructgemma-3-12b-itQwen3-VL-8B-Instruct-4KQwen3-VL-8B-Instruct-32KQwen3-VL-32B-Instruct-4KQwen3-VL-32B-Instruct-32K
AI2D_TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 
ChartQA_TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 
HallusionBench64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 
MathVerse_MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 
MathVision_MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 
MathVista_MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 
MMMU_VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 
MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 
OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 
ScreenSpot_v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 
Table 3: Accuracy comparisons relative to popular open-weight, non-thinking models 
BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force thinkingKimi-VL-A3B-Thinkinggemma-3-12b-itQwen3-VL-8B-Thinking-4KQwen3-VL-8B-Thinking-40KQwen3-VL-32B-Thiking-4KQwen3-VL-32B-Thinking-40K
AI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 
ChartQA_TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 
HallusionBench64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 
MathVerse_MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 
MathVision_MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 
MathVista_MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 
MMMU_VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 
MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 
OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 
ScreenSpot_v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 
Table 4: Accuracy comparisons relative to popular open-weight, thinking models 

Our model balances thinking and non-thinking performance – on average showing better accuracy in the default “mixed-reasoning” behavior than when forcing thinking vs. non-thinking. Only in a few cases does forcing a specific mode improve performance (MathVerse and MMU_val for thinking and ScreenSpot_v2 for non-thinking). Compared to recent popular, open-weight models, our model provides a desirable trade-off between accuracy and cost (as a function of inference time compute and output tokens), as discussed previously.

Note: All numbers here are the result of running benchmarks ourselves and may be lower than other previously shared numbers. Instead of quoting leaderboards, we performed our own benchmarking, so we could understand scaling performance as a function of output token counts for related models. We made our best effort to run fair evaluations and used recommended evaluation platforms with model-specific recommended settings and prompts provided for all third-party models. For Qwen models we use the recommended token counts and also ran evaluations matching our max output token count of 4096. For Phi-4-reasoning-vision-15B, we used our system prompt and chat template but did not do any custom user-prompting or parameter tuning, and we ran all evaluations with temperature=0.0, greedy decoding, and 4096 max output tokens. These numbers are provided for comparison and analysis rather than as leaderboard claims. For maximum transparency and fairness, we will release all our evaluation logs publicly. For more details on our evaluation methodology, please see our technical report (opens in new tab).

Safety

As with other Phi models, Phi-4-reasoning-vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. For further details, check out our technical report (opens in new tab).

Open release and community engagement

Phi-4-reasoning-vision-15B is available on Microsoft Foundry (opens in new tab) and HuggingFace (opens in new tab) with additional examples and details on GitHub (opens in new tab). For additional guidance on how to use our model properly and safely, please refer to our Model card (opens in new tab). For further details on the technical aspects of the model, training, and evaluation, see our technical report (opens in new tab).

In line with our goal of supporting future AI development in the community, Phi-4-reasoning-vision-15B is released under a permissive license with model weights, fine‑tuning code, and benchmark logs. We intend this release to complement existing work by providing concrete artifacts that help close gaps in understanding how compact multimodal reasoning models can be built and studied.

Looking forward

Smaller vision–language models with selective, task‑aware reasoning offer one promising direction for making multimodal systems more practical and accessible. We present our model and its learnings to inform ongoing research in multimodal modeling, computer‑using agents, and mathematical scientific reasoning. We hope these details are useful to researchers exploring similar tradeoffs and invite critical evaluation, replication, and extension by the community. If you’d like to join us and help shape the future of multimodal models, please apply for one of our open roles.

Acknowledgements

We thank Rachel Ward for her extensive work on data collection and curation. We thank the GenDatasets, PhiGround, SimCity, and Fara-7B efforts for invaluable training data. We thank Harkirat Behl, Mojan Javaheripi, and Suriya Gunasekar for providing us with Phi-4 checkpoints and guidance on training with Phi models. We additionally thank Sahaj Agarwal, Ahmed Awadallah, Qi Dai, Gustavo de Rosa, Rafah Hosn, Ece Kamar, Piero Kauffmann, Yash Lara, Chong Luo, Caio César Teodoro Mendes, Akshay Nambi, Craig Presti, Matthew Rosoff, Corby Rosset, Marco Rossi, Kashyap Patel, Adil Salim, Sidhartha Sen, Shital Shah, Pratyusha Sharma, Alexey Taymanov, Vibhav Vineet, John Weiss, Spencer Whitehead, the AI Frontiers Team and Leadership, and Microsoft Research Leadership, for their valuable help, insightful discussions, and continued support throughout this work.

Opens in a new tab

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Read the whole story
alvinashcraft
39 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Cursor Joined the ACP Registry and Is Now Live in Your JetBrains IDE

1 Share

Cursor is now available as an AI agent inside JetBrains IDEs through the Agent Client Protocol. Select it from the agent picker, and it has full access to your project.

If you’ve spent any time in the AI coding space, you already know Cursor. It has been one of the most requested additions to the ACP Registry.

What you get

Cursor is known for its AI-native, agentic workflows. JetBrains IDEs are valued for deep code intelligence – refactoring, debugging, code quality checks, and the tooling professionals rely on at scale. ACP brings the two together.

You can now use Cursor’s agentic capabilities directly inside your JetBrains IDE – within the workflows and features you already use. 

A growing open ecosystem

Cursor joins a growing list of agents available through ACP in JetBrains IDEs. Every new addition to the ACP Registry means you have more choice – while still working inside the IDE you already rely on. You get access to frontier models from major providers, including OpenAI, Anthropic, Google, and now also Cursor.

This is part of our open ecosystem strategy. Plug in the agents you want and work in the IDE you love – without getting locked into a single solution.

Cursor is focused on building the best way to build software with AI. By integrating Cursor with JetBrains IDEs, we’re excited to provide teams with powerful agentic capabilities in the environments where they’re already working.

– Jordan Topoleski, COO at Cursor

Get started

You need version 2025.3.2 or later of your JetBrains IDE with the AI Assistant plugin enabled. From there, open the agent selector, select Install from ACP Registry…, install Cursor, and start working. You don’t need a JetBrains AI subscription to use Cursor as an AI agent.

The ACP Registry keeps growing, and many agents have already joined it – with more on the way. Try it today with Cursor and experience agent-driven development inside your JetBrains IDE. For more information about the Agent Client Protocol, see our original announcement and the blog post on the ACP Agent Registry support.

Read the whole story
alvinashcraft
40 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The Glossary You Must Read If You Wanna Talk About AI

1 Share

AI terminology can be confusing, especially when words like agents, skills, tools, and LLMs get used interchangeably.

That’s why I put together this glossary as a quick reference, to explain these concepts and help everyone, technical or not, talk about AI clearly.

Agent Skill

An agent skill is a predefined capability or behavior that an AI agent uses to accomplish specific tasks like searching the web, writing code, sending emails, or reading files. Skills give agents a structured way to interact with tools, APIs, or data sources, making them more reliable and reusable across workflows. Think of them as modular “superpowers” you can plug into an agent.

At a minimum, skills are just folders the agent reads, containing logic, instructions, assets, templates, and more. Most of today’s state-of-the-art agent apps let you create your own custom skills.

MCP

MCP (Model Context Protocol) is an open standard that lets AI agents connect to external tools and data sources consistently. Instead of creating a custom integration for every service (like Slack, Google Drive, or GitHub…) MCP provides a universal “plug-in” format, allowing any MCP-compatible server to communicate with any MCP-compatible AI.

Think of it as USB-C, but for AI tool integrations.

Agent Tool

A tool is a function (code) that an AI agent can execute when it decides to. That’s why each tool has a name and a description, which influence when the model chooses to use it (for example, “Use this function to pull the latest tickets from a Jira project”).

Besides the name and description, the function contains the code that the AI agent runs with the required arguments. For example, the agent could call:

jira_fetch_tickets(project="AI", limit=10)

Tools are also components that power MCP servers. In Open WebUI project, users can even write custom Python tools that agents can invoke.

Large Language Model (LLM)

An AI model within the deep learning spectrum, primarily designed for language understanding and content generation. LLMs excel at processing and generating human-like text.

Token

In Generative AI, a token is the smallest unit of information a language model processes. Depending on the language and the model’s design, a token can represent a whole word, part of a word, or even a single character. Tokens are the building blocks that language models use to understand and generate text.

Context Window

The context window is the number of tokens an LLM can process as input or generate as output. Input and output limits are usually different, with the input capacity typically much larger than the output.
For example, GPT-4o has an input limit of 128.000 tokens and an output limit of 16.384 tokens.

Fine-Tuning

Fine-tuning is the process of modifying a language model’s neural network using your own data. It’s different from simply adding documents to a conversation or adjusting prompts, which don’t change the model’s underlying structure (see RAG for an alternative approach).

Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a system that uses a vector database to store ingested data, such as documents, web pages, and other sources. When a question is asked, relevant data is retrieved and combined with the question before being sent to a language model (LLM). The LLM itself doesn’t change, but it “sees” the retrieved information, allowing it to answer based on this additional context.

Prompt

A prompt is the user’s input that kicks off the model’s text generation. It guides the model to produce relevant and coherent responses based on the context or question. Prompts can be simple or detailed, shaping the quality and direction of the output.

ChatGPT

ChatGPT is an OpenAI product built on the GPT family of large language models (LLMs). As a product, it can use different LLMs, such as GPT-4o, GPT-4o-mini, and o1.

AI Agent

An AI agent is a software system powered by LLMs that performs tasks, answers questions, and automates processes for users. They can range from simple chatbots to advanced digital or robotic systems capable of running complex workflows autonomously. Key features include planning, using tools, perceiving their environment, and remembering past interactions, which help them improve performance over time.

Prompt Injection

Prompt injection is a type of cyberattack on large language models (LLMs), where malicious inputs are disguised as normal prompts to manipulate the model’s behavior or output. These attacks can make the model ignore safeguards, reveal sensitive information, or carry out unauthorized actions.

Google Gemini

Google’s family of Large Language Models (LLMs).

Anthropic Claude

Anthropic’s family of Large Language Models (LLMs).

Meta Llama

Meta’s family of Large Language Models.

LLM Parameters

LLM parameters are the components within a large language model that determine its behavior and capabilities. Learned during training, they include weights and biases that help the model understand and generate language. Generally, more parameters mean a smarter model, but they also require more computing power, especially memory (RAM), to run.

Copilot

Copilot is Microsoft’s branding for different AI agents, such as:

  • GitHub Copilot, which assists with coding
  • Copilot 365, which helps with Office and Windows tasks

System Prompt

A system prompt is a set of instructions or guidelines given to a language model to set its behavior, tone, and limits during a conversation.

Prompt Engineering

Prompt engineering is the practice of designing and refining prompts to optimize a language model’s performance and output. It involves crafting specific inputs that guide the model to produce the desired responses, improving accuracy, relevance, and coherence.

Digital Twin

Digital twins are virtual representations of assets, people, or processes and their environments that simulate strategies and optimize behaviors. In the CPaaS space, this usually refers to AI agents that mimic people using audio and video modalities.

Multimodal

Multimodal refers to the ability of AI systems to process and combine multiple types of data inputs (text, images, audio, video) to perform tasks or generate outputs. This approach allows AI models to understand and create content across different modalities, resulting in more comprehensive and context-aware applications.

Vector Database

Unlike traditional databases that store structured data in tables, vector databases are optimized for operations like similarity search, allowing efficient retrieval of data points that are mathematically close to a given query vector.

This capability is essential for applications such as recommendation systems, image recognition, natural language processing, and other AI-driven tasks where data is represented as vectors. For a common implementation, see RAG (Retrieval-Augmented Generation).

Hybrid Search

Hybrid search combines the strengths of vector search and traditional full-text search to improve the relevance of retrieved results. Vector search captures the semantic meaning of queries, matching based on context and intent, while full-text search ensures precise keyword matches.

By blending these approaches, hybrid search increases the likelihood of retrieving the most relevant documents, even when queries are vague or phrased differently from the source content. This boosts the accuracy of retrieval and enhances the overall effectiveness of the RAG (Retrieval-Augmented Generation) pipeline.

Embedding Model

An embedding model is a machine learning model trained to convert input text into numerical vectors, which can then be used for vector similarity search. Embedding models are a key part of the RAG (Retrieval-Augmented Generation) pipeline, as they transform user questions into vector representations.

The post The Glossary You Must Read If You Wanna Talk About AI appeared first on ShiftMag.

Read the whole story
alvinashcraft
40 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Say hello to MacBook Neo

1 Share
Apple today unveiled MacBook Neo, an all-new laptop that delivers the magic of the Mac at a breakthrough price.

Read the whole story
alvinashcraft
40 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The “Data Center Rebellion” Is Here

1 Share
This post first appeared on Ben Lorica’s Gradient Flow Substack newsletter and is being republished here with the author’s permission.

Even the most ardent cheerleaders for artificial intelligence now quietly concede we are navigating a massive AI bubble. The numbers are stark: Hyperscalers are deploying roughly $400 billion annually into data centers and specialized chips while AI-related revenue hovers around $20 billion—a 20-to-1 capital-to-revenue ratio that stands out even in infrastructure cycles historically characterized by front-loaded spending. To justify this deployment on conventional investment metrics, the industry would need a step change in monetization over a short window to make the numbers work.

While venture capitalists and tech executives debate the “mismatch” between compute and monetization, a more tangible crisis is unfolding far from Silicon Valley. A growing grassroots opposition to AI data centers remains largely below the radar here in San Francisco. I travel to Sioux Falls, South Dakota, a few times a year to visit my in-laws. It’s not a region known for being antibusiness. Yet even there, a “data center rebellion” has been brewing. Even though the recent attempt to overturn a rezoning ordinance did not succeed, the level of community pushback in the heart of the Midwest signals that these projects no longer enjoy a guaranteed green light.

This resistance is not merely reflexive NIMBYism. It represents a sophisticated multifront challenge to the physical infrastructure AI requires. For leadership teams planning for the future, this means “compute availability” is no longer just a procurement question. It is now tied to local politics, grid stability, water management, and city approval processes. In the course of trying to understand the growing opposition to AI data centers, I’ve been examining the specific drivers behind this opposition and why the assumption of limitless infrastructure growth is colliding with hard constraints.

The grid capacity crunch and the ratepayer revolt

AI data centers function as grid-scale industrial loads. Individual projects now request 100+ megawatts, and some proposals reach into the gigawatt range. One proposed Michigan facility, for example, would consume 1.4 gigawatts, nearly exhausting the region’s remaining 1.5 gigawatts of headroom and roughly matching the electricity needs of about a million homes. This happens because AI hardware is incredibly dense and uses a massive amount of electricity. It also runs constantly. Since AI work doesn’t have “off” hours, power companies can’t rely on the usual quiet periods they use to balance the rest of the grid.

The politics come down to who pays the bill. Residents in many areas have seen their home utility rates jump by 25% or 30% after big data centers moved in, even though they were promised rates wouldn’t change. People are afraid they will end up paying for the power company’s new equipment. This happens when a utility builds massive substations just for one company, but the cost ends up being shared by everyone. When you add in state and local tax breaks, it gets even worse. Communities deal with all the downsides of the project, while the financial benefits are eaten away by tax breaks and credits.

The result is a rare bipartisan alignment around a simple demand: Hyperscalers should pay their full cost of service. Notably, Microsoft has moved in that direction publicly, committing to cover grid-upgrade costs and pursue rate structures intended to insulate residential customers—an implicit admission that the old incentive playbook has become a political liability (and, in some places, an electoral one).

AI scale-up to deployable compute

Water wars and the constant hum

High-density AI compute generates immense heat, requiring cooling systems that can consume millions of gallons of water daily. In desert municipalities like Chandler and Tucson, Arizona, this creates direct competition with agricultural irrigation and residential drinking supplies. Proposed facilities may withdraw hundreds of millions of gallons annually from stressed aquifers or municipal systems, raising fears that industrial users will deplete wells serving farms and homes. Data center developers frequently respond with technical solutions like dry cooling and closed-loop designs. However, communities have learned the trade-off: Dry cooling shifts the burden to electricity, and closed-loop systems still lose water to the atmosphere and require constant refills. The practical outcome is that cooling architecture is now a first-order constraint. In Tucson, a project known locally as “Project Blue” faced enough pushback over water rights that the developer had to revisit the cooling approach midstream.

Beyond resource consumption, these facilities create a significant noise problem. Industrial-scale cooling fans and backup diesel generators create a “constant hum” that represents daily intrusion into previously quiet neighborhoods. In Florida, residents near a proposed facility serving 2,500 families and an elementary school cite sleep disruption and health risks as primary objections, elevating the issue from nuisance to harm. The noise also hits farms hard. In Wisconsin, residents reported that the low-frequency hum makes livestock, particularly horses, nervous and skittish. This disrupts farm life in a way that standard commercial development just doesn’t. This is why municipalities are tightening requirements: acoustic modeling, enforceable decibel limits at property lines, substantial setbacks (sometimes on the order of 200 feet), and berms that are no longer “nice-to-have” concessions but baseline conditions for approval.

The $3 trillion question
(enlarge)

The jobs myth meets the balance sheet

Communities are questioning whether the small number of jobs created is worth the local impact. Developers highlight billion-dollar capital investments and construction employment spikes, but residents focus on steady-state reality: AI data centers employ far fewer permanent workers per square foot than manufacturing facilities of comparable scale. Chandler, Arizona, officials noted that existing facilities employ fewer than 100 people despite massive physical footprints. Wisconsin residents contrast promised “innovation campuses” with operational facilities requiring only dozens to low hundreds of permanent staff—mostly specialized technicians—making the “job creation” pitch ring hollow. When a data center replaces farmland or light manufacturing, communities weigh not just direct employment but opportunity cost: lost agricultural jobs, foregone retail development, and mixed-use projects that might generate broader economic activity.

Opposition scales faster than infrastructure: One local win becomes a national template for blocking the next project.

The secretive way these deals are made is often what fuels the most anger. A recurring pattern is what some call the “sleeping giant” dynamic: Residents learn late that officials and developers have been negotiating for months, often under NDAs, sometimes through shell entities and codenames. In Wisconsin, Microsoft’s “Project Nova” became a symbol of this approach; in Minnesota’s Hermantown, a year of undisclosed discussions triggered similar backlash. In Florida, opponents were furious when a major project was tucked into a consent agenda. Since these agendas are meant for routine business, it felt like a deliberate attempt to bypass public debate. Trust vanishes when people believe advisors have a conflict of interest, like a consultant who seems to be helping both the municipality and the developer. After that happens, technical claims are treated as nothing more than a sales pitch. You won’t get people back on board until you provide neutral analysis and commitments that can actually be enforced.

Data center in the community

From zoning fight to national constraint

What started as isolated neighborhood friction has professionalized into a coordinated national movement. Opposition groups now share legal playbooks and technical templates across state lines, allowing residents in “frontier” states like South Dakota or Michigan to mobilize with the sophistication of seasoned activists. The financial stakes are real: Between April and June 2025 alone, approximately $98 billion in proposed projects were blocked or delayed, according to Data Center Watch. This is no longer just a zoning headache. It’s a political landmine. In Arizona and Georgia, bipartisan coalitions have already ousted officials over data center approvals, signaling to local boards that greenlighting a hyperscale facility without deep community buy-in can be a career-ending move.

The US has the chips, but China has centralized command over power and infrastructure.

The opposition is also finding an unlikely ally in the energy markets. While the industry narrative is one of “limitless demand,” the actual market prices for long-term power and natural gas aren’t spiking but are actually staying remarkably flat. There is a massive disconnect between the hype and the math. Utilities are currently racing to build nearly double the capacity that even the most optimistic analysts project for 2030. This suggests we may be overbuilding “ghost infrastructure.” We are asking local communities to sacrifice their land and grid stability for a gold rush that the markets themselves don’t fully believe in.

This “data center rebellion” creates a strategic bottleneck that no amount of venture capital can easily bypass. While the US maintains a clear lead in high-end chips, we are hitting a wall on how we manage the mundane essentials like electricity and water. In the geopolitical race, the US has the chips, but China has the centralized command over infrastructure. Our democratic model requires transparency and public buy-in to function. If US companies keep relying on secret deals to push through expensive, overbuilt infrastructure, they risk a total collapse of community trust.



Read the whole story
alvinashcraft
40 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories