Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
147911 stories
·
33 followers

How to Set Up SMTP for WordPress Emails and Contact Forms

1 Share

Imagine setting up a lead capture or newsletter registration form, only to later find out that subscribers haven’t been receiving your emails. It’s very frustrating, and unfortunately, it’s one of the most common problems WordPress users face. 

This is why you need to set up SMTP (Simple Mail Transfer Protocol). Without it, emails containing contact form submissions, password reset links, new user registration notices, and WooCommerce order confirmations will likely end up in spam folders — or won’t arrive at all.

Below, we’ll take a closer look at SMTP and how it improves email deliverability. We’ll then show you how to set it up in WordPress using a plugin and troubleshoot common SMTP issues. 

How WordPress sends emails by default

WordPress uses the wp_mail() function to send emails. This is built on the basic mail() function provided by PHP, the programming language that WordPress software is based on.

While wp_mail() gets the job done, it’s not secure or reliable. This is because it doesn’t authenticate emails sent from your website. 

As a result, your emails will lack the trust signals that email clients like Gmail and Outlook use to verify the authenticity of the sender. This means that they’ll likely be flagged as spam or blocked entirely.

Additionally, many hosting servers aren’t configured to handle complex email protocols. They’re primarily optimized to serve web pages. Therefore, messages sent via wp_mail() may be blocked, delayed, or marked as suspicious by receiving servers. 

Plus, some providers block common SMTP ports or throttle email traffic to prevent abuse. This can further impact email deliverability.

Note that web hosting and email delivery are two different services. Web hosting serves your website to visitors, while email delivery requires configured mail servers, authentication protocols, and sender reputation management.

So, even if your hosting provider lets you create a custom email address for your website, you’ll still want to use a dedicated SMTP service to significantly improve the chances of your emails arriving in your recipients’ inboxes.

What is SMTP (Simple Mail Transfer Protocol)?

SMTP is the standard communication protocol for sending emails across the Internet. Unlike the default wp_mail() method in WordPress, SMTP requires authentication. It also uses encryption to ensure that your messages are delivered securely.

SMTP connects to a mail server (like Gmail, Outlook, or a transactional email provider like your web host) through a username and password. This connection is encrypted with SSL or TLS, which protects both the sender and the recipient. 

How SMTP enhances WordPress email deliverability

Now that you have a clearer idea of what SMTP is, let’s talk about how it boosts WordPress email deliverability. 

For starters, SMTP requires you to authenticate your identity using credentials provided by your email service provider. This ensures that the messages originate from a trusted source (i.e. your website).

Encrypted connections (using SSL or TLS) add another layer of security. These ensure that your emails can’t be intercepted by cyber criminals during transit.

When emails are sent through a trusted SMTP server, they are more likely to be accepted by receiving servers. This improves your domain’s sender reputation over time, so your messages have a lower chance of being marked as spam.

SMTP is supported by all major email services, including Gmail, Outlook/Office 365, Zoho, SendGrid, Mailgun, and Amazon SES. These services provide the credentials you need to send authenticated emails through your WordPress site.

How to set up SMTP in WordPress (a step-by-step guide)

Now, let’s show you how to set up SMTP with a WordPress plugin.

1. Install a WordPress SMTP plugin

One of the best SMTP plugins on the market is MailPoet. Designed by Automattic (the people behind WordPress.com), this transactional email service comes with a built-in SMTP solution that’s very easy to activate. Plus, it guarantees a near 99% global delivery rate.

To get started, go to Plugins → Add Plugin and use the search bar to find the tool.

MailPoet listing in the WordPress plugin repository

Click on Install Now and Activate. WordPress will direct you to MailPoet’s setup page.

MailPoet page with the text "better email without leaving WordPress"

Note that you’ll need a MailPoet account to continue. You can create one for free.

option to connect to a MailPoet account

Once you’ve created an account, you can simply connect it to your WordPress website. 

2. Configure SMTP plugin settings

Now, you can configure your SMTP plugin settings. In your WordPress dashboard, go to MailPoet → Settings and select the Send With tab.

MailPoet sending service settings

Here, you can select MailPoet Sending Service to reroute all WordPress emails through the plugin’s SMTP solution. This is available for free for up to 5,000 emails per month. 

You also have the option to connect the plugin to another SMTP service like SendGrid and Amazon SES. To do this, select Other and follow the instructions to connect to your chosen service. 

How to test your WordPress SMTP setup

Most SMTP plugins have a built-in test email feature. You can use it to send a message to your own address and confirm a successful delivery.

You can also submit a form from your website to make sure that contact form notifications are delivered correctly. This is especially useful for checking formatting, reply-to headers, and other content. 

If you use WooCommerce, you could even do a test purchase to check that order notifications and confirmations are received. 

Once you’ve received your test email, you’ll want to examine its headers to ensure that your domain is passing authentication checks. This also helps you confirm that your DNS records (SPF, DKIM, and DMARC) are properly configured and that your SMTP connection is working as it should.

Open the email in your inbox and locate the option to “view original message” or “view headers.” This varies by email client, so you might need to refer to their documentation to see how it’s done. 

If you’re using Gmail, click on the three dots near the sender information and look for the Show original option.

Show original option in Gmail

In the page that opens, you’ll see more information about the message. Look for the following authentication results:

  • spf=pass: Shows that the IP address sending the email is authorized by the domain’s SPF record
  • dkim=pass: Confirms that the email hasn’t been altered during transmission, and that the signature matches the domain’s DKIM key
  • dmarc=pass: Means that SPF and/or DKIM align with your domain’s policies

If any of these checks fail, there might be a problem with your DNS configuration or how your SMTP service handles message signing.

Troubleshooting common SMTP issues

Setting up SMTP is a straightforward process, but you might encounter some issues. Let’s look at the most commonly reported problems, and how to solve them.

Invalid SMTP credentials

Incorrect credentials are one of the most common SMTP issues. So, you’ll want to double-check your username and password or API key. 

If your provider requires an app-specific password, make sure that you’ve generated it correctly in your email provider’s account settings. If you’re using OAuth authentication, check that the token is still valid and hasn’t expired.

Ports blocked by hosting provider

To prevent spam, some web hosting providers block outbound SMTP traffic on ports like 587 and 465. 

If you’ve done everything correctly but your test email fails to send, you’ll want to reach out to your host. They may need to unblock the necessary ports, or provide an alternative method for sending emails via SMTP.

DNS records not propagating

Services like SendGrid, Mailgun, and Amazon SES typically require DNS verification to confirm domain ownership and allow email sending. This involves setting up SPF, DKIM, and sometimes DMARC records. 

These records usually take up to 48 hours to propagate. Until the process is ready, your emails might fail authentication and be marked as suspicious. You can use DNS propagation check tools to monitor their status.

Once the propagation period is over, send a test email again. If it’s not working, there might be another issue. 

Plugin conflicts

If you have multiple plugins on your website that are trying to modify mail settings, it can lead to conflicts that affect email delivery. You might also encounter an issue if you have a form builder on your website that’s not compatible with the SMTP plugin or service that you’re using. 

If emails suddenly stop working, try deactivating recently installed or updated plugins. You’ll want to focus on tools related to email, website security, or performance optimization. 

If your emails work when a particular plugin is deactivated, investigate the issue by reaching out to its developers. 

Two-factor authentication and app passwords

If your account has two-factor authentication (2FA) enabled, you might not be able to use your regular password for SMTP. 

Most providers like Gmail and Outlook offer the option to create an app-specific password, and you’ll want to use this in your SMTP plugin (instead of your main account password). If you’re unsure, you’ll likely find information about 2FA and passwords in your provider’s documentation.

Gmail API quota exceeded

If you’ve configured SMTP using Gmail’s API, you’ll likely encounter daily sending limits. Free Gmail accounts are limited to 500 messages per day. Google Workspace accounts may have higher limits. 

If you exceed these quotas, your emails will fail to send until the quota resets. So, you’ll want to monitor usage and consider switching to a premium service if your traffic grows.

Frequently asked questions (FAQ)

Finally, let’s answer some common questions about setting up SMTP in WordPress. 

What’s the difference between WordPress default email and SMTP?

The default WordPress method (wp_mail()) uses your server’s basic mail function and therefore lacks authentication measures. Meanwhile, SMTP connects to a mail server with secure credentials and encryption, improving email deliverability.

What types of WordPress emails does SMTP Improve?

SMTP helps with all emails sent from your WordPress site, including contact form submissions, new user registrations, password reset requests, WooCommerce order confirmations, newsletters, and more. 

How much does it cost to set up SMTP for a WordPress website?

You can set up SMTP for free when using a plugin like MailPoet. Additionally, Gmail, Outlook, and other email providers offer free SMTP services with some limitations, while premium services like SendGrid or Mailgun offer free tiers and charge based on volume.

Do I need to pay for an SMTP service?

Not necessarily. You can use MailPoet, which has a free version that works for most small sites. If you have a more complex site with higher email volumes, you might need to get a paid SMTP service that offers better deliverability, analytics, and support.

Can I use Gmail SMTP for WordPress emails?

Yes, you can connect to Gmail SMTP, but you’ll likely need to enable 2FA and create an app-specific password.

Can I use Outlook or Office 365 SMTP for WordPress emails?

Yes. To do this, use smtp.office365.com as the host and port 587 with TLS. You’ll also need to add your full email address as the SMTP username, and generate an app password if you have 2FA enabled.

Do I need technical skills to configure SMTP in WordPress?

No. Most SMTP plugins offer guided setup wizards. Just follow the prompts and enter information like sender name and email address. If you encounter any issues, you can always contact the plugin developers or your mail providers for assistance. 

Can I use SMTP with any WordPress contact form plugin?

Yes. SMTP works with most major form plugins, including Jetpack Forms, WPForms, Ninja Forms, Contact Form 7, and Gravity Forms.

What are recommended contact form plugins for WordPress?

Jetpack Forms is a powerful form plugin by Automattic, the same people behind WordPress.com. It comes with pre-made templates for lead capture, registration, feedback, contact, and other forms. 

It integrates seamlessly with the block editor, and you can add a form on any page or post on your website. Simply add the Form block, choose a template, and customize the fields and appearance to suit your needs. 

Jetpack Forms is free and is included with the default Jetpack plugin. It also integrates with Akismet to prevent spam and Jetpack AI Assistant to create nearly any form with a simple text prompt. 





Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Azure reliability, resiliency, and recoverability: Build continuity by design

1 Share

Modern cloud systems are expected to deliver more than uptime. Customers expect consistent performance, the ability to withstand disruption, and confidence that recovery is predictable and intentional.

In Azure, these expectations nap the three distinct concepts: reliability, resiliency, and recoverability.

Reliability describes the degree to which a service or workload consistently performs at its intended service level within business-defined constraints and tradeoffs. Reliability is the outcome customers ultimately care about.

To achieve reliable outcomes, workloads are designed along two complementary dimensions. Resiliency is the ability to withstand faults and disruptive conditions such as infrastructure failures, zonal or regional outages, cyberattacks, or sudden change in load—and continue operating without customer-visible disruption. Recoverability is the ability to restore normal operations after disruption, returning the workload to a reliable state once resiliency limits are exceeded.

This blog anchors definitions and guidance to the Microsoft Cloud Adoption Framework, the Azure Well‑Architected Framework and the reliability guides for Azure services. Use the Reliability guides to confirm how each service behaves during faults, what protections are built in, and what you must configure and operate, so shared responsibility boundaries stay clear as workloads scale and during recovery scenarios.

Why this matters

When reliability, resiliency, and recoverability are used interchangeably, teams make the wrong design tradeoffs—over-investing in recovery when architectural resiliency is required, or assuming redundancy guarantees reliable outcomes. This post clarifies how these concepts differ, when each applies, and how they guide real design, migration, and incident-readiness decisions in Azure.

Industry perspective: Clarifying common confusion

Azure guidance treats reliability as the goal, achieved through deliberate resiliency and recoverability strategies. Resiliency describes workload behavior during disruption; recoverability describes restoring service after disruption.

Anchor principle: Reliability is the goal. Resiliency keeps you operational during disruption. Recoverability restores service when disruption exceeds design limits.

Part I — Reliability by design: Operating model and workload architecture

Reliable outcomes require alignment between organizational intent and workload architecture. Microsoft Cloud Adoption Framework helps organizations define governance, accountability, and continuity expectations that shape reliability priorities. Azure Well‑Architected Frameworktranslates those priorities into architectural principles, design patterns, and tradeoff guidance.

Part II — Reliability in practice: What you measure and operationalize

Reliability only matters if it is measured and sustained. Teams operationalize reliability by defining acceptable service levels, instrumenting steady-state behavior and customer experience, and validating assumptions with evidence.

Azure Monitor and Application Insights provide observability, while controlled fault testing (for example, with Azure Chaos Studio helps confirm designs behave as expected under stress.

Practical signals of “enough reliability” include meeting service levels for critical user flows, introducing changes safely, maintaining steady-state performance under expected load, and keeping deployment risk low through disciplined change practices.

Governance mechanisms such as Azure PolicyAzure landing zones, and Azure Verified Modules help apply these practices consistently as environments evolve.

The Reliability Maturity Model can help teams assess how consistently reliability practices are applied as workloads evolve, while remaining scoped to reliability practices rather than resiliency or recoverability architecture.

Part III — Resiliency in practice: From principle to staying operational

Resiliency by design is no longer a late-stage high-availability checklist. For mission-critical workloads, resiliency must be intentional, measurable, and continuously validated—built into how applications are designed, deployed, and operated.

Resiliency by design aims to keep systems operating through disruption wherever possible, not only recover after failures.

Resiliency is a lifecycle, not a feature

Effective practice shifts from isolated configurations to a repeatable lifecycle applied across workloads:

  • Start resilient—embed resiliency at design time using prescriptive architectures, secure-by-default configurations, and platform-native protections.  
  • Get resilient—assess existing applications, identify resiliency gaps, and remediate risks, prioritizing production mission-critical workloads. 
  • Stay resilient—continuously validate, monitor, and improve posture, ensuring configurations don’t drift and assumptions hold as scale, usage patterns, and threat models change.  

Withstanding disruption through architectural design

Resiliency focuses on how workloads behave during disruptive conditions such as failures, sudden changes in load, or unexpected operating stress—so they can continue operating and limit customer-visible impact. Some disruptive conditions are not “faults” in the traditional sense; elastic scale-out is a resiliency strategy for handling demand spikes even when infrastructure is healthy.

In Azure, resiliency is achieved through architectural and operational choices that tolerate faults, isolate failures, and limit their impact. Many decisions begin with failure-domain architecture: availability zones provide physical isolation within a region, zone-resilient configurations enable continued operation through zonal loss, and multi-region designs can extend operational continuity depending on routing, replication, and failover behavior.

The Reliable Web App reference architecture in the Azure Architecture Center illustrates how these principles come together through zone-resilient deployment, traffic routing, and elastic scaling paired with validation practices aligned to WAF. This reinforces a core tenet of resiliency by design: resiliency is achieved through intentional design and continuous verification, not assumed redundancy.  

Traffic management and fault isolation

Traffic management is central to resiliency behavior. Services such as Azure Load Balancer and Azure Front Door can route traffic away from unhealthy instances or regions, reducing user impact during disruption. Design guidance such as load-balancing decision trees can help teams select patterns that match their resiliency goals.

It is also important to distinguish resiliency from disaster recovery. Multi-region deployments may support high availability, fault isolation, or load distribution without necessarily meeting formal recovery objectives, depending on how failover, replication, and operational processes are implemented.

From resource checks to application-centric posture

Customers experience disruption as application outages, not as individual disk or VM failures. Resiliency must therefore be assessed and managed at the application level.

Azure’s zone resiliency experience supports this shift by grouping resources into logical application service groups, assessing risk, tracking posture over time, detecting drift, and guiding remediation with cost visibility. This turns resiliency from an assumption into an explicit, measurable posture.

Validation matters: configuration is not enough

Resiliency should be validated rather than assumed. Teams can simulate disruption through controlled drills, observe application behavior under stress, and measure continuity characteristics during expected scenarios. Strong observability is essential here: it shows how the application performs during and after drills.

Increasingly, assistive capabilities such as the Resiliency Agent (preview) in Azure Copilot help teams assess posture and guide remediation without blurring the distinction between resiliency (remaining operational through disruption) and recoverability (restoring service after disruption).  

What “enough resiliency” looks like: workloads remain functional during expected scenarios; failures are isolated, and systems degrade gracefully rather than causing customer-visible outages.

Part IV – Recoverability in practice: Restoring normal operations after disruption

Recoverability becomes relevant when disruption exceeds what resiliency mechanisms can withstand. It focuses on restoring normal operations after outages, data corruption events, or broader incidents, returning the system to a reliable state.

Recoverability strategies typically involve backup, restore, and recovery orchestration. In Azure, services such as Azure Backup and Azure Site Recovery support these scenarios, with behavior varying by service and configuration.

Recovery requirements such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO) belong here. These metrics define restoration expectations after disruption, not how workloads remain operational during disruption.

Recoverability also depends on operational readiness: teams document runbooks, practice restores, verify backup integrity, and test recovery regularly, so recovery plans work under real pressure.

By separating recoverability from resiliency, teams can ensure recovery planning complements, rather than substitutes for, sound resiliency architecture.

A 30-day action plan: Turning intent into reliable outcomes

Within 30 days, translate concepts into deliberate decisions.

First, identify and classify critical workloads, confirm ownership, and define acceptable service levels and tradeoffs.

Next, assess resiliency posture against expected disruption scenarios (including zonal loss, regional failure, load spikes, and cyber disruption), validate failure-domain choices, and verify traffic management behavior. Use guardrails such as Azure Backup, Microsoft Defender for Cloud, and Microsoft Sentinel to strengthen continuity against cyberattacks.

Then, confirm recoverability paths for scenarios that exceed resiliency limits, including restoration paths and RTO/RPO targets.

Finally, align operational practices—change management, observability, governance, and continuous improvement—and validate assumptions using the Reliability guides for each Azure service.

Designing confident, reliable cloud systems

Modern cloud continuity is defined by how confidently systems perform, withstand disruption, and restore service when needed. Reliability is the outcome to design for; resiliency and recoverability are complementary strategies that make reliable operation possible.

Next step: Explore Azure Essentials for guidance and tools to build secure, resilient, cost-efficient Azure projects. To see how shared responsibility and Azure Essentials come together in practice, read Resiliency in the cloud—empowered by shared responsibility and Azure Essentials on the Microsoft Azure Blog.

For expert-led, outcome-based engagements to strengthen resiliency and operational readiness, Microsoft Unified provides end-to-end support across the Microsoft cloud. To move from guidance to execution, start your project with experts and investments through Azure Accelerate.

Azure capabilities referenced

Foundational guidance:

Resiliency examples:

Recoverability examples:

Governance and validation examples:

The post Azure reliability, resiliency, and recoverability: Build continuity by design appeared first on Microsoft Azure Blog.

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Claude Sonnet 4.6 in Microsoft Foundry-Frontier Performance for Scale

1 Share

Claude Sonnet 4.6 is available today in Microsoft Foundry, and it is designed for teams who want frontier performance across coding, agents, and professional work at scale.
Last week, we took a major step forward with the availability of Claude Opus 4.6 in Microsoft Foundry, bringing frontier AI capable of deep reasoning, agentic workflows, and complex decision-making to enterprise developers and builders. If Opus represents the highest tier of AI performance, Sonnet 4.6 builds on that momentum by delivering nearly Opus-level intelligence at a lower price, while often being more token efficient than Claude Sonnet 4.5.

Claude Sonnet 4.6 is available today in Microsoft Foundry, and it is designed for teams who want frontier performance across coding, agents, and professional work at scale. With Sonnet 4.6, customers get access to powerful reasoning and productivity capabilities that make everyday AI a practical reality for development teams, enterprise knowledge workers, and automation scenarios.

Large Context, Adaptive Thinking, and Effort Controls

Claude Sonnet 4.6 delivers frontier intelligence at scale, built for coding, agents, and enterprise workflows.

A major highlight is its 1 million token context window (beta), matching the extended context capabilities of Claude Opus 4.6, alongside 128K maximum output. This enables teams to work across massive codebases, long financial models, multi-document analysis, and extended multi-turn workflows without fragmentation or repeated context resets.

Sonnet 4.6 also uses adaptive thinking and effort parameters, that gives Claude the freedom to think if and when it determines reasoning is required. This is an evolution from traditional extended thinking, optimizing both performance and speed. Teams can use effort parameters to better control quality-latency-cost tradeoffs.

A Developer’s Everyday Model

Claude Sonnet 4.6 is a full upgrade for software development. It is smart enough to work independently through complex codebases and handles iterative workflows without losing quality.

Enterprise software teams can expect Claude Sonnet 4.6 to deliver:

Stronger reasoning across code contexts
Better understanding of complex codebases
Reliable performance across iterative development cycles
Whether you’re building features, refactoring existing modules, or debugging tricky issues, Sonnet 4.6 can follow your workflow, maintain architectural context, and adapt as you iterate.

Sonnet 4.6 is designed for back-and-forth development:

You define intent
It produces high-quality outputs
You guide refinement
Deliverables stay consistent through iterations
For teams building in Microsoft Foundry, this translates to fewer context resets, faster cycle times, and smoother development velocity.

                                           Ref:  Benchmark table published by Anthropic  

Empowering High-Quality Knowledge Work

Sonnet 4.6 makes high-quality knowledge work accessible at scale, enabling teams to produce polished outputs with fewer editing cycles.

Improvements in search, analysis, and content generation make Sonnet 4.6 ideal for everyday enterprise workflows, such as:

Drafting and refining reports
Summarizing large document sets
Generating structured business documentation
Producing polished presentations and narratives
Consistent quality across both single-turn tasks and extended multi-turn collaboration ensures teams spend less time refining and more time delivering.

Powerful Computer Use

Claude Sonnet 4.6 is Anthropic’s most capable computer use model yet, scoring 72.5% on OSWorld Verified. With improved precision, the model has better clicking accuracy on difficult UI elements. Claude Sonnet 4.6 enables browser automation at scale without API key dependency. It can navigate, interact, and complete tasks across any browser-based surface, including tools with no API, legacy systems, and sites you’re already logged into.

Claude Sonnet 4.6 can work across apps without explicit instruction. It can read context from one surface and act on another, checking a calendar, responding to a message, and creating an event, without the user having to orchestrate each step.

For organizations running business workflows on systems that predate modern APIs, Sonnet 4.6’s browser-based computer use is transformative. For developers, Sonnet 4.6 is a strong fit for software development workflows as a QA and testing layer. Spinning up a browser when needed, developers can delegate visual inspection and form-based validation.

Versatile Horizontal and Vertical Use Cases

Claude Sonnet 4.6 is a direct upgrade to Sonnet 4.5. Most workflows will require only minimal prompting changes.,

  1. Search & Conversational Experiences

Sonnet 4.6 is an excellent choice for high-volume conversational products, delivering consistent quality across multi-turn exchanges while remaining cost-efficient for scale.

  1. Agentic & Multi-Model Pipelines

Sonnet 4.6 can function as both lead agent and sub-agent in multi-model setups. Adaptive thinking, context compaction, and effort controls give developers precise orchestration tools for complex workflows.

  1. Finance & Analytics

With stronger financial modeling intelligence and improved spreadsheet capabilities, Sonnet 4.6 is a strong fit for analysis, compliance review, and data summarization workflows where precision and iteration speed matter.

  1. Enterprise Document & Workflow Production

Users need fewer rounds of editing to reach production-ready documents, spreadsheets, and presentations, making Claude Sonnet 4.6 a strong fit for finance, legal, and other precision-critical verticals where polish and domain accuracy matter.

Built for Scale in Microsoft Foundry

With Claude Sonnet 4.6 available in Microsoft Foundry, customers can deploy near-Opus-level intelligence within an enterprise-grade environment that supports governance, compliance, and operational tooling.

For teams building modern AI workflows, from developer assistants to enterprise automation agents, Claude Sonnet 4.6 provides a powerful, scalable foundation in Microsoft Foundry.

Try it today

And to go deeper, join us on February 23 for Model Mondays, where leaders from Anthropic will walk through both Claude Opus 4.6 and Claude Sonnet 4.6 including real-world use cases, architectural guidance, and what’s next for frontier models in enterprise deployment.

The post Claude Sonnet 4.6 in Microsoft Foundry-Frontier Performance for Scale appeared first on Microsoft Azure Blog.

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

How to Develop AI Agents Using LangGraph: A Practical Guide

1 Share

AI agents are all the rage these days. They’re like traditional chatbots, but they have the ability to utilize a plethora of tools in the background. They can also decide which tool to use and when to use it to answer your questions.

In this tutorial, I’ll show you how to build this type of agent using LangGraph. We’ll dig into real code from my personal project FinanceGPT, an open-source financial assistant I created to help me with my finances.

You’ll walk away understanding how AI agents actually work under the hood, and you’ll be able to build your own agent for whatever domain you are working on.

What I’ll Cover:

Prerequisites

Before diving in, you should be comfortable with the following:

Python knowledge: You should know how to write Python functions, work with async/await syntax, and understand decorators. The code examples use all three extensively.

Basic LLM/chatbot familiarity: You don't need to be an expert, but knowing what a large language model is and having some experience calling one (via OpenAI's API or similar) will help you follow along.

LangChain basics: We'll be using LangGraph, which is built on top of LangChain. If you've never used LangChain before, it's worth skimming their quickstart guide first.

You'll also need the following tools installed:

  • Python 3.10+

  • An OpenAI API key (the examples use gpt-4-turbo-preview)

  • The following packages, installable via pip:

  pip install langchain langgraph langchain-openai sqlalchemy

If you're planning to follow along with the full FinanceGPT project rather than just the code snippets, you'll also want a PostgreSQL database set up, but that's optional for understanding the core concepts covered here.

What Are AI Agents?

Think of AI agents as traditional chatbots that can answer user questions. But they specialize in figuring out what tools they need and can chain multiple actions together to get an answer.

Here’s an example conversation with my FinanceGPT AI agent:

User: "How much did I spend on groceries this month?"

Agent: [Thinks: I need transaction data filtered by category]

Agent: [Calls search_transactions(category="Groceries")]

Agent: [Gets back: $1,245.67 across 23 transactions]

Agent: "You spent $1,245.67 on groceries this month."

The agent broke down the problem, picked the right tool to use, and generated the answer. This matters a lot when you’re working with messy real world problems where:

  • Questions don’t fit into specific categories

  • You need to pull data from multiple sources

  • Users want to ask followup questions

What is LangGraph?

LangGraph is an open sourced extension of LangChain that’s useful for creating stateful AI agents by modeling workflows as nodes and edges in a graph. You can think of your agent’s logic as a flowchart where:

  • Nodes are the actions (for example “ask the LLM” or “run this tool”)

  • Edges are the arrows (what happens next)

  • State is the information passed around

LangGraph is especially good at providing the following benefits:

  1. Flow control: You define exactly what happens when.

  2. Stateful: The framework preserves conversation history for you.

  3. Easy to use: Just adding a decorator to an existing Python function makes it a tool.

  4. Production-ready: It has built-in error handling and retries.

Core Concept 1: Tools

Think of tools as just Python functions your AI agent can call. The LLM utilizes the function name, docstring, parameters, and return value to know what the functions are doing and when to use them.

LangChain has a @tool decorator that can convert any function into a tool, for example:

from langchain_core.tools import tool

@tool
def get_current_weather(location: str) -> str:
    """Get the current weather for a location.

    Use this when the user asks about weather conditions.

    Args:
        location: City name (e.g., "San Francisco", "New York")

    Returns:
        Weather description string
    """
    # In real life, you'd call a weather API here
    return f"The weather in {location} is sunny, 72°F"

Notice that the docstring is self-explanatory, as that’s how the LLM decides whether this tool is the right choice or not.

Here is a real example from FinanceGPT. This is a tool that searches through financial transactions:

from langchain_core.tools import tool
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select

def create_search_transactions_tool(search_space_id: int, db_session: AsyncSession):
    """
    Factory function that creates a search tool with database access.

    This pattern lets you inject dependencies (database, user context)
    while keeping the tool signature clean for the LLM.
    """

    @tool
    async def search_transactions(
        keywords: str | None = None,
        category: str | None = None
    ) -> dict:
        """Search financial transactions by merchant or category.

        Use when users ask about:
        - Spending at specific merchants ("How much at Starbucks?")
        - Spending in categories ("How much on groceries?")
        - Both combined ("Show me restaurant spending at McDonald's")

        Args:
            keywords: Merchant name to search for
            category: Spending category (e.g., "Groceries", "Gas")

        Returns:
            Dictionary with transactions, total amount, and count
        """
        # Query the database
        query = select(Document.document_metadata).where(
            Document.search_space_id == search_space_id
        )
        result = await db_session.execute(query)
        documents = result.all()

        # Filter transactions based on criteria
        all_transactions = []
        for (doc_metadata,) in documents:
            transactions = doc_metadata.get("financial_data", {}).get("transactions", [])

            for txn in transactions:
                # Apply filters
                if category and category.lower() not in str(txn.get("category", "")).lower():
                    continue
                if keywords and keywords.lower() not in txn.get("description", "").lower():
                    continue

                # Include matching transaction
                all_transactions.append({
                    "date": txn.get("date"),
                    "description": txn.get("description"),
                    "amount": float(txn.get("amount", 0)),
                    "category": txn.get("category"),
                })

        # Calculate total and return
        total = sum(abs(t["amount"]) for t in all_transactions if t["amount"] < 0)

        return {
            "transactions": all_transactions[:20],  # Limit results
            "total_amount": total,
            "count": len(all_transactions),
            "summary": f"Found {len(all_transactions)} transactions totaling ${total:,.2f}"
        }

    return search_transactions

Let’s dive into what this code is doing.

The factory function pattern: The tool only takes parameters the LLM can provide (a keyword and category), but it also needs a database session and search_space_id to know whose data to query. The factory function solves this by capturing those dependencies in a closure, so the LLM sees a clean interface while the database wiring stays hidden.

The filtering logic: We loop through all transactions and apply the optional filters. If category is provided, it must appear in the transaction's category field. If keywords is provided, it must appear in the merchant description. Both can be used together, letting the LLM handle questions like "How much did I spend at McDonald's in the Restaurants category?"

The return value: Instead of a raw list, the tool returns a structured dict with a capped result set, a pre-calculated total, and a plain-English summary string. The summary means the LLM can read "Found 23 transactions totaling $1,245.67" and immediately know what to say, rather than parsing the raw data itself.

Key Tool Design Principles

These are the principles that differentiate a good tool from a great tool:

  1. Docstrings: Instead of vague descriptions, you need to be thorough with the explanation of the tool in the docstring. The more examples you give, the better the LLM gets at picking the right tool.

  2. Clean signature: The tool should only take the parameters that the LLM has access to and can provide. If the tool needs user ids, or database connections (and so on), you can hide those in factory functions using closures.

  3. Return both data and summaries: Instead of just the raw data, if you include a summary field, the agent can just use that to understand the output better. Here’s an example:

     {
         "transactions": [...],           # For detailed analysis
         "total_amount": 1245.67,         # Pre-calculated
         "summary": "Found 23 transactions..."  # Ready to send to user
     }
    
  4. Limited context window: Capping results to a finite amount like 20-50 items depending on the use case will make sure your LLM doesn’t choke or hit context limits.

Core Concept 2: Agent State

Your agent carries around information as it works. This is called the agent’s state. For a chatbot, it’s usually the conversation history.

In LangGraph, state is defined with a TypeDict:

from typing import Annotated, Sequence, TypedDict
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
    """
    This is what flows through your agent.

    Messages is a list that keeps growing:
    - User questions
    - Agent responses
    - Tool results
    """
    messages: Annotated[Sequence[BaseMessage], "The conversation history"]

For complex agents, you can track more than just messages, like:

class FancierState(TypedDict):
    messages: Sequence[BaseMessage]
    user_id: str
    retry_count: int
    last_tool_used: str | None

This matters more than it might look. Each field here has a real purpose in a sophisticated production-grade agent. user_id tells every node whose data to fetch without you having to pass it around manually. retry_count helps agent detect when its stuck in a loop so it can bail out gracefully. last_tool_used helps the agent avoid redundant calls.

As the agent grows in complexity, state becomes the single source of truth that keeps every node coordinated.

Why State Matters

State is what separates an agent which is conversational from an API call that is stateless. Without it, every message would be processed in isolation and the agent would have no recollection of what was asked earlier, what tools it already used, and what data it retrieved already.

With state, the full conversation history is passed through each step of the agent’s execution.

Here's what that looks like in practice for our grocery spending example:

When the conversation starts:
{
    "messages": []
}

User asks something:
{
    "messages": [
        HumanMessage("How much did I spend on groceries?")
    ]
}

Agent decides to use a tool:
{
    "messages": [
        HumanMessage("How much did I spend on groceries?"),
        AIMessage(tool_calls=[{name: "search_transactions", ...}]),
        ToolMessage({"total_amount": 1245.67, ...}),
    ]
}

Agent responds with the answer:
{
    "messages": [
        HumanMessage("How much did I spend on groceries?"),
        AIMessage(tool_calls=[...]),
        ToolMessage({...}),
        AIMessage("You spent $1,245.67 on groceries this month.")
    ]
}

Notice that the state is always growing with every tool call and every result. This means that when user has a followup like “How does that compare to last month?”, the agent can just look back and know what “that” refers to.

Core Concept 3: The Agent Graph

The graph is the backbone of your agent. Think of it as a collection of tools and an LLM, combined together to reason, act and respond in a structured way. Specifically, it determines the order of operations – that is, what runs first, what happens next, and what conditions determine which path to take.

Without a graph, you would have to manually orchestrate the workflow: calling the LLM, then checking whether it wants to use a tool, executing the tool, and then feeding the result back to it and deciding when to stop. The graph encodes this logic explicitly so that your agent figures out the right sequence.

Each node in the graph is an action like “ask the LLM” or “run a tool” and each edge is a connection between those actions.

With that in mind, let's build one step by step.

Step 1: Create the Agent Node

The agent node is where the LLM makes a decision like “Should I use a tool?” or “Which tool to use?”. Let’s take an example:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Create the LLM with tools
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)

# Create your tools
tools = [
    create_search_transactions_tool(search_space_id, db_session),
    # ... other tools
]

# Bind tools to the LLM so it knows what's available
llm_with_tools = llm.bind_tools(tools)

# Create the system prompt
system_prompt = """You are a helpful AI financial assistant.

Your capabilities:
- Search transactions by merchant, category, or date
- Analyze portfolio performance
- Find tax optimization opportunities

Guidelines:
- Be concise and cite specific data
- Format currency as $X,XXX.XX
- Remind users to consult professionals for tax/investment advice"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder(variable_name="messages"),
])

# Define the agent node function
async def call_agent(state: AgentState):
    """
    The agent node calls the LLM to decide the next action.

    The LLM can:
    1. Call one or more tools
    2. Generate a text response
    3. Both
    """
    messages = state["messages"]

    # Format messages with system prompt
    formatted = prompt.format_messages(messages=messages)

    # Call the LLM
    response = await llm_with_tools.ainvoke(formatted)

    # Return state update (add the LLM's response)
    return {"messages": [response]}

Let’s walk through what's happening here.

First, we initialize the LLM with temperature=0, which makes the model deterministic and consistent. This is important for an agent that needs to make reliable decisions rather than creative ones.

Next, we call llm.bind_tools(tools). It tells the LLM what tools are available by passing along their names, descriptions, and parameter schemas. Without this, the LLM would have no idea it could call any tools at all. With it, the LLM can look at a user's question and decide both whether a tool is needed and which one to use.

The prompt is built using ChatPromptTemplate, which combines a static system prompt with a MessagesPlaceholder. The placeholder is where the full conversation history gets inserted at runtime, meaning the LLM always has the complete context of the conversation when making its decision.

Last, call_agent is the actual node function. It pulls the current messages from state, formats them with the prompt, calls the LLM, and returns the response to be appended to state. This is the function LangGraph will call every time execution reaches the agent node.

Step 2: Create the Tool Node

LangGraph has a pre-built ToolNode that executes tools:

from langgraph.prebuilt import ToolNode

# This node automatically executes any tools the LLM requested
tool_node = ToolNode(tools)

When the LLM includes tool calls in its response, ToolNode will:

  1. extract the tool calls,

  2. execute each tool with specific params, and

  3. add ToolMessage object with the result to state

Step 3: Define Control Flow

This is where we need to decide when the tool should be used and when it ends.

from langgraph.graph import END

def should_continue(state: AgentState):
    """
    Router function that determines the next step.

    Returns:
        "tools" - if the LLM wants to use tools
        END - if the LLM is done (just text response)
    """
    last_message = state["messages"][-1]

    # Check if the LLM included tool calls
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"

    # No tool calls means we're done
    return END

This tiny function is the decision-maker of your entire agent. After the LLM responds, LangGraph calls should_continue to figure out what to do next. It works by inspecting the last message in state: the LLM's most recent response. If that response contains tool calls, it means the LLM has decided it needs more data before it can answer, so we return "tools" to route execution to the tool node. If there are no tool calls, the LLM has produced a final answer and we return END to stop execution.

This is the mechanism that makes the agent loop. The agent doesn't just call one tool and stop, but it can call a tool, see the result, decide it needs another tool, call that one too, and only stop when it has everything it needs to respond.

Step 4: Assemble the Graph

Now, we can connect everything:

from langgraph.graph import StateGraph

# Create the graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("agent", call_agent)
workflow.add_node("tools", tool_node)

# Set entry point
workflow.set_entry_point("agent")

# Add conditional edge from agent
workflow.add_conditional_edges(
    "agent",           # From this node
    should_continue,   # Use this function to decide
    {
        "tools": "tools",  # If "tools" is returned, go to tools node
        END: END           # If END is returned, finish
    }
)

# After tools execute, go back to agent
workflow.add_edge("tools", "agent")

# Compile into a runnable agent
agent = workflow.compile()

This is where everything gets wired together. We start by creating a StateGraph and passing it our AgentState type. This tells LangGraph what shape the state will take as it flows through the graph.

We then register our two nodes with add_node. The string name we give each node ("agent" and "tools") is what we'll use to reference them when defining edges. set_entry_point tells LangGraph where execution should begin which in our case is the agent node.

The conditional edge is where the routing logic plugs in. We're telling LangGraph: "After the agent node runs, call should_continue to decide what happens next, then use this mapping to translate that decision into the next node." If should_continue returns "tools", go to the tools node. If it returns END, stop.

Finally, add_edge("tools", "agent") creates an unconditional edge: after the tools node runs, always go back to the agent node. This is what creates the loop, letting the agent review the tool results and decide whether it's done or needs to keep going. Calling workflow.compile() locks everything in and returns a runnable agent.

Understanding the Flow

Here’s what happens when you run the agent:

User Question
    ↓
[AGENT NODE]
    ↓
[SHOULD_CONTINUE]
    ↓
  Tools needed?
    ↓ YES   ↓ NO
[TOOLS]    [END]
    ↓
[AGENT NODE]
    ↓
[SHOULD_CONTINUE]
    ↓
    ...

The loop above allows the agent to:

  1. Use a tool

  2. See the results

  3. Decide if more tools are needed

  4. Use more tools or generate final answer

How to Put it All Together

Let’s see the complete agent in one place:

from typing import Annotated, Sequence, TypedDict
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode

# 1. Define State
class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], "Conversation history"]

# 2. Create Agent Function
def create_agent(tools):
    # Set up LLM
    llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
    llm_with_tools = llm.bind_tools(tools)

    # Create prompt
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful AI assistant."),
        MessagesPlaceholder(variable_name="messages"),
    ])

    # Define nodes
    async def call_agent(state: AgentState):
        formatted = prompt.format_messages(messages=state["messages"])
        response = await llm_with_tools.ainvoke(formatted)
        return {"messages": [response]}

    def should_continue(state: AgentState):
        last_message = state["messages"][-1]
        if hasattr(last_message, "tool_calls") and last_message.tool_calls:
            return "tools"
        return END

    # Build graph
    workflow = StateGraph(AgentState)
    workflow.add_node("agent", call_agent)
    workflow.add_node("tools", ToolNode(tools))
    workflow.set_entry_point("agent")
    workflow.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
    workflow.add_edge("tools", "agent")

    return workflow.compile()

# 3. Use the Agent
async def main():
    # Create tools (simplified example)
    tools = [create_search_transactions_tool(user_id=1, db_session=session)]

    # Create agent
    agent = create_agent(tools)

    # Run agent
    result = await agent.ainvoke({
        "messages": [HumanMessage(content="How much did I spend on groceries?")]
    })

    # Get final response
    final_response = result["messages"][-1].content
    print(final_response)

How the Agent Thinks

Let’s use an example to see how the agent reasons.

Example: “How much did I spend on groceries this month?”

Step 1: User Input

State: {
    "messages": [HumanMessage("How much did I spend on groceries this month?")]
}

Step 2: Agent Node

The LLM gets:

  • A system prompt, like the one we defined above

  • User question: “How much did I spend on groceries this month?”

  • List of available tools: search_transactions(keywords, category)

The LLM reasons that this is about spending in a specific category and decides that it should use search_transactions with category=’groceries’. It responds with a tool call:

AIMessage(
    content="",
    tool_calls=[{
        "name": "search_transactions",
        "args": {"category": "Groceries"},
        "id": "call_123"
    }]
)

Step 3: Should Continue

The router sees tool calls and returns “tools”.

Step 4: Tools Node

It executes search_transactions(category="Groceries") and gets:

{
    "transactions": [...],
    "total_amount": 1245.67,
    "count": 23,
    "summary": "Found 23 transactions totaling $1,245.67"
}

And adds this to the state:

ToolMessage(
    content='{"transactions": [...], "total_amount": 1245.67, ...}',
    tool_call_id="call_123"
)

Step 5: Agent Node Again

The LLM now sees the user question, its previous tool, and the results. The LLM thinks: “I now have the data, the user spent $1245.67 on groceries. I can answer now.” And the LLM responds with:

AIMessage(content="You spent $1,245.67 on groceries this month across 23 transactions.")

Step 6: Should Continue

No tool calls this time, so returns END.

Final State:

{
    "messages": [
        HumanMessage("How much did I spend on groceries this month?"),
        AIMessage("", tool_calls=[...]),
        ToolMessage('{"total_amount": 1245.67, ...}'),
        AIMessage("You spent $1,245.67 on groceries this month across 23 transactions.")
    ]
}

The user receives: "You spent $1245.67 on groceries this month across 23 transactions."

Conclusion

Building an AI agent boils down to three ideas:

  1. Tools

  2. State

  3. Graph

LangGraph gives you control, so you are not left hoping that the agent does the right thing – instead, you’re explicitly defining what the “right thing” is.

The FinanceGPT example shows how this works in a real application. By learning these concepts, now you can build specialized agents for different jobs.

Resources Worth Checking Out

These helped me learn LangGraph:

Check Out FinanceGPT

All the code examples here came from FinanceGPT. If you want to see these patterns in a complete app, poke around the repo. It's got document processing, portfolio tracking, tax optimization – all built with LangGraph.

If you find this helpful, give the project a star on GitHub – it helps other developers discover it.



Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

How to Elevate Your Database Game: Supercharging Query Performance with Postgres FDW

1 Share

Foreign data wrappers (FDWs) make remote Postgres tables feel local. That convenience is exactly why FDW performance surprises are so common.

A query that looks like a normal join can execute like a distributed system: rows move across the network, remote statements get executed repeatedly, and the local planner quietly becomes a coordinator. In that world, “fast SQL” is not mainly about CPU or indexes. It’s about data movement and round-trips.

This handbook covers the mechanism that determines whether a federated query behaves like a clean remote query or a chatty distributed workflow: pushdown.

Pushdown is not “moving compute”. Pushdown determines whether filtering, joining, ordering, and aggregation occur at the data source or after the data has already crossed the wire. When pushdown works, the local server receives a reduced result set. When it doesn’t, Postgres often has to fetch broad intermediate sets and finish the work locally.

The chapters ahead will help you build a practical mental model of what is “shippable” in postgres_fdw, why some expressions are blocked, and how to read EXPLAIN (ANALYZE, BUFFERS, VERBOSE) without getting tricked by familiar plan shapes.

After the core method, the handbook covers tuning knobs that matter in production, schema and indexing considerations, benchmarking methodology, monitoring and logging, and a case study that shows what a real pushdown win looks like end-to-end.

The later sections go deeper into advanced shippability edge cases, cost model calibration, and regression-proofing FDW workloads.

Table of Contents

Prerequisites

This handbook assumes basic comfort with Postgres query plans. It builds on EXPLAIN (ANALYZE, BUFFERS) rather than reintroducing SQL fundamentals, indexing, or join algorithms.

The focus here is federated execution: how foreign queries behave, and how to reason about them with the same clarity as local plans.

Here’s what you should already be comfortable with:

  • Reading EXPLAIN (ANALYZE, BUFFERS) output and spotting obvious plan smells (row explosions, bad join order, missed indexes).

  • Basic join mechanics (nested loop, hash join, merge join) and why cardinality estimates matter.

  • Postgres statistics at a practical level (ANALYZE, correlation, and what “estimated rows vs actual rows” implies).

And here’s what you need to follow along with the examples:

  • A Postgres “local” instance that will run postgres_fdw and act as the coordinator.

  • A Postgres “remote” instance that holds the foreign tables.

  • Permission on the local side to:

    • CREATE EXTENSION postgres_fdw;

    • create a SERVER and USER MAPPING

    • create FOREIGN TABLE objects (or permission to use existing ones)

  • A way to run queries and capture plans:

    • psql is enough, and so is any GUI, as long as you can run EXPLAIN (ANALYZE, BUFFERS, VERBOSE).

We won’t go through a long environment setup walkthrough. The examples assume the FDW objects exist and focus on plans and behavior.

We also won’t go into general distributed systems theory. Only the pieces that show up in an FDW plan are used.

Executive Summary

The single most important lesson of this handbook is that FDW pushdown reduces data movement. It’s tempting to think of pushdown as merely changing where a calculation happens (“move the work to the remote”). But what really matters is whether the remote server is asked for only the rows you need.

When pushdown is working, the remote server performs the selective join and filtering, and the local Postgres receives a small, already reduced result set. When pushdown fails, the local server becomes a distributed query coordinator: it pulls large intermediate sets over the network and then finishes the heavy lifting locally.

Why does this matter? Because a refactor that makes more of your query shippable to the remote server can slash end‑to‑end latency without changing a single row of output. In the case study we'll explore later, rewriting a query so that the FDW can ship a joined remote query instead of performing multiple foreign scans and local joins reduces runtime from approximately 166 ms to 25 ms. The business logic did not change – the shape of the work changed.

Below is a simple bar chart illustrating that dramatic drop. The chart uses actual timings from the case study. If you run the experiment yourself, the numbers may differ depending on your hardware and network, but the relative difference should be clear.

Bar chart titled "Query Execution Time: Before vs After Refactor." The chart shows execution time in milliseconds on the vertical axis. The "Before" bar is much taller, over 160 ms, compared to the "After" bar, which is below 20 ms, indicating a significant improvement in execution time after refactoring.

Motivation

Foreign data wrappers let you query remote data using the same SQL syntax you use locally. That convenience is exactly why they can be so deceptive.

A federated query may look like a normal join, but under the hood, it behaves like a distributed system: some part of the plan runs on the remote server, some on the local server, and every boundary between them is a network hop. The slow path is rarely “bad SQL” – it’s usually a combination of two things:

  1. Too many rows are pulled over the network. Without pushdown, the FDW retrieves a large slice of the remote table and applies your filters and joins locally. This may lead to tens of thousands or millions of rows being shipped across the network when you only needed hundreds or fewer.

  2. Too many round-trips. If the plan performs a nested loop that drives a foreign scan, it can end up executing the same remote query hundreds or thousands of times. Each call might be fast on its own, but latency adds up.

This isn't speculation. PostgreSQL's documentation makes clear that a foreign table has no local storage and that Postgres “asks the FDW to fetch data from the external source” [1]. There is no local buffer cache or heap storage to hide mistakes. Every row you retrieve must traverse the network at least once. If your plan fetches more rows than it needs, or repeatedly does so, performance can degrade quickly.

That’s why you should treat the Remote SQL shown in EXPLAIN (VERBOSE) as part of your query plan. It tells you exactly what the remote server is being asked to do. If it’s missing your filters or joins, you know the local server will have to finish the job. The rest of this handbook will teach you how to read that plan, how to force pushdown when possible, and how to recognize the signs that something has gone wrong.

FDW Basics Without the Setup Tax

You might be tempted to skip this section if you've already created foreign tables in your own databases. Don't. Understanding the architecture of foreign data wrappers is essential to understanding why pushdown matters.

SQL/MED in a nutshell

PostgreSQL implements the SQL/MED (Management of External Data) standard through its FDW framework. To access a remote Postgres server via postgres_fdw, you perform four steps:

  1. Install the extension: CREATE EXTENSION postgres_fdw tells Postgres to load the FDW code.

  2. Create a foreign server: CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '...', port '...', dbname '...')defines where the remote server resides and how to connect.

  3. Create a user mapping: CREATE USER MAPPING FOR your_user SERVER foreign_server OPTIONS (user 'remote_user', password '...') tells Postgres how to authenticate on the remote side.

  4. Create a foreign table: CREATE FOREIGN TABLE remote_table (...) SERVER foreign_server OPTIONS (schema_name '...', table_name '...'); defines the columns and references the remote table.

Once you've done that, you can run SELECT statements against the foreign table as if it were local. But the definition hides an important detail: there is no storage associated with that foreign table [1]. Every time you SELECT, INSERT, UPDATE, or DELETE, the FDW must connect to the remote server, build a remote query, send it, and read the results. This overhead is small for simple queries but becomes critical as queries get more complex.

What postgres_fdw does and does not do

postgres_fdw does two things for you:

  1. It builds remote SQL from your query, including pushing down safe filters, joins, sorts, and aggregates when it can.

  2. It fetches rows from the remote server and hands them to the local executor. If some part of your query cannot be executed remotely, the local executor performs that part.

The FDW tries hard to minimize data transfer by sending as much of your WHERE clause as possible to the remote server and by not retrieving unused columns [2]. It also has a number of tuning knobs that we'll explore later (such as fetch_size, use_remote_estimate, fdw_startup_cost, and fdw_tuple_cost[3]). But the real win often comes from structuring your query so that the FDW can push work down.

There's one last architectural point to keep in mind: the remote server runs with a restricted session environment. In remote sessions opened by postgres_fdw, the search_path is set to pg_catalog only, and TimeZone, DateStyle, and IntervalStyle are set to specific values [4]. This means that any functions you expect to run remotely must be schema‑qualified or packaged in a way that the FDW can find them. It also underscores why you should not override session settings for FDW connections unless you know exactly what you are doing [4].

Pushdown Mechanics

At a high level, “pushdown” means pushing as much of your SQL query as possible to the remote server. But the FDW cannot simply send arbitrary SQL. It must be safe and portable for remote evaluation. Postgres uses the term shippable to describe expressions and operations that can be evaluated on the foreign server.

What “shippable” means in practice

An expression is considered shippable if it meets several conditions:

  1. It uses built‑in functions, operators, or data types, or functions/operators from extensions that have been explicitly allow‑listed via the extensions option on the foreign server [2]. If you use a custom function or an extension that has not been declared, the FDW assumes it cannot run remotely.

  2. It’s marked IMMUTABLE. Postgres distinguishes between IMMUTABLE, STABLE, and VOLATILE functions. Only immutable functions – those that always return the same output for the same inputs and don’t depend on session state – are candidates for pushdown [5]. This rule prevents time‑dependent functions, such as now() or random() from being evaluated remotely, because the result might differ between the local and remote servers.

  3. It doesn’t depend on local collations or type conversions. PostgreSQL’s docs warn that type or collation mismatches can lead to semantic anomalies [1]. If the FDW cannot guarantee that a comparison behaves identically on both servers, it will refuse to push it down. For example, comparing a citext column to a text constant could be unsafe if the remote server doesn’t have the citext extension installed.

From these rules, you can derive a mental checklist: avoid non‑immutable functions in your WHERE clause, keep your join conditions simple and typed correctly, and list any third‑party extensions you want to use in the foreign server’s extensions option so that they are considered shippable [2].

WHERE pushdown

If a WHERE clause consists entirely of shippable expressions, it will be included in the remote query. Otherwise, it will be evaluated locally. This matters because pushing a filter down reduces the number of rows returned to the local server.

Consider a predicate like this:

WHERE created_at >= now() - interval '30 days'

Because now() is volatile (it returns a different value each time it’s called), Postgres cannot assume the remote server will interpret now() the same way. The FDW therefore pulls the entire table and applies the filter locally.

A better approach is to pass a parameter into the query or compute the cutoff timestamp once in the application and embed it into the SQL.

Join pushdown conditions

Joins are the next big lever. When postgres_fdw encounters a join between foreign tables on the same foreign server, it will send the entire join to the remote server unless it believes it will be more efficient to fetch the tables individually or unless the tables use different user mappings [6].

It applies the same precautions described for WHERE clauses: the join condition must be shippable, and both tables must be on the same server. Cross‑server joins are never pushed down – the FDW will perform them locally.

Shippability decision tree

It can be helpful to visualize the shippability rules as a flowchart. Below is a simple decision tree that you can use when inspecting an expression or join clause.

It starts with the question of whether an expression is in a WHERE or JOIN clause. Further decisions are made based on factors like using volatile functions, built-in functions, type mismatches, or cross-server joins. The flowchart concludes with outcomes like "Not shippable, evaluated locally" or "Shippable, included in Remote SQL."

If you reach the left side of the tree, the expression will be evaluated locally. If you reach the right side, the FDW can ship it.

Flowchart for determining SQL expression shippability. It starts with the question of whether an expression is in a WHERE or JOIN clause. Further decisions are made based on factors like using volatile functions, built-in functions, type mismatches, or cross-server joins. The flowchart concludes with outcomes like "Not shippable, evaluated locally" or "Shippable, included in Remote SQL."

Shippable Operations: a Deep Dive

Postgres has been expanding what postgres_fdw can be pushed down over several versions. This section walks through each operation class and the conditions required for pushdown.

Filters (WHERE clauses)

As explained above, simple filters that use built‑in operators and immutable functions are generally pushed down. If you see a Filter: node above a Foreign Scan in your plan, it means some part of your predicate didn’t qualify. Common reasons include using now(), timezone() or other volatile functions, referencing a non‑allow‑listed extension, or comparing different collation settings.

When this happens, the entire table (or at least all rows matching other shippable conditions) is fetched, and the filter is applied locally.

Plan smell: Look for a Foreign Scan node with a Filter: line directly above it. That means filtering happened locally. Also look for broad Remote SQL such as:

SELECT * FROM remote_table WHERE (name = 'Hamdaan')

with no group constraints. That's a sign that the filter was not pushed down.

Joins

Simple inner joins between foreign tables on the same foreign server are usually pushable. The join condition must satisfy the same shippability rules as filters. If the join involves more than one foreign server, if the join condition uses an unshippable function, or if the foreign tables use different user mappings, the FDW will fetch each table separately and join them locally [6]. This can lead to large intermediate sets being transferred.

Plan smell: A Hash Join or Merge Join where both inputs are Foreign Scan nodes indicates that the join was performed locally. Conversely, a single Foreign Scan representing a join and containing the JOIN ... ON clause in Remote SQL indicates that the join was pushed down.

Aggregates (GROUP BY, COUNT, SUM, and so on)

Starting in PostgreSQL 10, aggregates can be pushed to the remote server when possible. The release notes state explicitly: “push aggregate functions to the remote server,” and explain that this reduces the amount of data that must be transferred from the remote server and offloads aggregate computation [7].

To qualify, both the grouping expressions and the aggregate functions themselves must be shippable. If the FDW cannot push an aggregate, it will fetch the raw rows and perform the aggregation locally.

Plan smell: Look for a GroupAggregate node above a Foreign Scan that returns many rows. When the aggregate is pushed down, there will be no local aggregate node. Instead, the Remote SQL will include a GROUP BY clause.

ORDER BY and LIMIT

Prior to PostgreSQL 12, sorting and limiting were rarely pushed down. In version 12, Etsuro Fujita’s patch allows ORDER BY sorts and LIMIT clauses to be pushed to postgres_fdw foreign servers in more cases [8]. For the sort or limit to be pushed, the underlying scan must be pushable, and the ordering expression must be shippable. Partitioned queries or complicated join trees may still cause the sort or limit to be applied locally.

Plan smell: A local Sort or Limit node above a Foreign Scan indicates the operation was not pushed down. Conversely, a Remote SQL statement containing ORDER BY and LIMIT indicates that pushdown succeeded.

DISTINCT

Distinct operations can be pushed down when the distinct expression list is shippable. But if the distinct is combined with unshippable expressions, or if the distinct is applied after a join that cannot be pushed down, the FDW will retrieve all rows and perform the distinct locally.

Window functions

In practice, window functions are rarely pushed down through postgres_fdw. They often require ordering or partitioning semantics that are difficult to represent portably. If you see a WindowAgg node in your plan, it’s almost always local. That doesn’t mean you can't use window functions with foreign tables, but you should expect them to incur network and CPU costs.

Version differences

Postgres developers continue to improve the FDW layer. Here are some notable changes by version:

  1. PostgreSQL 9.6 introduced remote join pushdown and allowed UPDATE/DELETE pushdown. Before 9.6, all joins were local.

  2. PostgreSQL 10 introduced aggregate pushdown, enabling remote GROUP BY and aggregate functions [7].

  3. PostgreSQL 12 expanded ORDER BY and LIMIT pushdown [8].

  4. PostgreSQL 15 added pushdown for certain CASE expressions and other improvements.

If you learned FDW behavior on an older version, revisit your assumptions.

Pushdown Blockers and Why They Exist

When pushdown fails, it’s not due to bad luck. There’s always a reason grounded in safety or correctness. Here are the most common blockers and how to diagnose them.

Non‑immutable functions

Functions marked VOLATILE or STABLE cannot be pushed down because their results may differ between the local and remote server. Examples include now(), random(), current_user, and user‑defined functions that look at session variables or query the database. Even functions you might think are harmless, like age() or clock_timestamp(), can cause pushdown to fail.

Fix: Compute volatile values in your application or in a CTE before referencing the foreign table. For example, compute timestamp 'now' - interval '30 days' as a constant and compare your created_at column against that constant. Alternatively, move the logic into a stored generated column on the remote table.

Type and collation mismatches

The documentation warns that when types or collations don’t match between the local and remote tables, the remote server may interpret conditions differently [1]. This is particularly insidious when text comparisons, case‑insensitive collations, or non‑default locale settings are used. If Postgres can't guarantee the same semantics, it will pull rows locally and evaluate the expression.

Fix: Make sure that your foreign table definition uses the same data types and collations as the remote table. When in doubt, explicitly cast values to a common type.

Cross‑server joins

Joins across different foreign servers cannot be pushed down. The FDW can only ship a join when both tables reside on the same remote server and use the same user mapping [6]. Otherwise, it will perform two separate scans and join the results locally.

Fix: If you frequently join tables across servers, consider consolidating the tables on a single server, materializing a view on one side, or pulling the smaller table into a temporary local table before joining.

Mixed local and foreign joins

A join between a local table and a foreign table will not be pushed down. Even though the foreign side might be pushdown‑eligible, the FDW cannot join it with local data on the remote server. A nested loop with a parameterized foreign scan is the typical pattern here, resulting in many remote calls.

Fix: Filter or aggregate as much as possible on the foreign side first (via a CTE or by materializing a subset) before joining to local tables.

Remote session settings and search paths

Because postgres_fdw sets a restricted search_path, TimeZone, DateStyle, and IntervalStyle in remote sessions [4], any functions you call must be schema‑qualified or otherwise compatible. If a function relies on the current search path or session settings, it may break or produce different results on the remote side.

Fix: Schema‑qualify remote functions and ensure that any environment‑dependent logic is safe to execute under the default FDW session settings. If necessary, attach SET search_path or other settings to your remote functions.

Troubleshooting matrix

The table below maps symptoms in your EXPLAIN plan to likely causes and fixes. Use it as a quick diagnostic tool when something looks off.

Symptom in planLikely causeSuggested fix
Foreign Scan has loops much greater than 1Parameterized remote lookup caused by nested loop, join conditions not shippableRewrite join so the FDW can ship a single joined query, or batch remote requests via an IN list or temporary table
Broad Remote SQL that lacks scope predicatesWHERE clause contains non‑immutable functions or unsupported operatorsReplace volatile functions with constants or allow‑list extension functions, ensure types and collations match
Local Hash Join or Merge Join between two foreign tablesJoin could not be pushed down (different servers, user mappings, or unshippable join expression)Consolidate tables on one server, align user mappings, or rewrite the join condition
Local Sort, Limit, or Unique on top of a Foreign ScanORDER BY, LIMIT, or DISTINCT could not be pushed downSimplify sort expressions, push filters deeper, check PG version for improvements
Plan runs but gives wrong results when pushdown is enabledSemantic mismatch due to type/collation differences or remote session settings [1] [4]Align types/collations, schema‑qualify functions, use stable session settings

Reading EXPLAIN Like a Pro

SQL execution plan analysis table with columns: exclusive, inclusive, rows x, rows, loops, and node details. Rows display Nested Loop Join, Hash Join, and Seq Scan operations with costs, times, and buffers. Highlighted cells indicate notable metrics.

Many developers skim EXPLAIN plans for local queries, looking at the top nodes and overall cost. For FDW queries, you must invert that habit: read the foreign parts first. The Remote SQL string tells you what the remote server is being asked to do, and the loops field tells you how many times that remote call is executed.

Inspect the Foreign Scan nodes

Start by finding the Foreign Scan node(s). In EXPLAIN (VERBOSE), each foreign scan includes a line like:

Remote SQL: SELECT ...

This line is not a trivial – it’s the actual SQL that will run on the remote server. Read it carefully. Does it include your WHERE predicates? Does it include your join conditions? If not, you know the local server will pick up the slack.

Look at the loops column. If the loops exceed 1, the same remote query is executed multiple times. For example:

Foreign Scan on public.user_entity  (rows=1 loops=416)
  Remote SQL: SELECT id, tenant_id FROM public.user_entity WHERE enabled AND service_account_client_link IS NULL AND id = $1

This is the “N+1” problem in disguise. The plan executes the foreign scan once per outer row. Multiply the per‑loop cost by the number of loops to understand why the query is slow. The fix is to rewrite the query so that the join and filters are applied in a single remote call.

Recognize InitPlan vs SubPlan

An InitPlan runs once and caches its result. A SubPlan can run per outer row. In FDW queries, subplans often drive parameterized remote scans. If you see a SubPlan attached to a nested loop that feeds a foreign scan, suspect a parameterized remote lookup and look for ways to turn it into an InitPlan or merge it into a single remote query.

Understand CTE materialization

Common table expressions (CTEs) behave differently depending on whether they are marked MATERIALIZED or NOT MATERIALIZED. A materialized CTE is computed once and stored in a temporary structure, then read by the rest of the query. A non‑materialized CTE is inlined into the parent query, allowing optimizations to span across the boundary.

In PostgreSQL 12 and later, CTEs are inlined by default unless they’re referenced multiple times or explicitly marked MATERIALIZED. Materializing a CTE that contains a foreign scan can freeze a broad remote fetch and prevent later clauses from being pushed down. On the other hand, materialization can prevent repeated remote scans if the CTE is referenced multiple times. Use this lever deliberately to control where remote work happens.

Annotated example

Let's annotate a simplified excerpt from a real plan. The goal is to show how to quickly read the relevant parts.

Nested Loop  (rows=414 loops=1)
  -> Hash Join  (rows=416 loops=1)
       -> Foreign Scan on public.user_entity (rows=1 loops=416)
            Remote SQL: SELECT id, tenant_id FROM public.user_entity WHERE enabled AND service_account_client_link IS NULL AND id = $1
  -> Foreign Scan on public.user_attribute (rows=671 loops=1)
       Remote SQL: SELECT ua.user_id, ua.value FROM user_attribute ua JOIN user_entity u ON ua.user_id = u.id JOIN tenant r ON u.tenant_id = r.id WHERE ua.name = 'attribute A' AND r.name = 'demo' AND u.enabled AND u.service_account_client_link IS NULL AND (g.name = 'keycloak-group-a' OR g.parent_group = $1)

In the old plan, the first Foreign Scan executed 416 times, each time retrieving a single row. The Remote SQL only applies the filter on enabled and service_account_client_link – it doesn’t include the tenant or group scoping. That scoping is applied by the nested loop outside the foreign scan.

In the refactored plan, the second Foreign Scan results from combining user_attribute, user_entity, user_group_membership, keycloak_group, and tenant into a single remote query. It retrieves 671 rows in a single query and includes all relevant filters. There is no repeated remote call. The timing difference is driven by the different loop values and the selectivity of the Remote SQL.

How to Tune postgres_fdw

Once you've structured your query for maximum pushdown, tuning knobs let you squeeze out further performance improvements and adjust planner decisions.

fetch_size

fetch_size controls how many rows postgres_fdw retrieves per network fetch. The default is 100 rows [9]. A small fetch size means more round-trips and lower memory usage. A larger fetch size reduces network overhead at the cost of buffering more rows in memory.

In practice, increasing fetch_size to a few thousand can reduce latency for large result sets. It’s specified either at the foreign server or foreign table level:

ALTER SERVER foreign_server OPTIONS (ADD fetch_size '1000');
ALTER FOREIGN TABLE remote_table OPTIONS (ADD fetch_size '1000');

use_remote_estimate

By default, the planner estimates the cost of foreign scans using local statistics. This can be wildly inaccurate if the foreign table has a different data distribution. Setting use_remote_estimate to true tells postgres_fdw to run EXPLAIN on the remote server to get row count and cost estimates. This can dramatically improve join order selection at the cost of an additional remote query during planning [3]. You can set this per table or per server:

ALTER SERVER foreign_server OPTIONS (SET use_remote_estimate 'true');

fdw_startup_cost and fdw_tuple_cost

These cost parameters model the overhead of starting a foreign scan and the cost per row fetched. Adjusting them can influence the planner’s choice of join strategy. A higher fdw_startup_cost discourages the planner from choosing plans with many small foreign scans (which might generate many remote calls). A higher fdw_tuple_cost discourages plans that fetch large numbers of rows [3]. Use these only after you have solid evidence from EXPLAIN and experiments.

ANALYZE and analyze_sampling

Running ANALYZE on a foreign table collects local statistics by sampling the remote table [3]. Accurate stats are essential for good estimates when use_remote_estimate is false.

But if the remote table changes frequently, these stats become stale quickly. The analyze_sampling option controls whether sampling happens on the remote side or locally. When analyze_sampling is set to random, system, bernoulli, or auto, ANALYZE will sample rows remotely instead of pulling all rows into the local server[3].

extensions

The extensions option lists extensions whose functions and operators can be shipped to the remote server [2]. If you rely on functions from citext, pg_trgm, or other extensions, add them to the server definition:

ALTER SERVER foreign_server OPTIONS (SET extensions 'citext,pg_trgm');

A quick knob impact table

KnobPrimary effectWhen to change itPossible downside
fetch_sizeNumber of rows per fetchResult sets are large and latency dominatesToo large consumes memory
use_remote_estimateBetter row count/cost estimatesPlanner misestimates foreign scansExtra remote queries during planning
fdw_startup_costPenalty per foreign scanPlanner chooses many small foreign scansWrong values bias the planner
fdw_tuple_costCost per row fetchedPlanner pulls too many rowsMis‑tuned values mislead planner
extensionsWhich extension functions are shippableUsing extension functions in predicatesExtensions must exist and match on both servers

Schema and Index Recommendations

Pushdown doesn’t eliminate the need for good indexes. In fact, effective pushdown depends on the remote server having indexes that support the filter and join predicates you’re shipping.

Below are some patterns to watch for in FDW queries and the indexes that support them. You can adapt these to your own schema.

TableAccess patternRecommended indexWhy
tenant (remote)Filter by tenant.nameUNIQUE (name) or BTREE (name)Resolves tenant ID quickly
keycloak_group (remote)Filter by name, join by tenant_id, filter on parent_groupComposite (tenant_id, name) and (parent_group)Supports resolving root group and walking one‑level hierarchy
user_group_membership (remote)Join by user_id, filter by group_idBTREE (group_id, user_id)Efficiently finds users in a set of groups
user_attribute (remote)Filter by name, join by user_idComposite (name, user_id) (optionally include value)Matches “attribute name → users → values” flow
user_entity (remote)Filter by tenant_id, enabled, service_account_client_link IS NULL, join by idPartial index on (tenant_id, id) with predicate on enabled and service_account_client_link IS NULLHelps remote planner start from user table when tenant and user filters are applied
filtercategory (local)Filter by category && uuid[], join on (entitytype, entityid)GIN index on category, BTREE (entitytype, entityid)Speeds array overlap checks and join predicate

In general, indexes should reflect the join order you expect the remote planner to use. If your Remote SQL starts with:

FROM user_attribute ua JOIN user_entity u ON ua.user_id = u.id JOIN user_group_membership ugm ON ...

ensure that indexes exist on user_attribute(user_id) and user_group_membership(user_id).

Benchmarking Methodology

It’s easy to claim a performance improvement without proper measurement. Here's a repeatable method you can use to benchmark FDW query changes.

  1. Warm the caches. Run each query once to load data into the remote buffer cache and the local FDW connection. Discard the timings.

  2. Measure latencies. Use EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to capture execution times, buffer usage, and remote row counts. Be aware that EXPLAIN ANALYZE adds overhead, so record the raw execution time if possible by running the query directly.

  3. Record remote metrics. On the remote server, enable pg_stat_statements and track the calls, total_time, and rows for each remote query. This gives you a per‑query breakdown and confirms what Remote SQL is executed.

  4. Control for concurrency and network latency. Run benchmarks during a quiet period or isolate the test cluster. If your environment has high network latency, record the round‑trip time separately to attribute delays.

  5. Compare apples to apples. Benchmark the old and new queries under identical conditions. Use the same sample data, same remote server, and same connection settings.

  6. Look at row counts. The primary goal of pushdown is to reduce the number of rows shipped. Compare the rows column of each Foreign Scan node.

Here's a simple matrix you can use to record your experiments:

ScenarioWhat you're testingExpected change in Remote SQLMetrics to record
Baseline (old query)Starting point: broad remote scans + local joinsRemote SQL lacks scoping predicatesp50/p95 latency, remote row count, local sort/hash time
Refactor (new query)Join + filter pushdownRemote SQL includes joins and filtersSame metrics, plus remote row count
Introduce a volatile functionPushdown blocker testClause removed from Remote SQLRemote row count increases, local filter cost increases
Type or collation mismatchSemantic risk testRemote SQL might change behavior or lose pushdownCompare correctness and row counts
ORDER/LIMIT pushdownVersion‑dependent testRemote SQL includes ORDER BY, LIMITSort time shifts to remote. Row count should remain
use_remote_estimate on/offPlanning accuracy testPlanner uses remote estimatesPlanning time, join order, and runtime difference

Monitoring and Logging

In production, you need to know when a query starts misbehaving. There are two places to look: the local server and the remote server.

Local metrics

  1. pg_stat_statements. This extension tracks planning and execution times, row counts, and buffer hits for each query. Look for high total times relative to rows or calls.

  2. Auto Explain or auto_explain. Turn on auto_explain.log_min_duration_statement to capture slow queries with plans. This will show you the Remote SQL executed and whether the plan changed.

  3. Connection pool metrics. Monitor connection counts and wait events related to FDW operations (for example, PostgresFdwConnect, PostgresFdwGetResult) as described in the documentation [10].

Remote metrics

  1. pg_stat_statements on the remote server. This lets you see which Remote SQL queries are being executed, how often, and how long they take. Compare these with the Remote SQL strings in your local EXPLAIN plans.

  2. Server logs. Increase log_statement or log_min_duration_statement on the remote server to capture long-running remote queries.

Correlating local and remote metrics can reveal patterns such as a new code path causing a surge in remote queries or pushdown failures, leading to heavy remote scans.

Case Study: Refactoring a Keycloak Coverage Query

The theory above may seem abstract until you see it play out in practice. Let's walk through a real example inspired by a Keycloak integration.

The original query calculated coverage: given a list of category IDs, it returned the percentage of users who had attributes mapped to those categories and a JSON array of entity counts. The query used a CTE to build a list of scoped users, then joined it with user attributes, category mappings, and a few other tables.

Symptom

In a test environment with 100K user records, the query averaged 166 ms. This was slower than expected. Running EXPLAIN (ANALYZE, BUFFERS, VERBOSE) showed two foreign scans on the Keycloak database. The first scanned user_entity 416 times (loops = 416). The second pulled all rows from user_attribute where name = 'attributeA' before filtering by tenant and group locally.

Here's a simplified excerpt (numbers are approximate):

Foreign Scan on public.user_entity  (actual time=0.117..0.117 rows=1 loops=416)
  Remote SQL: SELECT id, tenant_id FROM public.user_entity WHERE (enabled AND service_account_client_link IS NULL AND id = $1)
Foreign Scan on public.user_attribute  (actual time=41.267..80.352 rows=80739 loops=1)
  Remote SQL: SELECT value, user_id FROM public.user_attribute WHERE (('attributeA' = name))

The first scan performed a single-row lookup 416 times. The second scan retrieved 80,739 rows because the only condition pushed down was name = 'attributeA'. Tenant and group scoping occurred locally. That meant 80k rows were transferred over the network and then filtered down to about 671 on the local side.

Diagnosis

There were two main issues.

First was the N+1 remote calls on user_entity. The join to user_entity was not pushed down, so the plan executed a remote lookup for each row from user_group_membership. This created 416 remote queries.

Second was the unscoped attribute fetch. Because the WHERE clause included user_entity.tenant_id = tenant.id and keycloak_group.name = 'groupA' in a higher CTE, the FDW could not see those predicates when scanning user_attribute. It therefore fetched all rows with name = 'attributeA' and left the tenant and group filters to the local side.

Refactor

The fix was to inline the tenant and group joins into the user_attribute scan to avoid the nested-loop pattern. The refactored selected_user_attributes CTE looked like this (simplified for readability):

WITH selected_user_attributes AS (
  SELECT DISTINCT ua.user_id, ua.value
  FROM public.user_attribute ua
  JOIN public.user_entity u ON u.id = ua.user_id
  JOIN public.user_group_membership ugm ON ugm.user_id = u.id
  JOIN public.keycloak_group g ON g.id = ugm.group_id
  JOIN public.tenant r ON r.id = u.tenant_id
  WHERE ua.name = 'attributeA'
    AND u.enabled
    AND u.service_account_client_link IS NULL
    AND r.name = 'tenantA'
    AND (g.name = 'groupA' OR g.parent_group = (
         SELECT id FROM public.keycloak_group WHERE name = 'groupA' AND tenant_id= r.id
    ))
)

This single query expresses the same scoping logic that previously lived in separate CTEs. Because all the join conditions are on the same foreign server and use built‑in operators, the FDW can push down the entire join. The new plan looked like this:

Foreign Scan  (actual time=7.840..7.856 rows=671 loops=1)
  Remote SQL: SELECT ua.user_id, ua.value FROM user_attribute ua JOIN user_entity u ON ua.user_id = u.id JOIN user_group_membership ugm ON ugm.user_id = u.id JOIN keycloak_group g ON g.id = ugm.group_id JOIN tenant r ON u.tenant_id= r.id WHERE ua.name = 'attributeA' AND u.enabled AND u.service_account_client_link IS NULL AND r.name = 'tenantA' AND (g.name = 'groupA' OR g.parent_group = $1)

Only one remote query is executed, and it returns 671 rows. Tenant and group scoping occur on the remote server. There is no nested loop or repeated remote scan. The final runtime dropped to about 25 ms.

Why it improved

  1. Fewer rows crossing the network. The old plan fetched 80k attribute rows and filtered them locally. The new plan fetched only the 671 scoped rows.

  2. No repeated remote calls. The old plan executed 416 remote scans of user_entity. The new plan performs one joined remote query.

  3. Less local work. Because the join and filtering happen remotely, the local side no longer hashes or filters large sets.

Key takeaway

If you see a Foreign Scan with a high loops count or a Remote SQL that doesn’t contain your filters and joins, you’re leaving performance on the table. Merging filters and joins into a single remote query (subject to shippability rules) often yields orders-of-magnitude improvements.

Checklist and Troubleshooting Guide

The following steps summarize how to approach FDW performance tuning:

  1. Inspect the Remote SQL. Always run EXPLAIN (VERBOSE) and look at what is being sent to the remote. If your predicates are missing, the FDW isn't pushing them down.

  2. Check loops. If the loops are greater than 1 on a Foreign Scan, you are paying for repeated remote calls. Rewrite the query or reorder the joins to make the foreign scan run once.

  3. Make predicates shippable. Replace volatile functions with constants or parameters. Ensure operators and functions are built‑in or explicitly allow‑listed via the extensions option [2].

  4. Align types and collations. Use the same data types and collations on both sides to avoid semantic mismatches [1].

  5. Push joins to the same server. Consolidate tables on one foreign server if possible. Joins across servers cannot be pushed down [6].

  6. Use use_remote_estimate when planning seems off. Enabling remote estimates can improve join order selection [3].

  7. Tune fetch_size and costs if your queries transfer many rows. A bigger fetch_size reduces round-trip; adjusting fdw_startup_cost and fdw_tuple_cost influences the planner [3].

  8. Analyze foreign tables if you rely on local cost estimates. Keep in mind that stats can get stale quickly [3].

  9. Monitor both servers. Use pg_stat_statements on local and remote servers to see how often remote queries run and how long they take.

  10. Test version upgrades. Each major release improves FDW pushdown semantics (for example, aggregates in 10 [7], ORDER/LIMIT in 12 [8]). Retest after upgrading.

Case Study Takeaways

Querying remote data with PostgreSQL’s postgres_fdw can be fast and convenient if you respect the underlying mechanics. Pushdown is the difference between streaming a trickle of relevant rows and hauling an ocean of data across the network. It isn't simply a matter of moving CPU cycles – it changes how much data moves, how many network round-trip occur, and how much your local server has to do.

The rules may seem restrictive – use only immutable functions, avoid cross‑server joins, align types and collations – but they exist to preserve correctness while enabling optimization.

By reading EXPLAIN from the bottom up, inspecting the Remote SQL, and understanding the shippability rules, you can spot slow patterns quickly. Armed with tuning knobs like fetch_size and use_remote_estimate, and a willingness to rewrite queries to make joins and filters pushable, you can often achieve dramatic performance gains without touching your hardware.

This case study shows that rewriting a query to enable a single-joined remote query reduced runtime from around 166 ms to 25 ms. That sort of improvement is not rare. It’s what happens when you treat FDW queries as distributed queries rather than local queries in disguise.

The next time you debug a slow FDW query, remember this handbook. Check the Remote SQL. Count the loops. Ask yourself: “Am I doing the work close to the data, or am I bringing the data to the work?” Adjust accordingly, and you'll write queries that make the most of Postgres's federated capabilities while keeping your latency in check.

This section closes the case study loop and summarizes exactly what changed in the plan and why it produced a large end-to-end win. The following sections of the handbook turn that single win into a repeatable method: how Postgres determines what is shippable, how to quickly read FDW plans, which operations and versions matter, and how to debug common failure modes that prevent pushdown.

Advanced Operations: A Deeper Dive into Shippability

The previous sections introduced the basic rules around what can be pushed to the remote and why. To really make sense of those rules, you need to see how they play out on the operations you use every day.

This section walks through filters, joins, aggregates, ordering, and limits, DISTINCT queries, and window functions in more detail. By the end, you should have a mental map of which operations to trust and which to double‑check when reading your plans.

Filters and simple predicates

WHERE clauses matter more than you think

When you specify WHERE attribute = 'value' on a foreign table, the FDW will happily transmit that predicate to the remote server as long as the comparison uses built‑in types and immutable operators. For example:

  • WHERE id = 42 is fine

  • WHERE lower(username) = 'hamdaan' is fine if lower() is allow‑listed and immutable

  • WHERE created_at >= now() - interval '7 days' is not shippable because now() is volatile

When such a predicate cannot be pushed, the FDW will fetch every row that matches all the shippable predicates and apply the rest locally. That means that a seemingly innocuous call to now() can blow up your network traffic.

The lesson is simple: compute volatile values up front (in your application or in a CTE) and reference them as constants in the query against the foreign table.

Complex expressions are not automatically unsafe

Suppose you have WHERE (status = 'active' AND (age BETWEEN 18 AND 29 OR age > 65)). This entire expression is shippable because it uses built‑in boolean logic, simple comparisons, and immutable operators. The FDW will deparse it into remote SQL and forward it. You only need to worry when one of the subexpressions introduces a function or operator that the FDW doesn’t recognize or cannot safely assume exists on the remote.

A good heuristic is: if you can express your filter using only simple comparisons, boolean logic, and built‑in functions, pushdown should work. When in doubt, check the Remote SQL.

Array and JSON operators

Modern Postgres makes heavy use of array and JSON functions. Many of these functions, like the array overlap operator && used in the case study, are built‑in and can be shipped. But some JSON functions are provided by extensions (like jsonb_path_query or functions from the pgjson family).

If your filter uses one of these, ensure that the extension is available and allow‑listed on the foreign server. Otherwise, the FDW will fetch rows and perform the JSON logic locally. This is rarely what you want when dealing with large JSON columns.

Joins: the good, the bad, and the ugly

Same‑server joins are your friend

If you join multiple foreign tables that are all defined on the same foreign server and user mapping, and if the join condition uses only shippable expressions, then the FDW can generate a single remote join. This is the ideal case.

For example, joining orders and customers on orders.customer_id = customers.id is pushable, as long as both tables reside on the same foreign server. The remote planner will use its own statistics and indexes to plan the join, and the local server will simply iterate through the result. Postgres 9.6 and later support this pattern [6].

Cross‑server joins break pushdown

If you attempt to join two foreign tables that live on different servers (or even on the same remote server but with different user mappings), postgres_fdw will fetch the tables separately and join them locally. This is almost always slower than pushing the join down, because you end up transferring both tables in their entirety.

The FDW design team chose not to support cross‑server joins because there is no portable way to tell two remote servers to cooperate on a join. Your options are: replicate one table on the other server, materialize the smaller table locally before joining, or restructure the query to filter aggressively on each side before joining locally.

Mixed local/foreign joins are tricky

Joining a local table to a foreign table cannot be pushed down, for straightforward reasons: the remote server has no access to your local data. A common pattern that triggers repeated remote calls looks like this:

SELECT u.id, a.value
FROM users u
LEFT JOIN user_attribute a
  ON a.user_id = u.id AND a.name = 'favorite_color';

If users is a local table and user_attribute is foreign, the plan may use a nested loop: for each local u, it executes a remote lookup in user_attribute to retrieve attributes.

The fix is to flip the query: retrieve all relevant rows from user_attribute in one remote scan, then join them locally. Or, if possible, create a small temporary table on the remote side with your u.id values, perform the join entirely remotely, and then fetch the results.

Join conditions matter

Even when joining two foreign tables on the same server, an unshippable join condition will force the join to be local. For example, JOIN ON textcol ILIKE '%foo%' is not pushable because ILIKE might not exist or behave identically on the remote.

If you need case‑insensitive matching, consider lowercasing both sides: LOWER(textcol) = 'foo' (assuming the remote server has the lower() function available and allowed). Similarly, joining on a cast expression (for example, JOIN ON CAST(a.id AS text) = b.text_id) can block pushdown. Define your columns with matching types instead.

Aggregates and grouping

Aggregates are where the data movement story shines. When you can push down a GROUP BY and aggregate functions like COUNT, SUM, AVG, or MAX, you reduce the result set to just the aggregated rows. This can be a difference of several orders of magnitude.

Postgres 10 introduced aggregate pushdown [7]. But not all aggregates are equal:

Simple aggregates such as COUNT(*), SUM(col), AVG(col), MIN(col), and MAX(col) are shippable when applied to shippable expressions. Even COUNT(DISTINCT col) is often shippable, because the remote can deduplicate before counting. The FDW will wrap the aggregate in a remote query and return just the aggregated row.

If you see a GroupAggregate node on the local side, check whether all involved columns and functions are shippable. If they are, ensure that the join conditions above are also pushable.

Filtered aggregates such as COUNT(*) FILTER (WHERE x > 5) or SUM(col) FILTER (WHERE status = 'active') are often pushable, because they translate into SUM(CASE WHEN condition THEN col ELSE 0 END) or COUNT(...). As long as the filter is shippable, the FDW will push it into the remote aggregate.

User‑defined aggregates are rarely pushable. If you have a custom aggregate function, the FDW will not assume that it exists or behaves the same on the remote server. Even if you install the function on both servers, postgres_fdw won't push it unless the function is in an allow‑listed extension.

Grouping sets and rollups are not currently pushable. When you write GROUP BY GROUPING SETS (...) or ROLLUP(...), Postgres will compute the grouping locally even if the underlying scan is remote.

If you need complex rollups, consider performing them in two steps: push down the initial grouping to the remote server to reduce rows, then perform the rollup locally.

ORDER BY, LIMIT, and DISTINCT

Ordering and limiting rows may seem like purely cosmetic features, but they affect how much data is transferred. If the remote can sort and limit, the local server only receives the top N rows. If it cannot, the local server must sort everything.

Postgres 12 expanded the cases where ORDER BY and LIMIT are pushed down [8]. Here are guidelines:

  • Single foreign scan with simple sort: If your query selects from one foreign table and sorts by a shippable expression (for example, ORDER BY created_at DESC), the FDW will include ORDER BY in Remote SQL. It will also push down LIMIT and OFFSET. This is ideal because the remote server does the sort and sends only the top rows.

  • Sort after join: If you sort after joining two foreign tables on the same server, and the join and sort expressions are shippable, the FDW may push both down. But if the sort requires columns from the local side or from a different remote server, the FDW cannot push it down.

  • Sort after aggregation: Sorting aggregated results is often pushable as long as the aggregate itself is pushable. But when grouping occurs locally, the sort remains local.

  • DISTINCT behaves like GROUP BY. If the distinct expression list is shippable, the FDW can push it down. If you write SELECT DISTINCT ON (col1) col2, col3 FROM ... and col3 is not part of the DISTINCT list, Postgres will treat this as GROUP BY and may push it. Be aware that DISTINCT ON semantics differ from plain DISTINCT and may not be pushable in older Postgres versions.

Window functions

Window functions (for example, ROW_NUMBER() OVER (PARTITION BY ...), RANK(), LAG(), LEAD()) rely on ordering and partitioning across rows.

Postgres has not yet taught postgres_fdw how to push window functions. When you see a WindowAgg node in your plan, it’s almost always local. The FDW will fetch the rows, and the local server will sort, partition, and compute the window. If you need to run window functions on remote data, plan to transfer the data locally.

Version‑specific quirks

The exact pushdown capabilities vary by release. When planning migrations or deciding whether to rely on a pushdown behavior, check the release notes:

  • 9.6: first version to support pushdown of joins and sorts, and remote updates and deletes.

  • 10: introduced aggregate pushdown [7], significantly reducing network use for GROUP BY queries.

  • 11: improved partition pruning and join ordering for foreign tables.

  • 12: expanded ORDER BY and LIMIT pushdown [8].

  • 15: added pushdown for simple CASE expressions and additional built‑in functions.

  • 17 (development at the time of writing) continues to expand shippable constructs. Always test on your target version because subtle improvements can change what the FDW can ship.

Common Anti‑Patterns and How to Avoid Them

Everyone has run into FDW queries that seemed reasonable but turned out to be bottlenecks. Here are a few of the most common mistakes and how to correct them. These examples are deliberately simplified – so you can adapt them to your schema.

Using volatile functions in predicates

Anti‑pattern:

SELECT *
FROM audit_logs
WHERE event_ts >= now() - interval '1 day';

now() is a volatile function, so the FDW refuses to push this predicate. It pulls all rows from audit_logs and filters them locally.

Better:

SELECT *
FROM audit_logs
WHERE event_ts >= $1;

Compute $1 (a timestamp) in your application or upstream query. Or compute it once in a CTE:

WITH cutoff AS (SELECT now() - interval '1 day' AS ts) SELECT * FROM audit_logs, cutoff WHERE event_ts >= cutoff.ts;

The FDW sees a constant and pushes the predicate.

Joining local and foreign data first

Anti‑pattern:

SELECT u.email, ua.value
FROM users u
LEFT JOIN user_attribute ua ON u.id = ua.user_id AND ua.name = 'favorite_movie';

This uses a local table (users) to drive a join to a foreign table (user_attribute). The FDW receives 10,000 individual remote queries if users have 10,000 rows. Each call fetches one or zero rows from user_attribute.

Better:

-- Fetch all favorite movies remotely and join locally
WITH remote_movies AS (
  SELECT ua.user_id, ua.value
  FROM user_attribute ua
  WHERE ua.name = 'favorite_movie'
)
SELECT u.email, rm.value
FROM users u
LEFT JOIN remote_movies rm ON u.id = rm.user_id;

Now the FDW issues one query to fetch all relevant attributes, and the join is done locally in one pass.

Cross‑server joins without materialization

Anti‑pattern:

SELECT *
FROM remote_db1.orders o
JOIN remote_db2.customers c ON o.customer_id = c.id;

This is not pushable because the two tables are on different foreign servers. Postgres will fetch orders and customers separately and join them locally. If orders have 1 million rows and customers have 50,000 rows, you will transfer 1.05 million rows.

Better: Replicate or materialize one side on the other server (or locally) before joining. For example, create a materialized view m_customers on remote_db1 containing just the id and name of the customers you need, then join orders and m_customers on the same server. Alternatively, copy customers into a temporary table on the local server and join there.

Complex expressions on join keys

Anti‑pattern:

SELECT *
FROM remote_table a
JOIN remote_table b ON CAST(a.key AS text) = b.key_text;

Casting a numeric key to text prevents pushdown. The remote server cannot use indexes and must return both tables. The local server performs the join and cast.

Better: Align your schemas so that the join columns use the same type. If you cannot change the schema, create a computed column on the remote server with the appropriate type and use it in the join.

Ignoring collation and type mismatches

Anti‑pattern:

SELECT *
FROM remote_table
WHERE citext_col = 'abc';

If the remote server doesn’t have the citext extension installed, the comparison semantics will differ, and the FDW will refuse to ship the filter. This appears harmless until you see the plan and realize all rows were fetched.

Better: Install the same extensions and collations on the remote server, or convert the column to a base type like text on both sides.

Extending Tuning: Calibrating Cost Models

Earlier, we discussed fetch_size, use_remote_estimate, and the cost knobs. This section expands on how to use them strategically.

Balancing fetch size and memory

fetch_size controls how many rows the FDW asks for in each round trip [9]. Think of it as the batch size. The default (100) works well for small result sets. If you expect to retrieve tens of thousands of rows, a higher fetch size reduces the overhead of many network requests. But there are trade‑offs:

  • Memory consumption: Each foreign scan buffers rows until they are consumed. A huge fetch size (for example, 10,000) may allocate more memory than you expect, especially when multiple scans run concurrently. Monitor memory usage as you increase this setting.

  • Latency hiding: If network latency is high, overlapping network requests with local processing can hide some latency. But postgres_fdw does not pipeline multiple fetches – it waits for one batch before requesting the next. This means that a larger batch size reduces the number of waits, but cannot overlap them. If you operate across data centers, consider using a connection pooler or caching layer instead of just increasing fetch_size.

Remote estimates vs. local estimates

The planner uses statistics to estimate how many rows each node will produce, which in turn influences join order. When use_remote_estimate is false (the default), the planner guesses based on local stats collected by ANALYZE on the foreign table. This can be wrong if the remote table has a different distribution than the local sample, or if the table has changed since the last ANALYZE.

Setting use_remote_estimate to true instructs the FDW to run EXPLAIN on the remote server during planning to obtain row counts and cost estimates [3]. This can improve join ordering, especially when joining multiple foreign tables or mixing local and foreign tables. The downside is increased planning time because each remote estimate runs an extra query.

In practice:

  • Enable use_remote_estimate on queries with complex joins where the planner picks obviously wrong join orders. If enabling it improves the plan, consider leaving it on for that server or table.

  • Use ANALYZE on foreign tables periodically if your remote data is relatively static. This populates local stats and can avoid the overhead of remote estimates.

  • Don’t enable use_remote_estimate indiscriminately on simple lookups. The cost of additional round-trip remote flights may outweigh the benefit.

Tuning cost parameters

fdw_startup_cost and fdw_tuple_cost control how much the planner thinks it costs to start a foreign scan and fetch each row [3]. If these are too low, the planner may choose a nested loop that generates many small remote calls. If they are too high, the planner might avoid remote scans even when they are efficient.

You can adjust these parameters based on empirical measurement:

  • Increase fdw_startup_cost to discourage the planner from using nested loops that call the remote table repeatedly. You might set it to the average cost of a round-trip remote.

  • Increase fdw_tuple_cost if network bandwidth is limited or expensive. This indicates to the planner that each remote row incurs higher fetch costs than a local row. The planner will prefer plans that filter early on the remote side.

Always adjust these settings gradually and observe the effect on the plan. Keep separate settings per foreign server if network conditions differ.

When to analyze foreign tables

Running ANALYZE on a foreign table collects sample statistics by pulling a subset of rows from the remote server. This helps the planner estimate row counts when use_remote_estimate is off. It also helps decide whether to use an index on the remote side. You should analyze foreign tables when:

  • The remote table is large and static, and you want accurate local estimates without the overhead of remote estimates.

  • You have just defined a foreign table, and the default stats are empty.

  • You changed the extensions allow‑list to enable more pushdown and want the planner to see the effect.

Conversely, if the remote data changes constantly, ANALYZE results will quickly become stale. In that case, rely on use_remote_estimate instead.

Further Case Studies and Practical Examples

The Keycloak coverage example is not the only place where pushdown matters. The following scenarios illustrate other patterns you may encounter.

Reporting on a sharded logging system

Imagine you store application logs across multiple shards, each a separate Postgres database. You want to produce a report of the number of error logs per service per day.

A naïve approach might join all shards in one query:

SELECT shard, service, date_trunc('day', log_time) AS day, COUNT(*)
FROM shard1.logs
UNION ALL
SELECT shard, service, date_trunc('day', log_time) AS day, COUNT(*)
FROM shard2.logs
...;

This approach will fetch all log rows to the local server and aggregate them locally. A better solution is to push the grouping to each shard:

SELECT shard, service, day, sum(count)
FROM (
  SELECT 1 AS shard, service, date_trunc('day', log_time) AS day, COUNT(*) AS count
  FROM shard1.logs
  WHERE log_time >= $1 AND log_time < $2
  GROUP BY service, day
  UNION ALL
  SELECT 2 AS shard, service, date_trunc('day', log_time) AS day, COUNT(*)
  FROM shard2.logs
  WHERE log_time >= $1 AND log_time < $2
  GROUP BY service, day
  ...
) x
GROUP BY shard, service, day;

Here, each foreign server returns a small set of aggregated rows instead of raw logs. The outer aggregation sums across shards. This pattern generalizes: push grouping and filtering to the remote side, then combine locally.

Combining remote and local data for analytics

Suppose you have a local table users and a remote table orders. You want to compute the average order amount per user segment. A naïve query might look like:

SELECT u.segment, AVG(o.amount)
FROM users u
JOIN orders o ON o.user_id = u.id
GROUP BY u.segment;

This is a local join driving a remote nested loop. The better approach is to aggregate orders remotely by user_id and join on the small result:

WITH remote_totals AS (
  SELECT user_id, SUM(amount) AS total, COUNT(*) AS n
  FROM orders
  GROUP BY user_id
)
SELECT u.segment, AVG(rt.total / rt.n)
FROM users u
JOIN remote_totals rt ON u.id = rt.user_id
GROUP BY u.segment;

This pushes the heavy aggregation to the remote and transfers only one row per user. The local join then groups by segment. As with other examples, the key is to reduce remote rows before they cross the network.

Avoiding pushdown for correctness

There are legitimate cases where you should prevent pushdown because of semantic differences. Postgres allows you to do this by adding OFFSET 0 or wrapping the foreign table in a CTE.

For example, if a built‑in function behaves differently on the remote due to a version mismatch, you can force local evaluation:

WITH local_eval AS (SELECT  FROM remote_table)  -- CTE prevents pushdown
SELECT 
FROM local_eval
WHERE some_complex_expression(local_eval.col) > 0;

Alternatively, a WHERE clause like random() < 0.1 will not push down because random() is volatile – you don't need to force it. But adding OFFSET 0 is a simple hack that prevents any pushdown:

SELECT * FROM remote_table OFFSET 0;

Knowing how to disable pushdown intentionally helps you debug. If a query returns different results when pushdown occurs, suspect type/collation mismatches or remote session settings [4].

Monitoring, Diagnostics, and Regression Testing

Monitoring doesn't end at counting remote rows. To make pushdown reliable in production, you need to set up mechanisms to detect regressions and gather evidence when performance changes.

Automate EXPLAIN regression tests

In addition to unit tests and integration tests, you can add tests that assert the shape of your plans. For instance, if a mission‑critical report must always push down a WHERE clause, you can write a test that runs EXPLAIN (VERBOSE) and checks that the Remote SQL contains the filter. You might even parse loops and assert that it is 1. When a developer inadvertently adds a non‑immutable function or changes a join, the test will fail. This is akin to snapshot testing for SQL.

Monitor pg_stat_statements across servers

Enable pg_stat_statements on both the local and remote servers. On the local side, track the total time, planning time, and rows for each FDW query. On the remote side, track which queries are being executed.

Look for outliers: a query whose remote calls spike or whose average remote rows jump from hundreds to thousands. Those are early signs of pushdown failure.

Log remote SQL with auto_explain

Setting auto_explain.log_min_duration_statement (for example, to 500ms) causes Postgres to automatically log slow queries with their plans. Combine this with auto_explain.log_verbose = true and auto_explain.log_nested_statements = true to capture remote SQL as well. When a federated query slows down, the log will show you exactly what remote SQL was executed and how often. This is invaluable in production, where you cannot always run EXPLAIN interactively.

Use connection pooling and prepare statements

postgres_fdw maintains a connection pool keyed on the user mapping. It reuses connections between queries, but you can also use connection pooling at the network level (for example, pgbouncer or pgcat).

Keeping connections warm reduces the startup cost, as captured by fdw_startup_cost. Meanwhile, preparing statements on the remote server (via PREPARE and EXECUTE) can save parse time when the same remote SQL is executed frequently. postgres_fdw can use server‑side prepared statements for parameterized scans.

Regression testing after version upgrades

Every major Postgres release brings improvements to postgres_fdw pushdown semantics. But new releases also change planner heuristics and remote SQL generation. After an upgrade, rerun your key queries with EXPLAIN (VERBOSE), compare the Remote SQL, and benchmark them.

In some cases, a release may push down something previously local, revealing a latent type mismatch or a function difference. In other cases, pushdown may be withheld due to a new rule. Don’t assume that an upgrade automatically improves performance – test it.

Extended Guidelines for Advanced DBAs

To close this handbook, here are consolidated guidelines distilled from the previous sections. They go beyond simple bullet points to capture nuances. Keep them handy for reference or print them out for your team.

  1. Respect the FDW safety model. Immutable functions and built‑in operators are your friends. Anything outside that scope must be explicitly allowed or evaluated locally. Understand which items belong to each category and plan accordingly.

  2. Always read the Remote SQL. Don’t trust your intuition about what is being pushed down. The Remote SQL string is the only source of truth. It indicates whether a predicate, join, sort, or limit operation is occurring remotely. It also shows parameter placeholders (for example, $1) that correspond to values passed from the local plan.

  3. Reduce before you fetch. The network is the highest cost. If the remote can reduce rows through filtering, grouping, or limiting, let it. If it cannot, structure your query to enable it. Avoid queries that require pulling large raw tables and processing them locally.

  4. Beware of join order. The planner sometimes chooses a nested loop with a foreign table as the inner side, resulting in repeated remote calls. Examine loops: if you see a high number, consider rewriting the query or adjusting cost parameters.

  5. Use CTEs strategically. A CTE can isolate remote scans and let you control whether they are materialized once or inlined. Use MATERIALIZED to avoid repeated remote scans when a CTE is referenced multiple times. Use NOT MATERIALIZED to allow optimizations across CTE boundaries.

  6. Instrument, monitor, iterate. Good FDW performance is not a one‑off fix. Monitor queries and plans. Use tests to catch regressions. Adjust tuning knobs and indexes as your data or workload changes. Document your reasoning so others can understand why a particular plan is expected.

  7. Educate your team. Federated queries invite subtle bugs and performance traps. Share the high‑level rules – immutable functions only, cross‑server joins are local, always check remote SQL – so engineers write safer queries by default. A 30‑minute training can save hours of debugging later.

Bringing it All Together

This handbook has covered a lot of ground: from the high‑level principle that pushdown is about data movement, to the nitty‑gritty of join conditions and tuning knobs, to troubleshooting steps and case studies. It is intentionally opinionated and personal: these are the patterns and pitfalls encountered in real systems, not abstract guidelines. By sharing specific examples, I hoped to make the rules memorable and show how they interplay with actual workloads.

The goal is not just to tell you what to do, but to show you how to think and problem solve: review the plan, trace data movement, and determine whether the query is doing the heavy work in the right place.

That thinking process, practiced enough times, becomes second nature. When you write a new query, you'll automatically consider whether your predicates are immutable, whether the join can be shipped, and whether you are about to trigger an N+1 pattern. When you review plans, you'll start from the Foreign Scan nodes and remote SQL, not the top‑level node. When you tune, you'll know which knobs to twist and in which order.

Keep experimenting. Use the examples here as starting points. Try different structures in a test environment and measure the difference. The more you play with pushdown, the more comfortable you'll become with its constraints and superpowers.

If this handbook helps you avoid one performance incident or saves you from shipping a broken query, it has done its job. Enjoy exploring the federated world of Postgres.

References

[1] [2] [3] [4] [5] [6] [9] [10] PostgreSQL: Documentation: 18: F.38. postgres_fdw – access data stored in external PostgreSQL servers (https://www.postgresql.org/docs/current/postgres-fdw.html)

[7] PostgreSQL: Release Notes (https://www.postgresql.org/docs/release/10.0/)

[8] PostgreSQL: Release Notes (https://www.postgresql.org/docs/release/12.0/)



Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft C++ (MSVC) Build Tools v14.51 Preview Released: How to Opt In

1 Share

Today we are releasing the first preview of the Microsoft C++ (MSVC) Build Tools version 14.51. This update, shipping in the latest Visual Studio 2026 version 18.4 Insiders release, introduces many C++23 conformance changes, bug fixes, and runtime performance improvements. Check out the release notes for an in-progress list of what’s new. Conformance improvements and bug fixes will be detailed in an upcoming blog post and Insiders release notes in the near future.

We plan to ship more frequent, incremental MSVC Build Tools previews, just as we are shipping more frequent IDE updates. As a result, we have adjusted the process for enabling and using MSVC previews, and this post describes the new process.

We encourage you to explore MSVC previews to adapt to breaking changes and report issues early. MSVC previews do not receive servicing patches and thus should not be used in production environments.

How to opt in

Visual Studio 2026 has changed the process for opting in to MSVC Build Tools previews. Most Visual Studio updates will include fresh MSVC previews, bringing compiler changes to you far faster than ever before. These updates will occur more frequently in the Insiders channel. Soon, you will also be able to install MSVC previews from the Stable channel, though these will be less recent than the builds available in Insiders.

Installing MSVC previews

To install MSVC v14.51 Preview, you must select one or both of these components in the Visual Studio installer depending on what architectures you are targeting for your builds:

  1. MSVC Build Tools for x64/x86 (Preview)
  2. MSVC Build Tools for ARM64/ARM64EC (Preview)

You can install these from the Workloads tab under Desktop development with C++ or from the Individual components tab.

How to install MSVC v14.51 Preview from the C++ desktop workload

Desktop development with C++ workload includes checkboxes for installing MSVC previews on the right side pane.

MSVC v14.51 components under “Individual components”

The Individual components page in the Visual Studio installer includes preview MSVC compilers as well as support libraries and frameworks.

An easy way to find the relevant components under Individual components is to search for “preview”. Here you will also find support libraries and frameworks like MFC, ATL, C++/CLI, and Spectre-mitigated libraries compatible with this MSVC preview.

The components are the same as stable MSVC releases, except they are marked with “(MSVC Preview)” rather than “(Latest)” or a specific version number. Whenever you update Visual Studio, your MSVC preview will also be updated to the latest available build in that installer channel. Preview MSVC builds are not designed with version pinning in mind and do not receive servicing updates, though you can always download fresh builds as you update the IDE.

If you only want to build in the command line, you can also install MSVC v14.51 Preview using the Build Tools for Visual Studio 2026, by selecting the same checkboxes.

Configuring Command Prompts

You can configure MSVC Preview command-line builds by navigating to this path and running the appropriate vcvars for your desired environment:

cmd.exe example for x64 builds:

cd "C:\Program Files\Microsoft Visual Studio\18\Insiders\VC\Auxiliary\Build"
.\vcvars64.bat -vcvars_ver=Preview

A command prompt window is displayed, indicating the initialization of a Visual Studio 2026 developer command prompt with the x64 environment.

Configuring MSBuild Projects

For MSBuild projects, you must enable MSVC preview builds in the project system by setting the new Use MSVC Build Tools Preview property to “Yes” and making sure the MSVC Build Tools Version property is set to “Latest supported”. If MSVC Build Tools Version is set to something other than “Latest supported”, that MSVC version will be used for builds instead. If you wish to switch back to a stable MSVC build, you should set Use MSVC Build Tools Preview to “No”.

Instructions – Enabling MSVC previews in MSBuild projects

First, right-click the project you want to modify in Solution Explorer, select Properties.

Next, make sure your Configuration and Platform at the top are set to what you want to modify.

Under the General tab (open by default), set Use MSVC Build Tools Preview to “Yes”.

To enable MSVC previews in the MSBuild project system, set MSVC Build Tools version to Latest supported and Use MSVC Build Tools Preview to Yes.

Make sure the MSVC Build Tools Version property is set to “Latest supported”, or else your project will build with the version specified there instead.

Lastly, run a build to make sure it works. Your project will now build using the latest preview tools.

Note: For command-line builds, you can also set the new property by running:

msbuild <project_or_solution_file> /p:MSVCPreviewEnabled=true

Configuring CMake Projects

For CMake projects, you should specify the MSVC version in a CMakePresets.json file under the toolset property. The same process applies regardless of what version of MSVC you want to use (and whether it’s a Preview or not).

Instructions – Enabling MSVC previews in CMake projects

First, open your CMake project in Visual Studio. Ensure your workspace has a CMakePresets.json file in the root directory. See Configure and build with CMake Presets | Microsoft Learn if you need help configuring a CMakePresets file.

Then, add a base preset under configurePresets that specifies MSVC v14.51:

{
    "name": "windows-msvc-v1451-base",
    "description": "Base preset for MSVC v14.51",
    "hidden": true,
    "inherits": "windows-base",
    "toolset": {
        "value": "v145,host=x64,version=14.51"
    }
}

Next, add more specific presets for each individual architecture, e.g.:

{
    "name": "x64-debug-msvc-v1451-preview",
    "displayName": "x64 Debug (MSVC v14.51 Preview)",
    "inherits": "windows-msvc-v1451-base",
    "architecture": {
        "value": "x64",
        "strategy": "external"
    },
    "cacheVariables": {
        "CMAKE_BUILD_TYPE": "Debug"
    }
}

Next, slect the new build configuration from the list of targets beside the Play button at the top of the IDE.

Lastly, run a build to make sure it works. You can create additional presets the same way for other MSVC versions to easily swap between them.

Known issues

There are several known issues that will be fixed in a future MSVC Build Tools Preview and/or Visual Studio Insiders release.

CMake targets using Visual Studio generator

There is a bug configuring CMake targets using the Visual Studio (MSBuild) generator. A workaround is described below.

First, open Developer Command Prompt for VS Insiders (or the prompt for the version of Visual Studio you are using) as an administrator.

Then, run the following commands, which create a new folder and copy a file from another location to it:

pushd %VCINSTALLDIR%\Auxiliary\Build
mkdir 14.51
copy .\v145\Microsoft.VCToolsVersion.VC.14.51.props .\14.51\Microsoft.VCToolsVersion.14.51.props
copy .\v145\Microsoft.VCToolsVersion.VC.14.51.txt .\14.51\Microsoft.VCToolsVersion.14.51.txt

Lastly, run a build to make sure it works.

Command-line builds using PowerShell

Command line builds in PowerShell (including via Launch-VsDevShell.ps1) are not yet configured for the preview.

C++ CMake tools for Windows dependency on latest stable MSVC

If you are using the CMake tools in Visual Studio, their installer component still has a dependency on the latest stable version of MSVC. Therefore you will need to install both latest stable and latest preview MSVC Build Tools until we correct this dependency relationship.

Try out MSVC v14.51 Preview in Visual Studio 2026!

We encourage you to try out Visual Studio 2026 version 18.4 on the Insiders Channel, along with MSVC version 14.51 Preview. For MSVC, your feedback can help us address any bugs and improve build and runtime performance. Submit feedback using the Help > Send Feedback menu from the IDE, or by navigating directly to Visual Studio Developer Community.

The post Microsoft C++ (MSVC) Build Tools v14.51 Preview Released: How to Opt In appeared first on C++ Team Blog.

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories