Content Developer II at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
121481 stories
·
29 followers

What is Quantization? Quantizing LLMs | Exxact Blog

1 Share

Deep Learning

Quantization and LLMs - Condensing Models to Manageable Sizes

EXX-Blog-Quantization-LLMs-Condensing-Models-to-Manageable-Sizes.jpg

The Scale and Complexity of LLMs

The incredible abilities of LLMs are powered by their vast neural networks which are made up of billions of parameters. These parameters are the result of training on extensive text corpora and are fine-tuned to make the models as accurate and versatile as possible. This level of complexity requires significant computational power for processing and storage.

gpt-4 parameters vs traditional language models parameter size

The accompanying bar graph delineates the number of parameters across different scales of language models. As we move from smaller to larger models, we witness a significant increase in the number of parameters with 'Small' language models at the modest millions of parameters and 'Large' models with tens of billions of parameters.

However, it is the GPT-4 LLM model with 175 billion parameters that dwarfs other models’ parameter size. Not only is GPT-4 using the most parameters out of the graphs, but it also powers the most recognizable generative AI model, ChatGPT. This towering presence on the graph is representative of other LLMs of its class, displaying the requirements needed to power the future’s AI chatbots, as well as the processing power required to support such advanced AI systems.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

The Cost of Running LLMs and Quantization

Deploying and operating complex models can get costly due to their need for either cloud computing on specialized hardware, such as high-end GPUs, AI accelerators, and continuous energy consumption. Reducing the cost by choosing an on-premises solution can save a great deal of money and increase flexibility in hardware choices and freedom to utilize the system wherever with a trade-off in maintenance and employing a skilled professional. High costs can make it challenging for small business deployments to train and power an advanced AI. Here is where quantization comes in handy.

Quantization is a technique that reduces the numerical precision of each parameter in a model, thereby decreasing its memory footprint. This is akin to compressing a high-resolution image to a lower resolution while retaining the essence and most important aspects but at a reduced data size. This approach enables the deployment of LLMs on with less hardware without substantial performance loss.

ChatGPT was trained and is deployed using thousands of NVIDIA DGX systems, millions of dollars of hardware, and tens of thousands more for infrastructure. Quantization can enable good proof-of-concept, or even fully fledged deployments with less spectacular (but still high performance) hardware.

In the sections to follow, we will dissect the concept of quantization, its methodologies, and its significance in bridging the gap between the highly resource-intensive nature of LLMs and the practicalities of everyday technology use. The transformative power of LLMs can become a staple in smaller-scale applications, offering vast benefits to a broader audience.

Basics of Quantization

Quantizing a large language model refers to the process of reducing the precision of numerical values used in the model. In the context of neural networks and deep learning models, including large language models, numerical values are typically represented as floating-point numbers with high precision (e.g., 32-bit or 16-bit floating-point format). Read more about Floating Point Precision here.

Quantization addresses this by converting these high-precision floating-point numbers into lower-precision representations, such as 16- or 8-bit integers to make the model more memory-efficient and faster during both training and inference by sacrificing precision. As a result, the training and inferencing of the model requires less storage, consumes less memory, and can be executed more quickly on hardware that supports lower-precision computations.

Types of Quantization

To add depth and complexity to the topic, it is critical to understand that quantization can be applied at various stages in the lifecycle of a model's development and deployment. Each method has its distinct advantages and trade-offs and is selected based on the specific requirements and constraints of the use case. 

1. Static Quantization

Static quantization is a technique applied during the training phase of an AI model, where the weights and activations are quantized to a lower bit precision and applied to all layers. The weights and activations are quantized ahead of time and remain fixed throughout. Static quantization is great for known memory requirements of the system the model is planning to be deployed to.

  • Pros of Static Quantization
    • Simplifies deployment planning as the quantization parameters are fixed.
    • Reduces model size, making it more suitable for edge devices and real-time applications.
  • Cons of Static Quantization
    • Performance drops are predictable; so certain quantized parts may suffer more due to a broad static approach.
    • Limited adaptability for static quantization for varying input patterns and less robust update to weights.

2. Dynamic Quantization

Dynamic Quantization involves quantizing weights statically, but activations are quantized on the fly during model inference. The weights are quantized ahead of time, while the activations are quantized dynamically as data passes through the network. This means that quantization of certain parts of the model are executed on different precisions as opposed to defaulting to a fixed quantization.

  • Pros of Dynamic Quantization
    • Balances model compression and runtime efficiency without significant drop in accuracy.
    • Useful for models where activation precision is more critical than weight precision.
  • Cons of Dynamic Quantization
    • Performance improvements aren’t predictable compared to static methods (but this isn’t necessarily a bad thing).
    • Dynamic calculation means more computational overhead and longer train and inference times than the other methods, while still being lighter weight than without quantization

3. Post-Training Quantization (PTQ_

In this technique, quantization is incorporated into the training process itself. It involves analyzing the distribution of weights and activations and then mapping these values to a lower bit depth. PTQ is deployed on resource-constrained devices like edge devices and mobile phones. PTQ can be either static or dynamic.

  • Pros of PTQ
    • Can be applied directly to a pre-trained model without the need for retraining.
    • Reduces the model size and decreases memory requirements.
    • Improved inference speeds enabling faster computations during and after deployment.
  • Cons of PTQ
    • Potential loss in model accuracy due to the approximation of weights.
    • Requires careful calibration and fine tuning to mitigate quantization errors.
    • May not be optimal for all types of models, particularly those sensitive to weight precision.

4. Quantization Aware Training (QAT)

During training, the model is aware of the quantization operations that will be applied during inference and the parameters are adjusted accordingly. This allows the model to learn to handle quantization induced errors.

  • Pros of QAT
    • Tends to preserve model accuracy compared to PTQ since the model training accounts for quantization errors during training.
    • More robust for models sensitive to precision and is better at inferencing even on lower precisions.
  • Cons of QAT
    • Requires retraining the model resulting in longer training times.
    • More computationally intensive since it incorporates quantization error checking.

5. Binary Ternary Quantization

These methods quantize the weights to either two values (binary) or three values (ternary), representing the most extreme form of quantization. Weights are constrained to +1, -1 for binary, or +1, 0, -1 for ternary quantization during or after training. This would drastically reduce the number of possible quantization weight values while still being somewhat dynamic.

  • Pros of Binary Ternary Quantization
    • Maximizes model compression and inferencing speed and has minimal memory requirements.
    • Fast inferencing and quantization calculations enables usefulness on underpowered hardware.
  • Cons of Binary Ternary Quantization
    • High compression and reduced precision results in a significant drop in accuracy.
    • Not suitable for all types of tasks or datasets and struggles with complex tasks.

The Benefits & Challenges of Quantization

before and after quantization

The quantization of Large Language Models brings forth multiple operational benefits. Primarily, it achieves a significant reduction in the memory requirements of these models. Our goal for post-quantization models is for the memory footprint to be notably smaller. Higher efficiency permits the deployment of these models on platforms with more modest memory capabilities and decreasing the processing power needed to run the models once quantized translates directly into heightened inference speeds and quicker response times that enhance user experience.

On the other hand, quantization can also introduce some loss in model accuracy since it involves approximating real numbers. The challenge is to quantize the model without significantly affecting its performance. This can be done with testing the model's precision and time of completion before and after quantization with your models to gauge effectiveness, efficiency, and accuracy.

By optimizing the balance between performance and resource consumption, quantization not only broadens the accessibility of LLMs but also contributes to more sustainable computing practices.

We're Here to Deliver the Tools to Power Your Research

With access to the highest performing hardware, at Exxact, we can offer the platform optimized for your deployment, budget, and desired performance so you can make an impact with your research!

Talk to an Engineer Today

Table of Contents

Sign up for our newsletter.

Topics

Have any questions?

Contact us today

EXX-Blog-Quantization-LLMs-Condensing-Models-to-Manageable-Sizes.jpg

Deep Learning

Quantization and LLMs - Condensing Models to Manageable Sizes

February 15, 20248 min read

The Scale and Complexity of LLMs

The incredible abilities of LLMs are powered by their vast neural networks which are made up of billions of parameters. These parameters are the result of training on extensive text corpora and are fine-tuned to make the models as accurate and versatile as possible. This level of complexity requires significant computational power for processing and storage.

The accompanying bar graph delineates the number of parameters across different scales of language models. As we move from smaller to larger models, we witness a significant increase in the number of parameters with 'Small' language models at the modest millions of parameters and 'Large' models with tens of billions of parameters.

However, it is the GPT-4 LLM model with 175 billion parameters that dwarfs other models’ parameter size. Not only is GPT-4 using the most parameters out of the graphs, but it also powers the most recognizable generative AI model, ChatGPT. This towering presence on the graph is representative of other LLMs of its class, displaying the requirements needed to power the future’s AI chatbots, as well as the processing power required to support such advanced AI systems.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

The Cost of Running LLMs and Quantization

Deploying and operating complex models can get costly due to their need for either cloud computing on specialized hardware, such as high-end GPUs, AI accelerators, and continuous energy consumption. Reducing the cost by choosing an on-premises solution can save a great deal of money and increase flexibility in hardware choices and freedom to utilize the system wherever with a trade-off in maintenance and employing a skilled professional. High costs can make it challenging for small business deployments to train and power an advanced AI. Here is where quantization comes in handy.

What is Quantization?

Quantization is a technique that reduces the numerical precision of each parameter in a model, thereby decreasing its memory footprint. This is akin to compressing a high-resolution image to a lower resolution while retaining the essence and most important aspects but at a reduced data size. This approach enables the deployment of LLMs on with less hardware without substantial performance loss.

ChatGPT was trained and is deployed using thousands of NVIDIA DGX systems, millions of dollars of hardware, and tens of thousands more for infrastructure. Quantization can enable good proof-of-concept, or even fully fledged deployments with less spectacular (but still high performance) hardware.

In the sections to follow, we will dissect the concept of quantization, its methodologies, and its significance in bridging the gap between the highly resource-intensive nature of LLMs and the practicalities of everyday technology use. The transformative power of LLMs can become a staple in smaller-scale applications, offering vast benefits to a broader audience.

Basics of Quantization

Quantizing a large language model refers to the process of reducing the precision of numerical values used in the model. In the context of neural networks and deep learning models, including large language models, numerical values are typically represented as floating-point numbers with high precision (e.g., 32-bit or 16-bit floating-point format). Read more about Floating Point Precision here.

Quantization addresses this by converting these high-precision floating-point numbers into lower-precision representations, such as 16- or 8-bit integers to make the model more memory-efficient and faster during both training and inference by sacrificing precision. As a result, the training and inferencing of the model requires less storage, consumes less memory, and can be executed more quickly on hardware that supports lower-precision computations.

Types of Quantization

To add depth and complexity to the topic, it is critical to understand that quantization can be applied at various stages in the lifecycle of a model's development and deployment. Each method has its distinct advantages and trade-offs and is selected based on the specific requirements and constraints of the use case. 

1. Static Quantization

Static quantization is a technique applied during the training phase of an AI model, where the weights and activations are quantized to a lower bit precision and applied to all layers. The weights and activations are quantized ahead of time and remain fixed throughout. Static quantization is great for known memory requirements of the system the model is planning to be deployed to.

  • Pros of Static Quantization
    • Simplifies deployment planning as the quantization parameters are fixed.
    • Reduces model size, making it more suitable for edge devices and real-time applications.
  • Cons of Static Quantization
    • Performance drops are predictable; so certain quantized parts may suffer more due to a broad static approach.
    • Limited adaptability for static quantization for varying input patterns and less robust update to weights.

2. Dynamic Quantization

Dynamic Quantization involves quantizing weights statically, but activations are quantized on the fly during model inference. The weights are quantized ahead of time, while the activations are quantized dynamically as data passes through the network. This means that quantization of certain parts of the model are executed on different precisions as opposed to defaulting to a fixed quantization.

  • Pros of Dynamic Quantization
    • Balances model compression and runtime efficiency without significant drop in accuracy.
    • Useful for models where activation precision is more critical than weight precision.
  • Cons of Dynamic Quantization
    • Performance improvements aren’t predictable compared to static methods (but this isn’t necessarily a bad thing).
    • Dynamic calculation means more computational overhead and longer train and inference times than the other methods, while still being lighter weight than without quantization

3. Post-Training Quantization (PTQ_

In this technique, quantization is incorporated into the training process itself. It involves analyzing the distribution of weights and activations and then mapping these values to a lower bit depth. PTQ is deployed on resource-constrained devices like edge devices and mobile phones. PTQ can be either static or dynamic.

  • Pros of PTQ
    • Can be applied directly to a pre-trained model without the need for retraining.
    • Reduces the model size and decreases memory requirements.
    • Improved inference speeds enabling faster computations during and after deployment.
  • Cons of PTQ
    • Potential loss in model accuracy due to the approximation of weights.
    • Requires careful calibration and fine tuning to mitigate quantization errors.
    • May not be optimal for all types of models, particularly those sensitive to weight precision.

4. Quantization Aware Training (QAT)

During training, the model is aware of the quantization operations that will be applied during inference and the parameters are adjusted accordingly. This allows the model to learn to handle quantization induced errors.

  • Pros of QAT
    • Tends to preserve model accuracy compared to PTQ since the model training accounts for quantization errors during training.
    • More robust for models sensitive to precision and is better at inferencing even on lower precisions.
  • Cons of QAT
    • Requires retraining the model resulting in longer training times.
    • More computationally intensive since it incorporates quantization error checking.

5. Binary Ternary Quantization

These methods quantize the weights to either two values (binary) or three values (ternary), representing the most extreme form of quantization. Weights are constrained to +1, -1 for binary, or +1, 0, -1 for ternary quantization during or after training. This would drastically reduce the number of possible quantization weight values while still being somewhat dynamic.

  • Pros of Binary Ternary Quantization
    • Maximizes model compression and inferencing speed and has minimal memory requirements.
    • Fast inferencing and quantization calculations enables usefulness on underpowered hardware.
  • Cons of Binary Ternary Quantization
    • High compression and reduced precision results in a significant drop in accuracy.
    • Not suitable for all types of tasks or datasets and struggles with complex tasks.

The Benefits & Challenges of Quantization

The quantization of Large Language Models brings forth multiple operational benefits. Primarily, it achieves a significant reduction in the memory requirements of these models. Our goal for post-quantization models is for the memory footprint to be notably smaller. Higher efficiency permits the deployment of these models on platforms with more modest memory capabilities and decreasing the processing power needed to run the models once quantized translates directly into heightened inference speeds and quicker response times that enhance user experience.

On the other hand, quantization can also introduce some loss in model accuracy since it involves approximating real numbers. The challenge is to quantize the model without significantly affecting its performance. This can be done with testing the model's precision and time of completion before and after quantization with your models to gauge effectiveness, efficiency, and accuracy.

By optimizing the balance between performance and resource consumption, quantization not only broadens the accessibility of LLMs but also contributes to more sustainable computing practices.

We're Here to Deliver the Tools to Power Your Research

With access to the highest performing hardware, at Exxact, we can offer the platform optimized for your deployment, budget, and desired performance so you can make an impact with your research!

Talk to an Engineer Today

Sign up for our newsletter.

Sign up

Topics

Have any questions?

Contact us today

Read the whole story
alvinashcraft
1 hour ago
reply
West Grove, PA
Share this story
Delete

Exploring Visual Studio Code as a PowerShell ISE Alternative

1 Share
Microsoft has deprecated PowerShell ISE, leading users to consider alternatives such as Visual Studio Code, which provides features like syntax highlighting and an ISE mode.

Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

Attackers exploiting new critical OpenMetadata vulnerabilities on Kubernetes clusters

1 Share

Attackers are constantly seeking new vulnerabilities to compromise Kubernetes environments. Microsoft recently uncovered an attack that exploits new critical vulnerabilities in OpenMetadata to gain access to Kubernetes workloads and leverage them for cryptomining activity.

OpenMetadata is an open-source platform designed to manage metadata across various data sources. It serves as a central repository for metadata lineage, allowing users to discover, understand, and govern their data. On March 15, 2024, several vulnerabilities in OpenMetadata platform were published. These vulnerabilities (CVE-2024-28255, CVE-2024-28847, CVE-2024-28253, CVE-2024-28848, CVE-2024-28254), affecting versions prior to 1.3.1, could be exploited by attackers to bypass authentication and achieve remote code execution. Since the beginning of April, we have observed exploitation of this vulnerability in Kubernetes environments.

Microsoft highly recommends customers to check clusters that run OpenMetadata workload and make sure that the image is up to date (version 1.3.1 or later). In this blog, we share our analysis of the attack, provide guidance for identifying vulnerable clusters and using Microsoft security solutions like Microsoft Defender for Cloud to detect malicious activity, and share indicators of compromise that defenders can use for hunting and investigation.

Attack flow

For initial access, the attackers likely identify and target Kubernetes workloads of OpenMetadata exposed to the internet. Once they identify a vulnerable version of the application, the attackers exploit the mentioned vulnerabilities to gain code execution on the container running the vulnerable OpenMetadata image.

After establishing a foothold, the attackers attempt to validate their successful intrusion and assess their level of control over the compromised system. This reconnaissance step often involves contacting a publicly available service. In this specific attack, the attackers send ping requests to domains that end with oast[.]me and oast[.]pro, which are associated with Interactsh, an open-source tool for detecting out-of-band interactions.

OAST domains are publicly resolvable yet unique, allowing attackers to determine network connectivity from the compromised system to attacker infrastructure without generating suspicious outbound traffic that might trigger security alerts. This technique is particularly useful for attackers to confirm successful exploitation and validate their connectivity with the victim, before establishing a command-and-control (C2) channel and deploying malicious payloads.

After gaining initial access, the attackers run a series of reconnaissance commands to gather information about the victim environment. The attackers query information on the network and hardware configuration, OS version, active users, etc.

As part of the reconnaissance phase, the attackers read the environment variables of the workload. In the case of OpenMetadata, those variables might contain connection strings and credentials for various services used for OpenMetadata operation, which could lead to lateral movement to additional resources.

Once the attackers confirm their access and validate connectivity, they proceed to download the payload, a cryptomining-related malware, from a remote server. We observed the attackers using a remote server located in China. The attacker’s server hosts additional cryptomining-related malware that are stored, for both Linux and Windows OS.

Screenshot of attacker's server showing cryptomining-related malware
Figure 1. Additional cryptomining-related malware in the attacker’s server

The downloaded file’s permissions are then elevated to grant execution privileges. The attacker also added a personal note to the victims:

Screenshot of note from attacker
Figure 2. Note from attacker

Next, the attackers run the downloaded cryptomining-related malware, and then remove the initial payloads from the workload. Lastly, for hands-on-keyboard activity, the attackers initiate a reverse shell connection to their remote server using Netcat tool, allowing them to remotely access the container and gain better control over the system. Additionally, for persistence, the attackers use cronjobs for task scheduling, enabling the execution of the malicious code at predetermined intervals.

How to check if your cluster is vulnerable

Administrators who run OpenMetadata workload in their cluster need to make sure that the image is up to date. If OpenMetadata should be exposed to the internet, make sure you use strong authentication and avoid using the default credentials.

To get a list of all the images running in the cluster:

kubectl get pods --all-namespaces -o=jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | grep 'openmetadata'

If there is a pod with a vulnerable image, make sure to update the image version for the latest version.

How Microsoft Defender for Cloud capabilities can help

This attack serves as a valuable reminder of why it’s crucial to stay compliant and run fully patched workloads in containerized environments. It also highlights the importance of a comprehensive security solution, as it can help detect malicious activity in the cluster when a new vulnerability is used in the attack. In this specific case, the attackers’ actions triggered Microsoft Defender for Containers alerts, identifying the malicious activity in the container. In the example below, Microsoft Defender for Containers alerted on an attempt to initiate a reverse shell from a container in a Kubernetes cluster, as happened in this attack:

Screenshot of Microsoft Defender Containers alert for detection of potential reverse shell
Figure 3. Microsoft Defender for Containers alert for detection of potential reverse shell

To prevent such attacks, Microsoft Defender for Containers provides agentless vulnerability assessment for Azure, AWS, and GCP, allowing you to identify vulnerable images in the environment, before the attack occurs.  Microsoft Defender Cloud Security Posture Management (CSPM) can help to prioritize the security issues according to their risk. For example, Microsoft Defender CSPM highlights vulnerable workloads exposed to the internet, allowing organizations to quickly remediate crucial threats.

Organizations can also monitor Kubernetes clusters using Microsoft Sentinel via Azure Kubernetes Service (AKS) solution for Sentinel, which enables detailed audit trail for user and system actions to identify malicious activity.

Indicators of compromise (IoCs)

TypeIoC
Executable SHA-2567c6f0bae1e588821bd5d66cd98f52b7005e054279748c2c851647097fa2ae2df
Executable SHA-25619a63bd5d18f955c0de550f072534aa7a6a6cc6b78a24fea4cc6ce23011ea01d
Executable SHA-25631cd1651752eae014c7ceaaf107f0bf8323b682ff5b24c683a683fdac7525bad
IP8[.]222[.]144[.]60
IP61[.]160[.]194[.]160
IP8[.]130[.]115[.]208

Hagai Ran Kestenberg, Security Researcher
Yossi Weizman, Senior Security Research Manager

Learn more

For the latest security research from the Microsoft Threat Intelligence community, check out the Microsoft Threat Intelligence Blog: https://aka.ms/threatintelblog.

To get notified about new publications and to join discussions on social media, follow us on LinkedIn at https://www.linkedin.com/showcase/microsoft-threat-intelligence, and on X (formerly Twitter) at https://twitter.com/MsftSecIntel.

To hear stories and insights from the Microsoft Threat Intelligence community about the ever-evolving threat landscape, listen to the Microsoft Threat Intelligence podcast: https://thecyberwire.com/podcasts/microsoft-threat-intelligence.

The post Attackers exploiting new critical OpenMetadata vulnerabilities on Kubernetes clusters appeared first on Microsoft Security Blog.

Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

Why configuration is so complicated

1 Share
Ben and Ryan explore why configuration is so complicated, the right to repair, the best programming languages for beginners, how AI is grading exams in Texas, Automattic’s $125M acquisition of Beeper, and why a major US city’s train system still relies on floppy disks. Plus: The unique challenge of keeping up with a field that’s changing as rapidly as GenAI.
Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

Build Intelligent RAG For Multimodality and Complex Document Structure

1 Share

The advent of Retrieval-Augmented Generation (RAG) models has been a significant milestone in the field of Natural Language Processing (NLP). These models combine the power of information retrieval with generative language models to produce answers that are not just accurate but also contextually enriched. However, as the digital universe expands beyond textual data, incorporating image understanding and hierarchical document structure analysis into RAG systems is becoming increasingly crucial. This article explores how these two elements can significantly enhance the capabilities of RAG models.

 

Understanding RAG Models

Before diving into the nuances of image understanding and document analysis, let’s briefly touch upon the essence of RAG models. These systems work by first retrieving relevant documents from a vast corpus and then using a generative model to synthesize information into a coherent response. The retrieval component ensures that the model has access to accurate and up-to-date information, while the generative component allows for the creation of human-like text.

 

Image Understanding and Structure Analysis

The Challenge

One of the most significant limitations of traditional RAG models is their inability to understand and interpret visual data. In a world where images accompany textual information ubiquitously, this represents a substantial gap in the model’s comprehension abilities. Documents are not just strings of text; they have structure — sections, subsections, paragraphs, and lists — all of which convey semantic importance. Traditional RAG models often overlook this hierarchical structure, potentially missing out on understanding the document’s full meaning.

The Solution

To bridge this gap, RAG models can be augmented with Computer Vision (CV) capabilities. This involves integrating image recognition and understanding modules that can analyze visual data, extract relevant information, and convert it into a textual format that the RAG model can process. Incorporating hierarchical document structure analysis involves teaching RAG models to recognize and interpret the underlying structure of documents.

 

mrajguru_0-1713503507692.png

 

Implementation

  • Visual Feature Extraction: Use pre-trained neural networks to identify objects, scenes, and activities in images.
  • Visual Semantics: Develop algorithms that can understand the context and semantics of the visual content.
  • Multimodal Data Fusion: Combine the extracted visual information with textual data to create a multimodal context for the RAG system.
  • Structure Recognition: Implement algorithms to identify different levels of hierarchy in documents, such as titles, headings, and bullet points.
  • Semantic Role Labeling: Assign semantic roles to different parts of the document, understanding the purpose of each section.
  • Structure-Aware Retrieval: Enhance the retrieval process by considering the hierarchical structure of documents, ensuring that the most relevant sections are used for generation.

 

In this blog we will look at how do we implement this using Azure Document Intelligence, LangChain and Azure OpenAI.

Prerequisites

Before we implement this we will require some prerequisites

  • GPT-4-Vision-Preview model deployed
  • GPT-4–1106-Preview model deployed
  • text-ada-embedding model deployed
  • Azure Document Intelligence deployed

Once we have the above information , lets get started !!

Let’s import the required libraries.

 

 

import os from dotenv import load_dotenv load_dotenv('azure.env') from langchain import hub from langchain_openai import AzureChatOpenAI #from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader from doc_intelligence import AzureAIDocumentIntelligenceLoader from langchain_openai import AzureOpenAIEmbeddings from langchain.schema import StrOutputParser from langchain.schema.runnable import RunnablePassthrough from langchain.text_splitter import MarkdownHeaderTextSplitter from langchain.vectorstores.azuresearch import AzureSearch from azure.ai.documentintelligence.models import DocumentAnalysisFeature

 

 

Now we are going to write some custom function on top of LangChain Document Loader which can help us Load the PDF document. First thing we do is using Azure Document Intelligence which has the this beautiful feature of converting Image to Markdown format. Lets use the same.

 

 

import logging from typing import Any, Iterator, List, Optional import os from langchain_core.documents import Document from langchain_community.document_loaders.base import BaseLoader from langchain_community.document_loaders.base import BaseBlobParser from langchain_community.document_loaders.blob_loaders import Blob logger = logging.getLogger(__name__) class AzureAIDocumentIntelligenceLoader(BaseLoader): """Loads a PDF with Azure Document Intelligence""" def __init__( self, api_endpoint: str, api_key: str, file_path: Optional[str] = None, url_path: Optional[str] = None, api_version: Optional[str] = None, api_model: str = "prebuilt-layout", mode: str = "markdown", *, analysis_features: Optional[List[str]] = None, ) -> None: """ Initialize the object for file processing with Azure Document Intelligence (formerly Form Recognizer). This constructor initializes a AzureAIDocumentIntelligenceParser object to be used for parsing files using the Azure Document Intelligence API. The load method generates Documents whose content representations are determined by the mode parameter. Parameters: ----------- api_endpoint: str The API endpoint to use for DocumentIntelligenceClient construction. api_key: str The API key to use for DocumentIntelligenceClient construction. file_path : Optional[str] The path to the file that needs to be loaded. Either file_path or url_path must be specified. url_path : Optional[str] The URL to the file that needs to be loaded. Either file_path or url_path must be specified. api_version: Optional[str] The API version for DocumentIntelligenceClient. Setting None to use the default value from `azure-ai-documentintelligence` package. api_model: str Unique document model name. Default value is "prebuilt-layout". Note that overriding this default value may result in unsupported behavior. mode: Optional[str] The type of content representation of the generated Documents. Use either "single", "page", or "markdown". Default value is "markdown". analysis_features: Optional[List[str]] List of optional analysis features, each feature should be passed as a str that conforms to the enum `DocumentAnalysisFeature` in `azure-ai-documentintelligence` package. Default value is None. Examples: --------- >>> obj = AzureAIDocumentIntelligenceLoader( ... file_path="path/to/file", ... api_endpoint="https://endpoint.azure.com", ... api_key="APIKEY", ... api_version="2023-10-31-preview", ... api_model="prebuilt-layout", ... mode="markdown" ... ) """ assert ( file_path is not None or url_path is not None ), "file_path or url_path must be provided" self.file_path = file_path self.url_path = url_path self.parser = AzureAIDocumentIntelligenceParser( api_endpoint=api_endpoint, api_key=api_key, api_version=api_version, api_model=api_model, mode=mode, analysis_features=analysis_features, ) def lazy_load( self, ) -> Iterator[Document]: """Lazy load given path as pages.""" if self.file_path is not None: yield from self.parser.parse(self.file_path) else: yield from self.parser.parse_url(self.url_path)

 

 

Now lets define the Document Parser for the same. This document Parser will internally call the intended to load and parse PDF files using Azure's Document Intelligence service (formerly known as Azure Forms Recognizer). This service uses machine learning models to extract text, key-value pairs, and tables from documents. 

lazy_parse- A method that lazily parses a given file, meaning it starts processing the file and yielding results as they become available rather than waiting for the entire file to be processed.
 
class AzureAIDocumentIntelligenceParser(BaseBlobParser): """Loads a PDF with Azure Document Intelligence (formerly Forms Recognizer).""" def __init__( self, api_endpoint: str, api_key: str, api_version: Optional[str] = None, api_model: str = "prebuilt-layout", mode: str = "markdown", analysis_features: Optional[List[str]] = None, ): from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.ai.documentintelligence.models import DocumentAnalysisFeature from azure.core.credentials import AzureKeyCredential kwargs = {} if api_version is not None: kwargs["api_version"] = api_version if analysis_features is not None: _SUPPORTED_FEATURES = [ DocumentAnalysisFeature.OCR_HIGH_RESOLUTION, ] analysis_features = [ DocumentAnalysisFeature(feature) for feature in analysis_features ] if any( [feature not in _SUPPORTED_FEATURES for feature in analysis_features] ): logger.warning( f"The current supported features are: " f"{[f.value for f in _SUPPORTED_FEATURES]}. " "Using other features may result in unexpected behavior." ) self.client = DocumentIntelligenceClient( endpoint=api_endpoint, credential=AzureKeyCredential(api_key), headers={"x-ms-useragent": "langchain-parser/1.0.0"}, features=analysis_features, **kwargs, ) self.api_model = api_model self.mode = mode assert self.mode in ["single", "page", "markdown"] def _generate_docs_page(self, result: Any) -> Iterator[Document]: for p in result.pages: content = " ".join([line.content for line in p.lines]) d = Document( page_content=content, metadata={ "page": p.page_number, }, ) yield d def _generate_docs_single(self, file_path: str, result: Any) -> Iterator[Document]: md_content = include_figure_in_md(file_path, result) yield Document(page_content=md_content, metadata={}) def lazy_parse(self, file_path: str) -> Iterator[Document]: """Lazily parse the blob.""" blob = Blob.from_path(file_path) with blob.as_bytes_io() as file_obj: poller = self.client.begin_analyze_document( self.api_model, file_obj, content_type="application/octet-stream", output_content_format="markdown" if self.mode == "markdown" else "text", ) result = poller.result() if self.mode in ["single", "markdown"]: yield from self._generate_docs_single(file_path, result) elif self.mode in ["page"]: yield from self._generate_docs_page(result) else: raise ValueError(f"Invalid mode: {self.mode}") def parse_url(self, url: str) -> Iterator[Document]: from azure.ai.documentintelligence.models import AnalyzeDocumentRequest poller = self.client.begin_analyze_document( self.api_model, AnalyzeDocumentRequest(url_source=url), # content_type="application/octet-stream", output_content_format="markdown" if self.mode == "markdown" else "text", ) result = poller.result() if self.mode in ["single", "markdown"]: yield from self._generate_docs_single(result) elif self.mode in ["page"]: yield from self._generate_docs_page(result) else: raise ValueError(f"Invalid mode: {self.mode}")

 

If you look at this LangChain document parser I have included a method called include_figure_in_md. This method goes through the markdown content and look for all figures and replace each figure with the description of the same.

Before the same please lets write some utility method which can help you crop image from the document PDF/Image.

 

 

from PIL import Image import fitz # PyMuPDF import mimetypes import base64 from mimetypes import guess_type # Function to encode a local image into data URL def local_image_to_data_url(image_path): # Guess the MIME type of the image based on the file extension mime_type, _ = guess_type(image_path) if mime_type is None: mime_type = 'application/octet-stream' # Default MIME type if none is found # Read and encode the image file with open(image_path, "rb") as image_file: base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8') # Construct the data URL return f"data:{mime_type};base64,{base64_encoded_data}" def crop_image_from_image(image_path, page_number, bounding_box): """ Crops an image based on a bounding box. :param image_path: Path to the image file. :param page_number: The page number of the image to crop (for TIFF format). :param bounding_box: A tuple of (left, upper, right, lower) coordinates for the bounding box. :return: A cropped image. :rtype: PIL.Image.Image """ with Image.open(image_path) as img: if img.format == "TIFF": # Open the TIFF image img.seek(page_number) img = img.copy() # The bounding box is expected to be in the format (left, upper, right, lower). cropped_image = img.crop(bounding_box) return cropped_image def crop_image_from_pdf_page(pdf_path, page_number, bounding_box): """ Crops a region from a given page in a PDF and returns it as an image. :param pdf_path: Path to the PDF file. :param page_number: The page number to crop from (0-indexed). :param bounding_box: A tuple of (x0, y0, x1, y1) coordinates for the bounding box. :return: A PIL Image of the cropped area. """ doc = fitz.open(pdf_path) page = doc.load_page(page_number) # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1). bbx = [x * 72 for x in bounding_box] rect = fitz.Rect(bbx) pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), clip=rect) img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) doc.close() return img def crop_image_from_file(file_path, page_number, bounding_box): """ Crop an image from a file. Args: file_path (str): The path to the file. page_number (int): The page number (for PDF and TIFF files, 0-indexed). bounding_box (tuple): The bounding box coordinates in the format (x0, y0, x1, y1). Returns: A PIL Image of the cropped area. """ mime_type = mimetypes.guess_type(file_path)[0] if mime_type == "application/pdf": return crop_image_from_pdf_page(file_path, page_number, bounding_box) else: return crop_image_from_image(file_path, page_number, bounding_box)

 

 

Next we write a method where image can be passed to GPT-4-Vision model and get the description of this image.

 

 

from openai import AzureOpenAI aoai_api_base = os.getenv("AZURE_OPENAI_ENDPOINT") aoai_api_key= os.getenv("AZURE_OPENAI_API_KEY") aoai_deployment_name = 'gpt-4-vision' # your model deployment name for GPT-4V aoai_api_version = '2024-02-15-preview' # this might change in the future MAX_TOKENS = 2000 def understand_image_with_gptv(image_path, caption): """ Generates a description for an image using the GPT-4V model. Parameters: - api_base (str): The base URL of the API. - api_key (str): The API key for authentication. - deployment_name (str): The name of the deployment. - api_version (str): The version of the API. - image_path (str): The path to the image file. - caption (str): The caption for the image. Returns: - img_description (str): The generated description for the image. """ client = AzureOpenAI( api_key=aoai_api_key, api_version=aoai_api_version, base_url=f"{aoai_api_base}/openai/deployments/{aoai_deployment_name}" ) data_url = local_image_to_data_url(image_path) response = client.chat.completions.create( model=aoai_deployment_name, messages=[ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": [ { "type": "text", "text": f"Describe this image (note: it has image caption: {caption}):" if caption else "Describe this image:" }, { "type": "image_url", "image_url": { "url": data_url } } ] } ], max_tokens=2000 ) img_description = response.choices[0].message.content return img_description

 

 

Now once we have the utilities method set we can just import the Document Intelligence Loader and load the document.

 

 

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader loader = AzureAIDocumentIntelligenceLoader(file_path='sample.pdf', api_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY"), api_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"), api_model="prebuilt-layout", api_version="2024-02-29-preview", mode='markdown', analysis_features = [DocumentAnalysisFeature.OCR_HIGH_RESOLUTION]) docs = loader.load()

 

 

Semantic chunking is a powerful technique used in natural language processing that involves breaking down large pieces of text into smaller, thematically consistent segments or “chunks” that are semantically coherent. The primary goal of semantic chunking is to capture and preserve the inherent meaning within the text, allowing each chunk to contain as much semantically independent information as possible. This process is critically important for various language model applications, such as embedding models and retrieval-augmented generation (RAG), because it helps overcome limitations associated with processing long sequences of text. By ensuring that the data fed into language models (LLMs) is thematically and contextually coherent, semantic chunking enhances the model’s ability to interpret and generate relevant and accurate responses.

 

mrajguru_1-1713503944554.png

 

 

Additionally, it improves the efficiency of information retrieval from vector databases by enabling the retrieval of highly relevant information that aligns closely with the user’s intent, thereby reducing noise and maintaining semantic integrity. In essence, semantic chunking serves as a bridge between large volumes of text data and the effective processing capabilities of advanced language models, making it a cornerstone of efficient and meaningful natural language understanding and generation.

 

Lets look at the Markdown Header Splitter to split the document based on the header.

 

 

# Split the document into chunks base on markdown headers. headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ("####", "Header 4"), ("#####", "Header 5"), ("######", "Header 6"), ("#######", "Header 7"), ("########", "Header 8") ] text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) docs_string = docs[0].page_content docs_result = text_splitter.split_text(docs_string) print("Length of splits: " + str(len(docs_result)))

 

 

Lets initialize the model of both Azure OpenAI GPT and Azure OpenAI Embedding.

 

 

from langchain_openai import AzureOpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain_openai import AzureChatOpenAI from langchain import hub from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough llm = AzureChatOpenAI(api_key = os.environ["AZURE_OPENAI_API_KEY"], api_version = "2023-12-01-preview", azure_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"], model= "gpt-4-1106-preview", streaming=True) aoai_embeddings = AzureOpenAIEmbeddings( api_key=os.getenv("AZURE_OPENAI_API_KEY"), azure_deployment="text-embedding-ada-002", openai_api_version="2023-12-01-preview", azure_endpoint =os.environ["AZURE_OPENAI_ENDPOINT"] )

 

 

Now lets create an index and store the embeddings into AI Search.

 

 

from langchain_community.vectorstores.azuresearch import AzureSearch from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings index_name: str = "langchain-vector-demo" vector_store: AzureSearch = AzureSearch( azure_search_endpoint=vector_store_address, azure_search_key=vector_store_password, index_name=index_name, embedding_function=aoai_embeddings.embed_query, )

 

 

Finally lets create our RAG Chain. Here I have used a simple Retrieval but you can make complex Retrieval as well to Retrieve both Image and Text.

 

 

def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) retriever_base = index.as_retriever(search_type="similarity",search_kwargs = {"k" : 5}) rag_chain_from_docs = ( { "context": lambda input: format_docs(input["documents"]), "question": itemgetter("question"), } | prompt | llm | StrOutputParser() ) rag_chain_with_source = RunnableMap( {"documents": retriever_base, "question": RunnablePassthrough()} ) | { "documents": lambda input: [doc.metadata for doc in input["documents"]], "answer": rag_chain_from_docs, }

 

 

 

Now lets make into action, lets take the below PDF example and ask question from Plot.

 

mrajguru_2-1713504222082.png

 

 

 

Here I will ask a question from the Plot from this page. As you see i get the correct response along with citations too.

 

mrajguru_3-1713504247556.png

 

Hope you like the blog. Please clap and follow me if you like to read more such blogs coming soon.

 

Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

Join the Microsoft Developers AI Learning Hackathon and Win Up to $10K in Prizes!

1 Share

LeeStott_0-1713472037426.png


Join the Microsoft Developers AI Learning Hackathon for a chance to 
hone your AI skillsbuild an AI copilot with Azure Cosmos DB for MongoDB, and compete for up to $10,000 in prizes.

This global online event runs from April 15th to June 17th and welcomes developers of all levels Register NOW. By participating, you’ll get practical experience with essential database technologies that drive innovations like OpenAI’s ChatGPT. Plus, you’ll earn an exclusive badge to display your newfound AI expertise.

Don’t miss out on this opportunity to kickstart your AI journey on Azure. Register now and explore the capabilities of Azure Cosmos DB, the fully managed, serverless, distributed database designed for modern app development. With its SLA-backed performance, instant scalability, and support for popular APIs like PostgreSQL, MongoDB, and Apache Cassandra, Azure Cosmos DB is the perfect platform to power your AI applications. Try it for free and follow the latest updates to stay ahead in the world of app development.


What is the Microsoft Developers AI Learning Hackathon? 

The Microsoft Developers AI Learning Hackathon is a global online event where you can learn fundamental AI skills, build your own AI copilot with Azure Cosmos DB for MongoDB, and compete for up to $10K in prizes. The hackathon runs from April 15
th – June 17th and is open to developers of all levels and backgrounds. 

 

Why should I participate? 

Get hands-on with foundational database technologies, powering innovations like OpenAI’s ChatGPT. Use Azure Cosmos DB for MongoDB in real-world AI scenarios and compete for up to $10,000 in prizes! Plus, participants will earn an exclusive badge to showcase their new AI skills. Don't miss this chance to learn, build, and get started with your AI learning journey on Azure!  


 LeeStott_2-1713472132393.png

About Azure Cosmos DB 
Azure Cosmos DB is a fully managed and serverless distributed database for modern app development, with SLA-backed speed and availability, automatic and instant scalability, and support for open-source PostgreSQL, MongoDB, and Apache Cassandra. Try Azure Cosmos DB for free here. To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn. 

Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete
Next Page of Stories