Content Developer II at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
121938 stories
·
29 followers

Bringing GenAI Offline: running SLM’s like Phi-2/Phi-3 and Whisper Models on Mobile Devices

1 Share

In today's digitally interconnected landscape, language models stand at the forefront of technological innovation, reshaping the way we engage with various platforms and applications. These sophisticated algorithms have become indispensable tools in tasks ranging from text generation to natural language processing, driving efficiency and productivity across diverse sectors.

 

Yet, the reliance on cloud-based solutions presents a notable obstacle in certain contexts. In environments characterized by limited internet connectivity or stringent data privacy regulations, accessing cloud services may prove impractical or even impossible. This dependency on external servers introduces latency issues, security concerns, and operational challenges that hinder the seamless integration of language models into everyday workflows.

 

Enter the solution: running language models offline. By bringing the computational power of sophisticated models like phi2/3 and Whisper directly to mobile devices, this approach circumvents the constraints of cloud reliance, empowering users to leverage advanced language processing capabilities irrespective of connectivity status.

 

In this blog, we delve into the significance of enabling offline capabilities for LLMs and explore the practicalities of running SLMs on mobile devices, offering insights into the transformative potential of this technology.

How LLM's are deployed today?

In a typical Large Language Model (LLM) deployment scenario, the LLM is hosted on a public cloud infrastructure like Microsoft Azure using tools like Azure Machine Learning and exposed as an API endpoint. This API serves as the interface through which external applications, such as Web Applications, mobile apps on Android and iOS devices, interact with the LLM to perform natural language processing tasks. When a user initiates a request through the mobile app, the app sends a request to the API endpoint using data, specifying the desired task, such as text generation or sentiment analysis.

concept-deployment.png

 

The API processes the request, utilizing the LLM to perform the required task, and returns the result to the mobile app. This architecture enables seamless integration of LLM capabilities into mobile applications, allowing users to leverage advanced language processing functionalities directly from their devices while offloading the computational burden to the cloud infrastructure.

To overcome the limitations of relying on internet connectivity and ensure users have the flexibility and ease to interact with their safety copilot even in remote locations or locations where internet isn’t available like basements or underground facilities while safeguarding privacy, the optimal solution is to run Large Language Models (LLMs) on-device, offline. By deploying LLMs directly on users' devices, such as mobile phones and tablets, we eliminate the need for continuous internet access and the associated back-and-forth communication with remote servers. This approach empowers users to access their safety copilot anytime, anywhere, without dependency on network connectivity.

What are Small Language Models (SLMs) ?

Small Language Models (SLMs) represent a focused subset of artificial intelligence tailored for specific enterprise needs within Natural Language Processing (NLP). Unlike their larger counterparts like GPT-4, SLMs prioritize efficiency and precision over sheer computational power. They are trained on domain-specific datasets, enabling them to navigate industry-specific terminologies and nuances with accuracy. In contrast to Large Language Models (LLMs), which may lack customization for enterprise contexts, SLMs offer targeted, actionable insights while minimizing inaccuracies and the risk of generating irrelevant information. SLMs are characterized by their compact architecture, lower computational demands, and enhanced security features, making them cost-effective and adaptable for real-time applications like chatbots. Overall, SLMs provide tailored efficiency, enhanced security, and lower latency, addressing specific business needs effectively while offering a promising alternative to the broader capabilities of LLMs.

Small Language Models (SLMs) offer enterprises control and customization, efficient resource usage, effective performance, swift training and inference, and resource-efficient deployment. They scale easily, adapt to specific domains, facilitate rapid prototyping, enhance security, and provide transparency. SLMs also have clear limitations and offer cost efficiency, making them an attractive option for businesses seeking AI capabilities without extensive resource investment.

Screenshot 2024-05-01 at 3.32.15 PM.png

 

Why running SLM's offline at edge is a challenge?

Running small language models (SLMs) offline on mobile phones enhances privacy, reduces latency, and promotes access. Users can interact with llm-based applications, receive critical information, and perform tasks even in offline environments, ensuring accessibility and control over personal data. Real-time performance and independence from centralized infrastructure unlock new opportunities for innovation in mobile computing, offering a seamless and responsive user experience. However, running SLMs offline on mobile phones presents several challenges due to the constraints of mobile hardware and the complexities of running LLM tasks. Here are some key challenges:

  1. Limited Processing Power: Mobile devices, especially smartphones, have limited computational resources compared to desktop computers or servers. SLMs often require significant processing power to execute tasks such as text generation or sentiment analysis, which can strain the capabilities of mobile CPUs and GPUs.
  2. Memory Constraints: SLMs typically require a significant amount of memory to store model parameters and intermediate computations. Mobile devices have limited RAM compared to desktops or servers, making it challenging to load and run large language models efficiently.
  3. Battery Life Concerns: Running resource-intensive tasks like NLP on mobile devices can drain battery life quickly. Optimizing SLMs for energy efficiency is crucial to ensure that offline usage remains practical without significantly impacting battery performance.
  4. Storage Limitations: Storing large language models on mobile devices can be problematic due to limited storage space. Balancing the size of the model with the available storage capacity while maintaining performance is a significant challenge.
  5. Update and Maintenance: Keeping SLMs up to date with the latest improvements and security patches presents challenges for offline deployment on mobile devices. Ensuring seamless updates while minimizing data usage and user inconvenience requires careful planning and implementation.
  6. Real-Time Performance: Users expect responsive performance from mobile applications, even when running complex NLP tasks offline. Optimizing SLMs for real-time inference on mobile devices is crucial to provide a smooth user experience.

How to deploy SLMs on Mobile Device?

Deploying Large Language Models (LLMs) on mobile devices involves a sophisticated integration of MediaPipe and WebAssembly technologies to optimize performance and efficiency. MediaPipe, renowned for its on-device ML capabilities, provides a robust framework for running LLMs entirely on mobile devices, thereby eliminating the need for constant network connectivity and offloading computation to remote servers. With the experimental MediaPipe LLM Inference API, developers can seamlessly integrate popular LLMs like Gemma, Phi 2, Falcon, and Stable LM into their mobile applications. This breakthrough is facilitated by a series of optimizations across the on-device stack, including the integration of new operations, quantization techniques, caching mechanisms, and weight sharing strategies. MediaPipe leverages WebAssembly (Wasm) to further enhance the deployment of LLMs on mobile devices.

tvm-wasm-stack.png

 

Wasm's compact binary format and compatibility with multiple programming languages ensure efficient execution of non-JavaScript code within the mobile environment. By time-slicing GPU access and ensuring platform neutrality, Wasm optimizes GPU usage and facilitates seamless deployment across diverse hardware environments, thus enhancing the performance of LLMs on mobile devices. Additionally, advances such as the WebAssembly Systems Interface – Neural Networks (WASI-NN) standard enhance Wasm's capabilities, promising a future where it plays a pivotal role in democratizing access to AI-grade compute power on mobile devices. Through the synergistic utilization of MediaPipe and WebAssembly, developers can deploy LLMs on mobile devices with unprecedented efficiency and performance, revolutionizing on-device AI applications across various platforms.

Mediapipe's LLM Inference API empowers you to harness the power of large language models (LLMs) directly on your Android device. With this tool, you can execute various tasks like text generation, natural language information retrieval, and document summarization without relying on external servers. It offers seamless integration with multiple text-to-text LLMs, enabling you to leverage cutting-edge generative AI models within your Android applications, with support for popular SLM's like Phi-2, Gemma, Falcon-RW-1B, and StableLM-3B.

Screenshot-2024-04-21-at-10.45.17-AM.png

 

The LLM Inference API uses the com.google.mediapipe:tasks-genai library. Add this dependency to the build.gradle file of your Android app:

 
dependencies {
    implementation 'com.google.mediapipe:tasks-genai:0.10.11'
}

Convert model to MediaPipe format

he model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11.

Install and import the dependencies with the following:

 
$ python3 -m pip install mediapipe

Use the genai.converter library to convert the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

def
phi2_convert_config(backend):
input_ckpt = '/content/phi-2'
vocab_model_file = '/content/phi-2/'
output_dir = '/content/intermediate/phi-2/'
output_tflite_file = f'/content/converted_models/phi2_{backend}.bin'
return converter.ConversionConfig(input_ckpt=input_ckpt, ckpt_format='safetensors', model_type='PHI_2', backend=backend, output_dir=output_dir, combine_file_only=False, vocab_model_file=vocab_model_file, output_tflite_file=output_tflite_file)
Parameter Description Accepted Values
input_ckpt The path to the model.safetensors or pytorch.bin file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. model-00001-of-00003.safetensors, model-00001-of-00003.safetensors. You can specify a file pattern, like model*.safetensors. PATH
ckpt_format The model file format. {"safetensors", "pytorch"}
model_type The LLM being converted. {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}
backend The processor (delegate) used to run the model. {"cpu", "gpu"}
output_dir The path to the output directory that hosts the per-layer weight files. PATH
output_tflite_file The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file. PATH
vocab_model_file The path to the directory that stores the tokenizer.json and tokenizer_config.json files. For Gemma, point to the single tokenizer.model file. PATH

Push model to the device

Push the content of the output_path folder to the Android device.

 
$ adb shell rm -r /data/local/tmp/llm/ # Remove any previously loaded models
$ adb shell mkdir -p /data/local/tmp/llm/
$ adb push model.bin /data/local/tmp/llm/model_phi2.bin.bin

Create the task

The MediaPipe LLM Inference API uses the createFromOptions() function to set up the task. The createFromOptions() function accepts values for the configuration options. For more information on configuration options, see Configuration options.

The following code initializes the task using basic configuration options:

 
// Set the configuration options for the LLM Inference task
val options = LlmInferenceOptions.builder()
       
.setModelPATH('/data/local/.../')
       
.setMaxTokens(1000)
       
.setTopK(40)
       
.setTemperature(0.8)
       
.setRandomSeed(101)
       
.build()

// Create an instance of the LLM Inference task
llmInference
= LlmInference.createFromOptions(context, options)

Configuration options

Use the following configuration options to set up an Android app:

Option Name Description Value Range Default Value
modelPath The path to where the model is stored within the project directory. PATH N/A
maxTokens The maximum number of tokens (input tokens + output tokens) the model handles. Integer 512
topK The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting topK, you must also set a value for randomSeed. Integer 40
temperature The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting temperature, you must also set a value for randomSeed. Float 0.8
randomSeed The random seed used during text generation. Integer 0
resultListener Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method. N/A N/A
errorListener Sets an optional error listener. N/A N/A

 

Prepare data

The LLM Inference API accepts the following inputs:

  • prompt (string): A question or prompt.
 
val inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday."

Run the task

Use the generateResponse() method to generate a text response to the input text provided in the previous section (inputPrompt). This produces a single generated response.

 
val result = llmInference.generateResponse(inputPrompt)
logger
.atInfo().log("result: $result")

To stream the response, use the generateResponseAsync() method.

 
val options = LlmInference.LlmInferenceOptions.builder()
 
...
 
.setResultListener { partialResult, done ->
    logger
.atInfo().log("partial result: $partialResult")
 
}
 
.build()

llmInference
.generateResponseAsync(inputPrompt)

Handle and display results

The LLM Inference API returns a LlmInferenceResult, which includes the generated response text.

 
Here's a draft you can use:

Subject: Lunch on Saturday Reminder

Hi Brett,

Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.

Looking forward to it!

Best,
[Your Name]

 

Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

763: Web Scraping + Reverse Engineering APIs

1 Share

Web scraping 101! Dive into the world of web scraping with Scott and Wes as they explore everything from tooling setup and navigating protected routes to effective data management. In this Tasty Treat episode, you’ll gain invaluable insights and techniques to scrape (almost) any website with ease.

Show Notes

Sick Picks

Shameless Plugs

Hit us up on Socials!

Syntax: X Instagram Tiktok LinkedIn Threads

Wes: X Instagram Tiktok LinkedIn Threads

Scott:X Instagram Tiktok LinkedIn Threads

Randy: X Instagram YouTube Threads





Download audio: https://traffic.libsyn.com/secure/syntax/Syntax_-_763.mp3?dest-id=532671
Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

SQL Server and AI with Muazma Zahid & Bob Ward

1 Share

How does artificial intelligence fit in with SQL Server? At the Microsoft Fabric Conference in Las Vegas, Richard sat down with Muazma Fahid and Bob Ward to discuss the AI developments in SQL Server. Muazma talks about SQL Server as a crucial source of data for building machine learning models and the new features added to make SQL Server a key store for vector data and other elements of machine learning. There's also Copilot for Azure SQL Data to help with diagnostics in your databases and to use natural language to write queries - it SQL Server Natural Language Query all over again, but certainly better than ever.

Links

Recorded March 27, 2024





Download audio: https://cdn.simplecast.com/audio/c2165e35-09c6-4ae8-b29e-2d26dad5aece/episodes/8fb3a8b3-7e46-4aea-b8c6-26f057b52ce4/audio/e341c778-321e-422d-9090-a50df815d8dc/default_tc.mp3?aid=rss_feed&feed=cRTTfxcT
Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

General Performance Tip: Optimizing Enum Value Name Retrieval

1 Share
The article delves into the optimization of Enum value name retrieval in .NET, comparing three approaches. It demonstrates that one of these methods is 16.13 times more efficient, with no memory allocation.





Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

Run Phi-3 SLM on your machine with C# Semantic Kernel and Ollama

1 Share

Microsoft recently unveiled Phi-3, the latest iteration of their Small Language Model (SLM). And hot on its heels is Ollama, a powerful tool that enables you to run SLMs and LLMs right on your own machine.

Excited to dive in? In this guide, I’ll show you how to harness the power of Phi-3 and Ollama using C# and Semantic Kernel. I’ll walk you through the process of creating a simple console application to get you started on your SLM journey.

So, let’s get coding and unlock the potential of Phi-3 and Ollama on your machine!

Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete

Transforming Financial Documents into Smart and Secure Forms in ASP.NET Core C#

1 Share

This article shows how to transform financial documents into smart and secure forms in ASP.NET Core C#. All the necessary steps, from prepopulating the form fields to digital signatures, are explained in this article.

Read more



Read the whole story
alvinashcraft
2 hours ago
reply
West Grove, PA
Share this story
Delete
Next Page of Stories