Content Developer II at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
122135 stories
·
29 followers

Getting started with Microsoft Phi-3-mini - Try running the Phi-3-mini on iPhone with ONNX Runtime

1 Share

Microsoft, Google, and Apple have all released SLM (Microsoft phi3-mini, Google Gemma, and Apple OpenELM) adapted to edge devices at different times . Developers deploy SLM offline on Nvidia Jetson Orin, Raspberry Pi, and AI PC. This gives generative AI more application scenarios. We learned several ways to deploy applications from the previous article, so how do we deploy SLM applications to mobile devices?

This article is a preliminary exploration based on iPhone. We know that Microsoft phi3-mini has released three formats on Hugging Face, among which gguf and onnx are quantized models. We can deploy phi3-mini's quantized model based on different hardware conditions. So lets get started and explore the quantitative model based on the Phi-3-mini onnx format. If you want to use the GGUF format, it is recommended to use LLM Farm app.


Generative AI with ONNX Runtime

In the era of AI , the portability of AI models is very important. ONNX Runtime can easily deploy trained models to different devices. Developers do not need to pay attention to the inference framework and use a unified API to complete model inference. In the era of generative AI, ONNX Runtime has also performed code optimization (https: //onnxruntime.ai/docs/genai/). Through the optimized ONNX Runtime, the quantized generative AI model can be inferred on different terminals. In Generative AI with ONNX Runtime, you can inferene AI model API through Python, C#, C / C++. of course,Deployment on iPhone can take advantage of C++'s Generative AI with ONNX Runtime API..


Steps

A. Preparation

  1. macOS 14+

  2. Xcode 15+

  3. iOS SDK 17.x

  4. Install Python 3.10+ (Conda is recommended)

  5. Install the Python library - python-flatbuffers

  6. Install CMake


B. Compiling ONNX Runtime for iOS


git clone https://github.com/microsoft/onnxruntime.git

cd onnxruntime

./build.sh --build_shared_lib --ios --skip_tests --parallel --build_dir ./build_ios --ios --apple_sysroot iphoneos --osx_arch arm64 --apple_deploy_target 17.4 --cmake_generator Xcode --config Release

Notice

  1. Before compiling, you must ensure that Xcode is configured correctly and set it on the terminal

sudo xcode-select -switch /Applications/Xcode.app/Contents/Developer 

  1. ONNX Runtime needs to be compiled based on different platforms. For iOS, you can compile based on arm64 / x86_64

  2. It is recommended to directly use the latest iOS SDK for compilation. Of course, you can also lower the version to be compatible with past SDKs.

C. Compiling Generative AI with ONNX Runtime for iOS

Note: Because Generative AI with ONNX Runtime is in preview, please note the changes.


git clone https://github.com/microsoft/onnxruntime-genai

cd onnxruntime-genai

git checkout yguo/ios-build-genai


mkdir ort

cd ort

mkdir include

mkdir lib

cd ../


cp ../onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h ort/include
cp ../onnxruntime/build_ios/Release/Release-iphoneos/libonnxruntime*.dylib* ort/lib

python3 build.py --parallel --build_dir ./build_ios_simulator --ios --ios_sysroot iphoneos --osx_arch arm64 --apple_deployment_target 17.4 --cmake_generator Xcode


D. Create an App application in Xcode

I chose Objective-C as the App development method , because using Generative AI with ONNX Runtime C++ API, Objective-C is better compatible. Of course, you can also complete related calls through Swift bridging.

 

xcode.png


E. Copy the ONNX quantized INT4 model to the App application project

We need to import the INT4 quantization model in ONNX format, which needs to be downloaded first

 

hf.png

 

After downloading, you need to add it to the Resources directory of the project in Xcode.

 

model.png

 

F. Add the C++ API in ViewControllers

Notice:

  1. Add the corresponding C++ header file to the project

head.png

 

  1. add onnxruntime-genai.dylib in Xcode

 

lib.png

 

  1. Directly use the code on C Samples for testing in this samples. You can also directly add moreto run(such as ChatUI)

  2. Because you need to call C++, please change ViewController.m to ViewController.mm


    NSString *llmPath = [[NSBundle mainBundle] resourcePath];
    char const *modelPath = llmPath.cString;

    auto model =  OgaModel::Create(modelPath);

    auto tokenizer = OgaTokenizer::Create(*model);

    const char* prompt = "<|system|>You are a helpful AI assistant.<|end|><|user|>Can you introduce yourself?<|end|><|assistant|>";

    auto sequences = OgaSequences::Create();
    tokenizer->Encode(prompt, *sequences);

    auto params = OgaGeneratorParams::Create(*model);
    params->SetSearchOption("max_length", 100);
    params->SetInputSequences(*sequences);

    auto output_sequences = model->Generate(*params);
    const auto output_sequence_length = output_sequences->SequenceCount(0);
    const auto* output_sequence_data = output_sequences->SequenceData(0);
    auto out_string = tokenizer->Decode(output_sequence_data, output_sequence_length);
    
    auto tmp = out_string;

G. Look at the running results

result.jpg

 

Sample Codes: https://github.com/Azure-Samples/Phi-3MiniSamples/tree/main/ios


Summary

This is a very preliminary running result, because I am using an iPhone 12 so the running is relatively slow, and the CPU usage reaches 130% during inference. It would be better to have Apple MLX framework to cooperate with inference under the iOS mechanism, so what I am looking forward to in this project is that Generative AI with ONNX Runtime can provide hardware acceleration for iOS. Of course you can also try a newer iPhone device to test.

This is just a preliminary exploration, but it is a good start. I look forward to the improvement of Generative AI with ONNX Runtime.

Resources

  1. LLMFarm’s GitHub Repo https://github.com/guinmoon/LLMFarm

  2. Phi3-mini Microsoft Blog https://aka.ms/phi3blog-april

  3. Phi-3 technical report https://aka.ms/phi3-tech-report

  4. Getting started with Phi3 https://aka.ms/phi3gettingstarted

  5. Learn about ONNX Runtime https://github.com/microsoft/onnxruntime

  6. Learn about Generative AI with ONNX Runtime https://github.com/microsoft/onnxruntime-genai

Read the whole story
alvinashcraft
52 minutes ago
reply
West Grove, PA
Share this story
Delete

Why should you migrate from OpenAI to Azure OpenAI?

1 Share

kevin_comba_0-1714742095449.png

As the field of AI continues to grow, developers are constantly seeking new and innovative ways to integrate it into their work. With the launch of Azure OpenAI Service, developers now have even more tools at their disposal to take advantage of this powerful technology. Azure OpenAI Service can be used to create chatbots, generate text, translate languages, and write different kinds of creative content. As the platform continues to evolve, developers will be able to use it to build even more powerful and sophisticated applications. 

 

What is Azure OpenAI 

Azure OpenAI Service is a fully managed service that allows developers to easily integrate OpenAI models into their applications. With Azure OpenAI Service, developers can quickly and easily access a wide range of AI models, including natural language processing, computer vision, and more. Azure OpenAI Service provides a simple and easy-to-use API that makes it easy to get started with AI 

 

Strategic AI Provider Selection for Businesses 

The AI service provider landscape is characterized by its rapid evolution and diverse offerings. Informed decision-making requires a careful analysis of the providers' unique strengths, their pricing models, and the congruence with an organization's specific demands and strategic ambitions. 

Let’s look at some of the common scenarios in app migrations and break down major differences 

 

Programing SDKs 

Major differences we need to make to switch our apps from OpenAI to Azure OpenAI. We are going to use Python SDK for this example. 

  •  API key – The code looks similar, but Azure OpenAI adds api_version and azure_endpoint because you're running your own instance.  

kevin_comba_1-1714742095451.png

 

  • Microsoft Entra ID authentication – This is helpful in adding extra security to our client instance by adding api_version, azure_endpoint and the token_provider. 

kevin_comba_2-1714742095453.png

 

  • Keyword argument for model - OpenAI uses the model keyword argument to specify what model to use. Azure OpenAI has the concept of unique model deployments. When you use Azure OpenAI, model should refer to the underlying deployment name you chose when you deployed the model. 

kevin_comba_3-1714742095456.png

 

  • Embeddings multiple input support - OpenAI and Azure OpenAI currently support input arrays up to 2,048 input items for text-embedding-ada-002. Both require the max input token limit per API request to remain under 8,191 for this model. 

kevin_comba_4-1714742095458.png

 

 Other Benefits of migrating from OpenAI to Azure OpenAI 

 

  • Managed Service and Infrastructure: 
    • Azure OpenAI is a fully managed service provided by Microsoft. You don’t need to worry about setting up and maintaining infrastructure, as Azure handles it for you. You just need to spin up your OpenAI instance and start developing. 
    • You can also configure Azure OpenAI Service with managed identities 

  • Security and Compliance: 
    • Azure provides robust security features, including encryption, identity management, and compliance certifications. This acts as a more friendly reason for startups, companies and organization 
    • If your application deals with sensitive data, Azure OpenAI ensures that your models and data are protected according to industry standards. Your companied data is retained in your own Azure OpenAI instance. 
    • Responsible AI practices for Azure OpenAI models 

  • Azure OpenAI supported programming languages - Azure OpenAI gives you five programing languages (C#, Go, Java, JavaScript and Python) with SDKs to help you easily interact with the models.  

  • Scalability and High Availability: 
    • Azure’s global infrastructure allows you to scale your AI workloads dynamically. You can handle increased demand by automatically provisioning additional resources. 
    • Azure also provides redundancy across multiple data centers, ensuring high availability and fault tolerance.
       
  • Integration with Other Azure Services: 
  • Cost Optimization: 
    • Azure offers flexible pricing options, including pay-as-you-go (PAYG) and Provisioned Throughput Units (PTUs). With PAYG, you can optimize costs by paying only for the resources you use, while PTUs provide throughput with minimal latency variance, making them ideal for scaling your AI solutions. Each model is priced per unit, ensuring a predictable cost structure for your AI deployments. 

    • Additionally, Azure provides cost management tools to monitor and optimize your spending. You can event approximate the cost for your Azure resources by using the Price calculator. 

Read More

 

 

Read the whole story
alvinashcraft
52 minutes ago
reply
West Grove, PA
Share this story
Delete

Finetune Small Language Model (SLM) Phi-3 using Azure Machine Learning

1 Share

Motivations for Small Language Models:

· Efficiency: SLMs are computationally more efficient, requiring less memory and storage, and can operate faster due to fewer parameters to process.

· Cost: Training and deploying SLMs is less expensive, making them accessible to a wider range of businesses and suitable for applications in edge computing.

· Customizability: SLMs are more adaptable to specialized applications and can be fine-tuned for specific tasks more readily than larger models· Under-Explored Potential: While large models have shown clear benefits, the potential of smaller models trained with larger datasets has been less explored. SLM aims to showcase that smaller models can achieve high performance when trained with enough data.

· Inference Efficiency: Smaller models are often more efficient during inference, which is a critical aspect when deploying models in real-world applications with resource constraints. This efficiency includes faster response times and reduces computational and energy costs.

· Accessibility for Research: By being open-source and smaller in size, SLM is more accessible to a broader range of researchers who may not have the resources to work with larger models. It provides a platform for experimentation and innovation in language model research without requiring extensive computational resources.

· Advancements in Architecture and Optimization: SLM incorporates various architectural and speed optimizations to improve computational efficiency. These enhancements allow SLM to train faster and with less memory, making it feasible to train on commonly available GPUs.

· Open-Source Contribution: The authors of SLM have made the model checkpoints and code publicly available, contributing to the open-source community and enabling further advancements and applications by others.

· End-User Applications: With its excellent performance and compact size, SLM is suitable for end-user applications, potentially even on mobile devices, providing a lightweight platform for a wide range of applications.

· Training Data and Process: SLM training process is designed to be effective and reproducible, using a mixture of natural language data and code data, aiming to make pre-training accessible and transparent.

mrajguru_0-1714728727182.png

 

Phi-2 (Microsoft Research)

Phi-2 is the successor of Phi-1.5, the large language model (LLM) created by Microsoft.To improve over Phi-1.5, in addition to doubling the number of parameters to 2.7 billion, Microsoft also extended the training data. Phi-2 outperforms Phi-1.5 and LLMs that are 25 times larger on several public benchmarks even though it is not aligned/fine-tuned. This is just a pre-trained model for research purposes only (non-commercial, non-revenue generating). Forget about the exorbitant fees of larger language models. Phi-2 runs efficiently on even modest hardware, democratizing access to cutting-edge AI for startups and smaller businesses. No more sky-high cloud bills, just smart, affordable solutions on your own terms. In this example, we are going to learn how to fine-tune phi-2 using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

Step:1

Lets prepare the dataset. In this case we are going to download the ultrachat dataset.

 

from datasets import load_dataset from random import randrange # Load dataset from the hub dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]') print(f"dataset size: {len(dataset)}") print(dataset[randrange(len(dataset))])

 

 

Lets take a shorter version of the dataset to create training and test example. To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

 

 

dataset = dataset.train_test_split(test_size=0.2) train_dataset = dataset['train'] train_dataset.to_json(f"data/train.jsonl") test_dataset = dataset['test'] test_dataset.to_json(f"data/eval.jsonl")

 

 

Lets save this training and test dataset in json format. Now let’s load the Azure ML SDK. This will help us create the necesary component.

 

 

# import required libraries from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential from azure.ai.ml import MLClient, Input from azure.ai.ml.dsl import pipeline from azure.ai.ml import load_component from azure.ai.ml import command from azure.ai.ml.entities import Data from azure.ai.ml import Input from azure.ai.ml import Output from azure.ai.ml.constants import AssetTypes

 

 

Now lets create the workspace client.

 

 

credential = DefaultAzureCredential() workspace_ml_client = None try: workspace_ml_client = MLClient.from_config(credential) except Exception as ex: print(ex) subscription_id= "Enter your subscription_id" resource_group = "Enter your resource_group" workspace= "Enter your workspace name" workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)

 

 

Here lets create a custom training environment.

 

from azure.ai.ml.entities import Environment, BuildContext env_docker_image = Environment( image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest", conda_file="environment/conda.yml", name="llm-training", description="Environment created for llm training.", ) ml_client.environments.create_or_update(env_docker_image)

 

 

Let’s look at the conda.yml

 

name: pydata-example channels: - conda-forge dependencies: - python=3.8 - pip=21.2.4 - pip: - bitsandbytes - transformers - peft - accelerate - einops - datasets

 

 

Lets look at the training script. We are going to use the recently introduced method in the paper “QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation” by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

  • Quantize the pretrained model to 4 bits and freezing it.
  • Attach small, trainable adapter layers. (LoRA)
  • Finetune only the adapter layers, while using the frozen quantized model for context.

 

%%writefile src/train.py import os #import mlflow import argparse import sys import logging import datasets from datasets import load_dataset from peft import LoraConfig import torch import transformers from trl import SFTTrainer from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig from datasets import load_dataset logger = logging.getLogger(__name__) ################### # Hyper-parameters ################### training_config = { "bf16": True, "do_eval": False, "learning_rate": 5.0e-06, "log_level": "info", "logging_steps": 20, "logging_strategy": "steps", "lr_scheduler_type": "cosine", "num_train_epochs": 1, "max_steps": -1, "output_dir": "./checkpoint_dir", "overwrite_output_dir": True, "per_device_eval_batch_size": 4, "per_device_train_batch_size": 4, "remove_unused_columns": True, "save_steps": 100, "save_total_limit": 1, "seed": 0, "gradient_checkpointing": True, "gradient_checkpointing_kwargs":{"use_reentrant": False}, "gradient_accumulation_steps": 1, "warmup_ratio": 0.2, } peft_config = { "r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "bias": "none", "task_type": "CAUSAL_LM", "target_modules": "all-linear", "modules_to_save": None, } train_conf = TrainingArguments(**training_config) peft_conf = LoraConfig(**peft_config) ############### # Setup logging ############### logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = train_conf.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process a small summary logger.warning( f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}" + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}" ) logger.info(f"Training/evaluation parameters {train_conf}") logger.info(f"PEFT parameters {peft_conf}") ################ # Modle Loading ################ checkpoint_path = "microsoft/Phi-3-mini-4k-instruct" # checkpoint_path = "microsoft/Phi-3-mini-128k-instruct" model_kwargs = dict( use_cache=False, trust_remote_code=True, attn_implementation="flash_attention_2", # loading the model with flash-attenstion support torch_dtype=torch.bfloat16, device_map=None ) model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs) tokenizer = AutoTokenizer.from_pretrained(checkpoint_path) tokenizer.model_max_length = 2048 tokenizer.pad_token = tokenizer.unk_token # use unk rather than eos token to prevent endless generation tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token) tokenizer.padding_side = 'right' ################## # Data Processing ################## def apply_chat_template( example, tokenizer, ): messages = example["messages"] # Add an empty system message if there is none if messages[0]["role"] != "system": messages.insert(0, {"role": "system", "content": ""}) example["text"] = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False) return example def main(args): train_dataset = load_dataset('json', data_files=args.train_file, split='train') test_dataset = load_dataset('json', data_files=args.eval_file, split='train') column_names = list(train_dataset.features) processed_train_dataset = train_dataset.map( apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=10, remove_columns=column_names, desc="Applying chat template to train_sft", ) processed_test_dataset = test_dataset.map( apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=10, remove_columns=column_names, desc="Applying chat template to test_sft", ) ########### # Training ########### trainer = SFTTrainer( model=model, args=train_conf, peft_config=peft_conf, train_dataset=processed_train_dataset, eval_dataset=processed_test_dataset, max_seq_length=2048, dataset_text_field="text", tokenizer=tokenizer, packing=True ) train_result = trainer.train() metrics = train_result.metrics trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() ############# # Evaluation ############# tokenizer.padding_side = 'left' metrics = trainer.evaluate() metrics["eval_samples"] = len(processed_test_dataset) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # ############ # # Save model # ############ os.makedirs(args.model_dir, exist_ok=True) torch.save(model, os.path.join(args.model_dir, "model.pt")) def parse_args(): # setup argparse parser = argparse.ArgumentParser() # add arguments parser.add_argument("--train-file", type=str, help="Input data for training") parser.add_argument("--eval-file", type=str, help="Input data for eval") parser.add_argument("--model-dir", type=str, default="./", help="output directory for model") parser.add_argument("--epochs", default=10, type=int, help="number of epochs") parser.add_argument( "--batch-size", default=16, type=int, help="mini batch size for each gpu/process", ) parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate") parser.add_argument("--momentum", default=0.9, type=float, help="momentum") parser.add_argument( "--print-freq", default=200, type=int, help="frequency of printing training statistics", ) # parse args args = parser.parse_args() # return args return args # run script if __name__ == "__main__": # parse args args = parse_args() # call main function main(args)

 

 

Let’s create a training compute .

 

from azure.ai.ml.entities import AmlCompute # If you have a specific compute size to work with change it here. By default we use the 1 x V100 compute from the above list compute_cluster_size = "Standard_NC6s_v3" # If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big' compute_cluster = "gpu-cluster" try: compute = ml_client.compute.get(compute_cluster) print("The compute cluster already exists! Reusing it for the current run") except Exception as ex: print( f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!" ) try: print("Attempt #1 - Trying to create a dedicated compute") compute = AmlCompute( name=compute_cluster, size=compute_cluster_size, tier="Dedicated", max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print("Error")

 

 

Now lets call the compute job with the above training script in the AML compute we just created.

 

from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration job = command( inputs=dict( train_file=Input( type="uri_file", path="data/train.jsonl", ), eval_file=Input( type="uri_file", path="data/eval.jsonl", ), epoch=2, batchsize=64, lr = 0.01, momentum = 0.9, prtfreq = 200, output = "./outputs" ), code="./src", # local path where the code is stored compute = 'gpu-a100', command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}", environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/52", distribution={ "type": "PyTorch", "process_count_per_instance": 1, }, ) returned_job = workspace_ml_client.jobs.create_or_update(job) workspace_ml_client.jobs.stream(returned_job.name)

 

 

Lets look at the pipeline output.

 

# check if the `trained_model` output is available job_name = returned_job.name print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)

 

 

Once the model is finetuned lets register the job in the workspace to create endpoint.

 

from azure.ai.ml.entities import Model from azure.ai.ml.constants import AssetTypes run_model = Model( path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder", name="phi-3-finetuned", description="Model created from run.", type=AssetTypes.MLFLOW_MODEL, ) model = workspace_ml_client.models.create_or_update(run_model)

 

 

Lets creat the endpoint.

 

from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) # Check if the endpoint already exists in the workspace try: endpoint = workspace_ml_client.online_endpoints.get(endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"Test endpoint for {model.name}", identity=IdentityConfiguration( type="user_assigned", user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)], ) if uai_id != "" else None, ) # Trigger the endpoint creation try: workspace_ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err

 

Once the endpoint is created we can go ahead and create the deployment.

 

# Initialize deployment parameters deployment_name = "phi3-deploy" sku_name = "Standard_NCs_v3" REQUEST_TIMEOUT_MS = 90000 deployment_env_vars = { "SUBSCRIPTION_ID": subscription_id, "RESOURCE_GROUP_NAME": resource_group, "UAI_CLIENT_ID": uai_client_id, }

 

For inferencing we will use a different base image.

 

 

from azure.ai.ml.entities import Model, Environment env = Environment( image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest', inference_config={ "liveness_route": {"port": 5001, "path": "/"}, "readiness_route": {"port": 5001, "path": "/"}, "scoring_route": {"port": 5001, "path": "/score"}, }, )

 

Lets deploy the model

 

from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) deployment = ManagedOnlineDeployment( name=deployment_name, endpoint_name=endpoint_name, model=model.id, instance_type=sku_name, instance_count=1, #code_configuration=code_configuration, environment = env, environment_variables=deployment_env_vars, request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS), liveness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), ) # Trigger the deployment creation try: workspace_ml_client.begin_create_or_update(deployment).wait() print("\n---Deployment created successfully---\n") except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err

 

If you want to delete the endpoint please see the below code.

 

workspace_ml_client.online_deployments.begin_delete(name = deployment_name, endpoint_name = endpoint_name) workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)

 

 

Hope this tutorial helps you in Finetuning and deploying Phi-3 model in Azure ML Studio.

Hope you like the blog. Please clap and follow me if you like to read more such blogs coming soon.

References:

https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

https://www.philschmid.de/sagemaker-falcon-180b-qlora

Read the whole story
alvinashcraft
52 minutes ago
reply
West Grove, PA
Share this story
Delete

Philip Japikse: Migrating from .NET Framework to .NET 8 - Episode 296

1 Share

An international speaker, Microsoft MVP, ASPInsider, MCSD, PSM II, PSD, and PST, and a passionate member of the developer community, Phil has been working with .NET since the first betas, developing software for over 35 years, and heavily involved in the agile community since 2005 as well as a Professional Scrum Trainer. Phil has taken over the best-selling Pro C# books (Apress Publishing), including Pro C# 10, is the President of the Cincinnati .NET User’s Group (Cinnug.org), and the Cincinnati Software Architect Group, co-hosted the Hallway Conversations podcast (Hallwayconversations.com), founded and runs the CincyDeliver conference (Cincydeliver.org), and volunteers for the National Ski Patrol. During the day, Phil works as the CTO for Pintas & Mullins. Phil always enjoys learning new tech and is always striving to improve his craft.

 

Topics of Discussion:

[3:47] Philip’s career journey and why he’s still hands-on coding.

[5:37] Sometimes it’s not a technical problem, but a process or human interaction problem.

[6:37] Philip’s love of mentoring.

[8:18] The importance of collaboration.

[9:53] Challenges in migrating applications from .NET Framework to .NET Core.

[12:55] The importance of staying current.

[14:48] Modernizing legacy web applications using .NET Core.

[19:22] Rebuilding an old app using new technology, with challenges and lessons learned.

[24:22] Gradually introducing a new screen using feature flags is better than a "big bang" rewrite.

[26:01] Continuous deployment helps to roll out new features gradually to limited users.

[27:53] Differences between the .NET framework and .NET Core apps, including configuration settings to environmental awareness.

[34:59] Philip’s favorite resources to dig into, including his book.

[41:20] The power of collaborative learning.

 

Mentioned in this Episode:

Clear Measure Way

Architect Forum

Software Engineer Forum

Programming with Palermo — New Video Podcast! Email us at programming@palermo.net.

Clear Measure, Inc. (Sponsor)

.NET DevOps for Azure: A Developer’s Guide to DevOps Architecture the Right Way, by Jeffrey Palermo — Available on Amazon!

Jeffrey Palermo’s Twitter — Follow to stay informed about future events!

“Philip Japikse: Professional C# in .NET - Episode 230”

 

Want to Learn More?

Visit AzureDevOps.Show for show notes and additional episodes.

 





Download audio: https://traffic.libsyn.com/secure/azuredevops/ADP_296_00-07-23.mp3?dest-id=768873
Read the whole story
alvinashcraft
52 minutes ago
reply
West Grove, PA
Share this story
Delete

Collection Performance: Leveraging LINQ MAXBy() and MINBy() for Efficient and Readable Code

1 Share
The article explores the usage of LINQ's MinBy() and MaxBy() methods, which efficiently return the objects with the minimum and maximum values in a sequence based on a specified key selector function.







Read the whole story
alvinashcraft
53 minutes ago
reply
West Grove, PA
Share this story
Delete

Interview of our Team Lead Patrick Smacchia at WebsitePlanet

1 Share
May 6, 2024 1 minutes read

Here is an interview of our Team Lead Patrick Smacchia at WebsitePlanet, enjoy 🙂

The post Interview of our Team Lead Patrick Smacchia at WebsitePlanet appeared first on NDepend Blog.

Read the whole story
alvinashcraft
53 minutes ago
reply
West Grove, PA
Share this story
Delete
Next Page of Stories