Chapter 11: Hosting Models on Cloud Platforms¶

“You don’t have to build the rocket—just rent the launchpad.”

Self-hosting doesn’t always mean racking your own servers in a cold data center. In fact, most developers start their self-hosting journey using cloud platforms—where you get access to powerful GPUs, managed containers, autoscaling endpoints, and observability dashboards.

This chapter will walk you through three major cloud platforms used for hosting LLMs:

Amazon SageMaker
Google Vertex AI
Azure Machine Learning

Each platform has its own philosophy, pricing structure, and developer experience. We’ll compare them and then dive into hands-on hosting steps so you can launch a real chatbot backend from any of them.

When Should You Use Cloud Hosting?¶

Cloud platforms shine when:

You want scalable LLM hosting with managed infrastructure.
You need to integrate with cloud-native services (e.g., S3, Pub/Sub, Cloud Run).
You don't want to deal with GPU provisioning and model loading logistics.
You want endpoint-based inference with SLAs and autoscaling.

They’re perfect for MVPs, production deployments, and internal enterprise tools that require compliance or reliability.

Option 1: Hosting on AWS SageMaker¶

🔹 Key Features¶

Model Hosting as a Service (inference endpoints with autoscaling)
Integrated with S3, CloudWatch, IAM
Prebuilt PyTorch, TensorFlow, Hugging Face containers
Offers GPU and multi-model endpoints

Basic Workflow¶

Prepare your model
Save as .tar.gz with pytorch_model.bin, config.json, etc.
Upload to S3
Create a SageMaker Model
Use prebuilt Hugging Face container image
Deploy as Endpoint
Set instance type (e.g., ml.g5.xlarge for single A10 GPU)
Call via HTTPS API

Example¶

from sagemaker.huggingface import HuggingFaceModel

model = HuggingFaceModel(
    model_data='s3://your-bucket/model.tar.gz',
    role='your-sagemaker-role',
    transformers_version='4.26',
    pytorch_version='1.13',
    py_version='py39'
)

predictor = model.deploy(instance_type='ml.g5.xlarge')

Pros¶

Best for enterprise-scale hosting
Native support for multiple models per endpoint
Deep IAM + security integration

Cons¶

Can get expensive
Steeper learning curve for setup

Option 2: Hosting on Google Vertex AI¶

🔹 Key Features¶

Model Upload and Container-based Deployment
Excellent for TF/ONNX, but supports PyTorch via custom containers
Integrated with BigQuery, GCS, Firebase, GKE
Built-in explainability, monitoring, and logging

Basic Workflow¶

Package Model
SavedModel, ONNX, or PyTorch format
Upload Model to Vertex
Create Endpoint
Choose compute type (e.g., n1-standard-4 + T4 GPU)
Deploy Model
Via Console or gcloud CLI

Example (CLI)¶

gcloud ai models upload \
  --region=us-central1 \
  --display-name=mymodel \
  --artifact-uri=gs://your-model-dir \
  --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/pytorch-x.y:latest

Pros¶

Tight GCP integration (Cloud Run, Pub/Sub, etc.)
Cost-effective for small-medium workloads
Simplified UI for ML lifecycle

Cons¶

Less Hugging Face support out-of-the-box
May require more DevOps skill for custom workflows

Option 3: Hosting on Azure ML¶

🔹 Key Features¶

Supports ML pipelines, AutoML, and LLMOps
Deployment via ACI (Azure Container Instances) or AKS
Works well with OpenAI on Azure (hybrid solution)

Basic Workflow¶

Register Model
Create Environment (Python + PyTorch/HF)
Define Inference Script
Deploy via CLI or SDK

Example (Script)¶

from azureml.core.model import Model
from azureml.core.webservice import AciWebservice, Webservice

model = Model.register(...)

inference_config = InferenceConfig(...)
deployment_config = AciWebservice.deploy_configuration(...)

service = Model.deploy(workspace, 'myservice', [model], inference_config, deployment_config)

Pros¶

Enterprise-grade security, Active Directory integration
Works well with OpenAI hybrid models (some on Azure, some self-hosted)

Cons¶

Verbose setup process
May require Azure-specific skills

Pros & Cons Comparison Table¶

Feature	SageMaker	Vertex AI	Azure ML
LLM Container Support	Hugging Face, TF, PT	Better for TF/ONNX	Manual setup often
Cost Management	Fine-grained control	Auto-scaling tiers	Pay-as-you-go + AKS
Autoscaling	Yes (real-time)	Yes	Yes
Security	IAM, VPC	IAM, VPC	Azure AD, VNET
Dev Experience	CLI + SDK heavy	UI + CLI friendly	Heavier setup
Multi-Model Support	Yes	Not native	Yes (AKS)

Securing the Endpoints¶

Cloud providers help you enforce:

IAM-based access (internal only, service accounts)
Rate limits and quotas
HTTPS encryption by default
Private VPC endpoints for zero-exposure setups

Pro tip: Always monitor your usage. Even idle endpoints can incur cost.

When to Use Each¶

Scenario	Recommended Platform
Building enterprise chatbot for internal tools	Azure ML (Active Directory)
Public LLM-based app with real-time usage spikes	AWS SageMaker
Startup-grade MVP or Google-native apps	Google Vertex AI
You want full DevOps control over image/container	Use Docker + Render (see next chapter)

Summary¶

Cloud platforms strike a powerful balance between control and convenience. You don’t need to worry about spinning up NVIDIA drivers or model loading quirks—just upload your model, deploy an endpoint, and call it from your chatbot.

This chapter taught you the key workflows and tradeoffs of each major cloud platform. In the next chapter, we’ll go deeper into open-source model hosting—where you get even more flexibility by using your own FastAPI/Docker stack or tools like Hugging Face Inference Endpoints.

Ready to host your own brain—from scratch? Let's get building.