1. Goals of Task 2.2
-
Deploy foundation models (FMs) on AWS using different services (Bedrock, SageMaker, ECS/EKS, Lambda).
-
Address unique challenges of LLM deployment (latency, cost, context length, safety, governance).
-
Design optimized deployment approaches per use case (real‑time, batch, internal apps, hybrid/edge).
2. AWS services for deploying FMs
2.1 Amazon Bedrock
-
Fully managed serverless access to multiple foundation models via a unified API (Amazon, Anthropic, Meta, Mistral, etc.).
-
Handles scaling, patching, security controls, and integrates with IAM, CloudWatch, and existing AWS tooling.
-
Supports model customization (fine‑tuning and knowledge bases) without managing GPUs directly.
-
Best when you want: fast time‑to‑market, minimal ops, and pay‑per‑use pricing.
2.2 Amazon SageMaker (incl. SageMaker “AI”)
-
Host your own or JumpStart/Marketplace foundation models on managed endpoints.
-
Endpoint types: real‑time, serverless, asynchronous inference, and batch transform.
-
Fine control over instance types (GPU, Inferentia, Trainium), autoscaling, networking (VPC), and security.
-
Integrates deeply with MLOps: Experiments, Pipelines, Model Registry, and monitoring.
2.3 Bedrock ↔ SageMaker via Marketplace
-
From Bedrock Marketplace, you can subscribe to a model and deploy it to a SageMaker endpoint.
-
You specify endpoint name, instance type, and instance count, then Bedrock orchestrates the deployment onto SageMaker.
-
Useful when you want Marketplace models but require SageMaker‑level control and integration.
2.4 Containers on ECS/EKS
-
Package an LLM inference server (e.g., vLLM, TGI) into a container image.
-
Run the container on:
-
ECS on EC2 or Fargate (simplified scaling and ops).
-
EKS for maximum flexibility and integration with Kubernetes ecosystem.
-
-
Appropriate when you need custom runtimes, advanced scheduling, hybrid/on‑prem deployment, or non‑standard frameworks.
2.5 Serverless wrappers with API Gateway + Lambda
-
Typical pattern: API Gateway → Lambda → Bedrock or SageMaker endpoint.
-
Lambda is used as thin orchestration: input validation, auth, routing, logging; heavy LLM compute happens on Bedrock/SageMaker.
-
Works well for light to moderate traffic APIs, POCs, and event‑driven integrations with other AWS services.
3. Core deployment patterns
3.1 Direct Bedrock API (managed FM)
-
App calls Bedrock’s
InvokeModel/InvokeModelWithResponseStreamdirectly or via API Gateway. -
No infrastructure management; scaling and availability are handled by AWS.
-
Good fit for: chatbots, assistants, Q&A apps, and early‑stage projects.
3.2 SageMaker real‑time endpoints
-
Deploy a model artifact plus inference code to a real‑time HTTPS endpoint.
-
Options:
-
Single‑model endpoints.
-
Multi‑model endpoints (MME) to host multiple models on the same fleet.
-
-
Use autoscaling policies (e.g., based on invocations per minute or concurrency) for traffic spikes.
-
Suitable when you need: predictable low latency, specific hardware, custom inference logic, or enterprise network controls.
3.3 Asynchronous / batch inference
-
Asynchronous endpoints: queue incoming requests and process them asynchronously; clients poll for results.
-
Batch transform: process large static datasets offline in bulk jobs.
-
Ideal for document backfills, large content generation tasks, or workloads where latency is not user‑facing.
3.4 Containerized microservice (ECS/EKS)
-
Deploy your own LLM service in containers behind an ALB or API Gateway.
-
You manage scaling (e.g., ECS Service Autoscaling, Karpenter on EKS) and GPU scheduling.
-
Enables advanced features: custom batching, multi‑tenant routing, custom observability stack, or integration with on‑prem GPUs.
4. Unique challenges of LLM deployment
4.1 Latency and throughput
-
Large parameter counts increase memory usage and compute time per token, raising end‑to‑end latency.
-
High concurrency can exhaust GPU memory and cause queuing or throttling.
-
Mitigations:
-
Choose the smallest model that meets quality requirements (model right‑sizing).
-
Use optimized hardware (GPU, Inferentia, Trainium) and optimized serving stacks.
-
Enable token streaming where supported to improve perceived latency.
-
4.2 Cost management and GPU constraints
-
GPUs are expensive and often limited; naive one‑model‑per‑GPU hosting is inefficient.
-
Patterns for cost control:
-
Multi‑model endpoints to share hardware across models.
-
Autoscaling based on real traffic patterns and off‑peak scheduling for batch jobs.
-
Hybrid approach: offload spiky or experimental workloads to Bedrock while keeping steady traffic on self‑hosted endpoints.
-
4.3 Context length, token limits, and memory
-
Longer context windows increase memory footprint and compute cost per request.
-
Design considerations:
-
Enforce maximum tokens for prompt and completion, with truncation or summarization of long inputs.
-
Use retrieval‑augmented generation (RAG) to keep prompts small while grounding responses in enterprise data.
-
4.4 Reliability, safety, and governance
-
Risks: hallucinations, sensitive data leakage, toxic or non‑compliant content.
-
Controls:
-
Guardrails and content filters (e.g., Bedrock Guardrails, custom moderation services).
-
Human‑in‑the‑loop review for high‑risk actions.
-
Centralized logging and auditing via CloudWatch, X‑Ray, and CloudTrail.
-
5. Optimized patterns by use case
5.1 Public chat/assistant (low latency, spiky)
-
Requirements: low latency, global access, unpredictable traffic.
-
Suggested pattern:
-
Bedrock model with streaming responses.
-
API Gateway + Lambda front‑end or direct from web/mobile.
-
Strong rate limiting and token caps to manage cost.
-
5.2 Internal enterprise RAG application
-
Requirements: private data, compliance, observability, and explainability.
-
Suggested pattern:
-
Mid‑sized open‑source model on SageMaker real‑time endpoint plus RAG pipeline over enterprise index.
-
CI/CD with SageMaker Pipelines and Model Registry for controlled rollouts.
-
Use VPC‑only access, CloudWatch metrics, and model monitoring.
-
5.3 High‑volume content generation (batch/async)
-
Requirements: high throughput, cost‑efficiency, less strict latency.
-
Suggested pattern:
-
SageMaker batch transform or async inference, or containerized jobs on ECS/EKS.
-
Spot instances or off‑peak scheduling to reduce GPU cost.
-
Distilled or pruned models when acceptable for quality.
-
5.4 Edge or hybrid deployments
-
Requirements: data locality, low round‑trip latency, partial offline capabilities.
-
Suggested pattern:
-
Smaller models deployed in containers on Outposts, Local Zones, or on‑prem Kubernetes.
-
Central control plane in AWS (Bedrock/SageMaker) for heavier tasks and centralized governance.
-
6. Design checklist (exam‑oriented)
-
Clarify workload: online vs batch, target latency, TPS, burst patterns.
-
Choose service:
-
Bedrock for managed FMs and minimal operations.
-
SageMaker for fine‑grained control, custom code, and MLOps.
-
ECS/EKS for fully custom stacks or hybrid/edge constraints.
-
-
Address LLM‑specific concerns:
- Model size, context length, token limits, safety controls, and monitoring.
-
Plan lifecycle:
- Experiment → staging → production via CDK/CloudFormation, CodePipeline, and SageMaker Pipelines.