Task 2.2 - Notes

1. Goals of Task 2.2

Deploy foundation models (FMs) on AWS using different services (Bedrock, SageMaker, ECS/EKS, Lambda).
Address unique challenges of LLM deployment (latency, cost, context length, safety, governance).
Design optimized deployment approaches per use case (real‑time, batch, internal apps, hybrid/edge).

2. AWS services for deploying FMs

2.1 Amazon Bedrock

Fully managed serverless access to multiple foundation models via a unified API (Amazon, Anthropic, Meta, Mistral, etc.).
Handles scaling, patching, security controls, and integrates with IAM, CloudWatch, and existing AWS tooling.
Supports model customization (fine‑tuning and knowledge bases) without managing GPUs directly.
Best when you want: fast time‑to‑market, minimal ops, and pay‑per‑use pricing.

2.2 Amazon SageMaker (incl. SageMaker “AI”)

Host your own or JumpStart/Marketplace foundation models on managed endpoints.
Endpoint types: real‑time, serverless, asynchronous inference, and batch transform.
Fine control over instance types (GPU, Inferentia, Trainium), autoscaling, networking (VPC), and security.
Integrates deeply with MLOps: Experiments, Pipelines, Model Registry, and monitoring.

2.3 Bedrock ↔ SageMaker via Marketplace

From Bedrock Marketplace, you can subscribe to a model and deploy it to a SageMaker endpoint.
You specify endpoint name, instance type, and instance count, then Bedrock orchestrates the deployment onto SageMaker.
Useful when you want Marketplace models but require SageMaker‑level control and integration.

2.4 Containers on ECS/EKS

Package an LLM inference server (e.g., vLLM, TGI) into a container image.
Run the container on:
- ECS on EC2 or Fargate (simplified scaling and ops).
- EKS for maximum flexibility and integration with Kubernetes ecosystem.
Appropriate when you need custom runtimes, advanced scheduling, hybrid/on‑prem deployment, or non‑standard frameworks.

2.5 Serverless wrappers with API Gateway + Lambda

Typical pattern: API Gateway → Lambda → Bedrock or SageMaker endpoint.
Lambda is used as thin orchestration: input validation, auth, routing, logging; heavy LLM compute happens on Bedrock/SageMaker.
Works well for light to moderate traffic APIs, POCs, and event‑driven integrations with other AWS services.

3. Core deployment patterns

3.1 Direct Bedrock API (managed FM)

App calls Bedrock’s InvokeModel / InvokeModelWithResponseStream directly or via API Gateway.
No infrastructure management; scaling and availability are handled by AWS.
Good fit for: chatbots, assistants, Q&A apps, and early‑stage projects.

3.2 SageMaker real‑time endpoints

Deploy a model artifact plus inference code to a real‑time HTTPS endpoint.
Options:
- Single‑model endpoints.
- Multi‑model endpoints (MME) to host multiple models on the same fleet.
Use autoscaling policies (e.g., based on invocations per minute or concurrency) for traffic spikes.
Suitable when you need: predictable low latency, specific hardware, custom inference logic, or enterprise network controls.

3.3 Asynchronous / batch inference

Asynchronous endpoints: queue incoming requests and process them asynchronously; clients poll for results.
Batch transform: process large static datasets offline in bulk jobs.
Ideal for document backfills, large content generation tasks, or workloads where latency is not user‑facing.

3.4 Containerized microservice (ECS/EKS)

Deploy your own LLM service in containers behind an ALB or API Gateway.
You manage scaling (e.g., ECS Service Autoscaling, Karpenter on EKS) and GPU scheduling.
Enables advanced features: custom batching, multi‑tenant routing, custom observability stack, or integration with on‑prem GPUs.

4. Unique challenges of LLM deployment

4.1 Latency and throughput

Large parameter counts increase memory usage and compute time per token, raising end‑to‑end latency.
High concurrency can exhaust GPU memory and cause queuing or throttling.
Mitigations:
- Choose the smallest model that meets quality requirements (model right‑sizing).
- Use optimized hardware (GPU, Inferentia, Trainium) and optimized serving stacks.
- Enable token streaming where supported to improve perceived latency.

4.2 Cost management and GPU constraints

GPUs are expensive and often limited; naive one‑model‑per‑GPU hosting is inefficient.
Patterns for cost control:
- Multi‑model endpoints to share hardware across models.
- Autoscaling based on real traffic patterns and off‑peak scheduling for batch jobs.
- Hybrid approach: offload spiky or experimental workloads to Bedrock while keeping steady traffic on self‑hosted endpoints.

4.3 Context length, token limits, and memory

Longer context windows increase memory footprint and compute cost per request.
Design considerations:
- Enforce maximum tokens for prompt and completion, with truncation or summarization of long inputs.
- Use retrieval‑augmented generation (RAG) to keep prompts small while grounding responses in enterprise data.

4.4 Reliability, safety, and governance

Risks: hallucinations, sensitive data leakage, toxic or non‑compliant content.
Controls:
- Guardrails and content filters (e.g., Bedrock Guardrails, custom moderation services).
- Human‑in‑the‑loop review for high‑risk actions.
- Centralized logging and auditing via CloudWatch, X‑Ray, and CloudTrail.

5. Optimized patterns by use case

5.1 Public chat/assistant (low latency, spiky)

Requirements: low latency, global access, unpredictable traffic.
Suggested pattern:
- Bedrock model with streaming responses.
- API Gateway + Lambda front‑end or direct from web/mobile.
- Strong rate limiting and token caps to manage cost.

5.2 Internal enterprise RAG application

Requirements: private data, compliance, observability, and explainability.
Suggested pattern:
- Mid‑sized open‑source model on SageMaker real‑time endpoint plus RAG pipeline over enterprise index.
- CI/CD with SageMaker Pipelines and Model Registry for controlled rollouts.
- Use VPC‑only access, CloudWatch metrics, and model monitoring.

5.3 High‑volume content generation (batch/async)

Requirements: high throughput, cost‑efficiency, less strict latency.
Suggested pattern:
- SageMaker batch transform or async inference, or containerized jobs on ECS/EKS.
- Spot instances or off‑peak scheduling to reduce GPU cost.
- Distilled or pruned models when acceptable for quality.

5.4 Edge or hybrid deployments

Requirements: data locality, low round‑trip latency, partial offline capabilities.
Suggested pattern:
- Smaller models deployed in containers on Outposts, Local Zones, or on‑prem Kubernetes.
- Central control plane in AWS (Bedrock/SageMaker) for heavier tasks and centralized governance.

6. Design checklist (exam‑oriented)

Clarify workload: online vs batch, target latency, TPS, burst patterns.
Choose service:
- Bedrock for managed FMs and minimal operations.
- SageMaker for fine‑grained control, custom code, and MLOps.
- ECS/EKS for fully custom stacks or hybrid/edge constraints.
Address LLM‑specific concerns:
- Model size, context length, token limits, safety controls, and monitoring.
Plan lifecycle:
- Experiment → staging → production via CDK/CloudFormation, CodePipeline, and SageMaker Pipelines.

Prateek's Digital Garden

Explorer

Task 2.2 - Notes

1. Goals of Task 2.2

2. AWS services for deploying FMs

2.1 Amazon Bedrock

2.2 Amazon SageMaker (incl. SageMaker “AI”)

2.3 Bedrock ↔ SageMaker via Marketplace

2.4 Containers on ECS/EKS

2.5 Serverless wrappers with API Gateway + Lambda

3. Core deployment patterns

3.1 Direct Bedrock API (managed FM)

3.2 SageMaker real‑time endpoints

3.3 Asynchronous / batch inference

3.4 Containerized microservice (ECS/EKS)

4. Unique challenges of LLM deployment

4.1 Latency and throughput

4.2 Cost management and GPU constraints

4.3 Context length, token limits, and memory

4.4 Reliability, safety, and governance

5. Optimized patterns by use case

5.1 Public chat/assistant (low latency, spiky)

5.2 Internal enterprise RAG application

5.3 High‑volume content generation (batch/async)

5.4 Edge or hybrid deployments

6. Design checklist (exam‑oriented)

Graph View

Table of Contents

Backlinks