1. Goals of Task 2.2

  • Deploy foundation models (FMs) on AWS using different services (Bedrock, SageMaker, ECS/EKS, Lambda).

  • Address unique challenges of LLM deployment (latency, cost, context length, safety, governance).

  • Design optimized deployment approaches per use case (real‑time, batch, internal apps, hybrid/edge).


2. AWS services for deploying FMs

2.1 Amazon Bedrock

  • Fully managed serverless access to multiple foundation models via a unified API (Amazon, Anthropic, Meta, Mistral, etc.).​

  • Handles scaling, patching, security controls, and integrates with IAM, CloudWatch, and existing AWS tooling.

  • Supports model customization (fine‑tuning and knowledge bases) without managing GPUs directly.

  • Best when you want: fast time‑to‑market, minimal ops, and pay‑per‑use pricing.​

2.2 Amazon SageMaker (incl. SageMaker “AI”)

  • Host your own or JumpStart/Marketplace foundation models on managed endpoints.

  • Endpoint types: real‑time, serverless, asynchronous inference, and batch transform.

  • Fine control over instance types (GPU, Inferentia, Trainium), autoscaling, networking (VPC), and security.

  • Integrates deeply with MLOps: Experiments, Pipelines, Model Registry, and monitoring.

2.3 Bedrock ↔ SageMaker via Marketplace

  • From Bedrock Marketplace, you can subscribe to a model and deploy it to a SageMaker endpoint.​

  • You specify endpoint name, instance type, and instance count, then Bedrock orchestrates the deployment onto SageMaker.​

  • Useful when you want Marketplace models but require SageMaker‑level control and integration.​

2.4 Containers on ECS/EKS

  • Package an LLM inference server (e.g., vLLM, TGI) into a container image.​

  • Run the container on:

    • ECS on EC2 or Fargate (simplified scaling and ops).

    • EKS for maximum flexibility and integration with Kubernetes ecosystem.

  • Appropriate when you need custom runtimes, advanced scheduling, hybrid/on‑prem deployment, or non‑standard frameworks.

2.5 Serverless wrappers with API Gateway + Lambda

  • Typical pattern: API Gateway → Lambda → Bedrock or SageMaker endpoint.

  • Lambda is used as thin orchestration: input validation, auth, routing, logging; heavy LLM compute happens on Bedrock/SageMaker.

  • Works well for light to moderate traffic APIs, POCs, and event‑driven integrations with other AWS services.


3. Core deployment patterns

3.1 Direct Bedrock API (managed FM)

  • App calls Bedrock’s InvokeModel / InvokeModelWithResponseStream directly or via API Gateway.​

  • No infrastructure management; scaling and availability are handled by AWS.​

  • Good fit for: chatbots, assistants, Q&A apps, and early‑stage projects.​

3.2 SageMaker real‑time endpoints

  • Deploy a model artifact plus inference code to a real‑time HTTPS endpoint.​

  • Options:

    • Single‑model endpoints.

    • Multi‑model endpoints (MME) to host multiple models on the same fleet.​

  • Use autoscaling policies (e.g., based on invocations per minute or concurrency) for traffic spikes.

  • Suitable when you need: predictable low latency, specific hardware, custom inference logic, or enterprise network controls.

3.3 Asynchronous / batch inference

  • Asynchronous endpoints: queue incoming requests and process them asynchronously; clients poll for results.​

  • Batch transform: process large static datasets offline in bulk jobs.

  • Ideal for document backfills, large content generation tasks, or workloads where latency is not user‑facing.

3.4 Containerized microservice (ECS/EKS)

  • Deploy your own LLM service in containers behind an ALB or API Gateway.

  • You manage scaling (e.g., ECS Service Autoscaling, Karpenter on EKS) and GPU scheduling.

  • Enables advanced features: custom batching, multi‑tenant routing, custom observability stack, or integration with on‑prem GPUs.


4. Unique challenges of LLM deployment

4.1 Latency and throughput

  • Large parameter counts increase memory usage and compute time per token, raising end‑to‑end latency.

  • High concurrency can exhaust GPU memory and cause queuing or throttling.

  • Mitigations:

    • Choose the smallest model that meets quality requirements (model right‑sizing).​

    • Use optimized hardware (GPU, Inferentia, Trainium) and optimized serving stacks.

    • Enable token streaming where supported to improve perceived latency.​

4.2 Cost management and GPU constraints

  • GPUs are expensive and often limited; naive one‑model‑per‑GPU hosting is inefficient.

  • Patterns for cost control:

    • Multi‑model endpoints to share hardware across models.

    • Autoscaling based on real traffic patterns and off‑peak scheduling for batch jobs.

    • Hybrid approach: offload spiky or experimental workloads to Bedrock while keeping steady traffic on self‑hosted endpoints.

4.3 Context length, token limits, and memory

  • Longer context windows increase memory footprint and compute cost per request.

  • Design considerations:

    • Enforce maximum tokens for prompt and completion, with truncation or summarization of long inputs.​

    • Use retrieval‑augmented generation (RAG) to keep prompts small while grounding responses in enterprise data.

4.4 Reliability, safety, and governance

  • Risks: hallucinations, sensitive data leakage, toxic or non‑compliant content.

  • Controls:

    • Guardrails and content filters (e.g., Bedrock Guardrails, custom moderation services).

    • Human‑in‑the‑loop review for high‑risk actions.​

    • Centralized logging and auditing via CloudWatch, X‑Ray, and CloudTrail.


5. Optimized patterns by use case

5.1 Public chat/assistant (low latency, spiky)

  • Requirements: low latency, global access, unpredictable traffic.

  • Suggested pattern:

    • Bedrock model with streaming responses.​

    • API Gateway + Lambda front‑end or direct from web/mobile.​

    • Strong rate limiting and token caps to manage cost.​

5.2 Internal enterprise RAG application

  • Requirements: private data, compliance, observability, and explainability.

  • Suggested pattern:

    • Mid‑sized open‑source model on SageMaker real‑time endpoint plus RAG pipeline over enterprise index.

    • CI/CD with SageMaker Pipelines and Model Registry for controlled rollouts.

    • Use VPC‑only access, CloudWatch metrics, and model monitoring.

5.3 High‑volume content generation (batch/async)

  • Requirements: high throughput, cost‑efficiency, less strict latency.

  • Suggested pattern:

    • SageMaker batch transform or async inference, or containerized jobs on ECS/EKS.

    • Spot instances or off‑peak scheduling to reduce GPU cost.​

    • Distilled or pruned models when acceptable for quality.

5.4 Edge or hybrid deployments

  • Requirements: data locality, low round‑trip latency, partial offline capabilities.​

  • Suggested pattern:

    • Smaller models deployed in containers on Outposts, Local Zones, or on‑prem Kubernetes.

    • Central control plane in AWS (Bedrock/SageMaker) for heavier tasks and centralized governance.


6. Design checklist (exam‑oriented)

  • Clarify workload: online vs batch, target latency, TPS, burst patterns.

  • Choose service:

    • Bedrock for managed FMs and minimal operations.​

    • SageMaker for fine‑grained control, custom code, and MLOps.

    • ECS/EKS for fully custom stacks or hybrid/edge constraints.

  • Address LLM‑specific concerns:

    • Model size, context length, token limits, safety controls, and monitoring.
  • Plan lifecycle:

    • Experiment → staging → production via CDK/CloudFormation, CodePipeline, and SageMaker Pipelines.