LLMOps & AI Infrastructure

Deployment, monitoring, cost optimization, and scaling LLM applications in production.

12·Free resources

0 of 12 resources completed

Log in to track progress

Log in to mark resources complete and sync progress across devices.

  • Docs35 min

    Google Cloud - MLOps on Vertex AI

    Pipelines, monitoring, and model registry for LLM apps.

    Open resource
  • Docs32 min

    AWS - SageMaker MLOps best practices

    CI/CD for models and multi-account governance.

    Open resource
  • Article22 min

    Arize - LLM observability

    Tracing prompts, latency, and drift in production.

    Open resource
  • Docs24 min

    WhyLabs - Data and LLM monitoring

    Statistical profiles for inputs and outputs.

    Open resource
  • Article20 min

    Honeycomb - OpenTelemetry for AI services

    Correlating traces across model and app tiers.

    Open resource
  • Docs26 min

    Datadog - LLM monitoring product docs

    Dashboards for tokens, errors, and spend.

    Open resource
  • Docs28 min

    LangSmith - Evaluation & deployment

    Datasets, evaluators, and regression tests for prompts.

    Open resource
  • Article24 min

    Weights & Biases - LLMOps platform

    Experiment tracking meets production observability.

    Open resource
  • Docs30 min

    Kubernetes - Autoscaling workloads

    Scale inference pods with GPU-aware schedulers.

    Open resource
  • Docs32 min

    NVIDIA Triton - Model serving

    Dynamic batching and ensemble pipelines.

    Open resource
  • Article18 min

    FinOps Foundation - Unit economics

    Chargeback, forecasting, and cost guardrails for AI.

    Open resource
  • Docs22 min

    OpenTelemetry - GenAI semantic conventions

    Standard spans for model calls across vendors.

    Open resource

← All learning paths