Infrastructure

Infrastructure
Building an AI Platform Team: Roles, Tools, and Rituals

Balys Kriksciunas • Mon Apr 20 2026
Infrastructure
GPU FinOps: Reducing Your $10M AI Compute Bill

Balys Kriksciunas • Tue Apr 14 2026
Infrastructure
Disaggregated Inference: Prefill, Decode, and the New Serving Topology

Balys Kriksciunas • Tue Apr 07 2026
Infrastructure
Multi-Agent Orchestration Infrastructure: Lessons from Production

Balys Kriksciunas • Tue Mar 31 2026
Infrastructure
Context Engineering: Storage, Retrieval, and the New Memory Stack

Balys Kriksciunas • Tue Mar 17 2026
Infrastructure
Agent Infrastructure: What's Different from LLM Serving

Balys Kriksciunas • Tue Mar 03 2026
Infrastructure
Inference at the Edge: Running LLMs on Consumer GPUs

Balys Kriksciunas • Wed Feb 18 2026
Infrastructure
Running Sovereign AI: EU and India Infrastructure Playbooks

Balys Kriksciunas • Wed Feb 04 2026
Infrastructure
MI300X vs H100: AMD's Bet on Inference

Balys Kriksciunas • Wed Jan 21 2026
Infrastructure
The AI Infrastructure Stack: 2026 Edition

Balys Kriksciunas • Wed Jan 07 2026
Infrastructure
NVIDIA B200 vs H100: Should You Upgrade?

Balys Kriksciunas • Tue Nov 18 2025
Infrastructure
Model Evals in Production: Regression Testing Prompts

Balys Kriksciunas • Thu Oct 02 2025
Infrastructure
LoRA, QLoRA, and PEFT: The Fine-Tuning Infrastructure Guide

Balys Kriksciunas • Mon Sep 08 2025
Infrastructure
Securing RAG Pipelines: Prompt Injection via Data

Balys Kriksciunas • Tue Aug 12 2025
Infrastructure
Hybrid Search in Production: BM25 + Dense Retrieval

Balys Kriksciunas • Mon Jul 21 2025
Infrastructure
Ray Serve vs Kubernetes for Model Serving

Balys Kriksciunas • Mon Jun 30 2025
Infrastructure
AI FinOps: Tracking Token Spend Across Your Org

Balys Kriksciunas • Mon Jun 09 2025
Infrastructure
KV Cache Optimization Techniques for LLM Serving

Balys Kriksciunas • Mon May 19 2025
Infrastructure
Speculative Decoding for Production LLMs

Balys Kriksciunas • Mon Apr 28 2025
Infrastructure
LLM Gateway Patterns: LiteLLM, Portkey, and Kong AI

Balys Kriksciunas • Mon Apr 14 2025
Infrastructure
FP8 and Quantization: Serving LLMs at Half the Cost

Balys Kriksciunas • Mon Mar 24 2025
Infrastructure
pgvector at Scale: When Postgres Is Enough

Balys Kriksciunas • Mon Mar 10 2025
Infrastructure
vLLM vs TGI vs Triton: LLM Inference Server Benchmarks

Balys Kriksciunas • Tue Feb 18 2025
Infrastructure
Multi-Cloud GPU Strategy: Avoiding Lock-in and Saving 40%

Balys Kriksciunas • Mon Feb 03 2025
Infrastructure
The State of AI Infrastructure 2025

Balys Kriksciunas • Mon Jan 20 2025
Infrastructure
Self-Hosting Llama 3: A Production Deployment Guide

Balys Kriksciunas • Tue Dec 17 2024
Infrastructure
Tracing LLM Applications with OpenTelemetry

Balys Kriksciunas • Thu Nov 28 2024
Infrastructure
GPU Clouds Compared: CoreWeave, Lambda, Runpod, Fly and the Neoclouds

Balys Kriksciunas • Tue Nov 12 2024
Infrastructure
PagedAttention Explained: How vLLM Achieves 24x Throughput

Balys Kriksciunas • Wed Oct 30 2024
Infrastructure
Continuous Batching for LLMs: Why It Matters

Balys Kriksciunas • Mon Oct 14 2024
Infrastructure
Kubernetes for GPU Workloads: A Primer

Balys Kriksciunas • Sat Sep 28 2024
Infrastructure
Choosing a Vector Database in 2024: A Practical Guide

Balys Kriksciunas • Thu Sep 05 2024
Infrastructure
vLLM: The Open-Source Inference Engine Changing LLM Serving

Balys Kriksciunas • Sat Aug 10 2024
Infrastructure
NVIDIA H100 vs A100: Which GPU Should You Deploy?

Balys Kriksciunas • Mon Jul 22 2024
Infrastructure
The AI Infrastructure Stack Explained (2024)

Balys Kriksciunas • Sat Jun 15 2024

Building an AI Platform Team: Roles, Tools, and Rituals

GPU FinOps: Reducing Your $10M AI Compute Bill

Disaggregated Inference: Prefill, Decode, and the New Serving Topology

Multi-Agent Orchestration Infrastructure: Lessons from Production

Context Engineering: Storage, Retrieval, and the New Memory Stack

Agent Infrastructure: What's Different from LLM Serving

Inference at the Edge: Running LLMs on Consumer GPUs

Running Sovereign AI: EU and India Infrastructure Playbooks

MI300X vs H100: AMD's Bet on Inference

The AI Infrastructure Stack: 2026 Edition

NVIDIA B200 vs H100: Should You Upgrade?

Model Evals in Production: Regression Testing Prompts

LoRA, QLoRA, and PEFT: The Fine-Tuning Infrastructure Guide

Securing RAG Pipelines: Prompt Injection via Data

Hybrid Search in Production: BM25 + Dense Retrieval

Ray Serve vs Kubernetes for Model Serving

AI FinOps: Tracking Token Spend Across Your Org

KV Cache Optimization Techniques for LLM Serving

Speculative Decoding for Production LLMs

LLM Gateway Patterns: LiteLLM, Portkey, and Kong AI

FP8 and Quantization: Serving LLMs at Half the Cost

pgvector at Scale: When Postgres Is Enough

vLLM vs TGI vs Triton: LLM Inference Server Benchmarks

Multi-Cloud GPU Strategy: Avoiding Lock-in and Saving 40%

The State of AI Infrastructure 2025

Self-Hosting Llama 3: A Production Deployment Guide

Tracing LLM Applications with OpenTelemetry

GPU Clouds Compared: CoreWeave, Lambda, Runpod, Fly and the Neoclouds

PagedAttention Explained: How vLLM Achieves 24x Throughput

Continuous Batching for LLMs: Why It Matters

Kubernetes for GPU Workloads: A Primer

Choosing a Vector Database in 2024: A Practical Guide

vLLM: The Open-Source Inference Engine Changing LLM Serving

NVIDIA H100 vs A100: Which GPU Should You Deploy?

The AI Infrastructure Stack Explained (2024)

Don't miss out on AI insights