Lesson 1: Inference APIs with FastAPI

In research, a machine learning model lives inside an experimental Jupyter Notebook. In production, a model is a high-availability, low-latency utility exposed via standard HTTP endpoints. Let's learn how to leverage FastAPI and Pydantic to build robust, asynchronous inference APIs.

Why FastAPI? It is built on the ASGI (Asynchronous Server Gateway Interface) standard, enabling lightning-fast asynchronous operations matching Node.js and Go speeds, while offering automatic OpenAPI/Swagger documentation generation.

Asynchronous Processing and Concurrency

Traditional WSGI frameworks (like Flask) handle concurrent requests by spinning up operating system threads or processes. If an endpoint is waiting on a slow operation (such as loading a large file or waiting for a model prediction to run on CPU), the executing thread is blocked.

FastAPI uses Python's async and await keywords to implement coroutine-based concurrency. When a request waits for an I/O operation or a heavy model calculation, the event loop pauses that request's coroutine and yields execution to another incoming request. This allows a single worker thread to handle thousands of concurrent requests.

Structured Data Validation with Pydantic

In production APIs, you must never trust user input. A malformed input payload—such as sending strings instead of floating-point features, or an array of the wrong shape—can cause your matrix multiplication calculations (e.g., in PyTorch or NumPy) to crash with a segmentation fault or an uncaught runtime exception.

Pydantic acts as the gatekeeper. By declaring requests using typed schemas, FastAPI validates incoming JSON automatically. If the request does not adhere strictly to the schema, FastAPI rejects it immediately with a standard422 Unprocessable Entity response, ensuring that garbage data never reaches your expensive AI models.

from pydantic import BaseModel, Field
from typing import List

# Enforces that requests contain exactly 4 float elements
class InferenceRequest(BaseModel):
    features: List[float] = Field(..., min_items=4, max_items=4, description="Raw input features")
    scale_factor: float = Field(default=1.0, gt=0, description="Inference scaling coefficient")

Production Lifespan Management

Loading an AI model (like a 500MB neural network) is incredibly expensive, requiring disk read, RAM allocation, and weights transfer to GPU memory. You must load your model **exactly once** during application initialization, caching it in memory so that subsequent API requests run instantaneously.

Modern FastAPI uses the lifespan context manager to manage startup and shutdown events cleanly:

from fastapi import FastAPI
from contextlib import asynccontextmanager

ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load model and cache it in RAM/GPU during startup
    print("Loading ML model parameters...")
    ml_models["predictor"] = lambda x: sum(x) * 1.5
    yield
    # Clean up and release GPU resources on shutdown
    ml_models.clear()
    print("Cleaned up resources.")

app = FastAPI(lifespan=lifespan)

Challenge Tasks: Construct an Inference Server

Implement a production-grade inference server simulator with complete lifespan caching, strict schema validation, and validation exception boundaries:

[ ]Task 1: Define a ModelInput Pydantic model requiring features (list of floats) and a default scale_factor of 1.0.
[ ]Task 2: Define a ModelOutput Pydantic schema returning prediction (float) and status (string).
[ ]Task 3: Initialize a FastAPI application utilizing a lifespan manager that pre-loads mock model weights: [0.25, -0.5, 1.25, 0.0].
[ ]Task 4: Add a POST route at /predict. It must validate that features has exactly 4 elements; if not, raise an HTTPException with code 400.
[ ]Task 5: Compute the inference value inside the route (weighted dot product multiplied by scale_factor) and return the response.
[ ]Task 6: Create a GET endpoint at /health that returns standard metrics: {status: 'healthy', engine: 'FastAPI-ML'}.

from pydantic import BaseModel, Field from typing import List # Enforces that requests contain exactly 4 float elements class InferenceRequest(BaseModel): features: List[float] = Field(..., min_items=4, max_items=4, description="Raw input features") scale_factor: float = Field(default=1.0, gt=0, description="Inference scaling coefficient")

from fastapi import FastAPI from contextlib import asynccontextmanager ml_models = {} @asynccontextmanager async def lifespan(app: FastAPI): # Load model and cache it in RAM/GPU during startup print("Loading ML model parameters...") ml_models["predictor"] = lambda x: sum(x) * 1.5 yield # Clean up and release GPU resources on shutdown ml_models.clear() print("Cleaned up resources.") app = FastAPI(lifespan=lifespan)