Lesson 3: Deploying to Hugging Face & Replicate

Managing virtual machines, network load balancers, and GPU drivers on cloud providers like AWS or GCP requires dedicated engineering overhead. Specialized AI platform-as-a-service (PaaS) and serverless providers like Hugging Face Spaces and Replicatestreamline deployment, offering automatic HTTPS, scaling to zero, and cold-start optimization.

Hugging Face Spaces

Hugging Face Spaces allows you to host interactive ML applications (using frameworks like Gradio, Streamlit, or custom Docker) directly inside the Hugging Face ecosystem. Spaces operate via a Git-based workflow: you commit your code, aDockerfile, or configuration files, and Hugging Face automatically compiles, exposes, and hosts the application.

To configure a custom Docker application in a Hugging Face Space, you specify application metadata using a YAML header (front-matter) inside the root README.md file of your Space's repository:

---
title: Stable Diffusion Inference
emoji: 🎨
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
---

# custom-docker-app
Exposing custom FastAPI backend on port 8000.

Replicate and Cog Containerization

Replicate is a serverless platform designed specifically to run large deep learning models at scale. Replicate handles dynamic autoscaling, scaling down to 0 instances when idle (eliminating constant GPU costs), and rising instantly to meet incoming load.

To deploy to Replicate, you use Cog, their open-source tool that wraps ML models in standardized containers. Cog requires two main components:

cog.yaml: A YAML schema that replaces Dockerfiles, defining system-level, CUDA, and Python library dependencies.
predict.py: A structured Python file specifying how your model is loaded and how it runs predictions.

# Example cog.yaml configuration
build:
  gpu: true
  cuda: "11.8"
  python_version: "3.10"
  python_packages:
    - "torch==2.0.1"
    - "transformers==4.30.2"
predict: "predict.py:Predictor"

Defining the Replicate Predictor

The predict.py model script exposes a unified interface. The setup method loads weights once into GPU RAM, and predict processes concurrent execution streams:

from cog import BasePredictor, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load model weights and move to GPU RAM"""
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = torch.load("weights.pt", map_location=self.device)
        self.model.eval()

    def predict(
        self,
        image: Path = Input(description="Input image file"),
        scale: float = Input(description="Upscale multiplier", default=2.0)
    ) -> Path:
        """Run a single model inference process"""
        # Inference computations occur here...
        return output_path

Comparative Analysis

Platform Feature	Hugging Face Spaces	Replicate Serverless
Execution Type	Persistent instances (Always-On/Sleep schedules)	Serverless Functions (Cold Starts vs Scale-to-0)
Scale Dynamics	Manual vertical or horizontal scaling	Elastic auto-scaling from 0 to hundreds of GPUs
Packaging Format	Standard Docker files or custom SDKs (Gradio)	Cog Container YAML specifications
Primary Use Case	Interactive UI demos, portals, lightweight APIs	Production pipelines, large batch jobs, high scale APIs

Challenge Tasks: Build Serverless Metadata

Prepare the structural blueprints to support multi-platform deployments for a serverless diffusion model:

[ ]Task 1: Construct the front-matter YAML structure for a Hugging Face Space README.md, setting SDK to docker and port to 8000.
[ ]Task 2: Define a cog.yaml file detailing GPU settings, CUDA version 11.8, and package dependencies: torch, cog.
[ ]Task 3: Implement the predict.py script by inheriting from BasePredictor.
[ ]Task 4: Write the setup() routine to create a mock dictionary containing weights: { 'gamma': 0.82, 'beta': 0.15 }.
[ ]Task 5: Implement the predict() function parameterizing input arguments (e.g. integer seed and float prompt_weight) and returning the calculated inference result.

from cog import BasePredictor, Input import torch class Predictor(BasePredictor): def setup(self): """Load model weights and move to GPU RAM""" self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model = torch.load("weights.pt", map_location=self.device) self.model.eval() def predict( self, image: Path = Input(description="Input image file"), scale: float = Input(description="Upscale multiplier", default=2.0) ) -> Path: """Run a single model inference process""" # Inference computations occur here... return output_path

Platform Feature

Hugging Face Spaces

Replicate Serverless

Execution Type

Persistent instances (Always-On/Sleep schedules)

Serverless Functions (Cold Starts vs Scale-to-0)

Scale Dynamics

Manual vertical or horizontal scaling

Elastic auto-scaling from 0 to hundreds of GPUs

Packaging Format

Standard Docker files or custom SDKs (Gradio)

Cog Container YAML specifications

Primary Use Case

Interactive UI demos, portals, lightweight APIs

Production pipelines, large batch jobs, high scale APIs