Zero2AI
CoursesPlayground
Multimodal Applications & Creative Generative Systems • Module A: Multimodal Creative AILesson 4: Creative Generative Project
PreviousFinish

Lesson 4: Creative Generative Project

Construct a creative multimodal content generator combining audio transcription and image diffusion.

Built with AI for beginners. Free forever.

Privacy Policy•Terms of Service
Course 6 • Lesson 4 (Project)

Creative Multi-Stage Generative Pipeline

Synthesize your knowledge of Whisper, Stable Diffusion, and Gemini models to design an automated creative workspace. Program an autonomous pipeline that listens, conceptualizes scripts, synthesizes images, and runs self-correcting alignment checks.

Whisper Audio ParsingGemini Director PromptingStable Diffusion UNetGemini Visual QA LoopFastAPI Orchestrator Project

1. Architectural Blueprint: The Multi-Stage Agentic Loop

Production AI pipelines are rarely single-model systems. The most robust AI products function as **agentic orchestrations** where specialized systems are connected sequentially:

Autonomous System Diagram

1. AUDIO CAPTURE

User voice file recorded at 16 kHz. Whisper transforms voice wave arrays into clean text string representations.

Speech-To-Text
2. CREATIVE DIRECTOR

Gemini parses raw text, writes visually dense prompt instructions, extracts color coordinates, and formats scene parameters.

Gemini LLM
3. SYNTHESIS ENGINE

Stable Diffusion maps textual keywords inside a cross-attention U-Net layer to denoise latents, decoded back to standard pixels.

Denoising Latents
4. QUALITY GUARDRAIL

Gemini Visual QA reviews synthesized image vs. script scene files. High alignment passes; low scores suggest parameters and rerun.

Feedback Loop

2. Visual Alignment Analysis & The Self-Correction Loop

Traditional generative models are "fire-and-forget". If Stable Diffusion outputs a messy artifact or skips a critical descriptive noun, the developer has to manually refresh.

By introducing **Multimodal Visual QA Guardrails**, we run automated validation. Gemini acts as an automated critic. We pass the original script prompt and the newly synthesized image. Gemini runs visual patch comparisons and returns a structured output:

// Structured Alignment Verification Schema

alignment_score: 0 to 100 (rating precision of generated layout to text prompt)

missing_elements: string[] (list of conceptual nouns missing from spatial field)

correction_prompt: string (re-adjusted prompt text optimizing weights if score < 85)

If the alignment_score falls below a specific threshold (e.g. 85%), our system automatically intercepts the loop, injects the new correction_prompt, increases the CFG scale factor, and reruns Stable Diffusion.

FastAPI Pipeline Walkthrough: Production `pipeline.py`

Analyze the complete Python FastAPI application script. This script orchestrates the pipeline, binding incoming raw files, running model interfaces, and executing the self-correcting Visual QA feedback loops.

from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
import torch
import os
import io
import numpy as np
import librosa
from google import genai
from google.genai import types
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from diffusers import StableDiffusionPipeline

app = FastAPI(title="Autonomous Multimodal Pipeline")

# Initialize SDK Clients (Load tokens)
genai_client = genai.Client()
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Instantiating Pipeline weights on {DEVICE}...")
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny").to(DEVICE)
diffusion_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(DEVICE)

class SceneOutput(BaseModel):
    scene_description: str
    diffusion_prompt: str
    color_palette: str

class QAReport(BaseModel):
    alignment_score: int
    missing_elements: list[str]
    correction_prompt: str

def transcribe_audio(audio_bytes: bytes) -> str:
    """Stage 1: Whisper voice transcribing"""
    # Load raw buffer and resample to 16kHz
    y, sr = librosa.load(io.BytesIO(audio_bytes), sr=16000)
    input_features = whisper_processor(y, sampling_rate=16000, return_tensors="pt").input_features.to(DEVICE)
    
    predicted_ids = whisper_model.generate(input_features)
    transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

def creative_director_expansion(transcription: str) -> SceneOutput:
    """Stage 2: Gemini Scene Expansion & Prompt Planning"""
    prompt = f"""
    You are a Creative Visual Director. Take this story outline transcript: "{transcription}"
    Generate a detailed scene description, a rich text prompt for Stable Diffusion, and a suggested color hex palette.
    Return strictly structured JSON format conforming to the SceneOutput schema.
    """
    response = genai_client.models.generateContent(
        model="gemini-2.5-flash",
        contents=prompt,
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=SceneOutput,
        ),
    )
    return SceneOutput.model_validate_json(response.text)

def run_visual_qa(image_bytes: bytes, original_script: str) -> QAReport:
    """Stage 4: Multimodal Quality Assurance Verification"""
    image_part = types.Part.from_bytes(data=image_bytes, mime_type="image/png")
    prompt = f"""
    Evaluate the alignment of this synthesized image against this target script: "{original_script}".
    Check for missing elements and detail compliance. Return an alignment score from 0-100.
    Conform exactly to the QAReport schema.
    """
    response = genai_client.models.generateContent(
        model="gemini-2.5-flash",
        contents=[image_part, prompt],
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=QAReport,
        ),
    )
    return QAReport.model_validate_json(response.text)

@app.post("/api/generate")
async def run_generative_pipeline(audio_file: UploadFile = File(...)):
    try:
        # Read WAV byte payload
        audio_bytes = await audio_file.read()
        
        # 1. Transcribe voice
        transcription = transcribe_audio(audio_bytes)
        
        # 2. Expand prompts with Gemini
        creative_data = creative_director_expansion(transcription)
        
        # 3. Generate Image using Stable Diffusion (First pass)
        print(f"Generating image with prompt: {creative_data.diffusion_prompt}")
        output_image = diffusion_pipe(creative_data.diffusion_prompt, num_inference_steps=25).images[0]
        
        # Save image memory buffer to pass into Gemini
        img_buffer = io.BytesIO()
        output_image.save(img_buffer, format="PNG")
        img_bytes_payload = img_buffer.getvalue()
        
        # 4. Multimodal Safety & Quality Gate (First check)
        qa_report = run_visual_qa(img_bytes_payload, creative_data.scene_description)
        
        # Self-Correction Loop: If prompt alignment is low, regenerate once with optimized prompt
        if qa_report.alignment_score < 85:
            print(f"Alignment low ({qa_report.alignment_score}%). Rerunning with corrected prompt: {qa_report.correction_prompt}")
            corrected_image = diffusion_pipe(qa_report.correction_prompt, num_inference_steps=30, guidance_scale=9.0).images[0]
            
            img_buffer = io.BytesIO()
            corrected_image.save(img_buffer, format="PNG")
            img_bytes_payload = img_buffer.getvalue()
            
            # Re-evaluate
            qa_report = run_visual_qa(img_bytes_payload, creative_data.scene_description)
            output_image = corrected_image

        return {
            "status": "success",
            "transcription": transcription,
            "expanded_scene": creative_data.scene_description,
            "final_prompt": creative_data.diffusion_prompt,
            "color_palette": creative_data.color_palette,
            "qa_score": qa_report.alignment_score,
            "qa_feedback": qa_report.correction_prompt if qa_report.alignment_score < 85 else "Perfect alignment",
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Pipeline Orchestrator Playground

Select an audio stream and execute the multi-model pipeline stages from audio to final quality audit.

Workspace Trigger

Pipeline Log Trace
Whisper Speech-To-Text
Gemini Director Expansion
Stable Diffusion Denoise
Gemini QA Evaluation

Examine your pipeline configurations and click Execute Pipeline to trigger orchestration.

Selected file context: dream_diary.wavPipeline status: idle

Workspace Project Milestones

Build your multi-model backend orchestrator server by executing these milestones. Check items off as they successfully build within your active repository.

  • Milestone 1: Complete FastAPI Server Setup (`app.py`)

    Setup a clean FastAPI backend application layer containing an endpoint /api/generate configured to receive raw incoming UploadFile WAV buffers.

  • Milestone 2: Whisper Audio Resampler & Transcription

    Integrate librosa in the backend handler to resample uploaded audio buffers to exactly 16000 Hz, pass the array to a pre-trained Whisper weights instance, and return decoded text.

  • Milestone 3: Gemini Visual Director Structured Schema Parser

    Design a Pydantic validation class containing strings for scene details, prompt keywords, and palettes. Instantiate a Gemini client to output JSON matching this structured schema.

  • Milestone 4: Stable Diffusion UNet Denoising Integration

    Feed the visual prompt synthesized by the Gemini Director stage directly into your local PyTorch Stable Diffusion pipeline, executing a 25-step Euler solver to output a clean canvas.

  • Milestone 5: Multimodal Quality Control QA Evaluation Loop

    Expose the generated image as a binary buffer to Gemini alongside the text script, demanding an alignment score. Implement conditional routing logic to automatically rerun synthesis on failures.