Synthesize your knowledge of Whisper, Stable Diffusion, and Gemini models to design an automated creative workspace. Program an autonomous pipeline that listens, conceptualizes scripts, synthesizes images, and runs self-correcting alignment checks.
Production AI pipelines are rarely single-model systems. The most robust AI products function as **agentic orchestrations** where specialized systems are connected sequentially:
User voice file recorded at 16 kHz. Whisper transforms voice wave arrays into clean text string representations.
Gemini parses raw text, writes visually dense prompt instructions, extracts color coordinates, and formats scene parameters.
Stable Diffusion maps textual keywords inside a cross-attention U-Net layer to denoise latents, decoded back to standard pixels.
Gemini Visual QA reviews synthesized image vs. script scene files. High alignment passes; low scores suggest parameters and rerun.
Traditional generative models are "fire-and-forget". If Stable Diffusion outputs a messy artifact or skips a critical descriptive noun, the developer has to manually refresh.
By introducing **Multimodal Visual QA Guardrails**, we run automated validation. Gemini acts as an automated critic. We pass the original script prompt and the newly synthesized image. Gemini runs visual patch comparisons and returns a structured output:
alignment_score: 0 to 100 (rating precision of generated layout to text prompt)
missing_elements: string[] (list of conceptual nouns missing from spatial field)
correction_prompt: string (re-adjusted prompt text optimizing weights if score < 85)
If the alignment_score falls below a specific threshold (e.g. 85%), our system automatically intercepts the loop, injects the new correction_prompt, increases the CFG scale factor, and reruns Stable Diffusion.
Analyze the complete Python FastAPI application script. This script orchestrates the pipeline, binding incoming raw files, running model interfaces, and executing the self-correcting Visual QA feedback loops.
from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
import torch
import os
import io
import numpy as np
import librosa
from google import genai
from google.genai import types
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from diffusers import StableDiffusionPipeline
app = FastAPI(title="Autonomous Multimodal Pipeline")
# Initialize SDK Clients (Load tokens)
genai_client = genai.Client()
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Instantiating Pipeline weights on {DEVICE}...")
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny").to(DEVICE)
diffusion_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(DEVICE)
class SceneOutput(BaseModel):
scene_description: str
diffusion_prompt: str
color_palette: str
class QAReport(BaseModel):
alignment_score: int
missing_elements: list[str]
correction_prompt: str
def transcribe_audio(audio_bytes: bytes) -> str:
"""Stage 1: Whisper voice transcribing"""
# Load raw buffer and resample to 16kHz
y, sr = librosa.load(io.BytesIO(audio_bytes), sr=16000)
input_features = whisper_processor(y, sampling_rate=16000, return_tensors="pt").input_features.to(DEVICE)
predicted_ids = whisper_model.generate(input_features)
transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)
return transcription[0]
def creative_director_expansion(transcription: str) -> SceneOutput:
"""Stage 2: Gemini Scene Expansion & Prompt Planning"""
prompt = f"""
You are a Creative Visual Director. Take this story outline transcript: "{transcription}"
Generate a detailed scene description, a rich text prompt for Stable Diffusion, and a suggested color hex palette.
Return strictly structured JSON format conforming to the SceneOutput schema.
"""
response = genai_client.models.generateContent(
model="gemini-2.5-flash",
contents=prompt,
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=SceneOutput,
),
)
return SceneOutput.model_validate_json(response.text)
def run_visual_qa(image_bytes: bytes, original_script: str) -> QAReport:
"""Stage 4: Multimodal Quality Assurance Verification"""
image_part = types.Part.from_bytes(data=image_bytes, mime_type="image/png")
prompt = f"""
Evaluate the alignment of this synthesized image against this target script: "{original_script}".
Check for missing elements and detail compliance. Return an alignment score from 0-100.
Conform exactly to the QAReport schema.
"""
response = genai_client.models.generateContent(
model="gemini-2.5-flash",
contents=[image_part, prompt],
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=QAReport,
),
)
return QAReport.model_validate_json(response.text)
@app.post("/api/generate")
async def run_generative_pipeline(audio_file: UploadFile = File(...)):
try:
# Read WAV byte payload
audio_bytes = await audio_file.read()
# 1. Transcribe voice
transcription = transcribe_audio(audio_bytes)
# 2. Expand prompts with Gemini
creative_data = creative_director_expansion(transcription)
# 3. Generate Image using Stable Diffusion (First pass)
print(f"Generating image with prompt: {creative_data.diffusion_prompt}")
output_image = diffusion_pipe(creative_data.diffusion_prompt, num_inference_steps=25).images[0]
# Save image memory buffer to pass into Gemini
img_buffer = io.BytesIO()
output_image.save(img_buffer, format="PNG")
img_bytes_payload = img_buffer.getvalue()
# 4. Multimodal Safety & Quality Gate (First check)
qa_report = run_visual_qa(img_bytes_payload, creative_data.scene_description)
# Self-Correction Loop: If prompt alignment is low, regenerate once with optimized prompt
if qa_report.alignment_score < 85:
print(f"Alignment low ({qa_report.alignment_score}%). Rerunning with corrected prompt: {qa_report.correction_prompt}")
corrected_image = diffusion_pipe(qa_report.correction_prompt, num_inference_steps=30, guidance_scale=9.0).images[0]
img_buffer = io.BytesIO()
corrected_image.save(img_buffer, format="PNG")
img_bytes_payload = img_buffer.getvalue()
# Re-evaluate
qa_report = run_visual_qa(img_bytes_payload, creative_data.scene_description)
output_image = corrected_image
return {
"status": "success",
"transcription": transcription,
"expanded_scene": creative_data.scene_description,
"final_prompt": creative_data.diffusion_prompt,
"color_palette": creative_data.color_palette,
"qa_score": qa_report.alignment_score,
"qa_feedback": qa_report.correction_prompt if qa_report.alignment_score < 85 else "Perfect alignment",
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Select an audio stream and execute the multi-model pipeline stages from audio to final quality audit.
Examine your pipeline configurations and click Execute Pipeline to trigger orchestration.
Build your multi-model backend orchestrator server by executing these milestones. Check items off as they successfully build within your active repository.
Setup a clean FastAPI backend application layer containing an endpoint /api/generate configured to receive raw incoming UploadFile WAV buffers.
Integrate librosa in the backend handler to resample uploaded audio buffers to exactly 16000 Hz, pass the array to a pre-trained Whisper weights instance, and return decoded text.
Design a Pydantic validation class containing strings for scene details, prompt keywords, and palettes. Instantiate a Gemini client to output JSON matching this structured schema.
Feed the visual prompt synthesized by the Gemini Director stage directly into your local PyTorch Stable Diffusion pipeline, executing a 25-step Euler solver to output a clean canvas.
Expose the generated image as a binary buffer to Gemini alongside the text script, demanding an alignment score. Implement conditional routing logic to automatically rerun synthesis on failures.