Course 6 • Lesson 3

Multimodal Gemini Prompting & Engineering

Explore native multimodal engineering. Learn how Gemini models ingest interleaved tokens from visual, auditory, and textual vectors seamlessly, and master structured outputs using typed schema configs.

Native MultimodalityInterleaved ContextResponse Schema (JSON)Large File Upload APIsStructured OCR Sandbox

1. Architectural Evolution: Native Multimodality

Historically, multimodal systems were engineered via Late Fusion: separate pre-trained systems (e.g. ResNet for vision, BERT for text) would process inputs independently, and their latent representations were merged through linear projection layers prior to decoding.

Late Fusion (Stitched Networks)

Stitching pre-trained models limits semantic coherence. The text model cannot direct the spatial vision model during feature extraction, leading to catastrophic information loss in cross-modal boundaries (e.g., mapping textual queries to specific image regions).

Native / Early Fusion (Gemini Architecture)

Gemini is trained natively from the ground up across all modalities in a unified Transformer model. Visual frames, raw audio frequencies, and text tokens are projected into a single mutual embedding space. Every weight layer natively attends to temporal, spatial, and linguistic relationships synchronously.

2. Visual Tokens & Spatial Patches

To convert an image into discrete tokens that a transformer can digest, Gemini utilizes visual encoders (similar to Vision Transformers - ViT):

Spatial Patching: The image is split into a grid of small squares (e.g. $16 \times 16$ pixel patches).
Linear Projection: Each patch is flattened into a 1D vector and multiplied by a trainable projection matrix to generate visual token embeddings.
Interleaved Insertion: The visual tokens are inserted directly between textual token indices. For example, the sequence: "Check this layout: [Visual Tokens 1-256] make it mobile responsive." flows as one contiguous token vector through the attention blocks.

3. Programmatic Control: Schema Enforcement & Safety Settings

When building AI engineering tools, output consistency is critical. A standard text completion prompt is prone to hallucinations or conversational filler (e.g. "Here is the parsed JSON you asked for...").

Gemini solves this programmatically at the API compiler level by enforcing a typed responseSchema. The decoding engine restricts its sampling vocabulary to exclusively validate syntax strings (such as commas, colons, braces, or strings matching JSON shapes) guaranteeing that the raw text output parses safely.

API Walkthrough: Python & TypeScript Multimodal Implementations

Examine the structural setup required to build multimodal programs. This demonstrates both inline binary loading for lightweight images and large file upload streams using the official Google Generative AI SDKs.

SDK Implementations

// 1. TypeScript Node.js Implementation (@google/genai)

import { GoogleGenAI, Type } from "@google/genai";
import * as fs from "fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

async function parseReceiptImage(filePath: string) {
  // Convert local file to Base64 part format
  const imagePart = {
    inlineData: {
      data: Buffer.from(fs.readFileSync(filePath)).toString("base64"),
      mimeType: "image/png"
    }
  };

  const response = await ai.models.generateContent({
    model: "gemini-2.5-flash",
    contents: [
      imagePart,
      "Extract receipt items, vendor, total and date structured matching schema."
    ],
    config: {
      // Direct Schema Enforcement
      responseMimeType: "application/json",
      responseSchema: {
        type: Type.OBJECT,
        properties: {
          vendor: { type: Type.STRING },
          total: { type: Type.NUMBER },
          items: {
            type: Type.ARRAY,
            items: {
              type: Type.OBJECT,
              properties: {
                name: { type: Type.STRING },
                price: { type: Type.NUMBER }
              }
            }
          }
        },
        required: ["vendor", "total", "items"]
      }
    }
  });

  console.log("Parsed JSON:", JSON.parse(response.text));
}

# 2. Python Video Reasoning File API Implementation

import google.generativeai as genai
import time
import os

# Initialize API credentials
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

def analyze_large_video(video_path: str):
    # Videos can contain millions of parameters. Use File API for temporary cloud indexing
    print(f"Uploading visual file buffer...")
    video_file = genai.upload_file(path=video_path)
    
    # Wait for background processing (transcoding / structural parsing)
    while video_file.state.name == "PROCESSING":
        print("Waiting for transcode processing index...")
        time.sleep(5)
        video_file = genai.get_file(video_file.name)
        
    if video_file.state.name == "FAILED":
        raise Exception("API Video transcode failed")
        
    print(f"File active. URI: {video_file.uri}")
    
    model = genai.GenerativeModel("gemini-2.5-flash")
    
    # Send video file context alongside textual prompts
    response = model.generate_content([
        video_file,
        "Compile a scene change timeline list noting anomalous events."
    ])
    
    print("Generation complete:")
    print(response.text)
    
    # Clean up uploaded resource
    genai.delete_file(video_file.name)

Interleaved Multi-Modal API Sandbox

Simulate structured JSON receipt extraction, raw front-end components engineering, or dense temporal safety audits.

Request Payload

Select Input File

Prompt Configuration

Payload Context

File type:image

Visual Tokens:256

Resolution: 1024x1024 px • Monochromatic text receipt.

Adjust your parameters and click Send Request to compile the raw multi-modal structures.

API Target: gemini-2.5-flashSafety Evaluation: Passed

Multimodal Gemini Prompting Challenges

Execute these tasks in your workspace repository code. Toggle checkboxes once they successfully execute within your development environment.

Task 1: Structured Invoice Schema Parsing
Design a receipt-parser script utilizing responseSchema configurations to extract vendor names, transaction dates, and floats directly from receipt image files.
Task 2: Front-End Landing Page Mockup Transpiler
Build a script that uploads hand-drawn layouts and prompts Gemini to translate structural visual lines into modular, styled front-end React classes matching layout coordinates.
Task 3: File API Large Video Temporal Tracking
Implement a node file workflow utilizing ai.files.upload(). Transcode and process a large surveillance clip to return a sorted sequence array containing chronological action logs.
Task 4: Multimodal System Instruction & Temperature Controls
Configure strict safety block overrides alongside rigid system instructions restricting conversational chat responses, forcing the model to strictly act as a factual machine parsing visual tables.

Multimodal Gemini Prompting & Engineering

Native MultimodalityInterleaved ContextResponse Schema (JSON)Large File Upload APIsStructured OCR Sandbox

1. Architectural Evolution: Native Multimodality

Late Fusion (Stitched Networks)

Native / Early Fusion (Gemini Architecture)

2. Visual Tokens & Spatial Patches

To convert an image into discrete tokens that a transformer can digest, Gemini utilizes visual encoders (similar to Vision Transformers - ViT):

Spatial Patching: The image is split into a grid of small squares (e.g. $16 \times 16$ pixel patches).

Linear Projection: Each patch is flattened into a 1D vector and multiplied by a trainable projection matrix to generate visual token embeddings.

Interleaved Insertion: The visual tokens are inserted directly between textual token indices. For example, the sequence: "Check this layout: [Visual Tokens 1-256] make it mobile responsive." flows as one contiguous token vector through the attention blocks.

3. Programmatic Control: Schema Enforcement & Safety Settings

API Walkthrough: Python & TypeScript Multimodal Implementations

SDK Implementations

// 1. TypeScript Node.js Implementation (@google/genai)

import { GoogleGenAI, Type } from "@google/genai";
import * as fs from "fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

async function parseReceiptImage(filePath: string) {
  // Convert local file to Base64 part format
  const imagePart = {
    inlineData: {
      data: Buffer.from(fs.readFileSync(filePath)).toString("base64"),
      mimeType: "image/png"
    }
  };

  const response = await ai.models.generateContent({
    model: "gemini-2.5-flash",
    contents: [
      imagePart,
      "Extract receipt items, vendor, total and date structured matching schema."
    ],
    config: {
      // Direct Schema Enforcement
      responseMimeType: "application/json",
      responseSchema: {
        type: Type.OBJECT,
        properties: {
          vendor: { type: Type.STRING },
          total: { type: Type.NUMBER },
          items: {
            type: Type.ARRAY,
            items: {
              type: Type.OBJECT,
              properties: {
                name: { type: Type.STRING },
                price: { type: Type.NUMBER }
              }
            }
          }
        },
        required: ["vendor", "total", "items"]
      }
    }
  });

  console.log("Parsed JSON:", JSON.parse(response.text));
}

# 2. Python Video Reasoning File API Implementation

import google.generativeai as genai
import time
import os

# Initialize API credentials
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

def analyze_large_video(video_path: str):
    # Videos can contain millions of parameters. Use File API for temporary cloud indexing
    print(f"Uploading visual file buffer...")
    video_file = genai.upload_file(path=video_path)
    
    # Wait for background processing (transcoding / structural parsing)
    while video_file.state.name == "PROCESSING":
        print("Waiting for transcode processing index...")
        time.sleep(5)
        video_file = genai.get_file(video_file.name)
        
    if video_file.state.name == "FAILED":
        raise Exception("API Video transcode failed")
        
    print(f"File active. URI: {video_file.uri}")
    
    model = genai.GenerativeModel("gemini-2.5-flash")
    
    # Send video file context alongside textual prompts
    response = model.generate_content([
        video_file,
        "Compile a scene change timeline list noting anomalous events."
    ])
    
    print("Generation complete:")
    print(response.text)
    
    # Clean up uploaded resource
    genai.delete_file(video_file.name)

Lesson 3: Multimodal Prompting with Gemini

1. Architectural Evolution: Native Multimodality

Late Fusion (Stitched Networks)

Native / Early Fusion (Gemini Architecture)

2. Visual Tokens & Spatial Patches

3. Programmatic Control: Schema Enforcement & Safety Settings

API Walkthrough: Python & TypeScript Multimodal Implementations

Interleaved Multi-Modal API Sandbox

Request Payload

Multimodal Gemini Prompting Challenges

1. Architectural Evolution: Native Multimodality

Late Fusion (Stitched Networks)

Native / Early Fusion (Gemini Architecture)

2. Visual Tokens & Spatial Patches

3. Programmatic Control: Schema Enforcement & Safety Settings

API Walkthrough: Python & TypeScript Multimodal Implementations

Interleaved Multi-Modal API Sandbox

Request Payload

Multimodal Gemini Prompting Challenges