Wednesday, June 10, 2026

Encoder/Decoders/Transformers and Stable Diffusion

Table 1: The Big Picture

Term Purpose Input Output Example
Encoder Compress / understand data Raw data Latent representation (embedding) BERT, Sentence-BERT, CLIP Text Encoder
Decoder Generate or reconstruct data Latent representation Output data GPT, VAE Decoder
Autoencoder Learn compressed representations Input data Reconstructed input Image Autoencoder
Autodecoder Learn latent vectors directly Learned latent code Output data DeepSDF, Neural Shape Models

Table 2: Encoder vs Decoder

Aspect Encoder Decoder
Main Goal Understanding Generation / Reconstruction
Direction Input → Latent Latent → Output
Typical Output Embedding Text, Image, Audio, etc.
Used For Search, Retrieval, Classification Generation, Reconstruction
Example BERT GPT

Table 3: Common Encoder Examples

Model Architecture Input Output Purpose
BERTTransformer EncoderTextEmbeddingUnderstanding
Sentence-BERTTransformer EncoderTextSentence EmbeddingSemantic Search
E5Transformer EncoderTextEmbeddingRAG Retrieval
BGETransformer EncoderTextEmbeddingVector Search
CLIP Text EncoderTransformer EncoderTextText EmbeddingText-to-Image
ResNetCNN EncoderImageFeature VectorVision Tasks
ViTTransformer EncoderImageImage EmbeddingVision Tasks

Table 4: Is an Embedding Model an Encoder?

Model Type Encoder? Example
Embedding Model Yes (usually) BERT, E5, BGE
RAG Embedding Model Yes E5, BGE
GPT No (Decoder-only) GPT-4, Llama
CLIP Text Encoder Yes Stable Diffusion
Rule of Thumb:

Embedding Model ≈ Encoder

Table 5: Decoder Types

Decoder Type Input Output Example
VAE Decoder Latent Vector Image Stable Diffusion VAE
CNN Decoder Feature Maps Segmentation Mask / Image U-Net
RNN Decoder Context Vector Sequence Old Translation Models
Transformer Decoder Previous Tokens Next Token GPT, Llama
Diffusion Decoder* Noise Image Stable Diffusion
Note: Diffusion Decoder is not a formal category; it is commonly used informally.

Table 6: Transformer Decoder vs Generic Decoder

Feature Decoder Transformer Decoder
Meaning General Concept Specific Architecture
Purpose Latent → Output Sequence Generation
Attention Mechanism Optional Yes
Token-by-Token Generation Not Required Yes
Example VAE Decoder GPT

Relationship Hierarchy

Decoder
├── VAE Decoder
├── CNN Decoder
├── RNN Decoder
└── Transformer Decoder
      ├── GPT
      ├── Llama
      ├── Gemini
      └── Claude

Table 7: Can Transformer Decoders Work Only on Text?

Data Type Can Use Transformer Decoder? Example
Text Yes GPT
Images (tokenized) Yes ImageGPT
Audio Yes AudioLM
Music Yes MusicLM
Protein Sequences Yes ProGen
Video (tokenized) Yes Various Video Transformers
Better Definition

Transformer Decoder = Sequence Generator

NOT

Transformer Decoder = Text Generator

Table 8: How GPT (Decoder-Only) is Trained

Training Sentence:

I love Kubernetes

Input Target
I love
I love Kubernetes
I love Kubernetes <EOS>

Loss Function
CrossEntropy(PredictedToken, ActualToken)
Thus a decoder does have a target: the next token.

Table 9: Autoencoder vs Autodecoder

Feature Autoencoder Autodecoder
Encoder Present? Yes No
Decoder Present? Yes Yes
Latent Vector Source Produced by Encoder Directly Learned
Typical Use Compression, Denoising 3D Shapes, Neural Fields
Example Variational Autoencoder DeepSDF

Autoencoder Flow

Input
 ↓
Encoder
 ↓
Latent
 ↓
Decoder
 ↓
Reconstructed Input

Autodecoder Flow

Learned Latent Code
        ↓
      Decoder
        ↓
      Output

Table 10: Stable Diffusion Components

Component Type Purpose
Text Encoder Transformer Encoder Understand Prompt
U-Net Diffusion Network Denoising
VAE Encoder Encoder Compress Images
VAE Decoder Decoder Reconstruct Images
Scheduler Control Logic Manage Denoising Steps

Table 11: When is the VAE Encoder Used in Stable Diffusion?

Operation VAE Encoder Used?
Model Training Yes
Text → Image No
Image → Image Yes
Inpainting Yes
Outpainting Yes

Table 12: Stable Diffusion (Training)

Image
  ↓
VAE Encoder
  ↓
Image Latent
  ↓
Add Noise
  ↓
U-Net
  ↓
Predict Noise
During training, Stable Diffusion learns how to remove noise from latent image representations.

Table 13: Stable Diffusion (Text-to-Image Inference)

Prompt
   ↓
Text Encoder
   ↓
Text Embeddings
                +
Random Latent Noise
                ↓
             U-Net
                ↓
         Clean Latent
                ↓
           VAE Decoder
                ↓
              Image
Important Observation

VAE Encoder is NOT used during standard Text-to-Image generation.

Table 11: When is the VAE Encoder Used in Stable Diffusion?

Operation VAE Encoder Used?
Model Training Yes
Text → Image No
Image → Image Yes
Inpainting Yes
Outpainting Yes

Table 12: Stable Diffusion (Training)

Image
  ↓
VAE Encoder
  ↓
Image Latent
  ↓
Add Noise
  ↓
U-Net
  ↓
Predict Noise

Table 13: Stable Diffusion (Text-to-Image Inference)

Prompt
    ↓
Text Encoder
    ↓
Text Embeddings
                +
Random Latent Noise
                ↓
            U-Net
            ↓
        Clean Latent
        ↓
      VAE Decoder
      ↓
     Image
Notice:

VAE Encoder is NOT used here.

Table 14: BERT vs GPT vs Stable Diffusion

Feature BERT GPT Stable Diffusion
Architecture Encoder-only Decoder-only Diffusion + VAE
Input Text Text Text Prompt
Output Embedding Text Image
Primary Goal Understanding Generation Image Generation
Uses Embeddings? Yes Internally Yes
Uses VAE? No No Yes
Uses Transformer Decoder? No Yes No
Uses Transformer Encoder? Yes No Text Encoder Only

No comments:

Post a Comment

Spring Boot Interceptors vs .NET Action Filters

Spring Boot Interceptors and .NET Action Filters are highly equivalent in terms of purpose, design, and behavior. Both allow you to ru...