baidu/ERNIE-Image

baidu

Texto a imagen

ERNIE-Image es un modelo abierto de generación de imágenes a partir de texto desarrollado por el equipo ERNIE-Image de Baidu. Usa una arquitectura Diffusion Transformer de flujo único con 8B parámetros y puede combinarse con un Prompt Enhancer ligero para convertir instrucciones breves en descripciones más ricas y estructuradas. Destaca por la fidelidad a instrucciones complejas, el renderizado de texto denso o largo y la generación de composiciones estructuradas como pósteres, cómics, storyboards e imágenes multipanel.

Como usar

Instalación y uso básico con Diffusers:
pip install -U diffusers transformers accelerate

import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("baidu/ERNIE-Image", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Uso recomendado con ErnieImagePipeline y Prompt Enhancer:
pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

Servidor local con SGLang:
git clone https://github.com/sgl-project/sglang.git

sglang serve --model-path baidu/ERNIE-Image

curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 50,
    "guidance_scale": 4.0,
    "use_pe": true
  }' \
  --output output.png

Funcionalidades

Generación texto-a-imagen con arquitectura Diffusion Transformer de flujo único.
8B parámetros, con tamaño relativamente compacto para un modelo abierto de alta capacidad.
Prompt Enhancer opcional para ampliar prompts breves y mejorar el seguimiento de instrucciones.
Buen rendimiento en renderizado de texto largo, denso y sensible al layout.
Adecuado para pósteres comerciales, infografías, imágenes tipo UI, cómics y composiciones multipanel.
Soporta estilos variados: fotografía realista, diseño gráfico limpio, tonos cinematográficos y estéticas más estilizadas.
Parámetros recomendados: resolución 1024x1024 u otras relaciones como 848x1264 y 1264x848, guidance scale 4.0 y 50 pasos de inferencia.
Puede ejecutarse en GPUs de consumo con 24 GB de VRAM, según la ficha del modelo.

Casos de uso

Creación de pósteres comerciales con texto legible y layout controlado.
Infografías, imágenes tipo interfaz y diseños con texto largo o denso.
Cómics, storyboards y composiciones multipanel donde importan la organización y la coherencia espacial.
Generación de imágenes realistas o estilizadas a partir de prompts detallados.
Prototipado creativo y adaptación downstream de modelos texto-a-imagen abiertos.
Producción visual que requiere seguimiento preciso de instrucciones con múltiples objetos, relaciones y detalles.