baidu/ERNIE-Image-Turbo

baidu

Texto a imagen

ERNIE-Image-Turbo es un modelo abierto de generación de imágenes a partir de texto desarrollado por el equipo ERNIE-Image de Baidu. Es la versión destilada de ERNIE-Image, basada en la misma familia Diffusion Transformer (DiT) de flujo único, y está optimizada para generar imágenes con buena fidelidad en solo 8 pasos de inferencia. Destaca en seguimiento de instrucciones complejas, renderizado de texto, composición estructurada y generación rápida para pósteres, cómics, storyboards, diseños multipanel e imágenes con contenido textual denso.

Como usar

Instalación y uso básico con Diffusers:
pip install -U diffusers transformers accelerate

import torch
from diffusers import DiffusionPipeline

# Cambiar a "mps" para dispositivos Apple
pipe = DiffusionPipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    dtype=torch.bfloat16,
    device_map="cuda"
)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Parámetros recomendados: resolución 1024x1024, 848x1264, 1264x848, 768x1376, 896x1200, 1376x768 o 1200x896; guidance scale 1.0; 8 pasos de inferencia.
Uso con ErnieImagePipeline:
pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image-Turbo",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=True
).images[0]

image.save("output.png")

Uso con SGLang:
git clone https://github.com/sgl-project/sglang.git
sglang serve --model-path baidu/ERNIE-Image-Turbo

Solicitud de generación vía API local:
curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 8,
    "guidance_scale": 1.0,
    "use_pe": true
  }' \
  --output output.png

Funcionalidades

Generación texto-a-imagen con arquitectura Diffusion Transformer de flujo único.
Versión Turbo destilada de ERNIE-Image, optimizada mediante DMD y RL para mayor velocidad y estética.
Genera con calidad alta en solo 8 pasos de inferencia.
Buen rendimiento en renderizado de texto largo, denso y sensible al diseño.
Fuerte seguimiento de instrucciones con múltiples objetos, relaciones detalladas y descripciones con conocimiento específico.
Adecuado para composiciones estructuradas como pósteres, infografías, cómics, storyboards y diseños multipanel.
Soporta estilos variados, incluidos fotografía realista, imágenes orientadas a diseño y estéticas estilizadas o cinematográficas.
Puede ejecutarse en GPU de consumo con 24 GB de VRAM.
Licencia Apache 2.0.
Disponible en Hugging Face con pesos Safetensors y uso mediante Diffusers, SGLang e inference providers.

Casos de uso

Generación rápida de imágenes desde prompts de texto en aplicaciones sensibles a latencia.
Creación de pósteres, infografías y piezas visuales con texto legible y diseño estructurado.
Producción de cómics, storyboards y composiciones multipanel donde la organización visual es importante.
Generación de imágenes fotorealistas, de diseño gráfico o con estilos artísticos/cinematográficos.
Prototipado local e investigación en generación de imágenes en GPUs de consumo con 24 GB de VRAM.
Servicios de generación de imágenes mediante Diffusers, SGLang o endpoints compatibles.