Lightricks/LTX-Video-0.9.1

Lightricks

Texto a video

LTX-Video-0.9.1 es un modelo de generación de vídeo basado en difusión para texto a vídeo e imagen más texto a vídeo, desarrollado por Lightricks. Está orientado a producir vídeos de alta resolución con contenido realista y variado, con soporte para flujos locales, ComfyUI y Diffusers. La tarjeta lo describe como un modelo DiT capaz de generar vídeo de alta calidad en tiempo real, con salida de hasta 30 FPS y resolución objetivo de 1216×704 en la familia LTX-Video.

Como usar

Instalación rápida con Diffusers:
pip install -U diffusers transformers accelerate

Ejemplo básico de uso desde Hugging Face con Diffusers:
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.1", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Uso local desde el repositorio oficial:
git clone https://github.com/Lightricks/LTX-Video.git
cd LTX-Video
# create env
python -m venv env
source env/bin/activate
python -m pip install -e .\[inference-script\]

Inferencia texto a vídeo:
python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config ltxv-13b-0.9.7-dev.yaml

Inferencia imagen a vídeo:
python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config ltxv-13b-0.9.7-dev.yaml

Ejemplo texto a vídeo con Diffusers:
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=704,
    height=480,
    num_frames=161,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

Ejemplo imagen a vídeo con Diffusers:
import torch
from diffusers import LTXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image(
    "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png")
prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=704,
    height=480,
    num_frames=161,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

Consejos de uso: los prompts deben estar en inglés y funcionan mejor cuando son largos, concretos y describen cámara, iluminación, sujetos, movimiento y estilo visual. El modelo no está pensado para devolver información factual y puede no seguir el prompt perfectamente.

Funcionalidades

Generación texto a vídeo a partir de prompts descriptivos en inglés.
Generación imagen a vídeo combinando una imagen de entrada con una descripción textual.
Modelo basado en difusión/DiT para vídeo.
Compatible con Diffusers y flujos de trabajo de ComfyUI.
Usa pesos Safetensors.
Funciona mejor con resoluciones inferiores a 720×1280 y menos de 257 fotogramas.
Requiere resoluciones divisibles por 32 y número de fotogramas divisible por 8 + 1; si no, rellena y recorta la salida.
Incluye variantes de la familia LTX-Video, como versiones 2B, 2B destilada y 13B, con distintos compromisos entre calidad, VRAM y velocidad.

Casos de uso

Crear clips de vídeo a partir de descripciones textuales detalladas.
Animar una imagen inicial mediante una instrucción textual.
Prototipar escenas cinematográficas, planos de personajes, paisajes, ciudades o secuencias narrativas cortas.
Integrar generación de vídeo en flujos locales con Python, CUDA, Diffusers o ComfyUI.
Experimentar con modelos de vídeo de baja latencia o tiempo real dentro de la familia LTX-Video.