Skywork/SkyReels-V2-DF-14B-720P-Diffusers

Skywork

Texto a video

Modelo generativo de video de 14B parámetros de la serie SkyReels V2, orientado a generación cinematográfica de larga duración a 720p mediante Diffusion Forcing autoregresivo. Soporta texto a video, imagen a video y extensión de video, con inferencia síncrona o asíncrona para mejorar la consistencia temporal en secuencias largas.

Como usar

Instalación básica con Diffusers:
pip install -U diffusers transformers accelerate

Ejemplo rápido de texto a video con Diffusers:
# pip install ftfy
import torch
from diffusers import AutoModel, SkyReelsV2DiffusionForcingPipeline, UniPCMultistepScheduler
from diffusers.utils import export_to_video

model_id = "Skywork/SkyReels-V2-DF-14B-720P-Diffusers"
vae = AutoModel.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipeline = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
    model_id, vae=vae, torch_dtype=torch.bfloat16
)
flow_shift = 8.0 # 8.0 for T2V, 5.0 for I2V
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
pipeline = pipeline.to("cuda")

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
output = pipeline(
    prompt=prompt,
    num_inference_steps=30,
    height=544, # 720 for 720P
    width=960, # 1280 for 720P
    num_frames=97,
    base_num_frames=97, # 121 for 720P
    ar_step=5,
    causal_block_size=5,
    overlap_history=None, # 17 for long video generations
    addnoise_condition=20,
).frames[0]
export_to_video(output, "video.mp4", fps=24, quality=8)

Ejemplo de imagen a video con fotograma inicial y final:
import numpy as np
import torch
import torchvision.transforms.functional as TF
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

model_id = "Skywork/SkyReels-V2-DF-14B-720P-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipeline = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
    model_id, vae=vae, torch_dtype=torch.bfloat16
)
flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
pipeline.to("cuda")

first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

def aspect_ratio_resize(image, pipeline, max_area=720 * 1280):
    aspect_ratio = image.height / image.width
    mod_value = pipeline.vae_scale_factor_spatial * pipeline.transformer.config.patch_size[1]
    height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
    width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
    image = image.resize((width, height))
    return image, height, width

def center_crop_resize(image, height, width):
    resize_ratio = max(width / image.width, height / image.height)
    width = round(image.width * resize_ratio)
    height = round(image.height * resize_ratio)
    size = [width, height]
    image = TF.center_crop(image, size)
    return image, height, width

first_frame, height, width = aspect_ratio_resize(first_frame, pipeline)
if last_frame.size != first_frame.size:
    last_frame, _, _ = center_crop_resize(last_frame, height, width)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
output = pipeline(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    height=height,
    width=width,
    guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=24, quality=8)

Para generación larga con scripts del repositorio SkyReels-V2, se usa generate_video_df.py con parámetros como --num_frames, --overlap_history, --addnoise_condition, --ar_step y --causal_block_size. Para 720p, la documentación recomienda 720 x 1280 x 121 fotogramas como configuración base.

Funcionalidades

Generación de video a partir de texto con resolución objetivo 720p, usando configuraciones recomendadas de 720 x 1280 y 121 fotogramas.
Arquitectura Diffusion Forcing autoregresiva para extender videos usando los últimos fotogramas del segmento anterior como contexto.
Soporte de flujos texto-a-video, imagen-a-video con fotograma inicial/final y video-a-video para extender una secuencia existente.
Inferencia síncrona y asíncrona mediante parámetros como ar_step, causal_block_size, base_num_frames y overlap_history.
Integración con Diffusers mediante SkyReelsV2DiffusionForcingPipeline, SkyReelsV2DiffusionForcingImageToVideoPipeline y SkyReelsV2DiffusionForcingVideoToVideoPipeline.
Opciones para reducir o gestionar VRAM como offload, Teacache y ajuste de base_num_frames; el modelo 14B requiere alta memoria GPU.
Aceleración multi-GPU con xDiT USP usando torchrun.
Entrenamiento y refinamiento descritos con preentrenamiento multietapa, refuerzo para calidad de movimiento, Diffusion Forcing y SFT de alta calidad a 540p y 720p.

Casos de uso

Generar clips cinematográficos largos a partir de prompts de texto.
Crear videos condicionados por una imagen inicial, una imagen final y una descripción textual.
Extender un video existente manteniendo estilo, sujeto y continuidad temporal.
Producir material visual de tipo storyboard, escena narrativa o demostración de cámara con mayor duración que los modelos de video cortos tradicionales.
Experimentar con generación autoregresiva de video, inferencia asíncrona y evaluación de consistencia visual en secuencias largas.