Stable diffusion thread

doogyhatts · Jul 7, 2025

@AikiBoy
This channel mainly use AI voice and images only.
If the topic cannot meet the 2hr limit, the remaining content is pasted from other unrelated topic.
On the 15th this month, Youtube will be revising some of the YPP rules regarding mass-produced and repetitive content.

windwaver · Jul 7, 2025

I see the requirements a bit scary.

How much to build a rig to use this?

focus1974 · Jul 7, 2025

doogyhatts said:
Do realistic WoW animations, won't go wrong.

next 2 years.. the games coming out will be more and more realistic or of higher graphical fidelity and creativity even for anime.

Already noticed the store graphics of the game apps in Google Playstore all suddenly updated into very visually appealing ones even though the games was from years ago..

doogyhatts · Jul 7, 2025

windwaver said:
I see the requirements a bit scary.

How much to build a rig to use this?

For local video generation, I use a RTX 5080.
I mainly use it for image generation, image editing, generate speaking avatars and video upscaling.
My entire rig with monitor costs 3k sgd (exclude hard-disk).

For image generation, I am also using the Seedream 3.0 model on the Dreamina platform.
For video generation, I am still using KlingAI, since I am partially sponsored by them.
I upscale my videos using Topaz Starlight-Mini ($249 usd fixed cost).

If you feel that the images from Midjourney are better, then you have to factor in their subscription plan.
The video output from Midjourney is only 480p, but has very good physics and non-photorealistic rendering.
Some people don't use an expensive local rig, they generate their images and video on Midjourney, then upscale using Starlight-Mini in their cheaper rig (RTX 5060-Ti).

As for the Scarlet Monastery video, the creator replied that he used KlingAI and Veo2 to generate the video clips.
I think his images might be generated from Imagen4.
But I don't think he used KlingAI entirely for the lip-sync, he could have used either Dreamina or Veo3.

AikiBoy · Jul 7, 2025

doogyhatts said:
For local video generation, I use a RTX 5080.
I mainly use it for image generation, image editing, generate speaking avatars and video upscaling.
My entire rig with monitor costs 3k sgd (exclude hard-disk).

For image generation, I am also using the Seedream 3.0 model on the Dreamina platform.
For video generation, I am still using KlingAI, since I am partially sponsored by them.
I upscale my videos using Topaz Starlight-Mini ($249 usd fixed cost).

If you feel that the images from Midjourney are better, then you have to factor in their subscription plan.
The video output from Midjourney is only 480p, but has very good physics and non-photorealistic rendering.
Some people don't use an expensive local rig, they generate their images and video on Midjourney, then upscale using Starlight-Mini in their cheaper rig (RTX 5060-Ti).

As for the Scarlet Monastery video, the creator replied that he used KlingAI and Veo2 to generate the video clips.
I think his images might be generated from Imagen4.
But I don't think he used KlingAI entirely for the lip-sync, he could have used either Dreamina or Veo3.

Ur setup so pro u earn alot making AI vid?

doogyhatts · Jul 7, 2025

AikiBoy said:
Ur setup so pro u earn alot making AI vid?

Not yet.
I am still building a new audience for WoW-related content.
I had to make a set of reusable sprites for the character, which takes up a lot of time.

And the main problem is that KlingAI's lip-sync is still not updated to the new OmniSync algorithm which they have completed their research.
Running the open-source lip-sync algorithms are very slow on the local machine.

Other people don't care about character consistency, so they just whack many different characters and environments quickly.
Then add in the spoken audio, generated from ElevenLabs. Making sure got enough animations to satisfy the long audio.

AikiBoy · Jul 7, 2025

Lol i want to do something like this

doogyhatts · Jul 7, 2025

AikiBoy said:
Lol i want to do something like this

In this case, have to pay for Veo3 or Hailuo.
KlingAI's 2.1 master model is very expensive for such kungfu actions, and still has a lot of morphing.

KaiserBreath · Jul 7, 2025

Got any tricks to get Fantasy Talking or Float to bypass the 5s limit? Either the video gets distorted, not enough vram, or the model is pegged to 5s.

doogyhatts · Jul 7, 2025

KaiserBreath said:
Got any tricks to get Fantasy Talking or Float to bypass the 5s limit? Either the video gets distorted, not enough vram, or the model is pegged to 5s.

I don't use Fantasy Talking or Float.
Use either Hunyuan Avatar, Omni Avatar or Multi Talk.

I am able to run Hunyuan Avatar and Omni Avatar using 16gb vram for 10 seconds audio.
Omni Avatar is faster but has less body motion. I am using the command line version.
Hunyuan Avatar is slower and has morphing hands. I used it inside Wan2GP.
Multi Talk will OOM for 10 seconds audio right now, wait for Wan2GP to integrate it and lower the requirements.

KaiserBreath · Jul 7, 2025

doogyhatts said:
I don't use Fantasy Talking or Float.
Use either Hunyuan Avatar, Omni Avatar or Multi Talk.

I am able to run Hunyuan Avatar and Omni Avatar using 16gb vram for 10 seconds audio.
Omni Avatar is faster but has less body motion. I am using the command line version.
Hunyuan Avatar is slower and has morphing hands. I used it inside Wan2GP.
Multi Talk will OOM for 10 seconds audio right now, wait for Wan2GP to integrate it and lower the requirements.

Btw, if we take the last frame of the previous clip to make the next one, will it work to look like it can continue? Like extend longer.

Or the sample image is not gauranteed to be the 1st frame.

doogyhatts · Jul 7, 2025

KaiserBreath said:
Btw, if we take the last frame of the previous clip to make the next one, will it work to look like it can continue? Like extend longer.

Or the sample image is not gauranteed to be the 1st frame.

There will be a slight colour difference between the last frame of the first segment and the first frame of the second segment.
This is for non-speaking avatars.

For speaking avatars, the first frame of the second segment is not guaranteed to be the same as the last frame of the first segment.

KaiserBreath · Jul 7, 2025

doogyhatts said:
There will be a slight colour difference between the last frame of the first segment and the first frame of the second segment.
This is for non-speaking avatars.

For speaking avatars, the first frame of the second segment is not guaranteed to be the same as the last frame of the first segment.

ic. So far there is no real open-source that can maintain the same fidelity and duration length like HeyGen this kind of commercial ones rite.

5s is too short. 10s might be OK to atleast complete a sentence lol.

doogyhatts · Jul 7, 2025

KaiserBreath said:
ic. So far there is no real open-source that can maintain the same fidelity and duration length like HeyGen this kind of commercial ones rite.

5s is too short. 10s might be OK to atleast complete a sentence lol.

Someone told me on github that he did a 16-second one using MultiTalk on a 3090.

I am not sure what is the maximum length of the audio that HeyGen can support.

I have not tried doing above 10 seconds for Hunyuan Avatar & Omni Avatar solutions.

KaiserBreath · Jul 7, 2025

doogyhatts said:
Someone told me on github that he did a 16-second one using MultiTalk on a 3090.

I am not sure what is the maximum length of the audio that HeyGen can support.

I have not tried doing above 10 seconds for Hunyuan Avatar & Omni Avatar solutions.

HeyGen can do like long length 1min no cut kind. Not sure if there is some editing magic going on but atleast seems seamless.

doogyhatts · Jul 7, 2025

KaiserBreath said:
HeyGen can do like long length 1min no cut kind. Not sure if there is some editing magic going on but atleast seems seamless.

I see. ~~Open-source cannot do this yet.~~

I am waiting for KlingAI to update their lip-sync algorithm to the new one.
Their lip-sync UI functionality now allows for multiple speakers and total audio length to 1 min.

doogyhatts · Jul 7, 2025

@KaiserBreath
That same person now tells me on github, that he did try a 1 min audio length in MultiTalk and it works.

I think it has no limit on how much how long of the video you generate, I just tried generate 1 minute of talking video and it's working just fine.
I think it generate in chunks, but the VRAM consumption is 17-18GB which might be a bit over from 16GB and that's why you get OOM.
I can't run it on my 5080 as well, but I can run it just fine on 3090 no matter how long the clip is.

KaiserBreath · Jul 7, 2025

doogyhatts said:
@KaiserBreath
That same person now tells me on github, that he did try a 1 min audio length in MultiTalk and it works.

Nice but this one cant based on image reference avatar right.

I think usually those are harder, and if comes with gestures, even shorter output.

doogyhatts · Jul 7, 2025

KaiserBreath said:
Nice but this one cant based on image reference avatar right.

I think usually those are harder, and if comes with gestures, even shorter output.

It does have a requirement to have an input image in the json file.
https://github.com/MeiGen-AI/MultiTalk/blob/main/examples/multitalk_example_1.json

I am not sure if the 1 min length is due to Kijai's implementation of the sliding context window. It very well could be.

AikiBoy · Jul 8, 2025

How u get kling sponsorship? What perk?

What must u do for them

Stable diffusion thread

Arch-Supremacy Member

High Supremacy Member

Greater Supremacy Member

Arch-Supremacy Member

Arch-Supremacy Member

Arch-Supremacy Member

Arch-Supremacy Member

Arch-Supremacy Member

Senior Member

Arch-Supremacy Member

Senior Member

Arch-Supremacy Member

Senior Member

Arch-Supremacy Member

Senior Member

Arch-Supremacy Member

Arch-Supremacy Member

Senior Member

Arch-Supremacy Member

Arch-Supremacy Member