Wan2GP - MultiTalk - Multiple Characters Lip Sync

Wan2GP - Multi Talk - Multi character

Wan2GP MultiTalk – Multi Character lip sync. This video will show you my work flow in MultiTalk for a multi character speaking. We are running Wan2GP version 9.31 by DeepBeepMeep. We are using a Nvidia GeForce RTX 3060ti graphics card with only 8GB of VRAM. In this video I show you the settings I use to make sure the characters we want to speak are actually doing the talking. Sometimes this can be a hit or miss situation so I’m sharing some tips on how to make it more accurate.

Follow the free series to build and install the free open source options to get started with AI at home for free.

Wan2GP - Multi Person Speaking

Step 1. Open your Wan2GP installation. If you don’t use a batch file go back to the episode called – Batch File for Starting Wan2GP

https://youtu.be/cZnCNoCAvoc?si=9v23zzw5b2l9u9nW

Step 2. I use presets to help speed up the workflow. If you want to learn how to use presets go to the previous episode on Wan2GP – Presets at this link:

https://youtu.be/b3lX-5EQia4

Select a preset or select the Wan 2.1 from the drop down window on the top left. Then select MultiTalk 480p from the drop down window on the top right.

Note: If you use presets then it loads everything you had loaded when you save the preset with the exception of the start image and the audio file. That is why I use them.

Step 3. Load an image into the Start video with image window by clicking on Drop Media Here. Always use an image that is the highest resolution that you can. My image has two characters, it makes it easier for the multi talk program to understand who is talking.

Step 4. Select Video Mask Creator from the topmost selection menu. Select Image and load the same image into the Upload image box. Then click on Load Image bar at the bottom oof the image and it opens a new window called Step2:Add masks.

Click on the character you are trying to isolate. It will turn the character a blue color. In this video I click on One’s head only. Once you have the character selected hit the Add Mask button on the bottom of step2 window. Then hit the Image Matting button on the bottom of that same window. It will open two more windows on the bottom. Foreground output and Mask.

Step 5. In the video I explain what the Bounding Box numbers mean.

Mask BBox Info (Left:Top:Right:Bottom) 47:18:61:40

In my example the number means that the mask is 47% from the left side , 18% from the top, 61% from the left side, and 40% from the top. This is the bounding box around the character that the multi talk model will make speak the included audio file.

In MultiTalk multi person speaking there is a place to put this BBox number on the Video Generator tab that we started in.

Step 6. Click on the Button: “Set to Control Image & Mask” Then copy the bounding box number to take back tp the video generator screen.

Step 7. Now return back to the Video Generator tab at the top left. Paste the bounding box number into the box Speaker locations. This is very sensitive you must use the format

47:18:61:40 or 47:61 with a space between the two characters speaking. In my example I end up with

7:14:40:90 47:18:61:40

Step 8. Repeat this step for the next character by going back to the mask creator tab and deleting everything from the first character and doing it again for the next character.

Step 9. I choose Two Speakers, assumed to be played in a row option from the Voices drop down menu. In the video I explain the other options somewhat. Now I select my audio files to add to the generation. I use the scissor icon to trim them down in length.

Note: In the video my example I trim down to 2 seconds 3 seconds is optimal for my VRam because it keeps the length down to 81 frames. You can do what ever length generation if you have enough VRam memory.

Also in this video I use 480p but if you keep the video length 3 seconds or under I have been able to do 720p even with an 8GB card.

Step 9. Check the rest of the settings:

Number of Frames: 81

Number of inference steps: 10

Guidance(CFG): 1

Audio Guidance: 5

Sampler: unipc

Shift Scale: 2

Loras : Wan2.1_I2V_14B_FusionX_LoRA

Step Skipping: Tea Cache x1.5 speed up

Prompts: 3d Pixar Style characters. The Snake is looking at the camera and speaking the first audio file provided. The character on the right is looking at the snake and speaking the second audio file. No new backgrounds. No new characters added.

Negative prompt: No new characters. No new backgrounds. No new body parts. No warped body parts. No red mouths or lips. No added features to characters.

Once you have the settings adjusted to our liking hit the generate button. In the video my generation takes 8 minutes 52 seconds. and has some weird lighting effect on One’s hand at the end. I then go over some optimizations and some limitations multi-talk has. I’m sure much of it has to do with my low Vram and trying to do generations as quickly as possible.

Wan2GP – MultiTalk – Multiple Characters Lip Sync