Speaker Selection — API

Speaker selection helps you target the right face when a clip or image contains multiple people. You can either let Sync Labs auto-detect the active speaker (video only) or provide a user-selected point from your UI and forward it via the active_speaker_detection DTO on /v2/generate. For using speaker selection in the web app, see the guide.

When to use what

  • Auto-detect (video only): fastest setup; best for single/obvious speaker clips. Set auto_detect: true and skip manual fields.
  • Manual selection (video or image): best for multiple people or when you want deterministic control. Provide a reference frame and a point on the speaker’s face, or supply bounding boxes if you already have detections for that frame.

Image inputs (sync-3 only) support manual speaker selection but not auto-detect. The backend rejects auto_detect: true for images.

Workflow: selecting a speaker in video

1

Capture a reference frame

Seek the video to a frame where the target speaker’s face is visible. Keep track of the frame index you show in the UI.

2

Collect a point on the face

Record the [x, y] coordinates (in the same coordinate system/pixels as your extracted frame) for the clicked point on the speaker’s face. Keep the frame index and coordinates paired.

3

Optional: provide bounding boxes instead

If you already ran face detection over the video, send bounding_boxes as a per-frame array. Each entry is [x1, y1, x2, y2] (top-left to bottom-right) or null when no face is present. This replaces the need for frame_number + coordinates.

For long videos with many frames, use bounding_boxes_url instead to point to an external JSON file — this avoids large request payloads.

4

Send generation request

Set options.active_speaker_detection with either frame_number + coordinates, or bounding_boxes for all frames when you already have detections (no frame_number/coordinates needed in that case). Leave auto_detect false when you want to honor the manual selection.

Workflow: selecting a speaker in images (sync-3)

When your image contains multiple faces, you can specify which one to lipsync by providing manual coordinates.

1

Run face detection on the image

Detect faces in your input image and display bounding boxes in your UI. Use the image’s native pixel dimensions as the coordinate space.

2

Collect a point on the selected face

Record the [x, y] coordinates (in the image’s native pixel space) for the center or clicked point on the target speaker’s face.

3

Send generation request

Set options.active_speaker_detection with frame_number: 0 and coordinates: [x, y]. Do not set auto_detect: true — the backend rejects auto-detect for image inputs.

AspectVideoImage (sync-3)
Auto-detectYesNo
Manual coordinatesYesYes
Bounding boxesYesNo
frame_numberCurrent frame indexAlways 0
Coordinate spaceVideo dimensionsImage native pixels

ActiveSpeaker DTO fields

See the full API reference for active_speaker_detection.

  • auto_detect (boolean, default false): let Sync Labs pick the active speaker automatically.
  • v3 (boolean, optional): enable ASD v3.
  • frame_number (number): frame index that corresponds to the provided coordinates.
  • coordinates ([x, y]): reference point on the speaker’s face in frame_number.
  • bounding_boxes ((number[] | null)[], optional): per-frame array of bounding boxes across the video. Each entry corresponds to that frame: set to [x1, y1, x2, y2] (x1,y1 = top-left; x2,y2 = bottom-right) for the detected face, or null if no box for that frame. Use this instead of frame_number + coordinates when you already run detection over the clip.
  • bounding_boxes_url (string, optional): URL pointing to a JSON file containing the bounding boxes. Use this instead of inline bounding_boxes to avoid large request payloads. The JSON file must contain a bounding_boxes array with one entry per frame, matching the format above.

Request examples

1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5const response = await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 frameNumber: 240,
15 coordinates: [640, 360]
16 }
17 }
18});
$curl -X POST https://api.sync.so/v2/generate \
> -H "x-api-key: $SYNC_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "model": "lipsync-2",
> "input": [
> { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
> { "type": "audio", "url": "https://assets.sync.so/docs/example-audio.wav" }
> ],
> "options": {
> "active_speaker_detection": {
> "auto_detect": false,
> "frame_number": 240,
> "coordinates": [640, 360]
> }
> }
> }'
1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 // boundingBoxes aligned to video frames; null where no box is present.
15 boundingBoxes: [
16 null, // frame 0
17 [520, 280, 760, 520], // frame 1 (speaker A) -> [x1,y1,x2,y2]
18 [120, 260, 320, 500], // frame 2 (speaker B)
19 null // frame 3
20 // ...one entry per frame in the clip
21 ]
22 }
23 }
24});
1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5await sync.generations.create({
6 input: [
7 { type: "image", url: "https://assets.sync.so/docs/example-image.jpg" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "sync-3",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 frameNumber: 0,
15 coordinates: [512, 384] // center of detected face in image pixels
16 }
17 }
18});

For long videos, host the bounding boxes in an external JSON file to keep request payloads small. The JSON must follow this format:

1{
2 "bounding_boxes": [[520, 280, 760, 520], null, [120, 260, 320, 500], null]
3}

Each array entry corresponds to one frame: [x1, y1, x2, y2] for a detected face, or null when no face is present. The number of entries must match the total frame count.

1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 boundingBoxesUrl: "https://your-cdn.com/bounding-boxes.json"
15 }
16 }
17});