Speaker Selection — API

Speaker selection helps you target the right face when a clip or image contains multiple people. You can either let Sync Labs auto-detect the active speaker (video only) or provide a user-selected point from your UI and forward it via the active_speaker_detection DTO on /v2/generate. For using speaker selection in the web app, see the guide.

When to use what

Auto-detect (video only): fastest setup; best for single/obvious speaker clips. Set auto_detect: true and skip manual fields.
Manual selection (video or image): best for multiple people or when you want deterministic control. Provide a reference frame and a point on the speaker’s face, or supply bounding boxes if you already have detections for that frame.

Image inputs (sync-3 only) support manual speaker selection but not auto-detect. The backend rejects auto_detect: true for images.

Workflow: selecting a speaker in video

Capture a reference frame

Seek the video to a frame where the target speaker’s face is visible. Keep track of the frame index you show in the UI.

Collect a point on the face

Record the [x, y] coordinates (in the same coordinate system/pixels as your extracted frame) for the clicked point on the speaker’s face. Keep the frame index and coordinates paired.

Optional: provide bounding boxes instead

If you already ran face detection over the video, send bounding_boxes as a per-frame array. Each entry is [x1, y1, x2, y2] (top-left to bottom-right) or null when no face is present. This replaces the need for frame_number + coordinates.

For long videos with many frames, use bounding_boxes_url instead to point to an external JSON file — this avoids large request payloads.

Send generation request

Set options.active_speaker_detection with either frame_number + coordinates, or bounding_boxes for all frames when you already have detections (no frame_number/coordinates needed in that case). Leave auto_detect false when you want to honor the manual selection.

Workflow: selecting a speaker in images (sync-3)

When your image contains multiple faces, you can specify which one to lipsync by providing manual coordinates.

Run face detection on the image

Detect faces in your input image and display bounding boxes in your UI. Use the image’s native pixel dimensions as the coordinate space.

Collect a point on the selected face

Record the [x, y] coordinates (in the image’s native pixel space) for the center or clicked point on the target speaker’s face.

Send generation request

Set options.active_speaker_detection with frame_number: 0 and coordinates: [x, y]. Do not set auto_detect: true — the backend rejects auto-detect for image inputs.

Aspect	Video	Image (sync-3)
Auto-detect	Yes	No
Manual coordinates	Yes	Yes
Bounding boxes	Yes	No
`frame_number`	Current frame index	Always `0`
Coordinate space	Video dimensions	Image native pixels

ActiveSpeaker DTO fields

See the full API reference for active_speaker_detection.

auto_detect (boolean, default false): let Sync Labs pick the active speaker automatically.
v3 (boolean, optional): enable ASD v3.
frame_number (number): frame index that corresponds to the provided coordinates.
coordinates ([x, y]): reference point on the speaker’s face in frame_number.
bounding_boxes ((number[] | null)[], optional): per-frame array of bounding boxes across the video. Each entry corresponds to that frame: set to [x1, y1, x2, y2] (x1,y1 = top-left; x2,y2 = bottom-right) for the detected face, or null if no box for that frame. Use this instead of frame_number + coordinates when you already run detection over the clip.
bounding_boxes_url (string, optional): URL pointing to a JSON file containing the bounding boxes. Use this instead of inline bounding_boxes to avoid large request payloads. The JSON file must contain a bounding_boxes array with one entry per frame, matching the format above.

Request examples

TypeScript SDK

1 import { SyncClient } from "@sync.so/sdk";
2 
3 const sync = new SyncClient();
4 
5 const response = await sync.generations.create({
6   input: [
7     { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8     { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9   ],
10   model: "lipsync-2",
11   options: {
12     activeSpeakerDetection: {
13       autoDetect: false,
14       frameNumber: 240,
15       coordinates: [640, 360]
16     }
17   }
18 });

cURL (HTTP)

$ curl -X POST https://api.sync.so/v2/generate \
>   -H "x-api-key: $SYNC_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "lipsync-2",
>     "input": [
>       { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
>       { "type": "audio", "url": "https://assets.sync.so/docs/example-audio.wav" }
>     ],
>     "options": {
>       "active_speaker_detection": {
>         "auto_detect": false,
>         "frame_number": 240,
>         "coordinates": [640, 360]
>       }
>     }
>   }'

TypeScript SDK (bounding boxes instead of coordinates)

1 import { SyncClient } from "@sync.so/sdk";
2 
3 const sync = new SyncClient();
4 
5 await sync.generations.create({
6   input: [
7     { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8     { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9   ],
10   model: "lipsync-2",
11   options: {
12     activeSpeakerDetection: {
13       autoDetect: false,
14       // boundingBoxes aligned to video frames; null where no box is present.
15       boundingBoxes: [
16         null,                               // frame 0
17         [520, 280, 760, 520],               // frame 1 (speaker A) -> [x1,y1,x2,y2]
18         [120, 260, 320, 500],               // frame 2 (speaker B)
19         null                                // frame 3
20         // ...one entry per frame in the clip
21       ]
22     }
23   }
24 });

Image input with speaker selection (sync-3)

1 import { SyncClient } from "@sync.so/sdk";
2 
3 const sync = new SyncClient();
4 
5 await sync.generations.create({
6   input: [
7     { type: "image", url: "https://assets.sync.so/docs/example-image.jpg" },
8     { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9   ],
10   model: "sync-3",
11   options: {
12     activeSpeakerDetection: {
13       autoDetect: false,
14       frameNumber: 0,
15       coordinates: [512, 384]  // center of detected face in image pixels
16     }
17   }
18 });

Using bounding_boxes_url (for large payloads)

For long videos, host the bounding boxes in an external JSON file to keep request payloads small. The JSON must follow this format:

1 {
2   "bounding_boxes": [[520, 280, 760, 520], null, [120, 260, 320, 500], null]
3 }

Each array entry corresponds to one frame: [x1, y1, x2, y2] for a detected face, or null when no face is present. The number of entries must match the total frame count.

1 import { SyncClient } from "@sync.so/sdk";
2 
3 const sync = new SyncClient();
4 
5 await sync.generations.create({
6   input: [
7     { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8     { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9   ],
10   model: "lipsync-2",
11   options: {
12     activeSpeakerDetection: {
13       autoDetect: false,
14       boundingBoxesUrl: "https://your-cdn.com/bounding-boxes.json"
15     }
16   }
17 });