Speaker Selection — API
Speaker Selection — API
Speaker selection helps you target the right face when a clip or image contains multiple people. You can either let Sync Labs auto-detect the active speaker (video only) or provide a user-selected point from your UI and forward it via the active_speaker_detection DTO on /v2/generate. For using speaker selection in the web app, see the guide.
When to use what
- Auto-detect (video only): fastest setup; best for single/obvious speaker clips. Set
auto_detect: trueand skip manual fields. - Manual selection (video or image): best for multiple people or when you want deterministic control. Provide a reference frame and a point on the speaker’s face, or supply bounding boxes if you already have detections for that frame.
Image inputs (sync-3 only) support manual speaker selection but not auto-detect. The backend rejects auto_detect: true for images.
Workflow: selecting a speaker in video
Capture a reference frame
Seek the video to a frame where the target speaker’s face is visible. Keep track of the frame index you show in the UI.
Collect a point on the face
Record the [x, y] coordinates (in the same coordinate system/pixels as your extracted frame) for the clicked point on the speaker’s face. Keep the frame index and coordinates paired.
Optional: provide bounding boxes instead
If you already ran face detection over the video, send bounding_boxes as a per-frame array. Each entry is [x1, y1, x2, y2] (top-left to bottom-right) or null when no face is present. This replaces the need for frame_number + coordinates.
For long videos with many frames, use bounding_boxes_url instead to point to an external JSON file — this avoids large request payloads.
Workflow: selecting a speaker in images (sync-3)
When your image contains multiple faces, you can specify which one to lipsync by providing manual coordinates.
Run face detection on the image
Detect faces in your input image and display bounding boxes in your UI. Use the image’s native pixel dimensions as the coordinate space.
ActiveSpeaker DTO fields
See the full API reference for active_speaker_detection.
auto_detect(boolean, defaultfalse): let Sync Labs pick the active speaker automatically.v3(boolean, optional): enable ASD v3.frame_number(number): frame index that corresponds to the provided coordinates.coordinates([x, y]): reference point on the speaker’s face inframe_number.bounding_boxes((number[] | null)[], optional): per-frame array of bounding boxes across the video. Each entry corresponds to that frame: set to[x1, y1, x2, y2](x1,y1= top-left;x2,y2= bottom-right) for the detected face, ornullif no box for that frame. Use this instead offrame_number+coordinateswhen you already run detection over the clip.bounding_boxes_url(string, optional): URL pointing to a JSON file containing the bounding boxes. Use this instead of inlinebounding_boxesto avoid large request payloads. The JSON file must contain abounding_boxesarray with one entry per frame, matching the format above.
Request examples
TypeScript SDK
cURL (HTTP)
TypeScript SDK (bounding boxes instead of coordinates)
Image input with speaker selection (sync-3)
Using bounding_boxes_url (for large payloads)
For long videos, host the bounding boxes in an external JSON file to keep request payloads small. The JSON must follow this format:
Each array entry corresponds to one frame: [x1, y1, x2, y2] for a detected face, or null when no face is present. The number of entries must match the total frame count.

