For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
SupportStatusTry now
DocumentationAPI Reference
DocumentationAPI Reference
    • Studio
    • Discord
    • Blog
    • Changelog
  • Getting Started
    • Introduction
    • Quickstart
    • Free Trial
  • Product
    • How AI Lip Sync Works
    • Use Cases
    • Billing
    • Integrations
    • Experimental features
    • Generation Times & Performance
    • Troubleshooting
  • Compatibility and Tips
    • Web Browser Support
    • Media Formats Support
    • Media Content Tips
    • Improving Lip Sync Quality
  • WebApp Guides
    • Speaker Selection
    • Dubbing
  • Developer Guides
    • SDKs
    • Python SDK Guide
    • TypeScript SDK Guide
    • Segments
    • Error Handling
    • Speaker Selection
    • Example Projects
  • Tutorials
    • Dubbing
    • Video Dubbing API Guide
    • Video Translation API Guide
    • Text-to-Speech Lip Sync
    • Personalized Video Messaging
    • Translation/Dubbing
  • Plugins & Extensions
    • MCP Server
    • ComfyUI
LogoLogo
SupportStatusTry now
On this page
  • When to use what
  • Workflow: selecting a speaker in your app
  • ActiveSpeaker DTO fields
  • Request examples
Developer Guides

Speaker Selection — API

Was this page helpful?
Edit this page

Last updated May 15, 2026

Previous

Dubbing

Next
Built with

Speaker selection helps you target the right face when a clip contains multiple people. You can either let Sync Labs auto-detect the active speaker or provide a user-selected point from your UI and forward it via the active_speaker_detection DTO on /v2/generate. For using speaker selection in the web app, see the guide.

When to use what

  • Auto-detect: fastest setup; best for single/obvious speaker clips. Set auto_detect: true and skip manual fields.
  • Manual selection: best for multiple people or when you want deterministic control. Provide a reference frame and a point on the speaker’s face, or supply bounding boxes if you already have detections for that frame.

Workflow: selecting a speaker in your app

1

Capture a reference frame

Seek the video to a frame where the target speaker’s face is visible. Keep track of the frame index you show in the UI.

2

Collect a point on the face

Record the [x, y] coordinates (in the same coordinate system/pixels as your extracted frame) for the clicked point on the speaker’s face. Keep the frame index and coordinates paired.

3

Optional: provide bounding boxes instead

If you already ran face detection over the video, send bounding_boxes as a per-frame array. Each entry is [x1, y1, x2, y2] (top-left to bottom-right) or null when no face is present. This replaces the need for frame_number + coordinates.

4

Send generation request

Set options.active_speaker_detection with either frame_number + coordinates, or bounding_boxes for all frames when you already have detections (no frame_number/coordinates needed in that case). Leave auto_detect false when you want to honor the manual selection.

ActiveSpeaker DTO fields

See the full API reference for active_speaker_detection.

  • auto_detect (boolean, default false): let Sync Labs pick the active speaker automatically.
  • v3 (boolean, optional): enable ASD v3.
  • frame_number (number): frame index that corresponds to the provided coordinates.
  • coordinates ([x, y]): reference point on the speaker’s face in frame_number.
  • bounding_boxes ((number[] | null)[], optional): per-frame array of bounding boxes across the video. Each entry corresponds to that frame: set to [x1, y1, x2, y2] (x1,y1 = top-left; x2,y2 = bottom-right) for the detected face, or null if no box for that frame. Use this instead of frame_number + coordinates when you already run detection over the clip.

Request examples

TypeScript SDK
1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5const response = await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 frameNumber: 240,
15 coordinates: [640, 360]
16 }
17 }
18});
cURL (HTTP)
$curl -X POST https://api.sync.so/v2/generate \
> -H "x-api-key: $SYNC_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "model": "lipsync-2",
> "input": [
> { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
> { "type": "audio", "url": "https://assets.sync.so/docs/example-audio.wav" }
> ],
> "options": {
> "active_speaker_detection": {
> "auto_detect": false,
> "frame_number": 240,
> "coordinates": [640, 360]
> }
> }
> }'
TypeScript SDK (bounding boxes instead of coordinates)
1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 // boundingBoxes aligned to video frames; null where no box is present.
15 boundingBoxes: [
16 null, // frame 0
17 [520, 280, 760, 520], // frame 1 (speaker A) -> [x1,y1,x2,y2]
18 [120, 260, 320, 500], // frame 2 (speaker B)
19 null // frame 3
20 // ...one entry per frame in the clip
21 ]
22 }
23 }
24});