Skip to main content

Troubleshooting Transcriptions: When Speakers Blend into One

Why does my transcript show only one speaker when there are two?

Joyc from Voyc avatar
Written by Joyc from Voyc
Updated this week

Sometimes, when viewing a conversation in Voyc, you may find that a two-person conversation is transcribed as if it were a solo act (one person). This usually isn’t a bug in Voyc, but a common challenge in speech-to-text technology called diarisation.

What is Diarisation?

Diarisation means speaker separation. Think of it as drawing boundaries between voices so a transcript looks like a dialogue, not a monologue. When diarisation slips, all voices get bundled into one.

At Voyc, we partner with an industry leading provider, Deepgram, for transcription and their system does the diarisation.

In simple terms, Deepgram listens to the audio, breaks it into small chunks and creates a kind of “voice fingerprint” for each person. It then matches those fingerprints across the conversation so Speaker 1 stays Speaker 1, Speaker 2 stays Speaker 2 and so on. That’s how transcripts get labelled like a real dialogue instead of a confusing wall of text.

For example:

Hi, how are you? I am good thanks, and you? I am well

Becomes:

[Speaker 1] Hi, how are you?

[Speaker 2] I am good thanks, and you?

[Speaker 1] I am well

Why Does My Conversation Only Show 1 Speaker?

Speaker separation depends heavily on the quality of the input audio and how well the model has been trained for it. Common reasons for diarisation slip-ups include:

  • Language mismatch: If the language or accent (domain) of the conversation doesn’t match the audio settings, diarisation struggles to distinguishing voices. For example, if your conversations are in UK English but your Channel language is set as English US.

  • Audio quality: Background noise, muffled microphones, or heavy compression all make it harder to tell speakers apart.

  • Mono vs stereo: If both speakers are on a single channel (mono), diarisation has to guess. With stereo, each voice gets its own lane and separation becomes almost perfect.

Since Voyc partners with Deepgram for transcription, this is where their diarisation models come in. Deepgram uses deep learning to segment speech, but just like any AI system, it’s only as good as the data it’s fed.

How Can I Improve Accuracy?

If your transcripts are merging speakers into one or are not being transcribed as expected, there are a few things you can do to improve the odds of cleaner separation and output:

  1. Use stereo audio instead of mono

    This is the silver bullet. If each participant is recorded on their own audio track, diarisation doesn’t need to guess who is speaking, it just has to label the tracks. This guarantees 100% diarisation!

  2. Check the language settings

    Make sure the Channel language matches the actual language and accent spoken. This helps the diarisation model know what patterns to expect.

  3. Upload higher quality audio

    Use recordings with clear voices and minimal background noise. If you can, avoid heavily compressed formats.

📝 Note: You may find that diarisation improves over the duration of the call.

If you have tried the above and your transcription is still inaccurate, please tag the conversation with “transcription is still not as expected”, leave a comment with the issue and let us know via support chatbot so we can find it and investigate.

The Gist

If you see only one speaker in your transcripts, it’s not Voyc ignoring people. It means diarisation needs better input. The cleaner the audio, the sharper the separation, things like clear voices, minimal background noise and ideally separate tracks make all the difference.

Did this answer your question?