<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://iyakovlev.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://iyakovlev.dev/" rel="alternate" type="text/html" /><updated>2026-04-07T17:23:19+00:00</updated><id>https://iyakovlev.dev/feed.xml</id><title type="html">Ivan Yakovlev</title><subtitle>Personal site of Ivan Yakovlev — speech processing researcher and engineer working on speaker recognition, multilingual TTS, and streaming ASR.</subtitle><author><name>Ivan Yakovlev</name><email>ivan@iyakovlev.dev</email></author><entry><title type="html">We benchmarked 4 neural forced aligners across multiple languages. Here’s what actually works.</title><link href="https://iyakovlev.dev/blog/2026/04/07/forced-alignment-benchmark/" rel="alternate" type="text/html" title="We benchmarked 4 neural forced aligners across multiple languages. Here’s what actually works." /><published>2026-04-07T00:00:00+00:00</published><updated>2026-04-07T00:00:00+00:00</updated><id>https://iyakovlev.dev/blog/2026/04/07/forced-alignment-benchmark</id><content type="html" xml:base="https://iyakovlev.dev/blog/2026/04/07/forced-alignment-benchmark/"><![CDATA[<p>Many TTS data pipelines rely on forced alignment for segmentation and labeling — but there’s no good public benchmark comparing neural approaches <strong>across languages</strong>. Existing evaluations (<a href="https://github.com/lifeiteng/Aligner-SUPERB">Aligner-SUPERB</a>, <a href="https://arxiv.org/abs/2406.19363">Tradition or Innovation, Interspeech 2024</a>) compare aligners against ground-truth word timestamps on English corpora (TIMIT, Buckeye). That’s useful academically, but ground-truth timestamps are expensive to produce and practically don’t exist outside English.</p>

<p>The thing is — you don’t actually need them. Quality multilingual (audio, transcript) datasets already exist: FLEURS covers 100+ languages, Common Voice and MLS cover dozens more. The bottleneck isn’t data, it’s the evaluation method. So we built one that works with any (audio, transcript) corpus.</p>

<p>We tested <a href="https://github.com/facebookresearch/seamless_communication">Seamless</a> (Meta), <a href="https://github.com/m-bain/whisperX">WhisperX</a>, <a href="https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B">Qwen3-ForcedAligner</a>, and a commercial cloud ASR API on <a href="https://huggingface.co/datasets/google/fleurs">FLEURS</a> across EN / ES / FR / RU.</p>

<h2 id="method">Method</h2>

<p>Instead of ground-truth timestamps, we use an implicit WER-based evaluation:</p>

<ol>
  <li>Run a forced aligner on (audio, transcript) pairs to get word-level timestamps</li>
  <li>Crop audio at the aligned word boundaries</li>
  <li>Re-transcribe each crop with an independent ASR (Whisper large-v3-turbo)</li>
  <li>Compute WER between the crop transcription and the expected text</li>
</ol>

<p><strong>Better alignment = tighter crops = lower WER.</strong> Accurate boundaries mean each crop contains exactly the target word. Misaligned boundaries cut words or bleed neighboring audio, causing transcription errors.</p>

<p>To isolate alignment quality from aligner ASR quality, results are filtered to utterances where <strong>all</strong> aligners produced perfect transcription reconstruction (strict mode). Inner words only (<code class="language-plaintext highlighter-rouge">[1:-1]</code>) are used to avoid edge effects at utterance boundaries.</p>

<p><img src="/assets/posts/forced-aligner-bench-scheme-rounded.drawio.png" alt="Benchmark pipeline: align → crop → re-transcribe → score" class="post-figure" /></p>

<h2 id="aligners">Aligners</h2>

<table>
  <thead>
    <tr>
      <th>Aligner</th>
      <th>Type</th>
      <th>Approach</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Seamless</strong></td>
      <td>NAR T2U + unit extraction</td>
      <td>Duration prediction with monotonic alignment (RAD-TTS based)</td>
    </tr>
    <tr>
      <td><strong>Qwen3</strong></td>
      <td>Transformer forced aligner</td>
      <td>LLM-based slot-filling with discrete timestamp prediction</td>
    </tr>
    <tr>
      <td><strong>WhisperX</strong></td>
      <td>CTC-based</td>
      <td>Word boundaries extracted from wav2vec2 CTC emissions</td>
    </tr>
    <tr>
      <td><strong>Cloud ASR</strong></td>
      <td>Cloud API</td>
      <td>Commercial transcription API with alignment output</td>
    </tr>
  </tbody>
</table>

<h2 id="results">Results</h2>

<p>Dataset: FLEURS test split, strict filtering (intersection of utterances with perfect transcription across all aligners), inner words only.</p>

<h3 id="mean-wer-by-language">Mean WER by language</h3>

<table>
  <thead>
    <tr>
      <th>Aligner</th>
      <th>EN</th>
      <th>ES</th>
      <th>FR</th>
      <th>RU</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Seamless</td>
      <td><strong>0.067</strong></td>
      <td><strong>0.030</strong></td>
      <td>0.102</td>
      <td>0.055</td>
    </tr>
    <tr>
      <td>Qwen3</td>
      <td>0.072</td>
      <td>0.039</td>
      <td>0.104</td>
      <td>0.059</td>
    </tr>
    <tr>
      <td>Cloud ASR</td>
      <td>0.091</td>
      <td>0.038</td>
      <td><strong>0.098</strong></td>
      <td><strong>0.050</strong></td>
    </tr>
    <tr>
      <td>WhisperX</td>
      <td>0.082</td>
      <td>0.065</td>
      <td>0.155</td>
      <td>0.133</td>
    </tr>
  </tbody>
</table>

<p>Lower is better. <strong>Bold</strong> = best per language.</p>

<h2 id="takeaways">Takeaways</h2>

<p><strong>Seamless is the most consistent performer overall.</strong> Duration prediction with monotonic alignment produces stable word boundaries across languages — the model explicitly learns duration, not just token occurrence.</p>

<p><strong>WhisperX struggles on FR and RU.</strong> CTC emissions predict token occurrence in sequence, not precise duration. The spiked posteriors make boundary extraction unreliable, especially for morphologically rich languages.</p>

<p><strong>Qwen3-ForcedAligner</strong> had high expectations as an LLM-based aligner, but on this benchmark the advantage over Seamless isn’t visible. Worth watching, not a clear winner yet.</p>

<p><strong>Cloud ASR edges out on FR and RU</strong>, but margins are small and Seamless wins on average. Worth noting: alignment comes bundled with their transcription response, so we filtered to utterances where the API’s recognition exactly matched FLEURS ground truth. This reduced the test set and may slightly favor it.</p>

<p><strong>Seamless’ practical limitation:</strong> fixed set of 38 languages. If yours isn’t covered, you’re back to CTC-based approaches.</p>

<p>For our production pipeline at <a href="https://palabra.ai">Palabra AI</a>, Seamless is the current choice.</p>

<hr />

<h2 id="implementation">Implementation</h2>

<p>The benchmark infrastructure — alignment wrappers, crop logic, WER computation, multi-GPU extraction — was built with <a href="https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview">Claude Code</a>. Each aligner is wrapped behind a unified interface, and the full pipeline (align → crop → transcribe → score) runs with a couple of bash scripts.</p>

<p>Code &amp; benchmark: <a href="https://github.com/PalabraAI/forced-aligners-bench">github.com/PalabraAI/forced-aligners-bench</a></p>]]></content><author><name>Ivan Yakovlev</name><email>ivan@iyakovlev.dev</email></author><summary type="html"><![CDATA[Implicit WER-based evaluation of Seamless, WhisperX, Qwen3-ForcedAligner, and a commercial cloud ASR on FLEURS across EN/ES/FR/RU.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://iyakovlev.dev/assets/posts/forced-aligner-bench-scheme-rounded.drawio.png" /><media:content medium="image" url="https://iyakovlev.dev/assets/posts/forced-aligner-bench-scheme-rounded.drawio.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>