By Minseok Song and Hiroshi Yoshioka (Microsoft MVPs)
TL;DR
Recent community feedback, especially from Japanese translations, revealed that many translation failures were not semantic, but structural.
Through detailed issue reports and discussions, we identified recurring patterns such as broken links, malformed code fences, inconsistent list structures, and CJK-specific formatting issues.
In response, Co-op Translator has undergone a series of structural improvements across multiple releases, culminating in v0.18.1 with enhancements such as parser-based code fence handling, list-aware chunking, language-specific Markdown templates, safer CJK emphasis normalization, more robust image migration, and improved internal anchor consistency.
These changes were directly informed by real-world community feedback. We would like to especially thank Hiroshi Yoshioka (Microsoft MVP), whose many detailed reports not only uncovered several of these systemic issues but also made this community report possible.
The result is not just improved Japanese translations, but a more reliable and resilient translation pipeline for any repository that depends on Markdown fidelity.
Introduction
Most translation bugs are not actually translation bugs.
They are structural failures.
They show up as broken links, missing bold markers, unclosed code fences, skipped content, or images that quietly point to the wrong place. To a learner reading translated technical documentation, those issues can make a page feel untrustworthy. To a maintainer localizing documentation at scale, they reveal something deeper: the translation pipeline is not preserving structure as carefully as it preserves meaning.
That insight became much clearer over the past several months through community feedback on Co-op Translator.
Co-op Translator helps maintain educational GitHub content across many languages while keeping Markdown, images, and notebooks synchronized as the source evolves. As Hiroshi Yoshioka reported a series of Japanese translation issues across real Microsoft learning repositories, each issue looked narrow on the surface: a broken link here, a skipped line there, bold markers not surviving around linked text, HTML image tags not being rewritten, or code fences breaking after chunking.
Example of a real community-reported issue where a code block was broken during translation, causing structural corruption in the output.
But taken together, those reports exposed a broader pattern:
The hardest problem was not “translate this sentence.”
The hardest problem was “translate this document without damaging its structure.”
This post is a community report on the hardening work that followed, especially in the recent run-up to v0.18.1, and what we learned from those real-world cases.
Why these reports mattered
One of the most useful things about community feedback is that it reveals failure modes that synthetic tests often miss.
These were not edge cases found in toy Markdown samples. The reports came from real translated content in active educational repositories. That meant we were dealing with the kinds of files maintainers actually have to ship:
- nested lists
- fenced code blocks
- inline HTML
- relative links
- translated headings
- migrated image assets
- CJK punctuation and emphasis edge cases
In other words, we were seeing the kinds of Markdown that break when a translation system is only mostly correct.
1) We stopped treating code fences like a regex problem
Code fences are not a regex problem—they are a structural one.
Left: Regex-based handling breaks code fences and list structure across chunks.
Right: Parser-based processing preserves code blocks and their surrounding context as atomic units.
One of the earliest recurring themes was code fence integrity.
A report on incorrectly handled triple backticks highlighted a classic failure mode: if fenced blocks are detected or split incorrectly, placeholders can fall out of sync, chunk boundaries can be corrupted, and the translated file can come back structurally damaged. A later report showed a closely related issue: list items and indented code placeholders could be split into separate chunks, which then caused broken fences downstream.
The right fix was not another regex patch.
Instead, Co-op Translator moved to a parser-based approach using markdown-it-py for fenced code block detection. This made code block handling spec-aware and more resilient to cases like unmatched fences, variable fence lengths, and info strings. More importantly, it ensured code sections were treated as atomic units during chunking and placeholder restoration.
This same principle was extended to list-aware chunking.
Rather than splitting Markdown line by line and hoping the model would preserve structure, the pipeline now groups list items together with their continuation lines and indented placeholders such as @@CODE_BLOCK_X@@. This prevents bullets and their associated code content from being separated into different translation chunks.
This was not just a better heuristic. It changed the unit of chunking itself.
In practice, this required modifying the chunking pipeline to detect and preserve list-item blocks before token-based splitting. Instead of treating each line independently, we introduced a grouping step that keeps the entire list context intact, including nested indentation and code placeholders.
The change was implemented directly in the chunking logic:
lines = _group_lines_preserving_list_items(part_text)
This helper ensures that list items and their associated code blocks are processed as a single unit, preventing structural corruption during translation.
Why this mattered
Technical documentation frequently embeds code examples directly under list items or step-by-step instructions. When these relationships are broken during translation, the issue is not just cosmetic. It results in structurally invalid Markdown and misplaced code blocks that can confuse readers and make examples unusable.
These were not edge cases. They appeared in real production documentation where:
- Fenced code blocks became malformed after chunking
- List items and their associated code placeholders were separated into different segments
- Placeholder ordering drifted, breaking reconstruction of the original structure
In practice, this meant that even when the translated text was correct, the document itself could no longer be trusted as a working technical resource.
What changed in practice
Before:
- Code samples could leak out of their list context
- List items and code blocks were split across chunks
- Placeholder ordering could drift, breaking reconstruction
After:
- Code blocks are preserved as atomic units during chunking
- List-bound code samples remain intact
- Placeholder ordering is stable across the pipeline
2) We restored internal link consistency across translation chunks
Even when each chunk appears locally correct, internal links can break at the document level.
Left: Anchor links drift out of sync because headings and links are translated independently across chunks.
Right: After document-level normalization, links correctly resolve to their corresponding translated headings.
Another cluster of issues surfaced when translating longer Markdown documents: internal links would silently break once the content was processed in chunks.
Co-op Translator splits large documents into multiple chunks to fit within model constraints. While this works well for translation itself, it introduces a structural problem. Internal links such as [Go to section](#section-name) depend on heading-derived anchor slugs, and those slugs can change during translation. When each chunk is translated independently, links and headings can drift out of sync.
In practice, this meant that even when translated headings and links looked correct locally within a chunk, they no longer matched at the document level. Tables of contents, section jump links, and cross-references inside the same file could silently break.
The right fix was not to rely on chunk-level correctness.
Instead, Co-op Translator introduced a document-level normalization step for internal anchor links.
The pipeline now parses both the source and translated Markdown using markdown-it, extracts headings, generates GitHub-style slugs from the translated headings, and then realigns internal anchor links so they correctly point to their corresponding translated sections. Rather than trusting fragment identifiers produced during chunk-level translation, links are reconciled against the final translated document structure.
This was not just a small post-processing tweak. It changed where consistency is enforced.
In practice, this required introducing a normalization step that runs after all chunks are merged back into a single document. Instead of assuming each chunk is self-consistent, the system now treats the entire document as the source of truth and rebinds internal links accordingly.
The change was implemented as a dedicated normalization pass:
normalize_internal_anchor_links(source_markdown, translated_markdown)
This function aligns fragment identifiers with translated heading slugs, ensuring that internal navigation remains valid even when content has been translated in multiple independent chunks.
Why this mattered
Technical documentation relies heavily on internal navigation such as tables of contents, section links, and cross-references within the same file.
When anchor links drift out of sync with translated headings, the document becomes difficult to navigate even if the translation itself is accurate. Readers may click on links that lead to incorrect sections or nowhere at all, which significantly reduces trust in the content.
These issues surfaced in real-world usage where:
- Internal links no longer matched translated heading slugs
- Tables of contents pointed to incorrect or missing sections
- Cross-references silently broke across chunk boundaries
This highlighted that correctness at the chunk level was not enough. Consistency had to be enforced at the document level.
What changed in practice
Before:
- Internal links could drift out of sync with translated headings
- Tables of contents pointed to incorrect or missing sections
- Cross-references silently broke across chunk boundaries
- Long documents behaved like fragmented outputs rather than a single unit
After:
- Internal links are realigned with translated heading slugs at the document level
- Tables of contents correctly resolve to translated sections
- Cross-references remain consistent across the entire document
- Long Markdown documents behave as a single coherent unit
3) We fixed CJK emphasis the safe way
Bold and italic rendering around CJK text was a recurring and subtle failure point.
Issues like “Markdown bold not handled correctly” may look minor, but they reveal a deeper compatibility problem: many Markdown renderers do not consistently apply emphasis when markers sit directly next to CJK characters.
To address this, we introduced a dedicated normalization step for emphasis markers.
Instead of relying on each renderer to interpret `*`, `**`, and `***` correctly in CJK-adjacent cases, Co-op Translator converts them into equivalent HTML tags such as `<em>` and `<strong>` when the target language is Japanese, Korean, or Chinese.
This shifts emphasis rendering from renderer-dependent behavior to deterministic output.
What mattered was not just fixing it, but fixing it safely.
The normalization is strictly scoped to CJK languages and carefully designed to avoid overmatching. It does not mutate inline code spans or unrelated fragments. This is critical, because overly aggressive formatting fixes can easily break code, identifiers, or underscore-heavy technical text.
Unlike whitespace-delimited languages, Japanese, Korean, and Chinese often place characters directly adjacent to emphasis markers without clear boundaries.
For example, a phrase like:
example is ...
may be translated into Japanese as:
例は ...
Here, the particle は is attached directly to the emphasized word. In some Markdown renderers, this breaks the expected boundary around ..., causing the emphasis to render incorrectly or not at all.
This pattern is not limited to Japanese. Similar boundary issues can appear across CJK languages due to the absence of whitespace between words.
Why this mattered
Formatting bugs around emphasis may look cosmetic, but they affect readability, hierarchy, and trust especially in instructional documentation where emphasis often signals warnings, key concepts, or required steps.
What changed in practice
Before:
- Emphasis markers could render inconsistently when adjacent to CJK characters
- Bold and italic formatting could break depending on the Markdown renderer
- Fixes risked overmatching and corrupting code or inline technical content
After:
- Emphasis rendering is deterministic across CJK languages using HTML tags
- Bold and italic formatting remains consistent regardless of renderer behavior
- Normalization is safely scoped, avoiding unintended mutations in code and inline content
Next steps
With the recent release, Co-op Translator now exposes a programmatic API that allows the translation pipeline to be executed directly from Python, not only through the CLI.
This is an important step, but it is not the end state.
The immediate focus is improving adoption. Documentation and usage patterns are being developed so that the API can be reliably integrated across different environments and workflows.
More fundamentally, the direction is shifting.
Co-op Translator is evolving from a repository-specific tool into a reusable translation engine that can operate as part of larger content pipelines.
This enables broader use cases, including:
- Long-form content such as eBooks and technical blogs
- Developer documentation and static site projects (for example, Docusaurus or Astro)
- Continuous documentation pipelines that track and update translations as source content evolves
- Multilingual SDK, API documentation, and knowledge base systems
The long-term goal is to treat translation as infrastructure rather than a one-time task.
Instead of generating static outputs, the system is being designed to support continuous updates, structural guarantees, and seamless integration into real-world documentation workflows.
Why community feedback mattered so much here
One of the most encouraging parts of this work is that the most useful reports were not always long reports.
Sometimes a single repository link, a screenshot, and one concrete example of broken output were enough to reveal a structural weakness in the translation engine. That feedback created a valuable loop between people reading translated docs and people maintaining the translation tooling.
Hiroshi's reports did not just identify isolated defects. They helped surface recurring categories of failure:
- code fence integrity
- chunk boundary safety
- link preservation
- CJK emphasis compatibility
- image path migration
- anchor normalization
Once those patterns became visible, the fixes could be implemented in the core and covered with tests so that the broader ecosystem not just one file or one repo would benefit.
Why this matters for learners worldwide
Co-op Translator is used in educational repositories where translated documentation can lower the barrier to learning for people around the world. That raises the quality bar.
A learner should not have to wonder whether a missing bold marker changed the meaning of a sentence.
A learner should not hit a broken anchor halfway through a tutorial.
A learner should not lose trust in a translated page because a code block or image path was corrupted during processing.
Improving those details is not cosmetic. It is part of making global technical education more reliable.
Closing thoughts
This community report comes down to a simple truth:
Translation quality depends on structural quality.
Community feedback helped Co-op Translator get better at preserving the things technical documents depend on most: code fences, lists, links, emphasis, images, and anchors. The result is a more dependable foundation for multilingual documentation not only for Japanese, but for any repository that needs translated content to behave like a maintained technical artifact rather than a plain text dump.
To everyone who has opened an issue, shared a screenshot, submitted a PR, or stress-tested translated docs in the real world: thank you. That feedback is helping Co-op Translator become a stronger tool for maintainers and a more trustworthy experience for learners.
If you are maintaining multilingual Markdown content, I hope these lessons are useful beyond this project too: use parsers where you can, make structure a first-class concern, and treat community bug reports as design input not just support tickets.
If you are working on multilingual documentation, you can explore Co-op Translator here:
https://github.com/Azure/co-op-translator
Selected GitHub references
About the authors
Minseok Song (Microsoft MVP) is an OSS maintainer of Co-op Translator focusing on GitHub-native multilingual automation.
Hiroshi Yoshioka (Microsoft MVP) is a community contributor who has played a key role in improving translation quality through detailed real-world feedback.