I’m working with HTML content from newsletters and articles that includes various tags like <div>, <em>, <b> plus data attributes. Currently I’m sending the complete HTML markup to OpenAI for translation which works fine for smaller content.
Sample HTML:
<div><em data-priority="high">Hello</em> and welcome to our <b class="featured">updated website</b>.
Please review <a href="https://sample.org">privacy policy</a> before proceeding.</div>
The problem happens when my HTML content exceeds the token limits. I need a way to break down big HTML documents into manageable pieces for separate API calls while keeping the markup intact.
I tried extracting just plain text and translating that separately, but the results aren’t consistent. It works much better when I include the full HTML structure in the translation request.
Manual DOM parsing and token counting becomes a nightmare with complex HTML. Been down that road with translation projects.
Automation changed everything for me. Set up a workflow that grabs your HTML, chunks it smartly, hits OpenAI’s API, then puts everything back together without breaking anything.
Built this for our content team. It estimates tokens, keeps parent tags intact, handles rate limits, and retries failed calls automatically. No more manual DOM walking or broken HTML.
Just drop in your HTML file and get the translated version back. The workflow remembers context between chunks so translations stay consistent across the whole document.
Beats coding chunking logic from scratch, especially when you’re dealing with different HTML structures all the time.
I use this automation platform: https://latenode.com
i think using a dom parser is a smart move! breaking it up at <p> or <section> tags will help maintain structure and context, making translations smoother. hope that helps and good luck with your project!
I’ve hit this same problem translating big HTML files. Here’s what actually works: walk through the DOM and find natural break points based on content density, not just tag types. I calculate rough token counts for each content block, then group nearby elements until I’m at about 80% of the limit. That gives you breathing room for the API response. The trick is keeping parent container tags intact - track the opening tags and rebuild them for each chunk. For your HTML sample, I’d wrap each chunk with the parent tags it needs and add a reference system so you can put everything back together in order. This beats plain text extraction by miles, especially when you’ve got messy nested structures.
I use content-aware chunking and it works great. Instead of cutting text randomly, I parse the HTML to find natural break points like complete sentences or paragraphs. I walk through the DOM tree, build up content, and estimate tokens with a rough character ratio. When I’m near the limit, I back up to the last complete section. The trick is keeping context flow intact - I overlap chunks slightly so the AI stays consistent with tone and terminology. For nested structures, I map original positions and parent relationships, then rebuild everything after translation. Way better results than just splitting on tags.