Breaking Down Large HTML Content for OpenAI Translation API in PHP Without Losing Structure

I’m working with HTML content from newsletters and articles that includes tags like <div>, <em>, <b> and various attributes. Currently I’m sending the complete HTML to OpenAI for translation which works fine for smaller content.

Sample HTML:

<div><em data-type="emphasis">Hello</em> and welcome to our <span class="featured">updated website</span>. 
Please review our <a href="https://sample.org">privacy policy</a> before continuing.</div>

The problem happens when my HTML content exceeds the token limits. I need a way to divide large HTML documents into manageable pieces for separate API calls while maintaining the original markup structure.

I attempted extracting just the plain text for translation but this approach isn’t working well. The results are much better when I include the complete HTML structure in my requests.

I faced a similar challenge when dealing with lengthy product descriptions in HTML. Directly translating large blocks leads to lost structure and context. What worked for me was using a DOM parser to break down the HTML into smaller, meaningful parts while preserving the tags. I aim for each chunk to stay within 2000-3000 tokens, ensuring each piece retains its surrounding elements. It’s essential to keep track of where each chunk fits in the original document to piece everything back together after translation. Be cautious with nested HTML elements; they can lead to confusing translations if not handled carefully. Ensuring each chunk is contextually sound is key.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.