I’m working with HTML content from newsletters and articles that includes tags like <div>, <em>, <b> and various attributes. Currently I’m sending the complete HTML to OpenAI for translation which works fine for smaller content.
Sample HTML:
<div><em data-type="emphasis">Hello</em> and welcome to our <span class="featured">updated website</span>.
Please review our <a href="https://sample.org">privacy policy</a> before continuing.</div>
The problem happens when my HTML content exceeds the token limits. I need a way to divide large HTML documents into manageable pieces for separate API calls while maintaining the original markup structure.
I attempted extracting just the plain text for translation but this approach isn’t working well. The results are much better when I include the complete HTML structure in my requests.