Issues with Google Translate API when translating PDF text

I’m attempting to create a program that pulls text from a PDF file and translates it using the Google Translate API, but I can’t seem to get it to work. I’m not sure what the problem is because I’ve made several adjustments to my approach, but nothing has resolved the issue.

Here is the code I am currently using to extract and translate the text:

from tika import parser
#importing Google Translate library
import os
from textblob import TextBlob

#Cleaning up any previous files
#os.remove("arifureta.txt")
#os.remove("arifureta-formater.txt")
#os.remove("arifureta-traduit.txt")

#Extracting text from the PDF
document = parser.from_file('/home/tom/Téléchargements/Arifureta_ From Commonplace to World_s Strongest Vol. 1.pdf')
text = document['content']
text = text.replace('https://mp4directs.com', '')
text = text.replace('\t', '')
text = text.replace('\r', '')

#Writing extracted text to a file
with open("arifureta.txt", "a") as text_file:
    text_file.write(text)

#Formatting text
with open("arifureta-formater.txt", "a") as formatted_file:
    previous_line_empty = 0
    with open("arifureta.txt") as f:
        for line in f:
            if len(line) == 1:
                previous_line_empty += 1
            else:
                previous_line_empty = 0
            if previous_line_empty < 2:
                formatted_file.write(line)

#Translating text
with open("arifureta-traduit.txt", "a") as translated_file:
    text_not_translated = ''
    line_count = 0
    with open("arifureta-formater.txt") as f:
        for line in f:
            translated_file.write(str(TextBlob(text_not_translated).translate(from_lang='en', to='fr')))
            if len(line) > 1:
                text_not_translated += line
                line_count += 1
            if line_count % 1000 == 0:
                blob = TextBlob(text_not_translated)
                try:
                    translated_output = str(blob.translate(from_lang='en', to='fr'))
                    translated_file.write(translated_output)
                    print(translated_output)
                except Exception:
                    pass
                line_count = 0
            if len(line) == 1:
                translated_file.write('\n')

Ultimately, I hope to have the entire text from the PDF translated in the output file. However, currently, I’m getting a ‘broken link’ response, which I suspect might be due to the volume of text. I’m looking for any advice on alternate methods or solutions to this problem.

textblob’s translate method can be unreliable for large texts like yours. try switching to googletrans library instead - it handles chunking better and has less timeout issues. also your translation logic is backwards, you’re translating empty strings then adding content after.

Looking at your code, there’s a fundamental logic issue in your translation loop that’s likely causing the problems. You’re calling TextBlob(text_not_translated).translate() at the beginning of each iteration when text_not_translated is still empty or contains previous content, then accumulating lines afterward. This creates inconsistent translation calls and probably contributes to your broken link errors.

I encountered similar issues when working with large document translations last year. The Google Translate API has rate limits and request size restrictions that aren’t immediately obvious. What worked for me was implementing proper chunking with delays between requests and adding retry logic with exponential backoff. Also consider that TextBlob uses Google Translate’s free service which has stricter limitations compared to the official paid API.

For large PDFs, you might want to process smaller chunks sequentially rather than accumulating 1000 lines at once. Try reducing your batch size to something like 100-200 lines and add a small delay between translation requests to avoid hitting rate limits.

Your main issue stems from the translation service limitations rather than the code structure itself. I’ve dealt with similar PDF translation projects and found that TextBlob’s underlying service becomes unreliable when processing large volumes of text, especially from extracted PDF content which often contains formatting artifacts.

The “broken link” error typically occurs when the translation service can’t handle the request size or encounters malformed text from PDF extraction. PDF text extraction often includes hidden characters, metadata, and formatting that can break translation APIs. Before translating, I recommend adding better text cleaning - remove non-printable characters, normalize whitespace, and filter out any remaining PDF artifacts.

Another approach I’ve used successfully is switching to the official Google Cloud Translate API with proper authentication. It’s more reliable for batch processing and handles larger text volumes better than the free services. The small cost is usually worth it for production use cases.