I’m attempting to create a program that pulls text from a PDF file and translates it using the Google Translate API, but I can’t seem to get it to work. I’m not sure what the problem is because I’ve made several adjustments to my approach, but nothing has resolved the issue.
Here is the code I am currently using to extract and translate the text:
from tika import parser
#importing Google Translate library
import os
from textblob import TextBlob
#Cleaning up any previous files
#os.remove("arifureta.txt")
#os.remove("arifureta-formater.txt")
#os.remove("arifureta-traduit.txt")
#Extracting text from the PDF
document = parser.from_file('/home/tom/Téléchargements/Arifureta_ From Commonplace to World_s Strongest Vol. 1.pdf')
text = document['content']
text = text.replace('https://mp4directs.com', '')
text = text.replace('\t', '')
text = text.replace('\r', '')
#Writing extracted text to a file
with open("arifureta.txt", "a") as text_file:
text_file.write(text)
#Formatting text
with open("arifureta-formater.txt", "a") as formatted_file:
previous_line_empty = 0
with open("arifureta.txt") as f:
for line in f:
if len(line) == 1:
previous_line_empty += 1
else:
previous_line_empty = 0
if previous_line_empty < 2:
formatted_file.write(line)
#Translating text
with open("arifureta-traduit.txt", "a") as translated_file:
text_not_translated = ''
line_count = 0
with open("arifureta-formater.txt") as f:
for line in f:
translated_file.write(str(TextBlob(text_not_translated).translate(from_lang='en', to='fr')))
if len(line) > 1:
text_not_translated += line
line_count += 1
if line_count % 1000 == 0:
blob = TextBlob(text_not_translated)
try:
translated_output = str(blob.translate(from_lang='en', to='fr'))
translated_file.write(translated_output)
print(translated_output)
except Exception:
pass
line_count = 0
if len(line) == 1:
translated_file.write('\n')
Ultimately, I hope to have the entire text from the PDF translated in the output file. However, currently, I’m getting a ‘broken link’ response, which I suspect might be due to the volume of text. I’m looking for any advice on alternate methods or solutions to this problem.