Extracting text from DOCX files on Google Drive using API

I’m stuck trying to get text from a DOCX file stored in Google Drive using their API. I’ve managed to download the file as a byte stream with the get_media method, but I can’t figure out how to convert it to readable text. Here’s what I’ve tried:

from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
import io

auth = get_credentials()  # Assume this function exists
drive_service = build('drive', 'v3', credentials=auth)

doc_id = 'your_file_id_here'
request = drive_service.files().get_media(fileId=doc_id)

buffer = io.BytesIO()
downloader = MediaIoBaseDownload(buffer, request)

complete = False
while not complete:
    status, complete = downloader.next_chunk()
    print(f'Download progress: {int(status.progress() * 100)}%')

file_content = buffer.getvalue()

The code above successfully downloads the file, but when I try to decode it, I get ‘invalid continuation byte’ errors. Even when ignoring errors, the result isn’t usable text. Has anyone successfully extracted text from a DOCX file downloaded from Google Drive? Any help would be amazing!

hey dave, i’ve run into this before. the trick is to use a library like python-docx or mammoth. personally, i prefer mammoth cuz it’s simpler. just pip install mammoth and do this:

import mammoth
result = mammoth.extract_raw_text(BytesIO(file_content))
text = result.value

works like a charm for me. good luck!

I’ve encountered this issue before when working with DOCX files from Google Drive. The problem is that DOCX files are actually compressed archives, not plain text. To extract the text, you’ll need to use a library that can handle the DOCX format.

I recommend using the python-docx library. Here’s how you can modify your code:

from docx import Document
from io import BytesIO

# ... your existing code to download the file ...

doc = Document(BytesIO(file_content))
full_text = []
for para in doc.paragraphs:
    full_text.append(para.text)

extracted_text = '\n'.join(full_text)

This approach should give you the text content of the DOCX file. Remember to install python-docx first with pip install python-docx. Hope this helps solve your problem!

I’ve dealt with this exact scenario in a project I worked on recently. While the python-docx library is a solid choice, I found that using the mammoth library gave me better results, especially with more complex DOCX files.

Here’s what worked for me:

First, install mammoth with pip install mammoth. Then, modify your code like this:

import mammoth

Your existing download code here

result = mammoth.convert_to_html(BytesIO(file_content))
extracted_text = result.value

This approach not only extracts the text but also preserves some of the document’s structure. It’s been a game-changer for me when dealing with DOCX files from Google Drive.

One caveat: if you’re dealing with a large number of files, you might want to implement some error handling and possibly use asyncio for better performance. Just something to keep in mind as you scale up your solution.