I have a Google Document that includes several images and data tables. I want to turn this document into a Markdown file so I can use it as a blog post on my Jekyll site.
I’m wondering if there’s a good way to do this conversion while keeping all the formatting intact. Should I try exporting the Google Doc as a PDF first and then converting that PDF to Markdown? Or is there a better approach?
I’m particularly concerned about what happens to the images and tables during the conversion process. Will they get preserved properly or will I lose the formatting? Has anyone successfully done this kind of conversion before?
Any suggestions for tools or methods that work well for this would be really helpful. I’d prefer to avoid manually rewriting everything if possible.
You want to convert a Google Document containing images and data tables into a Markdown file for use in a Jekyll blog post, while preserving all formatting. You’re unsure of the best approach, considering options like exporting to PDF first and then converting. You’re especially concerned about the integrity of images and tables during the conversion process.
Step-by-Step Guide:
Export to .docx: The most efficient method is to first export your Google Document as a Microsoft Word (.docx) file. This format generally preserves formatting better than exporting directly to HTML or PDF. To do this, open your Google Doc, go to File > Download > Microsoft Word (.docx). Save the file to a convenient location.
Use Pandoc for Conversion: Pandoc is a powerful command-line tool for document conversion. It excels at handling the intricacies of different document formats, including .docx and Markdown. You’ll need to install Pandoc on your system. Instructions for your operating system can be found on the official Pandoc website: https://pandoc.org/installing.html.
Run the Pandoc Conversion Command: Once Pandoc is installed, open your terminal or command prompt and navigate to the directory where you saved the .docx file. Execute the following command:
Replace document.docx with the actual filename of your Word document and output.md with the desired name for your Markdown file. The --extract-media=./images option will extract any images from the .docx file and save them to a new folder named “images” in the same directory.
Adjust Image Paths (If Necessary): Pandoc will automatically create Markdown links to your extracted images. However, you may need to adjust these paths to match the file structure of your Jekyll site’s asset folder. Make sure you place the “images” folder in the correct location for your Jekyll site. For example, if your Jekyll site has an assets folder located at _site/assets/, you might need to adjust image paths from /images/image.jpg to assets/images/image.jpg.
Review and Refine the Markdown: Open the output.md file and review the converted Markdown. Pandoc generally handles tables well, but you might need to manually adjust the alignment or formatting of some tables. Check all the images render correctly.
Common Pitfalls & What to Check Next:
Pandoc Installation: Ensure Pandoc is correctly installed and added to your system’s PATH environment variable. If you encounter errors, double-check the Pandoc installation instructions for your operating system.
Image Paths: Carefully review and correct image paths in your Markdown file to match your Jekyll site’s asset structure. Broken image links are a common issue after conversion.
Table Formatting: While Pandoc usually handles tables effectively, complex table formatting might require manual adjustments in the resulting Markdown file.
Alternative Tools: If you encounter difficulties with Pandoc, explore other conversion tools or consider using a Google Doc add-on designed for Markdown conversion (though these might have limitations with complex formatting).
Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!
Skip the manual work and conversion headaches. I’ve been through this nightmare at work more times than I can count.
Automate it. Build a workflow that pulls from Google Docs API, grabs your content, downloads images, converts tables to markdown, and pushes everything straight to Jekyll.
You don’t even need to code this. Just connect Google Docs for content extraction, add image processing for media files, convert tables to markdown, and auto-commit to your repo.
Built something like this last month for our docs team. They update Google Docs, blog posts show up on Jekyll automatically. No more export-convert-upload hell.
Best part? Images get handled properly. The automation grabs them, optimizes them, drops them in the right Jekyll folders, and fixes all the markdown links.
Saves hours every publish. You can reuse it for all future posts too.
Honestly, just grab the Google Docs addon “Docs to Markdown” - works great for basic conversions. You’ll need to manually upload images to your Jekyll assets folder, but tables convert pretty well. Way easier than dealing with pandoc or setting up complex workflows.
Based on my experience with similar projects, I recommend exporting your Google Document directly to an HTML format instead of starting with a PDF. This method tends to preserve the structure better for later conversion. Once you have the HTML, using a tool like pandoc simplifies the transition to Markdown. When converting, you can run a command like pandoc -f html -t markdown input.html -o output.md. Just be mindful that images and tables may need some extra attention; you’ll want to manually handle the images by saving them to your Jekyll assets folder and adjusting links accordingly. Overall, this approach minimizes reformatting work while retaining the document’s original layout.