Building a Universal RAG System for Mixed Content Types Including Documents and Media

I need help creating a comprehensive RAG solution that can handle different types of content like text documents, pictures, graphs, and spreadsheets. My files come in various formats including Word documents, Excel files, and PDFs.

What I need the system to do:

  • Return image-only responses when needed
  • Provide highly precise text answers for procedural information
  • Maintain logical flow even when perfect accuracy isn’t critical

Current document challenges:

  • Simple text files in Word and Excel formats work fine. I just need to optimize my embedding approach and language model settings like token size, overlap percentage, and other parameters
  • Complex files with mixed content (Word docs with pictures, PDFs containing graphs and data tables) are giving me trouble. I haven’t figured out a unified approach yet

My setup includes:

  • Local language models (Llama 3.1 13B Instruct, Qwen2-7B-Instruct)

Can someone help me design a complete workflow that addresses these different content types? I’m looking for practical guidance on building this kind of multi-format RAG architecture.

multimodal rag is tricky for sure! separating text n images is a smart move. unstructured.io is helpful for parsing, but don’t forget to use clip or blip for visuals. just make sure to route queries acording to the content type you detect. good luck!