Need Help Building Research Assistant Tool with Local Document Search and AI Models

SurfingWave · August 26, 2025, 4:16pm

Hi everyone!

I’m a retired academic looking to create an automated research tool that can analyze my digital document collection. I have thousands of papers and books indexed with a search system called Recoll, and I want to build something that can help me dig deeper into topics.

What I want to build:

Basically, I need a system that can take a research topic, find relevant documents from my collection, then use AI to ask and answer questions about those documents automatically. The idea is to let it run multiple rounds of questions to build up a comprehensive understanding, then generate a final report.

My current setup:

Document indexing via Recoll (already working)
Want to use Langchain as the framework
Planning to run Ollama for local AI processing

Where I’m stuck:

I’m not very experienced with programming and my technical English isn’t great. The main issues I’m facing are:

How to connect Recoll’s search results with Langchain
Creating prompts that generate meaningful research questions
Making the whole process iterative so it builds on previous findings

What I’m hoping for:

Any guidance on implementation approaches, code snippets for the Recoll integration, or just general advice would be amazing. If anyone wants to collaborate on this, I’d be very interested in working together.

The end goal is to make this open source for other researchers to use.

Thanks for any help you can offer!

Mike71 · September 4, 2025, 8:40am

honestly, the tech integration’s not the hard part - it’s getting good results from iterative prompting. i tried something similar last year and my AI kept asking the same questions after 3-4 rounds. maybe add a ‘question history’ check so it doesn’t circle back to the same topics? also, recoll can output json format which might be easier than xml parsing.

Dave_17Sketch · September 3, 2025, 4:19pm

Your Langchain and Ollama approach looks promising, but it’s advisable to start with a simpler setup before fully developing the iterative system. I’ve taken a similar path during my dissertation. For integrating Recoll, consider using subprocess calls to execute recoll queries with Python and then parse the XML output. You can try the command subprocess.run(['recoll', '-t', '-q', query_string])—this provides structured results that fit well with Langchain document loaders. The iterative questioning was a significant challenge for me as well. What helped was maintaining a context file to track findings from each round. Begin with straightforward questions such as ‘What are the main themes here?’ and then leverage those insights to craft more in-depth questions in subsequent rounds. I made the mistake of trying to process too many documents at once initially. It’s wise to start small and scale up gradually. Additionally, implementing some basic relevance scoring can help filter out unimportant content early on. Don’t let your technical English deter you; the documentation for both Langchain and Ollama has improved significantly. Start with their basic examples and modify them step by step.

emcarter · September 3, 2025, 3:29am

try recoll’s python bindings instead of subprocess calls - much cleaner than parsing xml or json. just import recoll and use it directly. for the iterative stuff, track question ‘quality’ somehow. if an ai answer creates new search terms, that’s probably worth following up on.

charlottew · September 2, 2025, 8:24pm

Try adding a vector database to your Recoll setup. I did something similar for legal docs and found that embedding documents with sentence transformers beats keyword search alone. You can run ChromaDB or Qdrant locally, then use Langchain’s vector store retrievers to grab relevant chunks. Here’s what worked for me: structure your iterations around document themes, not random questions. After each AI round, pull out the main concepts and use those to guide your next Recoll searches. Way more focused. For the tech side, wrap Recoll in a Python class that formats results for Langchain. Keep each iteration to 2-3 specific questions and track what you’ve already covered vs. what needs more digging.

jade_journey · September 2, 2025, 2:09pm

I’ve built similar research automation systems before, and integration between tools is always the biggest headache. You’ll end up writing tons of custom code to connect Recoll with Langchain, plus all the iterative questioning logic.

Skip the custom integration nightmare - automate the whole workflow instead. Set up triggers that run searches, feed results to your AI models, and chain analysis rounds together without complex coding.

Connect Recoll’s API directly to Ollama through HTTP requests, then build iterative questioning as automated steps. Each round feeds into the next, building your research report automatically.

For prompts, use templates populated with previous findings: “Based on these documents: [search results] and previous analysis: [prior findings], generate 3 follow-up research questions.”

I built something like this for patent research at work. Runs overnight, processes hundreds of documents, delivers comprehensive reports by morning. Zero manual work.

You can prototype the workflow visually, test each step, then let it run automatically on your documents. Way faster than coding from scratch.

byteBard_007 · September 2, 2025, 12:55pm

I’ve been hitting the same document analysis headaches at work. The subprocess route everyone talks about? It’s a nightmare when you’re dealing with thousands of files.

You need a visual workflow that connects everything - no custom coding required. Picture building blocks: one handles Recoll queries, another processes with Ollama, another runs the question loop.

I built something like this for our tech docs. Research topic triggers search → results feed the AI → questions get generated and answered → system decides whether to dig deeper or move on.

Treat this as workflow automation, not a coding problem. Map everything visually, test each piece separately, then let it run.

For the iterative bit, add decision logic that checks if you’ve gathered enough info or need more questions. Way cleaner than babysitting context files.

Your open source angle works perfectly here - other researchers can tweak the workflow without touching any code.