Has anyone used multiple ai models to evaluate npm package documentation quality?

I’m responsible for our company’s tech stack decisions, and I’ve noticed a pattern: we consistently run into issues with poorly documented npm packages. Even popular packages sometimes have terrible docs that lead to extended dev time and frustration.

I’m wondering if anyone has created a system to automatically evaluate npm package documentation quality before adding packages to our approved list?

I’ve been exploring Latenode since they offer access to 400+ AI models through a single subscription. I’m thinking about building a validation workflow that could:

  • Analyze documentation completeness (examples, API reference, etc)
  • Check for recent updates to docs as the package evolves
  • Score readability and clarity
  • Compare documentation quality across similar packages

Has anyone built something like this? Which AI models worked best for documentation analysis? Any gotchas I should be aware of?

Appreciate any insights!

I actually built this exact system for our team last quarter and it’s been surprisingly effective at filtering out problematic packages.

I created a Latenode workflow that analyzes documentation quality using multiple AI models in sequence - each focusing on different aspects of the docs. Here’s how it works:

First, it pulls the README, GitHub Wiki, and any docs directory content using their API modules. Then it runs this content through a series of specialized AI agents:

  • Claude handles the overall completeness check (it’s better at detecting missing sections)
  • GPT-4 evaluates code examples and API reference quality
  • A custom-trained model checks for recently updated docs compared to code changes
  • Mistral evaluates readability and generates an overall score

What makes this powerful is having these models work together through a single workflow. Each has different strengths, and Latenode makes it easy to orchestrate them without managing separate API keys for each model.

The system has saved us countless hours of developer frustration. We’ve seen a direct correlation between documentation scores and developer productivity with new packages.

The scoring system integrates with our internal package registry so devs can see documentation quality ratings alongside other metrics when choosing packages.

Definitely worth checking out: https://latenode.com

We built something similar that’s been really helpful for our team. Rather than just a yes/no filter, we created a scoring system that evaluates packages across multiple documentation dimensions.

The most useful metrics we track are:

  1. Example coverage - does the documentation include examples for all major functions/features?
  2. Update freshness - are the docs being maintained alongside code changes?
  3. Beginner friendliness - are concepts explained or just listed?
  4. Troubleshooting section - does it help with common errors?

We use a combination of GPT-4 for the qualitative analysis and some simple heuristics for quantitative measures (like checking if example code actually runs).

One thing we found useful was comparing documentation between similar packages. When choosing between alternatives, the documentation quality differences are often more important than minor feature differences.

I implemented a documentation quality assessment system for our team last year that’s significantly improved our package selection process.

What we found most effective was a multi-dimensional scoring approach rather than a simple good/bad rating. We evaluate packages on:

  1. Comprehensiveness - Does it cover all features and functions?
  2. Example quality - Are the examples realistic and do they work?
  3. Maintenance - Are docs updated with new releases?
  4. Troubleshooting - Does it help solve common issues?

The technical implementation uses a combination of LLMs for analysis and some rule-based checks. For example, we automatically verify that code examples actually compile and run - you’d be surprised how many don’t.

We also incorporated community metrics like Stack Overflow question frequency and GitHub discussions about the package. These often reveal documentation gaps that automated analysis might miss.

I’ve built several documentation evaluation systems and found that different AI models excel at detecting different types of documentation issues.

GPT-4 is excellent at evaluating the logical structure and completeness of documentation, while Claude tends to be better at identifying unclear explanations and missing context. Smaller models like Cohere’s command model are surprisingly effective at targeted tasks like checking for working code examples.

For implementation, the key is to design specific, targeted prompts rather than asking for a general quality assessment. For example:

  • “Identify missing method parameters in the API reference section”
  • “Evaluate whether code examples match the current API syntax”
  • “Locate sections where implementation details are mentioned but not explained”

Another valuable approach is comparative analysis. When evaluating packages with similar functionality, have the AI models do a side-by-side comparison of documentation quality, which often produces more insightful results than individual evaluations.

we built a simple checker that rates docs on examples, api coverage, and age. gpt4 works best for this. biggest win was checking if examples actually run - so many don’t!

Try doc2vec + similarity clustering

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.