Compact Search Indexing Library/API Recommendation

I am in search of a lightweight, open-source library for search indexing that can easily fit into an embedded web application. This library should ideally be designed in C, C++, or PHP, and must not require any database installation for storing indexes. Instead, I prefer if the indexes can be kept in a simple file format, such as XML or TXT. Although I considered well-known options like Xapian and CLucene, their sizes seem too large for my embedded system needs. The final solution will function on a Linux platform and be responsible for indexing HTML documents. I would appreciate any suggestions for a suitable library or API. Thank you!

Considering your requirements, Woozzy could be an ideal fit. While not as widely known as some other libraries, Woozzy is a diminutive, open-source search indexing library implemented in C++. It stores indexes in plain text files, making it lightweight and well-suited for embedded systems where resources are limited.

Moreover, it is capable of processing HTML documents, extracting relevant data for indexing purposes, and does so without depending on an external database. This should certainly align with your preference for file-based indexes in formats like XML or TXT.


// Example of Woozzy usage
#include <woozzy/woozzy.h>

int main() {
    woozzy::Index index;
    index.addDocument("doc1.html", "<html>...Document content...</html>");
    index.saveToFile("index.txt");
    // Other actions
    return 0;
}

Woozzy's codebase is minimal and performs essential indexing tasks without unnecessary overhead, allowing it to seamlessly integrate into your Linux-based project, specifically for indexing HTML documents. Also, being open-source, you can delve into the code and tailor it to suit any additional needs if required.

You might want to look into Sphinx. It’s compact and efficient for search indexing. It doesn’t need a separate database, as you can store indexes in text files. Alternatively, OpenKeyval in PHP can also be used if you’re working only in key-value pairs and prefer simple file-based storage. Both are lightweight options that should fit well into embedded systems and work smoothly on Linux for indexing HTML docs.