RAG vector search with Azure cognitive search returns incorrect chunk identifiers in chat completion responses

HappyDancer99 · August 15, 2025, 8:38pm

I’m working with Azure cognitive search and have set up a search index that uses a skillset to break down documents into smaller pieces and create vector embeddings. When I make calls to the chat completions endpoint, I can get responses with citations, but there’s a major issue with the chunk identification.

The problem is that when using vector search on chunked documents, all citations come back with chunk_id set to “0” instead of unique identifiers. This makes it impossible to provide meaningful citation references to users.

Here’s what happens with regular keyword search (works correctly):

{
  "response": {
    "role": "assistant",
    "text": "Here is the answer [ref1].",
    "context": {
      "references": [{
        "text": "relevant content here",
        "document": "Manual Guide - 15-08-2023.pdf",
        "source": "https://mysite.sharepoint.com/docs/Manual%20Guide.pdf",
        "path": "/documents/Manual Guide - 15-08-2023.pdf",
        "chunk_id": "15"
      }]
    }
  }
}

But with vector search on chunked content (the problem):

{
  "response": {
    "role": "assistant", 
    "text": "Based on the documents provided...",
    "context": {
      "references": [
        {
          "text": "first chunk content",
          "document": "Policy Manual 2023.pdf",
          "source": "https://storage.blob.core.windows.net/files/Policy%20Manual%202023.pdf",
          "path": null,
          "chunk_id": "0"
        },
        {
          "text": "second chunk content", 
          "document": "Policy Manual 2023.pdf",
          "source": "https://storage.blob.core.windows.net/files/Policy%20Manual%202023.pdf",
          "path": null,
          "chunk_id": "0"
        }
      ]
    }
  }
}

Both chunks show chunk_id as “0” even though they’re different sections. I need either unique chunk identifiers or a way to modify the document titles to show which part they represent.

Is there a configuration option I’m missing, or do I really need to create a custom web skill with Azure Functions just for this basic functionality?

Sky24 · August 29, 2025, 5:07am

Yeah, this is a known issue with Azure Cognitive Search vector indexing. The chunk_id field gets wiped during vectorization because Azure treats each chunk as its own document instead of keeping track of parent-child relationships. I hit this same problem last year building a document Q&A system. My workaround was tweaking the document splitting stage to inject custom metadata before vectorization happens. Instead of using chunk_id, I made a custom field called “section_identifier” that combines the document name with a sequence number during text splitting. You can do this by adding a custom skill that processes chunks before they get vectorized. The skill assigns unique IDs based on where they sit in the document and saves them as searchable metadata. Your citations end up showing useful stuff like “Policy_Manual_2023_Section_3” instead of just zeros. You could also try the integrated vectorization preview features, but they’re still beta and have their own problems. The custom metadata approach has worked great for my production stuff.

evelynh · August 28, 2025, 8:24pm

Hit this exact problem 8 months ago building a compliance doc search system. Azure’s chunking process screws up the vector embedding creation.

Azure’s text splitting skill makes chunks but kills the parent document relationships when vectorizing. Everything gets flattened and you lose position metadata.

Here’s what actually worked without custom functions:

Tweak your text splitting config to include a “@search.action” field that embeds position data right into the chunk content or title. Set up your skillset to append chunk numbers to the document field during processing.

Instead of “Policy Manual 2023.pdf”, your chunks get indexed as “Policy Manual 2023.pdf [Part 1]”, “Policy Manual 2023.pdf [Part 2]”, etc.

Do this at the skillset level before vectorization, not after. Add a conditional skill that calculates chunk position based on character offsets from the splitting skill.

Keeps your existing pipeline intact and gives users meaningful citations without chunk_id dependency. Way simpler than rebuilding everything or dealing with Azure’s preview features that aren’t production ready.

quirky_quokka23 · August 27, 2025, 8:26pm

Been there, done that. Azure Cognitive Search chunk ID issues are a nightmare for citation systems.

Azure’s chunking doesn’t create meaningful IDs for vector embeddings. You get stuck with generic “0” values because it treats each chunk separately without proper indexing.

Skip the custom Azure Functions and complex skillsets. Move this workflow to Latenode instead. You’ll get a cleaner solution that handles document chunking, vector embeddings, and chunk identification in one flow.

With Latenode:

Process documents and create custom chunk IDs with your own logic
Generate embeddings using any provider
Store everything with proper metadata
Build search and citation exactly how you want

I’ve built RAG systems this way and chunk tracking works perfectly since you control the entire pipeline. No more fighting Azure’s limitations or paying for expensive cognitive services.

The automation handles document ingestion to response formatting, giving you meaningful chunk IDs that actually help users find source content.

Alex_Brave · August 26, 2025, 8:25pm

Ugh, this bug drove me nuts for weeks! Azure’s default chunking screws up chunk metadata during vector indexing. Here’s a quick fix - modify your skillset to add a custom field before vectorization. Just concatenate doc name + chunk position like “documentName_chunk_1”, “documentName_chunk_2”. Way easier than custom functions and works with what you’ve got.