Issues with chunk_id in vector search for Azure Cognitive Search with document chunks

I’m currently utilizing Azure Cognitive Search and have created a setup to process documents that chunks and generates embeddings. When I access the chat completions endpoint, it returns citations, but there’s a significant problem with the chunk identifiers.

Issue Description:
In standard keyword searches across complete documents, the chunk_id is accurate, but in vector searches on documents split into sections via my skillset, every citation appears with the chunk_id set to “0”, which makes it difficult to identify the correct citations.

Working Example of Keyword Search Result:

{
  "message": {
    "role": "assistant", 
    "content": "Here’s the data [reference1].",
    "context": {
      "citations": [{
        "content": "sample content here",
        "title": "Document_Name_2023.pdf",
        "url": "https://example.blob.core.windows.net/files/Document_Name_2023.pdf",
        "filepath": "/files/Document_Name_2023.pdf",
        "chunk_id": "15"
      }]
    }
  }
}

Example of Vector Search Result with Issues:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Based on the content shared...",
      "context": {
        "citations": [
          {
            "content": "first chunk content",
            "title": "Report_2023.pdf",
            "url": "https://storage.blob.core.windows.net/docs/Report_2023.pdf",
            "chunk_id": "0"
          },
          {
            "content": "second chunk content", 
            "title": "Report_2023.pdf",
            "url": "https://storage.blob.core.windows.net/docs/Report_2023.pdf",
            "chunk_id": "0"
          }
        ]
      }
    }
  }]
}

In the above examples, both citations have the chunk_id listed as “0”, even though they pertain to different segments from the same document.

What I’m Seeking:
I need either distinct chunk_id values for each citation or a means to alter the title field to incorporate chunk identifiers, such as “Report_2023.pdf - section 1” and “Report_2023.pdf - section 2”.

Structure of My API Request:

{
  "data_sources": [{
    "type": "azure_search",
    "parameters": {
      "endpoint": "https://mysearch.search.windows.net",
      "authentication": {"type": "api_key", "key": "mykey"},
      "index_name": "documents_index",
      "topNDocuments": 5,
      "query_type": "vectorSimpleHybrid",
      "embeddingEndpoint": "https://myopenai.openai.azure.com/openai/deployments/embeddings/embeddings"
    }
  }],
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What information can you find?"}
  ],
  "temperature": 0.8,
  "max_tokens": 1500
}

Is there a configuration I might have overlooked? My only idea so far is to create a custom Azure function, but that feels excessive for solving this fundamental issue.

This chunk_id issue happens because Azure’s vector search pipeline doesn’t sync properly with your enrichment output. I’ve hit this same problem building document processing workflows.

You’re juggling too many pieces - skillset, indexer, vector embeddings, and search config all need perfect sync. Change one thing and everything breaks.

Skip wrestling with Azure’s complex pipeline and automate the whole workflow instead. Build a system that processes documents, creates chunks with unique IDs, generates embeddings, and handles search logic in one flow. You’ll control exactly how chunk_id gets assigned and stays consistent.

I’ve built similar automations that take PDFs, split them into sections, assign proper identifiers, create embeddings, and store everything with consistent metadata. No more wondering why vector search returns different data than keyword search.

The automation handles document upload, chunking with proper ID assignment, embedding generation, plus builds a custom search endpoint that always returns correct chunk_id values. Way cleaner than debugging Azure’s black box.

You can build this document processing workflow at https://latenode.com

This looks like your indexer processes chunks differently than how vector search retrieves them. Azure Cognitive Search with vectorSimpleHybrid queries sometimes pulls from raw indexed content instead of the enriched output that has proper chunk identifiers. I’ve seen this before when my index had both original document fields and enriched chunk fields, but vector search defaulted to the wrong source. Check that your vector field configuration points to the same enriched content with proper chunk_id values. You might need to explicitly map chunk_id in your index definition so it uses skillset output instead of defaulting to the base document. Also verify whether your embedding generation happens on chunked content or the full document - if it’s the full document, that’s probably why chunk identification gets lost in vector searches.

Dealt with this nightmare before. Your embedding field and chunk_id field aren’t coming from the same enrichment step in your skillset.

Here’s what’s happening: SplitSkill creates chunks with proper IDs, but your embedding skill processes text without keeping those chunk identifiers. Vector search finds the right content but can’t map back to the original chunk_id.

Here’s what fixed it for me:

Make sure your embedding skill takes input from the same output that has chunk_id attached. You need something like:

{
  "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
  "name": "split-text", 
  "outputs": [
    {"name": "textItems", "targetName": "pages"},
    {"name": "itemIds", "targetName": "chunk_ids"}
  ]
}

Then your embedding skill needs to process each chunk while keeping the ID:

{
  "inputs": [
    {"source": "/document/pages/*", "name": "text"},
    {"source": "/document/chunk_ids/*", "name": "chunk_id"}
  ]
}

Also check your index field mappings. Make sure chunk_id comes from enriched output, not the source document. I had to explicitly map it like “chunk_id”: “/document/enriched/chunk_ids” instead of letting it auto-map.

Vector search and keyword search use different code paths in Azure, so they can pull from different field sources if your mappings aren’t explicit.

Had this exact issue a few months ago. The problem’s in your skillset configuration during document processing. The chunk_id gets populated when indexing happens, not during queries. If your text splitting skill isn’t assigning unique IDs to each chunk, they all just default to “0”. Check your skillset definition - make sure your SplitSkill or custom chunking skill actually outputs proper chunk identifiers. I had to modify mine to generate sequential IDs for each text segment. Also verify your index schema maps the chunk_id field correctly from the skillset output. This usually comes from the enrichment pipeline, not search config - that’s why keyword search works but vector search doesn’t. They’re pulling from different processed versions of your docs.

check if your index has separate fields for vector content vs metadata. Azure sometimes creates duplicate fields during indexing - one with chunk_id, one without. your vector search might be hitting the wrong field.
run a query directly against your index to see what fields actually exist for each chunk. I bet you’ll find multiple versions of the same content with different metadata.
Also, try adding a “select” parameter to your search request to force it to return the chunk_id field explicitly.