I’m working with Azure cognitive search and have set up a search index that uses a skillset to break down documents into smaller pieces and create vector embeddings. When I make calls to the chat completions endpoint, I can get responses with citations, but there’s a major issue with the chunk identification.
The problem is that when using vector search on chunked documents, all citations come back with chunk_id set to “0” instead of unique identifiers. This makes it impossible to provide meaningful citation references to users.
Here’s what happens with regular keyword search (works correctly):
{
"response": {
"role": "assistant",
"text": "Here is the answer [ref1].",
"context": {
"references": [{
"text": "relevant content here",
"document": "Manual Guide - 15-08-2023.pdf",
"source": "https://mysite.sharepoint.com/docs/Manual%20Guide.pdf",
"path": "/documents/Manual Guide - 15-08-2023.pdf",
"chunk_id": "15"
}]
}
}
}
But with vector search on chunked content (the problem):
{
"response": {
"role": "assistant",
"text": "Based on the documents provided...",
"context": {
"references": [
{
"text": "first chunk content",
"document": "Policy Manual 2023.pdf",
"source": "https://storage.blob.core.windows.net/files/Policy%20Manual%202023.pdf",
"path": null,
"chunk_id": "0"
},
{
"text": "second chunk content",
"document": "Policy Manual 2023.pdf",
"source": "https://storage.blob.core.windows.net/files/Policy%20Manual%202023.pdf",
"path": null,
"chunk_id": "0"
}
]
}
}
}
Both chunks show chunk_id as “0” even though they’re different sections. I need either unique chunk identifiers or a way to modify the document titles to show which part they represent.
Is there a configuration option I’m missing, or do I really need to create a custom web skill with Azure Functions just for this basic functionality?