I’m currently utilizing Azure Cognitive Search and have created a setup to process documents that chunks and generates embeddings. When I access the chat completions endpoint, it returns citations, but there’s a significant problem with the chunk identifiers.
Issue Description:
In standard keyword searches across complete documents, the chunk_id is accurate, but in vector searches on documents split into sections via my skillset, every citation appears with the chunk_id set to “0”, which makes it difficult to identify the correct citations.
Working Example of Keyword Search Result:
{
"message": {
"role": "assistant",
"content": "Here’s the data [reference1].",
"context": {
"citations": [{
"content": "sample content here",
"title": "Document_Name_2023.pdf",
"url": "https://example.blob.core.windows.net/files/Document_Name_2023.pdf",
"filepath": "/files/Document_Name_2023.pdf",
"chunk_id": "15"
}]
}
}
}
Example of Vector Search Result with Issues:
{
"choices": [{
"message": {
"role": "assistant",
"content": "Based on the content shared...",
"context": {
"citations": [
{
"content": "first chunk content",
"title": "Report_2023.pdf",
"url": "https://storage.blob.core.windows.net/docs/Report_2023.pdf",
"chunk_id": "0"
},
{
"content": "second chunk content",
"title": "Report_2023.pdf",
"url": "https://storage.blob.core.windows.net/docs/Report_2023.pdf",
"chunk_id": "0"
}
]
}
}
}]
}
In the above examples, both citations have the chunk_id listed as “0”, even though they pertain to different segments from the same document.
What I’m Seeking:
I need either distinct chunk_id values for each citation or a means to alter the title field to incorporate chunk identifiers, such as “Report_2023.pdf - section 1” and “Report_2023.pdf - section 2”.
Structure of My API Request:
{
"data_sources": [{
"type": "azure_search",
"parameters": {
"endpoint": "https://mysearch.search.windows.net",
"authentication": {"type": "api_key", "key": "mykey"},
"index_name": "documents_index",
"topNDocuments": 5,
"query_type": "vectorSimpleHybrid",
"embeddingEndpoint": "https://myopenai.openai.azure.com/openai/deployments/embeddings/embeddings"
}
}],
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What information can you find?"}
],
"temperature": 0.8,
"max_tokens": 1500
}
Is there a configuration I might have overlooked? My only idea so far is to create a custom Azure function, but that feels excessive for solving this fundamental issue.