Maintaining file metadata when using RecursiveCharacterTextSplitter in langchain

I’m working with langchain’s RecursiveCharacterTextSplitter to break down Python source files into smaller pieces. The problem I’m running into is that once the splitting happens, I can’t tell which text chunk came from which original file.

Is there a way to preserve the source file information so I can map each chunk back to its original filename? I need to maintain this connection for my downstream processing.

def process_repository(git_url):
    os.environ['OPENAI_API_KEY'] = "your-key-here"
    
    file_contents = []
    supported_extensions = [".py"]
    
    print('downloading repository')
    local_path = download_repo(git_url)
    
    for root, dirs, files in os.walk(local_path):
        for filename in files:
            if filename.endswith(tuple(supported_extensions)):
                try:
                    file_path = os.path.join(root, filename)
                    with open(file_path, "r", encoding="utf-8") as file_obj:
                        file_contents.append(file_obj.read())
                except Exception as error:
                    continue
    
    # split content into chunks
    splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, 
        chunk_size=5000, 
        chunk_overlap=0
    )
    document_chunks = splitter.create_documents(file_contents)
    
    return document_chunks

You’re losing the connection between chunks and source files because you’re only passing content to create_documents().

I hit this same issue building a code analysis tool last year. Here’s what fixed it:

def process_repository(git_url):
    os.environ['OPENAI_API_KEY'] = "your-key-here"
    
    documents = []
    supported_extensions = [".py"]
    
    print('downloading repository')
    local_path = download_repo(git_url)
    
    for root, dirs, files in os.walk(local_path):
        for filename in files:
            if filename.endswith(tuple(supported_extensions)):
                try:
                    file_path = os.path.join(root, filename)
                    with open(file_path, "r", encoding="utf-8") as file_obj:
                        content = file_obj.read()
                        # Create Document with metadata upfront
                        doc = Document(
                            page_content=content,
                            metadata={"source": file_path, "filename": filename}
                        )
                        documents.append(doc)
                except Exception as error:
                    continue
    
    splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, 
        chunk_size=5000, 
        chunk_overlap=0
    )
    
    # Split documents while preserving metadata
    document_chunks = splitter.split_documents(documents)
    
    return document_chunks

Use split_documents() instead of create_documents() and create Document objects with metadata before splitting. Each chunk inherits the metadata from its parent document.

Don’t forget to import Document from langchain.schema.

The issue arises because you’re only supplying raw strings to create_documents() without any accompanying source information. The splitter doesn’t have any way to identify the origin of each piece.

In my experience while developing a document search system, it’s crucial to maintain the file path information throughout the entire process. You can’t just collect content – you must also include the associated metadata.

To resolve this, create a list of tuples that includes both the content and its metadata prior to splitting. Then use create_documents() with both pieces:

file_data = []
for root, dirs, files in os.walk(local_path):
    for filename in files:
        if filename.endswith(tuple(supported_extensions)):
            try:
                file_path = os.path.join(root, filename)
                with open(file_path, "r", encoding="utf-8") as file_obj:
                    content = file_obj.read()
                    file_data.append((content, {"source": file_path}))
            except Exception as error:
                continue

texts, metadatas = zip(*file_data)
document_chunks = splitter.create_documents(texts, metadatas=metadatas)

Now, each chunk retains its source file in the metadata, making it straightforward to trace back to the original file later.