How to process database text content with langchain text splitters

Stella_Dreamer · August 6, 2025, 9:36am

I’m new to working with langchain and trying to figure out how to properly handle text data from my database. I have a MySQL table containing text content that needs to be processed and split into smaller chunks.

When I try to use the text splitter with data from my database, I keep running into issues with the document structure. I’m not sure if I’m formatting the document objects correctly for the splitter to work with.

Here’s what I’m attempting to do:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 500,
    chunkOverlap: 100
});

// My table structure: CREATE TABLE articles (article_id INT PRIMARY KEY, content TEXT);
const results = database.query('SELECT article_id, content FROM articles').fetchAll();

for (const result of results) {
   const documents = [{ metadata: result.article_id, pageContent: result.content }];
   const splitChunks = await textSplitter.splitDocuments(documents);
}

// Getting this error:
// TypeError: Cannot read properties of undefined (reading 'loc')

What’s the correct way to structure the document objects when working with database text content? Am I missing something in the metadata setup?

Hermione_Book · August 16, 2025, 1:18pm

That error happens because langchain can’t track document locations when you pass plain objects to splitDocuments. The splitter loses its location references and throws the undefined ‘loc’ error.

I hit the same issue with PostgreSQL data. You need proper document structure and have to handle null/empty content from your DB queries.

Filter out empty content first:

const validResults = results.filter(result => result.content && result.content.trim().length > 0);

const documents = validResults.map(result => ({
    pageContent: result.content.trim(),
    metadata: { article_id: result.article_id, source: 'database' }
}));

Empty or null content will break the splitter in weird ways. I’d also add a source identifier in metadata - helps track chunk origins when you’re mixing data sources.

sofia_scribbles · August 15, 2025, 11:20am

Database text splitting is a nightmare - formatting rules and encoding issues everywhere. I’ve dealt with this exact mess on multiple projects.

Your code’s close, but langchain forces you through all these Document class hoops and metadata formatting. You’ll also hit memory problems later when processing big tables.

I used to waste hours debugging splitter errors until I found automation tools do this way better. Now I just build a workflow that pulls from MySQL, chunks text however I want, and spits out clean results.

No import headaches, no document class BS, no encoding problems. Connect your database, set chunk parameters, done. Way less code to babysit.

It handles batching automatically so you don’t worry about memory with huge tables. Everything processes smooth without those weird location tracking errors.

Saved me tons of debugging time on similar stuff. Check it out: https://latenode.com

pixelPilot · August 15, 2025, 6:38am

Been there with langchain quirks. Your document structure looks fine, but that error means the text splitter wants a specific Document class instance, not plain objects.

Import the Document class:

import { Document } from "@langchain/core/documents";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const documents = results.map(result => 
    new Document({
        pageContent: result.content,
        metadata: { article_id: result.article_id }
    })
);

const splitChunks = await textSplitter.splitDocuments(documents);

Honestly, I ditched langchain after getting tired of setup headaches. Now I use Latenode for all my text processing workflows.

Just set up a workflow that connects to MySQL, processes text with whatever splitter you need, and spits out clean chunks. No import issues or document formatting problems.

Way cleaner than managing dependencies and class structures manually. Check it out: https://latenode.com

CreativeArtist88 · August 13, 2025, 7:24pm

I get what you’re saying! Could be the async part tho. Had similar issues with big database dumps where Slick’s timing caused the splitter to fail. Add error handling to your db call and let each document build properly before splitting. That might fix it.

DancingFox · August 13, 2025, 8:29am

This happens because langchain’s document chunking doesn’t play nice with database content. You need clean UTF-8 strings and proper formatting before hitting the splitter.

I’ve hit this exact issue with a legacy database where some fields had encoding problems. The splitter expects clean UTF-8, but database text often has weird characters or encoding artifacts.

Try normalizing your content first:

const documents = results.map(result => ({
    pageContent: result.content.toString().normalize('NFD'),
    metadata: { article_id: result.article_id }
}));

Also check your MySQL connection charset settings. Mixed encodings confuse the splitter when it’s calculating character positions for chunks. I switched to utf8mb4 collation and it fixed the location tracking errors.

One more thing - make sure your content field doesn’t have null bytes or control characters. They mess with the text processing logic.

avaw · August 12, 2025, 10:05am

That error happens because the text splitter can’t track document locations properly. Your metadata structure is the problem.

Change this:

const documents = [{ metadata: result.article_id, pageContent: result.content }];

To this:

const documents = [{ metadata: { article_id: result.article_id }, pageContent: result.content }];

Metadata needs to be an object, not a direct value. Pass a primitive value and the splitter’s internal tracking gets confused.

I hit the same issue last year processing support tickets from our database. Spent hours debugging before I realized the metadata wasn’t structured right.

Also, you’re creating a new array for each database row. Process everything at once instead:

const documents = results.map(result => ({
    pageContent: result.content,
    metadata: { article_id: result.article_id }
}));

const splitChunks = await textSplitter.splitDocuments(documents);

This splits everything in one operation instead of calling the splitter multiple times.