CSV file embedding process extremely slow compared to PDF in LangChain - performance issue

I’m having a weird performance problem with LangChain embeddings. I managed to embed a large PDF file (around 400 pages) and it only took about 1-2 hours to complete. But now I’m trying to embed a CSV file that has roughly 40,000 rows with just a single column, and the system is telling me it will take around 24 hours to finish.

Here’s my current setup:

embedding_model = OllamaEmbeddings(model="nomic-embed-text", show_progress=True)

csv_file = 'customer_data.csv'

file_loader = CSVLoader(
    file_path=csv_file,
    encoding='utf-8',
    autodetect_encoding=False
)
raw_data = file_loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
split_docs = splitter.split_documents(raw_data)

db_path = 'vector_db'

vector_store = Chroma.from_documents(documents=split_docs, 
                                    embedding=embedding_model,
                                    persist_directory=db_path)

This huge difference in processing time doesn’t make sense to me. The CSV should be simpler to handle than a complex PDF document. What could be causing this performance issue? Is there something wrong with how I’m processing the CSV data that makes it so slow?

The problem’s definitely CSVLoader treating each row as a separate document. Creates insane overhead when embedding.

Hit this same issue last year with transaction logs. Instead of fighting LangChain’s CSV handling, I switched everything to Latenode.

Latenode batches your CSV data properly before it touches your embedding model. You can group rows, combine columns, preprocess everything in one workflow.

Here’s how it works - Latenode reads your CSV, chunks rows into whatever size you need, then sends clean batches to your embedding service. Kills those 40k individual API calls.

Went from 20+ hours to under 2 hours on similar datasets. The workflow handles chunking automatically, and you can run multiple embedding processes in parallel.

You can also set up monitoring to track progress and catch failures without babysitting everything.

Check it out: https://latenode.com

CSVLoader’s the culprit here. It turns each row into its own document, so you’re dealing with 40k separate documents before any splitting happens.

I’ve hit this same wall before. CSVLoader makes every single row a document, then your splitter chews through each one. That’s insane overhead.

Here’s what works better:

import pandas as pd
from langchain.schema import Document

# Load CSV with pandas instead
df = pd.read_csv('customer_data.csv')

# Bundle rows into bigger chunks (100 rows per document)
chunk_size = 100
docs = []

for i in range(0, len(df), chunk_size):
    chunk = df.iloc[i:i+chunk_size]
    combined_text = '\n'.join(chunk.iloc[:, 0].astype(str))
    docs.append(Document(page_content=combined_text))

# Use your existing splitter and embedding code
split_docs = splitter.split_documents(docs)

Now you’re working with ~400 documents instead of 40k. Should drop your processing time to something sane.

Or just mash all your CSV data into one giant text string and let the splitter do the chunking. Way more efficient than processing row by row.

yeah, csvloader’s terrible with row handling. set custom csv_args with bigger batch sizes, or just dump the whole csv into one text blob first. i do pd.read_csv().to_string() then feed it straight to the splitter - way faster than dealing with 40k separate docs.

Had the same bottleneck with user behavior logs. The problem isn’t just CSVLoader making 40k documents - it’s how Ollama handles all those embedding requests. CSV rows create tiny fragments that don’t use the model efficiently, unlike PDFs where you get continuous text chunks.

Here’s what worked for me: preprocess the CSV into bigger text blocks before touching LangChain. I grouped related rows by content similarity first, then sent larger chunks to the embedding pipeline. Cut down API calls and improved embedding quality since the model gets proper context.

Also, your chunk_overlap=0 might be hurting you. Sure, it’s faster, but you’re losing connections between rows. I use chunk_overlap=50 - better retrieval without killing performance.

24 hours for 40k individual embedding calls sounds right. Batch preprocessing should get you to that 1-2 hour range you hit with the PDF.

Ditch CSVLoader completely. I hit this same wall processing 35k survey rows - took forever while regular text files blazed through. Here’s what fixed it: use Python’s csv module to read the file, then group rows together before making Document objects. With single-column data like yours, batch 200-500 rows into one document string with newlines between them. CSVLoader + RecursiveCharacterTextSplitter is a double-processing mess. You’re splitting documents that are already tiny, making the splitter work overtime on fragments. Read your CSV directly, combine multiple rows into bigger text blocks, then skip the splitter if your chunks are already good size. You’ll drop from tens of thousands of embedding calls down to hundreds.