Encoding issues when loading CSV files with Langchain DirectoryLoader - getting undefined character mapping errors

I’m having trouble with character encoding when using Langchain’s DirectoryLoader to process CSV files from a folder

Here’s my current setup:

from langchain_community.document_loaders import DirectoryLoader
from langchain.document_loaders.csv_loader import CSVLoader

file_loader_options = {"autodetect_encoding": True}
dir_loader = DirectoryLoader(
    path=r'\my_folder_path',
    glob="**/*.csv",
    loader_kwargs=file_loader_options
)
documents = dir_loader.load()

I keep running into two different encoding problems with certain CSV files:

  1. 'charmap' codec can't decode byte 0x9d in position 4492: character maps to <undefined>
  2. All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The strange thing is that when I use pandas to read the same files with proper encoding settings, everything works fine. But Langchain seems to struggle with these files.

I even tried creating a custom loader class but still having issues:

from langchain.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import DirectoryLoader

class MyCSVLoader(CSVLoader):
    def load(self):
        with open(self.file_path, encoding="utf-8") as file:
            data = file.read()
        return self._parse(data)

my_loader = DirectoryLoader(
    r'C:\my_data_folder',
    glob="**/*.csv",
    loader_cls=MyCSVLoader
)
results = my_loader.load()

Any ideas on how to properly handle these encoding issues with Langchain?

DirectoryLoader doesn’t pass encoding parameters to CSVLoader instances when handling multiple files. I hit this same issue migrating legacy data - CSVLoader just defaults to system encoding instead of using what you set in DirectoryLoader. Skip DirectoryLoader completely and handle file discovery yourself. Use glob to find your CSV files, then create individual CSVLoader objects with explicit encoding parameters. You’ll get full control over each file’s encoding and avoid the parameter passing mess. Those XML compatibility errors? They’re from Excel exports with embedded formatting characters. Clean them up when processing individual files and you’ll fix both issues at once.

Another fix: use errors='ignore' or errors='replace' when opening files in your loader. You’re getting charmap errors because Windows defaults to cp1252 but your CSVs probably have UTF-8 characters. I just use open(file_path, encoding='utf-8', errors='replace') and it handles the weird bytes without crashing.

Had the same nightmare last month with mixed encoding CSVs. DirectoryLoader can’t handle encoding detection consistently when you’ve got different files in one batch. I fixed it by preprocessing everything first - wrote a quick script that tries multiple encodings (utf-8, latin-1, cp1252) on each CSV and converts them all to utf-8 before Langchain touches them. For the XML compatibility error, you need to strip out null bytes and control characters. Add content = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x84\x86-\x9f]', '', content) in your custom loader after reading the file. Extra step but saves hours of debugging.

i had similar problems too. using chardet was a game changer for me! run import chardet; detected = chardet.detect(open('file.csv', 'rb').read()) then set that encoding for your CSVLoader. it really helped with my weird files, hope this helps!

The autodetect_encoding parameter doesn’t work with CSVLoader in DirectoryLoader. That’s your problem.

I hit this same issue processing user data exports at work. Here’s what fixed it:

from langchain_community.document_loaders import DirectoryLoader
from langchain.document_loaders.csv_loader import CSVLoader
import os

class EncodingFixCSVLoader(CSVLoader):
    def __init__(self, file_path, **kwargs):
        super().__init__(file_path, **kwargs)
        
    def load(self):
        encodings = ['utf-8', 'utf-8-sig', 'latin-1', 'cp1252']
        
        for encoding in encodings:
            try:
                with open(self.file_path, 'r', encoding=encoding) as f:
                    content = f.read()
                content = content.replace('\x00', '').replace('\r', '\n')
                return self._get_elements()
            except UnicodeDecodeError:
                continue
                
        raise ValueError(f"Could not decode {self.file_path}")

loader = DirectoryLoader(
    path=r'\my_folder_path',
    glob="**/*.csv",
    loader_cls=EncodingFixCSVLoader
)

It tries multiple encodings and cleans the content before parsing. Haven’t had issues since.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.