I’m working on a chatbot that can handle different file types using langchain loaders and streamlit frontend. The issue is with processing uploaded files correctly.
def run_app():
st.header("Multi-Document Chat Assistant")
file_uploads = st.file_uploader(
"Select files to upload",
type=["pdf", "txt", "docx", "html", "json", "xml"],
accept_multiple_files=True
)
processed_docs = []
if file_uploads:
for file_item in file_uploads:
st.info(f"Processing: {file_item.name}")
file_ext = pathlib.Path(file_item.name).suffix.lower()
if file_ext in DOCUMENT_HANDLERS:
handler_class, config = DOCUMENT_HANDLERS[file_ext]
doc_loader = handler_class(file_item, **config)
else:
doc_loader = GenericFileLoader(file_item)
processed_docs.extend(doc_loader.load())
user_question = st.text_input("Enter your question:")
if st.button("Submit Query"):
if user_question:
try:
model = initialize_llm()
vector_store = build_vector_db(processed_docs)
answer = get_response(user_question, vector_store)
st.success(f"Answer: {answer}")
except Exception as error:
st.error(f"Error occurred: {error}")
if __name__ == "__main__":
run_app()
When I upload a PDF file, I get this error: TypeError: expected str, bytes or os.PathLike object, not UploadedFile. It seems the loaders expect file paths but streamlit gives UploadedFile objects. How do I convert or handle this properly?
streamlit’s uploaded files can’t be read directly by langchain - you need to save them first. skip temp files and use io.BytesIO instead, it’s cleaner. just grab the content with file_content = uploaded_file.read() and feed it to loaders that handle byte streams. or look for a from_bytes method on your loaders - some have it.
Langchain loaders need file paths, not StreamLit’s UploadedFile objects. You’ve got to write the uploaded files to temp storage first. Here’s what works:
import tempfile
import os
def process_uploaded_file(uploaded_file, file_ext):
with tempfile.NamedTemporaryFile(delete=False, suffix=file_ext) as tmp_file:
tmp_file.write(uploaded_file.getbuffer())
tmp_file_path = tmp_file.name
try:
handler_class, config = DOCUMENT_HANDLERS[file_ext]
doc_loader = handler_class(tmp_file_path, **config)
docs = doc_loader.load()
finally:
os.unlink(tmp_file_path) # Clean up temp file
return docs
Swap out your file processing loop with calls to this function. Use getbuffer() to grab the file bytes and write them to a temp file that gets cleaned up after. I’ve been using this approach for multiple file types in production - works great.
The problem is streamlit’s UploadedFile objects don’t have file paths, but langchain loaders expect them. I hit this exact issue last year building something similar. Don’t manually handle temp files for every loader type. Check first if the loader supports direct content loading - some have alternate constructors that take content instead of file paths. For loaders that need file paths, use tempfile.mkstemp(). It gives you both a file descriptor and path, then just write the uploaded content there. Use uploaded_file.getvalue() instead of getbuffer() - it’s way more reliable across streamlit versions. Your error handling needs work though. When a loader fails, you want to keep processing other files, not break the whole batch. Wrap each file in its own try-catch so one bad file doesn’t crash your entire session.
Yeah, langchain document loaders want file paths, but Streamlit gives you UploadedFile objects. You’ve got to save those files temporarily first.
Honestly, managing all these different loaders and temp files is a headache. Been there.
I’d skip this mess and use Latenode for file processing instead. Set up workflows that auto-detect file types and process them without dealing with streamlit upload conversions.
With Latenode, you create one endpoint that takes any file type, processes it, and returns extracted content. Your streamlit app just sends the file and gets clean text back.
I’ve done this for document processing - way cleaner. No temp file juggling or loader headaches.
The workflow handles file type detection automatically and picks the right processing method. Add new file types without touching your streamlit code.
Been dealing with this exact issue for years. The problem is langchain loaders want file paths, but streamlit gives you file objects in memory.
I built a wrapper that handles the conversion automatically. Instead of managing temp files everywhere, create one function:
def load_streamlit_file(uploaded_file):
file_ext = pathlib.Path(uploaded_file.name).suffix.lower()
# Save to temp location
temp_path = f"/tmp/{uploaded_file.name}"
with open(temp_path, "wb") as f:
f.write(uploaded_file.getbuffer())
# Load with appropriate handler
handler_class, config = DOCUMENT_HANDLERS[file_ext]
loader = handler_class(temp_path, **config)
docs = loader.load()
# Cleanup
os.remove(temp_path)
return docs
Then replace your processing loop:
for file_item in file_uploads:
processed_docs.extend(load_streamlit_file(file_item))
I use /tmp/ instead of tempfile because it’s simpler and gets cleaned up automatically on most systems. This has handled millions of files in production without issues.
Keep the temp file logic separate from your main app flow.