I’m having trouble understanding why the DocArrayInMemorySearch from langchain needs to connect to the internet. I thought it was supposed to work in memory only.
ConnectionError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken
My environment has limited internet access so external connections fail. Can someone explain what’s happening here and if there’s a way around it?
Been there. The issue isn’t DocArrayInMemorySearch - it’s OpenAI’s embeddings component trying to download tokenizer files every time.
I hit this same wall on a project where production servers had no internet. Manual tokenizer caching gets messy fast across multiple environments.
What worked way better: I set up an automation flow that handles embedding prep beforehand. Built a workflow that processes documents, generates embeddings, and packages everything for deployment to restricted environments.
Run the internet-dependent stuff (downloading tokenizers, generating embeddings) in an unrestricted environment, then auto-deploy the processed results where needed.
You can even monitor for new documents and auto-reprocess your vector store when content changes. No more manual file copying or environment variable juggling.
Scales way better than manual solutions, especially with multiple deployments or regular content updates.
Hit me during a production deployment too. OpenAIEmbeddings creates a tiktoken encoder that tries to grab the tokenizer file from OpenAI’s servers when you first run embeddings. DocArrayInMemorySearch works totally offline, but it’s stuck waiting for those embeddings. Figured out the issue: the download triggers when .from_documents() processes your docs and hits the embedding function. tiktoken checks for cached files first, but if they’re not there, it immediately tries to download them. Here’s what worked in my locked-down environment: run tiktoken.get_encoding('cl100k_base') in a startup script while you still have internet access, then cut off access. The cached files stick around and OpenAIEmbeddings uses them without any more network calls. Way cleaner than manually copying files between environments.
That error’s from the OpenAI embeddings part, not DocArrayInMemorySearch. The library needs to download a tokenizer file to split your text before creating embeddings.
I hit this exact issue deploying to a restricted environment last year. OpenAIEmbeddings needs that tiktoken file to count tokens properly.
Here’s how to fix it:
Option 1: Pre-download the tokenizer
Run this once where you have internet:
Had the same problem yesterday. DocArrayInMemorySearch itself is good, but it’s the OpenAI embeddings that try to grab that tokenizer file on first use. You can easily fix this by copying your tiktoken cache from your local to the restricted server. Trust me, it saved me a lot of time.
Hit this same issue a few weeks ago setting up a demo. DocArrayInMemorySearch sounds like it should work completely offline, but it doesn’t.
Here’s what’s happening: OpenAIEmbeddings downloads a tiktoken encoding file during .from_documents(). Your vector store runs in memory after that, but the initial embedding step needs internet to tokenize documents first.
Cleanest fix is splitting embedding generation from search setup. Don’t call .from_documents() directly in your restricted environment - generate embeddings elsewhere and pass them in.
Try this:
# Run this part where you have internet
embeddings = [api_embeddings.embed_query(doc.page_content) for doc in document_list]
# Then in restricted environment
vector_store = DocArrayInMemorySearch.from_embeddings(
text_embedding_pairs=[(doc.page_content, emb) for doc, emb in zip(document_list, embeddings)],
embedding=api_embeddings
)
You control exactly when the internet call happens and can prep everything offline. Way more predictable than hoping cached files transfer correctly.
The confusion comes from thinking DocArrayInMemorySearch needs internet - it’s actually the OpenAIEmbeddings component that’s the problem. Hit this same issue during a corporate deployment where security blocked external requests. Here’s what’s happening: OpenAI’s embedding service downloads the cl100k_base tokenizer on first run to encode text before generating embeddings. This download happens no matter what - even if you’re using in-memory storage after.
Easy workaround I used: run the embedding generation on a machine with internet, then serialize the vector store for transfer. You can pickle the entire DocArrayInMemorySearch object after creating it and load it in your restricted environment - no internet needed after that.
Alternatively, try local embedding models like sentence-transformers. No external downloads required. Performance might be fine for your use case, and you get complete offline functionality from day one.