I’m working on setting up a Vector Search index in Google Cloud Platform and running into a frustrating issue. The process works fine for certain files, but with others I keep hitting the same roadblock.
The error message I’m getting is: 400 There are invalid records in the input file. Embedding size mismatch: expected 768, but got 1 3: There are invalid records in the input file. Embedding size mismatch: expected 768, but got 1
I’m following the official documentation and using this code:
I tried adjusting the approximate_neighbors_count parameter to 1 but that didn’t help at all. Has anyone encountered this embedding dimension mismatch before? What’s the best way to resolve this issue?
This shows you exactly which records have wrong dimensions. I found records where the embedding model timed out and my script wrote empty arrays or error codes instead of vectors.
The “got 1” part usually means you’ve got a record with embedding: [1] or just embedding: 1 somewhere. Could be from failed API calls, rate limiting, or text that was too long for your embedding model.
Also check if you’re mixing different embedding models in the same dataset. I once accidentally used both text-embedding-ada-002 (1536 dims) and an older model (768 dims) in the same batch.
This embedding dimension error usually means your input data has inconsistent vector sizes - not a code problem. When you see ‘expected 768, but got 1’, some records in your dataset have single values instead of proper 768-dimensional vectors. I hit this exact issue when I accidentally mixed metadata or text fields into the embedding column of my JSONL files. It happens during data preprocessing when certain records fail to generate embeddings properly and default to placeholder values or get corrupted. Check your input files carefully - look for records where the embedding field has a single number, null values, or broken arrays. Also make sure your embedding generation actually worked for all records before uploading to the storage bucket. Batch processing sometimes fails silently on specific documents while looking like it completed fine.
Vector Search is super picky about your data pipeline, especially text chunking and embedding generation. I hit this same error when my preprocessing script wrote partial records - ran out of memory during big batch jobs. Wasn’t just bad embeddings either. My chunking logic was creating tiny or empty chunks, so the embedding service just returned placeholder junk. Check that you’re not feeding it empty strings or really short text bits. Also make sure your embedding calls actually finish properly, not just getting HTTP 200s. I started logging the embedding array lengths while generating them - caught tons of edge cases where the model spit out weird formats.
Had the same headache last week! Your code’s fine - it’s the data format. You’ve got a record in your jsonl file with a broken embedding, probably just [1] instead of the full 768-element array. This happens when embedding generation fails silently but still writes output. Grep your files for any embeddings that don’t start with big arrays.