Creating JSONL format for Vertex AI batch inference from GCS files

I’m struggling with generating a proper JSONL file that includes references to images stored in my Google Cloud Storage bucket. I need this for running batch inference using Vertex AI.

What I’m trying to do:

  • Extract file paths from my GCS bucket
  • Format them appropriately for Vertex AI batch processing.

My current approach:

  1. First, I list all files: gsutil ls gs://my-data-bucket/images > file_list.txt
  2. Then, I manually convert the txt file to JSONL format like this:
{"content": "gs://my-data-bucket/images/photo1.jpg", "mimeType": "image/jpeg"}
{"content": "gs://my-data-bucket/images/photo2.jpg", "mimeType": "image/jpeg"}

The problem:
When I submit my batch prediction job, I keep getting an error indicating that the file “cannot be parsed as JSONL.”

I suspect there might be formatting issues with my JSONL structure. Has anyone faced this before? Is there a more straightforward way to export bucket contents into the correct JSONL format that Vertex AI requires?

Had this exact problem with my first batch inference pipeline. Turns out invisible characters from gsutil output were breaking the JSON parsing. I ditched gsutil and wrote a Python script using the GCS client library to generate the JSONL directly. The script loops through bucket objects and writes each line with proper JSON encoding - no more formatting issues. Also make sure your JSONL structure matches what your model wants - some need extra fields like “instances” wrapping the content. Once I switched to generating it programmatically, all the parsing errors went away.

yeah, also make sure each line in your jsonl is properly formatted, no extra spaces or weird chars. that could be causing errors too. try using a tool like jq to check it out!