I’m having trouble getting my custom machine learning container to work properly on Vertex AI, even though it runs fine on Cloud Run.
What I’ve done so far:
- Created a Docker image with a PyTorch-based object detection model
- Tested the container locally and it works great
- Successfully deployed the same image on Cloud Run without any problems
- The API uses FastAPI and processes image data sent as base64
My setup includes:
- PyTorch base image with CUDA support (version 2.2.0 with CUDA 12.1)
- FastAPI server with Uvicorn and Gunicorn
- Trying to use n1-standard-4 machine with NVIDIA P4 GPU
- Container starts with a shell script that loads the model and launches the server
Deployment process:
- name: Create Detection Model
run: |
MODEL_EXISTS=$(gcloud ai models list --region=$LOCATION --filter="displayName=object-detector" --format="value(name)" --limit=1)
if [ -z "$MODEL_EXISTS" ]; then
gcloud ai models upload --region=$LOCATION --display-name=object-detector --container-image-uri=$CONTAINER_URI
else
echo "Detection model exists, skipping creation."
fi
- name: Check Model Registration Status
run: |
max_wait=$WAIT_TIME
current_time=$(date +%s)
while true; do
DETECTOR_MODEL=$(gcloud ai models list --region=$LOCATION --filter="displayName=object-detector" --format="value(name)" --limit=1)
if [ -n "$DETECTOR_MODEL" ]; then
echo "Model Ready: $DETECTOR_MODEL"
echo "DETECTOR_MODEL=$DETECTOR_MODEL" >> $GITHUB_ENV
break
fi
if [ $(( $(date +%s) - current_time )) -gt $max_wait ]; then
echo "Registration timeout exceeded." && exit 1
fi
sleep 15
done
Endpoint deployment:
- name: Setup Detection Endpoint
run: |
DETECTION_ENDPOINT=$(gcloud ai endpoints list --region=$LOCATION --filter="displayName=detector-endpoint" --format="value(name)" --limit=1)
if [ -z "$DETECTION_ENDPOINT" ]; then
DETECTION_ENDPOINT=$(gcloud ai endpoints create --region=$LOCATION --display-name=detector-endpoint --format="value(name)")
fi
gcloud ai endpoints deploy-model $DETECTION_ENDPOINT \
--region=$LOCATION \
--model=$DETECTOR_MODEL \
--display-name=detection-service \
--machine-type=$INSTANCE_TYPE \
--accelerator=count=1,type=$ACCELERATOR_TYPE \
--min-replica-count=$MIN_INSTANCES \
--enable-access-logging \
--autoscaling-metric-specs=$SCALING_CONFIG
The problem:
The model deployment fails on Vertex AI and I can’t see any useful logs in the endpoint monitoring. The same exact container image works perfectly when I deploy it to Cloud Run. Has anyone experienced similar issues with custom containers on Vertex AI?