Unzipping a Google Drive dataset in Colab for ML training

FlyingLeaf · April 22, 2025, 11:57pm

I’m trying to use a dataset of 2000 images stored as a zip file on my Google Drive for machine learning training. I’ve managed to access the file using PyDrive, but I’m stuck on how to extract and save the contents to a directory in Colab.

Here’s what I’ve done so far:

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = 'my_zip_file_id'
downloaded = drive.CreateFile({'id': file_id})

This code gets the file, but I can’t figure out how to unzip it. I tried using zipfile, but I get a ‘Not a zipfile error’:

import io
import zipfile

dataset = io.BytesIO(downloaded.encode('cp862'))
zip_ref = zipfile.ZipFile(dataset, 'r')
zip_ref.extractall()
zip_ref.close()

How can I properly extract the zip file and save its contents to a Colab directory? It would make processing and understanding the dataset much easier. Any help would be appreciated!

aroberts · April 30, 2025, 6:40pm

yo, i’ve been there too. try this:

downloaded.GetContentFile(‘dataset.zip’)
!unzip dataset.zip -d /content/dataset

its way simpler. just download the file n use the unix unzip command. works like a charm for me. don’t forget to delete the zip after to save space

FlyingEagle · April 30, 2025, 4:59pm

I’ve encountered similar issues when working with large datasets from Google Drive in Colab. Here’s a method that worked for me:

After authenticating and creating the file object, try this approach:

import zipfile
import os

# Download the file content
content = downloaded.GetContentFile('dataset.zip')

# Extract the zip file
with zipfile.ZipFile('dataset.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/dataset')

# Remove the zip file to save space
os.remove('dataset.zip')

This downloads the zip file to Colab’s file system, extracts its contents to a new directory called ‘dataset’, and then removes the original zip to free up space. You can then access your images from the ‘/content/dataset’ directory.

Remember to adjust the paths if needed. Also, for very large datasets, you might want to consider processing the images in batches to avoid memory issues. Hope this helps!

JessicaDream12 · April 27, 2025, 8:55pm

Having worked with similar setups, I can suggest an alternative approach that might solve your issue. Instead of using PyDrive, you could leverage Google Colab’s built-in integration with Google Drive. Here’s a method that’s proven reliable:

from google.colab import drive
drive.mount(‘/content/drive’)

import zipfile
import os

zip_path = ‘/content/drive/MyDrive/path_to_your_zip_file.zip’
extract_path = ‘/content/dataset’

with zipfile.ZipFile(zip_path, ‘r’) as zip_ref:
zip_ref.extractall(extract_path)

This mounts your Google Drive, then directly extracts the zip file to a local Colab directory. It’s more straightforward and often more stable than using PyDrive. Just ensure you replace ‘path_to_your_zip_file.zip’ with the actual path in your Drive. After extraction, your images will be available in the ‘/content/dataset’ directory for further processing.