Unzipping a Google Drive dataset in Colab for ML training

I’m trying to use a dataset of 2000 images stored as a zip file on my Google Drive for machine learning training. I’ve managed to access the file using PyDrive, but I’m stuck on how to extract and save the contents to a directory in Colab.

Here’s what I’ve done so far:

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = 'my_zip_file_id'
downloaded = drive.CreateFile({'id': file_id})

This code gets the file, but I can’t figure out how to unzip it. I tried using zipfile, but I get a ‘Not a zipfile error’:

import io
import zipfile

dataset = io.BytesIO(downloaded.encode('cp862'))
zip_ref = zipfile.ZipFile(dataset, 'r')
zip_ref.extractall()
zip_ref.close()

How can I properly extract the zip file and save its contents to a Colab directory? It would make processing and understanding the dataset much easier. Any help would be appreciated!

yo, i’ve been there too. try this:

downloaded.GetContentFile(‘dataset.zip’)
!unzip dataset.zip -d /content/dataset

its way simpler. just download the file n use the unix unzip command. works like a charm for me. don’t forget to delete the zip after to save space

I’ve encountered similar issues when working with large datasets from Google Drive in Colab. Here’s a method that worked for me:

After authenticating and creating the file object, try this approach:

import zipfile
import os

# Download the file content
content = downloaded.GetContentFile('dataset.zip')

# Extract the zip file
with zipfile.ZipFile('dataset.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/dataset')

# Remove the zip file to save space
os.remove('dataset.zip')

This downloads the zip file to Colab’s file system, extracts its contents to a new directory called ‘dataset’, and then removes the original zip to free up space. You can then access your images from the ‘/content/dataset’ directory.

Remember to adjust the paths if needed. Also, for very large datasets, you might want to consider processing the images in batches to avoid memory issues. Hope this helps!

Having worked with similar setups, I can suggest an alternative approach that might solve your issue. Instead of using PyDrive, you could leverage Google Colab’s built-in integration with Google Drive. Here’s a method that’s proven reliable:

from google.colab import drive
drive.mount(‘/content/drive’)

import zipfile
import os

zip_path = ‘/content/drive/MyDrive/path_to_your_zip_file.zip’
extract_path = ‘/content/dataset’

with zipfile.ZipFile(zip_path, ‘r’) as zip_ref:
zip_ref.extractall(extract_path)

This mounts your Google Drive, then directly extracts the zip file to a local Colab directory. It’s more straightforward and often more stable than using PyDrive. Just ensure you replace ‘path_to_your_zip_file.zip’ with the actual path in your Drive. After extraction, your images will be available in the ‘/content/dataset’ directory for further processing.