How to upload dataset in google colaboratory? - python

I need to upload dataset of images in google colaboratory. It has subfolder inside it which contains images. Whatever I found on the net was for the single file.
from google.colab import files
uploaded = files.upload()
Is there any way to do it?

For uploading data to Colab, you have three methods.
Method 1
You can directly upload file or directory in Colab UI
The data is saved in Colab local machine. In my experiment, there are three features:
1) the upload speed is good.
2) it will remain directory structure but it will not unzip directly. You need to execute this code in Colab cell
!makedir {dir_name}
!unzip {zip_file} -d {dir_name}
3) Most importantly, when Colab crashes, the data will be deleted.
Method 2
Execute the code in Colab cell:
from google.colab import files
uploaded = files.upload()
In my experiment, when you run the cell, it appears the upload button. and when the cell executing indicator is still running, you choose a file. 1) After execution, the file name will appear in the result panel. 2)Refresh Colab files, you will see the file. 3) Or execute !ls, you shall see you file. If not, the file is not uploaded successfully.
Method 3
If your data is from kaggle, you can use Kaggle API to download data to Colab local directory.
Method 4
Upload data to Google Drive, you can use 1)Google Drive Web Browser or 2) Drive API (https://developers.google.com/drive/api/v3/quickstart/python). To access drive data, use the following code in Colab.
from google.colab import drive
drive.mount('/content/drive')
I would recommend uploading data to Google Drive because it is permanent.

You need to copy your dataset into Google Drive. Then obtain the DATA_FOLDER_ID.
The best way to do so, is to open the folder in your Google Drive and copy the last part of html address. For example the folder id for the link:
https://drive.google.com/drive/folders/xxxxxxxxxxxxxxxxxxxxxxxx is xxxxxxxxxxxxxxxxxxxxxxxx
Then you can create local folders and upload each file recursively.
DATA_FOLDER_ID = 'xxxxxxxxxxxxxxxxxxxxxxxx'
ROOT_PATH = '~/you_path'
!pip install -U -q PyDrive
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# choose a local (colab) directory to store the data.
local_root_path = os.path.expanduser(ROOT_PATH)
try:
os.makedirs(local_root_path)
except: pass
def ListFolder(google_drive_id, destination):
file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % google_drive_id}).GetList()
counter = 0
for f in file_list:
# If it is a directory then, create the dicrectory and upload the file inside it
if f['mimeType']=='application/vnd.google-apps.folder':
folder_path = os.path.join(destination, f['title'])
os.makedirs(folder_path)
print('creating directory {}'.format(folder_path))
ListFolder(f['id'], folder_path)
else:
fname = os.path.join(destination, f['title'])
f_ = drive.CreateFile({'id': f['id']})
f_.GetContentFile(fname)
counter += 1
print('{} files were uploaded in {}'.format(counter, destination))
ListFolder(DATA_FOLDER_ID, local_root_path)

Related

I can see only files created with my script pydrive

As the title says I have an issue with pydrive. I ran the code given in the pydrive quickstart (https://googleworkspace.github.io/PyDrive/docs/build/html/quickstart.html) and I created a settings and credentials file to avoid entering my credentials all the time.
But when I run this code:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
# Rename the downloaded JSON file to client_secrets.json
# The client_secrets.json file needs to be in the same directory as the script.
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
# List files in Google Drive
fileList = drive.ListFile().GetList()
for drive_file in fileList:
print('title: %s, id: %s' % (drive_file['title'], drive_file['id']))
I can only see the files created with my script. For example if I add this before the list file:
folder = drive.ListFile({'q': "title = 'Python_test' and trashed=false"}).GetList()[0] # get the folder we just created
file = drive.CreateFile({'title': "test.txt", 'parents': [{'id': folder['id']}]})
file.Upload()
I only see the folder and the file ID I just created... And if I add manually a file on my drive (on my browser for example), it doesn't appear.
Anyone got an idea of what's going on?
I just found the problem, it was on my settings.yaml file, I added only this oauth_scope:
oauth_scope:
- https://www.googleapis.com/auth/drive.file
but this gives only access to the files created by the app. To correct that I needed to remove the .file like this
oauth_scope:
- https://www.googleapis.com/auth/drive
If you want more details about the different scopes, check this link:
https://developers.google.com/identity/protocols/oauth2/scopes

How to work with Google Colab efficiently?

I try to train a neural network on Colab using a GPU there. I am now wondering if I am on the right pave and if all the steps I am doing are necessary, because the process I am following does not appear very efficient to me.
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
import os
# choose a local (colab) directory to store the data.
local_root_path = os.path.expanduser("~/data")
try:
os.makedirs(local_root_path)
except: pass
def ListFolder(google_drive_id, destination):
file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % google_drive_id}).GetList()
counter = 0
for f in file_list:
# If it is a directory then, create the dicrectory and upload the file inside it
if f['mimeType']=='application/vnd.google-apps.folder':
folder_path = os.path.join(destination, f['title'])
os.makedirs(folder_path)
print('creating directory {}'.format(folder_path))
ListFolder(f['id'], folder_path)
else:
fname = os.path.join(destination, f['title'])
f_ = drive.CreateFile({'id': f['id']})
f_.GetContentFile(fname)
counter += 1
print('{} files were uploaded in {}'.format(counter, destination))
ListFolder("1s1Ks_Gf_cW-F-RwXFjBu96svbmqiXB0o", local_root_path)
This commands allow to connect the Notebook in Colab with my Google Drive and stores the data in Colab. Because I have a lot of images (more than 180k) the storage of the data in Colab takes very, very long and partially the connection breaks. I am now wondering if it is necessray to upload all the data from my Google Drive to Colab?
If no, what do I have to do instead to work with the data from Google Drive?
If yes, is there a way to do this more efficiently?
Or is there maybe a completely different way I should work with Colab?
You can access files directly on your Google drive without copying them into Notebook environment.
Execute this code in one cell:
from google.colab import drive
drive.mount('/content/gdrive')
And try:
!ls /content/gdrive
Now you can copy your files from/to /content/gdrive directory and they will appear in your Google Drive.

google drive api to upload all pdfs to google drive

I am using the pydrive to upload pdf files to my google drive folder. I am wanting to send all *pdf files in a local folder at once with this code but not sure where to go from here? Should I use glob? If so I would like to see an example, please.
working code that sends 1 file to the designated google drive folder:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(g_login)
folder_id = 'google_drive_id_goes_here'
f = drive.CreateFile({'title': 'testing_pdf',
'mimeType': 'application/pdf',
'parents': [{'kind': 'drive#fileLink', 'id':folder_id}]})
f.SetContentFile('/Users/Documents/python/google_drive/testing.pdf')
f.Upload()
You cant upload files at once. Create file with the API is a single thing and pydrive as no mechanism for uploading more then one .
Your going to have to put this in a loop and upload each file as you go.
import os
directory = 'the/directory/you/want/to/use'
for filename in os.listdir(directory):
if filename.endswith(".txt"):
f = open(filename)
lines = f.read()
print (lines[10])
continue
else:
continue

How to remove the file from trash in drive in colab

I use a google drive in colab. Basically I do the following:
from google.colab import drive
drive.mount('/content/gdrive')
After this I can use os function (listdir, remove) to manipulate the files. The problem is that after removing the file with os.remove it is not actually removed but goes to trash. I would like to remove a file completely but up till now I have not found how to do this.
I tried to locate the file in a trash but the trash directory shows nothing os.listdir('/content/gdrive/.Trash') also I see the files there in the web interface.
How can I remove the file from trash?
It is straightforward to perform this action inside Google Colab by using the pydrive module. In order to delete all files from your Google Drive's Trash folder, code the following lines in your Google Colab notebook:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
my_drive = GoogleDrive(gauth)
After entering authentication code and creating a valid instance of GoogleDrive class, write:
for a_file in my_drive.ListFile({'q': "trashed = true"}).GetList():
# print the name of the file being deleted.
print(f'the file "{a_file['title']}", is about to get deleted permanently.')
# delete the file permanently.
a_file.Delete()
If you'd like to delete a specific file in Trash, then you need to change the last chunck of code. Let's assume you have a file which is named weights-improvement-01-10.5336.hdf5 in your Trash:
for a_file in my_drive.ListFile({'q': "title = 'weights-improvement-01-10.5336.hdf5' and trashed=true"}).GetList():
# print the name of the file being deleted.
print(f'the file "{a_file['title']}", is about to get deleted permanently.')
# delete the file permanently.
a_file.Delete()
If you want to make other and perhaps more complex queries, e.g. delete a bunch of files which have the expression weights-improvement- in common in their names, or files all of which have been modified before a given date; visit:
1) Get all files which matches the query,
2) Search for files and folders.
Jess's answer of using Google Drive API to clear the trash isn't a good way because you might actually have other data in the bin
Because files will move to bin upon delete, so this neat trick reduces the file size to 0 before deleting (cannot be undone!)
import os
delete_filepath = 'drive/My Drive/Colab Notebooks/somefolder/examplefile.png'
open(delete_filename, 'w').close() #overwrite and make the file blank instead - ref: https://stackoverflow.com/a/4914288/3553367
os.remove(delete_filename) #delete the blank file from google drive will move the file to bin instead
Also answered at: https://stackoverflow.com/a/60729089/3553367
If you're looking for a code for removing the file from the trash, you can check this SO post answered by Tanaike - Empty Google Drive Trash:
def main():
credentials = get_credentials()
http = credentials.authorize(httplib2.Http())
service = discovery.build('drive', 'v3', http=http)
service.files().emptyTrash().execute()
or use these methods using Pydrive:
file.Trash() - Move file to trash
file.Untrash() - Move file out of trash
file.Delete() - Permanently delete the file
We can run these codes in google colab successfully:
from google.colab import drive
drive.mount('/root/gdrive', force_remount=True)
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
drive_service.files().emptyTrash().execute()
but if there're many files in the google drive trash bin, we still need to wait for over 20 minutes, sad.
The rclone command has a immediate effect:
!curl https://rclone.org/install.sh | bash
!rclone config # to set google drive client id etc
#delete all files in trash and don't put them in trash again:
!rclone --config /root/sing/rclone.conf delete gdrive: --drive-trashed-only --drive-use-trash=false
!rclone --config /root/sing/rclone.conf ls gdrive: --drive-trashed-only --fast-list
The easiest way to permanently delete files for google drive is the rm command.
You can delete a specific file using rm:
!rm /content/drive/MyDrive/somefolder/examplefile.png
Or, you can delete a whole folder:
!rm -r /content/drive/MyDrive/somefolder
Or, you can use wildcards for deleting files with specific patterns:
!rm /content/drive/MyDrive/somefolder/*.png
However, if you want to delete files from trash, you have to use google drive API.

Where is dumped file in Google Colab?

When I wrote this code in google colab:
import pickle
x=10;
output = open('data.pkl', 'wb')
pickle.dump(x,output)
x is saved and also in another window in Google Colab I can access this file and read it but I don't know where is the file. Does anybody know where is it?
It’s in the current directory. You can also download it back to your local machine with
from google.colab import files
files.download(‘data.pkl’)
You can upload it to your Google drive:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# get the folder id where you want to save your file
file = drive.CreateFile({'parents':[{u'id': folder_id}]})
file.SetContentFile('data.pkl')
file.Upload()
This code basically fetches the data.pkl from the cloud VM and upload it permanently to your Google Drive under a specific folder.
If you choose not to specify a folder, the file will be uploaded under the root of your Google Drive.
You can save and read the dumped file anywhere in your google drive folder.
import gc
import pickle
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
pick_insert = open('drive/My Drive/data.pickle','wb')
pickle.dump(data, pick_insert)
pick_insert.close()
pick_read = open('drive/My Drive/data.pickle','rb')
data = pickle.load(pick_read)
pick_read.close()
Saved dump then can be loaded from the same dir as below,
dump(stories, open('review_dataset.pkl', 'wb'))
stories = load(open('review_dataset.pkl', 'rb'))
In my case, I was trying to access the pickle files in a sub-directory (data) under the . directory.
The data directory has 2 pickle files generated from the pre-processing step.
So I tried #korakot suggestion in the comments, and it worked fine!. That what I did so far.
# connect your colab with the drive
from google.colab import drive
drive.mount('/content/drive')
# list the directories in the home directory
import os
os.listdir('.')
# move the sub-directory (data) into google-drive
mv /content/data/ /content/drive/MyDrive/
You can obtain the pkl file using the following statements
from google.colab import files files
files.download("model.pkl")
Not only pkl you can retrieve other format of data also by changing the extension
you can save your pkl file by inputting this instead:
import pickle
from google.colab import drive
drive.mount('/content/drive')
x=10;
output = open('/content/drive/MyDrive/Colab Notebooks/data.pkl', 'wb')
pickle.dump(x,output)
and open it using this code:
import pickle
from google.colab import drive
drive.mount('/content/drive')
x = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/data.pkl', 'rb'))
it worked for me :)

Categories