Loading Yelp Kaggle DataSet on Google Colab Taking too Long - python

I am having difficulty loading the Yelp Dataset downloaded from Kaggle:
https://www.kaggle.com/yelp-dataset/yelp-dataset
I downloaded the zip file from Kaggle directly into my local drive under my Desktop folder and extracted all the files from it.
I was able to upload 4 out of the 5 JSON files from the extracted folder. But there was one that was not able to be uploaded at all (taking forever to upload):
yelp_academic_dataset_review.JSON
For some reason, this file is way too large to upload to Google Colab and the file size is around 7 GB.
I also tried uploading the file from my Google Drive as well.
Is there a way around this?
I couldn’t even read the data from this JSON file since I couldn’t upload it.
I tried this code:
from google.colab import files
uploaded = files.upload()
Problem after is nothing happens for hours on end and the data never loads.
Is there any way to bypass this?

Related

How to export a dataframe to csv on local desktop

I have created a dataframe from an existing file.Now i am trying to download it onto my local desktop with the code as shown below-
data.to_csv(r'C:\Users\pmishr50\Desktop\Skills\python\new.csv')
The code doesnt show any error but i cnt find my file in the given path.
I have found results to download the data, also download the data to Google drive. But i wish to download the data to the path mentioned here.

Uploaded files to azure blob storage are empty or not even present

I can upload files to Azure, but sometimes the files I send to Azure are empty or not even present. I use iothub_client (from azure-iothub-device-client).
First I create a series of csv and json files from an SQL database which I temporarily store on my pc. These files are then processed by the upload client and uploaded asynchronously to a storage blob.
Basically the structure is:
-- ID
---- csv file
---- csv file
---- json file
Most of the time this functions without any problems, however sometimes something doesn't work correctly and one csv file is containing only the header row - but never all of them - and/or the json file is even missing. The order of uploading is:
csv file one
json file
csv file two
The files are however always correctly created and stored on my pc. The code does not give any errors, so the iothub_client seems to be happy with what it's getting as input.
I can't figure out why this goes wrong, as I have not been able to reproduce the error. Retrying to upload the same files results in a correctly executed upload procedure.
Any clues about what can be the cause would be very much appreciated!
I run into this a lot. My best guess something has a lock on the file. I team into this with a logic app that scanned a SharePoint directory every five minutes. It would grab an Excel file and move it to blob storage. But I kept getting 0k files. I haven't seen this issue since changing the logic app too check every ten minutes. This increases the odds that a user will not be in the file.

Upload data to google bucket with kaggle api and use it in colab

I want to use kaggle data sets from a google bucket when using colab.
First: Is there a way to directly upload kaggle data sets to google bucket via the kaggle api?
Second: How do I use data in google bucket from colab without copying it to the notebook?
At the moment my experience with using google bucket with colab is through a URI for audio transcription such as this:
gcs_uri = 'gs://bucket_name/file_name.wav'
audio = types.RecognitionAudio(uri=gcs_uri)
I'm guessing I can also do something similar for loading data into python pandas dataframe directly from a URI. My experience with using kaggle api is on my local machine, for example:
kaggle competitions download -c petfinder-adoption-prediction
Which downloads the data using the kaggle api. If I load data to a colab notebook, it is removed between sessions, so my intention in using google bucket is to have it available for multiple sessions.
You could try this solution for your first issue. Not sure if wget is possible with the data set you need, but this suggests it's possible. But this isn't via the Kaggle API.
The second question, how to use data without copying it to the notebook, is you can actually mount the bucket as a disk to your instance. Then you could access the data directly.
So putting them together you could have the bucket mounted locally, and then move the data into it. Then you can access it in the notebook.

BigQuery: loading excel file

Is there any way we can load direct excel file into BigQuery, instead of converting to CSV.
I get the files every days in excel format and need to load into BigQuery. Right now converting into CSV manually and loading into BigQuery.
Planning to schedule the job.
If not possible to load the excel files directly into BigQuery then I need to write a process(Python) to convert into CSV before loading into BigQuery.
Please let me know if any better options are there.
Thanks,
I think you could achieve above in a few clicks, without any code.
You need to use Google Drive and external (federated) tables.
1) You could upload manually you excel files to Google Drive or synchronise them
2) In Google Drive Settings find:
"**Convert uploads** [x] Convert uploaded files to Google Docs editor format"
and check it.
To access above option go to https://drive.google.com/drive/my-drive, click on the Gear settings icon and then choose Settings.
Now you excel files will be accessible by Big Query
3) Last part: https://cloud.google.com/bigquery/external-data-drive
You could access you excel file by URI: https://cloud.google.com/bigquery/external-data-drive#drive-uri and then create table manually using above uri.
You could do last step also by API.

Use mutagen to load metadata of audio file from google drive

I want to write a program building a database of my audio file in google drive. The language is python. Does anyone know of a method to retrieve audio metadata of file from google drive api? What I want to do is using the id of the file from google drive api,I want to load the file into memory and use Mutagen to load the metadata. My problem is how to load the file from google drive api. If possible also, I would like to load only a part of the file containing the metadata but not the audio itself. From my understanding also, I am not sure it mutagen can load file already in memory.
Nop. It's not possible using the Drive API.
Metadata in files.get() is the best you can get from the API. For accessing further information about a file in Drive like audio metadata, you will have to download the entire file and proceed with the local copy.
You can obtain the metadata of a file from your Google Drive using Files.get. You can try it here in the API Explorer. In the fields section, click "Use fields editor" and tick "all" so it returns all info about the file.
This is what Drive has to offer. Some audio metadata may not be available as Drive was not built for that niche.

Categories