Loading csv.gz from url to bigquery - python

I am trying to load all the csv.gz files from this url to google bigquery. What is the best way to do this?
I tried using pyspark to read the csv.gz files (as I need to perform some data cleaning on these files) but I realized that pyspark doesn't support directly reading files from url. Would it make sense to load the cleaned versions of the csv.gz files into BigQuery or should I dump the raw,original csv.gz files in BigQuery and perform my cleaning process in BigQuery itself?
I was reading the "Google BigQuery: The Definitive Guide" book and it suggests to load the data on Google Cloud Storage. Do I have to load each csv.gz file into Google Cloud Storage or is there an easier way to do this?
Thanks for your help!

As #Samuel mentioned, you can use the curl command to download the files from the URL and then copy the files to GCS bucket.
If you have heavy transformations to be done on the data I would recommend using Cloud Dataflow otherwise you can go for Cloud Dataprep workflow and finally export your clean data to BigQuery table.
Choosing BigQuery for transformations totally depends upon your use-case, data size and budget ie, if you have high volume then direct transformations could be costly.

Related

Upload data to google bucket with kaggle api and use it in colab

I want to use kaggle data sets from a google bucket when using colab.
First: Is there a way to directly upload kaggle data sets to google bucket via the kaggle api?
Second: How do I use data in google bucket from colab without copying it to the notebook?
At the moment my experience with using google bucket with colab is through a URI for audio transcription such as this:
gcs_uri = 'gs://bucket_name/file_name.wav'
audio = types.RecognitionAudio(uri=gcs_uri)
I'm guessing I can also do something similar for loading data into python pandas dataframe directly from a URI. My experience with using kaggle api is on my local machine, for example:
kaggle competitions download -c petfinder-adoption-prediction
Which downloads the data using the kaggle api. If I load data to a colab notebook, it is removed between sessions, so my intention in using google bucket is to have it available for multiple sessions.
You could try this solution for your first issue. Not sure if wget is possible with the data set you need, but this suggests it's possible. But this isn't via the Kaggle API.
The second question, how to use data without copying it to the notebook, is you can actually mount the bucket as a disk to your instance. Then you could access the data directly.
So putting them together you could have the bucket mounted locally, and then move the data into it. Then you can access it in the notebook.

Python code to load CSV data from Google Storage to Bigquery?

I am pretty new in this, so wanted to have code and process to load data from csv file (Placed in Google Storage) to BigQuery Table using the python code and DataFlow.
Thanks in advance.
There are different BigQuery libraries depending on the language. For Python you would find this one.
But if what you want is the exactly piece of code to upload CSV from Google Cloud Storage to Bigquery, this example might work for you: "Loading CSV data into a new table"
You can see also in the same documentation page "Appending to or overwriting a table with CSV data".
You can go also to the GitHub in order to check all the methods available for Python.

Trying to get JSON data from python application to google sheets

I have a python program which outputs JSON files. I want to get the JSON files into google sheets.
I looked for a way to upload JSON files directly to google sheets, and couldn't find a way.
This prompted me to look for a way to store my JSON files online, so google sheets could use an API to call the JSON data.
I have tried using Google Cloud Platform, but I could not find a way to call the JSON data from Google Cloud Platform to google sheets. I looked into a few other web based services that offer storage and api services at low-no cost, but I could not find any. I am fairly proficient Python, but that's the extent of my programming knowledge.
At this point, I am at a loss as far as a method of getting my JSON data into a google spreadsheet. Any and all advice/suggestions are welcome and appreciated, and I am glad to answer any questions.
I would use this https://pypi.org/project/tablib/0.9.3/
to convert if from JSON to xls. Then you can open it up directly in google sheets.
Edit: Found a video which shows how to write Dictionary structured data to CSV.
https://www.youtube.com/watch?v=s1XiCh-mGCA

BigQuery: loading excel file

Is there any way we can load direct excel file into BigQuery, instead of converting to CSV.
I get the files every days in excel format and need to load into BigQuery. Right now converting into CSV manually and loading into BigQuery.
Planning to schedule the job.
If not possible to load the excel files directly into BigQuery then I need to write a process(Python) to convert into CSV before loading into BigQuery.
Please let me know if any better options are there.
Thanks,
I think you could achieve above in a few clicks, without any code.
You need to use Google Drive and external (federated) tables.
1) You could upload manually you excel files to Google Drive or synchronise them
2) In Google Drive Settings find:
"**Convert uploads** [x] Convert uploaded files to Google Docs editor format"
and check it.
To access above option go to https://drive.google.com/drive/my-drive, click on the Gear settings icon and then choose Settings.
Now you excel files will be accessible by Big Query
3) Last part: https://cloud.google.com/bigquery/external-data-drive
You could access you excel file by URI: https://cloud.google.com/bigquery/external-data-drive#drive-uri and then create table manually using above uri.
You could do last step also by API.

Use mutagen to load metadata of audio file from google drive

I want to write a program building a database of my audio file in google drive. The language is python. Does anyone know of a method to retrieve audio metadata of file from google drive api? What I want to do is using the id of the file from google drive api,I want to load the file into memory and use Mutagen to load the metadata. My problem is how to load the file from google drive api. If possible also, I would like to load only a part of the file containing the metadata but not the audio itself. From my understanding also, I am not sure it mutagen can load file already in memory.
Nop. It's not possible using the Drive API.
Metadata in files.get() is the best you can get from the API. For accessing further information about a file in Drive like audio metadata, you will have to download the entire file and proceed with the local copy.
You can obtain the metadata of a file from your Google Drive using Files.get. You can try it here in the API Explorer. In the fields section, click "Use fields editor" and tick "all" so it returns all info about the file.
This is what Drive has to offer. Some audio metadata may not be available as Drive was not built for that niche.

Categories