I am pretty new in this, so wanted to have code and process to load data from csv file (Placed in Google Storage) to BigQuery Table using the python code and DataFlow.
Thanks in advance.
There are different BigQuery libraries depending on the language. For Python you would find this one.
But if what you want is the exactly piece of code to upload CSV from Google Cloud Storage to Bigquery, this example might work for you: "Loading CSV data into a new table"
You can see also in the same documentation page "Appending to or overwriting a table with CSV data".
You can go also to the GitHub in order to check all the methods available for Python.
Related
I am trying to load all the csv.gz files from this url to google bigquery. What is the best way to do this?
I tried using pyspark to read the csv.gz files (as I need to perform some data cleaning on these files) but I realized that pyspark doesn't support directly reading files from url. Would it make sense to load the cleaned versions of the csv.gz files into BigQuery or should I dump the raw,original csv.gz files in BigQuery and perform my cleaning process in BigQuery itself?
I was reading the "Google BigQuery: The Definitive Guide" book and it suggests to load the data on Google Cloud Storage. Do I have to load each csv.gz file into Google Cloud Storage or is there an easier way to do this?
Thanks for your help!
As #Samuel mentioned, you can use the curl command to download the files from the URL and then copy the files to GCS bucket.
If you have heavy transformations to be done on the data I would recommend using Cloud Dataflow otherwise you can go for Cloud Dataprep workflow and finally export your clean data to BigQuery table.
Choosing BigQuery for transformations totally depends upon your use-case, data size and budget ie, if you have high volume then direct transformations could be costly.
I wrote a small python program that works with data from a CSV file. I am tracking some numbers in a google sheet and I created the CSV file by downloading the google sheet. I am trying to find a way to have python read in the CSV file directly from google sheets, so that I do not have to download a new CSV when I update the spreadsheet.
I see that the requests library may be able to handle this, but I'm having a hard time figuring it out. I've chosen not to try the google APIs because this way seems simpler as long as I don't mind making the sheet public to those with the link, which is fine.
I've tried working with the requests documentation but I'm a novice programmer and I can't get it to read in as a CSV.
This is how the data is currently taken into python:
file = open('data1.csv', newline='')
reader = csv.reader(file)
I would like the file = open() to ideally be replaced by the requests library and pull directly from the spreadsheet.
You need to find the correct URL request that download the file.
Sample URL:
csv_url='https://docs.google.com/spreadsheets/d/169AMdEzYzH7NDY20RCcyf-JpxPSUaO0nC5JRUb8wwvc/export?format=csv&id=169AMdEzYzH7NDY20RCcyf-JpxPSUaO0nC5JRUb8wwvc&gid=0'
The way to doing it is by manually download your file while inspecting the requests URL at the Network tab in the Developer Tools in your browser.
Then the following is enough:
import requests as rs
csv_url=YOUR_CSV_DOWNLOAD_URL
res=rs.get(url=csv_url)
open('google.csv', 'wb').write(res.content)
It will save CSV file with the name 'google.csv' in the folder of you python script file.
import pandas as pd
import requests
YOUR_SHEET_ID=''
r = requests.get(f'https://docs.google.com/spreadsheet/ccc?key={YOUR_SHEET_ID}&output=csv')
open('dataset.csv', 'wb').write(r.content)
df = pd.read_csv('dataset.csv')
df.head()
I tried #adirmola's solution but I had to tweak it a little.
When he wrote "You need to find the correct URL request that download the file" he has a point. An easy solution is what I'm showing here. Adding "&output=csv" after your google sheet id.
Hope it helps!
I'm not exactly sure about your usage scenario, and Adirmola already provided a very exact answer to your question, but my immediate question is why you want to download the CSV in the first place.
Google Sheets has a python library so you can just get the data from the GSheet directly.
You may also be interested in this answer since you're interested in watching for changes in GSheets
I would just like to say using the Oauth keys and google python API is not always an option. I found the above to be quite useful for my current application.
I have a python program which outputs JSON files. I want to get the JSON files into google sheets.
I looked for a way to upload JSON files directly to google sheets, and couldn't find a way.
This prompted me to look for a way to store my JSON files online, so google sheets could use an API to call the JSON data.
I have tried using Google Cloud Platform, but I could not find a way to call the JSON data from Google Cloud Platform to google sheets. I looked into a few other web based services that offer storage and api services at low-no cost, but I could not find any. I am fairly proficient Python, but that's the extent of my programming knowledge.
At this point, I am at a loss as far as a method of getting my JSON data into a google spreadsheet. Any and all advice/suggestions are welcome and appreciated, and I am glad to answer any questions.
I would use this https://pypi.org/project/tablib/0.9.3/
to convert if from JSON to xls. Then you can open it up directly in google sheets.
Edit: Found a video which shows how to write Dictionary structured data to CSV.
https://www.youtube.com/watch?v=s1XiCh-mGCA
Is there any way we can load direct excel file into BigQuery, instead of converting to CSV.
I get the files every days in excel format and need to load into BigQuery. Right now converting into CSV manually and loading into BigQuery.
Planning to schedule the job.
If not possible to load the excel files directly into BigQuery then I need to write a process(Python) to convert into CSV before loading into BigQuery.
Please let me know if any better options are there.
Thanks,
I think you could achieve above in a few clicks, without any code.
You need to use Google Drive and external (federated) tables.
1) You could upload manually you excel files to Google Drive or synchronise them
2) In Google Drive Settings find:
"**Convert uploads** [x] Convert uploaded files to Google Docs editor format"
and check it.
To access above option go to https://drive.google.com/drive/my-drive, click on the Gear settings icon and then choose Settings.
Now you excel files will be accessible by Big Query
3) Last part: https://cloud.google.com/bigquery/external-data-drive
You could access you excel file by URI: https://cloud.google.com/bigquery/external-data-drive#drive-uri and then create table manually using above uri.
You could do last step also by API.
I was testing with the dropbox provided API for python..my target was to read a Spreadsheet in my dropbox without downloading it to my local storage.
import dropbox
dbx = dropbox.Dropbox('my-token')
print dbx.users_get_current_account()
fl = dbx.files_get_preview('/CGPA.xlsx')[1] # returns a Response object
After the above code, calling the fl.text() method gives an HTML output which shows the preview that would be seen if opened by browser. And the data can be parsed.
My query is, if there is a built-in method of the SDK for getting any particular info from the spreadsheet, like the data of a row or a cell...preferrably in json format...I previously used butterdb for extracting data from a google drive spreadsheet...is there such functionality for dropbox?....could not understand by reading the docs: http://dropbox-sdk-python.readthedocs.io/en/master/
No, the Dropbox API doesn't offer the ability to selectively query parts of a spreadsheet file like this without downloading the whole file, but we'll consider it a feature request.