I want to read google sheet with multiple sheets into a (or several) pandas dataframe.
I don't know the sheet names, or the number of sheets in advance.
The trivial attempt fails:
def main():
path = r"https://docs.google.com/spreadsheets/d/1-MlSisrAxhOyKhrz6S08PG68j667Ym7jGExOyytpCSM/edit?usp=sharing"
pd.read_excel(path)
fails with
ValueError: Excel file format cannot be determined, you must specify an engine manually.
Trying any format doesn't work.
All answers to this question refer to .csv, meaning a single sheet, or knowing the sheet name in advance.
Same goes for the 1st Google hit for "read google sheet python pandas".
Is there a standard way of doing this?
When your Spreadsheet is publicly shared, in your situation, how about the following sample script?
Sample script:
import openpyxl
import pandas as pd
import requests
from io import BytesIO
spreadsheetId = "###" # Please set your Spreadsheet ID.
url = "https://docs.google.com/spreadsheets/export?exportFormat=xlsx&id=" + spreadsheetId
res = requests.get(url)
data = BytesIO(res.content)
xlsx = openpyxl.load_workbook(filename=data)
for name in xlsx.sheetnames:
values = pd.read_excel(data, sheet_name=name)
# do something
In this sample script, the publicly shared Spreadsheet is exported as a XLSX data. And, the exported XLSX data is opened, the sheet names are retrieved. And then, each sheet is put into the dataframe.
If you want to retrieve the specific sheets, please filter the sheet names from xlsx.sheetnames.
Note:
If your Spreadsheet is not publicly shared, this thread might be useful. Ref
Related
Do I need read_excel GoogleSheet for doing further search action on its columns in Python?
I must gather data from the entire Google Sheet file. I need search by sheetname firstly, then gather information by looking up the values in columns.
I started by looking up the two popular solutions on the internet;
First one is, with the gspread package : as it relies on service_account.json info I will not use it.
Second one is, appropriate for me. But it shows how to export as csv file. I need to take data as xlsx file.
code is below;
import pandas as pd
sheet_id=" url "
sheet_name="sample_1"
url=f"https://docs.google...d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
I have both info sheet_id and sheet_name but need to export as xlsx file.
Here I see an example how to read an excel file. Is tehre a way to read as excel file but google spreadsheet
Using Pandas to pd.read_excel() for multiple worksheets of the same workbook
xls = pd.ExcelFile('excel_file_path.xls')
# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]
# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheet_name="house")
I have no problem reading a google sheet using the method I found here:
Python Read In Google Spreadsheet Using Pandas
spreadsheet_id = "<INSERT YOUR GOOGLE SHEET ID HERE>"
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv"
df = pd.read_csv(url)
df.to_excel("my_sheet.xlsx")
You need to set the permissions of your sheet though. I found that setting it to "anyone with a link" worked.
UPDATE - based on comments below
If your spreadsheet has multiple tabs and you want to read anything other than the first sheet, you need to specify a sheetID as described here
spreadsheet_id = "<INSERT YOUR GOOGLE spreadsheetId HERE>"
sheet_id = "<INSERT YOUR GOOGLE sheetId HERE>"
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?gid={sheet_id}&format=csv"
df = pd.read_csv(url)
df.to_excel("my_sheet.xlsx")
Can't seem to find any answer to this, but are there any functions/methods which can get a worksheet ID?
Currently, my code looks like this:
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
....code to authorize credentials goes here....
sheet = client.open(str(self.googleSheetFile)).worksheet(str(self.worksheet))
client.import_csv('abcdefg1234567abcdefg1234567', contents)
but I don't want to hardcode the abcdefg1234567abcdefg1234567. Is there anything I can do, like sheet.id()?
I believe your goal as follows.
In order to use import_csv, you want to retrieve the Spreadsheet ID from sheet = client.open(str(self.googleSheetFile)).worksheet(str(self.worksheet)).
You want to achieve this using gspread with python.
In this case, you can retrieve the Spreadsheet ID from client.open(str(self.googleSheetFile)). So please modify your script as follows.
From:
sheet = client.open(str(self.googleSheetFile)).worksheet(str(self.worksheet))
client.import_csv('abcdefg1234567abcdefg1234567', contents)
To:
spreadsheet = client.open(str(self.googleSheetFile))
sheet = spreadsheet.worksheet(str(self.worksheet))
client.import_csv(spreadsheet.id, contents)
Note:
When I saw the document of gspread, it says as follows. So please be careful this.
This method removes all other worksheets and then entirely replaces the contents of the first worksheet.
This modified script supposes that you have already been able to get and put values for Google Spreadsheet using Sheets API with gspread.
Reference:
import_csv(file_id, data)
I managed to read data from a Google Sheet file using this method:
# ACCES GOOGLE SHEET
googleSheetId = 'myGoogleSheetId'
workSheetName = 'mySheetName'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheetId,
workSheetName
)
df = pd.read_csv(URL)
However, after generating a pd.DataFrame that fetches info from the web using selenium, I need to append that data to the Google Sheet.
Question: Do you know a way to export that DataFrame to Google Sheets?
Yes, there is a module called "gspread". Just install it with pip and import it into your script.
Here you can find the documentation:
https://gspread.readthedocs.io/en/latest/
In particular their section on Examples of gspread with pandas.
worksheet.update([dataframe.columns.values.tolist()] + dataframe.values.tolist())
This might be a little late answer to the original author but will be of a help to others. Following is a utility function which can help write any python pandas dataframe to gsheet.
import pygsheets
def write_to_gsheet(service_file_path, spreadsheet_id, sheet_name, data_df):
"""
this function takes data_df and writes it under spreadsheet_id
and sheet_name using your credentials under service_file_path
"""
gc = pygsheets.authorize(service_file=service_file_path)
sh = gc.open_by_key(spreadsheet_id)
try:
sh.add_worksheet(sheet_name)
except:
pass
wks_write = sh.worksheet_by_title(sheet_name)
wks_write.clear('A1',None,'*')
wks_write.set_dataframe(data_df, (1,1), encoding='utf-8', fit=True)
wks_write.frozen_rows = 1
Steps to get service_file_path, spreadsheet_id, sheet_name:
Click Sheets API | Google Developers
Create new project under Dashboard (provide relevant project name and other required information)
Go to Credentials
Click on “Create Credentials” and Choose “Service Account”. Fill in all required information viz. Service account name, id, description et. al.
Go to Step 2 and 3 and Click on “Done”
Click on your service account and Go to “Keys”
Click on “Add Key”, Choose “Create New Key” and Select “Json”. Your Service Json File will be downloaded. Put this under your repo folder and path to this file is your service_file_path.
In that Json, “client_email” key can be found.
Create a new google spreadsheet. Note the url of the spreadsheet.
Provide an Editor access to the spreadsheet to "client_email" (step 8) and Keep this service json file while running your python code.
Note: add json file to .gitignore without fail.
From url (e.g. https://docs.google.com/spreadsheets/d/1E5gTTkuLTs4rhkZAB8vvGMx7MH008HjW7YOjIOvKYJ1/) extract part between /d/ and / (e.g. 1E5gTTkuLTs4rhkZAB8vvGMx7MH008HjW7YOjIOvKYJ1 in this case) which is your spreadsheet_id.
sheet_name is the name of the tab in google spreadsheet. By default it is "Sheet1" (unless you have modified it.
Google Sheets has a nice api you can use from python (see the docs here), which allows you to append single rows or entire batch updates to a Sheet.
Another way of doing it without that API would be to export the data to a csv file using the python csv library, and then you can easily import that csv file into a Google Sheet.
I am in the process of automating a process, in which I need to upload some data to a Google spreadsheet.
The data is originally located in a pandas dataframe, which is converted to a JSON file for upload.
I am getting to the upload, but i get all the data into each cell, so that cell A1 contains all data from the entire Pandas dataframe, in fact each cell in the spreadsheet contains all the data :/
Of course, what I want to have happen is to place what is cell A1 in the dataframe, as A1 in the Google spreadsheet and so forth down to cell J173.
I am thinking I need to put in some sort of loop to make this happen, but I am not sure how JSON files work, so I am not succeeding in creating this loop.
I hope one of you can help
Below is the code
#Converting data to a json file for upload
csv_data = csv_data.to_json()
#Updating data
cell_list = sheet.range('A1:J173')
for cell in cell_list:
cell.value = csv_data
sheet.update_cells(cell_list)
Windows 10
Python 3.8
You want to put the data of dataframe to Google Spreadsheet.
In your script, csv_data of csv_data.to_json() is the dataframe.
You want to achieve this using gspread with python.
From your script, I understood like this.
You have already been able to get and put values for Google Spreadsheet using Sheets API.
Pattern 1:
In this pattern, the method of values_update of gspread is used.
Sample script:
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheetName = "Sheet1" # Please set the sheet name.
csv_data = # <--- please set the dataframe.
client = gspread.authorize(credentials)
values = [csv_data.columns.values.tolist()]
values.extend(csv_data.values.tolist())
spreadsheet.values_update(sheetName, params={'valueInputOption': 'USER_ENTERED'}, body={'values': values})
Pattern 2:
In this pattern, the library of gspread-dataframe is used.
Sample script:
from gspread_dataframe import set_with_dataframe # Please add this.
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheetName = "Sheet1" # Please set the sheet name.
csv_data = # <--- please set the dataframe.
client = gspread.authorize(credentials)
spreadsheet = client.open_by_key(spreadsheetId)
worksheet = spreadsheet.worksheet(sheetName)
set_with_dataframe(worksheet, csv_data)
References:
values_update
gspread-dataframe
I followed the steps here and here but couldn't upload a pandas dataframe to google sheets.
First I tried the following code:
import gspread
from google.oauth2.service_account import Credentials
scope = ['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
credentials = Credentials.from_service_account_file('my_json_file_name.json', scopes=scope)
gc = gspread.authorize(credentials)
spreadsheet_key = '1FNMkcPz3aLCaWIbrC51lgJyuDFhe2KEixTX1lsdUjOY'
wks_name = 'Sheet1'
d2g.upload(df_qrt, spreadsheet_key, wks_name, credentials=credentials, row_names=True)
The above code returns an error message like this: AttributeError: module 'df2gspread' has no attribute 'upload' which doesn't make sense since df2spread indeed has a function called upload.
Second, I tried to append my data to a dataframe that I artificially created on the google sheet by just entering the column names. This also didn't work and didn't provide any results.
import gspread_dataframe as gd
ws = gc.open("name_of_file").worksheet("Sheet1")
existing = gd.get_as_dataframe(ws)
updated = existing.append(df_qrt)
gd.set_with_dataframe(ws, updated)
Any help will be appreciated, thanks!
You are not importing the package properly.
Just do this
from df2gspread import df2gspread as d2g
When you convert a worksheet to Dataframe using
existing = gd.get_as_dataframe(ws)
All the blank columns and rows in the sheet are now part of the dataframe with values as NaN, so when you try to append it with another dataframe it won't be appended because columns are mismatched.
Instead try this to covert worksheet to dataframe
existing = pd.DataFrame(ws.get_all_records())
When you export a dataframe in Google Sheets the index of the dataframe is stored in the first column(It happened in my case, can't be sure).
If the first column is index then you can remove the column using
existing.drop([''],axis=1,inplace=True)
Then this will work properly.
updated = existing.append(df_qrt)
gd.set_with_dataframe(ws, updated)