How can I Export Pandas DataFrame to Google Sheets using Python? - python

I managed to read data from a Google Sheet file using this method:
# ACCES GOOGLE SHEET
googleSheetId = 'myGoogleSheetId'
workSheetName = 'mySheetName'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheetId,
workSheetName
)
df = pd.read_csv(URL)
However, after generating a pd.DataFrame that fetches info from the web using selenium, I need to append that data to the Google Sheet.
Question: Do you know a way to export that DataFrame to Google Sheets?

Yes, there is a module called "gspread". Just install it with pip and import it into your script.
Here you can find the documentation:
https://gspread.readthedocs.io/en/latest/
In particular their section on Examples of gspread with pandas.
worksheet.update([dataframe.columns.values.tolist()] + dataframe.values.tolist())

This might be a little late answer to the original author but will be of a help to others. Following is a utility function which can help write any python pandas dataframe to gsheet.
import pygsheets
def write_to_gsheet(service_file_path, spreadsheet_id, sheet_name, data_df):
"""
this function takes data_df and writes it under spreadsheet_id
and sheet_name using your credentials under service_file_path
"""
gc = pygsheets.authorize(service_file=service_file_path)
sh = gc.open_by_key(spreadsheet_id)
try:
sh.add_worksheet(sheet_name)
except:
pass
wks_write = sh.worksheet_by_title(sheet_name)
wks_write.clear('A1',None,'*')
wks_write.set_dataframe(data_df, (1,1), encoding='utf-8', fit=True)
wks_write.frozen_rows = 1
Steps to get service_file_path, spreadsheet_id, sheet_name:
Click Sheets API | Google Developers
Create new project under Dashboard (provide relevant project name and other required information)
Go to Credentials
Click on “Create Credentials” and Choose “Service Account”. Fill in all required information viz. Service account name, id, description et. al.
Go to Step 2 and 3 and Click on “Done”
Click on your service account and Go to “Keys”
Click on “Add Key”, Choose “Create New Key” and Select “Json”. Your Service Json File will be downloaded. Put this under your repo folder and path to this file is your service_file_path.
In that Json, “client_email” key can be found.
Create a new google spreadsheet. Note the url of the spreadsheet.
Provide an Editor access to the spreadsheet to "client_email" (step 8) and Keep this service json file while running your python code.
Note: add json file to .gitignore without fail.
From url (e.g. https://docs.google.com/spreadsheets/d/1E5gTTkuLTs4rhkZAB8vvGMx7MH008HjW7YOjIOvKYJ1/) extract part between /d/ and / (e.g. 1E5gTTkuLTs4rhkZAB8vvGMx7MH008HjW7YOjIOvKYJ1 in this case) which is your spreadsheet_id.
sheet_name is the name of the tab in google spreadsheet. By default it is "Sheet1" (unless you have modified it.

Google Sheets has a nice api you can use from python (see the docs here), which allows you to append single rows or entire batch updates to a Sheet.
Another way of doing it without that API would be to export the data to a csv file using the python csv library, and then you can easily import that csv file into a Google Sheet.

Related

Reading GoogleSheet with pandas dataframe doing search on it

Do I need read_excel GoogleSheet for doing further search action on its columns in Python?
I must gather data from the entire Google Sheet file. I need search by sheetname firstly, then gather information by looking up the values in columns.
I started by looking up the two popular solutions on the internet;
First one is, with the gspread package : as it relies on service_account.json info I will not use it.
Second one is, appropriate for me. But it shows how to export as csv file. I need to take data as xlsx file.
code is below;
import pandas as pd
sheet_id=" url "
sheet_name="sample_1"
url=f"https://docs.google...d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
I have both info sheet_id and sheet_name but need to export as xlsx file.
Here I see an example how to read an excel file. Is tehre a way to read as excel file but google spreadsheet
Using Pandas to pd.read_excel() for multiple worksheets of the same workbook
xls = pd.ExcelFile('excel_file_path.xls')
# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]
# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheet_name="house")
I have no problem reading a google sheet using the method I found here:
Python Read In Google Spreadsheet Using Pandas
spreadsheet_id = "<INSERT YOUR GOOGLE SHEET ID HERE>"
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv"
df = pd.read_csv(url)
df.to_excel("my_sheet.xlsx")
You need to set the permissions of your sheet though. I found that setting it to "anyone with a link" worked.
UPDATE - based on comments below
If your spreadsheet has multiple tabs and you want to read anything other than the first sheet, you need to specify a sheetID as described here
spreadsheet_id = "<INSERT YOUR GOOGLE spreadsheetId HERE>"
sheet_id = "<INSERT YOUR GOOGLE sheetId HERE>"
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?gid={sheet_id}&format=csv"
df = pd.read_csv(url)
df.to_excel("my_sheet.xlsx")

Databricks - pyspark.pandas.Dataframe.to_excel does not recognize abfss protocol

I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python.
I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2.
There's a method called to_excel (here the doc) that allows to save a file to a container in ADL but I'm facing problems with the file system access protocols.
From the same class I use the methods to_csv and to_parquet using abfss and I'd like to use the same for the excel.
So when I try so save it using:
import pyspark.pandas as ps
# Omit the df initialization
file_name = "abfss://CONTAINER#SERVICEACCOUNT.dfs.core.windows.net/FILE.xlsx"
sheet = "test"
df.to_excel(file_name, test)
I get the error from fsspec:
ValueError: Protocol not known: abfss
Can someone please help me?
Thanks in advance!
The pandas dataframe does not support the protocol. It seems on Databricks you can only access and write the file on abfss via Spark dataframe. So, the solution is to write file locally and manually move to abfss. See this answer here.
You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
After write.save
run below code, which is nothing but moves temp location of file to your desginated location .
Below code does the work of moving files.
%sh
sudo mv file_name.xlsx /dbfs/mnt/fpmount/

Read GoogleSheet with multiple sheets into pandas

I want to read google sheet with multiple sheets into a (or several) pandas dataframe.
I don't know the sheet names, or the number of sheets in advance.
The trivial attempt fails:
def main():
path = r"https://docs.google.com/spreadsheets/d/1-MlSisrAxhOyKhrz6S08PG68j667Ym7jGExOyytpCSM/edit?usp=sharing"
pd.read_excel(path)
fails with
ValueError: Excel file format cannot be determined, you must specify an engine manually.
Trying any format doesn't work.
All answers to this question refer to .csv, meaning a single sheet, or knowing the sheet name in advance.
Same goes for the 1st Google hit for "read google sheet python pandas".
Is there a standard way of doing this?
When your Spreadsheet is publicly shared, in your situation, how about the following sample script?
Sample script:
import openpyxl
import pandas as pd
import requests
from io import BytesIO
spreadsheetId = "###" # Please set your Spreadsheet ID.
url = "https://docs.google.com/spreadsheets/export?exportFormat=xlsx&id=" + spreadsheetId
res = requests.get(url)
data = BytesIO(res.content)
xlsx = openpyxl.load_workbook(filename=data)
for name in xlsx.sheetnames:
values = pd.read_excel(data, sheet_name=name)
# do something
In this sample script, the publicly shared Spreadsheet is exported as a XLSX data. And, the exported XLSX data is opened, the sheet names are retrieved. And then, each sheet is put into the dataframe.
If you want to retrieve the specific sheets, please filter the sheet names from xlsx.sheetnames.
Note:
If your Spreadsheet is not publicly shared, this thread might be useful. Ref

Using gspread to extract sheet ID

Can't seem to find any answer to this, but are there any functions/methods which can get a worksheet ID?
Currently, my code looks like this:
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
....code to authorize credentials goes here....
sheet = client.open(str(self.googleSheetFile)).worksheet(str(self.worksheet))
client.import_csv('abcdefg1234567abcdefg1234567', contents)
but I don't want to hardcode the abcdefg1234567abcdefg1234567. Is there anything I can do, like sheet.id()?
I believe your goal as follows.
In order to use import_csv, you want to retrieve the Spreadsheet ID from sheet = client.open(str(self.googleSheetFile)).worksheet(str(self.worksheet)).
You want to achieve this using gspread with python.
In this case, you can retrieve the Spreadsheet ID from client.open(str(self.googleSheetFile)). So please modify your script as follows.
From:
sheet = client.open(str(self.googleSheetFile)).worksheet(str(self.worksheet))
client.import_csv('abcdefg1234567abcdefg1234567', contents)
To:
spreadsheet = client.open(str(self.googleSheetFile))
sheet = spreadsheet.worksheet(str(self.worksheet))
client.import_csv(spreadsheet.id, contents)
Note:
When I saw the document of gspread, it says as follows. So please be careful this.
This method removes all other worksheets and then entirely replaces the contents of the first worksheet.
This modified script supposes that you have already been able to get and put values for Google Spreadsheet using Sheets API with gspread.
Reference:
import_csv(file_id, data)

From Python web app: insert data into spreadsheet (e.g. LibreOffice / Excel), calculate and save as pdf

I am facing the problem, that I would like to push data (one large dataframe and one image) from my python web app (running on Tornado Webserver and Ubuntu) into a spreadsheet, calculate, save as pdf and the deliver to the frontend.
I took a look at several libs like openpyxl for writing Sheets in MS Excel, but that would solve just one part. I was thinking about using LibreOffice and pyoo, but it seems that I need the same python version on my backend as shipped with LibeOffice when importing pyuno.
Does somebody has solved a similar issue and have a recommendation how to solve this?
Thanks
I came up to a let's say not pretty, but rare solution that works very flexible for me.
use openpyxl to open an existing Excel workbook that includes layout (Template)
insert the dataframe into a separate sheet in that workbook
use openpyxl to save as temporary_file.xlsx
call LibeOffice with --headless --convert-to pdf temporary_file.xlsx
While executing the last call, all integrated formulas are recalculated/updated and the pdf is created (you have to configure calc so that auto calc is enabled when files are opened)
deliver pdf to frontend or process as you like
delete temporary_file.xlsx
import openpyxl
import pandas as pd
from subprocess import call
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
now = datetime.datetime.now().strftime("%Y%m%d_%H%M_%f")
wb_template_name = 'Template.xlsx'
wb_temp_name = now + wb_template_name
wb = openpyxl.load_workbook(wb_template_name)
ws = wb['dataframe_sheet']
pdf_convert_cmd = 'soffice --headless --convert-to pdf ' + wb_temp_name
for r in dataframe_to_rows(df, index=True, header=True):
ws.append(r)
wb.save(wb_temp_name)
call(pdf_convert_cmd, shell=True)
The reason why I'm doing this, is that I would like to be able to style the layout of the pdf independently from the data. I use named ranges or lookups that are referenced to the separate dataframe-sheet in excel.
I didn't try the image insertion yet, but this should work similar. I think there could be a way to increase the performance while simply dump the dataframe into the xlsx file (which is a zipped file of xmls), so that you don't need openpyxl.

Categories