Add multiple rows into google spreadsheet using API - python

I need to add multiple (few hundreds) rows into google spreadsheet. Currently I'm doing it in a loop:
for row in rows
_api_client.InsertRow(row, _spreadsheet_key, _worksheet_id)
which is extremely slow, because rows are added one by one.
Is there any way to speed this up?

Ok, I finally used batch request. The idea is to send multiple changes in a one API request.
Firstly, I created a list of dictionaries, which will be used like rows_map[R][C] to get value of cell at row R and column C.
rows_map = [
{
1: row['first_column']
2: row['second']
3: row['and_last']
}
for row i rows
]
Then I get all the cells from the worksheet
query = gdata.spreadsheet.service.CellQuery()
query.return_empty = 'true'
cells = _api_client.GetCellsFeed(self._key, wksht_id=self._raw_events_worksheet_id, query=query)
And create batch request to modify multiple cells at a time.
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
Then I can modify (or in my case rewrite all the values) the spreadsheet.
for cell_entry in cells.entry:
row = int(cell_entry.cell.row) - 2
col = int(cell_entry.cell.col)
if 0 <= row < len(events_map):
cell_entry.cell.inputValue = rows_map[row][col]
else:
cell_entry.cell.inputValue = ''
batch_request.AddUpdate(cell_entry)
And send all the changes in only one request:
_api_client.ExecuteBatch(batch_request, cells.GetBatchLink().href)
NOTES:
Batch request are possible only with Cell Queries. There is no such mechanism to be used with List Queries.
query.return_empty = 'true' is mandatory. Otherwise API will return only cells which are not empty.

Related

Improve performance of 8million iterations over a dataframe and query it

There is a for loop of 8 million iterations, which takes 2 sample values from a column of a 1 million records dataframe (say df_original_nodes) and then query that 2 samples in another dataframe say (df_original_rel) and if sample does not exist then add that samples as a new row into the queried dataframe (df_original_rel) and finally write the dataframe (df_original_rel) into a CSV.
This loop is taking roughly around 24+ hrs to complete. How this can be made performant? Happy if it even takes 8 hrs to complete than anything 12+ hrs.
Here is the piece of code:
for j in range(1, n_8000000):
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
df_ran_rel = df_original_nodes["UID"].sample(2, ignore_index=True)
FROM = df_ran_rel[0]
TO = df_ran_rel[1]
if df_original_rel.query("#FROM == FROM and #TO == TO").empty:
k += 1
new_row = {"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]}
df_original_rel = df_original_rel.append(new_row, ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)
My assumption is that querying a dataframe df_original_rel is the heavy-lifting part where the dataframe df_original_rel is also keep growing as the new row is added.
In my view lists are faster to traverse and maybe to query but then there will be another layer of conversion from dataframe to lists and vice-versa which could add further complexity.
Some things that should probably help – most of them around "do less Pandas".
Since I don't have your original data or anything like it, I can't test this.
# Grab a regular list of UIDs that we can use with `random.sample`
original_nodes_uid_list = df_original_nodes["UID"].tolist()
# Make a regular set of FROM-TO tuples
rel_from_to_pairs = set(df_original_rel[["FROM", "TO"]].apply(tuple, axis=1).tolist())
# Store new rows here instead of putting them in the dataframe; we'll also update rel_from_to_pairs as we go.
new_rows = []
for j in range(1, 8_000_000):
# These two lines could probably also be a `random.choice`
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
# Grab a from-to pair from the UID list
FROM, TO = random.sample(original_nodes_uid_list, 2)
# If this pair isn't in the set of known pairs...
if (FROM, TO) not in rel_from_to_pairs:
# ... prepare a new row to be added later
new_rows.append({"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]})
# ... and since this from-to pair _would_ exist had df_original_rel
# been updated, update the pairs set.
rel_from_to_pairs.add((FROM, TO))
# Finally, make a dataframe of the new rows, concatenate it with the old, and output.
df_new_rel = pd.DataFrame(new_rows)
df_original_rel = pd.concat([df_original_rel, df_new_rel], ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)

How can I align columns if rows have different number of values?

I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).

Python gspread - get the last row without fetching all the data?

Looking on tips how to get the data of the latest row of a sheet. I've seen solution to get all the data and then taking the length of that.
But this is of course a waste of all that fetching. Wondering if there is a smart way to do it, since you can already append data to the last row+1 with worksheet.append_rows([some_data])
I used the solution #buran metnion. If you init the worksheet with
add_worksheet(title="title", rows=1, cols=10)
and only append new data via
worksheet.append_rows([some_array])
Then #buran's suggestion is brilliant to simply use
worksheet.row_count
I found this code in another question, it creates a dummy append in the sheet.
After that, you can search for the location later on:
def get_last_row_with_data(service, value_input_option="USER_ENTERED"):
last_row_with_data = '1'
try:
# creates a dummy row
dummy_request_append = service.spreadsheets().values().append(
spreadsheetId='<spreadsheet id>',
range="{0}!A:{1}".format('Tab Name', 'ZZZ'),
valueInputOption='USER_ENTERED',
includeValuesInResponse=True,
responseValueRenderOption='UNFORMATTED_VALUE',
body={
"values": [['']]
}
).execute()
# Search the dummy row
a1_range = dummy_request_append.get('updates', {}).get('updatedRange', 'dummy_tab!a1')
bottom_right_range = a1_range.split('!')[1]
number_chars = [i for i in list(bottom_right_range) if i.isdigit()]
last_row_with_data = ''.join(number_chars)
except Exception as e:
last_row_with_data = '1'
return last_row_with_data
You can see a sample of Append in this documentation.
However, for me it is just easier to use:
# The ID of the sheet you are working with.
Google_sheets_ID = 'ID_of_your_Google_Sheet'
# define the start row that has data
# it will later be replace with the last row
# in my test sheet, it starts in row 2
last_row = 2
# code to the get the last row
# range will be the column where the information is located
# remember to change "sheet1" for the name of your worksheet.
response = service.spreadsheets().values().get(
spreadsheetId = Google_sheets_ID,
range = 'sheet1!A1:A'
)execute()
#Add the initial value where the range started to the last row with values
Last_row += len(response['values']) - 1
#If you print last row, you should see the last row with values in the Sheet.
print(last_row)

How we can use "Google Sheet API" to update the sheet with new data using Python

https://developers.google.com/sheets/api/reference/rest
I was able to Create sheet and also clear sheet and some other operations
But I am not able to get "number of row and the column" that are already filled, such that I can add some more data in the sheet.
def nextAvailableRow(worksheet):
rowsUsed = len(filter(None, worksheet.col_values(1)))
return rowsUsed+1
This function should take in a worksheet as a parameter and output the next available row that is not used. It takes in all the values of the first column and then filters out all the empty rows to find how many rows have been used.
If you only want to insert new data after the last row with values, you can do it using the spreadsheets.values.append endpoint (Notice you can play with the API at "try this API" and there is also a Python example)
spreadsheet_id = 'your-spreadSheet-id'
ranges = "A1:A" # It will get the indo until the last row with data
value_render_option = "DIMENSION_UNSPECIFIED"
value_input_option = "USER_ENTERED"
# Body for the request
value_range_body = {
"values": [
[
"A11",
"B11"
],
[
"A12",
"B12"
]
],
"majorDimension": "DIMENSION_UNSPECIFIED"
}
request = service.spreadsheets().values()\
.append(spreadsheetId=spreadsheet_id, range=ranges, valueInputOption=value_input_option, body=value_range_body)
request.execute()
The outer array in values will represent the rows and the inner ones the columns as you can see in the example body in my code.
Notice: In another answer, someone is recommending you to use a third library, instead of Google Sheet's client library for Python. I would recommend you to use the Official Google library because it has the guarantee to be supported by Google.

Add Google sheet with data using Google API v4

I am using Python. I need to add sheet in spreadsheets using Google API v4. I can create a sheet using batchUpdate with spreadsheet id and addSheet request (it returns shetId and creates empty sheet). But how can I add data in it?
data = {'requests': [
{
'addSheet':{
'properties':{'title': 'New sheet'}
}
}
]}
res = service.spreadsheets().batchUpdate(spreadsheetId=s_id, body=data).execute()
SHEET_ID = res['replies'][0]['addSheet']['properties']['sheetId']
You can add this code to write data in Google Sheet. In the document - Reading & Writing Cell Values
Spreadsheets can have multiple sheets, with each sheet having any number of rows or columns. A cell is a location at the intersection of a particular row and column, and may contain a data value. The Google Sheets API provides the spreadsheets.values collection to enable the simple reading and writing of values.
Writing to a single range
To write data to a single range, use a spreadsheets.values.update request:
values = [
[
# Cell values ...
],
# Additional rows ...
]
body = {
'values': values
}
result = service.spreadsheets().values().update(
spreadsheetId=spreadsheet_id, range=range_name,
valueInputOption=value_input_option, body=body).execute()
The body of the update request must be a ValueRange object, though the only required field is values. If range is specified, it must match the range in the URL. In the ValueRange, you can optionally specify its majorDimension. By default, ROWS is used. If COLUMNS is specified, each inner array is written to a column instead of a row.
Writing multiple ranges
If you want to write multiple discontinuous ranges, you can use a spreadsheets.values.batchUpdate request:
values = [
[
# Cell values
],
# Additional rows
]
data = [
{
'range': range_name,
'values': values
},
# Additional ranges to update ...
]
body = {
'valueInputOption': value_input_option,
'data': data
}
result = service.spreadsheets().values().batchUpdate(
spreadsheetId=spreadsheet_id, body=body).execute()
The body of the batchUpdate request must be a BatchUpdateValuesRequest object, which contains a ValueInputOption and a list of ValueRange objects (one for each written range). Each ValueRange object specifies its own range, majorDimension, and the data to input.
Hope this helps.
Figured out that after creating sheet inside spreadsheet you can access range 'nameofsheet!A1' i.e.
service.spreadsheets().values().update(spreadsheetId=s_id, range='New sheet!A1', body=data_i, valueInputOption='RAW').execute()
This request will post the data into new created sheet with name 'New sheet'
You could also try a 'more intuitive' library i wrote for google sheets api v4 , pygsheets as i found the default client bit cumbersome.
import pygsheets
gc = pygsheets.authorize()
# Open spreadsheet
sh = gc.open('my new ssheet')
# create a worksheet
sh.add_worksheet("new sheet",rows=50,cols=60)
# update data with 2d vector
sh.update_cells('A1:B10', values)
# or skip let it infer range from size of values
sh.update_cells('A1', values)

Categories