Adding Current Date column next to current Data Frame using Python - python

I'm not fully understanding data frames & am in the process of taking a course on them. This one feels like it should be so easy, but I could really use an explanation.
All I want to do is ADD a column next to my current output that has the CURRENT date in the cells.
I'm getting a timestamp using
time = pd.Timestamp.today()
print (time)
But obviously this is just to print, not connecting it to my other code.
I was able to accomplish this in Google Sheets (once the output lands), but it would be so much cleaner (and informative) if I could do it right from the script.
This is what it currently looks like:
import requests
import pandas as pd
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(1)
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
res = []
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" +
x).text)[0]
values = [[x], df.columns.values.tolist(), *df.values.tolist()] ## does it go within here?
res.extend(values)
res.append([])
waveData.append_rows(res, value_input_option="USER_ENTERED")
I thought it would go within the values, since this is (where I believe) my columns are built?
Would love to understand this better if someone is willing to take the time.

In your situation, how about the following modification?
From:
waveData.append_rows(res, value_input_option="USER_ENTERED")
To:
waveData.append_rows(res, value_input_option="USER_ENTERED")
# In this case, please add the following script.
row = len(waveData.get_all_values())
col = max([len(e) for e in res])
time = pd.Timestamp.today()
req = { "requests": [{ "repeatCell": { "cell": { "userEnteredValue": { "stringValue": str(time) } }, "range": { "sheetId": waveData.id, "startRowIndex": row - len(res) + 1, "endRowIndex": row, "startColumnIndex": col, "endColumnIndex": col + 1 }, "fields": "userEnteredValue" } }] }
sh.batch_update(req)
When this script is run, the timestamp of pd.Timestamp.today() is added to the next column of the last column by the batchUpdate method.
If you want to add the timestamp to the specific column, please modify range of the above script.

Related

Add dropdown menu (nested list) to Google Spreadsheat with gspread / Google Spread API to whole column

I am struggling to insert a dropdown menu into a Google Spreadsheet using the gspread module in python.
This question explains how to get a dropdown menu in a spreadsheet:
How to add a nested list with gspread?
However, even if I change the startColumnIndex and endColumnIndex the dropdown menu only shows up in one cell.
I have experimented with something like this:
data = sheet.get_all_values()
values = [request if e[3] != "" else "" for e in data]
cells = sheet.range("C1:C%d" % len(values))
for i, e in enumerate(cells):
e.value = values[i]
sheet.update_cells(cells)
where request is the dropdown menu. So I want to insert a dropdown menu in the fourth column if the third column is not empty, else I don't want to insert anything.
But as of now, this only works if request is a regular string and not the dropdown formatted cell I want it to be.
This picture shows what I want. You can see that row 1, 3, and 4 have a dropdown list in the fourth column (as the third column is not empy) while row 2 doesn't have anything.
It seems to be that it would be batch to use the batch_update module in combination with request and not the loop but I don't seem to get it working for multiple cells at the same time (preferably a whole column).
Thank you for your help!
I believe your goal is as follows.
When column "C" is not empty, you want to insert a dropdown list to column "D".
You want to achieve this using gspread for python.
In this case, how about the following sample script?
Sample script:
# Please use your gspread client.
spreadsheetId = "###" # Please set your Spreadsheet ID.
sheetName = "Sheet1" # Please set sheet name.
dropdownOptions = [1, 2, 3, 4, 5] # This is from your showing image.
spreadsheet = client.open_by_key(spreadsheetId)
sheet = spreadsheet.worksheet(sheetName)
sheetId = sheet.id
v = [{"userEnteredValue": str(e)} for e in dropdownOptions]
values = sheet.get_all_values()[1:]
requests = [
{
"setDataValidation": {
"range": {
"sheetId": sheetId,
"startRowIndex": i + 1,
"endRowIndex": i + 2,
"startColumnIndex": 3,
"endColumnIndex": 4,
},
"rule": {
"showCustomUi": True,
"strict": True,
"condition": {"values": v, "type": "ONE_OF_LIST"},
},
}
}
for i, r in enumerate(values)
if r[2] != ""
]
spreadsheet.batch_update({"requests": requests})
When this script is run, all values are retrieved and searched in column "C". When column "C" is empty, a dropdown list is inserted into column "D".
Note:
From your showing image, it supposes that your Spreadsheet has the 1st header row. Please be careful about this.
References:
Method: spreadsheets.batchUpdate
SetDataValidationRequest

How to convert dynamic nested json into csv?

I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.

Python dataframe to nested json file

I have a python dataframe as below.
python dataframe:-
Emp_No Name Project Task
1 ABC P1 T1
1 ABC P2 T2
2 DEF P3 T3
3 IJH Null Null
I need to convert it to json file and save it to disk as below
Json File
{
"Records"[
{
"Emp_No":"1",
"Project_Details":[
{
"Project":"P1",
"Task":"T1"
},
{
"Project":"P2",
"Task":"T2"
}
],
"Name":"ÄBC"
},
{
"Emp_No":"2",
"Project_Details":[
{
"Project":"P2",
"Task":"T3"
}
],
"Name":"DEF"
},
{
"Emp_No":"3",
"Project_Details":[
],
"Name":"IJH"
}
]
}
I feel like this post is not a doubt per se, but a cheecky atempt to avoid formatting the data, hahaha. But, since i'm trying to get used to the dataframe structure and the different ways of handling it, here you go!
import pandas as pd
asutosh_data = {'Emp_No':["1","1","2","3"], 'Name':["ABC","ABC","DEF","IJH"], 'Project':["P1","P2","P3","Null"], 'Task':["T1","T2","T3","Null"]}
df = pd.DataFrame(data=asutosh_data)
records = []
dif_emp_no = df['Emp_No'].unique()
for emp_no in dif_emp_no :
emp_data = df.loc[df['Emp_No'] == emp_no]
emp_project_details = []
for index,data in emp_data.iterrows():
if data["Project"]!="Null":
emp_project_details.append({"Project":data["Project"],"Task":data["Task"]})
records.append({"Emp_No":emp_data.iloc[0]["Emp_No"], "Project_Details":emp_project_details , "Name":emp_data.iloc[0]["Name"]})
final_data = {"Records":records}
print(final_data)
If you have any question about the code above, feel free to ask. I'll also leave below the documentation i've used to solve your problem (you may wanna check that):
unique : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
loc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
iloc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Quickest way to write a column with google sheet API and gspread

I have a column (sheet1!A:A) with 6000 rows, I would like to write today's date (todays_date) to each cell in the column. Currently doing it by using .values_update() method in a while loop but it takes too much time and giving APIError due to quota limit.
x=0
while x <= len(column):
sh.values_update(
'Sheet1!A'+str(x),
params={
'valueInputOption': 'USER_ENTERED'
} ,
body={
'values': todays_date]
}
)
x+=1
Is there any other way that I can change the cell values altogether?
You want to put a value to all cells in the column "A" in "Sheet1".
You want to achieve this using gspread with python.
You want to reduce the process cost for this situation.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
In this answer, I used the method of batch_update of gspread and RepeatCellRequest of the method of batchUpdate in Sheets API. In this case, above situation can be achieved by one API call.
Sample script:
Before you run the script, please set the variables of spreadsheetId and sheetName.
spreadsheetId = "###" # Please set the Spreadsheet ID.
sheetName = "Sheet1" # Please set the sheet name.
sh = client.open_by_key(spreadsheetId)
sheetId = sh.worksheet(sheetName)._properties['sheetId']
todays_date = (datetime.datetime.now() - datetime.datetime(1899, 12, 30)).days
requests = [
{
"repeatCell": {
"range": {
"sheetId": sheetId,
"startRowIndex": 0,
"startColumnIndex": 0,
"endColumnIndex": 1
},
"cell": {
"userEnteredValue": {
"numberValue": todays_date
},
"userEnteredFormat": {
"numberFormat": {
"type": "DATE",
"pattern": "dd/mm/yyyy"
}
}
},
"fields": "userEnteredValue,userEnteredFormat"
}
}
]
res = sh.batch_update({'requests': requests})
print(res)
When you run above script, the today's date is put to all cells of the column "A" as the format of dd/mm/yyyy.
If you want to change the value and format, please modify above script.
At (datetime.datetime.now() - datetime.datetime(1899, 12, 30)).days, the date is converted to the serial number. By this, the value can be used as the date object in Google Spreadsheet.
References:
batch_update(body)
RepeatCellRequest
If I misunderstood your question and this was not the direction you want, I apologize.

Filtering pandas dataframe by date to count views for timeline of programs

I need to count viewers by program for a streaming channel from a json logfile.
I identify the programs by their starttimes, such as:
So far I have two Dataframes like this:
The first one contains all the timestamps from the logfile
viewers_from_log = pd.read_json('sqllog.json', encoding='UTF-8')
# Convert date string to pandas datetime object:
viewers_from_log['time'] = pd.to_datetime(viewers_from_log['time'])
Source JSON file:
[
{
"logid": 191605,
"time": "0:00:17"
},
{
"logid": 191607,
"time": "0:00:26"
},
{
"logid": 191611,
"time": "0:01:20"
}
]
The second contains the starting times and titles of the programs
programs_start_time = pd.DataFrame.from_dict('programs.json', orient='index')
Source JSON file:
{
"2019-05-29": [
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
]
}
So what I need to do is to count the entries / program in the log file and link them to the program titles.
My approach is to slice log data for each date range from program data and get the shape. Next add column for program data with results:
import pandas as pd
# setup test data
log_data = {'Time': ['2019-05-30 00:00:26', '2019-05-30 00:00:50', '2019-05-30 00:05:50','2019-05-30 00:23:26']}
log_data = pd.DataFrame(data=log_data)
program_data = {'Time': ['2019-05-30 00:00:00', '2019-05-30 00:22:44'],
'Program': ['Program 1', 'Program 2']}
program_data = pd.DataFrame(data=program_data)
counts = []
for index, row in program_data.iterrows():
# get counts on selected range
try:
log_range = log_data[(log_data['Time'] > program_data.loc[index].values[0]) & (log_data['Time'] < program_data.loc[index+1].values[0])]
counts.append(log_range.shape[0])
except:
log_range = log_data[log_data['Time'] > program_data.loc[index].values[0]]
counts.append(log_range.shape[0])
# add aditional column with collected counts
program_data['Counts'] = counts
Output:
Time Program Counts
0 2019-05-30 00:00:00 Program 1 3
1 2019-05-30 00:22:44 Program 2 1
A working (but maybe a little quick and dirty) method:
Use the .shift(-1) method on the timestamp column of programs_start_time dataframe, to get an additional column with a name date_end indicating the timestamp of end for each TV program.
Then for each example_timestamp in the log file, you can query the TV programs dataframe like this: df[(df['date_start']=<example_timestamp) & (df['date_end']>example_timestamp)] (make sure you substitute df with your dataframe's name: programs_start_time) which will give you exactly one dataframe row and extract from it the name of the TV programm.
Hope this helps!
Solution with histogram, using numpy:
import pandas as pd
import numpy as np
df_p = pd.DataFrame([
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
])
df_v = pd.DataFrame([
{
"logid": 191605,
"time": "2019-05-29 0:00:17"
},
{
"logid": 191607,
"time": "2019-05-29 0:00:26"
},
{
"logid": 191611,
"time": "2019-05-29 0:01:20"
}
])
df_p.startTime_dt = pd.to_datetime(df_p.startTime_dt)
df_v.time = pd.to_datetime(df_v.time)
# here's part where I convert datetime to timestamp in seconds - astype(int) casts it to nanoseconds, hence there's // 10**9
programmes_start = df_p.startTime_dt.astype(int).values // 10**9
viewings_starts = df_v.time.astype(int).values // 10**9
# make bins for histogram
# add zero to the beginning of the array
# add value that is time an hour after the start of the last given programme to the end of the array
programmes_start = np.pad(programmes_start, (1, 1), mode='constant', constant_values=(0, programmes_start.max()+3600))
histogram = np.histogram(viewings_starts, bins=programmes_start)
print(histogram[0]
# prints [2 1 0 0]
Interpretation: there were 2 log entries before 'Amiről a kövek mesélnek' started, 1 log entry between starts of 'Amiről a kövek mesélnek' and 'Koffer - Kedvcsináló Kul(t)túrák Külföldön', 0 log entries between starts of 'Koffer - Kedvcsináló Kul(t)túrák Külföldön' and 'Gubancok' and 0 entries after start od 'Gubancok'. Which, looking at the data you provided, seems correct :) Hope this helps.
NOTE: I assume, that you have the date of the viewings. You don't have them in the example log file, but they appear in the screenshot - so I assumed that you can compute/get them somehow and added them by hand to the input dict.

Categories