I am trying to use 2 ipywidgets to control a dataframe and refresh as a change the 2 ipywidgets.
Create a new column (Check_value) that shifts down the Values column by the value in widget A
Filter and show only the rows where the Check_value is bigger than the Value column by the value in widget B
The dataframe :
Dates | Values
Day 1 | 5
Day 2 | 9
Day 3 | 14
Day 4 | 40
Day 5 | 80
## Widget A
A = widgets.IntSlider(
value=1,
min=1,
max=30,
step=1)
## Widget B
B = widgets.IntSlider(
value=1,
min=1,
max=30,
step=1)
out = widgets.Output()
## Processing
def common_processing(period, filters):
out.clear_output()
with out:
df = pd.read_csv(data.csv)
df['Dates'] = pd.to_datetime(df['Dates'])
df['check_value'] = df['Value'].shift(-period)
df['delta'] = df['check_value'] - df['Value']
display(df[df['delta'] > filters])
def A_eventhandler(change):
common_processing(change.new, B.value)
def B_eventhandler(change):
common_processing(A.value, change.new)
A.observe(A_eventhandler, names='value')
B.observe(B_eventhandler, names='value')
display(A)
display(B)
The data frame displayed does not change with changes in the widget values.
I tried running your code, the A_eventhandler function is never called when you change the widget value (you can check this by running print(change) before the common_processing call.)
The reason is your names keyword needs to be a list, rather than a string. So try:
A.observe(A_eventhandler, names=['value'])
B.observe(B_eventhandler, names=['value'])
When I set up an observe on a widget, I always do it without any filtering first, just printing the result in the observed function. Then you can add keywords to filter down to just the events and values you need.
Also, don't forget to display(out) somewhere in your code, as you are capturing the output here. Otherwise you will never see anything!
Related
How can I display a row of buttons across the bottom of the screen using Tkinter? The number of buttons is a variable. Could be 2 or could be 10. The number of items being displayed is also a variable. In this case it's 5, but could be 2 or could be 10. I am using Tkinter and currently have a working program that outputs a grid that looks like this:
-------------------------
| |Title| |
|Item1| |Quantity1|
|Item2| |Quantity2|
|Item3| |Quantity3|
|Item4| |Quantity4|
| | (Intentionally blank)
|Item5| |Quantity5|
-------------------------- (end)
I am outputting like so:
from tkinter import *
window = Tk()
itemLabel = Label(
window,
text = "Item1",
).grid(row = 2, column = 0, sticky = 'w')
However, when I try to add buttons, I can't seem to get the formatting correct. If I do it similarly to the label, with "sticky = 'w'", then it overlaps on the left. I only have 3 columns so if I have more than 3 buttons I run out of columns. Below is my desired output (hopefully all of the buttons will be of equal width):
-------------------------
| |Title| |
|Item1| |Quantity1|
|Item2| |Quantity2|
|Item3| |Quantity3|
|Item4| |Quantity4|
| | (Intentionally blank)
|Item5| |Quantity5|
--------------------------
|B#1| B#2|B#3|B#4| B#5|B#6|
--------------------------- (end)
I had a similar problem, but for radiobuttons - a variable number of options which can change, in this case updated on the click of a button. I set the max number of columns as I spill over into more rows if need be.
def labelling_add_radiobutton(self):
'''add selected values as radiobuttons for labelling'''
# destroy any existing radiobuttons in frame
for child in self.frame3_labelling_labels.winfo_children():
if isinstance(child, ttk.Radiobutton):
child.destroy()
# label variable and create radio button
self.checkbox_validation_output = tk.StringVar(value = '')
num_cols = 8 # sets max number of columns, rows are variable
for count, value in enumerate(list_of_buttons):
row = int((count + .99) / num_cols) + 1
col = (((row - 1) + count) - ((row - 1) * num_cols) - (row - 1)) + 1
ttk.Radiobutton(self.frame3_labelling_labels, text = value,
variable = self.checkbox_validation_output, value = value,
style = 'UL_radiobutton.TRadiobutton').grid(column = col, row = row,
sticky = 'w', padx = 10, pady = 10)
Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555
I've got a dataframe with categorical data.
A1 = ["cat",'apple','red',1,2]
A2 = ['dog','grape','blue',3,4]
A3 = ['rat','grape','gray',5,6]
A4 = ['toad','kiwi','yellow',7,8]
df_MD = pd.DataFrame([A1,A2,A3,A4],columns= ["animal","fruit","color","length","weight"])
animal fruit color length weight
0 cat apple red 1 2
1 dog grape blue 3 4
2 rat grape gray 5 6
3 toad kiwi yellow 7 8
I want to use bokeh serve to eventually create interactive plots.
I implemented this suggestion on how to add listeners:
tgtCol1 = 'animal'
catList = list(np.unique(df_MD[tgtCol1]))
def updatexy(tgtCol1,catList,df_MD):
'''creates x and y values based on whether the entry is found in catlist'''
mybool = df_MD[tgtCol1]==catList[0]
for cc in catList:
mybool = (mybool) | ( df_MD[tgtCol1]== cc)
df_MD_mybool = df_MD[mybool].copy()
x = df_MD_mybool['length'].copy()
y = df_MD_mybool['weight'].copy()
return(x,y)
x,y = updatexy(tgtCol1,catList,df_MD)
#create dropdown menu for column selection
menu = [x for x in zip(df_MD.columns,df_MD.columns)]
dropdown = Dropdown(label="select column", button_type="success", menu=menu)
def function_to_call(attr, old, new):
print(dropdown.value)
dropdown.on_change('value', function_to_call)
dropdown.on_click(function_to_call)
#create buttons for category selection
catList = np.unique(df_MD[tgtCol1].dropna())
checkbox = CheckboxGroup(labels=catList)
def function_to_call2(attr, old, new):
print(checkbox.value)
checkbox.on_change('value', function_to_call2)
checkbox.on_click(function_to_call2)
#slap everything together
layout = column(dropdown,checkbox)
#add it to the layout
curdoc().add_root(layout)
curdoc().title = "Selection Histogram"
This works ok to create the initial set of menus. But when I try to change the column or select different categories I get an error:
TypeError("function_to_call() missing 2 required positional arguments: 'old' and 'new'",)
so I can't even call the "listener" functions.
Once I call the listener functions, how do I update my list of checkbox values, as well as x and y?
I can update them within the scope of function_to_call1 and function_to_call2, but the global values for x ,y ,tgtCol1, and catList are unchanged!
I couldn't really find any guide or documentation on how listeners work, but after some experimentation I found that the structure for the listener is wrong. It should be as follows
myWdiget = <some widget>(<arguments>)
def function_to_call(d):
<actions, where d is a string or object corresponding to the widget state>
myWdiget.on_click(function_to_call) #add a event listener to myWdiget
So, my code turns out to be
dropdown = Dropdown(label="select column", button_type="success", menu=menu)
def function_to_call(d): #only one argument, the selection from the dropdown
catList = list(np.unique(df_MD[d].dropna()))
checkbox.labels=catList
checkbox.active = [] #makes it so nothing is checked
dropdown.on_click(function_to_call)
#create buttons for category selection
checkbox = CheckboxGroup(labels=catList)
def function_to_call2(cb):
tempindex = [x for x in cb] #cb is some bokeh object, we convert it to a list
#tgtCol1 is a global variable referring to the currently active column
#not an ideal solution
catList = np.unique(df_MD[tgtCol1].dropna())
if len(tempindex) != 0: catList = list(catList[tempindex])
x,y = updatexy(tgtCol1,catList,df_MD) #gets new data based on desired column, category set, and dataframe
s.data = dict(x = x,y = y) #'source' object to update a plot I made, unrelated to this answer, so I'm not showing it
checkbox.on_click(function_to_call2)
This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows:
I have a dataset, comprised of 500.000+ samples, with 6 features per sample.
I have put this dataset in a multiindexed Pandas DataFrame
The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows
Time id
2017-03-07 10:06:49.963241984 122.0 -7.024347
136.0 -11.664985
243.0 1.716150
2017-03-07 10:06:50.003462400 122.0 -7.025922
136.0 -11.671526
Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information
that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45.
But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal).
I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex?
My question is as follows:
How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following
X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)
with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp.
EDIT:
The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp.
For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323.
id = 175
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323
At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955
id = 175
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640')
new_time = old_time + pd.Timedelta('5 seconds')
# Finding the new value of location
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]
So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp.
EDIT2:
After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know.
for current_time in X.index.get_level_values(0)[125:]:
#only do if there are objects at current time
if len(X.ix[current_time].index):
# Calculate past time
past_time = current_time - pd.Timedelta('5 seconds')
# Find index in X.index that is closest to this past time
past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0)))
# translate the index back to a label
past_time = X.index[past_time_index][0]
# in that timestep, cycle the objects
for obj_id in X.ix[current_time].index:
# Try looking for the value box_center.x of obj obj_id 5s ago
try:
X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x']
X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y']
X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x']
X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y']
# If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan
except KeyError:
X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan
print('Timestep {}'.format(current_time))
If this is not enough information, please say so and I can add it :)
Cheers and thanks!
Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID.
offset = 5 * rate
# Create a shallow copy of the multiindex levels for modification
modified_levels = list(X.index.levels)
# Shift them
modified_times = pd.Series(modified_levels[0]).shift(offset)
# Fill NaNs with dummy values to avoid duplicates in the new index
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull()))
modified_levels[0] = modified_times
new_index = X.index.set_levels(modified_levels, inplace=False)
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values
If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop
you could use this as a loop
Data definition
import pandas as pd
import numpy as np
from io import StringIO
df_str = """
timestamp id location
10:00:00.005 1 a
10:00:00.005 2 b
10:00:00.005 3 c
10:00:05.006 2 a
10:00:05.006 3 b
10:00:05.006 4 c"""
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index()
delta = pd.to_timedelta(5, unit='s')
margin = pd.to_timedelta(1/50, unit='s')
df['location_shifted'] = np.nan
Loop over the different id's
for label_id in set(df['id']):
df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary
df_id['time_shift'] = df['timestamp'] + delta
for row in df_id.itertuples():
idx = row.Index
time_dif = abs(df['timestamp'] - row.time_shift)
shifted_locs = df_id[time_dif < margin ]
l = len(shifted_locs)
if l:
print(shifted_locs)
if l == 1:
idx_shift = shifted_locs.index[0]
else:
idx_shift = shifted_locs['time_shift'].idxmin()
df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']
Results
timestamp id location location_shifted
0 2017-05-09 10:00:00.005 1 a
1 2017-05-09 10:00:00.005 2 b
2 2017-05-09 10:00:00.005 3 c
3 2017-05-09 10:00:05.006 2 a b
4 2017-05-09 10:00:05.006 3 b c
5 2017-05-09 10:00:05.006 4 c
Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel.
3 Steps:
- make into 3D panel
- Add new columns
- Fill those columns
From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel.
After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back.
dp_new = dp.transpose(2,0,1)
dp_new['shifted_box_center_x']=np.nan
dp_new['shifted_box_center_y']=np.nan
dp_new['shifted_relative_velocity_x']=np.nan
dp_new['shifted_relative_velocity_y']=np.nan
# tranpose them back to their original form
dp_new = dp_new.transpose(1,2,0)
Now that we have added the new fields, we can get their names by
new_fields = dp_new.minor_axis[-4:]
The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125.
Lets first set the time to start from 5s in the datapanel
time = dp_new.items[125:]
Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds.
time = dp_new.items[125:]
for iloc, ts in enumerate(time):
# Print progress
print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True)
# Generate new INDEX field, by taking the field ID and dropping the NaN values
ids = dp_new.loc[ts].id.dropna().values
# Drop the nan field from the frame
dp_new[ts].dropna(thresh=5, inplace=True)
# save the original indices
original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values}
# set the index to field id
dp_new[ts].set_index(['id'], inplace=True)
# Check if the vector ids does NOT contain ALL ZEROS
if np.any(ids): # Check for all zeros
df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0
df_past.dropna(thresh=5, inplace=True) # drop the nan rows
df_past.set_index(['id'], inplace=True) # set the index to field ID
dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values
This will only fill in fields that have id's ==ids.
This code was able to run on a 300 000 element file in about 5 minutes.
Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case.
dp_new[ts, ids, new_fields] = values does NOT work.
I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])