Probability Density Function using pandas data - python

I would like to model the probability of an event occurring given the existence of the previous event.
To give you more context, I plan to group my data by anonymous_id, sort the values of the grouped dataset by timestamp (ts) and calculate the probability of the sequence of sources (utm_source) the person goes through. The person is represented by a unique anonymous_id. So the desired end goal is the probability of someone who came from a Facebook source to then come through from a Google source etc
I have been told that a package such as sci.py gaussian_kde would be useful for this. However, from playing around with it, this requires numerical inputs.
test_sample = test_sample.groupby('anonymous_id').apply(lambda x: x.sort_values(['ts'])).reset_index(drop=True)
and not sure what to try next.
I have also tried this, but i don't think that it makes much sense:
stats.gaussian_kde(test_two['utm_source'])
Here is a sample of my data
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902',
5: '00022b83-240e-4ef9-aaad-ac84064bb902',
6: '00022b83-240e-4ef9-aaad-ac84064bb902',
7: '00022b83-240e-4ef9-aaad-ac84064bb902',
8: '00022b83-240e-4ef9-aaad-ac84064bb902',
9: '0002ed69-4aff-434d-a626-fc9b20ef1b02'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000',
5: '2017-10-11 08:20:37.345000',
6: '2017-10-11 08:21:01.322000',
7: '2017-10-11 08:21:14.145000',
8: '2017-10-11 08:23:47.526000',
9: '2019-06-12 10:42:50.401000'},
'utm_source': {0: nan,
1: 'facebook',
2: 'facebook',
3: 'google',
4: nan,
5: 'facebook',
6: 'google',
7: 'adwords',
8: 'youtube',
9: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 9: 1}}
Note: i converted the dataframe to a dictionary

Here is one way you can do it (if I understand correctly):
from itertools import chain
from collections import Counter
groups = (df
.sort_values(by='ts')
.dropna()
.groupby('anonymous_id').utm_source
.agg(list)
.reset_index()
)
groups['transitions'] = groups.utm_source.apply(lambda x: list(zip(x,x[1:])))
all_transitions = Counter(chain(*groups.transitions.tolist()))
Which gives you (on your example data):
In [42]: all_transitions
Out[42]:
Counter({('google', 'facebook'): 1,
('facebook', 'google'): 1,
('google', 'adwords'): 1,
('adwords', 'youtube'): 1})
Or are you looking for something different?

Related

how to add text on each rectangle in collection using matplotlib?

I am using matplotlib for plotting and convenient visualization of some graphs in xy coordinates.
I need to highlight some regions - and I use rectangles for this.
But I am interested to add some text upon each rectangle - to be able to distinguish those regions. How to do it using patches because I have a lot of objects in a plot?
Here is the code I use to plot rectangles:
# sample data for rectangles visualization
windows_df = pd.DataFrame( {'window_index_num': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'left_pulse_num': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'right_pulse_num': {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 7, 6: 8, 7: 9, 8: 10, 9: 11}, 'idx_of_left_pulse': {0: 0, 1: 4036, 2: 4080, 3: 4107, 4: 4368, 5: 4491, 6: 4529, 7: 4624, 8: 4626, 9: 4639}, 'idx_of_right_pulse': {0: 4080, 1: 4107, 2: 4368, 3: 4491, 4: 4529, 5: 4624, 6: 4626, 7: 4639, 8: 4679, 9: 4781}, 'left_pulse_pos_in_E': {0: 10.002042118364418, 1: 40.29395464818188, 2: 41.19356816747343, 3: 41.76060061888303, 4: 47.90221207147802, 5: 51.27679395217831, 6: 52.39165780468267, 7: 55.37561818764979, 8: 55.47294132608167, 9: 55.99635666692289}, 'right_pulse_pos_in_E': {0: 41.19356816747343, 1: 41.76060061888303, 2: 47.90221207147802, 3: 51.27679395217831, 4: 52.39165780468267, 5: 55.37561818764979, 6: 55.47294132608167, 7: 55.99635666692289, 8: 57.33777021469516, 9: 60.984834434908144}, 'idx_window_left_border': {0: 0, 1: 3990, 2: 4058, 3: 4093, 4: 4237, 5: 4429, 6: 4510, 7: 4576, 8: 4625, 9: 4632}, 'idx_window_right_border': {0: 4094, 1: 4238, 2: 4430, 3: 4510, 4: 4577, 5: 4625, 6: 4633, 7: 4659, 8: 4730, 9: 4792}, 'left_win_pos_in_E': {0: 10.002042118364418, 1: 39.38459790393702, 2: 40.74003692229216, 3: 41.46513255508269, 4: 44.66179219947279, 5: 49.53272998148, 6: 51.82972979173252, 7: 53.82159300113625, 8: 55.40803086073492, 9: 55.76645477820397}, 'right_win_pos_in_E': {0: 41.48613320837913, 1: 44.6852679849016, 2: 49.56014983071213, 3: 51.82972979173252, 4: 53.85265044341121, 5: 55.40803086073492, 6: 55.79921126600202, 7: 56.66110947958804, 8: 59.119140585251095, 9: 61.39880967219205}, 'window_width': {0: 4095, 1: 249, 2: 373, 3: 418, 4: 341, 5: 197, 6: 124, 7: 84, 8: 106, 9: 161}, 'window_width_in_E': {0: 31.48409109001471, 1: 5.300670080964579, 2: 8.820112908419965, 3: 10.364597236649828, 4: 9.190858243938415, 5: 5.875300879254915, 6: 3.9694814742695, 7: 2.8395164784517917, 8: 3.7111097245161773, 9: 5.632354893988079}, 'sum_pulses_duration_in_E': {0: 0.5157099691135514, 1: 0.5408987779694527, 2: 0.6869248977656355, 3: 0.7304908951030242, 4: 0.7269657511683718, 5: 0.537271616198268, 6: 0.7609034761658222, 7: 0.6178183490930067, 8: 0.8269277926972265, 9: 0.5591109437337494}, 'sum_pulse_sq': {0: 3.7944375922206044, 1: 3.8756992116858715, 2: 2.9661915477796663, 3: 3.070559830941317, 4: 3.0597037730539385, 5: 10.2020204659669, 6: 45.77535573608872, 7: 45.87630607524008, 8: 39.10335270063814, 9: 3.437205923490125}, 'pulse_to_window_rate': {0: 0.01638001769335214, 1: 0.10204347180781788, 2: 0.07788164447530765, 3: 0.0704794290047244, 4: 0.0790966122938326, 5: 0.09144580460471718, 6: 0.1916883807363909, 7: 0.2175787158769594, 8: 0.22282493757444324, 9: 0.09926770493999569}, 'max_height_in_window': {0: 20.815950580921104, 1: 20.815950580921104, 2: 5.324888970962656, 3: 5.324888970962656, 4: 5.14075603114903, 5: 86.81228155905252, 6: 110.06755904473022, 7: 110.06755904473022, 8: 110.06755904473022, 9: 14.735092268739246}, 'min_height_in_window': {0: -0.011928180619527797, 1: 1.6172637244080776, 2: 1.6172637244080776, 3: 0.8658702248969847, 4: 0.8658702248969847, 5: 0.8658702248969847, 6: 1.8476229914953515, 7: 2.918666252051556, 8: 3.2397786967451707, 9: 2.4893555139463266}, 'windows_sq': {0: 655.3712842149647, 1: 110.33848645112575, 2: 46.96612194869083, 3: 55.19032951390669, 4: 47.24795994896218, 5: 510.0482741740266, 6: 436.911136546121, 7: 312.538647650477, 8: 408.4727887246568, 9: 82.9932690531994}} )
fig_w, axs_w = plt.subplots()
#theoretical cross-section
#axs_w.plot(df_wo_NANS['E'], df_wo_NANS['theo_cs'], marker = "o", markersize = 1, linewidth = 1.0, alpha=0.6, color = 'green', label = 'Theo Cross Section')
axs_w.grid(color = 'grey', linestyle = '--', linewidth = 0.2)
#windows rectangular
from matplotlib.collections import PatchCollection
from matplotlib.patches import Rectangle
boxes = []
for index,row in windows_df.iterrows():
current_rect_left_corner = (row['left_win_pos_in_E'], row['min_height_in_window'])
current_w = row['window_width_in_E']
current_h = row['max_height_in_window']-row['min_height_in_window']
boxes.append(Rectangle(current_rect_left_corner, current_w, current_h))
left = row['left_win_pos_in_E']
right = row['right_win_pos_in_E']
bottom = row['min_height_in_window']
top = row['max_height_in_window']
#mark of the start of the current window
axs_w.text(
left, #left corner, #0.5*(left+right), #middle of the rectangle
top, #top
str(index),
horizontalalignment='center',
verticalalignment='center',
fontsize=5
)
#mark of the end of the current window
axs_w.text(
right, #right corner, #0.5*(left+right), #middle of the rectangle
top+0.5*bottom, #top
str(index)+'e',
horizontalalignment='center',
verticalalignment='center',
fontsize=5
)
pc = PatchCollection(boxes, facecolor='y', alpha=0.2, edgecolor='black')
axs_w.add_collection(pc)
Added text marks using cycle but is it possible to do it using patch and collections to make more efficient code?

python dataframe to dictionary with multiple columns in keys and values

I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!
You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()
The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)

How to export to excel with pandas dataframe with multi column

I'm stuck at exporting a multi index dataframe to excel, in the matter what I'm looking for.
This is what I'm looking for in excel.
I know I have to add an extra Index Parameter on the left for the row of SRR (%) and Traction (-), but how?
My code so far.
import pandas as pd
import matplotlib.pyplot as plt
data = {'Step 1': {'Step Typ': 'Traction', 'SRR (%)': {1: 8.384, 2: 9.815, 3: 7.531, 4: 10.209, 5: 7.989, 6: 7.331, 7: 5.008, 8: 2.716, 9: 9.6, 10: 7.911}, 'Traction (-)': {1: 5.602, 2: 6.04, 3: 2.631, 4: 2.952, 5: 8.162, 6: 9.312, 7: 4.994, 8: 2.959, 9: 10.075, 10: 5.498}, 'Temperature': 30, 'Load': 40}, 'Step 3': {'Step Typ': 'Traction', 'SRR (%)': {1: 2.909, 2: 5.552, 3: 5.656, 4: 9.043, 5: 3.424, 6: 7.382, 7: 3.916, 8: 2.665, 9: 4.832, 10: 3.993}, 'Traction (-)': {1: 9.158, 2: 6.721, 3: 7.787, 4: 7.491, 5: 8.267, 6: 2.985, 7: 5.882, 8: 3.591, 9: 6.334, 10: 10.43}, 'Temperature': 80, 'Load': 40}, 'Step 5': {'Step Typ': 'Traction', 'SRR (%)': {1: 4.765, 2: 9.293, 3: 7.608, 4: 7.371, 5: 4.87, 6: 4.832, 7: 6.244, 8: 6.488, 9: 5.04, 10: 2.962}, 'Traction (-)': {1: 6.656, 2: 7.872, 3: 8.799, 4: 7.9, 5: 4.22, 6: 6.288, 7: 7.439, 8: 7.77, 9: 5.977, 10: 9.395}, 'Temperature': 30, 'Load': 70}, 'Step 7': {'Step Typ': 'Traction', 'SRR (%)': {1: 9.46, 2: 2.83, 3: 3.249, 4: 9.273, 5: 8.792, 6: 9.673, 7: 6.784, 8: 3.838, 9: 8.779, 10: 4.82}, 'Traction (-)': {1: 5.245, 2: 8.491, 3: 10.088, 4: 9.988, 5: 4.886, 6: 4.168, 7: 8.628, 8: 5.038, 9: 7.712, 10: 3.961}, 'Temperature': 80, 'Load': 70} }
df = pd.DataFrame(data)
items = list()
series = list()
for item, d in data.items():
items.append(item)
series.append(pd.DataFrame.from_dict(d))
df = pd.concat(series, keys=items)
df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True).T.to_excel('testfile.xlsx')
The picture below, shows df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True).T as a dataframe: (somehow close, but not exactly what I'm looking for):
Edit 1:
Found a good solution, not the exact one I was looking for, but it's still worth using it.
df.reset_index().drop(["level_0","level_1"], axis=1).pivot(columns=["Step Typ", "Load", "Temperature"], values=["SRR (%)", "Traction (-)"]).apply(lambda x: pd.Series(x.dropna().values)).to_excel("solution.xlsx")
Can you explain clearely and show the output you are looking for?
To export a table to excel use df.to_excel('path', index=True/False)
where:
index=True or False - to insert or not the index column into the file
Found a good solution, not the exact one I was looking for, but it's still worth using it.
df.reset_index().drop(["level_0","level_1"], axis=1).pivot(columns=["Step Typ", "Load", "Temperature"], values=["SRR (%)", "Traction (-)"]).apply(lambda x: pd.Series(x.dropna().values)).to_excel("solution.xlsx")

How to update a seaborn line plot with ipywidgets checkboxes?

I am struggling with the ipywidgets module.
I am trying to make a plot where you can toggle lines off/on with checkboxes based on a province.
fig, ax = plt.subplots(figsize=(10,10))
sns.lineplot(data=df5, x="Date_of_report", y="Total_reported", hue="Province", ax=ax)
provinces = df5["Province"].unique()
chk = [widgets.Checkbox(description=a) for a in provinces]
def updatePlot(**kwargs):
print([(k,v) for k, v in kwargs.items()])
widgets.interact(updatePlot, **{c.description: c.value for c in chk})
As you can see, I can draw the checkboxes and it prints out the status of the boxes.
but I don't know how to update the seaborn line plot. So when you select say: Drenthe it only shows the line from Drenthe.
here is the dataframe as a dict:
{'Date_of_report': {0: Timestamp('2020-03-13 10:00:00'), 1: Timestamp('2020-03-13 10:00:00'), 2: Timestamp('2020-03-13 10:00:00'), 3: Timestamp('2020-03-13 10:00:00'), 4: Timestamp('2020-03-13 10:00:00'), 5: Timestamp('2020-03-13 10:00:00'), 6: Timestamp('2020-03-13 10:00:00'), 7: Timestamp('2020-03-13 10:00:00'), 8: Timestamp('2020-03-13 10:00:00'), 9: Timestamp('2020-03-13 10:00:00')}, 'Province': {0: 'Drenthe', 1: 'Flevoland', 2: 'Friesland', 3: 'Gelderland', 4: 'Groningen', 5: 'Limburg', 6: 'Noord-Brabant', 7: 'Noord-Holland', 8: 'Overijssel', 9: 'Utrecht'}, 'Total_reported': {0: 14, 1: 7, 2: 8, 3: 64, 4: 4, 5: 71, 6: 377, 7: 66, 8: 18, 9: 83}, 'Hospital_admission': {0: 0, 1: 3, 2: 2, 3: 9, 4: 1, 5: 17, 6: 65, 7: 4, 8: 0, 9: 7}, 'Deceased': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 3, 6: 5, 7: 0, 8: 0, 9: 0}}

How to change from index to multiindex - pandas

I've got a data frame structured as below:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1.to_dict()).set_index('id')
id is set as an index, but now I would like to create a MultiIndex from var3 and id. But my following attempt fails:
ex.set_index(['var3', 'id'])
How can I then set a MultiIndex straight from Index? I know I can reset_index first and then set a MultiIndex, but it feels there has to be more elegant way.
DataFrame.set_index has an append argument, which is False by default.
If you have a DataFrame already indexed by "id", and you'd like to append "var3" to that, simply invoke:
new_df = ex.set_index("var3", append=True)
As suggested by #piRSquared in the comments, you can also swap the order if you would like "var3" to come first by method chaining a call to swaplevel. I.e.:
new_df = ex.set_index("var3", append=True).swaplevel(0, 1)
Like this:
ex.set_index(['var3', ex.index])

Categories