Related
I am using matplotlib for plotting and convenient visualization of some graphs in xy coordinates.
I need to highlight some regions - and I use rectangles for this.
But I am interested to add some text upon each rectangle - to be able to distinguish those regions. How to do it using patches because I have a lot of objects in a plot?
Here is the code I use to plot rectangles:
# sample data for rectangles visualization
windows_df = pd.DataFrame( {'window_index_num': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'left_pulse_num': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'right_pulse_num': {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 7, 6: 8, 7: 9, 8: 10, 9: 11}, 'idx_of_left_pulse': {0: 0, 1: 4036, 2: 4080, 3: 4107, 4: 4368, 5: 4491, 6: 4529, 7: 4624, 8: 4626, 9: 4639}, 'idx_of_right_pulse': {0: 4080, 1: 4107, 2: 4368, 3: 4491, 4: 4529, 5: 4624, 6: 4626, 7: 4639, 8: 4679, 9: 4781}, 'left_pulse_pos_in_E': {0: 10.002042118364418, 1: 40.29395464818188, 2: 41.19356816747343, 3: 41.76060061888303, 4: 47.90221207147802, 5: 51.27679395217831, 6: 52.39165780468267, 7: 55.37561818764979, 8: 55.47294132608167, 9: 55.99635666692289}, 'right_pulse_pos_in_E': {0: 41.19356816747343, 1: 41.76060061888303, 2: 47.90221207147802, 3: 51.27679395217831, 4: 52.39165780468267, 5: 55.37561818764979, 6: 55.47294132608167, 7: 55.99635666692289, 8: 57.33777021469516, 9: 60.984834434908144}, 'idx_window_left_border': {0: 0, 1: 3990, 2: 4058, 3: 4093, 4: 4237, 5: 4429, 6: 4510, 7: 4576, 8: 4625, 9: 4632}, 'idx_window_right_border': {0: 4094, 1: 4238, 2: 4430, 3: 4510, 4: 4577, 5: 4625, 6: 4633, 7: 4659, 8: 4730, 9: 4792}, 'left_win_pos_in_E': {0: 10.002042118364418, 1: 39.38459790393702, 2: 40.74003692229216, 3: 41.46513255508269, 4: 44.66179219947279, 5: 49.53272998148, 6: 51.82972979173252, 7: 53.82159300113625, 8: 55.40803086073492, 9: 55.76645477820397}, 'right_win_pos_in_E': {0: 41.48613320837913, 1: 44.6852679849016, 2: 49.56014983071213, 3: 51.82972979173252, 4: 53.85265044341121, 5: 55.40803086073492, 6: 55.79921126600202, 7: 56.66110947958804, 8: 59.119140585251095, 9: 61.39880967219205}, 'window_width': {0: 4095, 1: 249, 2: 373, 3: 418, 4: 341, 5: 197, 6: 124, 7: 84, 8: 106, 9: 161}, 'window_width_in_E': {0: 31.48409109001471, 1: 5.300670080964579, 2: 8.820112908419965, 3: 10.364597236649828, 4: 9.190858243938415, 5: 5.875300879254915, 6: 3.9694814742695, 7: 2.8395164784517917, 8: 3.7111097245161773, 9: 5.632354893988079}, 'sum_pulses_duration_in_E': {0: 0.5157099691135514, 1: 0.5408987779694527, 2: 0.6869248977656355, 3: 0.7304908951030242, 4: 0.7269657511683718, 5: 0.537271616198268, 6: 0.7609034761658222, 7: 0.6178183490930067, 8: 0.8269277926972265, 9: 0.5591109437337494}, 'sum_pulse_sq': {0: 3.7944375922206044, 1: 3.8756992116858715, 2: 2.9661915477796663, 3: 3.070559830941317, 4: 3.0597037730539385, 5: 10.2020204659669, 6: 45.77535573608872, 7: 45.87630607524008, 8: 39.10335270063814, 9: 3.437205923490125}, 'pulse_to_window_rate': {0: 0.01638001769335214, 1: 0.10204347180781788, 2: 0.07788164447530765, 3: 0.0704794290047244, 4: 0.0790966122938326, 5: 0.09144580460471718, 6: 0.1916883807363909, 7: 0.2175787158769594, 8: 0.22282493757444324, 9: 0.09926770493999569}, 'max_height_in_window': {0: 20.815950580921104, 1: 20.815950580921104, 2: 5.324888970962656, 3: 5.324888970962656, 4: 5.14075603114903, 5: 86.81228155905252, 6: 110.06755904473022, 7: 110.06755904473022, 8: 110.06755904473022, 9: 14.735092268739246}, 'min_height_in_window': {0: -0.011928180619527797, 1: 1.6172637244080776, 2: 1.6172637244080776, 3: 0.8658702248969847, 4: 0.8658702248969847, 5: 0.8658702248969847, 6: 1.8476229914953515, 7: 2.918666252051556, 8: 3.2397786967451707, 9: 2.4893555139463266}, 'windows_sq': {0: 655.3712842149647, 1: 110.33848645112575, 2: 46.96612194869083, 3: 55.19032951390669, 4: 47.24795994896218, 5: 510.0482741740266, 6: 436.911136546121, 7: 312.538647650477, 8: 408.4727887246568, 9: 82.9932690531994}} )
fig_w, axs_w = plt.subplots()
#theoretical cross-section
#axs_w.plot(df_wo_NANS['E'], df_wo_NANS['theo_cs'], marker = "o", markersize = 1, linewidth = 1.0, alpha=0.6, color = 'green', label = 'Theo Cross Section')
axs_w.grid(color = 'grey', linestyle = '--', linewidth = 0.2)
#windows rectangular
from matplotlib.collections import PatchCollection
from matplotlib.patches import Rectangle
boxes = []
for index,row in windows_df.iterrows():
current_rect_left_corner = (row['left_win_pos_in_E'], row['min_height_in_window'])
current_w = row['window_width_in_E']
current_h = row['max_height_in_window']-row['min_height_in_window']
boxes.append(Rectangle(current_rect_left_corner, current_w, current_h))
left = row['left_win_pos_in_E']
right = row['right_win_pos_in_E']
bottom = row['min_height_in_window']
top = row['max_height_in_window']
#mark of the start of the current window
axs_w.text(
left, #left corner, #0.5*(left+right), #middle of the rectangle
top, #top
str(index),
horizontalalignment='center',
verticalalignment='center',
fontsize=5
)
#mark of the end of the current window
axs_w.text(
right, #right corner, #0.5*(left+right), #middle of the rectangle
top+0.5*bottom, #top
str(index)+'e',
horizontalalignment='center',
verticalalignment='center',
fontsize=5
)
pc = PatchCollection(boxes, facecolor='y', alpha=0.2, edgecolor='black')
axs_w.add_collection(pc)
Added text marks using cycle but is it possible to do it using patch and collections to make more efficient code?
I am new in Python and I am struggling to reshape my dataFrame.
For a particular client (contact_id), I want to add an new date column that actually substracts the DTHR_OPERATION date for a 'TYPE_OPER_VALIDATION = 3' minus the DTHR_OPERATION date for a 'TYPE_OPER_VALIDATION = 1'.
If the 'TYPE_OPER_VALIDATION' is equal to 3 and that there is less than a hour difference between those two dates, I want to add a string such as 'connection' for example in the new column.
I have an issue "python Series' object has no attribute 'total_seconds" when I try to compare if the time difference is indeed minus or equal to an hour. I tried many solutions I found on Internet but I always seem to have a data type issue.
Here is my code snippet:
df_oper_one = merged_table.loc[(merged_table['TYPE_OPER_VALIDATION']==1),['contact_id','TYPE_OPER_VALIDATION','DTHR_OPERATION']]
df_oper_three = merged_table.loc[(merged_table['TYPE_OPER_VALIDATION']==3),['contact_id','TYPE_OPER_VALIDATION','DTHR_OPERATION']]
connection = []
for row in merged_table['contact_id']:
if (df_validation.loc[(df_validation['TYPE_OPER_VALIDATION']==3)]) & ((pd.to_datetime(df_oper_three['DTHR_OPERATION'],format='%Y-%m-%d %H:%M:%S') - pd.to_datetime(df_oper_one['DTHR_OPERATION'],format='%Y-%m-%d %H:%M:%S').total_seconds()) <= 3600): connection.append('connection')
# if diff_date.total_seconds() <= 3600: connection.append('connection')
else: connection.append('null')
merged_table['connection'] = pd.Series(connection)
Hello Nicolas and welcome to Stack Overflow. Please remember to always include sample data to reproduce your issue. Here is sample data to reproduce part of your dataframe:
df = pd.DataFrame({'Id contact':['cf2e79bc-8cac-ec11-9840-000d3ab078e6']*12+['865c5edf-c7ac-ec11-9840-000d3ab078e6']*10,
'DTHR OPERATION':['11/10/2022 07:07', '11/10/2022 07:29', '11/10/2022 15:47', '11/10/2022 16:22', '11/10/2022 16:44', '11/10/2022 18:06', '12/10/2022 07:11', '12/10/2022 07:25', '12/10/2022 17:21', '12/10/2022 18:04', '13/10/2022 07:09', '13/10/2022 18:36', '14/09/2022 17:59', '15/09/2022 09:34', '15/09/2022 19:17', '16/09/2022 08:31', '16/09/2022 19:18', '17/09/2022 06:41', '17/09/2022 11:19', '17/09/2022 15:48', '17/09/2022 16:13', '17/09/2022 17:07'],
'lastname':['BOUALAMI']*12+['VERVOORT']*10,
'TYPE_OPER_VALIDATION':[1, 3, 1, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3, 3]})
df['DTHR OPERATION'] = pd.to_datetime(df['DTHR OPERATION'])
I would recommend creating a new table to more easily accomplish your task:
df2 = pd.merge(df[['Id contact', 'DTHR OPERATION']][df['TYPE_OPER_VALIDATION']==3], df[['Id contact', 'DTHR OPERATION']][df['TYPE_OPER_VALIDATION']==1], on='Id contact', suffixes=('_type3','_type1'))
Then find the time difference:
df2['seconds'] = (df2['DTHR OPERATION_type3']-df2['DTHR OPERATION_type1']).dt.total_seconds()
Finally, flag connections of an hour or less:
df2['connection'] = np.where(df2['seconds']<=3600, 'yes', 'no')
Hope this helps!
sure, here is the information you are looking for :
df_contact = pd.DataFrame{'contact_id': {0: '865C5EDF-C7AC-EC11-9840', 1: '9C9690B1-F8AC-EC11', 2: '4DD27359-14AF-EC11-9840', 3: '0091373E-E7F4-4170-BCAC'}, 'birthdate': {0: Timestamp('2005-05-19
00:00:00'), 1: Timestamp('1982-01-28 00:00:00'), 2: Timestamp('1997-05-15 00:00:00'), 3: Timestamp('2005-03-22 00:00:00')}, 'fullname': {0: 'Laura VERVO', 1: 'Mélanie ALBE', 2: 'Eric VANO', 3: 'Jean Docq'}, 'lastname': {0: 'VERVO', 1: 'ALBE', 2: 'VANO', 3: 'Docq'}, 'age': {0: 17, 1: 40, 2: 25, 3: 17}}
df_validation = pd.dataframe{'validation_id': {0: 8263835881, 1: 8263841517, 2: 8263843376, 3: 8263843377, 4: 8263843381, 5: 8263843382, 6: 8263863088, 7: 8263863124, 8: 8263868113, 9: 8263868123}, 'LIBEL_LONG_PRODUIT_TITRE': {0: 'Mens NEXT 12-17', 1: 'Ann NEXT 25-64%B', 2: 'Ann EXPRESS CBLANCHE', 3: 'Multi 8 NEXT', 4: 'Ann EXPRESS 18-24', 5: 'SNCB+TEC NEXT ABO', 6: 'Ann EXPRESS 18-24', 7: 'Ann EXPRESS 12-17%B', 8: '1 jour EX Réfugié', 9: 'Ann EXPRESS 2564%B'}, 'DTHR_OPERATION':
{0: Timestamp('2022-10-01 00:02:02'), 1: Timestamp('2022-10-01 00:22:45'), 2: Timestamp('2022-10-01 00:02:45'), 3: Timestamp('2022-10-01 00:02:49'), 4: Timestamp('2022-10-01 00:07:03'), 5: Timestamp('2022-10-01 00:07:06'), 6: Timestamp('2022-10-01 00:07:40'), 7: Timestamp('2022-10-01 00:31:51'), 8: Timestamp('2022-10-01 00:03:33'), 9: Timestamp('2022-10-01 00:07:40')}, 'TYPE_OPER_VALIDATION': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 3, 7: 3, 8: 2, 9: 1}, 'NUM_SERIE_SUPPORT': {0: '2040121921', 1: '2035998914', 2: '2034456458', 3: '14988572652829627697', 4: '2035956003', 5: '2033613155', 6: '2040119429', 7: '2036114867', 8: '14988572650230713650', 9: '2040146199'}}
{'support_id': {0: '8D3A331D-3E86-EC11-93B0', 1: '44863926-3E86-EC11-93B0', 2: '45863926-3E86-EC11-93B0', 3: '46863926-3E86-EC11-93B0', 4: '47863926-3E86-EC11-93B0', 5: 'E3863926-3E86-EC11-93B0', 6: '56873926-3E86-EC11', 7: 'E3CE312C-3E86-EC11-93B0', 8: 'F3CE312C-3E86-EC11-93B0', 9: '3CCF312C-3E86-EC11-93B0'}, 'bd_linkedcustomer': {0: '15CCC384-C4AD-EC11', 1: '9D27061D-14AE-EC11-9840', 2: '74CAE68F-D4AC-EC11-9840', 3: '18F5FE1A-58AC-EC11-983F', 4: None, 5: '9FBDA103-2FAD-EC11-9840', 6: 'EEA1FB63-75AC-EC11-9840', 7: 'F150EC3D-0DAD-EC11-9840', 8: '111DE8C4-CAAC-EC11-9840', 9: None}, 'bd_supportserialnumber': {0: '44884259', 1: '2036010559', 2: '62863150', 3: '2034498160', 4: '62989611', 5: '2036094315', 6: '2033192919', 7: '2036051529', 8: '2036062236', 9: '2033889172'}}
df_support = pd.dataframe{'support_id': {0: '8D3A331D-3E86-EC11-93B0', 1: '44863926-3E86-EC11', 2: '45863926-3E86-EC11-93B0', 3: '46863926-3E86-EC11-93B0', 4: '47863926-3E86-EC11-93B0', 5: 'E3863926-3E86-EC11-93B0', 6: '56873926-3E86-EC11-93B0', 7: 'E3CE312C-3E86-EC11-93B0', 8: 'F3CE312C-3E86-EC11-93B0', 9: '3CCF312C-3E86-EC11-93B0'}, 'bd_linkedcustomer': {0: '15CCC384-C4AD-EC11-9840', 1: '9D27061D-14AE-EC11-9840', 2: '74CAE68F-D4AC-EC11-9840', 3: '18F5FE1A-58AC-EC11-983F', 4: None, 5: '9FBDA103-2FAD-EC11', 6: 'EEA1FB63-75AC-EC11-9840', 7: 'F150EC3D-0DAD-EC11-9840', 8: '111DE8C4-CAAC-EC11-9840', 9: None}, 'bd_supportserialnumber': {0: '44884259', 1: '2036010559', 2: '62863150', 3: '2034498160', 4: '62989611', 5: '2036094315', 6: '2033192919', 7: '2036051529', 8: '2036062236', 9: '2033889172'}}
df2 = pd.dataframe{'support_id': {0: '4BE73E8C-B8F9-EC11-BB3D', 1: '4BE73E8C-B8F9-EC11-BB3D', 2: '4BE73E8C-B8F9-EC11-BB3D', 3: '4BE73E8C-B8F9-EC11-BB3D', 4: '4BE73E8C-B8F9-EC11-BB3D', 5: '4BE73E8C-B8F9-EC11-BB3D', 6: '4BE73E8C-B8F9-EC11', 7: '4BE73E8C-B8F9-EC11-BB3D', 8: '4BE73E8C-B8F9-EC11-BB3D', 9: '4BE73E8C-B8F9-EC11-BB3D'}, 'bd_linkedcustomer': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840',
5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'bd_supportserialnumber': {0: '2036002771', 1: '2036002771', 2: '2036002771', 3: '2036002771', 4: '2036002771', 5: '2036002771', 6: '2036002771', 7: '2036002771', 8: '2036002771', 9: '2036002771'}, 'contact_id': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840', 5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'birthdate': {0: Timestamp('1982-01-28 00:00:00'), 1: Timestamp('1982-01-28 00:00:00'), 2: Timestamp('1982-01-28 00:00:00'), 3: Timestamp('1982-01-28 00:00:00'), 4: Timestamp('1982-01-28 00:00:00'), 5: Timestamp('1982-01-28 00:00:00'), 6: Timestamp('1982-01-28 00:00:00'), 7: Timestamp('1982-01-28 00:00:00'), 8: Timestamp('1982-01-28 00:00:00'), 9: Timestamp('1982-01-28 00:00:00')}, 'fullname': {0: 'Mélanie ALBE', 1: 'Mélanie ALBE', 2: 'Mélanie ALBE', 3: 'Mélanie ALBE', 4: 'Mélanie ALBE', 5: 'Mélanie ALBE', 6: 'Mélanie ALBE', 7: 'Mélanie ALBE', 8: 'Mélanie ALBE', 9: 'Mélanie ALBE'}, 'lastname': {0: 'ALBE', 1: 'ALBE', 2: 'ALBE', 3: 'ALBE', 4: 'ALBE', 5: 'ALBE', 6: 'ALBE', 7: 'ALBE', 8: 'ALBE', 9: 'ALBE'}, 'age': {0: 40, 1: 40, 2: 40, 3: 40, 4: 40, 5: 40, 6: 40, 7: 40, 8: 40, 9: 40}, 'validation_id': {0: 8264573419, 1: 8264574166, 2: 8264574345, 3: 8264676975, 4: 8265441741, 5: 8272463799, 6: 8272471694, 7: 8274368291, 8: 8274397366, 9: 8277077728}, 'LIBEL_LONG_PRODUIT_TITRE': {0: 'Ann NEXT 25-64', 1: 'Ann NEXT 25-64', 2: 'Ann NEXT 25-64', 3: 'Ann NEXT 25-64', 4: 'Ann NEXT 25-64', 5: 'Ann NEXT 25-64', 6: 'Ann NEXT 25-64', 7: 'Ann NEXT 25-64', 8: 'Ann NEXT 25-64', 9: 'Ann NEXT 25-64'}, 'DTHR_OPERATION': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 12:23:34'), 2: Timestamp('2022-10-01 07:47:46'), 3: Timestamp('2022-10-01 13:11:54'), 4: Timestamp('2022-10-01 12:35:02'), 5: Timestamp('2022-10-04 08:34:23'), 6: Timestamp('2022-10-04 08:04:50'), 7: Timestamp('2022-10-04 17:17:47'), 8: Timestamp('2022-10-04 15:20:29'), 9: Timestamp('2022-10-05 07:54:14')}, 'TYPE_OPER_VALIDATION': {0: 3, 1: 1, 2: 1, 3: 3, 4: 3, 5: 3, 6: 1, 7: 1, 8: 1, 9: 1}, 'NUM_SERIE_SUPPORT': {0: '2036002771', 1: '2036002771', 2: '2036002771', 3: '2036002771', 4: '2036002771', 5: '2036002771', 6: '2036002771', 7: '2036002771', 8: '2036002771', 9: '2036002771'}}
df3 = pd.dataframe{'contact_id': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840', 5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'DTHR_OPERATION_type3': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 08:30:18'), 2: Timestamp('2022-10-01 08:30:18'), 3: Timestamp('2022-10-01 08:30:18'), 4: Timestamp('2022-10-01 08:30:18'), 5: Timestamp('2022-10-01 08:30:18'), 6: Timestamp('2022-10-01 08:30:18'), 7: Timestamp('2022-10-01 08:30:18'), 8: Timestamp('2022-10-01 08:30:18'), 9: Timestamp('2022-10-01 08:30:18')}, 'DTHR_OPERATION_type1': {0: Timestamp('2022-10-01 12:23:34'), 1: Timestamp('2022-10-01 07:47:46'), 2: Timestamp('2022-10-04 08:04:50'), 3: Timestamp('2022-10-04 17:17:47'), 4: Timestamp('2022-10-04 15:20:29'), 5: Timestamp('2022-10-05 07:54:14'), 6: Timestamp('2022-10-05 18:22:42'), 7: Timestamp('2022-10-06 08:14:28'), 8: Timestamp('2022-10-06 18:19:33'), 9: Timestamp('2022-10-08 07:46:45')}, 'seconds': {0: -13996.0, 1: 2552.0, 2: -257672.00000000003, 3: -290849.0, 4: -283811.0, 5: -343436.0, 6: -381144.0, 7: -431050.0, 8: -467355.00000000006, 9: -602187.0}, 'first_connection': {0: 'no', 1: 'yes', 2: 'no', 3: 'no', 4: 'no', 5: 'no', 6: 'no', 7: 'no', 8: 'no', 9: 'no'}}
df4 = pd.dataframe{'contact_id': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840', 5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'DTHR_OPERATION_type3': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 08:30:18'), 2: Timestamp('2022-10-01 08:30:18'), 3: Timestamp('2022-10-01 08:30:18'), 4: Timestamp('2022-10-01 08:30:18'), 5: Timestamp('2022-10-01 08:30:18'), 6: Timestamp('2022-10-01 08:30:18'), 7: Timestamp('2022-10-01 08:30:18'), 8: Timestamp('2022-10-01 08:30:18'), 9: Timestamp('2022-10-01 08:30:18')}, 'DTHR_OPERATION_type3bis': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 13:11:54'), 2: Timestamp('2022-10-01 12:35:02'), 3: Timestamp('2022-10-04 08:34:23'), 4: Timestamp('2022-10-05 08:27:04'), 5: Timestamp('2022-10-05 19:05:29'), 6: Timestamp('2022-10-06 08:34:21'), 7: Timestamp('2022-10-06 18:37:56'), 8: Timestamp('2022-10-06 19:08:30'), 9: Timestamp('2022-10-08 13:01:13')}, 'seconds_type3': {0: 0.0, 1: -16896.0, 2: -14684.000000000002, 3: -259445.00000000003, 4: -345406.0, 5: -383711.0, 6: -432243.0, 7: -468458.00000000006, 8: -470292.00000000006, 9: -621055.0}, 'second_or_more_connection': {0: 'no', 1: 'no', 2: 'no', 3: 'no', 4: 'no', 5: 'no', 6: 'no', 7: 'no', 8: 'no', 9: 'no'}}
The desired result is a dF5 with the following columns [['contact_id', 'fullname', 'validation_id', 'LIBEL_LONG_PRODUIT_TITRE', 'TYPE_OPER_VALIDATION']] as well as this new colum dF5['connection]. Don't hestitate to reach out if you need further information or clarifications. Many thanks for your support :)
I found the way to implement the stackplot if my x-axis is just a list of numbers.
import pandas as pd
import matplotlib.pyplt as plt
d = {'time_key': {0: '2021-03-01',
1: '2021-03-01',
2: '2021-03-01',
3: '2021-03-01'},
'target': {0: 2, 1: 1, 2: 0, 3: 3},
'count': {0: 400, 1: 300, 2: 200, 3: 100},
'fraction': {0: 0.4, 1: 0.3, 2: 0.2, 3: 0.1}}
df = pd.DataFrame(d)
plt.stackplot(range(2), s[s.target==0].fraction, s[s.target==1].fraction,
s[s.target==2].fraction, s[s.target==3].fraction)
But I want to generalize the plot to many dates list.
d = {'time_key': {0: '2021-03-01',
1: '2021-03-01',
2: '2021-03-01',
3: '2021-03-01',
4: '2021-04-01',
5: '2021-04-01',
6: '2021-04-01',
7: '2021-04-01',
8: '2021-05-01',
9: '2021-05-01',
10: '2021-05-01',
11: '2021-05-01'},
'target': {0: 2,
1: 1,
2: 0,
3: 3,
4: 2,
5: 1,
6: 0,
7: 3,
8: 2,
9: 1,
10: 0,
11: 3},
'count': {0: 163,
1: 110,
2: 90,
3: 38,
4: 113,
5: 97,
6: 56,
7: 34,
8: 85,
9: 57,
10: 42,
11: 16},
'fraction': {0: 0.18091009988901222,
1: 0.1220865704772475,
2: 0.09988901220865705,
3: 0.042175360710321866,
4: 0.12541620421753608,
5: 0.1076581576026637,
6: 0.06215316315205328,
7: 0.03773584905660377,
8: 0.09433962264150944,
9: 0.06326304106548279,
10: 0.04661487236403995,
11: 0.017758046614872364}}
And I'd like to assign dates to x-axis in ascending order to see dynamics of the proportions.
Is this a way to implement it in a proper way?
The approximate desired output plot (I need time_key x-axis though):
Try:
dfp = df.set_index(['time_key','target'])['count'].unstack()
dfp.div(dfp.sum(axis=1), axis=0).plot.bar(stacked=True)
Output:
Also useful solution is
d = {0: {'2021-03-01': 0.2, '2021-04-01': 0.25, '2021-05-01': 0.3},
1: {'2021-03-01': 0.3, '2021-04-01': 0.25, '2021-05-01': 0.3},
2: {'2021-03-01': 0.4, '2021-04-01': 0.25, '2021-05-01': 0.3},
3: {'2021-03-01': 0.1, '2021-04-01': 0.25, '2021-05-01': 0.1}}
df = pd.DataFrame(d)
fig, ax = plt.subplots(figsize=(9, 6))
plt.style.use('classic')
df.plot.area(ax=ax)
when I use:
df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
It stops in the error because of invalid value in Type 2 column because of invalid values (negative values, NaN, string...) for hexa conversion. how to ignore this error or mark the invalid value as zero
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
You can create a personalized function that handles this exception to use in your lambda. For example:
def lambda_int(n):
try:
return int(n, 16)
except ValueError:
return 0
df[["Type 2", "Type 4"]] = df[["Type 2", "Type 4"]].applymap(lambda n: lambda_int(n))
Please go through this, i reconstructed your question and gave steps to follow
1. You first dictionary you provided does not have a value, it has a string "NaN"
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
import pandas as pd
df = pd.DataFrame(data)
df.head()
To check nan in your df and remove them
columns_with_na = df.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False))) #print them in descendng order
Prints 0 and 0 because there is no nan
Reconstructed your data to include a nan by using numpy.nan
import numpy as np
#recreated a dataset and included a nan value : np.nan at Type 2
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: np.nan,
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df2 = pd.DataFrame(data)
df2.head()
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False)))
prints 1 and 1 because there is a nan at Type 2 column
#drop nan values
df2 = df2.dropna(how = 'any')
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
#prints 0 because I dropped all the nan values
df2.head()
To fill nan in df with 0 use:
df2.fillna(0, inplace = True)
Fill in nan with 0 in df2['Type 2'] only:
#if you dont want to change the origina dataframe set inplace to false
df2['Type 2'].fillna(0, inplace = True) #inplace is set to True to change the original df
Below is my dataframe. I made some transformations to create the category column and dropped the original column it was derived from. Now I need to do a group-by to remove the dups e.g. Love and Fashion can be rolled up via a groupby sum.
df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
[Love 183 474.81522 2013-11-04 374242 300x250]
[Fashion 0 0.19434 2013-11-04 197 300x250]
[Fashion 9 18.26422 2013-11-04 13363 300x250]]
Here is the index that is created when I created the dataframe
print df.index
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
I assume I want to drop the index, and create date, and category as a multiindex then do a groupby sum of the metrics. How do I do this in pandas dataframe?
df.head(15).to_dict()= {'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}}
Python is 2.7 and pandas is 0.7.0 on ubuntu 12.04. Below is the error I get if I run the below
import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
Traceback (most recent call last):
File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
df.set_index(['date', 'category'], inplace=True)
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]
You can create the index on the existing dataframe. With the subset of data provided, this works for me:
import pandas
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
If you're having duplicate index issues with the full dataset, you'll need to clean up the data a bit. Remove the duplicate rows if that's amenable. If the duplicate rows are valid, then what sets them apart from each other? If you can add that to the dataframe and include it in the index, that's ideal. If not, just create a dummy column that defaults to 1, but can be 2 or 3 or ... N in the case of N duplicates -- and then include that field in the index as well.
Alternatively, I'm pretty sure you can skip the index creation and directly groupby with columns:
df.groupby(by=['date', 'category']).sum()
Again, that works on the subset of data that you posted.
I usually try to do it when I try to unstack a multi-index and it fails because there are duplicate values.
Here is the simple command that I run the find the problematic items:
df.groupby(level=df.index.names).count()