Create a temporal network with toneto (Python)? - python

I have the following dataframe (here is the portion of it):
{'date': {0: Timestamp('2020-10-03 00:00:00'),
1: Timestamp('2020-10-03 00:00:00'),
2: Timestamp('2020-10-03 00:00:00'),
3: Timestamp('2020-10-03 00:00:00'),
4: Timestamp('2020-10-24 00:00:00'),
5: Timestamp('2020-10-24 00:00:00'),
6: Timestamp('2020-10-24 00:00:00'),
7: Timestamp('2020-10-24 00:00:00'),
8: Timestamp('2020-10-25 00:00:00'),
9: Timestamp('2020-10-25 00:00:00')},
'from': {0: 7960001,
1: 25500005,
2: 4660001,
3: 91000032,
4: 280001,
5: 26100016,
6: 30001114,
7: 12000016,
8: 79000523,
9: 74000114},
'to': {0: 30000934,
1: 74000351,
2: 4660001,
3: 91000031,
4: 66000413,
5: 26100022,
6: 26100024,
7: 12000016,
8: 79000321,
9: 74000122},
'weight': {0: 17.1,
1: 15.0,
2: 931.6,
3: 145.9,
4: 29.3,
5: 25.8,
6: 15.0,
7: 132.4,
8: 51.5,
9: 492.9}}
And I want to build a temporal network out of this time series - graph/network data.
I would like to build a network with respect to time + clusters.
Here is what I am trying to do:
df is the dataframe above
import teneto
t = list(df.index())
netin = {'i': df['from'], 'j': df['to'], 't': t, 'weight': df['weight']}
df = pd.DataFrame(data=netin)
tnet = TemporalNetwork(from_df=df)
tnet.network
Keep getting:
TypeError: 'RangeIndex' object is not callable

Related

How to substract two dates based on filter of two other columns

I am new in Python and I am struggling to reshape my dataFrame.
For a particular client (contact_id), I want to add an new date column that actually substracts the DTHR_OPERATION date for a 'TYPE_OPER_VALIDATION = 3' minus the DTHR_OPERATION date for a 'TYPE_OPER_VALIDATION = 1'.
If the 'TYPE_OPER_VALIDATION' is equal to 3 and that there is less than a hour difference between those two dates, I want to add a string such as 'connection' for example in the new column.
I have an issue "python Series' object has no attribute 'total_seconds" when I try to compare if the time difference is indeed minus or equal to an hour. I tried many solutions I found on Internet but I always seem to have a data type issue.
Here is my code snippet:
df_oper_one = merged_table.loc[(merged_table['TYPE_OPER_VALIDATION']==1),['contact_id','TYPE_OPER_VALIDATION','DTHR_OPERATION']]
df_oper_three = merged_table.loc[(merged_table['TYPE_OPER_VALIDATION']==3),['contact_id','TYPE_OPER_VALIDATION','DTHR_OPERATION']]
connection = []
for row in merged_table['contact_id']:
if (df_validation.loc[(df_validation['TYPE_OPER_VALIDATION']==3)]) & ((pd.to_datetime(df_oper_three['DTHR_OPERATION'],format='%Y-%m-%d %H:%M:%S') - pd.to_datetime(df_oper_one['DTHR_OPERATION'],format='%Y-%m-%d %H:%M:%S').total_seconds()) <= 3600): connection.append('connection')
# if diff_date.total_seconds() <= 3600: connection.append('connection')
else: connection.append('null')
merged_table['connection'] = pd.Series(connection)
Hello Nicolas and welcome to Stack Overflow. Please remember to always include sample data to reproduce your issue. Here is sample data to reproduce part of your dataframe:
df = pd.DataFrame({'Id contact':['cf2e79bc-8cac-ec11-9840-000d3ab078e6']*12+['865c5edf-c7ac-ec11-9840-000d3ab078e6']*10,
'DTHR OPERATION':['11/10/2022 07:07', '11/10/2022 07:29', '11/10/2022 15:47', '11/10/2022 16:22', '11/10/2022 16:44', '11/10/2022 18:06', '12/10/2022 07:11', '12/10/2022 07:25', '12/10/2022 17:21', '12/10/2022 18:04', '13/10/2022 07:09', '13/10/2022 18:36', '14/09/2022 17:59', '15/09/2022 09:34', '15/09/2022 19:17', '16/09/2022 08:31', '16/09/2022 19:18', '17/09/2022 06:41', '17/09/2022 11:19', '17/09/2022 15:48', '17/09/2022 16:13', '17/09/2022 17:07'],
'lastname':['BOUALAMI']*12+['VERVOORT']*10,
'TYPE_OPER_VALIDATION':[1, 3, 1, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3, 3]})
df['DTHR OPERATION'] = pd.to_datetime(df['DTHR OPERATION'])
I would recommend creating a new table to more easily accomplish your task:
df2 = pd.merge(df[['Id contact', 'DTHR OPERATION']][df['TYPE_OPER_VALIDATION']==3], df[['Id contact', 'DTHR OPERATION']][df['TYPE_OPER_VALIDATION']==1], on='Id contact', suffixes=('_type3','_type1'))
Then find the time difference:
df2['seconds'] = (df2['DTHR OPERATION_type3']-df2['DTHR OPERATION_type1']).dt.total_seconds()
Finally, flag connections of an hour or less:
df2['connection'] = np.where(df2['seconds']<=3600, 'yes', 'no')
Hope this helps!
sure, here is the information you are looking for :
df_contact = pd.DataFrame{'contact_id': {0: '865C5EDF-C7AC-EC11-9840', 1: '9C9690B1-F8AC-EC11', 2: '4DD27359-14AF-EC11-9840', 3: '0091373E-E7F4-4170-BCAC'}, 'birthdate': {0: Timestamp('2005-05-19
00:00:00'), 1: Timestamp('1982-01-28 00:00:00'), 2: Timestamp('1997-05-15 00:00:00'), 3: Timestamp('2005-03-22 00:00:00')}, 'fullname': {0: 'Laura VERVO', 1: 'Mélanie ALBE', 2: 'Eric VANO', 3: 'Jean Docq'}, 'lastname': {0: 'VERVO', 1: 'ALBE', 2: 'VANO', 3: 'Docq'}, 'age': {0: 17, 1: 40, 2: 25, 3: 17}}
df_validation = pd.dataframe{'validation_id': {0: 8263835881, 1: 8263841517, 2: 8263843376, 3: 8263843377, 4: 8263843381, 5: 8263843382, 6: 8263863088, 7: 8263863124, 8: 8263868113, 9: 8263868123}, 'LIBEL_LONG_PRODUIT_TITRE': {0: 'Mens NEXT 12-17', 1: 'Ann NEXT 25-64%B', 2: 'Ann EXPRESS CBLANCHE', 3: 'Multi 8 NEXT', 4: 'Ann EXPRESS 18-24', 5: 'SNCB+TEC NEXT ABO', 6: 'Ann EXPRESS 18-24', 7: 'Ann EXPRESS 12-17%B', 8: '1 jour EX Réfugié', 9: 'Ann EXPRESS 2564%B'}, 'DTHR_OPERATION':
{0: Timestamp('2022-10-01 00:02:02'), 1: Timestamp('2022-10-01 00:22:45'), 2: Timestamp('2022-10-01 00:02:45'), 3: Timestamp('2022-10-01 00:02:49'), 4: Timestamp('2022-10-01 00:07:03'), 5: Timestamp('2022-10-01 00:07:06'), 6: Timestamp('2022-10-01 00:07:40'), 7: Timestamp('2022-10-01 00:31:51'), 8: Timestamp('2022-10-01 00:03:33'), 9: Timestamp('2022-10-01 00:07:40')}, 'TYPE_OPER_VALIDATION': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 3, 7: 3, 8: 2, 9: 1}, 'NUM_SERIE_SUPPORT': {0: '2040121921', 1: '2035998914', 2: '2034456458', 3: '14988572652829627697', 4: '2035956003', 5: '2033613155', 6: '2040119429', 7: '2036114867', 8: '14988572650230713650', 9: '2040146199'}}
{'support_id': {0: '8D3A331D-3E86-EC11-93B0', 1: '44863926-3E86-EC11-93B0', 2: '45863926-3E86-EC11-93B0', 3: '46863926-3E86-EC11-93B0', 4: '47863926-3E86-EC11-93B0', 5: 'E3863926-3E86-EC11-93B0', 6: '56873926-3E86-EC11', 7: 'E3CE312C-3E86-EC11-93B0', 8: 'F3CE312C-3E86-EC11-93B0', 9: '3CCF312C-3E86-EC11-93B0'}, 'bd_linkedcustomer': {0: '15CCC384-C4AD-EC11', 1: '9D27061D-14AE-EC11-9840', 2: '74CAE68F-D4AC-EC11-9840', 3: '18F5FE1A-58AC-EC11-983F', 4: None, 5: '9FBDA103-2FAD-EC11-9840', 6: 'EEA1FB63-75AC-EC11-9840', 7: 'F150EC3D-0DAD-EC11-9840', 8: '111DE8C4-CAAC-EC11-9840', 9: None}, 'bd_supportserialnumber': {0: '44884259', 1: '2036010559', 2: '62863150', 3: '2034498160', 4: '62989611', 5: '2036094315', 6: '2033192919', 7: '2036051529', 8: '2036062236', 9: '2033889172'}}
df_support = pd.dataframe{'support_id': {0: '8D3A331D-3E86-EC11-93B0', 1: '44863926-3E86-EC11', 2: '45863926-3E86-EC11-93B0', 3: '46863926-3E86-EC11-93B0', 4: '47863926-3E86-EC11-93B0', 5: 'E3863926-3E86-EC11-93B0', 6: '56873926-3E86-EC11-93B0', 7: 'E3CE312C-3E86-EC11-93B0', 8: 'F3CE312C-3E86-EC11-93B0', 9: '3CCF312C-3E86-EC11-93B0'}, 'bd_linkedcustomer': {0: '15CCC384-C4AD-EC11-9840', 1: '9D27061D-14AE-EC11-9840', 2: '74CAE68F-D4AC-EC11-9840', 3: '18F5FE1A-58AC-EC11-983F', 4: None, 5: '9FBDA103-2FAD-EC11', 6: 'EEA1FB63-75AC-EC11-9840', 7: 'F150EC3D-0DAD-EC11-9840', 8: '111DE8C4-CAAC-EC11-9840', 9: None}, 'bd_supportserialnumber': {0: '44884259', 1: '2036010559', 2: '62863150', 3: '2034498160', 4: '62989611', 5: '2036094315', 6: '2033192919', 7: '2036051529', 8: '2036062236', 9: '2033889172'}}
df2 = pd.dataframe{'support_id': {0: '4BE73E8C-B8F9-EC11-BB3D', 1: '4BE73E8C-B8F9-EC11-BB3D', 2: '4BE73E8C-B8F9-EC11-BB3D', 3: '4BE73E8C-B8F9-EC11-BB3D', 4: '4BE73E8C-B8F9-EC11-BB3D', 5: '4BE73E8C-B8F9-EC11-BB3D', 6: '4BE73E8C-B8F9-EC11', 7: '4BE73E8C-B8F9-EC11-BB3D', 8: '4BE73E8C-B8F9-EC11-BB3D', 9: '4BE73E8C-B8F9-EC11-BB3D'}, 'bd_linkedcustomer': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840',
5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'bd_supportserialnumber': {0: '2036002771', 1: '2036002771', 2: '2036002771', 3: '2036002771', 4: '2036002771', 5: '2036002771', 6: '2036002771', 7: '2036002771', 8: '2036002771', 9: '2036002771'}, 'contact_id': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840', 5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'birthdate': {0: Timestamp('1982-01-28 00:00:00'), 1: Timestamp('1982-01-28 00:00:00'), 2: Timestamp('1982-01-28 00:00:00'), 3: Timestamp('1982-01-28 00:00:00'), 4: Timestamp('1982-01-28 00:00:00'), 5: Timestamp('1982-01-28 00:00:00'), 6: Timestamp('1982-01-28 00:00:00'), 7: Timestamp('1982-01-28 00:00:00'), 8: Timestamp('1982-01-28 00:00:00'), 9: Timestamp('1982-01-28 00:00:00')}, 'fullname': {0: 'Mélanie ALBE', 1: 'Mélanie ALBE', 2: 'Mélanie ALBE', 3: 'Mélanie ALBE', 4: 'Mélanie ALBE', 5: 'Mélanie ALBE', 6: 'Mélanie ALBE', 7: 'Mélanie ALBE', 8: 'Mélanie ALBE', 9: 'Mélanie ALBE'}, 'lastname': {0: 'ALBE', 1: 'ALBE', 2: 'ALBE', 3: 'ALBE', 4: 'ALBE', 5: 'ALBE', 6: 'ALBE', 7: 'ALBE', 8: 'ALBE', 9: 'ALBE'}, 'age': {0: 40, 1: 40, 2: 40, 3: 40, 4: 40, 5: 40, 6: 40, 7: 40, 8: 40, 9: 40}, 'validation_id': {0: 8264573419, 1: 8264574166, 2: 8264574345, 3: 8264676975, 4: 8265441741, 5: 8272463799, 6: 8272471694, 7: 8274368291, 8: 8274397366, 9: 8277077728}, 'LIBEL_LONG_PRODUIT_TITRE': {0: 'Ann NEXT 25-64', 1: 'Ann NEXT 25-64', 2: 'Ann NEXT 25-64', 3: 'Ann NEXT 25-64', 4: 'Ann NEXT 25-64', 5: 'Ann NEXT 25-64', 6: 'Ann NEXT 25-64', 7: 'Ann NEXT 25-64', 8: 'Ann NEXT 25-64', 9: 'Ann NEXT 25-64'}, 'DTHR_OPERATION': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 12:23:34'), 2: Timestamp('2022-10-01 07:47:46'), 3: Timestamp('2022-10-01 13:11:54'), 4: Timestamp('2022-10-01 12:35:02'), 5: Timestamp('2022-10-04 08:34:23'), 6: Timestamp('2022-10-04 08:04:50'), 7: Timestamp('2022-10-04 17:17:47'), 8: Timestamp('2022-10-04 15:20:29'), 9: Timestamp('2022-10-05 07:54:14')}, 'TYPE_OPER_VALIDATION': {0: 3, 1: 1, 2: 1, 3: 3, 4: 3, 5: 3, 6: 1, 7: 1, 8: 1, 9: 1}, 'NUM_SERIE_SUPPORT': {0: '2036002771', 1: '2036002771', 2: '2036002771', 3: '2036002771', 4: '2036002771', 5: '2036002771', 6: '2036002771', 7: '2036002771', 8: '2036002771', 9: '2036002771'}}
df3 = pd.dataframe{'contact_id': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840', 5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'DTHR_OPERATION_type3': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 08:30:18'), 2: Timestamp('2022-10-01 08:30:18'), 3: Timestamp('2022-10-01 08:30:18'), 4: Timestamp('2022-10-01 08:30:18'), 5: Timestamp('2022-10-01 08:30:18'), 6: Timestamp('2022-10-01 08:30:18'), 7: Timestamp('2022-10-01 08:30:18'), 8: Timestamp('2022-10-01 08:30:18'), 9: Timestamp('2022-10-01 08:30:18')}, 'DTHR_OPERATION_type1': {0: Timestamp('2022-10-01 12:23:34'), 1: Timestamp('2022-10-01 07:47:46'), 2: Timestamp('2022-10-04 08:04:50'), 3: Timestamp('2022-10-04 17:17:47'), 4: Timestamp('2022-10-04 15:20:29'), 5: Timestamp('2022-10-05 07:54:14'), 6: Timestamp('2022-10-05 18:22:42'), 7: Timestamp('2022-10-06 08:14:28'), 8: Timestamp('2022-10-06 18:19:33'), 9: Timestamp('2022-10-08 07:46:45')}, 'seconds': {0: -13996.0, 1: 2552.0, 2: -257672.00000000003, 3: -290849.0, 4: -283811.0, 5: -343436.0, 6: -381144.0, 7: -431050.0, 8: -467355.00000000006, 9: -602187.0}, 'first_connection': {0: 'no', 1: 'yes', 2: 'no', 3: 'no', 4: 'no', 5: 'no', 6: 'no', 7: 'no', 8: 'no', 9: 'no'}}
df4 = pd.dataframe{'contact_id': {0: '9C9690B1-F8AC-EC11-9840', 1: '9C9690B1-F8AC-EC11-9840', 2: '9C9690B1-F8AC-EC11-9840', 3: '9C9690B1-F8AC-EC11-9840', 4: '9C9690B1-F8AC-EC11-9840', 5: '9C9690B1-F8AC-EC11-9840', 6: '9C9690B1-F8AC-EC11-9840', 7: '9C9690B1-F8AC-EC11-9840', 8: '9C9690B1-F8AC-EC11-9840', 9: '9C9690B1-F8AC-EC11-9840'}, 'DTHR_OPERATION_type3': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 08:30:18'), 2: Timestamp('2022-10-01 08:30:18'), 3: Timestamp('2022-10-01 08:30:18'), 4: Timestamp('2022-10-01 08:30:18'), 5: Timestamp('2022-10-01 08:30:18'), 6: Timestamp('2022-10-01 08:30:18'), 7: Timestamp('2022-10-01 08:30:18'), 8: Timestamp('2022-10-01 08:30:18'), 9: Timestamp('2022-10-01 08:30:18')}, 'DTHR_OPERATION_type3bis': {0: Timestamp('2022-10-01 08:30:18'), 1: Timestamp('2022-10-01 13:11:54'), 2: Timestamp('2022-10-01 12:35:02'), 3: Timestamp('2022-10-04 08:34:23'), 4: Timestamp('2022-10-05 08:27:04'), 5: Timestamp('2022-10-05 19:05:29'), 6: Timestamp('2022-10-06 08:34:21'), 7: Timestamp('2022-10-06 18:37:56'), 8: Timestamp('2022-10-06 19:08:30'), 9: Timestamp('2022-10-08 13:01:13')}, 'seconds_type3': {0: 0.0, 1: -16896.0, 2: -14684.000000000002, 3: -259445.00000000003, 4: -345406.0, 5: -383711.0, 6: -432243.0, 7: -468458.00000000006, 8: -470292.00000000006, 9: -621055.0}, 'second_or_more_connection': {0: 'no', 1: 'no', 2: 'no', 3: 'no', 4: 'no', 5: 'no', 6: 'no', 7: 'no', 8: 'no', 9: 'no'}}
The desired result is a dF5 with the following columns [['contact_id', 'fullname', 'validation_id', 'LIBEL_LONG_PRODUIT_TITRE', 'TYPE_OPER_VALIDATION']] as well as this new colum dF5['connection]. Don't hestitate to reach out if you need further information or clarifications. Many thanks for your support :)

Error: "cannot reindex from a duplicate axis" when looping sns relplots

I've got a DataFrame that is structured similar to this one but with more variables:
{'Date': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-01-01 00:00:00'),
2: Timestamp('2021-01-01 00:00:00'),
3: Timestamp('2021-02-01 00:00:00'),
4: Timestamp('2021-02-01 00:00:00'),
5: Timestamp('2021-02-01 00:00:00'),
6: Timestamp('2021-03-01 00:00:00'),
7: Timestamp('2021-03-01 00:00:00'),
8: Timestamp('2021-03-01 00:00:00')},
'Share': {0: 'nflx',
1: 'aapl',
2: 'amzn',
3: 'nflx',
4: 'aapl',
5: 'amzn',
6: 'nflx',
7: 'aapl',
8: 'amzn'},
'x': {0: 534,
1: 126,
2: 3270,
3: 590,
4: 172,
5: 3059,
6: 552,
7: 160,
8: 3462}}
I'm trying to loop sns relplots but getting the error "cannot reindex from a duplicate axis"
Code I've tried:
for i in [df["x"], df["y"], df["z"]]:
sns.relplot(data=df.reset_index(),
x="Date",
y=i,
hue="Share",
kind="line",
height=10,
aspect=1.7).savefig("{i}.png");
I know the error tells me that index has duplicates but to my knowledge it should be temporary gone with the "reset_index()". Moreover I think I have to index the date variable in order to do some of the variable specific calculations. Is the issue the plot code or what's the solution?

Filtering Dataframe based on Multiple Date Conditions

I'm working with the following DataFrame:
id slotTime EDD EDD-10M
0 1000000101068957 2021-05-12 2021-12-26 2021-02-26
1 1000000100849718 2021-03-20 2021-04-05 2020-06-05
2 1000000100849718 2021-03-20 2021-04-05 2020-06-05
3 1000000100849718 2021-03-20 2021-04-05 2020-06-05
4 1000000100849718 2021-03-20 2021-04-05 2020-06-05
I would like to only keep the rows where the slotTime is between EDD-10M and EDD:
df['EDD-10M'] < df['slotTime'] < df['EDD']]
I have tried using the following method:
df.loc[df[df['slotTime'] < df['EDD']] & df[df['EDD-10M'] < df['slotTime']]]
However it yields the following error
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Please Advise.
To replicate the above DataFrame use the below snippet:
import pandas as pd
from pandas import Timestamp
df = {
'id': {0: 1000000101068957,
1: 1000000100849718,
2: 1000000100849718,
3: 1000000100849718,
4: 1000000100849718,
5: 1000000100849718,
6: 1000000100849718,
7: 1000000100849718,
8: 1000000100849718,
9: 1000000100849718},
'EDD': {0: Timestamp('2021-12-26 00:00:00'),
1: Timestamp('2021-04-05 00:00:00'),
2: Timestamp('2021-04-05 00:00:00'),
3: Timestamp('2021-04-05 00:00:00'),
4: Timestamp('2021-04-05 00:00:00'),
5: Timestamp('2021-04-05 00:00:00'),
6: Timestamp('2021-04-05 00:00:00'),
7: Timestamp('2021-04-05 00:00:00'),
8: Timestamp('2021-04-05 00:00:00'),
9: Timestamp('2021-04-05 00:00:00')},
'EDD-10M': {0: Timestamp('2021-02-26 00:00:00'),
1: Timestamp('2020-06-05 00:00:00'),
2: Timestamp('2020-06-05 00:00:00'),
3: Timestamp('2020-06-05 00:00:00'),
4: Timestamp('2020-06-05 00:00:00'),
5: Timestamp('2020-06-05 00:00:00'),
6: Timestamp('2020-06-05 00:00:00'),
7: Timestamp('2020-06-05 00:00:00'),
8: Timestamp('2020-06-05 00:00:00'),
9: Timestamp('2020-06-05 00:00:00')},
'slotTime': {0: Timestamp('2021-05-12 00:00:00'),
1: Timestamp('2021-03-20 00:00:00'),
2: Timestamp('2021-03-20 00:00:00'),
3: Timestamp('2021-03-20 00:00:00'),
4: Timestamp('2021-03-20 00:00:00'),
5: Timestamp('2021-03-20 00:00:00'),
6: Timestamp('2021-03-20 00:00:00'),
7: Timestamp('2021-03-20 00:00:00'),
8: Timestamp('2021-03-20 00:00:00'),
9: Timestamp('2021-03-20 00:00:00')}}
df = pd.DataFrame(df)
you just need to group your sides
df[(df['slotTime'] < df['EDD']) & (df['EDD-10M'] < df['slotTime'])]
otherwise order of operations tries to & things first and it all falls apart
alternatively you may wish to use the .between operator (assuming you have a datetime series
df[df['slotTime'].between(df['EDD'],df['EDD-10M'])]
you can use between() method someone already answered you or try like this
df.loc[(df['EDD-10M'] < df['slotTime']) & (df['slotTime'] < df['EDD'])]
you should use ( and ) multiple conditions
You can this by using query:
df.query("(slotTime < EDD) & (`EDD-10M` < slotTime)")

How to pretty print labels on chart

I have this dataframe:
{'date': {0: Timestamp('2019-10-31 00:00:00'),
1: Timestamp('2019-11-30 00:00:00'),
2: Timestamp('2019-12-31 00:00:00'),
3: Timestamp('2020-01-31 00:00:00'),
4: Timestamp('2020-02-29 00:00:00'),
5: Timestamp('2020-03-31 00:00:00'),
6: Timestamp('2020-04-30 00:00:00'),
7: Timestamp('2020-05-31 00:00:00'),
8: Timestamp('2020-06-30 00:00:00'),
9: Timestamp('2020-07-31 00:00:00'),
10: Timestamp('2020-08-31 00:00:00')},
'rate': {0: 100.0,
1: 99.04595078851037,
2: 101.09797599729458,
3: 102.29581878702609,
4: 104.72409825791058,
5: 109.45297539163114,
6: 118.24943699089361,
7: 119.65432196709045,
8: 117.82108184647535,
9: 118.6223497519237,
10: 120.32838345607335}}
When I plot it I get clogged x axis
How do I print it such that I get a month name, year on x axis. For instance Nov,19.
I am using this to plot
chart = sns.lineplot('date', 'rate', data=cdf,marker="o")
If I add more datapoints it doesnt display them even if I change size:
Data:
{'date': {0: Timestamp('2019-01-31 00:00:00'),
1: Timestamp('2019-02-28 00:00:00'),
2: Timestamp('2019-03-31 00:00:00'),
3: Timestamp('2019-04-30 00:00:00'),
4: Timestamp('2019-05-31 00:00:00'),
5: Timestamp('2019-06-30 00:00:00'),
6: Timestamp('2019-07-31 00:00:00'),
7: Timestamp('2019-08-31 00:00:00'),
8: Timestamp('2019-09-30 00:00:00'),
9: Timestamp('2019-10-31 00:00:00'),
10: Timestamp('2019-11-30 00:00:00'),
11: Timestamp('2019-12-31 00:00:00'),
12: Timestamp('2020-01-31 00:00:00'),
13: Timestamp('2020-02-29 00:00:00'),
14: Timestamp('2020-03-31 00:00:00'),
15: Timestamp('2020-04-30 00:00:00'),
16: Timestamp('2020-05-31 00:00:00'),
17: Timestamp('2020-06-30 00:00:00'),
18: Timestamp('2020-07-31 00:00:00')},
'rate': {0: 100.0,
1: 98.1580633244672,
2: 102.03029115707123,
3: 107.12429902683576,
4: 112.60187555657997,
5: 108.10306860500229,
6: 105.35473845070196,
7: 105.13286204895526,
8: 106.11760178061557,
9: 107.76819930852,
10: 106.77041938461862,
11: 108.84840098309556,
12: 110.29751856107903,
13: 112.93762886874026,
14: 118.04947620270883,
15: 127.80912766377679,
16: 128.90556903738158,
17: 126.96768455091889,
18: 127.95060601426769}}
I have posted the updated data.
from pandas import Timestamp
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'date':
{0: Timestamp('2019-10-31 00:00:00'),
1: Timestamp('2019-11-30 00:00:00'),
2: Timestamp('2019-12-31 00:00:00'),
3: Timestamp('2020-01-31 00:00:00'),
4: Timestamp('2020-02-29 00:00:00'),
5: Timestamp('2020-03-31 00:00:00'),
6: Timestamp('2020-04-30 00:00:00'),
7: Timestamp('2020-05-31 00:00:00'),
8: Timestamp('2020-06-30 00:00:00'),
9: Timestamp('2020-07-31 00:00:00'),
10: Timestamp('2020-08-31 00:00:00')},
'rate':
{0: 100.0,
1: 99.04595078851037,
2: 101.09797599729458,
3: 102.29581878702609,
4: 104.72409825791058,
5: 109.45297539163114,
6: 118.24943699089361,
7: 119.65432196709045,
8: 117.82108184647535,
9: 118.6223497519237,
10: 120.32838345607335}})
df['datelabel'] = df['date'].apply(lambda x: x.strftime('%b %d'))
chart = sns.lineplot('date', 'rate', data=df,marker="o")
chart.set_xticklabels(df.datelabel, rotation=45)
plt.show()
Here's one approach:
Build a lambda function to apply strftime with our target representations over each record in date.
For date formatting, see: https://strftime.org/
%b - Month as locale’s abbreviated name.
%d - Day of the month as a zero-padded decimal number.
We can save it as a set of reference labels to be applied to the chart via xticklabels.
Additionally, you can rotate the labels with the rotation parameter.

koalas groupby -> apply returns 'cannot insert "key", already exists'

I've been struggling with this issue and haven't been able to solve it, I got the current dataframe:
import databricks.koalas as ks
x = ks.DataFrame.from_records(
{'ds': {0: Timestamp('2018-10-06 00:00:00'),
1: Timestamp('2017-06-08 00:00:00'),
2: Timestamp('2018-10-22 00:00:00'),
3: Timestamp('2017-02-08 00:00:00'),
4: Timestamp('2019-02-03 00:00:00'),
5: Timestamp('2019-02-26 00:00:00'),
6: Timestamp('2017-04-15 00:00:00'),
7: Timestamp('2017-07-02 00:00:00'),
8: Timestamp('2017-04-04 00:00:00'),
9: Timestamp('2017-03-20 00:00:00'),
10: Timestamp('2018-06-09 00:00:00'),
11: Timestamp('2017-01-15 00:00:00'),
12: Timestamp('2018-05-07 00:00:00'),
13: Timestamp('2018-01-17 00:00:00'),
14: Timestamp('2017-07-11 00:00:00'),
15: Timestamp('2018-12-17 00:00:00'),
16: Timestamp('2018-12-05 00:00:00'),
17: Timestamp('2017-05-22 00:00:00'),
18: Timestamp('2017-08-13 00:00:00'),
19: Timestamp('2018-05-21 00:00:00')},
'store': {0: 81,
1: 128,
2: 81,
3: 128,
4: 25,
5: 128,
6: 11,
7: 124,
8: 43,
9: 25,
10: 25,
11: 124,
12: 124,
13: 128,
14: 81,
15: 11,
16: 124,
17: 11,
18: 167,
19: 128},
'stock': {0: 1,
1: 236,
2: 3,
3: 9,
4: 36,
5: 78,
6: 146,
7: 20,
8: 12,
9: 12,
10: 15,
11: 25,
12: 10,
13: 7,
14: 0,
15: 230,
16: 80,
17: 6,
18: 110,
19: 8},
'sells': {0: 1.0,
1: 17.0,
2: 1.0,
3: 2.0,
4: 1.0,
5: 2.0,
6: 7.0,
7: 1.0,
8: 1.0,
9: 1.0,
10: 2.0,
11: 1.0,
12: 1.0,
13: 1.0,
14: 1.0,
15: 1.0,
16: 1.0,
17: 3.0,
18: 2.0,
19: 1.0}}
)
and this function that I want to use in a groupby - apply:
import numpy as np
def compute_indicator(df):
return (
df.copy()
.assign(
indicator=lambda x: x['a'] < np.percentile(x['b'], 80)
)
.astype(int)
.fillna(1)
)
Where df is meant to be a pandas DataFrame. If I do a group-by apply using pandas, the code executes as expected:
import pandas as pd
# This runs
a = pd.DataFrame.from_dict(x.to_dict()).groupby('store').apply(compute_indicator)
but when trying to run the same on koalas it gives me the following error: ValueError: cannot insert store, already exists
x.groupby('store').apply(compute_indicator)
# ValueError: cannot insert store, already exists
I cannot use the typing annotation in compute_indicator because some columns are not fixed (they travel around with the dataframe, meant to be used by another transformations).
What should I do to run the code in koalas?
As for Koalas 0.29.0, when koalas.DataFrame.groupby(keys).apply(f) runs for the first time over an untyped func f, it has to infer the schema, and to do this runs pandas.DataFrame.head(n).groupby(keys).apply(f). The probem is that pandas apply receives as argument the dataframe with the groupby keys as index and as columns (see this issue).
The result of pandas.DataFrame.head(h).groupby(keys).apply(f) is then converted to a koalas.DataFrame, so if f doesn't drop the keys columns this conversion raises an exception because of duplicated column names (see issue)

Categories