Difficult date calculation in DataFrame in Pyton Pandas? - python

I have DataFrame like below:
rng = pd.date_range('2020-12-11', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'status': ['active', 'active', 'finished', 'finished', 'active'] })
And I need to create 2 new columns in this DataFrame:
New1 = amount of days from "Date" column until today for status 'active'
New2 = amount of days from "Date" column until today for status 'finished'
Below sample result:

Use Series.rsub for subtract from right side with today by Timestamp and Timestamp.floor, convert timedeltas to days by Series.dt.days and assign new columns by condition in Series.where:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng,
'status': ['active', 'active', 'finished', 'finished', 'active'] })
days = df['Date'].rsub(pd.Timestamp('now').floor('d')).dt.days
df['New1'] = days.where(df['status'].eq('active'))
df['New2'] = days.where(df['status'].eq('finished'))
print (df)
Date status New1 New2
0 2020-12-01 active 13.0 NaN
1 2020-12-02 active 12.0 NaN
2 2020-12-03 finished NaN 11.0
3 2020-12-04 finished NaN 10.0
4 2020-12-05 active 9.0 NaN

Related

Loop to create multiple lists, with different naming

I would like to create a loop that creates multiple lists, named differently.
I have a dataframe that contains an excel file that I am trying to filter through depending on the month. (ideally the list should be 1, 2 , 3, etc.)
Each month should create a list
In the end I need to loop through those lists again to count the average and count len
If you have any questions let me know.
import pandas as pd
#read data
excel = 'short data.xlsx'
data = pd.read_excel(excel, parse_dates=['Closed Date Time'])
df = pd.DataFrame(data)
# data.info()
#Format / delete time from date column
data['Closed Date Time'] = pd.to_datetime(data['Closed Date Time'])
df['Close_Date'] = data['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(data['Close_Date'])
#loop to create multiple lists
times = 12
for _ in range(times):
if times <= 9:
month = df[df['Close_Date'].dt.strftime('%Y-%m') == f'2018-0{times}']
month = df[df['Close_Date'].dt.strftime('%Y-%m') == f'2018-{times}']
example data
[
Creating lists with different names can be totally wrong idea.
You should rather create single list with sublisst (and indexes instead of names) or single dictinary with names as keys. Or even better you should create single DataFrame with all values (in rows or columns). It will be more useful for next calculations.
And all this may not need for-loop.
But I think you may do it in different way. You could create column with Year Month
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
And later use groupby() to execute function on every month without using for-loops.
averages = df.groupby('Year_Month').mean()
sizes = df.groupby('Year_Month').size()
Minimal working code with example data:
import pandas as pd
#df = pd.read_excel('short data.xlsx', parse_dates=['Closed Date Time'])
data = {
'Closed Date Time': ['2022.10.25 01:00', '2022.10.24 01:00', '2018.10.25 01:00', '2018.10.24 01:00', '2018.10.23 01:00'],
'Price': [1, 2, 3, 4, 5],
'User': ['A','A','A','B','C'],
}
df = pd.DataFrame(data)
print(df)
df['Closed Date Time'] = pd.to_datetime(df['Closed Date Time'])
df['Close_Date'] = df['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(df['Close_Date'])
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
print(df)
print('\n--- averages ---\n')
averages = df.groupby('Year_Month').mean()
print(averages)
print('\n--- sizes ---\n')
sizes = df.groupby('Year_Month').size()
print(sizes)
Result:
Closed Date Time Price
0 2022.10.25 01:00 1
1 2022.10.24 01:00 2
2 2018.10.25 01:00 3
3 2018.10.24 01:00 4
4 2018.10.23 01:00 5
Closed Date Time Price Close_Date Year_Month
0 2022-10-25 01:00:00 1 2022-10-25 2022-10
1 2022-10-24 01:00:00 2 2022-10-24 2022-10
2 2018-10-25 01:00:00 3 2018-10-25 2018-10
3 2018-10-24 01:00:00 4 2018-10-24 2018-10
4 2018-10-23 01:00:00 5 2018-10-23 2018-10
--- averages ---
Price
Year_Month
2018-10 4.0
2022-10 1.5
--- sizes ---
Year_Month
2018-10 3
2022-10 2
dtype: int64
EDIT:
data = df.groupby('Year_Month').agg({'Price':['mean','size']})
print(data)
Result:
Price
mean size
Year_Month
2018-10 4.0 3
2022-10 1.5 2
EDIT:
Example with .groupby() and .apply() to execute more complex function.
And later it uses .to_dict() and .plot()
import pandas as pd
#df = pd.read_excel('short data.xlsx', parse_dates=['Closed Date Time'])
data = {
'Closed Date Time': ['2022.10.25 01:00', '2022.10.24 01:00', '2018.10.25 01:00', '2018.10.24 01:00', '2018.10.23 01:00'],
'Price': [1, 2, 3, 4, 5],
'User': ['A','A','A','B','C'],
}
df = pd.DataFrame(data)
#print(df)
df['Closed Date Time'] = pd.to_datetime(df['Closed Date Time'])
df['Close_Date'] = df['Closed Date Time'].dt.date
df['Close_Date'] = pd.to_datetime(df['Close_Date'])
df['Year_Month'] = df['Close_Date'].dt.strftime('%Y-%m')
#print(df)
def calculate(group):
#print(group)
#print(group['Price'].mean())
#print(group['User'].unique().size)
result = {
'Mean': group['Price'].mean(),
'Users': group['User'].unique().size,
'Div': group['Price'].mean()/group['User'].unique().size
}
return pd.Series(result)
data = df.groupby('Year_Month').apply(calculate)
print(data)
print('--- dict ---')
print(data.to_dict())
#print(data.to_dict('dict'))
print('--- records ---')
print(data.to_dict('records'))
print('--- list ---')
print(data.to_dict('list'))
print('--- index ---')
print(data.to_dict('index'))
import matplotlib.pyplot as plt
data.plot(kind='bar', rot=0)
plt.show()
Result:
Mean Users Div
Year_Month
2018-10 4.0 3.0 1.333333
2022-10 1.5 1.0 1.500000
--- dict ---
{'Mean': {'2018-10': 4.0, '2022-10': 1.5}, 'Users': {'2018-10': 3.0, '2022-10': 1.0}, 'Div': {'2018-10': 1.3333333333333333, '2022-10': 1.5}}
--- records ---
[{'Mean': 4.0, 'Users': 3.0, 'Div': 1.3333333333333333}, {'Mean': 1.5, 'Users': 1.0, 'Div': 1.5}]
--- list ---
{'Mean': [4.0, 1.5], 'Users': [3.0, 1.0], 'Div': [1.3333333333333333, 1.5]}
--- index ---
{'2018-10': {'Mean': 4.0, 'Users': 3.0, 'Div': 1.3333333333333333}, '2022-10': {'Mean': 1.5, 'Users': 1.0, 'Div': 1.5}}

Datetime conversion - convert only date for rows not containing time

I have the dataframe
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB04','AB05', 'AB06'],
'l_date': ["1/4/2021","1/4/2021",'1/5/2021','1/5/2021','1/8/2021', np.nan],
'l_time': ["17:05",
"6:00","13:43:10","00:00",np.nan,np.nan]
}
)
And I want to create a new column which combines l_date and l_time into a datetime column, l_datetime.
My code is this
cols = ['l_date','l_time']
df['d_datetime'] = df[cols].astype(str).agg(' '.join, axis=1)
df['d_datetime'] = df['d_datetime'].replace({'nan':''}, regex=True)
df['d_datetime'] = pd.to_datetime(df['d_datetime'], errors="coerce").dt.strftime("%d/%m/%Y %H:%M")
Now, this generates time for AB05 as 00:00 and creates the datetime. But for the ones which doesn't time in column l_time, I want the d_datetime to only have the date. How can I achieve this?
Intially I tried
df['d_datetime'] = df['d_datetime'].replace({' 00:00':''}, regex=True)
But this removes the time for AB04 too and I don't want that. How can I achieve the end result looks like below?
UPDATE
From the below result:
I want to check if l_time is NaN and if it is then, I want to apply replace({'00:00':''}) to that row. How can I achieve this?
Use:
df['d_datetime'] = (pd.to_datetime(df['l_date']).dt.strftime("%d/%m/%Y") + ' ' +
pd.to_datetime(df['l_time']).dt.time.replace(np.nan, '').astype(str).str[0:5]).str.strip()
OUTPUT:
ID l_date l_time d_datetime
0 AB01 1/4/2021 17:05 04/01/2021 17:05
1 AB02 1/4/2021 6:00 04/01/2021 06:00
2 AB03 1/5/2021 13:43:10 05/01/2021 13:43
3 AB04 1/5/2021 00:00 05/01/2021 00:00
4 AB05 1/8/2021 NaN 08/01/2021
5 AB06 NaN NaN NaN
df.loc[df["l_time"].isnull(), "d_datetime"] = df["d_datetime"].replace(
{"00:00": ""}, regex=True
)
Here is the solution:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB04','AB05', 'AB06'],
'l_date': ["1/4/2021","1/4/2021",'1/5/2021','1/5/2021','1/8/2021', np.nan],
'l_time': ["17:05",
"6:00","13:43:10","00:00",np.nan,np.nan]
}
)
df.l_time = df.l_time.fillna('')
df['d_datetime']= df['l_date'].astype(str)+" "+ df['l_time'].astype(str)
print(df)

Conditionally merging dataframes on time variable

I have to dataframe that I want to merge on the date, id and time variable in order to compute a duration.
from numpy import *
from pandas import *
df1 = DataFrame({
'id': ['a']*4,
'date': ['02-02-2015']*4,
'time_1': ['08:00:00', '09:00:00', '10:30:00', '12:45']})
df1
id date time
0 a 02-02-2015 08:00:00
1 a 02-02-2015 09:00:00
2 a 02-02-2015 10:30:00
3 a 02-02-2015 12:45:00
-------------------------------------------------------------------------------------------------
df2 = DataFrame({
'id': ['a']*7,
'date': ['02-02-2015']*7,
'time_2': ['08:00:00', '08:09:00', '08:04:01','08:52:36', '09:34:25', '10:30:00', '11:23:38']})
df2
id date time
0 a 02-02-2015 08:00:00
1 a 02-02-2015 08:09:00
2 a 02-02-2015 08:04:01
3 a 02-02-2015 08:52:36
4 a 02-02-2015 09:00:00
5 a 02-02-2015 10:30:00
6 a 02-02-2015 11:23:38
The rule that I want my merge to follow is that each row needs in df2 needs to go with the closest previous time in df1.
The intermediate result would be
intermediateResult = DataFrame({
'id': ['a']*8,
'date': ['02-02-2015']*8,
'time_1': ['08:00:00', '08:00:00', '08:00:00','08:00:00', '09:00:00', '10:30:00', '10:30:00', '12:45'],
'time_2': ['08:00:00', '08:09:00', '08:04:01','08:52:36', '09:34:25', '10:30:00', '11:23:38', nan] })
intermediateResult
id date time_1 time_2
0 a 02-02-2015 08:00:00 08:00:00
1 a 02-02-2015 08:00:00 08:09:00
2 a 02-02-2015 08:00:00 08:04:01
3 a 02-02-2015 08:00:00 08:52:36 # end
4 a 02-02-2015 09:00:00 09:34:25 # end
5 a 02-02-2015 10:30:00 10:30:00
6 a 02-02-2015 10:30:00 11:23:38 # end
7 a 02-02-2015 12:45 NaN
Finally, I want to get the time difference between the latest time_2 of each period (inicated with the comment # end) and their corresponding time_1.
The final result would look like this
finalResult = DataFrame({
'id': ['a']*4,
'date': ['02-02-2015']*4,
'Duration': ['00:52:36', '00:34:25', '00:53:38', nan]})
finalResult
id date Duration
0 a 02-02-2015 00:52:36
1 a 02-02-2015 00:34:25
2 a 02-02-2015 00:53:38
3 a 02-02-2015 NaN
Using different merge methods, came to the same answer. Eventually used merge_as0f direction =backward as per your request. Unfortunately not similar to yours in the sense that I have no NaN. Happy to help furtherif you gave information on how you end up with NaN in one row.
#Join dateto time and coerce to datetime
df1['datetime']=pd.to_datetime(df1.date.str.cat(df1.time_1,sep=' '))
df2['datetime']=pd.to_datetime(df2.date.str.cat(df2.time_2,sep=' '))
df2['time_2'] = df2['time_2'].apply(lambda x: (x[-5:]))#StripHours from time_2. I anticipate to use it as duration
#sort to allow merge_asof
df1=df1.sort_values('datetime')
df2=df2.sort_values('datetime')
#Merge to the dataframes joining using datetime to the nearest hour
df3=pd.merge_asof(df2, df1,on='datetime', by='id', tolerance=pd.Timedelta('2H'),allow_exact_matches=True,direction='backward').dropna()
#df3=df2.merge(df1, left_on=df2.datetime.dt.hour, right_on=df1.datetime.dt.hour, how='left').drop(columns=['key_0', 'id_y', 'date_y']).fillna(0)#Alternative merge
df3.set_index('datetime', inplace=True)#set datetime as index
df3['minutes']=df3.index.minute#Extract minute in each row. Looks to me you want the highest minute in each hour
#Groupby hour idxmax Helps boolean select the index with the highest minutes in an hour. aND DROP UNWANTED ROWS
finalResult=df3.loc[df3.groupby([df3.index.hour, df3.date_x])['minutes'].idxmax()].reset_index().drop(columns=['datetime','time_1','date_y','minutes'])
finalResult.columns=['id','date','Duration(min)']
finalResult
Using the solution suggested by #wwnde, I've found one that scales better to my real data set:
import numpy as np
import pandas as pd
df1 = DataFrame({
'id': ['a']*4,
'date': ['02-02-2015']*4,
'time_1': ['08:00:00', '09:00:00', '10:30:00', '12:45:00']
})
df2 = DataFrame({
'id': ['a']*7,
'date': ['02-02-2015',
'02-02-2015',
'03-02-2015', # small change here relatively to the df un my first post
'02-02-2015',
'02-02-2015',
'02-02-2015',
'02-02-2015'],
'time_2': ['08:00:00', '08:09:00', '08:04:01','08:52:36', '09:34:25', '10:30:00', '11:23:38']
})
----------------------------------------------------
def preproDf(df1, df2, time_1, time_2, _id, date):
'''
Preprocess the dataframes for the following operations
df1: pd.DataFrame, left dataframe
df2: pd.DataFrame, right dataframe
time_1:str, name of the left dataframe
time_2:str, name of the right dataframe
_id:str, name of the id variable. Should be the same for both dataframes
date:str, name of the date variable. Should be the same for both dataframes
return: None
'''
df2[time_2] = df2[time_2].apply(pd.to_datetime)
df1[time_1] = df1[time_1].apply(pd.to_datetime)
#sort to allow merge_asof
df1=df1.sort_values([_id, date, time_1])
df2=df2.sort_values([_id, date, time_2])
def processDF(df1, df2, time_1, time_2, _id, date):
# initialisation
groupKeys = list(df2.groupby([_id, date]).groups.keys())
dfGroup=groupKeys[0]
group = df2.groupby([_id, date]).get_group(dfGroup)
rslt = pd.merge_asof(group, df1, left_on=time_2, right_on=time_1, by=[_id, date], tolerance=pd.Timedelta('2H'),allow_exact_matches=True,direction='backward')#.dropna()
# For loop to get the values in an array
for group in groupKeys[1:]: # Iteration start at the second elmt
group = df2.groupby([_id, date]).get_group(group)
item = pd.merge_asof(group, df1, left_on=time_2, right_on=time_1, by=[_id, date], tolerance=pd.Timedelta('2H'),allow_exact_matches=True,direction='backward')#.dropna()
rslt = np.vstack((rslt, item))
rslt = DataFrame(rslt, columns=item.columns)
# Creating timeDifference variable
rslt['timeDifference'] = rslt[time_2] - rslt[time_1]
# Getting the actual result
rslt = rslt.groupby([_id, date, time_1]).timeDifference.max()
rslt = pd.DataFrame(rslt).reset_index()
rslt.rename({time_1: 'openTime'}, axis='columns')
return rslt
The result:
preproDf(df1, df2, 'time_1', 'time_2', 'id', 'date')
processDF(df1, df2, 'time_1', 'time_2', 'id', 'date')
id date time_1 screenOnDuration
0 a 02-02-2015 2020-05-29 08:00:00 00:52:36
1 a 02-02-2015 2020-05-29 09:00:00 00:34:25
2 a 02-02-2015 2020-05-29 10:30:00 00:53:38

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

pandas merge on date range

I have two dataframes,
df = pd.DataFrame({'Date': ['2011-01-02', '2011-04-10', '2015-02-02', '2016-03-03'], 'Price': [100, 200, 300, 400]})
df2 = pd.DataFrame({'Date': ['2011-01-01', '2014-01-01'], 'Revenue': [14, 128]})
I want add df2.revenue to df to produce the below table using the both the date columns for reference.
Date Price Revenue
2011-01-02 100 14
2011-04-10 200 14
2015-02-02 300 128
2016-03-03 400 128
As above the revenue is added according df2.Date and df.Date
Use merge_asof:
df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df3 = pd.merge_asof(df, df2, on='Date')
print (df3)
Date Price Revenue
0 2011-01-02 100 14
1 2011-04-10 200 14
2 2015-02-02 300 128
3 2016-03-03 400 128

Categories