I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks
You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0
Related
The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)
Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))
IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64
I have two dataframes, they have a start/end datetime and a value. Not the same number of rows. The intervals which overlap may not be in the same row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
I would like to calculate the sum of duration time when df1 and df2 overlap only if df1.value > df2.value.
During one df2 time interval, df1 can overlaps multiple times and sometimes the condition is True.
I tried something like that:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
I can loop on every df1 rows and test with the whole df2 data but it's not optimized.
expected output (example):
Timedelta('0 days 00:99:99')
Here is my solution:
Create DataFrames:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
Combine dataframes to make comparison easier. Combined dataframe has all possible matches :
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
Compare values with lambda function:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
Result :
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
Dataframe:
I would like to filter for customer_id'sthat first appear after a certain date in this case 2019-01-10 and then create a new df with a list of new customers
df
date customer_id
2019-01-01 429492
2019-01-01 344343
2019-01-01 949222
2019-01-10 429492
2019-01-10 344343
2019-01-10 129292
Output df
customer_id
129292
This is what I have tried so far but this gives me also customer_id's that were active before 10th January 2019
s = df.loc[df["date"]>="2019-01-10", "customer_id"]
df_new = df[df["customer_id"].isin(s)]
df_new
You can use boolean indexing with filtering with Series.isin:
df["date"] = pd.to_datetime(df["date"])
mask1 = df["date"]>="2019-01-10"
mask2 = df["customer_id"].isin(df.loc[~mask1,"customer_id"])
df = df.loc[mask1 & ~mask2, ['customer_id']]
print (df)
customer_id
5 129292
df['date'] = pd.to_datetime(df['date'])
cutoff = pd.to_datetime('2019-01-10')
mask = df['date'] >= cutoff
customers_before = df.loc[~mask, 'customer_id'].unique().tolist()
customers_after = df.loc[mask, 'customer_id'].unique().tolist()
result = set(customers_after) - set(customers_before)
"then create a new df with a list of new customers" so in this case your output is null, because 2019-01-10 is last date, there is no new customers after this date
but if you want to get list of customers after certain date or equal than :
df=pd.DataFrame({
'date':['2019-01-01','2019-01-01','2019-01-01',
'2019-01-10','2019-01-10','2019-01-10'],
'customer_id':[429492,344343,949222,429492,344343,129292]
})
certain_date=pd.to_datetime('2019-01-10')
df.date=pd.to_datetime(df.date)
df=df[
df.date>=certain_date
]
print(df)
date customer_id
3 2019-01-10 429492
4 2019-01-10 344343
5 2019-01-10 129292
If your 'date' column has datetime objects you just have to do:
df_new = df[df['date'] >= datetime(2019, 1, 10)]['customer_id']
If your 'date' column doesn't contain datetime objects, you should convert it first it by using to_datetime method:
df['date'] = pd.to_datetime(df['date'])
And then apply the methodology described above.
I have the following database that are extracted with pandas from csv files :
df1=pd.read_csv(path,parse_dates=True)
The print of df1 gives :
control Avg_return
2019-09-07 True 0
2019-06-06 True 0
2019-02-19 True 0
2019-01-17 True 0
2018-12-20 True 0
2018-11-27 True 0
2018-10-12 True 0
... ... ...
After I load the 2 csv file
df2=pd.read_csv(path,parse_dates=True)
The print of df2 gives :
return
2010-01-01 NaN
2010-04-01 0.010920
2010-05-01 -0.004404
2010-06-01 -0.025209
2010-07-01 -0.023280
... ...
The aim of my code is :
Take a date from df1
Subtract 6 days from the date taken in point 1.
Subtract 244 days from the date taken in point 1.
Take all the return from this two date in df2
Compute the mean of these return and stock it in Avg_return
I did this :
for i in range(0,df1_row):
#I go through my data df1
if (control.iloc[i]==True):
#I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d')
# I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d')
# I remove 244 days from my date
df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days
Unfortunately it gives me this error :
df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
^
SyntaxError: invalid syntax
Is someone able to help me? :)
The following looks a bit ugly, but I think it works :)
Dummy df's:
import numpy as np
import pandas as pd
cols = ['date', 'control', 'Avg_return']
data = [
[pd.to_datetime('2019-09-07'), True, 0],
[pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)
cols2 = ['date', 'return']
data2 = [
[pd.to_datetime('2010-01-01'), np.nan],
[pd.to_datetime('2010-04-01'), 0.010920],
[pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)
Drafted solution:
df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
Output:
date control Avg_return date_minus_6 date_minus_244
0 2019-09-07 True 1.0 2019-09-01 2019-01-06
1 2019-06-06 True 0.0 2019-05-31 2018-10-05
import csv
import pandas as pd
df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)
df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
if df1['control'][i] == True:
date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
#I'm not sure if average_return has the correct condition, but adjust as you see fit
df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] - df2[df2['date'] > date_2]['return']).mean()
print df1
This is a different approach without looping over all rows:
# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)
df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d')
df1['date_2'] = df1.index - pd.to_timedelta(244, unit='d')
df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)
I have a dataframe with datetime index. First of all, here is my fake data.
import pandas as pd
data1 = {'date' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]}
df1 = pd.DataFrame(data1)
df1 = df1.set_index('date')
df1.index = pd.to_datetime(df1.index).strftime('%Y-%m-%d %H:%M:%S')
What I want to do is to create a new column named "New_column" with categorical variables with 'Yes' or 'No' depends whether if a value in the "number" column is increased at least 20 percent in the same day.
So in this fake data, only the second value "12.85" will be "Yes" because it increased 23.35 percent at the timestamp "2019-02-19 19:45:00"
Even though the first value is 25% greater than the 3rd value, since it happened in the future, it should not be counted.
After the process, I should have NaN in the "New_column" for the last row of each day.
I have been trying many different ways to do it using:
pandas.DataFrame.pct_change
pandas.DataFrame.diff
How can I do this in a Pythonic way?
Initial setup
data = {
'datetime' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
df = df.sort_values('datetime')
df['date'] = df['datetime'].dt.date
df['New_column'] = 'No'
Find all rows that see a 20% increase later in the same day
indeces_true = set([])
for idx_low, row_low in df.iterrows():
for idx_high, row_high in df.iterrows():
if (row_low['date'] == row_high['date'] and
row_low['datetime'] < row_high['datetime'] and
row_low['number'] * 1.2 < row_high['number']):
indeces_true.add(idx_low)
# Assign 'Yes' for the true rows
for i in indeces_true:
df.loc[i, 'New_column'] = 'Yes'
# Last timestamp every day assigned as NaN
df.loc[df['date'] != df['date'].shift(-1), 'New_column'] = np.nan
# Optionally convert to categorical variable
df['New_column'] = pd.Categorical(df['New_column'])
Output
>>> df
datetime number date New_column
0 2019-02-19 09:31:00 18.6125 2019-02-19 No
1 2019-02-19 10:32:00 12.8500 2019-02-19 Yes
2 2019-02-19 17:12:00 14.8900 2019-02-19 No
3 2019-02-19 19:39:00 15.8301 2019-02-19 No
4 2019-02-19 19:45:00 15.8500 2019-02-19 NaN
5 2019-02-20 09:35:00 14.9160 2019-02-20 No
6 2019-02-20 09:36:00 14.9500 2019-02-20 NaN