Create a running max column in DataFrame for each day - python

I have a simple DataFrame that looks something like this:
TimeStamp, Value
1-Jan 06:10, 5
1-Jan 08:15, 7
1-Jan 15:30, 3
2-Jan 07:05, 1
2-Jan 10:15, 3
2-Jan 13:30, 2
How can I add a third column to the same DataFrame that would show me the running max value of 'Value' for each day and reset with each next day? I want the DataFrame to look like this:
TimeStamp, Value, DayMax
1-Jan 06:10, 5, 7
1-Jan 08:15, 7, 7
1-Jan 15:30, 3, 7
2-Jan 07:05, 1, 3
2-Jan 10:15, 3, 3
2-Jan 13:30, 2, 3
I tried using .rolling().max(...) but problem is I need the max value even in earlier rows, before the max value is encountered, and also before min_periods are reached. Also I need the max to reset with each day, and thus to ignore the window parameter.
I am hoping to avoid looping and complex code manipulations, as I will be doing it over a very large DataFrame, so would much prefer something built-in!

If you convert the TimeStamp column to a datetime using to_datetime then you can groupby on the date and call transform to return a Series that is the max value for each day:
In [54]:
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'], format='%d-%b %H:%M')
df
Out[54]:
TimeStamp Value
0 1900-01-01 06:10:00 5
1 1900-01-01 08:15:00 7
2 1900-01-01 15:30:00 3
3 1900-01-02 07:05:00 1
4 1900-01-02 10:15:00 3
5 1900-01-02 13:30:00 2
In [55]:
df['DayMax'] = df.groupby(df['TimeStamp'].dt.date)['Value'].transform('max')
df
Out[55]:
TimeStamp Value DayMax
0 1900-01-01 06:10:00 5 7
1 1900-01-01 08:15:00 7 7
2 1900-01-01 15:30:00 3 7
3 1900-01-02 07:05:00 1 3
4 1900-01-02 10:15:00 3 3
5 1900-01-02 13:30:00 2 3

Related

datetime hour component to column python pandas

I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.

Match datetime YYYY-MM-DD object in pandas dataframe

I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01

Dynamic Dates difference calculation Pandas

customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
Lets say I have data like this
Basically for an ecommerce firm some people buy regularly, some buy once every year, some buy monthly once etc. I need to find the difference between frequency of each transaction for each customer.
This will be a dynamic list, since some people will have transacted thousand times, some would have transacted once, some ten times etc. Any ideas on how to achieve this.
Output needed:
customer_id Order_date_Difference_in_days
1 6,3 #Difference b/w first 2 dates 2015-01-10 and 2015-01-16
#is 6 days and diff b/w next 2 consecutive dates is
#2015-01-16 and 2015-01-19 is #3 days
2 20
3 320,381
4 3596
Basically these are the differences between dates after sorting them first for each customer id
You can also use the below for the current output:
m=(df.assign(Diff=df.sort_values(['customer_id','Order_date'])
.groupby('customer_id')['Order_date'].diff().dt.days).dropna())
m=m.assign(Diff=m['Diff'].astype(str)).groupby('customer_id')['Diff'].agg(','.join)
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
Name: Diff, dtype: object
First we need to sort the data by customer id and the order date
ensure your datetime is a proper date time call df['Order_date'] = pd.to_datetime(df['Order_date'])
df.sort_values(['customer_id','Order_date'],inplace=True)
df["days"] = df.groupby("customer_id")["Order_date"].apply(
lambda x: (x - x.shift()) / np.timedelta64(1, "D")
)
print(df)
customer_id Order_date days
4 1 2015-01-10 NaN
0 1 2015-01-16 6.0
1 1 2015-01-19 3.0
2 2 2014-12-21 NaN
3 2 2015-01-10 20.0
6 3 2017-03-04 NaN
5 3 2018-01-18 320.0
9 3 2019-02-03 381.0
8 4 2010-01-01 NaN
7 4 2019-11-05 3595.0
then you can do a simple agg but you'll need to conver the value into a string.
df.dropna().groupby("customer_id")["days"].agg(
lambda x: ",".join(x.astype(str))
).to_frame()
days
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0

How to define if-else function using dataframe columns as arguments in python?

I need to write a function and then apply it for a dataframe's column in pandas.
My dataframe looks like this.Data is sorted by id and then by period columns.
period id column1
0 2013-01-31 5 NaT
1 2013-02-28 5 28 days
2 2013-03-31 5 31 days
3 2013-04-30 5 30 days
4 2016-05-31 6 NaT
5 2016-06-30 6 30 days
6 2016-08-31 6 62 days
The new column values should be defined according to values in column1:
if column1=NaT or column1>31
then new column eqauls to the value in period column
Else - values of new column should be copied from its previous row:
new column ith row= new column i-1 row.
I am very new to python and my code doesn't work:
def f(x):
if not x or x > 31
return x=df['period']
else
return x=x.shift()
df['newcolumn'] = df['column1'].apply(f)
The output should be this:
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
Any help would be much appreciated.
first it might be necessary to convert period to datetime: using pd.to_datetime
df['period']=pd.to_datetime(df['period'])
Then you can Use Dataframe.where with DataFrame.ffill:
df['newcolumn']=df['period'].where((df["column1"]>pd.Timedelta("31 days"))|(df["column1"].isnull())).ffill()
print(df)
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
you can use df.where(cond, other) which return return df's row if condition match else returns other
df["newcolumn"] = df["period"].where(df["column1"].isnull() | (df["column1"]>pd.TimeDelta("31D")), df["column1"].shift())

python pandas series loc value from multi index

I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)
You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565
Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64

Categories