Selecting dates with .gt() - python

I am trying to select the dates where the percentage change is over 1%. To do this my code is as follows:
df1 has 109 rows × 6 columns
`df1['Close'].pct_change().gt(0.01).index` produces:
DatetimeIndex(['2020-12-31', '2021-01-04', '2021-01-05', '2021-01-06',
'2021-01-07', '2021-01-08', '2021-01-11', '2021-01-12',
'2021-01-13', '2021-01-14',
...
'2021-05-25', '2021-05-26', '2021-05-27', '2021-05-28',
'2021-06-01', '2021-06-02', '2021-06-03', '2021-06-04',
'2021-06-07', '2021-06-08'],
dtype='datetime64[ns]', name='Date', length=109, freq=None)
This is not right because there are very few dates which are over 1% but i am still getting the same length of 109 as I would get without .gt().
Could you please advise why it is showing all the dates.

Select True values:
df1.loc[df1["Close"].pct_change().gt(1)].index
>>> df1
Open Close
Date
2021-01-01 5 7
2021-01-02 1 3
2021-01-03 1 2
2021-01-04 10 6
2021-01-05 5 10
2021-01-06 6 9
2021-01-07 8 1
2021-01-08 1 3
2021-01-09 10 5
2021-01-10 7 3
>>> df1.loc[df1["Close"].pct_change().gt(1)].index
DatetimeIndex(['2021-01-04', '2021-01-08'], dtype='datetime64[ns]', name='Date', freq=None)

Related

Dataframe - Datetime, get cumulated sum of previous day

I have a dataframe with the following columns:
datetime: HH:MM:SS (not continuous, there are some missing days)
date: ['datetime'].dt.date
X = various values
X_daily_cum = df.groupby(['date']).X.cumsum()
So Xcum is the cumulated sum of X but grouped per day, it's reset every day.
Code to reproduce:
import pandas as pd
df = pd.DataFrame( [['2021-01-01 10:10', 3],
['2021-01-03 13:33', 7],
['2021-01-03 14:44', 6],
['2021-01-07 17:17', 2],
['2021-01-07 07:07', 4],
['2021-01-07 01:07', 9],
['2021-01-09 09:09', 3]],
columns=['datetime', 'X'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %M:%S')
df['date'] = df['datetime'].dt.date
df['X_daily_cum'] = df.groupby(['date']).X.cumsum()
print(df)
Now I would like a new column that takes for value the cumulated sum of previous available day, like that:
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3
2 2021-01-03 00:14:44 6 2021-01-03 13 3
3 2021-01-07 00:17:17 2 2021-01-07 2 13
4 2021-01-07 00:07:07 4 2021-01-07 6 13
5 2021-01-07 00:01:07 9 2021-01-07 15 13
6 2021-01-09 00:09:09 3 2021-01-09 3 15
Is there a clean way to do it with pandas with an apply ?
I have managed to do it in a disgusting way by copying the df, removing datetime granularity, selecting last record of each date, joining this new df with the previous one. It's disgusting, I would like a more elegant solution.
Thanks for the help
Use Series.duplicated with Series.mask for set missing values to all values without last per dates, then shifting values and forward filling missing values:
df['last_day_cum_value'] = (df['X_daily_cum'].mask(df['date'].duplicated(keep='last'))
.shift()
.ffill())
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Old solution:
Use DataFrame.drop_duplicates with Series created by date and Series.shift for previous dates, then use Series.map for new column:
s = df.drop_duplicates('date', keep='last').set_index('date')['X_daily_cum'].shift()
print (s)
date
2021-01-01 NaN
2021-01-03 3.0
2021-01-07 13.0
2021-01-09 15.0
Name: X_daily_cum, dtype: float64
df['last_day_cum_value'] = df['date'].map(s)
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0

Cumulative metric in continuous date range based on non-continuous date changes

On specific dates, a metric starting at 0 increases by a value. Given a set of non-continuous dates and values, is it possible to produce a column with metric?
Input - metric changes per day
date value
02-03-2022 00:00:00 10
03-03-2022 00:00:00 0
06-03-2022 00:00:00 2
10-03-2022 00:00:00 18
Output - metric calculated for continuous range of days (starting value = 0 unless change applies already on first day)
0 metric
0 2022-02-28 0
1 2022-03-01 0
2 2022-03-02 10
3 2022-03-03 10
4 2022-03-04 10
5 2022-03-05 10
6 2022-03-06 12
7 2022-03-07 12
8 2022-03-08 12
9 2022-03-09 12
10 2022-03-10 30
11 2022-03-11 30
12 2022-03-12 30
13 2022-03-13 30
Code example
import pandas as pd
df = pd.DataFrame({'date': ['02-03-2022 00:00:00',
'03-03-2022 00:00:00',
'06-03-2022 00:00:00',
'10-03-2022 00:00:00'],
'value': [10, 0, 2, 18]},
index=[0,1,2,3])
df2 = pd.DataFrame(pd.date_range(start='28-02-2022', end='13-03-2022'))
df2['metric'] = 0 # TODO
Replace values in df2 from df by date, fill missing values with 0 and then cumsum:
df['date'] = pd.to_datetime(df.date, format='%d-%m-%Y %H:%M:%S')
df2['metric'] = df2[0].map(df.set_index('date')['value']).fillna(0).cumsum()
df2
0 metric
0 2022-02-28 0.0
1 2022-03-01 0.0
2 2022-03-02 10.0
3 2022-03-03 10.0
4 2022-03-04 10.0
5 2022-03-05 10.0
6 2022-03-06 12.0
7 2022-03-07 12.0
8 2022-03-08 12.0
9 2022-03-09 12.0
10 2022-03-10 30.0
11 2022-03-11 30.0
12 2022-03-12 30.0
13 2022-03-13 30.0
df.reindex is useful for this. Then add df.fillna and apply df.cumsum.
import pandas as pd
df = pd.DataFrame({'date': ['02-03-2022 00:00:00',
'03-03-2022 00:00:00',
'06-03-2022 00:00:00',
'10-03-2022 00:00:00'],
'value': [10, 0, 2, 18]},
index=[0,1,2,3])
df['date'] = pd.to_datetime(df.date, format='%d-%m-%Y %H:%M:%S')
res = df.set_index('date').reindex(pd.date_range(
start='2022-02-28', end='2022-03-13')).fillna(0).cumsum()\
.reset_index(drop=False).rename(columns={'index':'date',
'value':'metric'})
print(res)
date metric
0 2022-02-28 0.0
1 2022-03-01 0.0
2 2022-03-02 10.0
3 2022-03-03 10.0
4 2022-03-04 10.0
5 2022-03-05 10.0
6 2022-03-06 12.0
7 2022-03-07 12.0
8 2022-03-08 12.0
9 2022-03-09 12.0
10 2022-03-10 30.0
11 2022-03-11 30.0
12 2022-03-12 30.0
13 2022-03-13 30.0

Count number of occurences in past 14 days of certain value

I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !
date
id
2021-01-01
1
2021-01-04
1
2021-01-05
2
2021-01-06
2
2021-01-07
1
2021-01-08
1
2021-01-28
1
Assuming the input is sorted by date, you can use a GroupBy.rolling approach:
# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])
(df.assign(count=1)
.set_index('date')
.groupby('id')
.rolling('14d')['count'].sum()
.sort_index(level='date').reset_index() #optional if order is not important
)
output:
id date count
0 1 2021-01-01 1.0
1 1 2021-01-04 2.0
2 2 2021-01-05 1.0
3 2 2021-01-06 2.0
4 1 2021-01-07 3.0
5 1 2021-01-08 4.0
6 1 2021-01-28 1.0
I am not sure whether this is the best idea or not, but the code below is what I have come up with:
from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
endDate = row["date"]
startDate = endDate - timedelta(days=14)
id = row["id"]
summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
newColumn.append(summation)
df["check_column"] = newColumn
df
Output
date
id
check_column
0
2021-01-01 00:00:00
1
1
1
2021-01-04 00:00:00
1
2
2
2021-01-05 00:00:00
2
1
3
2021-01-06 00:00:00
2
2
4
2021-01-07 00:00:00
1
3
5
2021-01-08 00:00:00
1
4
6
2021-01-28 00:00:00
1
1
Explanation
In this approach, I have used iterrows in order to loop over the dataframe's rows. Additionally, I have used timedelta in order to subtract 14 days from the date column.

Resample a Python Series or Dataframe at Dates Given

Given a Pandas dataframe or series, I would like to resample it at specific points in time.
This might mean dropping values or adding new values by forward filling previous ones.
Example
Given the Series X defined by
import pandas
rng_X = pandas.to_datetime(
['2021-01-01', '2021-01-02', '2021-01-07', '2021-01-08', '2021-02-01'])
X = pandas.Series([0, 2, 4, 6, 8], rng_X)
X
2021-01-01 0
2021-01-02 2
2021-01-07 4
2021-01-08 6
2021-02-01 8
Resample X at dates
rng_Y = pandas.to_datetime(
['2021-01-02', '2021-01-03', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10'])
The expected output is
2021-01-02 2
2021-01-03 2
2021-01-07 4
2021-01-08 6
2021-01-09 6
2021-01-10 6
2021-01-01 is dropped from the output since it isn't in rng_y.
2021-01-03 is added to the output with its value copied forward from 2021-01-02 since it does not exist in X
2021-01-09 and 2021-01-10 are also added to the output with values copied from 2021-01-08
2021-02-01 is dropped from the output since it does not exist in rng_Y
Try reindex with method set to 'ffill':
X = X.reindex(rng_Y, method='ffill')
X:
2021-01-02 2
2021-01-03 2
2021-01-07 4
2021-01-08 6
2021-01-09 6
2021-01-10 6
dtype: int32
Complete Code:
import pandas as pd
rng_X = pd.to_datetime(['2021-01-01', '2021-01-02', '2021-01-07', '2021-01-08',
'2021-02-01'])
rng_Y = pd.to_datetime(['2021-01-02', '2021-01-03', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10'])
X = pd.Series([0, 2, 4, 6, 8], rng_X)
X = X.reindex(rng_Y, method='ffill')
print(X)
If X was a DataFrame (df) instead of a Series:
df = pd.DataFrame([0, 2, 4, 6, 8], index=rng_X, columns=['X'])
df = df.reindex(rng_Y, method='ffill')
df:
X
2021-01-02 2
2021-01-03 2
2021-01-07 4
2021-01-08 6
2021-01-09 6
2021-01-10 6

aggregate by week of daily column

Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31

Categories