Create a counter of date values for a given max-min interval

Create a counter of date values for a given max-min interval - python

Be the following python pandas DataFrame:
| date | column_1 | column_2 |
| ---------- | -------- | -------- |
| 2022-02-01 | val | val2 |
| 2022-02-03 | val1 | val |
| 2022-02-01 | val | val3 |
| 2022-02-04 | val2 | val |
| 2022-02-27 | val2 | val4 |
I want to create a new DataFrame, where each row has a value between the minimum and maximum date value from the original DataFrame. The counter column contains a row counter for that date.
| date | counter |
| ---------- | -------- |
| 2022-02-01 | 2 |
| 2022-02-02 | 0 |
| 2022-02-03 | 1 |
| 2022-02-04 | 1 |
| 2022-02-05 | 0 |
...
| 2022-02-26 | 0 |
| 2022-02-27 | 1 |

Count dates first & remove duplicates using Drop duplicates. Fill intermidiate dates with Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
df['counts'] = df['date'].map(df['date'].value_counts())
df = df.drop_duplicates(subset='date', keep="first")
df.date = pd.to_datetime(df.date)
df = df.set_index('date').asfreq('D').reset_index()
df = df.fillna(0)
print(df)
Gives #
date counts
0 2022-02-01 2.0
1 2022-02-02 0.0
2 2022-02-03 1.0
3 2022-02-04 1.0
4 2022-02-05 0.0
5 2022-02-06 0.0
6 2022-02-07 0.0
7 2022-02-08 0.0
8 2022-02-09 0.0
9 2022-02-10 0.0
10 2022-02-11 0.0
11 2022-02-12 0.0
12 2022-02-13 0.0
13 2022-02-14 0.0
14 2022-02-15 0.0
15 2022-02-16 0.0
16 2022-02-17 0.0
17 2022-02-18 0.0
18 2022-02-19 0.0
19 2022-02-20 0.0
20 2022-02-21 0.0
21 2022-02-22 0.0
22 2022-02-23 0.0
23 2022-02-24 0.0
24 2022-02-25 0.0
25 2022-02-26 0.0

Many ways to do this. Here is mine. Probably not optimal, but at least I am not iterating rows, nor using .apply, which are both sure recipes to create slow solutions
import pandas as pd
import datetime
# A minimal example (you should provide such an example next time)
df=pd.DataFrame({'date':pd.to_datetime(['2022-02-01', '2022-02-03', '2022-02-01', '2022-02-04', '2022-02-27']), 'c1':['val','val1','val','val2','val2'], 'c2':range(5)})
# A delta of 1 day, to create list of date
dt=datetime.timedelta(days=1)
# Result dataframe, with a count of 0 for now
res=pd.DataFrame({'date':df.date.min()+dt*np.arange((df.date.max()-df.date.min()).days+1), 'count':0})
# Cound dates
countDates=df[['date', 'c1']].groupby('date').agg('count')
# Merge the counted dates with the target array, filling missing values with 0
res['count']=res.merge(countDates, on='date', how='left').fillna(0)['c1']

Related

Filling Missing Date Column using groupby method

I have a dataframe that looks something like:
+---+----+---------------+------------+------------+
| | id | date1 | date2 | days_ahead |
+---+----+---------------+------------+------------+
| 0 | 1 | 2021-10-21 | 2021-10-24 | 3 |
| 1 | 1 | 2021-10-22 | NaN | NaN |
| 2 | 1 | 2021-11-16 | 2021-11-24 | 8 |
| 3 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 4 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 5 | 3 | 2021-10-26 | 2021-10-31 | 5 |
| 6 | 3 | 2021-10-30 | 2021-11-04 | 5 |
| 7 | 3 | 2021-11-02 | NaN | NaN |
| 8 | 3 | 2021-11-04 | 2021-11-04 | 0 |
| 9 | 4 | 2021-10-28 | NaN | NaN |
+---+----+---------------+------------+------------+
I am trying to fill the missing data with the days_ahead median of each id group,
For example:
Median of id 1 = 5.5 which rounds to 6
filled value of date2 at index 1 should be 2021-10-28
Similarly, for id 3 Median = 5
filled value of date2 at index 7 should be 2021-11-07
And,
for id 4 Median = NaN
filled value of date2 at index 9 should be 2021-10-28
I Tried
df['date2'].fillna(df.groupby('id')['days_ahead'].transform('median'), inplace = True)
But this fills with int values.
Although, I can use lambda and apply methods to identify int and turn it to date, How do I directly use groupby and fillna together?

You can round values with convert to_timedelta, add to date1 with fill_valueparameter and replace missing values:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
td = pd.to_timedelta(df.groupby('id')['days_ahead'].transform('median').round(), unit='d')
df['date2'] = df['date2'].fillna(df['date1'].add(td, fill_value=pd.Timedelta(0)))
print (df)
id date1 date2 days_ahead
0 1 2021-10-21 2021-10-24 3.0
1 1 2021-10-22 2021-10-28 NaN
2 1 2021-11-16 2021-11-24 8.0
3 2 2021-10-22 2021-10-24 2.0
4 2 2021-10-22 2021-10-24 2.0
5 3 2021-10-26 2021-10-31 5.0
6 3 2021-10-30 2021-11-04 5.0
7 3 2021-11-02 2021-11-07 NaN
8 3 2021-11-04 2021-11-04 0.0
9 4 2021-10-28 2021-10-28 NaN

How can I compute the most recent value from a column in a second dataset for each individual?

I have a pandas dataframe values that looks like:
person | date | value
-------|------------|------
A | 01-01-2020 | 1
A | 01-08-2020 | 2
A | 01-12-2020 | 3
B | 01-02-2020 | 4
B | 01-05-2020 | 5
B | 01-06-2020 | 6
And another dataframe encounters that looks like:
person | date
-------|------------
A | 01-01-2020
A | 01-03-2020
A | 01-06-2020
A | 01-11-2020
A | 01-12-2020
A | 01-15-2020
B | 01-01-2020
B | 01-04-2020
B | 01-06-2020
B | 01-08-2020
B | 01-09-2020
B | 01-10-2020
What I'd like to end up with is a merged dataframe that adds a third column to the encounters dataset with the most recent value of value for the corresponding person (shown below). Is there a straightforward way to do this in pandas?
person | date | most_recent_value
-------|------------|-------------------
A | 01-01-2020 | 1
A | 01-03-2020 | 1
A | 01-06-2020 | 1
A | 01-11-2020 | 2
A | 01-12-2020 | 3
A | 01-15-2020 | 3
B | 01-01-2020 | None
B | 01-04-2020 | 4
B | 01-06-2020 | 6
B | 01-08-2020 | 6
B | 01-09-2020 | 6
B | 01-10-2020 | 6

This is essentially merge_asof:
values['date'] = pd.to_datetime(values['date'])
encounters['date'] = pd.to_datetime(encounters['date'])
(pd.merge_asof(encounters.assign(rank=np.arange(encounters.shape[0]))
.sort_values('date'),
values.sort_values('date'),
by='person', on='date')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
person date value
0 A 2020-01-01 1.0
2 A 2020-01-03 1.0
4 A 2020-01-06 1.0
9 A 2020-01-11 2.0
10 A 2020-01-12 3.0
11 A 2020-01-15 3.0
1 B 2020-01-01 NaN
3 B 2020-01-04 4.0
5 B 2020-01-06 6.0
6 B 2020-01-08 6.0
7 B 2020-01-09 6.0
8 B 2020-01-10 6.0

Pandas Running Total by Group, Time Interval

I would like to take a table of customer orders like this:
customer_id | order_date | amount
0 | 2020-03-01 | 10.00
0 | 2020-03-02 | 2.00
1 | 2020-03-02 | 5.00
1 | 2020-03-02 | 1.00
2 | 2020-03-08 | 2.00
1 | 2020-03-09 | 1.00
0 | 2020-03-10 | 1.00
And create a table calculating a running total by week. Something like:
order_week | 0 | 1 | 2
2020-03-01 | 12.00 | 6.00 | 0.00
2020-03-08 | 13.00 | 7.00 | 2.00
Thanks so much for your help!!

IIUC:
df['order_date'] = pd.to_datetime(df['order_date'])
(df.groupby(['customer_id',df.order_date.dt.floor('7D')])
.amount.sum()
.unstack('customer_id',fill_value=0)
.cumsum()
)
Output:
customer_id 0 1 2
order_date
2020-02-27 12.0 6.0 0.0
2020-03-05 13.0 7.0 2.0

#Quang Hoang beautiful and concise. But did you want it in 7 days strictly or a week.
I had a go partitioning it in a week because wanted dates stated in your outcome to appear. Obviously #Quang Hoang experience unmatched. Feel free to criticize because I am learning
Coerce date to datetime and set the date to index
df['order_date']=pd.to_datetime(df['order_date'])
df.set_index(df['order_date'], inplace=True)
df.drop(columns=['order_date'], inplace=True
Group by customer id while and resample on amount.
df.groupby('customer_id')['amount'].apply(lambda x:x.resample('W').sum()).unstack('customer_id',fill_value=0).cumsum()
Outcome

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?

You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0

IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Create new columns based on other's columns value

I'm trying to do some feature engineering for a pandas data frame.
Say I have this:
Data frame 1:
X | date | is_holiday
a | 1/4/2018 | 0
a | 1/5/2018 | 0
a | 1/6/2018 | 1
a | 1/7/2018 | 0
a | 1/8/2018 | 0
...
b | 1/1/2018 | 1
I'd like to have an additional indicator for some dates, to indicate if the date is before 1 and 2 days from a holiday, and also 1 and 2 days after.
Data frame 1:
X | date | is_holiday | one_day_before_hol | ... | one_day_after_hol
a | 1/4/2018 | 0 | 0 | ... | 0
a | 1/5/2018 | 0 | 1 | ... | 0
a | 1/6/2018 | 1 | 0 | ... | 0
a | 1/7/2018 | 0 | 0 | ... | 1
a | 1/8/2018 | 0 | 0 | ... | 0
...
b | 1/1/2018 | 1 | 0 | ... | 0
Is there any efficient way to do it? I believe I can do it using for statements, but since I'm new to python, I'd like to see if there is an elegant way to do it. Dates might not be adjacent or continuos (i.e. for some of the X columns, a specific date might not be present)
Thank you so much!

Use pandas.DataFrame.groupby.shift:
import pandas as pd
g = df.groupby('X')['is_holiday']
df['one_day_before'] = g.shift(-1).fillna(0)
df['two_day_before'] = g.shift(-2).fillna(0)
df['one_day_after'] = g.shift(1).fillna(0)
Output:
X date is_holiday one_day_before two_day_before one_day_after
0 a 1/4/2018 0 0.0 1.0 0.0
1 a 1/5/2018 0 1.0 0.0 0.0
2 a 1/6/2018 1 0.0 0.0 0.0
3 a 1/7/2018 0 0.0 0.0 1.0
4 a 1/8/2018 0 0.0 0.0 0.0
5 b 1/1/2018 1 0.0 0.0 0.0

You could shift:
import pandas as pd
df = pd.DataFrame([1,0,0,1,1,0], columns=['day'])
d.head()
day
0 1
1 0
2 0
3 1
4 1
df['Once Day Before'] = d['day'].shift(-1)
df['One Day After'] = df['day'].shift(1)
df['Two Days before'] = df['day'].shift(-2)
df.head()
day Holiday One Day Before One Day After Two Days before
0 1 0.0 NaN 0.0
1 0 0.0 1.0 1.0
2 0 1.0 0.0 1.0
3 1 1.0 0.0 0.0
4 1 0.0 1.0 NaN
5 0 NaN 1.0 NaN
This would move the is_holiday up or down and to a new column. You will have to deal with the NaN's though.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a counter of date values for a given max-min interval - python

Related

Filling Missing Date Column using groupby method

How can I compute the most recent value from a column in a second dataset for each individual?

Pandas Running Total by Group, Time Interval

Calculate streak in pandas without apply

Create new columns based on other's columns value

Categories

Resources