Mark every Nth row per group using pandas - python

I have a Dataframe with customer info with their purchase details. I am trying to add a new columns that indicates every 3rd purchase done by the same customer.
Given below is the Dataframe
customer_name,bill_no,date
Mark,101,2018-10-01
Scott,102,2018-10-01
Pete,103,2018-10-02
Mark,104,2018-10-02
Mark,105,2018-10-04
Scott,106,2018-10-21
Julie,107,2018-10-03
Kevin,108,2018-10-07
Steve,109,2018-10-02
Mark,110,2018-10-06
Mark,111,2018-10-02
Mark,112,2018-10-05
Mark,113,2018-10-05
I am writing to filter every 3rd purchase done by the same customer. So in this case, I would like to add a flag for the below bill_no
Mark,105,2018-10-04
Mark,112,2018-10-05
Basically every multiple of 3 bill generated for the same customer.

Using groupby.cumcount:
n = 3
df['flag'] = df.groupby('customer_name').cumcount() + 1
df['flag'] = ((df['flag'] % n) == 0).astype(int)
print(df)
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0

If actually getting the indices are important, you should use groupby + apply with slicing on the index:
n = 3
idx = df.groupby('customer_name', group_keys=False).apply(
lambda x: x.index[n-1::n].to_series())
# So you can query these rows easily.
df.loc[idx]
customer_name bill_no date
4 Mark 105 2018-10-04
11 Mark 112 2018-10-05
Now, mark them using the indices:
df['flag'] = 0
df.loc[idx, 'flag'] = 1
df
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
If performance is important, use Sandeep's solution instead.

Related

Find cumulative maximum of a pandas series, but reset the cumulative maximum when 0 is encountered

Consider the series below:
100
102
101
103
0
12
123
14
I want the result to be as follows:
100
102
102
103
0
12
123
123
Let d be the variable containing your series, then groupby the cummulative sum of d == 0 and then obtain the cummax
d.groupby(d.eq(0).cumsum()).cummax()
Out[37]:
0 100
1 102
2 102
3 103
4 0
5 12
6 123
7 123

In Pandas, giving a datetime index, with rows on all work days, how to determine if a row is beginning of week or end of week?

I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data:
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Here is my current code
# Date fields
def DateFields(df_input):
dates = df_input.index.to_series()
df_input['day_of_week'] = dates.dt.dayofweek
df_input['day_of_month'] = dates.dt.day
df_input['day_of_year'] = dates.dt.dayofyear
df_input['month_of_year'] = dates.dt.month
df_input['isWeekStart'] = "No" #<--- Need help here
df_input['isWeekEnd'] = "No" #<--- Need help here
df_input['date'] = dates.dt.strftime('%Y-%m-%d')
return df_input
How can I calculate if a row is beginning of week and end of week?
Example of what I am looking for:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
5/1/2017 0 1 121 5 1 0
5/2/2017 1 2 122 5 0 0
5/3/2017 2 3 123 5 0 0
5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day
5/8/2017 0 8 128 5 1 0
5/9/2017 1 9 129 5 0 0
5/10/2017 2 10 130 5 0 0
5/11/2017 3 11 131 5 0 0
5/12/2017 4 12 132 5 0 1
5/15/2017 0 15 135 5 1 0
5/16/2017 1 16 136 5 0 0
5/17/2017 2 17 137 5 0 0
5/18/2017 3 18 138 5 0 0
5/19/2017 4 19 139 5 0 1
5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day
5/24/2017 2 24 144 5 0 0
5/25/2017 3 25 145 5 0 0
5/26/2017 4 26 146 5 0 1
5/30/2017 1 30 150 5 1 0
EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay:
prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1)
df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week !=
prev_working_day.dt.isocalendar().week)
And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one.
Output:
date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay
0 2017-05-01 0 1 121 5 True
1 2017-05-02 1 2 122 5 False
2 2017-05-03 2 3 123 5 False
3 2017-05-04 3 4 124 5 False
4 2017-05-08 0 8 128 5 True
5 2017-05-09 1 9 129 5 False
6 2017-05-10 2 10 130 5 False
7 2017-05-11 3 11 131 5 False
8 2017-05-12 4 12 132 5 False
9 2017-05-15 0 15 135 5 True
10 2017-05-16 1 16 136 5 False
11 2017-05-17 2 17 137 5 False
12 2017-05-18 3 18 138 5 False
13 2017-05-19 4 19 139 5 False
14 2017-05-23 1 23 143 5 False
15 2017-05-24 2 24 144 5 False
16 2017-05-25 3 25 145 5 False
17 2017-05-26 4 26 146 5 False
18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby.
df['date'] = pd.to_datetime(df['date'])
business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame()
business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x])
business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x])
business_days = business_days.apply(pd.Series.explode)
pd.merge(df, business_days, left_on='date', right_on='date')
output:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
0 2017-05-01 0 1 121 5 1 0
1 2017-05-02 1 2 122 5 0 0
2 2017-05-03 2 3 123 5 0 0
3 2017-05-04 3 4 124 5 0 1
4 2017-05-08 0 8 128 5 1 0
5 2017-05-09 1 9 129 5 0 0
6 2017-05-10 2 10 130 5 0 0
7 2017-05-11 3 11 131 5 0 0
8 2017-05-12 4 12 132 5 0 1
9 2017-05-15 0 15 135 5 1 0
10 2017-05-16 1 16 136 5 0 0
11 2017-05-17 2 17 137 5 0 0
12 2017-05-18 3 18 138 5 0 0
13 2017-05-19 4 19 139 5 0 1
14 2017-05-23 1 23 143 5 1 0
15 2017-05-24 2 24 144 5 0 0
16 2017-05-25 3 25 145 5 0 0
17 2017-05-26 4 26 146 5 0 1
18 2017-05-30 1 30 150 5 1 1
Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.

How would I rank within a groupby object based on another row condition in Pandas? Example included

The dataframe below has 4 columns: runner_name,race_date, height_in_inches,top_ten_finish.
I want to groupby race_date, and if the runner finished in the top ten for that race_date, rank his height_in_inches among only the other runners who finished in the top ten for that race_date. How would I do this?
This is the original dataframe:
>>> import pandas as pd
>>> d = {"runner":[&apos;mike&apos;,&apos;paul&apos;,&apos;jim&apos;,&apos;dave&apos;,&apos;douglas&apos;],
... "race_date":[&apos;2019-02-02&apos;,&apos;2019-02-02&apos;,&apos;2020-02-02&apos;,&apos;2020-02-01&apos;,&apos;2020-02-01&apos;],
... "height_in_inches":[72,68,70,74,73],
... "top_ten_finish":["yes","yes","no","yes","no"]}
>>> df = pd.DataFrame(d)
>>> df
runner race_date height_in_inches top_ten_finish
0 mike 2019-02-02 72 yes
1 paul 2019-02-02 68 yes
2 jim 2020-02-02 70 no
3 dave 2020-02-01 74 yes
4 douglas 2020-02-01 73 no
>>>
and this is what I'd like the result to look like. Notice how if they didn't finish in the top 10 of a race, they get a value of 0 for that new column.
runner race_date height_in_inches top_ten_finish if_top_ten_height_rank
0 mike 2019-02-02 72 yes 1
1 paul 2019-02-02 68 yes 2
2 jim 2020-02-02 70 no 0
3 dave 2020-02-01 74 yes 1
4 douglas 2020-02-01 73 no 0
Thank you!
We can do groupby + filter with rank
df['rank']=df[df.top_ten_finish.eq('yes')].groupby('race_date')['height_in_inches'].rank(ascending=False)
df['rank'].fillna(0,inplace=True)
df
Out[87]:
runner race_date height_in_inches top_ten_finish rank
0 mike 2019-02-02 72 yes 1.0
1 paul 2019-02-02 68 yes 2.0
2 jim 2020-02-02 70 no 0.0
3 dave 2020-02-01 74 yes 1.0
4 douglas 2020-02-01 73 no 0.0
You can filter and rank on groupby() then assign back:
df['if_top_ten_height_rank'] = (df.loc[df['top_ten_finish']=='yes','height_in_inches']
.groupby(df['race_date']).rank(ascending=False)
.reindex(df.index, fill_value=0)
.astype(int)
)
Output:
runner race_date height_in_inches top_ten_finish if_top_ten_height_rank
-- -------- ----------- ------------------ ---------------- ------------------------
0 mike 2019-02-02 72 yes 1
1 paul 2019-02-02 68 yes 2
2 jim 2020-02-02 70 no 0
3 dave 2020-02-01 74 yes 1
4 douglas 2020-02-01 73 no 0

Generating rows in a pandas dataframe to make up for missing values of a column (or multiple columns)

I have the following dataframe.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 4 101 79
6 4 102 21
7 5 101 129
8 6 101 561
Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.
I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Any help is really appreciated.
Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.
new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()
Output:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
12 7 101 0
13 7 102 0
14 8 101 0
15 8 102 0
Use pandas.DataFrame.pivot and then unstack with reset_index:
new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)
Output:
hour sensor_id 0
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna
a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])
Out[157]:
hour sensor_id
0 1 101
1 1 102
2 2 101
3 2 102
4 3 101
5 3 102
6 4 101
7 4 102
8 5 101
9 5 102
10 6 101
11 6 102
df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)
Out[161]:
hour sensor_id hourly_count
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Other way: using unstack with fill_value
df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()
Out[171]:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

Categories