Redefining a pandas dataframe based on its group - python

Iam using this dataframe
source fruit 2019 2020 2021
0 a apple 3 1 1
1 a banana 4 3 5
2 a orange 2 2 2
3 b apple 3 4 5
4 b banana 4 5 2
5 b orange 1 6 4
i want to refine it like this
source fruit 2019 2020 2021
0 a total 9 6 8
1 a seeds 5 3 3
2 a banana 4 3 5
3 b total 8 15 11
4 b seeds 4 10 9
5 b banana 4 5 2
total is sum of all fruits in that year for each source.
seeds is the sum of fruits containing seeds for each year for each source.
I tried
Appending new empty rows : Insert a new row after every nth row & Insert row at any position
But wasn't getting the expected result.
What would be the best way to get the desired output?

TRY:
df1 = df.groupby('source', as_index=False).sum().assign(fruit = 'total')
seeds = ['orange','apple']
df2 = df.loc[df['fruit'].isin(seeds)].groupby('source', as_index=False).sum().assign(fruit = 'seeds')
final_df = pd.concat([df.loc[~df['fruit'].isin(seeds)], df1,df2])

Related

Counting distinct, until a certain condition based on another row is met

I have the following df
Original df
Step | CampaignSource | UserId
1 Banana Jeff
1 Banana John
2 Banana Jefferson
3 Website Nunes
4 Banana Jeff
5 Attendance Nunes
6 Attendance Antonio
7 Banana Antonio
8 Website Joseph
9 Attendance Joseph
9 Attendance Joseph
Desired output
Steps | CampaignSource | CountedDistinctUserid
1 Website 2 (Because of different userids)
2 Banana 1
3 Banana 1
4 Website 1
5 Banana 1
6 Attendance 1
7 Attendance 1
8 Attendance 1
9 Attendance 1 (but i want to have 2 here even tho they have similar user ids and because is the 9th step)
What i want to do is impose a condition where if the step column which is made by strings equals '9', i want to count the userids as non distinct, any ideas on how i could do that? I tried applying a function but i just couldnt make it work.
What i am currently doing:
df[['Steps','UserId','CampaignSource']].groupby(['Steps','CampaignSource'],as_index=False,dropna=False).nunique()
You can group by "Step" and use a condition on the group name:
df.groupby('Step')['UserId'].apply(lambda g: g.nunique() if g.name<9 else g.count())
output:
Step
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
Name: UserId, dtype: int64
As DataFrame:
(df.groupby('Step', as_index=False)
.agg(CampaignSource=('CampaignSource', 'first'),
CountedDistinctUserid=('CampaignSource', lambda g: g.nunique() if g.name<9 else g.count())
)
)
output:
Step CampaignSource CountedDistinctUserid
0 1 Banana 2
1 2 Banana 1
2 3 Website 1
3 4 Banana 1
4 5 Attendance 1
5 6 Attendance 1
6 7 Banana 1
7 8 Website 1
8 9 Banana 2
You can apply different functions to different groups depending if condition match.
out = (df[['Steps','UserId','CampaignSource']]
.groupby(['Steps','CampaignSource'],as_index=False,dropna=False)
.apply(lambda g: g.assign(CountedDistinctUserid=( [len(g)]*len(g)
if g['Steps'].eq(9).all()
else [g['UserId'].nunique()]*len(g) ))))
print(out)
Steps UserId CampaignSource CountedDistinctUserid
0 1 Jeff Banana 2
1 1 John Banana 2
2 2 Jefferson Banana 1
3 3 Nunes Website 1
4 4 Jeff Banana 1
5 5 Nunes Attendance 1
6 6 Antonio Attendance 1
7 7 Antonio Banana 1
8 8 Joseph Website 1
9 9 Joseph Attendance 2
10 9 Joseph Attendance 2

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values
You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Transforming Dataframe from days to weeks and aggregating quantity column

This is a tricky one and I'm having a difficult time aggregating this data by week. So, starting on 5/26/20, for each week what is the total quantity? That is the desired dataframe. My data has 3 months worth of data points where some 'products' have 0 quantities and this needs to be reflected in the desired df.
Original DF:
Product Date Qty
A 5/26/20 4
A 5/28/20 2
A 5/31/20 2
A 6/02/20 1
A 6/03/20 5
A 6/05/20 2
B 5/26/20 1
B 5/27/20 8
B 6/02/20 2
B 6/06/20 10
B 6/14/20 7
Desired DF
Product Week Qty
A 1 9
A 2 7
A 3 0
B 1 11
B 2 10
B 3 7
We can do it with transform , then create the new week with subtract
s = (df.Date-df.groupby('Product').Date.transform('min')).dt.days//7 + 1
s = df.groupby([df.Product, s]).Qty.sum().unstack(fill_value=0).stack().reset_index()
s
Out[348]:
Product Date 0
0 A 1 8
1 A 2 8
2 A 3 0
3 B 1 9
4 B 2 12
5 B 3 7

How to concat two python DataFrames where if the row already exits it doesn't add it, if not, append it

I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])

Filtering Pandas Dataframe by mean of last N values

I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779

Categories