Group by with sum conditions [duplicate] - python

This question already has answers here:
Python Pandas Conditional Sum with Groupby
(3 answers)
Closed 4 years ago.
I have the following df and I'd like to group it by Date & Ref but with sum conditions.
In this respect I'd need to group by Date & Ref and sum 'Q' column only if P is >= than PP.
df = DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'P' : ['50', '65', '30', '38'],
'PP' : ['63', '63', '32', '32'],
'Q' : ['10', '15', '20', '10']})
df.groupby(['Date','Ref'])['Q'].sum() #This does the right grouping byt summing the whole column
df.loc[df['P'] >= df['PP'], ('Q')].sum() #this has the right sum condition, but does not divide between Date & Ref
Is there a way to do that?
Many thanks in advance

Just filter prior to grouping:
In[15]:
df[df['P'] >= df['PP']].groupby(['Date','Ref'])['Q'].sum()
Out[15]:
Date Ref
1 one 15
two 10
Name: Q, dtype: object
This reduces the size of the df in the first place so will speed up the groupby operation

You could do:
import pandas as pd
df = pd.DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'P' : ['50', '65', '30', '38'],
'PP' : ['63', '63', '32', '32'],
'Q' : ['10', '15', '20', '10']})
def conditional_sum(x):
return x[x['P'] >= x['PP']].Q.sum()
result = df.groupby(['Date','Ref']).apply(conditional_sum)
print(result)
Output
Date Ref
1 one 15
two 10
dtype: object
UPDATE
If you want to sum multiple columns in the output, you could use loc:
def conditional_sum(x):
return x.loc[x['P'] >= x['PP'], ['Q', 'P']].sum()
result = df.groupby(['Date', 'Ref']).apply(conditional_sum)
print(result)
Output
Q P
Date Ref
1 one 15.0 65.0
two 10.0 38.0
Note that in the example above I used column P for the sake of showing how to do it with multiple columns.

Related

column comparison of two dataframe, return df with mismatches python

I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column":
"first_column" is a key value that identify same product in both dataframes
import pandas as pd
data1 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['1', '2', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df1 = pd.DataFrame(data1)
print(df1)
test = df1['second_column'].nunique()
data2 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['3', '4', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df2 = pd.DataFrame(data2)
print(df2)
expected output:
IIUC
btw, you screenshots don't match your DF definition
df1.loc[~df1['second_column'].isin(df2['second_column'])]
first_column second_column third_column fourth_column
0 1 1 1 1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column second_column third_column fourth_column
0 1 3 1 1
1 2 4 2 2
the compare method can do what you want.
different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]
With this method, one important point is if there are extra rows (index) then it will not return a difference.
or if you want to find differences in one column only, you can first join on the index then check if the join matches
joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])

Filter Pandas Dataframe columns with a complex condition generated by a predicate function (defined on columns)

I would like to filter out pandas dataframe columns with a condition defined on its columns with a predicate function, for example (generally it may be much more sophisticated with rather complex dependencies between different elements of the series):
def detect_jumps(data, jump_factor=5):
for i in range(1, len(data)):
if data[i] - data[i - 1] > jump_factor:
return True
return False
on a dataframe df:
import pandas as pd
data = [
{'A': '10', 'B': '10', 'C': '100', 'D': '100', 'E': '0', },
{'A': '15', 'B': '16', 'C': '105', 'D': '104', 'E': '10', },
{'A': '20', 'B': '20', 'C': '110', 'D': '110', 'E': '11', },
]
df = pd.DataFrame(data)
i.e.
A B C D E
0 10 10 100 100 0
1 15 16 105 104 10
2 20 20 110 110 11
It should only filter out columns B (col[1] - col[0] == 6 > 5) and D (col[2] - col[1] == 6 > 5)
or predicate detect_jumps(data, 9) and in this case it should only filter out column E (col[1] - col[0] == 10 > 9)
Are there any ways to use such functions as a condition for filtering?
You don't need a custom function, use vectorial operations:
df2 = df.loc[:, ~df.astype(int).diff().gt(5).any()]
output:
A C
0 10 100
1 15 105
2 20 110
Nevertheless, using your function:
df2 = df.loc[:, [not detect_jumps(c) for label, c in df.astype(int).items()]]
# OR
df2 = df[[label for label, c in df.astype(int).items() if not detect_jumps(c)]]

How to replace values in one DataFrames in Python with values from another DataFrames where the dates (in datetime format) matches? [duplicate]

This question already has answers here:
Update a column values from another df where ids are same
(2 answers)
How do I merge dataframes on unique column values?
(1 answer)
Closed last year.
Given I have two DataFrames:
import pandas as pd
df1 = pd.DataFrame([['2017', '1'],
['2018', '2'],
['2019', '3'],
['2020', '4'],
['2021', '5'],
['2022', '6'],
], columns=['datetime', 'values'])
df2 = pd.DataFrame([['2018', '0'],
['2019', '0'],
['2020', '0'],
], columns=['datetime', 'values'])
print(df1)
print(df2)
(Assume the values in the column 'datetime' has datetime format and is not string)
How can I replace the values in df1 to the values of df2 where the datetime exists in both without using loops?
You can use combine_first after temporarily setting the index to whatever you want to use as matching columns:
(df2.set_index('datetime')
.combine_first(df1.set_index('datetime'))
.reset_index()
)
output:
datetime values
0 2017 1
1 2018 0
2 2019 0
3 2020 0
4 2021 5
5 2022 6

Conditional grouping pandas DataFrame

I have a DataFrame that has below columns:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
In each batch a name gets arbitrary many tries to get the greatest lenght.
What I want to do is create a column win that has the value 1 for greatest lenght in a batch and 0 otherwise, with the following conditions.
If one name hold the greatest lenght in a batch in multiple try only the first try will have the value 1 in win(See "Abe in example above")
If two separate name holds equal greatest lenght then both will have value 1 in win
What I have managed to do so far is:
df.groupby(['Batch', 'name'])['lenght'].apply(lambda x: (x == x.max()).map({True: 1, False: 0}))
But it doesn't support all the conditions, any insight would be highly
Expected outout:
df = pd.DataFrame({'Name': ['Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['10', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0],
'win':[0,1,0,1,0,0,0,0,0]})
appreciated.
Many thanks,
Use GroupBy.transform for max values per groups compared by Lenght column by Series.eq for equality and for map to True->1 and False->0 cast values to integers by Series.astype:
#added first row data by second row
df = pd.DataFrame({'Name': ['Karl', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy', 'Abe', 'Karl', 'Billy'],
'Lenght': ['12.5', '12.5', '11', '12.5', '12', '11', '12.5', '10', '5'],
'Try': [0,0,0,1,1,1,2,2,2],
'Batch':[0,0,0,0,0,0,0,0,0]})
df['Lenght'] = df['Lenght'].astype(float)
m1 = df.groupby('Batch')['Lenght'].transform('max').eq(df['Lenght'])
df1 = df[m1]
m2 = df1.groupby('Name')['Try'].transform('nunique').eq(1)
m3 = ~df1.duplicated(['Name','Batch'])
df['new'] = ((m2 | m3) & m1).astype(int)
print (df)
Name Lenght Try Batch new
0 Karl 12.5 0 0 1
1 Karl 12.5 0 0 1
2 Billy 11.0 0 0 0
3 Abe 12.5 1 0 1
4 Karl 12.0 1 0 0
5 Billy 11.0 1 0 0
6 Abe 12.5 2 0 0
7 Karl 10.0 2 0 0
8 Billy 5.0 2 0 0

How to append a list after looping over a dataframe column?

Assuming I have a dataframe as follows:
df = pd.DataFrame({ 'ids' : ['1', '1', '1', '1', '2', '2', '2', '3', '3'],
'values' : ['5', '8', '7', '12', '2', '1', '3', '15', '4']
}, dtype='int32')
ids values
1 5
1 7
1 8
1 12
2 1
2 3
2 2
3 4
3 15
What I would like to do is to loop over the values column and check which values are greater than 6 and the corresponding id from the ids column must be appended into an empty list.
Even if an id (say 3) has multiple values and out of those multiple values (4 and 15), only one value is greater than 6, I would like the corresponding id to be appended into the list.
Example:
Assuming we run a loop over the above mentioned dataframe df, I would like the output as follows:
more = [1, 3]
less = [2]
with more =[] and less = [] being pre-initialized empty lists
What I have so far:
I tried implementing the same, but surely I am doing some mistake. The code I have:
less = []
more = []
for value in df['values']:
for id in df['ids']:
if (value > 6):
more.append(id)
else:
less.append(id)
Use groupby and boolean indexing to create your lists. This will be much faster than looping:
g = df.groupby('ids')['values'].max()
mask = g.gt(6)
more = g[mask].index.tolist()
less = g[~mask].index.tolist()
print(more)
print(less)
[1, 3]
[2]
You can use dataframe indexing to scrape out all those indices which are greater than 6 and create a set of unique indices using:
setA = set(df[df['values'] > 6]['ids'])
This will create a set of all indices in the dataframe:
setB = set(df['ids'])
Now,
more = list(setA)
and for less, take the set difference:
less = list(setB.difference(setA))
That's it!

Categories