Applying a pandas GroupBy with mixed boolean and numerical values - python

How can I apply a pandas groupby to columns that are numerical and boolean? I want to sum over the numerical columns and I want the aggregation of the boolean values to be any, that is True if there are any Trues and False if there are only False.

Performing a sum aggregation will give the desired result as long as you cast the boolean columns back to boolean types. Example
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'bool': [True, False, True, True, False, False],
'c': [10, 10, 15, 15, 20, 20]})
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
df.groupby('id').sum()
bool c
id
1 1.0 20
2 2.0 30
3 0.0 40
As you can see when applying the sum True is cast as 1 and False is cast as zero. This effectively acts as the desired any operation. Casting back to boolean:
df['bool'] = df['bool'].astype(bool)
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20

You can choose the functions which you aggregate by with the following:
df.groupby("id").agg({
"bool":lambda arr: any(arr),
"c":sum,
})

Related

Calculate cumulative sum based on threshold and condition in another column numpy

I have a data frame and I'd like to calculate cumulative sum based on 2 conditions:
1st which is a boolean already in the table
and a fixed threshold that checks what's the cumulative sum.
I've succeed with 1st or 2nd but I find it hard to combine both.
For the first one I used groupby
df['group'] = np.cumsum((df['IsSuccess'] != df['IsSuccess'].shift(1)))
df['SumSale'] = df[['Sale', 'group']].groupby('group').cumsum()
For the 2nd frompyfunc
sumlm = np.frompyfunc(lambda a,b: b if (a+b>5) else a+b, 2, 1)
df['SumSale'] = sumlm.accumulate(df['Sale'], dtype=object)
My df is, and the SumSale is the result I'm looking for.
df2 = pd.DataFrame({'Sale': [10, 2, 2, 1, 3, 2, 1, 3, 5, 5],
'IsSuccess': [False, True, False, False, True, False, True, False, False, False],
'SumSaleExpected': [10, 12, 2, 3, 6, 2, 3, 6, 11, 16]})
So to summarize I'd like to start cumulating the sum once that sum is over 5 and the row IsSuccess is True. I'd like to avoid for loop if possible as well.
Thank you for help!
I hope I've understood your question right. This example will substract necessary value ("reset") when cumulative sum of sale is greater than 5 and IsSuccess==True:
df["SumSale"] = df["Sale"].cumsum()
# "reset" when SumSale>5 and IsSuccess==True
m = df["SumSale"].gt(5) & df["IsSuccess"].eq(True)
df.loc[m, "to_remove"] = df["SumSale"]
df["to_remove"] = df["to_remove"].ffill().shift().fillna(0)
df["SumSale"] -= df["to_remove"]
df = df.drop(columns="to_remove")
print(df)
Prints:
Sale IsSuccess SumSale
0 1 False 1.0
1 2 True 3.0
2 3 False 6.0
3 2 False 8.0
4 4 True 12.0
5 3 False 3.0
6 5 True 8.0
7 5 False 5.0
EDIT:
def fn():
sale, success = yield
cum = sale
while True:
sale, success = yield cum
if success and cum > 5:
cum = sale
else:
cum += sale
s = fn()
next(s)
df["ss"] = df["IsSuccess"].shift()
df["SumSale"] = df.apply(lambda x: s.send((x["Sale"], x["ss"])), axis=1)
df = df.drop(columns="ss")
print(df)
Prints:
Sale IsSuccess SumSaleExpected SumSale
0 10 False 10 10
1 2 True 12 12
2 2 False 2 2
3 1 False 3 3
4 3 True 6 6
5 2 False 2 2
6 1 True 3 3
7 3 False 6 6
8 5 False 11 11
9 5 False 16 16
You can modify your group approach to account for both conditions by taking the cumsum() of the two conditions:
cond1 = df.Sale.cumsum().gt(5).shift().bfill()
cond2 = df.IsSuccess.shift().bfill()
df['group'] = (cond1 & cond2).cumsum()
Now that group accounts for both conditions, you can directly cumsum() within these pseudogroups:
df['SumSale'] = df.groupby('group').Sale.cumsum()
# Sale IsSuccess group SumSale
# 0 1 False 0 1
# 1 2 True 0 3
# 2 3 False 0 6
# 3 2 False 0 8
# 4 4 True 0 12
# 5 3 False 1 3

(Python) Selecting rows containing a string in ANY column?

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

What happened to python's ~ when working with boolean?

In a pandas DataFrame, I have a series of boolean values. In order to filter to rows where the boolean is True, I can use: df[df.column_x]
I thought in order to filter to only rows where the column is False, I could use: df[~df.column_x]. I feel like I have done this before, and have seen it as the accepted answer.
However, this fails because ~df.column_x converts the values to integers. See below.
import pandas as pd . # version 0.24.2
a = pd.Series(['a', 'a', 'a', 'a', 'b', 'a', 'b', 'b', 'b', 'b'])
b = pd.Series([True, True, True, True, True, False, False, False, False, False], dtype=bool)
c = pd.DataFrame(data=[a, b]).T
c.columns = ['Classification', 'Boolean']```
print(~c.Boolean)
0 -2
1 -2
2 -2
3 -2
4 -2
5 -1
6 -1
7 -1
8 -1
9 -1
Name: Boolean, dtype: object
print(~b)
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
dtype: bool
Basically, I can use c[~b], but not c[~c.Boolean]
Am I just dreaming that this use to work?
Ah , since you created the c by using DataFrame constructor , then T,
1st let us look at what we have before T:
pd.DataFrame([a, b])
Out[610]:
0 1 2 3 4 5 6 7 8 9
0 a a a a b a b b b b
1 True True True True True False False False False False
So pandas will make each columns only have one dtype, if not it will convert to object .
After T what data type we have for each columns
The dtypes in your c :
c.dtypes
Out[608]:
Classification object
Boolean object
Boolean columns became object type , that is why you get unexpected output for ~c.Boolean
How to fix it ? ---concat
c=pd.concat([a,b],1)
c.columns = ['Classification', 'Boolean']
~c.Boolean
Out[616]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: Boolean, dtype: bool

Pandas Pivot Table Keeps Returning False Instead of 0

I am making a pivot table using pandas. If I set aggfunc=sum or aggfunc=count on a column of boolean values, it works fine provided there's at least one True in the column. E.g. [True, False, True, True, False] would return 3. However, if all the values are False, then the pivot table outputs False instead of 0. No matter what, I can't get around it. The only way I can circumvent it is to define a function follows:
def f(x):
mySum = sum(x)
return "0" if mySum == 0 else mySum
and then set aggfunc=lambda x: f(x). While that works visually, it still disturbs me that outputing a string is the only way I can get the 0 to stick. If I cast it as an int, or try to return 0.0, or do anything that's numeric at all, False is always the result.
Why is this, and how do I get the pivot table to actually give me 0 in this case (by only modifying the aggfunc, not the dataframe itself)?
df = pd.DataFrame({'count': [False] * 12, 'index': [0, 1] * 6, 'cols': ['a', 'b', 'c'] * 4})
print(df)
outputs
cols count index
0 a False 0
1 b False 1
2 c False 0
3 a False 1
4 b False 0
5 c False 1
6 a False 0
7 b False 1
8 c False 0
9 a False 1
10 b False 0
11 c False 1
You can use astype (docs) to cast to int before pivoting.
res = df.pivot_table(values='count', aggfunc=np.sum, columns='cols', index='index').astype(int)
print(res)
outputs
cols a b c
index
0 0 0 0
1 0 0 0

Multiply columns with range between 0 and 1 by 100 in Pandas

Given a Pandas dataframe:
df = pd.DataFrame({'A':[1, 2, 3, 4, 5],
'B': [0.1, 0.2, 0.3, 0.4, 0.5],
'C': [11, 12, 13, 14, 15]})
A B C
0 1 0.1 11
1 2 0.2 12
2 3 0.3 13
3 4 0.4 14
4 5 0.5 15
For all of the columns where the range of values is between 0 and 1, I'd like to multiply all values in those columns by a constant (say, 100). I don't know a priori which columns have values between 0 and 1 and there are 100+ columns.
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
I've tried using .min() and .max() and compared them to the desired range to return True/False values for each column.
(df.min() >= 0) & (df.max() <= 1)
A False
B True
C False
but it isn't obvious how to then select the True columns and multiply those values by 100.
Update
I came up with this solution instead
col_names = ((df.min() >= 0) & (df.max() <= 1)).index
df[col_names] = df[col_names] * 100
Something like this?
to_multiply = [col for col in df if 1 >= min(df[col]) >= 0 and 1 >= max(df[col]) >= 0]
df[to_multiply] = df[to_multiply] * 100
We can construct a boolean mask that test if the values in the df are greater than (gt) 0 and less than (lt) 1 and then call np.all and pass axis=0 to generate a boolean mask to filter the columns and then multiply all values in that column by 100:
In [58]:
df[df.columns[np.all(df.gt(0) & df.lt(1),axis=0)]] *= 100
df
Out[58]:
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
Breaking the above down:
In [61]:
df.gt(0) & df.lt(1)
Out[61]:
A B C
0 False True False
1 False True False
2 False True False
3 False True False
4 False True False
In [62]:
np.all(df.gt(0) & df.lt(1),axis=0)
Out[62]:
array([False, True, False], dtype=bool)
In [63]:
df.columns[np.all(df.gt(0) & df.lt(1),axis=0)]
Out[63]:
Index(['B'], dtype='object')
You can update your DataFrame based on your selection criteria:
df.update(df.loc[:, (df.ge(0).all() & df.le(1).all())].mul(100))
>>> df
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
Any column which is greater than or equal to zero and less than or equal to one is multiplied by 100.
Other comparison operators:
.ge (greater than or equal to)
.gt (greater than)
.le (less than or equal to)
.lt (less than)
.eq (equals)
Use .all() to check if all values are within range and if true, multiply them -
In [1877]: paste
for col in df.columns:
if (0<df[col]).all() and (df[col]<1).all():
df[col] = df[col] * 100
## -- End pasted text --
In [1878]: df
Out[1878]:
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15

Categories