Append values in new column based on 2 different condition in Python - python

I have a sample data set which is similar to the one defined below.
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500]}
test = pd.DataFrame(dict_1)
test.head(10)
I have to create 2 new columns in test dataframe i.e. output_True & output_False based on given conditions:
a) If Id[0] == Id[1] & boolean_val = True then put sal[0](Because this is the value when boolean_val = True) in output_True else "NA".
b) If Id[0] == Id[1] & boolean_val = False then put sal[1](Because this is the value when boolean_val = False) in output_False else "NA".
c) If Id[0] 1= Id[1] & boolean_val == True then put sal value of that row in output_True else if Id[0] 1= Id[1] & boolean_val == False then put sal value of that row in output_False.
If I have not properly framed my question then please check below dataframe output and I want my output to be similar to output_True & output_False as shown below.
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500],
"output_True" : [1000, "NA", 1500, "NA", 3500, "NA"],
"output_False" : [2000, "NA", 2500, "NA", "NA", 4500]}
output_df = pd.DataFrame(dict_1)
output_df.head(10)
I have tried using np.where() & list comprehension but my output data is not showing me correct value. Can someone please help me with this?

Use loc to assign your values for the boolean column. For the second condition you can use .shift() and compare your Id[0] == Id[1] values and sum based on that:
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500]}
test = pd.DataFrame(dict_1)
test
Id boolean_val sal
0 1 True 1000
1 1 False 2000
2 2 True 1500
3 2 False 2500
4 3 True 3500
5 4 False 4500
cond1 = test.boolean_val
test.loc[cond1, 'output_True'] = test.sal
cond2 = (test.Id.shift(-1).eq(test.Id))
test['output_False'] = np.nan
test.loc[cond2, 'output_False'] = test['sal'] + test['output_True']
test
Id boolean_val sal output_True output_False
0 1 True 1000 1000.0 2000.0
1 1 False 2000 NaN NaN
2 2 True 1500 1500.0 3000.0
3 2 False 2500 NaN NaN
4 3 True 3500 3500.0 NaN
5 4 False 4500 NaN NaN

Here's a way to get your desired output:
df = test.pivot(index='Id', columns='boolean_val', values='sal')
df = df.assign(boolean_val=df.loc[:,True].notna()).set_index('boolean_val', append=True)
df = df.rename(columns={True:'output_True', False:'output_False'})[['output_True', 'output_False']]
output_df = test.join(df, on=['Id','boolean_val'])
for col in ('output_True', 'output_False'):
output_df[col] = np.where(output_df[col].isna(), "NA", output_df[col].astype(pd.Int64Dtype()))
Output:
Id boolean_val sal output_True output_False
0 1 True 1000 1000 2000
1 1 False 2000 NA NA
2 2 True 1500 1500 2500
3 2 False 2500 NA NA
4 3 True 3500 3500 NA
5 4 False 4500 NA 4500
Explanation:
use pivot() to create an intermediate dataframe df with True and False columns containing the corresponding sal values for each Id
add a boolean_val column which contains True unless a given row's True column is NaN
set Id, boolean_val as the index for df
rename the True and False columns as output_True and output_False and swap their positions (to match the desired output)
use join() to create output_df which is test with added columns output_Trueandoutput_False`
replace NaN with the string "NA" and change sal values from float to int in output_True and output_False.

Related

Count NA and none-NA per group in pandas

I assume this is a simple task for pandas but I don't get it.
I have data liket this
Group Val
0 A 0
1 A 1
2 A <NA>
3 A 3
4 B 4
5 B <NA>
6 B 6
7 B <NA>
And I want to know the frequency of valid and invalid values in Val per group Group. This is the expected result.
A B Total
Valid 3 2 5
NA 1 2 3
Here is code to generate that sample data.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'Group': list('AAAABBBB'),
'Val': range(8)
})
# some values to NA
for idx in [2, 5, 7]:
df.iloc[idx, 1] = pd.NA
print(df)
What I tried is something with grouping
>>> df.groupby('Group').agg(lambda x: x.isna())
Val
Group
A [False, False, True, False]
B [False, True, False, True]
>>> df.groupby('Group').apply(lambda x: x.isna())
Group Val
0 False False
1 False False
2 False True
3 False False
4 False False
5 False True
6 False False
7 False True
You are close with using groupby and isna
new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
here is one way to do it
# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'] )
.reset_index()
.rename_axis(columns=None))
df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
A B Total
Val
NA 1 2 3
Valid 3 2 5
if you need both row and column total, then it'll be even simpler with crosstab
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'],
margins=True, margins_name='Total')
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
Total 4 4 8
Another possible solution, based on pandas.pivot_table and on the following ideas:
Add a new column, status, which contains NA or Valid if the corresponding value is or is not NaN, respectively.
Create a pivot table, using len as aggregation function.
Add the Total column, by summing by rows.
(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
.pivot_table(
index='status', columns='Group', values='Val',
aggfunc=lambda x: len(x))
.reset_index()
.rename_axis(None, axis=1)
.assign(Total = lambda x: x.sum(axis=1)))
Output:
status A B Total
0 NA 1 2 3
1 Valid 3 2 5

Calculate cumulative sum based on threshold and condition in another column numpy

I have a data frame and I'd like to calculate cumulative sum based on 2 conditions:
1st which is a boolean already in the table
and a fixed threshold that checks what's the cumulative sum.
I've succeed with 1st or 2nd but I find it hard to combine both.
For the first one I used groupby
df['group'] = np.cumsum((df['IsSuccess'] != df['IsSuccess'].shift(1)))
df['SumSale'] = df[['Sale', 'group']].groupby('group').cumsum()
For the 2nd frompyfunc
sumlm = np.frompyfunc(lambda a,b: b if (a+b>5) else a+b, 2, 1)
df['SumSale'] = sumlm.accumulate(df['Sale'], dtype=object)
My df is, and the SumSale is the result I'm looking for.
df2 = pd.DataFrame({'Sale': [10, 2, 2, 1, 3, 2, 1, 3, 5, 5],
'IsSuccess': [False, True, False, False, True, False, True, False, False, False],
'SumSaleExpected': [10, 12, 2, 3, 6, 2, 3, 6, 11, 16]})
So to summarize I'd like to start cumulating the sum once that sum is over 5 and the row IsSuccess is True. I'd like to avoid for loop if possible as well.
Thank you for help!
I hope I've understood your question right. This example will substract necessary value ("reset") when cumulative sum of sale is greater than 5 and IsSuccess==True:
df["SumSale"] = df["Sale"].cumsum()
# "reset" when SumSale>5 and IsSuccess==True
m = df["SumSale"].gt(5) & df["IsSuccess"].eq(True)
df.loc[m, "to_remove"] = df["SumSale"]
df["to_remove"] = df["to_remove"].ffill().shift().fillna(0)
df["SumSale"] -= df["to_remove"]
df = df.drop(columns="to_remove")
print(df)
Prints:
Sale IsSuccess SumSale
0 1 False 1.0
1 2 True 3.0
2 3 False 6.0
3 2 False 8.0
4 4 True 12.0
5 3 False 3.0
6 5 True 8.0
7 5 False 5.0
EDIT:
def fn():
sale, success = yield
cum = sale
while True:
sale, success = yield cum
if success and cum > 5:
cum = sale
else:
cum += sale
s = fn()
next(s)
df["ss"] = df["IsSuccess"].shift()
df["SumSale"] = df.apply(lambda x: s.send((x["Sale"], x["ss"])), axis=1)
df = df.drop(columns="ss")
print(df)
Prints:
Sale IsSuccess SumSaleExpected SumSale
0 10 False 10 10
1 2 True 12 12
2 2 False 2 2
3 1 False 3 3
4 3 True 6 6
5 2 False 2 2
6 1 True 3 3
7 3 False 6 6
8 5 False 11 11
9 5 False 16 16
You can modify your group approach to account for both conditions by taking the cumsum() of the two conditions:
cond1 = df.Sale.cumsum().gt(5).shift().bfill()
cond2 = df.IsSuccess.shift().bfill()
df['group'] = (cond1 & cond2).cumsum()
Now that group accounts for both conditions, you can directly cumsum() within these pseudogroups:
df['SumSale'] = df.groupby('group').Sale.cumsum()
# Sale IsSuccess group SumSale
# 0 1 False 0 1
# 1 2 True 0 3
# 2 3 False 0 6
# 3 2 False 0 8
# 4 4 True 0 12
# 5 3 False 1 3

Find duplicate rows among different groups with pandas

Problem
Consider the following dataframe:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
What I've tried
I've tried using
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
but the result is not what I want:
The first occurrence of A in group 1 (row 0) is True because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False.
If row 7 is deleted, then row 0 should be False because A is not present anymore in any other group.
What you need is essentially the AND of two different duplicated() calls.
~df_so.duplicated() deals within groups
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True) Deals between groups ignoring current group duplicates
Code:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
Output:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
Other case:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
Output:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
As you clarify in the comment, you need an additional mask beside current duplicated
m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())
df_so['dup'] = m1 & m2
Out[157]:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
8 300 A False
Note: I added row=8 as in the comment.
My idea for this problem:
import datatable as dt
df = dt.Frame(df_so)
df[:1, :, dt.by("ID", "letter")]
I would group by both the ID and letter column. Then simply select the first row.

Applying a pandas GroupBy with mixed boolean and numerical values

How can I apply a pandas groupby to columns that are numerical and boolean? I want to sum over the numerical columns and I want the aggregation of the boolean values to be any, that is True if there are any Trues and False if there are only False.
Performing a sum aggregation will give the desired result as long as you cast the boolean columns back to boolean types. Example
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'bool': [True, False, True, True, False, False],
'c': [10, 10, 15, 15, 20, 20]})
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
df.groupby('id').sum()
bool c
id
1 1.0 20
2 2.0 30
3 0.0 40
As you can see when applying the sum True is cast as 1 and False is cast as zero. This effectively acts as the desired any operation. Casting back to boolean:
df['bool'] = df['bool'].astype(bool)
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
You can choose the functions which you aggregate by with the following:
df.groupby("id").agg({
"bool":lambda arr: any(arr),
"c":sum,
})

Condensing pandas dataframe by dropping missing elements

Problem
I have a dataframe that looks like this:
Key Var ID_1 Var_1 ID_2 Var_2 ID_3 Var_3
1 True 1.0 True NaN NaN 5.0 True
2 True NaN NaN 4.0 False 7.0 True
3 False 2.0 False 5.0 True NaN NaN
Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.
Desired Output
Key Var First_ID First_Var Second_ID Second_Var
1 True 1 True 5 True
2 True 4 False 7 True
3 False 2 False 5 True
The ordering is not important, so long as the Id/Var pairs are maintained.
Current Solution
Below is a working solution that I have:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})
sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]
output = np.empty(shape=[data.shape[0], 6], dtype=str)
for i, *row in data.itertuples():
output[i] = [element for element in row if np.isfinite(element)]
print(output)
[['1' 'T' '1' 'T' '5' 'T']
['2' 'T' '4' 'F' '7' 'T']
['3' 'F' '2' 'F' '5' 'T']]
This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.
Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.
There is one simple solution i.e push the nans to right and drop the nans on axis 1. i.e
ndf = data.apply(lambda x : sorted(x,key=pd.isnull),1).dropna(1)
Output:
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True
Hope it helps.
A numpy solution from Divakar here for 10x speed i.e
def mask_app(a):
out = np.full(a.shape,np.nan,dtype=a.dtype)
mask = ~np.isnan(a.astype(float))
out[np.sort(mask,1)[:,::-1]] = a[mask]
return out
ndf = pd.DataFrame(mask_app(data.values),columns=data.columns).dropna(1)
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True

Categories