Say I have a DataFrame like following:
df = pd.DataFrame({
'group':['A', 'A', 'A', 'B', 'B', 'C'],
'amount':[100, -100, 50, 30, -30, 40]
})
If I would like to add another column as to check if the amount of each row can be paired (i.e. same amount, but 1 positive and 1 negative) within a group.
For example, in group A, 100 & -100 can be paired, then they will be True, while 50 cannot find a pair, then it's False (like the following table).
group
amount
pair
A
100
True
A
-100
True
A
50
False
B
30
True
B
-30
True
C
40
False
What would be the most efficient way to do this?
We can take the abs of the amount column, then create the pair column based on where the values are DataFrame.duplicated:
df['pair'] = df.assign(amount=df['amount'].abs()).duplicated(keep=False)
*keep=False means both duplicated rows get True. The right hand side can also be subset if the DataFrame has more than these 2 columns.
df:
group amount pair
0 A 100 True
1 A -100 True
2 A 50 False
3 B 30 True
4 B -30 True
5 C 40 False
Update to handle duplicate values, but ensure only positive negative pairs get matched using pivot_table:
Updated DataFrame:
df = pd.DataFrame({
'group': ['A', 'A', 'A', 'A', 'B', 'B', 'C'],
'amount': [100, -100, 50, 50, 30, -30, 40]
})
Pivot to wide form and check for pairs:
df['abs_amount'] = df['amount'].abs()
df = df.join(
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first')
.notnull().all(axis=1)
.rename('pair'),
on=['group', 'abs_amount']
).drop('abs_amount', axis=1)
df:
group amount pair
0 A 100 True
1 A -100 True
2 A 50 False
3 A 50 False
4 B 30 True
5 B -30 True
6 C 40 False
The pivot_table:
df['abs_amount'] = df['amount'].abs()
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first')
amount False True
group abs_amount
A 50 NaN 50.0 # Multiple 50s but no -50
100 -100.0 100.0
B 30 -30.0 30.0
C 40 NaN 40.0
Ensure all values in the row:
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first').notnull().all(axis=1)
group abs_amount
A 50 False
100 True
B 30 True
C 40 False
dtype: bool
Related
I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.
Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite
To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object
import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)
I have data similar to this
data = {'A': [10,20,30,10,-10, 20,-20, 10], 'B': [100,200,300,100,-100, 30,-30,100], 'C':[1000,2000,3000,1000, -1000, 40,-40, 1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
5
20
30
40
6
-20
-30
-40
7
10
100
1000
Here sum values of all the columns for index 0,3,7 equal to 1110 index 4 equals -1110 and sum value of index 5 and 6 equals 90, and -90 these are exact opposite, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3,4. and 5,6(Nearest index)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
5
20
30
40
Exact opposite
6
-20
-30
-40
Exact opposite
7
10
100
1000
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
5
20
30
40
90
6
-20
-30
-40
-90
7
10
100
1000
1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
This should do the trick:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index[:-2]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][i+1]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[i+1, 'D'] = 'Exact opposite'
continue
print(df)
This solution only considers 2 adjacent lines.
The following code compares all lines so it also detects lines 0 and 6:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index:
for j in df.index[i:]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][j]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[j, 'D'] = 'Exact opposite'
continue
print(df)
I am trying to add values to a column based on a couple of conditions. Here is the code example:
Import pandas as pd
df1 = pd.DataFrame({'Type': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C'], 'Val': [20, -10, 20, -10, 30, -20, 40, -30]})
df2 = pd.DataFrame({'Type': ['A', 'A', 'B', 'B', 'C', 'C'], 'Cat':['p', 'n', 'p', 'n','p', 'n'], 'Val': [30, -40, 20, -30, 10, -20]})
for index, _ in df1.iterrows():
if df1.loc[index,'Val'] >=0:
df1.loc[index,'Val'] = df1.loc[index,'Val'] + float(df2.loc[(df2['Type'] == df1.loc[index,'Type']) & (df2['Cat'] == 'p'), 'Val'])
else:
df1.loc[index,'Val'] = df1.loc[index,'Val'] + float(df2.loc[(df2['Type'] == df1.loc[index,'Type']) & (df2['Cat'] == 'n'), 'Val'])
For each value in the 'Val' column of df1, I want to add values from df2, based on the type and whether the original value was positive or negative.
The expected output for this example would be alternate 50 and -50 in df1. The above code does the job, but is too slow to be usable for a large data set. Is there a better way to do this?
Try adding a Cat column to df1 merge then sum val columns across axis 1 then drop the extra columns:
df1['Cat'] = np.where(df1['Val'].lt(0), 'n', 'p')
df1 = df1.merge(df2, on=['Type', 'Cat'], how='left')
df1['Val'] = df1[['Val_x', 'Val_y']].sum(axis=1)
df1 = df1.drop(['Cat', 'Val_x', 'Val_y'], 1)
Type Val
0 A 50
1 A 50
2 A -50
3 A -50
4 B 50
5 B -50
6 C 50
7 C -50
Add new column with np.where
df1['Cat'] = np.where(df1['Val'].lt(0), 'n', 'p')
Type Val Cat
0 A 20 p
1 A -10 n
2 A 20 p
3 A -10 n
4 B 30 p
5 B -20 n
6 C 40 p
7 C -30 n
merge on Type and Cat
df1 = df1.merge(df2, on=['Type', 'Cat'], how='left')
Type Val_x Cat Val_y
0 A 20 p 30
1 A -10 n -40
2 A 20 p 30
3 A -10 n -40
4 B 30 p 20
5 B -20 n -30
6 C 40 p 10
7 C -30 n -20
sum Val columns:
df1['Val'] = df1[['Val_x', 'Val_y']].sum(axis=1)
Type Val_x Cat Val_y Val
0 A 20 p 30 50
1 A -10 n -40 -50
2 A 20 p 30 50
3 A -10 n -40 -50
4 B 30 p 20 50
5 B -20 n -30 -50
6 C 40 p 10 50
7 C -30 n -20 -50
drop extra columns:
df1 = df1.drop(['Cat', 'Val_x', 'Val_y'], 1)
Type Val
0 A 50
1 A -50
2 A 50
3 A -50
4 B 50
5 B -50
6 C 50
7 C -50
import numpy as np
df1['sign'] = np.sign(df1.Val)
df2['sign'] = np.sign(df2.Val)
df = pd.merge(df1, df2, on=['Type', 'sign'], suffixes=('_df1', '_df2'))
df['Val'] = df.Val_df1 + df.Val_df2
df = df.drop(columns=['Val_df1', 'sign', 'Val_df2'])
df
Problem
Consider the following dataframe:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
What I've tried
I've tried using
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
but the result is not what I want:
The first occurrence of A in group 1 (row 0) is True because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False.
If row 7 is deleted, then row 0 should be False because A is not present anymore in any other group.
What you need is essentially the AND of two different duplicated() calls.
~df_so.duplicated() deals within groups
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True) Deals between groups ignoring current group duplicates
Code:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
Output:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
Other case:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
Output:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
As you clarify in the comment, you need an additional mask beside current duplicated
m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())
df_so['dup'] = m1 & m2
Out[157]:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
8 300 A False
Note: I added row=8 as in the comment.
My idea for this problem:
import datatable as dt
df = dt.Frame(df_so)
df[:1, :, dt.by("ID", "letter")]
I would group by both the ID and letter column. Then simply select the first row.
My DataFrame looks like this:
A B
100 1
100 2
200 2
200 3
I need to find all possible combinations of A and B values and create new dataframe with this combinations and a third column indicating each combination presence in the original df:
A B C
100 1 True
100 2 True
100 3 False
200 1 False
200 2 True
200 3 True
How I'm doing it now:
import pandas as pd
df = pd.DataFrame({'A' : [100,100,200,200], 'B' : [1,2,2,3]})
df['D'] = 42
df2 = df[['A','D']].merge(df[['B','D']], on = 'D')
[['A','B']].drop_duplicates()
i1 = df.set_index(['A','B']).index
i2 = df2.set_index(['A','B']).index
df2['C'] = i2.isin(i1)
print(df2)
It works, but looks ugly. Is there a cleaner way?
You can use:
create new column filled Trues
set_index from columns for all combinations
create MultiIndex.from_product from levels of df1 index
reindex original df and if not exist values add Falses
reset_index for columns from MultiIndex
df['C'] = True
df1 = df.set_index(['A','B'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df = df1.reindex(mux, fill_value=False).reset_index()
print (df)
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
With the help of itertools and tuple
import itertools
newdf = pd.DataFrame(list(itertools.product(df['A'].unique(),df['B'].unique())),columns = df.columns)
dft = list(df.itertuples(index=False))
newdf['C'] = newdf.apply(lambda x: tuple(x) in dft,axis=1)
Output :
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
Using cartesian_product and pd.merge
In [415]: combs = pd.core.reshape.util.cartesian_product(
df.set_index(['A', 'B']).index.levels)
In [416]: combs
Out[416]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]
In [417]: (pd.DataFrame({'A': combs[0], 'B': combs[1]})
.merge(df, how='left', indicator='C')
.replace({'C': {'both': True, 'left_only': False}}) )
Out[417]:
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
For combs, you could also,
In [432]: pd.core.reshape.util.cartesian_product([df.A.unique(), df.B.unique()])
Out[432]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]