XOR operation on 2 pandas DataFrames - python

Is there any way to remove from first DataFrame all rows which can be found in second DataFrame and add rows which are exclusive only in second DataFrame (= XOR)? Here's a twist: the first DataFrame has one column that shall be ignored during comparison.
import pandas as pd
df1 = pd.DataFrame({'col1': [1,2,3],
'col2': [4,5,6],
'spec': ['A','B','C']})
df2 = pd.DataFrame({'col1': [1,9],
'col2': [4,9]})
result = pd.DataFrame({'col1': [2,3,9],
'col2': [5,6,9],
'spec': ['B','C','df2']})
df1 = df1.astype(str)
df2 = df1.astype(str)
This is analogical to UNION (not UNION ALL) operation.
Combine
col1 col2 spec
0 1 4 A
1 2 5 B
2 3 6 C
and
col1 col2
0 1 4
1 9 9
to
col1 col2 spec
1 2 5 B
2 3 6 C
1 9 9 df2

You could concatenate and drop duplicates:
out = (pd.concat((df1, df2.assign(spec='df2')))
.drop_duplicates(subset=['col1','col2'], keep=False))
or filter out the common rows and concatenate:
out = pd.concat((df1[~df1[['col1','col2']].isin(df2[['col1','col2']]).all(axis=1)],
df2[~df2.isin(df1[['col1','col2']]).all(axis=1)].assign(spec='df2')))
Output:
col1 col2 spec
1 2 5 B
2 3 6 C
1 9 9 df2

Related

How to add interleaving rows as result of sort / groups?

I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3

How to identify unique elements in two dataframes and append with a new row

I am trying to write a function that takes in two dataframes with a different number of rows, finds the elements that are unique to each dataframe in the first column, and then appends a new row that only contains the unique element to the dataframe where it does not exist. For example:
>>> d1 = {'col1': [1, 2], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 1 3
1 2 4
2 5 6
>>> d2 = {'col1': [1, 2], 'col2': [3, 4]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 1 3
1 2 4
2 6 7
>>> standarized_unique_elems(df1, df2)
>>> df1
col1 col2
0 1 3
1 2 4
2 5 6
3 6 NaN
>>> df2
col1 col2
0 1 3
1 2 4
2 6 7
3 5 NaN
Before posting this question, I gave it my best shot, but cant figure out a good way to append a new row at the bottom of each dataframe with the unique element. Here is what I have so far:
def standardize_shape(df1, df2):
unique_elements = list(set(df1.iloc[:, 0]).symmetric_difference(set(df2.iloc[:, 0])))
for elem in unique_elements:
if elem not in df1.iloc[:, 0].tolist():
# append a new row with the unique element with rest of values NaN
if elem not in df2.iloc[:, 0].tolist():
# append a new row with the unique element with rest of values NaN
return (df1, df2)
I am still new to Pandas, so any help would be greatly appreciated!
We can do
out1 = pd.concat([df1,pd.DataFrame({'col1':df2.loc[~df2.col1.isin(df1.col1),'col1']})])
Out[269]:
col1 col2
0 1 3.0
1 2 4.0
2 5 6.0
2 6 NaN
#out2 = pd.concat([df2,pd.DataFrame({'col1':df1.loc[~df1.col1.isin(df2.col1),'col1']})])

pandas: groupby two columns and get random selection of groups such that each value in the first column will be represented by a single group

It's similar to this question, but with an additional level of complexity.
In my case, I have a the following dataframe:
import pandas as pd
df = pd.DataFrame({'col1': list('aaabbbabababbaaa'), 'col2': list('cdddccdsssssddcd'), 'val': range(0, 16)})
output:
col1 col2 val
0 a c 0
1 a d 1
2 a d 2
3 b d 3
4 b c 4
5 b c 5
6 a d 6
7 b s 7
8 a s 8
9 b s 9
10 a s 10
11 b s 11
12 b d 12
13 a d 13
14 a c 14
15 a d 15
My goal is to select random groups of groupby(['col1', 'col2']) such that each value of col1 will be selected only once.
This can be executed by the following code:
g = df.groupby('col1')
indexes = []
for _, group in g:
g_ = group.groupby('col2')
a = np.arange(g_.ngroups)
np.random.shuffle(a)
indexes.extend(group[g_.ngroup().isin(a[:1])].index.tolist())
output:
print(df[df.index.isin(indexes)])
col1 col2 val
4 b c 4
5 b c 5
8 a s 8
10 a s 10
However, I'm looking for a more concise and pythonic way to solve this.
Another option is to sufffle your two columns with sample and drop_duplicates by col1, so that you keep only one couple per col1 value. then merge the result to df to select all the rows with these couples.
print(df.merge(df[['col1','col2']].sample(frac=1).drop_duplicates('col1')))
col1 col2 val
0 b s 7
1 b s 9
2 b s 11
3 a s 8
4 a s 10​
or with groupby and sample a bit the same idea but to select only one row per col1 value with merge after
df.merge(df[['col1','col2']].groupby('col1').sample(n=1))
EDIT: to get both the selected rows and the others rows, then you can use the parameter indicator in the merge and do a left merge. then query each separately:
m = df.merge(df[['col1','col2']].groupby('col1').sample(1), how='left', indicator=True)
print(m)
select_ = m.query('_merge=="both"')[df.columns]
print(select_)
comp_ = m.query('_merge=="left_only"')[df.columns]
print(comp_)

Merge Pandas DataFrame using apply() to only merge on partial match in two columns

I need to merge two pandas DataFrames but not only on exact column values, but also on approximate ones.
For example, I have these two DataFrames:
import pandas as pd
d = {'col1': ["a", "b", "c", "d"], 'col2': [3, 4, 66, 120]}
df = pd.DataFrame(data=d)
col1 col2
0 a 3
1 b 4
2 c 66
3 d 120
d2 = {'col1a': ["aa", "bb", "cc", "dd"], 'col2b': [3, 4, 67, 100]}
df2 = pd.DataFrame(data=d2)
col1a col2b
0 aa 3
1 bb 4
2 cc 67
3 dd 100
Now, if I simply join them on col2 and col2b columns, I will only get two rows where the column values are exactly the same.
pd.merge(df, df2, how='inner', left_on='col2', right_on='col2b')
col1 col2 col1a col2b
0 a 3 aa 3
1 b 4 bb 4
Now, say for the simplicity of an example, I also want to merge column values based on the integer that is either +1 or -1 of the integer value from the left DataFrame. In our example in the left DataFrame the value 66 should be matched to 67 to the value from the right DataFrame in addition to the rows with values 3 and 4:
col1 col2 col1a col2b
0 a 3 aa 3
1 b 4 bb 4
2 c 66 cc 67
Not sure how to approach this problem, somehow would need to merge based on the approximated column values using apply()?
Here is one way from merge_asof
pd.merge_asof(df,df2,left_on='col2',right_on='col2b',tolerance = 1,direction ='nearest').dropna()
Out[7]:
col1 col2 col1a col2b
0 a 3 aa 3.0
1 b 4 bb 4.0
2 c 66 cc 67.0

Pandas replace, multi column criteria

I'm trying to replace values in a Pandas data frame, based on certain criteria on multiple columns. For a single column criteria this can be done very elegantly with a dictionary (e.g. Remap values in pandas column with a dict):
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2}, 'col2': {0:10, 1:20, 2:20}})
rdict = {1:'a', 2:'b'}
df2 = df.replace({"col1": rdict})
Input df:
col1 col2
0 1 10
1 1 20
2 2 20
Resulting df2:
col1 col2
0 a 10
1 a 20
2 b 20
I'm trying to extend this to criteria over multiple columns (e.g. where col1==1, col2==10 -> replace). For a single criteria this can be done like:
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'
Which results in a df3:
col1 col2
0 c 10
1 1 20
2 2 20
My real life problem has a large number of criteria, which would involve a large number of df3.loc[((criteria1)&(criteria2)), column] = value calls, which is far less elegant the the replacement using a dictionary as a "lookup table". Is it possible to extend the elegant solution (df2 = df.replace({"col1": rdict})) to a setup where values in one column are replaced by criteria based on multiple columns?
An example of what I'm trying to achieve (although in my real life case the number of criteria is a lot larger):
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'a'
df3.loc[((df['col1']==1)&(df['col2']==20)), 'col1'] = 'b'
df3.loc[((df['col1']==2)&(df['col2']==10)), 'col1'] = 'c'
df3.loc[((df['col1']==2)&(df['col2']==20)), 'col1'] = 'd'
Input df:
0 1 10
1 1 20
2 2 10
3 2 20
Resulting df3:
col1 col2
0 a 10
1 b 20
2 c 10
3 d 20
We can use merge.
Suppose your df looks like
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2, 4:2, 5:1}, 'col2': {0:10, 1:20, 2:10, 3:20, 4: 20, 5:10}})
col1 col2
0 1 10
1 1 20
2 2 10
3 2 20
4 2 20
5 1 10
And your conditional replacement can be represented as another dataframe:
df_replace
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
(As OP (Bart) pointed out, you can save this in a csv file.)
Then you can use
df = df.merge(df_replace, on=["col1", "col2"], how="left")
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
4 2 20 d
5 1 10 a
Then you just need to drop col1.
As MaxU pointed out, there could be rows that does not get replaced and resulting in NaN. We can use a line like
df["val"] = df["val"].combine_first(df["col1"])
to fill in values from col1 if the resulting values after merge is NaN.
Demo:
Source DF:
In [120]: df
Out[120]:
col1 col2
0 1 10
1 1 10
2 1 20
3 1 20
4 2 10
5 2 20
6 3 30
Conditions & Replacements DF:
In [121]: cond
Out[121]:
col1 col2 repl
1 1 20 b
2 2 10 c
0 1 10 a
3 2 20 d
Solution:
In [121]: res = df.merge(cond, how='left')
yields:
In [122]: res
Out[122]:
col1 col2 repl
0 1 10 a
1 1 10 a
2 1 20 b
3 1 20 b
4 2 10 c
5 2 20 d
6 3 30 NaN # <-- NOTE
In [123]: res['col1'] = res.pop('repl').fillna(res['col1'])
In [124]: res
Out[124]:
col1 col2
0 a 10
1 a 10
2 b 20
3 b 20
4 c 10
5 d 20
6 3 30
This method is likely to be more efficient than pandas functionality, as it relies on numpy arrays and dictionary mappings.
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
rdict = {(1, 10): 'a', (1, 20): 'b', (2, 10): 'c', (2, 20): 'd'}
df['col1'] = list(map(rdict.get, [(x[0], x[1]) for x in df1[['c1', 'c2']].values]))

Categories