How to find similarities in two dataframes and extract its value [duplicate]

How to find similarities in two dataframes and extract its value [duplicate] - python

I have a dataframe df1 which looks like:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2 like:
c l
0 A b
1 C a
I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
c k l
0 A 1 a
2 B 2 a
4 C 2 d

You can do this efficiently using isin on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on #IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.

This is pretty succinct and works well:
df1 = df1[~df1.index.isin(df2.index)]

Using DataFrame.merge & DataFrame.query:
A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only

I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]

How about:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)

Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.

You can concatenate both DataFrames and drop all duplicates:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

Related

Create new rows to pandas dataframe based on condition efficiently

I have two pandas dataframes: one with IDs and values and another that maps IDs with other IDs. The objective is to create a new dataframe that is based on df1. It loops through each sourceId in df1 and looks to df2, a mapping df, for matches in sourceId. If a match is found, a new row is created with the same value as in df1. So if multiple matches are found, the loop creates multiple rows (e.g. with ids A and C). If only one match is found (e.g. with id B), only one row is created.
The below code does exactly what I want, but it does it very slowly. In my original dataset df1 is 440K rows and df2 has mappings for thousands of different IDs - currently the code runs at 10-25 it/s which is too much.
Is there a faster way to do this that would benefit from matrix calculations/other benefits of numpy/pandas?
import pandas as pd
df1 = pd.DataFrame({
'SourceId': ['A', 'B', 'C', 'A', 'C', 'B'],
'value': [1, 5, 12, 30, 32, 55],
'time': [pd.to_datetime('2020-04-04 08:49:52.166498900+0000'),
pd.to_datetime('2020-08-14 06:12:40.860460500+0000'),
pd.to_datetime('2020-05-13 09:20:50.052688900+0000'),
pd.to_datetime('2020-03-09 13:55:17.335340600+0000'),
pd.to_datetime('2020-08-14 09:30:56.359635400+0000'),
pd.to_datetime('2020-01-31 23:03:46.539892900+0000')],
'otherInfo': ['0A10a', '055jA', 'boAqz', '0t,m5A', '09tjq1', 'akk_1!']})
df2 = pd.DataFrame({'SourceId': ['A', 'A', 'B', 'C', 'C', 'C'], 'TargetId': ['A', 'Q', 'B', 'C', 'B', 'X'], 'trueIfMatch': [1, 0, 1, 1, 0, 0]})
df3 = pd.DataFrame()
for r in df1.itertuples():
SourceId = r.SourceId
value = r.value
time = r.time
otherInfo = r.otherInfo
if SourceId in df2.SourceId.unique():
entries = df2.loc[df2.SourceId == SourceId].TargetId.tolist()
for entry in entries:
df3 = df3.append({
'sourceId': SourceId,
'targetId': entry,
'value': value,
'time': time,
'otherInfo': otherInfo
}, ignore_index=True)
display(df3)

Use df.merge with sort_values:
In [2293]: df3 = df1.merge(df2, on='SourceId').sort_values('value')
In [2294]: df3
Out[2294]:
SourceId value TargetId
0 A 1 A
1 A 1 Q
4 B 5 B
6 C 12 C
7 C 12 B
8 C 12 X
2 A 30 A
3 A 30 Q
9 C 32 C
10 C 32 B
11 C 32 X
5 B 55 B

How do I find how many members from one group are also members of another group?

I am trying to find the number of people of a certain group who appear in other groups. For instance, here is the Pandas dataframe:
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
Which looks like this:
Ash: a
Psyduck: b
Pikachu: c
Charizard: b
Ash: b
Psyduck: a
I am trying to create a cross tabulation that looks like the following:
a b c
a 2 2 0
b 2 3 0
c 0 0 1
Essentially, this cross tab shows how many members of group x are also members of group x. For example, there are 2 people who are in group a and b, thus there is a 2 in the intersection of those columns
I have used Pandas cross tab function but it doesn't give the result that I am looking for.

import pandas as pd
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
df = pd.DataFrame(d)
df = df.merge(df, on='name')
print(
pd.crosstab(df.group_x, df.group_y)
)
Output:
group_y a b c
group_x
a 2 2 0
b 2 3 0
c 0 0 1
Demo: https://repl.it/#alexmojaki/TragicFrigidConditions

pandas difference between to data frame according to custom rules

This is simplified example of what I want to do:
data1 = {'one':['A', 'E', 'G'], 'two':['B', 'D', 'H'], 'three':['C', 'F', 'J']}
df1 = pd.DataFrame(data1)
df1
one two three
0 A B C
1 E D F
2 G H J
data2 = {'one':['C', 'F', 'P'], 'two':['B', 'D', 'R'], 'three':['A', 'E', 'C']}
df2 = pd.DataFrame(data2)
df2
one two three
0 C B A
1 F D E
2 P R C
I want a function the will show me something like this:
diff(df1, df2) # this syntaks can be different
one two three from
0 G H J df1
1 P R C df2
Basically find came text for column two in both dataFrames, and if one and three columns are reversed, then it is fine, do not add it in a new frame.
I know how to do it with a loop but would like to know what is panda way of doing this.

Using pandas.Index.symmetric_difference
df1.set_index(df1.apply(frozenset, 1), inplace=True)
df2.set_index(df2.apply(frozenset, 1), inplace=True)
df1['from'] = 'df1'
df2['from'] = 'df2'
new_df = pd.concat([df1, df2]).loc[df1.index ^ df2.index].reset_index(drop=True)
print(new_df)
Output:
one three two from
0 G J H df1
1 P C R df2

Simple enought, just compare the columns that you want to be the same and filter on that. In your example:
pd.concat([df.loc[df1["two"] != df2["two"]] for df in (df1, df2)], axis=0)
EDIT: if you want the "from" column as well, change the above line to:
pd.concat([df.loc[df1["two"] != df2["two"]].assign(from_df=df_name) for df, df_name in zip((df1, df2), ("df1", df2)], axis=0)

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.

This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])

try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

pandas - filter dataframe by another dataframe by row elements

I have a dataframe df1 which looks like:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2 like:
c l
0 A b
1 C a
I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
c k l
0 A 1 a
2 B 2 a
4 C 2 d

You can do this efficiently using isin on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on #IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.

This is pretty succinct and works well:
df1 = df1[~df1.index.isin(df2.index)]

Using DataFrame.merge & DataFrame.query:
A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only

I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]

How about:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)

Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.

You can concatenate both DataFrames and drop all duplicates:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find similarities in two dataframes and extract its value [duplicate] - python

This is pretty succinct and works well: df1 = df1[~df1.index.isin(df2.index)]

How about: df1['key'] = df1['c'] + df1['l'] d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)

You can concatenate both DataFrames and drop all duplicates: df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False) Output: c k l 0 A 1.0 a 2 B 2.0 a 4 C 2.0 d This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

Related

Create new rows to pandas dataframe based on condition efficiently

How do I find how many members from one group are also members of another group?

pandas difference between to data frame according to custom rules

how to re-arrange multiple columns into one column with same index

pandas - filter dataframe by another dataframe by row elements

Categories

Resources