Compare two Dataframes but just on specific columns - python

I have two Dataframes (df1 and df2)
df1:
A B C D
12 52 16 23
19 32 30 09
df2:
A G C D E
12 13 16 04 100
I want to create a new column in df1 called 'Compare'
Then I want to compare the columns 'A' and 'C' and if the are same then give 'Compare' in this row the value 'X'.
result = df1[df1["A"].isin(df2["A"].tolist())]
does not work.

You can chain 2 conditions with & for bitwise AND or | for bitwise OR and set new values by numpy.where:
mask = df1["A"].isin(df2["A"]) & df1["C"].isin(df2["C"])
df1['Compare'] = np.where(mask, 'X', '')
print (df1)
A B C D Compare
0 12 52 16 23 X
1 19 32 30 9
Or use DataFrame.merge with left join and indicator=True:
s = df1[['A','C']].merge(df2[['A','C']], how='left', indicator=True)['_merge']
df1['Compare'] = np.where(s == 'both', 'X', '')
print (df1)
A B C D Compare
0 12 52 16 23 X
1 19 32 30 9

Related

how to split dataframe cells using delimiter into different dataframes. with conditions

There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])

Group dataframe by separation in intervals

I have a dataframe that has intervals and a label associated with each. I need to group and aggregate rows separated by a given distance from others.
For example, groups rows whose start/end are within 3 units of the the start/end of other rows have their label fields concatenated:
In [16]: df = pd.DataFrame([
...: [ 1, 3,'a'], [ 4,10,'b'],
...: [15,17,'c'], [18,20,'d'],
...: [27,30,'e'], [31,40,'f'], [41,42,'g'],
...: [50,54,'h']],
...: columns=['start', 'end', 'label'])
...:
In [17]: df
Out[17]:
start end label
0 1 3 a
1 4 10 b
2 15 17 c
3 18 20 d
4 27 30 e
5 31 40 f
6 41 42 g
7 50 54 h
Desired output:
In [18]: df_desired = group_by_interval(df)
In [19]: df_desired
Out[19]:
start end label
0 1 10 a b
1 15 20 c d
2 27 30 e f g
3 50 54 h
How can I execute this sort of grouping by interval with a dataframe?
I have found one similar SO here, but it's a little different since I don't know where to cut a priori.
You can create a grouper based on the condition and aggregate
grouper = ((df['start'] - df['end'].shift()) > 3).cumsum()
df.groupby( grouper).agg({'start' : 'first', 'end' : 'last', 'label': lambda x: ' '.join(x)})
start end label
0 1 10 a b
1 15 20 c d
2 27 42 e f g
3 50 54 h

pandas DataFrame: normalize one JSON column and merge with other columns

I have a pandas DataFrame containing one column with multiple JSON data items as list of dicts. I want to normalize the JSON column and duplicate the non-JSON columns:
# creating dataframe
df_actions = pd.DataFrame(columns=['id', 'actions'])
rows = [[12,json.loads('[{"type": "a","value": "17"},{"type": "b","value": "19"}]')],
[15, json.loads('[{"type": "a","value": "1"},{"type": "b","value": "3"},{"type": "c","value": "5"}]')]]
df_actions.loc[0] = rows[0]
df_actions.loc[1] = rows[1]
>>>df_actions
id actions
0 12 [{'type': 'a', 'value': '17'}, {'type': 'b', '...
1 15 [{'type': 'a', 'value': '1'}, {'type': 'b', 'v...
I want
>>>df_actions_parsed
id type value
12 a 17
12 b 19
15 a 1
15 b 3
15 c 5
I can normalize JSON data using:
pd.concat([pd.DataFrame(json_normalize(x)) for x in df_actions['actions']],ignore_index=True)
but I don't know how to join that back to the id column of the original DataFrame.
You can use concat with dict comprehension with pop for extract column, remove second level and join to original:
df1 = (pd.concat({i: pd.DataFrame(x) for i, x in df_actions.pop('actions').items()})
.reset_index(level=1, drop=True)
.join(df_actions)
.reset_index(drop=True))
What is same as:
df1 = (pd.concat({i: json_normalize(x) for i, x in df_actions.pop('actions').items()})
.reset_index(level=1, drop=True)
.join(df_actions)
.reset_index(drop=True))
print (df1)
type value id
0 a 17 12
1 b 19 12
2 a 1 15
3 b 3 15
4 c 5 15
Another solution if performance is important:
L = [{**{'i':k, **y}} for k, v in df_actions.pop('actions').items() for y in v]
df_actions = df_actions.join(pd.DataFrame(L).set_index('i')).reset_index(drop=True)
print (df_actions)
id type value
0 12 a 17
1 12 b 19
2 15 a 1
3 15 b 3
4 15 c 5
Here's another solution that uses explode and json_normalize:
exploded = df_actions.explode("actions")
pd.concat([exploded["id"].reset_index(drop=True), pd.json_normalize(exploded["actions"])], axis=1)
Here's the result:
id type value
0 12 a 17
1 12 b 19
2 15 a 1
3 15 b 3
4 15 c 5

Multiply dataframe with values from other dataframe

I have two dataframes
df1 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]], index = ['a','b','c', 'a'], columns = ['d','e'])
d e
a 1 2
b 3 4
c 5 6
a 7 8
df2 = pd.DataFrame([['a', 10],['b',20],['c',30],['f',40]])
0 1
0 a 10
1 b 20
2 c 30
3 f 40
i want my final dataframe to multiply rows of df1 to multiply by a factor corresponding to value in df2 (for eg. 20 for b)
so my output should look like
d e
a 10 20
b 60 80
c 150 180
a 70 80
Kindly provide a solution assuming df1 to be hundreds of rows in length. I could only think of looping through df1.index.
Use set_index and reindex to align df2 with df1 and then mul
In [1150]: df1.mul(df2.set_index(0).reindex(df1.index)[1], axis=0)
Out[1150]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
Create a mapping and call df.apply:
In [1128]: mapping = dict(df2.values)
In [1129]: df1.apply(lambda x: x * mapping[x.name], 1)
Out[1129]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
IIUC:
In [55]: df1 * pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[55]:
d e
a 10 20
a 70 80
b 60 80
c 150 180
Helper DF:
In [57]: pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[57]:
d e
0
a 10 10
b 20 20
c 30 30
This is straight forward. You just make sure they have a common axis, then you can combine them:
put the lookup column into the index
df2.set_index(0, inplace=True)
1
0
a 10
b 20
c 30
Now you can put that column into df1 very easily:
df1['multiplying_factor'] = df2[1]
Now you just want to multiply two columns:
df1['final_value'] = df1.e*df1.multiplying_factor
Now df1 looks like:
d e multiplying_factor final_value
a 1 2 10 20
b 3 4 20 80
c 5 6 30 180
a 7 8 10 80

Pandas check if row exist in another dataframe and append index

I'm having one problem to iterate over my dataframe. The way I'm doing is taking a long time and I don't have that many rows (I have like 300k rows)
What am I trying to do?
Check if one DF (A) contains the value of two columns of the other DF (B). You can think of this as a multiple-key field
If True, get the index of DF.B and assign to one column of DF.A
If False, two steps:
a. append to DF.B the two columns not found
b. assign the new ID to DF.A (I couldn't do this one)
This is my code, where:
df is DF.A and df_id is DF.B:
SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes
Real_ID is the column to which I want to assign the id of DF.B (df_id)
for index, row in df.iterrows():
#check if columns exist in the other dataframe
real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
if real_id.empty:
#row does not exist, append to df_id
df_id = df_id.append(row[['SampleID','ParentID']])
else:
#row exists, assign id of df_id to df
row['Real_ID'] = real_id.index
EXAMPLE:
DF.A (df)
Real_ID SampleID ParentID Something AnotherThing
0 20 21 a b
1 10 11 a b
2 40 51 a b
DF.B (df_id)
SampleID ParentID
0 10 11
1 20 21
Result:
Real_ID SampleID ParentID Something AnotherThing
0 1 10 11 a b
1 0 20 21 a b
2 2 40 51 a b
SampleID ParentID
0 20 21
1 10 11
2 40 51
Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...
Thanks
you can do it this way:
Data (pay attention at the index in the B DF):
In [276]: cols = ['SampleID', 'ParentID']
In [277]: A
Out[277]:
Real_ID SampleID ParentID Something AnotherThing
0 NaN 10 11 a b
1 NaN 20 21 a b
2 NaN 40 51 a b
In [278]: B
Out[278]:
SampleID ParentID
3 10 11
5 20 21
Solution:
In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)
In [280]: merged
Out[280]:
SampleID ParentID _merge
0 10 11 both
1 20 21 both
2 40 51 left_only
In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])
In [282]: B
Out[282]:
SampleID ParentID
3 10 11
5 20 21
2 40 51
In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']
In [286]: A
Out[286]:
Real_ID SampleID ParentID Something AnotherThing
0 3 10 11 a b
1 5 20 21 a b
2 2 40 51 a b

Categories