How to match and select values from dataframes in different length? - python

I have 2 data frames of different length. Each has 3 id columns, I would like to transfer the value from the second data frame to the first one only when all the id columns are the same. For example, the value 'bb' will be added to a new column in df1 at the 3rd row. How should I do it?
df1 = pd.DataFrame({'id1' : [1, 2, 'aa', 4, 5], 'id2': ['a', 'a', 'aa', 'd','e'], 'id3': ['p', 'r', 'aa', 'i', 't']})
df2 = pd.DataFrame({'id1': [ 6, 7, 6, 5, 4, 1, 'aa' ], 'id2':['f','d','c','f','b','z','aa'], 'id3':['a', 'f', 'q', 'b', 't', 't','aa'], \
'value' : [5, 4,7,6, 8, 5 , 'bb']})
id1 id2 id3
0 1 a p
1 2 a r
2 aa aa aa
3 4 d i
4 5 e t
[5 rows x 3 columns]
id1 id2 id3 value
0 6 f a 5
1 7 d f 4
2 6 c q 7
3 5 f b 6
4 4 b t 8
5 1 z t 5
6 aa aa aa bb

You can just perform a left style merge:
In [116]:
df1.merge(df2, how='left')
Out[116]:
id1 id2 id3 value
0 1 a p NaN
1 2 a r NaN
2 aa aa aa bb
3 4 d i NaN
4 5 e t NaN

Related

Create multiple columns with Pandas .apply()

I have two pandas DataFrames, both containing the same categories but different 'id' columns. In order to illustrate, the first table looks like this:
df = pd.DataFrame({
'id': list(np.arange(1, 12)),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
'weight': list(np.random.randint(1, 5, 11))
})
df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']
Output:
id category weight weight_sum p
0 1 a 4 14 0.285714
1 2 a 4 14 0.285714
2 3 a 2 14 0.142857
3 4 a 4 14 0.285714
4 5 b 4 8 0.500000
5 6 b 4 8 0.500000
6 7 c 3 15 0.200000
7 8 c 4 15 0.266667
8 9 c 2 15 0.133333
9 10 c 4 15 0.266667
10 11 c 2 15 0.133333
The second contains only 'id' and 'category'.
What I'm trying to do is to create a third DataFrame, that would have inherit the id of the second DataFrame, plus three new columns for the ids of the first DataFrame - each should be selected based on the p column, which represents its weight within that category.
I've tried multiple solutions and was thinking of applying np.random.choice and .apply(), but couldn't figure out a way to make that work.
EDIT:
The desired output would be something like:
user_id id_1 id_2 id_3
0 2 3 1 2
1 3 2 2 3
2 4 1 3 1
With each id being selected based on the its probability and respective category (both DataFrames have this column), and the same not showing up more than once for the same user_id.
Desired DataFrame
IIUC, you want to select random IDs of the same category with weighted probabilities. For this you can construct a helper dataframe (dfg) and use apply:
df2 = pd.DataFrame({
'id': np.random.randint(1, 12, size=11),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})
dfg = df.groupby('category').agg(list)
df3 = df2.join(df2['category']
.apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
p=dfg.loc[r, 'p'],
size=3)))
.add_prefix('id_')
)
Output:
id category id_0 id_1 id_2
0 11 a 2 3 3
1 10 a 2 3 1
2 4 a 1 2 3
3 7 a 2 1 4
4 5 b 6 5 5
5 10 b 6 5 6
6 8 c 9 8 8
7 11 c 7 8 7
8 11 c 10 8 8
9 4 c 9 10 10
10 1 c 11 11 9

Replace values in dataframe where updated versions are in another dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have two dataframes, something like this:
df1 = pd.DataFrame({
'Identity': [3, 4, 5, 6, 7, 8, 9],
'Value': [1, 2, 3, 4, 5, 6, 7],
'Notes': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
})
df2 = pd.DataFrame({
'Identity': [4, 8],
'Value': [0, 128],
})
In[3]: df1
Out[3]:
Identity Value Notes
0 3 1 a
1 4 2 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 6 f
6 9 7 g
In[4]: df2
Out[4]:
Identity Value
0 4 0
1 8 128
I'd like to use df2 to overwrite df1 but only where values exist in df2, so I end up with:
Identity Value Notes
0 3 1 a
1 4 0 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 128 f
6 9 7 g
I've been searching through all the various merge, combine, join functions etc, but I can't seem to find one that does what I want. Is there a simple way of doing this?
Use:
df1['Value'] = df1['Identity'].map(df2.set_index('Identity')['Value']).fillna(df1['Value'])
Or try reset_index with reindex and set_index with fillna:
df1['Value'] = df2.set_index('Identity').reindex(df1['Identity'])
.reset_index(drop=True)['Value'].fillna(df1['Value'])
>>> df1
Identity Value Notes
0 3 1.0 a
1 4 0.0 b
2 5 3.0 c
3 6 4.0 d
4 7 5.0 e
5 8 128.0 f
6 9 7.0 g
>>>
This fills missing rows in df2 with NaN and fills the NaNs with df1 values.

How to merge multiple column of same data frame

How to merge multiple column values into one column of same data frame and get new column with unique values.
Column1 Column2 Column3 Column4 Column5
0 a 1 2 3 4
1 a 3 4 5
2 b 6 7 8
3 c 7 7
Output:
Column A
a
a
b
c
1
3
6
7
2
4
5
8
Use unstack or melt for reshape, remove missinf values by dropna and duplicates by drop_duplicates:
df1 = df.unstack().dropna().drop_duplicates().reset_index(drop=True).to_frame('A')
df1 = df.melt(value_name='A')[['A']].dropna().drop_duplicates().reset_index(drop=True)
print (df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8
Here is another way to do it if you are ok using numpy. This will handle either nans or empty strings in the original dataframe and is a bit faster than unstack
or melt.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Column1': ['a', 'a', 'b', 'c'],
'Column2': [1, 3, 6, 7],
'Column3': [2, 4, 7, 7],
'Column4': [3, 5, 8, np.nan],
'Column5': [4, '', '', np.nan]})
u = pd.unique(df.values.flatten(order='F'))
u = u[np.where(~np.isin(u, ['']) & ~pd.isnull(u))[0]]
df1 = pd.DataFrame(u, columns=['A'])
print(df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8

Performing outer join that merges joined columns

I am performing an outer join on two DataFrames:
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'date': [4, 5, 6, 7, 8],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 6],
'date': [4, 5, 6, 7, 8],
'str': ['A', 'B', 'C', 'D', 'Q']})
pd.merge(df1, df2, on=["id","date"], how="outer")
This gives the result
date id str_x str_y
0 4 1 a A
1 5 2 b B
2 6 3 c C
3 7 4 d D
4 8 5 e NaN
5 8 6 NaN Q
Is it possible to perform the outer join such that the str-columns are concatenated? In other words, how to perform the join such that we get the DataFrame
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
where all NaN have been set to None.
I think not, possible solution is replace NaNs and join together:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x['str'].fillna('') + x['str_'].fillna(''))
.drop('str_', 1))
Similar alternative:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x.filter(like='str').fillna('').values.sum(axis=1))
.drop('str_', 1))
print (df)
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
If 'id', 'date' is unique in each data frame, then you can set the index and add the dataframes.
icols = ['date', 'id']
df1.set_index(icols).add(df2.set_index(icols), fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q

Pandas loop a dataframe and compare all rows with other DF rows and assign a value

I have two DF:
df1 = pd.DataFrame({'A':[3, 2, 5, 1, 6], 'B': [4, 6, 5, 8, 2], 'C': [4, 8, 3, 8, 0], 'D':[1, 4, 2, 8, 7], 'zebra': [5, 7, 2, 4, 8]})
df2 = pd.DataFrame({'B': [7, 3, 5, 1, 8], 'D':[4, 5, 8, 2, 3] })
print(df1)
print(df2)
A B C D zebra
0 3 4 4 1 5
1 2 8 8 5 7
2 5 5 3 2 2
3 1 6 8 5 4
4 6 2 0 7 8
B D
0 7 4
1 3 5
2 5 8
3 8 5
4 8 3
This is a simple example, in real df1 is with 1000k+ rows and 10+ columns, df2 is with only 24 rows and fewer columns as well. I would like to loop all rows in df2 and to compare those specific rows (for example column 'B' and 'D') from df2 with same column names in df1 and if row values match (if value in column B and column D in df2 match same values in same columns but in df1) to assign corresponding zebra value in that row to the same row in df2 creating new column zebra and assigning that value. If no matching found to assign 0s or NaN's.
B D zebra
0 7 4 nan
1 3 5 nan
2 5 8 nan
3 8 5 7
4 8 3 nan
From example, only row index 3 in df2 matched values 'B': 8 and 'D':5 with a row with index 2 from df1 (NOTE: row index should not be important in comparisons) and assign corresponding row value 7 from column 'zebra' to df2.
A merge would do
df2.merge(df1[['B', 'D', 'zebra']], on = ['B', 'D'], how = 'left')
B D zebra
0 7 4 NaN
1 3 5 NaN
2 5 8 NaN
3 8 5 7.0
4 8 3 NaN

Categories