How to merge multiple column of same data frame - python

How to merge multiple column values into one column of same data frame and get new column with unique values.
Column1 Column2 Column3 Column4 Column5
0 a 1 2 3 4
1 a 3 4 5
2 b 6 7 8
3 c 7 7
Output:
Column A
a
a
b
c
1
3
6
7
2
4
5
8

Use unstack or melt for reshape, remove missinf values by dropna and duplicates by drop_duplicates:
df1 = df.unstack().dropna().drop_duplicates().reset_index(drop=True).to_frame('A')
df1 = df.melt(value_name='A')[['A']].dropna().drop_duplicates().reset_index(drop=True)
print (df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8

Here is another way to do it if you are ok using numpy. This will handle either nans or empty strings in the original dataframe and is a bit faster than unstack
or melt.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Column1': ['a', 'a', 'b', 'c'],
'Column2': [1, 3, 6, 7],
'Column3': [2, 4, 7, 7],
'Column4': [3, 5, 8, np.nan],
'Column5': [4, '', '', np.nan]})
u = pd.unique(df.values.flatten(order='F'))
u = u[np.where(~np.isin(u, ['']) & ~pd.isnull(u))[0]]
df1 = pd.DataFrame(u, columns=['A'])
print(df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8

Related

Replace values in dataframe where updated versions are in another dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have two dataframes, something like this:
df1 = pd.DataFrame({
'Identity': [3, 4, 5, 6, 7, 8, 9],
'Value': [1, 2, 3, 4, 5, 6, 7],
'Notes': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
})
df2 = pd.DataFrame({
'Identity': [4, 8],
'Value': [0, 128],
})
In[3]: df1
Out[3]:
Identity Value Notes
0 3 1 a
1 4 2 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 6 f
6 9 7 g
In[4]: df2
Out[4]:
Identity Value
0 4 0
1 8 128
I'd like to use df2 to overwrite df1 but only where values exist in df2, so I end up with:
Identity Value Notes
0 3 1 a
1 4 0 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 128 f
6 9 7 g
I've been searching through all the various merge, combine, join functions etc, but I can't seem to find one that does what I want. Is there a simple way of doing this?
Use:
df1['Value'] = df1['Identity'].map(df2.set_index('Identity')['Value']).fillna(df1['Value'])
Or try reset_index with reindex and set_index with fillna:
df1['Value'] = df2.set_index('Identity').reindex(df1['Identity'])
.reset_index(drop=True)['Value'].fillna(df1['Value'])
>>> df1
Identity Value Notes
0 3 1.0 a
1 4 0.0 b
2 5 3.0 c
3 6 4.0 d
4 7 5.0 e
5 8 128.0 f
6 9 7.0 g
>>>
This fills missing rows in df2 with NaN and fills the NaNs with df1 values.

Performing outer join that merges joined columns

I am performing an outer join on two DataFrames:
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'date': [4, 5, 6, 7, 8],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 6],
'date': [4, 5, 6, 7, 8],
'str': ['A', 'B', 'C', 'D', 'Q']})
pd.merge(df1, df2, on=["id","date"], how="outer")
This gives the result
date id str_x str_y
0 4 1 a A
1 5 2 b B
2 6 3 c C
3 7 4 d D
4 8 5 e NaN
5 8 6 NaN Q
Is it possible to perform the outer join such that the str-columns are concatenated? In other words, how to perform the join such that we get the DataFrame
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
where all NaN have been set to None.
I think not, possible solution is replace NaNs and join together:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x['str'].fillna('') + x['str_'].fillna(''))
.drop('str_', 1))
Similar alternative:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x.filter(like='str').fillna('').values.sum(axis=1))
.drop('str_', 1))
print (df)
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
If 'id', 'date' is unique in each data frame, then you can set the index and add the dataframes.
icols = ['date', 'id']
df1.set_index(icols).add(df2.set_index(icols), fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q

Pandas loop a dataframe and compare all rows with other DF rows and assign a value

I have two DF:
df1 = pd.DataFrame({'A':[3, 2, 5, 1, 6], 'B': [4, 6, 5, 8, 2], 'C': [4, 8, 3, 8, 0], 'D':[1, 4, 2, 8, 7], 'zebra': [5, 7, 2, 4, 8]})
df2 = pd.DataFrame({'B': [7, 3, 5, 1, 8], 'D':[4, 5, 8, 2, 3] })
print(df1)
print(df2)
A B C D zebra
0 3 4 4 1 5
1 2 8 8 5 7
2 5 5 3 2 2
3 1 6 8 5 4
4 6 2 0 7 8
B D
0 7 4
1 3 5
2 5 8
3 8 5
4 8 3
This is a simple example, in real df1 is with 1000k+ rows and 10+ columns, df2 is with only 24 rows and fewer columns as well. I would like to loop all rows in df2 and to compare those specific rows (for example column 'B' and 'D') from df2 with same column names in df1 and if row values match (if value in column B and column D in df2 match same values in same columns but in df1) to assign corresponding zebra value in that row to the same row in df2 creating new column zebra and assigning that value. If no matching found to assign 0s or NaN's.
B D zebra
0 7 4 nan
1 3 5 nan
2 5 8 nan
3 8 5 7
4 8 3 nan
From example, only row index 3 in df2 matched values 'B': 8 and 'D':5 with a row with index 2 from df1 (NOTE: row index should not be important in comparisons) and assign corresponding row value 7 from column 'zebra' to df2.
A merge would do
df2.merge(df1[['B', 'D', 'zebra']], on = ['B', 'D'], how = 'left')
B D zebra
0 7 4 NaN
1 3 5 NaN
2 5 8 NaN
3 8 5 7.0
4 8 3 NaN

Comparing every values columns and rows in dataframes

I have two dataframes of different sizes and I would like to use a comparison for all values in four different columns, (two sets of two)
Essentially I would like to see where df1['A'] == df2['A'] & where df1['B'] == df2['B'] and return df1['C']'s value plus df2['C']'s values
import pandas as pd
df1 = pd.DataFrame({"A": [1, 2, 3, 4, 3], "B": [2, 5, 4, 7, 5], "C": [1, 2, 8, 0, 0]})
df2 = pd.DataFrame({"A": [1, 3, 2, 4, 8], "B": [5, 5, 4, 9, 1], "C": [1, 3, 3, 4, 6]})
df1:
A B C
0 1 2 1
1 2 5 2
2 3 4 8
3 4 7 0
4 3 5 0
...
df2:
A B C
0 1 5 1
1 3 4 3
2 2 5 4
3 4 9 4
5 8 1 6
...
in: df1['A'] == df2['A'] & where df1['B'] == df2['B']
df1['D'] = df1['C'] + df2['C']
out: df1:
A B C D
0 1 2 1 nan
1 2 5 2 6
2 3 4 8 11
3 4 7 0 nan
4 3 5 0 nan
My actual dataframes are much larger (120000ish rows of data with values for both 'A' columns range from 1 to 700 and 'B' from 1 to 300) so I know it might be a longer process.
You can merge the two DataFrames on columns A and B. Since you want to keep all values from df1, do a left merge of df1 and df2. The merged column C from df2 will be null wherever A and B don't match. After the merge, it's just a matter of renaming the merged column and doing a sum.
# Do a left merge, keeping df1 column names unchanged.
df1 = pd.merge(df1, df2, how='left', on=['A', 'B'], suffixes=('', '_2'))
# Add the two columns, fill locations that don't match with zero, and rename.
df1['C_2'] = df1['C_2'].add(df1['C']).fillna(0)
df1.rename(columns={'C_2': 'D'}, inplace=True)
You could first merge the two dataframes
In [145]: dff = pd.merge(df1, df2, on=['A', 'B'], how='left')
In [146]: dff
Out[146]:
A B C_x C_y
0 1 2 1 NaN
1 2 5 2 4
2 3 4 8 3
3 4 7 0 NaN
Then, take row-wise sum on C_-{like} columns, where null values are not present, then fill NaN with zero.
In [147]: dff['C'] = dff.filter(regex='C_').sum(skipna=False, axis=1).fillna(0)
In [148]: dff
Out[148]:
A B C_x C_y C
0 1 2 1 NaN 0
1 2 5 2 4 6
2 3 4 8 3 11
3 4 7 0 NaN 0
And, you can drop/pick required columns.

Duplicating a Pandas DF N times

So right now, if I multiple a list i.e. x = [1,2,3]* 2 I get x as [1,2,3,1,2,3] But this doesn't work with Pandas.
So if I want to duplicate a PANDAS DF I have to make a column a list and multiple:
col_x_duplicates = list(df['col_x'])*N
new_df = DataFrame(col_x_duplicates, columns=['col_x'])
Then do a join on the original data:
pd.merge(new_df, df, on='col_x', how='left')
This now duplicates the pandas DF N times, Is there an easier way? Or even a quicker way?
Actually, since you want to duplicate the entire dataframe (and not each element), numpy.tile() may be better:
In [69]: import pandas as pd
In [70]: arr = pd.np.array([[1, 2, 3], [4, 5, 6]])
In [71]: arr
Out[71]:
array([[1, 2, 3],
[4, 5, 6]])
In [72]: df = pd.DataFrame(pd.np.tile(arr, (5, 1)))
In [73]: df
Out[73]:
0 1 2
0 1 2 3
1 4 5 6
2 1 2 3
3 4 5 6
4 1 2 3
5 4 5 6
6 1 2 3
7 4 5 6
8 1 2 3
9 4 5 6
[10 rows x 3 columns]
In [75]: df = pd.DataFrame(pd.np.tile(arr, (1, 3)))
In [76]: df
Out[76]:
0 1 2 3 4 5 6 7 8
0 1 2 3 1 2 3 1 2 3
1 4 5 6 4 5 6 4 5 6
[2 rows x 9 columns]
Here is a one-liner to make a DataFrame with n copies of DataFrame df
n_df = pd.concat([df] * n)
Example:
df = pd.DataFrame(
data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']],
columns=['id', 'temp', 'name'],
index=pd.Index([1, 2, 3], name='row')
)
n = 4
n_df = pd.concat([df] * n)
Then n_df is the following DataFrame:
id temp name
row
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark

Categories