Merging with pandas while keeping NaNs at bottom - python

Let's say I have 3 dataframes, each with a single column. In each df, there are slightly more rows than in the previous one. For example:
and I want to get exactly this:
df1 = col1
1 a
2 b
3 c
df2 = col2
1 x
2 y
3 z
4 w
5 q
df3 = col3
1 A
2 B
3 C
4 D
5 E
6 F
7 G
and I want to get exactly this:
res = col1 col2 col3
1 a x A
2 b y B
3 c z C
4 - w D
5 - q E
6 - - F
7 - - G
That is, I want the rows to stay in the order in which they are added, so NaNs (-) are kept in the bottom.
I tried this:
import pandas as pd
total = pd.DataFrame()
total = pd.merge(total,df1,how='outer',left_index=True,right_index=True)
total = pd.merge(total,df2,how='outer',left_index=True,right_index=True)
total = pd.merge(total,df3,how='outer',left_index=True,right_index=True)
but I keep getting the table in a seemingly random order. Stuff like:
res = col1 col2 col3
1 a x A
4 - w D
3 c z C
5 - q E
2 b y B
7 - - G
6 - - F
How can I force the final df to take the desired form?
Thanks!

concat and pass axis=1 to do so column-wise:
In [203]:
pd.concat([df1,df2,df3], axis=1)
Out[203]:
col1 col2 col3
1 a x A
2 b y B
3 c z C
4 NaN w D
5 NaN q E
6 NaN NaN F
7 NaN NaN G

Related

how to compare two column in two dataframes using a complex condition

Let's suppose I have a dataframe:
import numpy as np
a = [['A',np.nan,2,'x|x|x|y'],['B','a|b',56,'b|c'],['C','c|e|e',65,'f|g'],['D','h',98,'j'],['E','g',98,'k|h'],['F','a|a|a|a|a|b',98,np.nan],['G','w',98,'p'],['H','s',98,'t|u']]
df1 = pd.DataFrame(a, columns=['1', '2','3','4'])
df1
1 2 3 4
0 A NaN 2 x|x|x|y
1 B a|b 56 b|c
2 C c|e|e 65 f|g
3 D h 98 j
4 E g 98 k|h
5 F a|a|a|a|a|b 98 NaN
6 G w 98 p
7 H s 98 t|u
and another dataframe:
a = [['x'],['b'],['h'],['v']]
df2 = pd.DataFrame(a, columns=['1'])
df2
1
0 x
1 b
2 h
3 v
I want to compare column 1 in df2 with column 2 and 4 (splitting it by "|") in df1, and if the value matches with either or both column 2 or 4 (after splitting), I want to extract only those rows of df1 in another dataframe with an added column that will have the value of df2 that matched with either column 2 or column 4 of df1.
For example, the result would look something like this:
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
2 F a|a|a|a|a|b 98 NaN b
3 D h 98 j h
4 E g 98 k|h h
Solution is join values of both columns to Series in DataFrame.agg, then splitting by Series.str.split, filter values in DataFrame.where with DataFrame.isin and then join values together without NaNs, last filter columns without empty strings:
df11 = df1[['2','4']].fillna('').agg('|'.join, 1).str.split('|', expand=True)
df1['5'] = (df11.where(df11.isin(df2['1'].tolist()))
.apply(lambda x: ','.join(set(x.dropna())), axis=1))
df1 = df1[df1['5'].ne('')]
print (df1)
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
3 D h 98 j h
4 E g 98 k|h h
5 F a|a|a|a|a|b 98 NaN b

Reverse Row Values in Pandas DataFrame

I'm working on a pandas data frame where I want to find the farthest out non-null value in each row and then reverse the order of those values and output a data frame with the row values reversed without leaving null values in the first column. Essentially reversing column order and shifting non-null values to the left.
IN:
1 2 3 4 5
1 a b c d e
2 a b c
3 a b c d
4 a b c
OUT:
1 2 3 4 5
1 e d c b a
2 c b a
3 d c b a
4 c b a
For each row, create a new Series with the same indexes but with the values reversed:
def reverse(s):
# Strip the NaN on both ends, but not in the middle
idx1 = s.first_valid_index()
idx2 = s.last_valid_index()
idx = s.loc[idx1:idx2].index
return pd.Series(s.loc[idx[::-1]].values, index=idx)
df.apply(reverse, axis=1)
Result:
1 2 3 4 5
1 e d c b a
2 c b a NaN NaN
3 d c b a NaN
4 c NaN b a NaN

Group by within a groupby then averaging

Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.

Set dataframe column using values from matching indices in another dataframe

I would like to set values in col2 of DF1 using the value held at the matching index of col2 in DF2:
DF1:
col1 col2
index
0 a
1 b
2 c
3 d
4 e
5 f
DF2:
col1 col2
index
2 a x
3 d y
5 f z
DF3:
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If I just try and set DF1['col2'] = DF2['col2'] then col2 comes out as all NaN values in DF3 - I take it this is because the indices are different. However when I try and use map() to do something like:
DF1.index.to_series().map(DF2['col2'])
then I still get the same NaN column, but I thought it would map the values over where the index matches...
What am I not getting?
You need join or assign:
df = df1.join(df2['col2'])
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
Or:
df1 = df1.assign(col2=df2['col2'])
#same like
#df1['col2'] = df2['col2']
print (df1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If no match and all values are NaNs check if indices have same dtype in both df:
print (df1.index.dtype)
print (df2.index.dtype)
If not, then use astype:
df1.index = df1.index.astype(int)
df2.index = df2.index.astype(int)
Bad solution (check index 2):
df = df2.combine_first(df1)
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 a x
3 d y
4 e NaN
5 f z
You can simply concat as you are combining based on index
df = pd.concat([df1['col1'], df2['col2']],axis = 1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z

How to "multiply" python pandas dataframes (as if they were vectors)?

I'm learning pandas. I have two dataframes:
df1 =
quality1 value
A 1
B 2
C 3
df2 =
quality2 value
D 1
E 10
F 100
I want to multiply them (as I might do with vectors to get a matrix). The answer should be:
df3 =
quality1 quality2 value
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300
How can I achieve this?
It's not the prettiest, but it would work:
>>> df1["dummy"] = 1
>>> df2["dummy"] = 1
>>> dfm = df1.merge(df2, on="dummy")
>>> dfm["value"] = dfm.pop("value_x") * dfm.pop("value_y")
>>> del dfm["dummy"]
>>> dfm
quality1 quality2 value
0 A D 1
1 A E 10
2 A F 100
3 B D 2
4 B E 20
5 B F 200
6 C D 3
7 C E 30
8 C F 300
Until we get native support for a Cartesian join (whistles and looks away..), merging on a dummy column is an easy way to get the same effect. The intermediate frame looks like
>>> dfm
quality1 value_x dummy quality2 value_y
0 A 1 1 D 1
1 A 1 1 E 10
2 A 1 1 F 100
3 B 2 1 D 1
4 B 2 1 E 10
5 B 2 1 F 100
6 C 3 1 D 1
7 C 3 1 E 10
8 C 3 1 F 100
You could also use cartesian function from scikit-learn:
from sklearn.utils.extmath import cartesian
# Your data:
df1 = pd.DataFrame({'quality1':list('ABC'), 'value':[1,2,3]})
df2 = pd.DataFrame({'quality2':list('DEF'), 'value':[1,10,100]})
# Make the matrix of labels:
dfm = pd.DataFrame(cartesian((df1.quality1.values, df2.quality2.values)),
columns=['quality1', 'quality2'])
# Multiply values:
dfm['value'] = df1.value.values.repeat(df2.value.size) * pd.np.tile(df2.value.values, df1.value.size)
print dfm.set_index(['quality1', 'quality2'])
Which yields:
value
quality1 quality2
A D 1
E 10
F 100
B D 2
E 20
F 200
C D 3
E 30
F 300

Categories