Pandas single match join - python

There is a way to have a single first available match join? something that will create the final df inside the functions 'some_magic_merge'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'joincol':['a','a','b','b','b','b','d','d'],
'val1':[1,2,3,4,5,6,7,8]})
df2 = pd.DataFrame({'joincol':['a','a','a','b','b','d'],
'val2':[1,2,3,4,5,6]})
final_df = some_magic_merge(df1,df2)
print(final_df)
print(df1)
print(df2)
output final df
joincol val1 val2
0 a 1 1.0
1 a 2 2.0
2 b 3 4.0
3 b 4 5.0
4 b 5 NaN
5 b 6 NaN
6 d 7 6.0
7 d 8 NaN
output df1 and df2
joincol val1
0 a 1
1 a 2
2 b 3
3 b 4
4 b 5
5 b 6
6 d 7
7 d 8
joincol val2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 d 6

Use GroupBy.cumcount for helper columns filled by counter and then left join in merge:
final_df = pd.merge(df1.assign(g=df1.groupby('joincol').cumcount()),
df2.assign(g=df2.groupby('joincol').cumcount()),
how='left', on=['joincol','g']).drop('g', axis=1)
print(final_df)
joincol val1 val2
0 a 1 1.0
1 a 2 2.0
2 b 3 4.0
3 b 4 5.0
4 b 5 NaN
5 b 6 NaN
6 d 7 6.0
7 d 8 NaN

Related

Comparing two dataframes with same columns and different rows

i have 2 dataframes df1, df2 and i want to compare these dataframes.
import pandas as pd
df1 = pd.DataFrame({'Column1': ['f','c','b','d','e','g','h'],
'Column2': ['1','2','3','4','5','7','8']})
df2 = pd.DataFrame({'Column1': ['f','b','d','e','a','g'],
'Column2': ['1','3','4','5','6','7']})
To compare dataframes, i use pandas merge. here's my code.
df = pd.merge(df1,df2, how="outer", on="Column1")
And Result :
Column1 Column2_x Column2_y
0 f 1 1
1 c 2 NaN
2 b 3 3
3 d 4 4
4 e 5 5
5 g 7 7
6 h 8 NaN
7 a NaN 6
But i don't want this result... the output what i want is below :
how can i get the output???
What i want :
Column1 Column2_x Column2_y
0 f 1 1
1 c 2 NaN
2 b 3 3
3 d 4 4
4 e 5 5
5 a NaN 6
6 g 7 7
7 h 8 NaN
It looks like you want to preserve the row order in df1. Try:
(df1.assign(enum=np.arange(len(df1)))
.merge(df2, on='Column1', how='outer')
.sort_values('enum')
.drop(columns=['enum'])
)
Output:
Column1 Column2_x Column2_y
0 f 1 1
1 c 2 NaN
2 b 3 3
3 d 4 4
4 e 5 5
5 g 7 7
6 h 8 NaN
7 a NaN 6

Add Missing Values To Pandas Groups

Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0

add nan if missing consecutive values

I have a dataframe like
df2 = pandas.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]],columns=['A','B'])
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
3 5 2
4 5 3
and I would like to add nan to the column B if consecutive values are missing in column A
the dataframe should become as
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
4 3 np.nan
5 4 np.nan
6 5 2
7 5 3
Could you please help me?
You can construct a dataframe to append, concatenate, then sort:
df = pd.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]], columns=['A','B'])
# construct dataframe to append
arr = np.arange(df['A'].min(), df['A'].max() + 1)
arr = arr[~np.in1d(arr, df['A'].values)]
df_append = pd.DataFrame({'A': arr})
# concatenate and sort
res = pd.concat([df, df_append]).sort_values('A')
print(res)
A B
0 1 4.0
1 2 2.0
2 2 1.0
0 3 NaN
1 4 NaN
3 5 2.0
4 5 3.0

Python DataFrame: replacing values from DataFrame to other DataFrame with same index and columns

I have two dataframes. "df" is my original dataframe with 100000+ values and "df_result" is another that contains only certain columns with certain indexes of df. I have changed the values in "df_result" columns and want to apply back to my original dataframe "df". I have mapped the column names and index of "df_index" to match the right index of "df" but it does not contain every index of "df". (ex, df.index() output is [0,1,2,.....,92808,92809] and df_result.index() output is [23429,23430,32349,42099,45232,.....,91324,91423]). Is there efficient way to put every value in "df_result" to the original "df" which is corespond to same index and columns?. Thank you!
You can use combine_first:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df_result = pd.DataFrame({'A':list('abc'),
'B':[4,5,4],
'C':[7,9,3],
'D':[5,7,1],
'E':[5,3,6],
'F':list('klo')}, index=[2,4,5])
print (df_result)
A B C D E F
2 a 4 7 5 5 k
4 b 5 9 7 3 l
5 c 4 3 1 6 o
df = df_result.combine_first(df)
print (df)
A B C D E F
0 a 4.0 7.0 1.0 5.0 a
1 b 5.0 8.0 3.0 3.0 a
2 a 4.0 7.0 5.0 5.0 k
3 d 5.0 4.0 7.0 9.0 b
4 b 5.0 9.0 7.0 3.0 l
5 c 4.0 3.0 1.0 6.0 o
Another solution wotking with NaNs too is join DataFrames and remove duplicates rows by indices:
df = df_result.append(df)
df = df[~df.index.duplicated()].sort_index()
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 a 4 7 5 5 k
3 d 5 4 7 9 b
4 b 5 9 7 3 l
5 c 4 3 1 6 o
EDIT:
does this work with np.nan values also? and if df have more columns other then df_result?
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[np.nan,4,8,9,4,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 NaN 1 5 a
1 b 5 4.0 3 3 a
2 c 4 8.0 5 6 a
3 d 5 9.0 7 9 b
4 e 5 4.0 1 2 b
5 f 4 3.0 0 4 b
df_result = pd.DataFrame({'A':list('abc'),
'B':[np.nan,50,40],
'E':[50,30,60],
'F':list('klo')}, index=[2,4,5])
print (df_result)
A B E F
2 a NaN 50 k
4 b 50.0 30 l
5 c 40.0 60 o
You can set df by indices and columns names with loc:
df.loc[df_result.index, df_result.columns] = df_result
print (df)
A B C D E F
0 a 4.0 NaN 1 5 a
1 b 5.0 4.0 3 3 a
2 a NaN 8.0 5 50 k
3 d 5.0 9.0 7 9 b
4 b 50.0 4.0 1 30 l
5 c 40.0 3.0 0 60 o
This function should work if you don't have any NA:
df = df.update(df_result)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html

How do I join two dataframes based on values in selected columns?

I am trying to join (merge) two dataframes based on values in each column.
For instance, to merge by values in columns in A and B.
So, having df1
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
And df2
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
I want to get a d3 with such structure
A B C D E F L
0 4 3 1 5 4 5 1
1 5 7 0 3 3 3 2
2 3 2 1 6 Nan Nan 4
3 3 8 Nan Nan 5 5 5
Can you, please help me? I've tried both merge and join methods but havent succeed.
UPDATE: (for updated DFs and new desired DF)
In [286]: merged = pd.merge(df1, df2, on=['A','B'], how='outer', suffixes=('','_y'))
In [287]: merged.L.fillna(merged.pop('L_y'), inplace=True)
In [288]: merged
Out[288]:
A B C D L E F
0 4 3 1.0 5.0 1.0 4.0 5.0
1 5 7 0.0 3.0 2.0 3.0 3.0
2 3 2 1.0 6.0 4.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0 5.0
Data:
In [284]: df1
Out[284]:
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
In [285]: df2
Out[285]:
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
OLD answer:
you can use pd.merge(..., how='outer') method:
In [193]: pd.merge(a,b, on=['A','B'], how='outer')
Out[193]:
A B C D E F
0 4 3 1.0 5.0 4.0 5.0
1 5 7 0.0 3.0 3.0 3.0
2 3 2 1.0 6.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0
Data:
In [194]: a
Out[194]:
A B C D
0 4 3 1 5
1 5 7 0 3
2 3 2 1 6
In [195]: b
Out[195]:
A B E F
0 4 3 4 5
1 5 7 3 3
2 3 8 5 5

Categories