Fill in values based on a differnt datafram values in pandas [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have the following dataframe:
df1 = pd.DataFrame({'ID': ['foo', 'foo','bar','foo', 'baz', 'foo'],'value': [1, 2, 3, 5, 4, 3, 1, 2, 3]})
df2 = pd.DataFrame({'ID': ['foo', 'bar', 'baz', 'foo'],'age': [10, 21, 32, 15]})
I would like to create a new column in DF1 called age, and take the values from df2, that match on 'ID'. I would like for those values to be duplicated (instead of nan), when 'ID' value appears more than once in df1.
I tried a merge of df1 and df2, but they produce NaNs instead of duplicates.
Tha Pandas 101 does not contain an answer for this problem.

I think you need outer join:
df = pd.merge(df1, df2, on='ID', how='outer')
print(df)
ID value age
0 foo 1 10
1 foo 1 15
2 foo 2 10
3 foo 2 15
4 foo 5 10
5 foo 5 15
6 foo 3 10
7 foo 3 15
8 bar 3 21
9 baz 4 32

Related

Outer merge between pandas and imputing NA with preceeding row

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns
Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

pandas compare 2 dataframes of different column names and different shape [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have 2 dataframes with different no. of rows and different column names. I want to compare and get the matching rows specific to that columns as output.
e.g
df1 = pd.DataFrame({'foo': [11, 22, 33], 'bar': ['aa', 'ab', 'ac'], 'foobar': [111, 222, 333]})
df2 = pd.DataFrame({'AA': [1,22], 'BB': ['see','ab'], 'CC': [123,222]})
df1: foo bar foobar
0 11 aa 111
1 22 ab 222
2 33 ac 333
df2: AA BB CC
0 1 see 123
1 22 ab 222
df2 not necessarily has to have same no of rows and columns.
expected output: for matching rows of df2 in df1
df3:
foo bar foobar
1 22 ab 222
I have tried using np.all, but this seems to work only if we have same no. of rows or single row in df2.
df3 = df1.loc[np.all(df1[['bar','foobar']].values == df2[['BB','CC']].values, axis=1),:]
Essentially needed, difference rows or matching rows from any of the df1 or df2.
expected output: for unmatched rows of df1 from df2
df3:
foo bar foobar
0 11 aa 111
2 33 ac 333
Imagine in this case: The order of columns are different, column mapping I will do. example: ( if columns values of a,b,c of df1 == column values of d,e,f in df2) get me the matched rows form df1 or df2.
df1 = pd.DataFrame({'foo': [11, 22, 33], 'bar': ['aa', 'ab', 'ac'], 'foobar': [111, 222, 333], 'barfoo':[2,22,34]})
df2 = pd.DataFrame({'AA': [22,33], 'CC': [222,333], 'BB': ['ab','ac']})
output : In this case I am matching on (foo:AA, bar:BB, foobar:CC)
df3:
foo bar foobar barfoo
1 22 ab 222 22
2 33 ac 333 34
Appreciate and thanks.
You can temporarily rename the columns of df2 and perform the inner join (a.k.a. merge) on the two dataframes. It will find all rows that are present in both dataframes:
mapper = dict(zip(df2, df1)) # Column mapper
df2.rename(columns=mapper).merge(df1)
# foo bar foobar
#0 22 ab 222
import pandas as pd
df1 = pd.DataFrame({'foo': [11, 22, 33], 'bar': ['aa', 'ab', 'ac'], 'foobar': [111, 222, 333]})
df2 = pd.DataFrame({'AA': [1,22], 'BB': ['see','ab'], 'CC': [123,222]})
df3 = df2.rename({'AA': 'foo', 'BB': 'bar', 'CC': 'foobar'})
df3 = df1.merge(df3, how = 'inner' ,indicator=False)
print('df1\n',df1)
print('df2\n',df2)
print('df3\n',df3)
Output
df1
foo bar foobar
0 11 aa 111
1 22 ab 222
2 33 ac 333
df2
AA BB CC
0 1 see 123
1 22 ab 222
df3
foo bar foobar
0 22 ab 222

How can I join two dataframes with update in some rows, using Pandas?

I'm new to pandas and I would like to know how I can join two files and update existing lines, taking into account a specific column. The files have thousands of lines. For example:
Df_1:
A B C D
1 2 5 4
2 2 6 8
9 2 2 1
Now, my table 2 has exactly the same columns, and I want to join the two tables replacing some rows that may be in this table and also in table 1 but where there were changes / updates in column C, and add the new lines that exist in this second table (df_2), for example:
Df_2:
A B C D
2 2 7 8
9 2 3 1
3 4 6 7
1 2 3 4
So, the result I want is the union of the two tables and their update in a few rows, in a specific column, like this:
Df_result:
A B C D
1 2 5 4
2 2 7 8
9 2 3 1
3 4 6 7
1 2 3 4
How can I do this with the merge or concatenate function? Or is there another way to get the result I want?
Thank you!
You need to have at least one column as a reference, I mean, to know what needs to change to do the update.
Assuming that in your case it is "A" and "B" in this case.
import pandas as pd
ref = ['A','B']
df_result = pd.concat([df_1, df_2], ignore_index = True)
df_result = df_result.drop_duplicates(subset=ref, keep='last')
Here a real example.
d = {'col1': [1, 2, 3], 'col2': ["a", "b", "c"], 'col3': ["aa", "bb", "cc"]}
df1 = pd.DataFrame(data=d)
d = {'col1': [1, 4, 5], 'col2': ["a", "d", "f"], 'col3': ["dd","ee", "ff"]}
df2 = pd.DataFrame(data=d)
df_result = pd.concat([df1, df2], ignore_index=True)
df_result = df_result.drop_duplicates(subset=['col1','col2'], keep='last')
df_result

Python/Pandas - What is the most efficient way to replace values in specifc columns [duplicate]

This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
Suppose you have a data frame
df = pd.DataFrame({'a':[1,2,3,4],'b':[2,4,6,8],'c':[2,4,5,6]})
and you want to replace specific values in columns 'a' and 'c' (but not 'b'). For example, replacing 2 with 20, and 4 with 40.
The following will not work since it is setting values on a copy of a slice of the DataFrame:
df[['a','c']].replace({2:20, 4:40}, inplace=True)
A loop will work:
for col in ['a','c']:
df[col].replace({2:20, 4:40},inplace=True)
But a loop seems inefficient. Is there a better way to do this?
According to the documentation on replace, you can specify a dictionary for each column:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8], 'c': [2, 4, 5, 6]})
lookup = {col : {2: 20, 4: 40} for col in ['a', 'c']}
df.replace(lookup, inplace=True)
print(df)
Output
a b c
0 1 2 20
1 20 4 40
2 3 6 5
3 40 8 6

Merge multiple data frames with different dimensions using Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have the following data frames (in reality they are more than 3).
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
# Note that the value in column 'head' is always unique
What I want to do is to merge them based on head column. And whenever the value of a head does not exist in one data frame we would assign it with NA.
In the end it'll look like this:
head1 head2 head3
-------------------------------
foo 11 1 NA
bix 22 NA NA
bar 32 3 100
xoo NA 2 20
qux NA 10 NA
How can I achieve that using Pandas?
You can use pandas.concat selecting the axis=1 to concatenate your multiple DataFrames.
Note however that I've first set the index of the df1, df2, df3 to use the variables (foo, bar, etc) rather than the default integers.
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
df1 = df1.set_index('head1')
df2 = df2.set_index('head2')
df3 = df3.set_index('head3')
df = pd.concat([df1, df2, df3], axis = 1)
columns = ['head1', 'head2', 'head3']
df.columns = columns
print(df)
head1 head2 head3
bar 32 3 100
bix 22 NaN NaN
foo 11 1 NaN
qux NaN 10 NaN
xoo NaN 2 20

Categories