Merging/Concat/Joining two dataframes - python

i have a pandas dataframe with a distinct code identifier as detailed below:
df1 = pd.DataFrame([['a', 1], ['b', 2],['c', 3],['d', 4],['e', 5],['f', 5]],
columns=['code', 'value1'])
with a second dataframe with the following
df2 = pd.DataFrame([['a', 11], ['b', 12],['c', 13],['d', 14],['e', 15],['f', 16],['g', 17], ['h', 2],['i', 3],['j', 4],['k', 5],['l', 5]],
columns=['code', 'value2'])
i would like to only see the codes identified in df1 (i.e a-f) and have a third column entitled value2.
I have tried
df1 = df1.join(df2, on = 'Code')
but i keep getting a value of NaN
I have looked at several places and seen merge, concat and join, but none of them appear to work

try this:
df1 = df1.merge(df2, on = 'code')
since you named the column 'code' not 'Code'

To only see the codes identified in df1 (i.e a-f) and have a third column entitled value2, you should use merge method with how='inner' and on='code:
>>> df1.merge(df2, how='inner', on='code')
code value1 value2
0 a 1 11
1 b 2 12
2 c 3 13
3 d 4 14
4 e 5 15
5 f 5 16

Use:
>>> df1.merge(df2, how='inner', on='code')
code value1 value2
0 a 1 11
1 b 2 12
2 c 3 13
3 d 4 14
4 e 5 15
5 f 5 16
Or do you mean by with how='outer' and merge?
>>> df1.merge(df2, how='outer', on='code')
code value1 value2
0 a 1.0 11
1 b 2.0 12
2 c 3.0 13
3 d 4.0 14
4 e 5.0 15
5 f 5.0 16
6 g NaN 17
7 h NaN 2
8 i NaN 3
9 j NaN 4
10 k NaN 5
11 l NaN 5
>>>

Related

how can I merge two dataframes that have same columns but it has different row values? [duplicate]

This question already has an answer here:
What is the difference between combine_first and fillna?
(1 answer)
Closed 2 years ago.
I'm trying to put together two dataframes that have the same columns and number of rows, but one of them have nan in some rows and the other doesn't.
This example is with 2 DF, but I have to do this with around 50 DF and get all dataframes merged in 1.
DF1:
id b c
0 1 15 1
1 2 nan nan
2 3 2 3
3 4 nan nan
DF2:
id b c
0 1 nan nan
1 2 26 6
2 3 nan nan
3 4 60 3
Desired output:
id b c
0 1 15 1
1 2 26 6
2 3 2 3
3 4 60 3
If you have
df1 = pd.DataFrame(np.nan, index=[0, 1], columns=[0, 1])
df2 = pd.DataFrame([[0, np.nan]], index=[0, 1], columns=[0, 1])
df3 = pd.DataFrame([[np.nan, 1]], index=[0, 1], columns=[0, 1])
Then you can update df1
for df in [df2, df3]:
df1.update(df)
print(df1)
0 1
0 0.0 1.0
1 0.0 1.0

Setting the index after merging with pandas?

Executing the following merge
import pandas as pd
s = pd.Series(range(5, 10), index=range(10, 15), name='score')
df = pd.DataFrame({'id': (11, 13), 'value': ('a', 'b')})
pd.merge(s, df, 'left', left_index=True, right_on='id')
results in this data frame:
score id value
NaN 5 10 NaN
0.0 6 11 a
NaN 7 12 NaN
1.0 8 13 b
NaN 9 14 NaN
Why does Pandas take the index from the right data frame as the index for the result, instead of the index from the left series, even though I specified both a left merge and left_index=True? The documentation says
left: use only keys from left frame
which I interpreted differently from the result I am actually getting. What I expected was the following data frame.
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
I am using Python 3.7.5 with Pandas 0.25.3.
Here's what happens:
the output index is the intersection of the index/column merge keys [0, 1].
missing keys are replaced with NaN
NaNs result in the index type being upcasted to float
To set the index, just assign to it:
s2 = pd.merge(s, df, how='left', left_index=True, right_on='id')
s2.index = s.index
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
You can also merge on s (just because I dislike calling pd.merge directly):
(s.to_frame()
.merge(df, how='left', left_index=True, right_on='id')
.set_axis(s.index, axis=0, inplace=False))
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
You can do this with reset_index:
df = pd.merge(s,df, 'left', left_index=True, right_on='id').reset_index(drop=True).set_index('id').rename_axis(index=None)
df.insert(1, 'id', df.index)
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
Since I do not need the duplicated information in both the id column and the index, I went with a combination of the answers from cs95 and oppressionslayer, and did the following:
pd.merge(s, df, 'left', left_index=True, right_on='id').set_index('id')
Which results in this data frame:
score value
id
10 5 NaN
11 6 a
12 7 NaN
13 8 b
14 9 NaN
Since this is different from what I initially asked for, I am leaving the answer from cs95 as the accepted answer, but I think this use case needs to be documented as well.

Merge small file into big file and give NaN's for the rows that do not match in python

I would like to merge two data frames - big one and small one. Example of data frames is following:
# small data frame construction
>>> d1 = {'col1': ['A', 'B'], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 A 3
1 B 4
# big data frame construction
>>> d2 = {'col1': ['A', 'B', 'C', 'D', 'E'], 'col2': [3, 4, 6, 7, 8]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 A 3
1 B 4
2 C 6
3 D 7
4 E 8
The code I am looking for should produce the following output (a data frame with big data frame shape, column names, and NaNs in rows that were not merged with the small data frame):
col1 col2
0 A 3
1 B 4
2 NA NA
3 NA NA
4 NA NA
The code I have tried:
>>> print(pd.merge(df1, df2, left_index=True, right_index=True, how='right', sort=False))
col1_x col2_x col1_y col2_y
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 5
3 NaN NaN D 6
4 NaN NaN E 7
You can add parameter suffixes with add _ for added columns and then removed added columns with Series.str.endswith, inverted mask by ~ and boolean indexing with loc, because droping columns:
df = pd.merge(df1, df2,
left_index=True,
right_index=True,
how='right',
sort=False,
suffixes=('','_'))
print (df)
col1 col2 col1_ col2_
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 6
3 NaN NaN D 7
4 NaN NaN E 8
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
col1 col2
0 A 3.0
1 B 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN

Join/merge dataframes and preserve the row-order

I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.

Concatenating dataframes creates too many columns

I am reading a number of csv files in using a loop, all have 38 columns. I add them all to a list and then concatenate/create a dataframe. My issue is that despite all these csv files having 38 columns, my resultant dataframe somehow ends up with 105 columns.
Here is a screenshot:
How can I make the resultant dataframe have the correct 38 columns and stack all of rows on top of each other?
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('alpha-enforcement-data-engineering')
appended_data = []
for obj in bucket.objects.filter(Prefix='closed/closed_processed/year_201'):
print(obj.key)
df = pd.read_csv(f's3://alpha-enforcement-data-engineering/{obj.key}', low_memory=False)
print(df.shape)
appended_data.append(df)
df_closed = pd.concat(appended_data, axis=0, sort=False)
print(df_closed.shape)
TLDR; check your column headers.
c = appended_data[0].columns
df_closed = pd.concat([df.set_axis(
c, axis=1, inplace=False) for df in appended_data], sort=False)
This happens because your column headers are different. Pandas will align your DataFrames on the headers when concatenating vertically, and will insert empty columns for DataFrames where that header is not present. Here's an illustrative example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
df
A B
0 1 4
1 2 5
2 3 6
df2
C D
0 7 10
1 8 11
2 9 12
pd.concat([df, df2], axis=0, sort=False)
A B C D
0 1.0 4.0 NaN NaN
1 2.0 5.0 NaN NaN
2 3.0 6.0 NaN NaN
0 NaN NaN 7.0 10.0
1 NaN NaN 8.0 11.0
2 NaN NaN 9.0 12.0
Creates 4 columns. Whereas, you wanted only two. Try,
df2.columns = df.columns
pd.concat([df, df2], axis=0, sort=False)
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
Which works as expected.

Categories