Join/merge dataframes and preserve the row-order

Join/merge dataframes and preserve the row-order - python

I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.

I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN

You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source

One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.

Related

Pandas - shifting a rolling sum after grouping spills over to following groups

I might be doing something wrong, but I was trying to calculate a rolling average (let's use sum instead in this example for simplicity) after grouping the dataframe. Until here it all works well, but when I apply a shift I'm finding the values spill over to the group below. See example below:
import pandas as pd
df = pd.DataFrame({'X': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Y': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
grouped_df = df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum().shift(periods=1)
print(grouped_df)
Expected result:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Result I actually get:
X
A 0 NaN
1 NaN
2 3.0
B 3 5.0
4 NaN
5 3.0
C 6 5.0
7 NaN
8 3.0
You can see the result of A2 gets passed to B3 and the result of B5 to C6. I'm not sure this is the intended behaviour and I'm doing something wrong or there is some bug in pandas?
Thanks

The problem is that
df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
returns a new series, then when you chain with shift(), you shift the series as a whole, not within the group.
You need another groupby to shift within the group:
grouped_df = (df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
.groupby(level=0).shift(periods=1)
)
Or use groupby.transform:
grouped_df = (df.groupby('X')['Y']
.transform(lambda x: x.rolling(window=2, min_periods=2)
.sum().shift(periods=1))
)
Output:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Name: Y, dtype: float64

how can I merge two dataframes that have same columns but it has different row values? [duplicate]

This question already has an answer here:
What is the difference between combine_first and fillna?
(1 answer)
Closed 2 years ago.
I'm trying to put together two dataframes that have the same columns and number of rows, but one of them have nan in some rows and the other doesn't.
This example is with 2 DF, but I have to do this with around 50 DF and get all dataframes merged in 1.
DF1:
id b c
0 1 15 1
1 2 nan nan
2 3 2 3
3 4 nan nan
DF2:
id b c
0 1 nan nan
1 2 26 6
2 3 nan nan
3 4 60 3
Desired output:
id b c
0 1 15 1
1 2 26 6
2 3 2 3
3 4 60 3

If you have
df1 = pd.DataFrame(np.nan, index=[0, 1], columns=[0, 1])
df2 = pd.DataFrame([[0, np.nan]], index=[0, 1], columns=[0, 1])
df3 = pd.DataFrame([[np.nan, 1]], index=[0, 1], columns=[0, 1])
Then you can update df1
for df in [df2, df3]:
df1.update(df)
print(df1)
0 1
0 0.0 1.0
1 0.0 1.0

Concatenating dataframes creates too many columns

I am reading a number of csv files in using a loop, all have 38 columns. I add them all to a list and then concatenate/create a dataframe. My issue is that despite all these csv files having 38 columns, my resultant dataframe somehow ends up with 105 columns.
Here is a screenshot:
How can I make the resultant dataframe have the correct 38 columns and stack all of rows on top of each other?
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('alpha-enforcement-data-engineering')
appended_data = []
for obj in bucket.objects.filter(Prefix='closed/closed_processed/year_201'):
print(obj.key)
df = pd.read_csv(f's3://alpha-enforcement-data-engineering/{obj.key}', low_memory=False)
print(df.shape)
appended_data.append(df)
df_closed = pd.concat(appended_data, axis=0, sort=False)
print(df_closed.shape)

TLDR; check your column headers.
c = appended_data[0].columns
df_closed = pd.concat([df.set_axis(
c, axis=1, inplace=False) for df in appended_data], sort=False)
This happens because your column headers are different. Pandas will align your DataFrames on the headers when concatenating vertically, and will insert empty columns for DataFrames where that header is not present. Here's an illustrative example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
df
A B
0 1 4
1 2 5
2 3 6
df2
C D
0 7 10
1 8 11
2 9 12
pd.concat([df, df2], axis=0, sort=False)
A B C D
0 1.0 4.0 NaN NaN
1 2.0 5.0 NaN NaN
2 3.0 6.0 NaN NaN
0 NaN NaN 7.0 10.0
1 NaN NaN 8.0 11.0
2 NaN NaN 9.0 12.0
Creates 4 columns. Whereas, you wanted only two. Try,
df2.columns = df.columns
pd.concat([df, df2], axis=0, sort=False)
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
Which works as expected.

When the dataframe has duplicate columns, it seems that fillna function cannot work correctly with dict parameter

I find that after using pd.concat() to concatenate two dataframes with same column name, then df.fillna() will not work correctly with the dict parameter specifying which value to use for each column.
I don't know why? Is something wrong with my understanding?
a1 = pd.DataFrame({'a': [1, 2, 3]})
a2 = pd.DataFrame({'a': [1, 2, 3]})
b = pd.DataFrame({'b': [np.nan, 20, 30]})
c = pd.DataFrame({'c': [40, np.nan, 60]})
x = pd.concat([a1,a2, b, c], axis=1)
print(x)
x = x.fillna({'b':10, 'c': 50})
print(x)
Initial dataframe:
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
Data is unchanged after df.fillna():
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0

As mentioned in the comments, there's a problem assigning values to a dataframe in the presence of duplicate column names.
However, you can use this workaround:
for col,val in {'b':10, 'c': 50}.items():
new_col = x[col].fillna(val)
idx = int(x.columns.get_loc(col))
x = x.drop(col,axis=1)
x.insert(loc=idx, column=col, value=new_col)
print(x)
result:
a a b c
0 1 1 10.0 40.0
1 2 2 20.0 50.0
2 3 3 30.0 60.0

How to add an empty row after a definite row in python dataframe?

I'm working with a huge dataframe in python and sometimes I need to add an empty row or several rows in a definite position to dataframe. For this question I create a small dataframe df in order to show, what I want to achieve.
> df = pd.DataFrame(np.random.randint(10, size = (3,3)), columns =
> ['A','B','C'])
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 8 1 9
Let's say I need to add an empty row, if I have a zero-value in the column 'C'. Here the empty row should be added after the second row. So at the end I want to have a new dataframe like:
>new_df
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 nan nan nan
> 3 8 1 9
I tried with concat and append, but I didn't get what I want to. Could you help me please?

You can try in this way:
l = df[df['C']==0].index.tolist()
for c, i in enumerate(l):
dfs = np.split(df, [i+1+c])
df = pd.concat([dfs[0], pd.DataFrame([[np.NaN, np.NaN, np.NaN]], columns=df.columns), dfs[1]], ignore_index=True)
print df
Input:
A B C
0 4 3 0
1 4 0 4
2 4 4 2
3 3 2 1
4 3 1 2
5 4 1 4
6 1 0 4
7 0 2 0
8 2 0 3
9 4 1 3
Output:
A B C
0 4.0 3.0 0.0
1 NaN NaN NaN
2 4.0 0.0 4.0
3 4.0 4.0 2.0
4 3.0 2.0 1.0
5 3.0 1.0 2.0
6 4.0 1.0 4.0
7 1.0 0.0 4.0
8 0.0 2.0 0.0
9 NaN NaN NaN
10 2.0 0.0 3.0
11 4.0 1.0 3.0
Last thing: it can happen that the last row has 0 in 'C', so you can add:
if df["C"].iloc[-1] == 0 :
df.loc[len(df)] = [np.NaN, np.NaN, np.NaN]

Try using slice.
First, you need to find the rows where C == 0. So let's create a bool df for this. I'll just name it 'a':
a = (df['C'] == 0)
So, whenever C == 0, a == True.
Now we need to find the index of each row where C == 0, create an empty row and add it to the df:
df2 = df.copy() #make a copy because we want to be safe here
for i in df.loc[a].index:
empty_row = pd.DataFrame([], index=[i]) #creating the empty data
j = i + 1 #just to get things easier to read
df2 = pd.concat([df2.ix[:i], empty_row, df2.ix[j:]]) #slicing the df
df2 = df2.reset_index(drop=True) #reset the index
I must say... I don't know the size of your df and if this is fast enough, but give it a try

In case you know the index where you want to insert a new row, concat can be a solution.
Example dataframe:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
Your new row as a dataframe with index 1:
new_row = pd.DataFrame({'A': np.nan, 'B': np.nan,'C': np.nan}, index=[1])
Inserting your new row after the second row:
new_df = pd.concat([df.loc[:1], new_row, df.loc[2:]]).reset_index(drop=True)
# A B C
# 0 1.0 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN NaN NaN
# 3 3.0 6.0 9.0

something like this should work for you:
for key, row in df.iterrows():
if row['C'] == 0:
df.loc[key+1] = pd.Series([np.nan])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join/merge dataframes and preserve the row-order - python

I think it is bug. Possible solution with left join: df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1) print (df_2) A B C 0 2.0 7.0 NaN 1 5.0 1.0 1.0 2 3.0 3.0 NaN 3 5.0 0.0 NaN

You can play with index between the both dataframe print(df) # A B C # 0 5 1 1.0 # 1 2 7 NaN # 2 3 3 NaN # 3 5 0 NaN df = df.set_index('B') df = df.reindex(index=df_2['B']) df = df.reset_index() df = df[['A', 'B', 'C']] print(df) # A B C # 0 2 7.0 NaN # 1 5 1.0 1.0 # 2 3 3.0 NaN # 3 5 0.0 NaN Source

Related

Pandas - shifting a rolling sum after grouping spills over to following groups

how can I merge two dataframes that have same columns but it has different row values? [duplicate]

Concatenating dataframes creates too many columns

When the dataframe has duplicate columns, it seems that fillna function cannot work correctly with dict parameter

How to add an empty row after a definite row in python dataframe?

Categories

Resources