Trouble Merging a Dataframe with a Series in Pandas - python

I've run into a bit of a sticky problem with pandas merge functionality. Here's a toy example of my issue:
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
s = pd.Series(['E0', 'E1', 'E2', 'E3'], index = [0,1,2,3])
If I now want to left merge these based on column A in the dataframe and the index in the series, i.e.:
pd.merge(df1,s,how='left', left_on = 'A', right_index = True)
I get the following error:
IndexError: list index out of range
Which I don't really understand. To make things more confusing, replacing the series with another dataframe means that everything works just fine:
df2 = pd.DataFrame({'E': ['E0','E1','E2','E3'],
'F': ['F0', 'F1', 'F2', 'F3']},
index=[0, 1, 2, 3])
pd.merge(df1,df2,how='left', left_on = 'A', right_index = True)
Gives:
A B C D E F
0 0 B0 C0 D0 E0 F0
1 1 B1 C1 D1 E1 F1
2 2 B2 C2 D2 E2 F2
3 3 B3 C3 D3 E3 F3
I'm probably missing something very basic, but I suspect this is one of those things that would be helpful to know for the future!
Many thanks in advance.

pd.merge expects DataFrames as its first two arguments. The second argument can not be a Series. However, you could convert s to a frame using its to_frame method:
In [10]: pd.merge(df1, s.to_frame(), how='left', left_on='A', right_index=True)
Out[10]:
A B C D 0
0 0 B0 C0 D0 E0
1 1 B1 C1 D1 E1
2 2 B2 C2 D2 E2
3 3 B3 C3 D3 E3
Notice that the last column name is 0. You can control the name of that column by giving the Series, s a name:
In [15]: s.name = 'Foo'
Then the last column name becomes Foo:
In [17]: pd.merge(df1, s.to_frame(), how='left', left_on='A', right_index=True)
Out[17]:
A B C D Foo
0 0 B0 C0 D0 E0
1 1 B1 C1 D1 E1
2 2 B2 C2 D2 E2
3 3 B3 C3 D3 E3

Related

DataFrame merge to exclude extra index column [duplicate]

This question already has answers here:
How to reset index in a pandas dataframe? [duplicate]
(3 answers)
Closed 3 years ago.
I want to merge two dataframe df1 and df2 and use the script below:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df = df1.join(df2.iloc[1:].reset_index(), how='inner')
But there is one more column named 'index' which I don't want. Is there anyway to remove it?
A B index C D
0 A0 B0 1 C1 D1
1 A1 B1 2 C2 D2
2 A2 B2 3 C3 D3
What I want is just like this:
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3
Just use drop=True within your code.
df = df1.join(df2.iloc[1:].reset_index(drop=True), how='inner')
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3

Joining two dataframes using paired index keys

I have two dataframes from which I've identified match candidates to merge together into one dataframe. Each original dataframe has its own index and the indeces do not correspond to each other (and no other columns to merge on). However, I have an array of paired indeces indicating which individual record pairs should be joined together. I have been unable to find any resources for this type of merge https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html.
The method I've come up with is clunky and slow, and I'm sure there's a better way.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
# The array of paired index keys
matches = np.array([(3,4),
(1,5),
(0,6),
(2,7)])
keys = {i[0]: i[1] for i in matches}
merged = pd.DataFrame()
for key, value in keys.items():
left = df1.loc[[key]].reset_index()
right = df2.loc[[value]].reset_index()
together = pd.concat([left, right], axis = 1)
merged = merged.append(together)
You could do something like this, sort the dataframes according to matches using reindex, then reset_index and pd.concat dataframes:
d1 = df1.reindex(matches[:,0]).reset_index()
d2 = df2.reindex(matches[:,1]).reset_index()
pd.concat([d1,d2], axis=1)
Output:
index A B C D index A B C D
0 3 A3 B3 C3 D3 4 A4 B4 C4 D4
1 1 A1 B1 C1 D1 5 A5 B5 C5 D5
2 0 A0 B0 C0 D0 6 A6 B6 C6 D6
3 2 A2 B2 C2 D2 7 A7 B7 C7 D7
In case you don't want the index named columns in your dataframe, converting your array of paired indexes into a dictionary and using the rename function can also be an approach:
matches = {3:4,1:5,0:6,2:7}
print(pd.concat([df1.rename(index=matches),df2],
axis=1))
However, if you want to keep matches declared as a np.array, the above code will do the trick:
matches = np.array([(3,4),
(1,5),
(0,6),
(2,7)])
print(pd.concat([df1.rename(index=dict(matches)),df2], axis=1))
In both cases, output will be:
A B C D A B C D
4 A3 B3 C3 D3 A4 B4 C4 D4
5 A1 B1 C1 D1 A5 B5 C5 D5
6 A0 B0 C0 D0 A6 B6 C6 D6
7 A2 B2 C2 D2 A7 B7 C7 D7

Merge two Dataframe with columns name differents

I have two DateFrame with values and number of rows different. All Index values in second DateFrame is in the first, so, I need a DateFrame with the same number of rows that the second DateFrame and all data of both DateFrames.
Example
df1 = pd.DataFrame(columns=["A", "B", "C"], data=[['a1' , 'b1', 'c1'], ['a2', 'b2', 'c2'], ['a3', 'b3', 'c3']], index=[1,2,3])
df2 = pd.DataFrame(columns=["D", "E"], data=[['d2' , 'e2'], ['d3' , 'e3']], index=[2,3])
print (df1)
print (df2)
Out:
A B C
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
D E
2 d2 e2
3 d3 e3
I want:
A B C D E
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d3 e3
I try this:
merge = df1.merge(df2, how='inner', left_on=df1.index, right_on=df2.index)
But I receive an error:
IndexError: indices are out-of-bounds
Note: Not all index in df1 are in df2 but all index in df2 are in df1.

pandas dataframe row manipulation

I'm sure that I'm missing something simple, but I haven't be able to figure this one out.
I have a DataFrame in Pandas with multiple rows that have the same keys, but different information. I want to place these rows onto the same row.
df = pd.DataFrame({'key': ['K0', 'K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
This will give me a dataframe with 4 rows and 3 columns. But there is a duplicate value 'KO' in 'key'
Is there any way to turn this into a dataframe with 3 rows, and 5 columns like shown below?
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A2', 'A3'],
'B': ['B0', 'B2', 'B3'],
'A_1': ['A1', 'NaN', 'NaN'],
'B_1': ['B1', 'NaN', 'NaN']})
Perform groupby on cumcount, then concatenate individual groups together.
gps = []
for i, g in df.groupby(df.groupby('key').cumcount()):
gps.append(g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1))
r = pd.concat(gps, 1).sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
You can shorten this somewhat using a list comprehension -
r = pd.concat(
[g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1)
for i, g in df.groupby(df.groupby('key').cumcount())],
axis=1)\
.sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
Let's use set_index, groupby, cumcount, and unstack, then flatten multiindex with map and format:
df_out = df.set_index(['key', df.groupby('key').cumcount()]).unstack()
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
key A_0 A_1 B_0 B_1
0 K0 A0 A1 B0 B1
1 K1 A2 None B2 None
2 K2 A3 None B3 None
I think this alter the layout. just put key as an index to access fields :
df2 = df.set_index([df.key,df.index])
Then
In [248]: df2.loc['K1']
Out[248]:
A B key
2 A2 B2 K1
In [249]: df2.loc['K0']
Out[249]:
A B key
0 A0 B0 K0
1 A1 B1 K0
and iter on rows.

pandas concat ignore_index doesn't work

I am trying to column-bind dataframes and having issue with pandas concat, as ignore_index=True doesn't seem to work:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 2, 3,4])
df2 = pd.DataFrame({'A1': ['A4', 'A5', 'A6', 'A7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D2': ['D4', 'D5', 'D6', 'D7']},
index=[ 5, 6, 7,3])
df1
# A B D
# 0 A0 B0 D0
# 2 A1 B1 D1
# 3 A2 B2 D2
# 4 A3 B3 D3
df2
# A1 C D2
# 5 A4 C4 D4
# 6 A5 C5 D5
# 7 A6 C6 D6
# 3 A7 C7 D7
dfs = [df1,df2]
df = pd.concat( dfs,axis=1,ignore_index=True)
print df
and the result is
0 1 2 3 4 5
0 A0 B0 D0 NaN NaN NaN
2 A1 B1 D1 NaN NaN NaN
3 A2 B2 D2 A7 C7 D7
4 A3 B3 D3 NaN NaN NaN
5 NaN NaN NaN A4 C4 D4
6 NaN NaN NaN A5 C5 D5
7 NaN NaN NaN A6 C6 D6
Even if I reset index using
df1.reset_index()
df2.reset_index()
and then try
pd.concat([df1,df2],axis=1)
it still produces the same result!
If I understood you correctly, this is what you would like to do.
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 2, 3,4])
df2 = pd.DataFrame({'A1': ['A4', 'A5', 'A6', 'A7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D2': ['D4', 'D5', 'D6', 'D7']},
index=[ 4, 5, 6 ,7])
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df = pd.concat( [df1, df2], axis=1)
Which gives:
A B D A1 C D2
0 A0 B0 D0 A4 C4 D4
1 A1 B1 D1 A5 C5 D5
2 A2 B2 D2 A6 C6 D6
3 A3 B3 D3 A7 C7 D7
Actually, I would have expected that df = pd.concat(dfs,axis=1,ignore_index=True) gives the same result.
This is the excellent explanation from jreback:
ignore_index=True ‘ignores’, meaning doesn’t align on the joining axis. it simply pastes them together in the order that they are passed, then reassigns a range for the actual index (e.g. range(len(index)))
so the difference between joining on non-overlapping indexes (assume axis=1 in the example), is that with ignore_index=False (the default), you get the concat of the indexes, and with ignore_index=True you get a range.
The ignore_index option is working in your example, you just need to know that it is ignoring the axis of concatenation which in your case is the columns. (Perhaps a better name would be ignore_labels.) If you want the concatenation to ignore the index labels, then your axis variable has to be set to 0 (the default).
In case you want to retain the index of the left data frame, set the index of df2 to be df1 using set_index:
pd.concat([df1, df2.set_index(df1.index)], axis=1)
Agree with the comments, always best to post expected output.
Is this what you are seeking?
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 2, 3,4])
df2 = pd.DataFrame({'A1': ['A4', 'A5', 'A6', 'A7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D2': ['D4', 'D5', 'D6', 'D7']},
index=[ 5, 6, 7,3])
df1 = df1.transpose().reset_index(drop=True).transpose()
df2 = df2.transpose().reset_index(drop=True).transpose()
dfs = [df1,df2]
df = pd.concat( dfs,axis=0,ignore_index=True)
print df
0 1 2
0 A0 B0 D0
1 A1 B1 D1
2 A2 B2 D2
3 A3 B3 D3
4 A4 C4 D4
5 A5 C5 D5
6 A6 C6 D6
7 A7 C7 D7
You can use numpy's concatenate to achieve the result.
cols = df1.columns.to_list() + df2.columns.to_list()
dfs = [df1,df2]
df = np.concatenate(dfs, axis=1)
df = pd.DataFrame(df, columns=cols)
Out[1]:
A B D A1 C D2
0 A0 B0 D0 A4 C4 D4
1 A1 B1 D1 A5 C5 D5
2 A2 B2 D2 A6 C6 D6
3 A3 B3 D3 A7 C7 D7
Thanks for asking. I had the same issue.
For some reason "ignore_index=True" doesn't help in my case.
I wanted to keep index from the first dataset and ignore the second index a this worked for me
X_train=pd.concat([train_sp, X_train.reset_index(drop=True, inplace=True)], axis=1)

Categories