This question already has answers here:
How to reset index in a pandas dataframe? [duplicate]
(3 answers)
Closed 3 years ago.
I want to merge two dataframe df1 and df2 and use the script below:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df = df1.join(df2.iloc[1:].reset_index(), how='inner')
But there is one more column named 'index' which I don't want. Is there anyway to remove it?
A B index C D
0 A0 B0 1 C1 D1
1 A1 B1 2 C2 D2
2 A2 B2 3 C3 D3
What I want is just like this:
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3
Just use drop=True within your code.
df = df1.join(df2.iloc[1:].reset_index(drop=True), how='inner')
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3
Related
I have two DateFrame with values and number of rows different. All Index values in second DateFrame is in the first, so, I need a DateFrame with the same number of rows that the second DateFrame and all data of both DateFrames.
Example
df1 = pd.DataFrame(columns=["A", "B", "C"], data=[['a1' , 'b1', 'c1'], ['a2', 'b2', 'c2'], ['a3', 'b3', 'c3']], index=[1,2,3])
df2 = pd.DataFrame(columns=["D", "E"], data=[['d2' , 'e2'], ['d3' , 'e3']], index=[2,3])
print (df1)
print (df2)
Out:
A B C
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
D E
2 d2 e2
3 d3 e3
I want:
A B C D E
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d3 e3
I try this:
merge = df1.merge(df2, how='inner', left_on=df1.index, right_on=df2.index)
But I receive an error:
IndexError: indices are out-of-bounds
Note: Not all index in df1 are in df2 but all index in df2 are in df1.
I'm sure that I'm missing something simple, but I haven't be able to figure this one out.
I have a DataFrame in Pandas with multiple rows that have the same keys, but different information. I want to place these rows onto the same row.
df = pd.DataFrame({'key': ['K0', 'K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
This will give me a dataframe with 4 rows and 3 columns. But there is a duplicate value 'KO' in 'key'
Is there any way to turn this into a dataframe with 3 rows, and 5 columns like shown below?
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A2', 'A3'],
'B': ['B0', 'B2', 'B3'],
'A_1': ['A1', 'NaN', 'NaN'],
'B_1': ['B1', 'NaN', 'NaN']})
Perform groupby on cumcount, then concatenate individual groups together.
gps = []
for i, g in df.groupby(df.groupby('key').cumcount()):
gps.append(g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1))
r = pd.concat(gps, 1).sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
You can shorten this somewhat using a list comprehension -
r = pd.concat(
[g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1)
for i, g in df.groupby(df.groupby('key').cumcount())],
axis=1)\
.sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
Let's use set_index, groupby, cumcount, and unstack, then flatten multiindex with map and format:
df_out = df.set_index(['key', df.groupby('key').cumcount()]).unstack()
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
key A_0 A_1 B_0 B_1
0 K0 A0 A1 B0 B1
1 K1 A2 None B2 None
2 K2 A3 None B3 None
I think this alter the layout. just put key as an index to access fields :
df2 = df.set_index([df.key,df.index])
Then
In [248]: df2.loc['K1']
Out[248]:
A B key
2 A2 B2 K1
In [249]: df2.loc['K0']
Out[249]:
A B key
0 A0 B0 K0
1 A1 B1 K0
and iter on rows.
This question already has answers here:
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas
(8 answers)
Closed 5 years ago.
I have below data frame
A B C
1 A1 B1 [C1, C2]
2 A2 B2 [C3, C4]
I wish to transform it to
A B C
1 A1 B1 C1
2 A1 B1 C2
3 A2 B2 C3
4 A2 B2 C4
What should I do? Thanks
One really simple way of doing it is as follows:
import pandas as pd
df = pd.DataFrame([['A1', 'B1', ['C1', 'C2']],['A2', 'B2', ['C3', 'C4']]], columns = ['A', 'B', 'C'])
df1 = df.copy()
df1['C'] = df['C'].apply(lambda x: x[0])
df2 = df.copy()
df2['C'] = df['C'].apply(lambda x: x[1])
pd.concat([df1, df2]).sort_values('A')
I have large data frame to merge into make sure the the merge take place in multiprocessing manner I decided to use indexes. But after creating indexes I get key error.
For example:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
(Pdb) df1
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
But second DataFrame:
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'C': ['C1', 'C2', 'C3', 'C4']})
(Pdb) df2
A C
0 A0 C1
1 A1 C2
2 A2 C3
3 A3 C4
Now I set indexes for both the data frames where column A is the index.
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)
(Pdb) df1
B
A
A0 B0
A1 B1
A2 B2
A3 B3
(Pdb) df2
C
A
A0 C1
A1 C2
A2 C3
A3 C4
Now when I do the merge:
(Pdb) result = pd.merge(df1, df2, on='A')
*** KeyError: 'A'
But if I do this without creating index merge take place without a key error.
(Pdb) df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
(Pdb) df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'C': ['C1', 'C2', 'C3', 'C4']})
(Pdb) result = pd.merge(df1, df2, on='A')
(Pdb) result
A B C
0 A0 B0 C1
1 A1 B1 C2
2 A2 B2 C3
3 A3 B3 C4
if you merge on the index, you should both:
not specify the key in merge
use the left_index = True, right_index = True arguments to merge
otherwise, you have to explicitely tell what your key is using key =
When you specify on=A you're telling Pandas that you want to use the column A to do the merge. When you make A the index, it's no longer a column in the frame.
You can either merge, and then make A the index, or make A the index and then specify left_index=True and right_index=True.
I've run into a bit of a sticky problem with pandas merge functionality. Here's a toy example of my issue:
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
s = pd.Series(['E0', 'E1', 'E2', 'E3'], index = [0,1,2,3])
If I now want to left merge these based on column A in the dataframe and the index in the series, i.e.:
pd.merge(df1,s,how='left', left_on = 'A', right_index = True)
I get the following error:
IndexError: list index out of range
Which I don't really understand. To make things more confusing, replacing the series with another dataframe means that everything works just fine:
df2 = pd.DataFrame({'E': ['E0','E1','E2','E3'],
'F': ['F0', 'F1', 'F2', 'F3']},
index=[0, 1, 2, 3])
pd.merge(df1,df2,how='left', left_on = 'A', right_index = True)
Gives:
A B C D E F
0 0 B0 C0 D0 E0 F0
1 1 B1 C1 D1 E1 F1
2 2 B2 C2 D2 E2 F2
3 3 B3 C3 D3 E3 F3
I'm probably missing something very basic, but I suspect this is one of those things that would be helpful to know for the future!
Many thanks in advance.
pd.merge expects DataFrames as its first two arguments. The second argument can not be a Series. However, you could convert s to a frame using its to_frame method:
In [10]: pd.merge(df1, s.to_frame(), how='left', left_on='A', right_index=True)
Out[10]:
A B C D 0
0 0 B0 C0 D0 E0
1 1 B1 C1 D1 E1
2 2 B2 C2 D2 E2
3 3 B3 C3 D3 E3
Notice that the last column name is 0. You can control the name of that column by giving the Series, s a name:
In [15]: s.name = 'Foo'
Then the last column name becomes Foo:
In [17]: pd.merge(df1, s.to_frame(), how='left', left_on='A', right_index=True)
Out[17]:
A B C D Foo
0 0 B0 C0 D0 E0
1 1 B1 C1 D1 E1
2 2 B2 C2 D2 E2
3 3 B3 C3 D3 E3