I have two DateFrame with values and number of rows different. All Index values in second DateFrame is in the first, so, I need a DateFrame with the same number of rows that the second DateFrame and all data of both DateFrames.
Example
df1 = pd.DataFrame(columns=["A", "B", "C"], data=[['a1' , 'b1', 'c1'], ['a2', 'b2', 'c2'], ['a3', 'b3', 'c3']], index=[1,2,3])
df2 = pd.DataFrame(columns=["D", "E"], data=[['d2' , 'e2'], ['d3' , 'e3']], index=[2,3])
print (df1)
print (df2)
Out:
A B C
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
D E
2 d2 e2
3 d3 e3
I want:
A B C D E
2 a2 b2 c2 d2 e2
3 a3 b3 c3 d3 e3
I try this:
merge = df1.merge(df2, how='inner', left_on=df1.index, right_on=df2.index)
But I receive an error:
IndexError: indices are out-of-bounds
Note: Not all index in df1 are in df2 but all index in df2 are in df1.
Related
I have these two dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A3'],'B': ['B0', 'B1', 'B2']})
I would like to merge these two dataframes now by the entries in column 'A'. But I do not want to keep the rows which match but the ones which do not match to each other.
That means, I would like to get a new dataframe which looks like this one:
df_new = pd.DataFrame({'A':['A3'], 'B':['B2']})
How could I do this?
Thanks a lot!
Merge with outer join gives You similar result:
df1.merge(df2, how = 'outer', on = 'A', indicator = True)
A B_x B_y _merge
0 A0 B0 B0 both
1 A1 B1 B1 both
2 A2 B2 NaN left_only
3 A3 NaN B2 right_only
which can be filtered by query:
df1.merge(df2, how = 'outer', on = 'A', indicator = True).query("_merge != 'both'")
A B_x B_y _merge
2 A2 B2 NaN left_only
3 A3 NaN B2 right_only
Note indicator = True created column _merge suitable to filter rows.
Try this, using isin :
df2.loc[~df1.A.isin(df2.A)]
A B
2 A3 B2
This question already has answers here:
How to reset index in a pandas dataframe? [duplicate]
(3 answers)
Closed 3 years ago.
I want to merge two dataframe df1 and df2 and use the script below:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df = df1.join(df2.iloc[1:].reset_index(), how='inner')
But there is one more column named 'index' which I don't want. Is there anyway to remove it?
A B index C D
0 A0 B0 1 C1 D1
1 A1 B1 2 C2 D2
2 A2 B2 3 C3 D3
What I want is just like this:
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3
Just use drop=True within your code.
df = df1.join(df2.iloc[1:].reset_index(drop=True), how='inner')
A B C D
0 A0 B0 C1 D1
1 A1 B1 C2 D2
2 A2 B2 C3 D3
I have 2 dataframes with missing values that I want to merge and complete data from each other,
A simple visualisation :
df1 :
A,B,C
A1,B1,C1
A2,B2,
A3,B3,C3
df2 :
A,B,C
A1,,C1
A4,B4,C4
A2,B2,C2
The result wanted:
A,B,C
A1,B1,C1
A2,B2,B2
A3,B3,C3
A4,B4,C4
Basically merge the dataframes without duplicates of the column "A" and completing if there are missing values in a row by comparing values from same column "A" between dataframes.
I tried many things saw on the Pandas documentation + solutions on stackexchange but failed everytime
These are all the different things I tried :
pd.merge_ordered(df1, df2, fill_method='ffill', left_by='A')
df1.combine_first(df2)
df1.update(df2)
pd.concat([df1, df2])
pd.merge(df1, df2, on=['A','B','C'], how='right')
pd.merge(df1, df2, on=['A','B','C'], how='outer')
pd.merge(df1, df2, on=['A','B','C'], how='left')
df1.join(df2, how='outer')
df1.join(df2, how='left')
df1.set_index('A').join(df2.set_index('A'))
(You can see I was quite desperate at the end)
Any idea how to do that ?
Did you try combine_first with A as the index?
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Or you can use first
pd.concat([df1,df2]).replace('',np.nan).groupby('A',as_index=False).first()
Out[53]:
A B C
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
Setup
Since you wrote them as csvs, I'm going to assume they were csvs.
df1 = pd.read_csv('df1.csv', sep=',', index_col=0)
df2 = pd.read_csv('df2.csv', sep=',', index_col=0)
Solution
Use fillna after having used align
pd.DataFrame.fillna(*df1.align(df2))
B C
A
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
You can use reset_index if you insist but I think it's prettier to leave it as it is.
You can use the pandas categorical data type to set an ordered list of categories, sort of these ordered categories, and drop rows with Null values to get your desired results:
from pandas.api.types import CategoricalDtype
# Create first dataframe from OP values
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', '', 'C3']})
# create second dataframe from original values
df2 = pd.DataFrame({'A': ['A1', 'A4', 'A2'],
'B': ['', 'B4', 'B2'],
'C': ['C1', 'C4', 'C2']})
# concatenate the two together for a long dataframe
final = pd.concat([df1, df2])
# specify the letters in your dataset
letters = ['A', 'B', 'C']
# create a placeholder dictionary to store the categorical datatypes
cat_dict = {}
# iterate over the letters
for let in letters:
# create the ordered categories - set hte range for the max # of values
cats = ['{}{}'.format(let, num) for num in list(range(1000))]
# create ordered categorical datatype
cat_type = CategoricalDtype(cats, ordered=True)
# insert into placeholder
cat_dict[let] = cat_type
# properly format your columns as the ordered categories
final['A'] = final['A'].astype(cat_dict['A'])
final['B'] = final['B'].astype(cat_dict['B'])
final['C'] = final['C'].astype(cat_dict['C'])
# finally sort on the three columns and drop rows with NA values
final.sort_values(['A', 'B', 'C']).dropna(how='any')
# which outputs desired results
A B C
0 A1 B1 C1
2 A2 B2 C2
2 A3 B3 C3
1 A4 B4 C4
While this is a bit longer, one nice thing about doing it this way is your data can be in any order upon input. This inserts an inherit rank into the values within each column, so A1 < A2 < A3 and so on and so forth. This also enables you to sort the columns.
I'm sure that I'm missing something simple, but I haven't be able to figure this one out.
I have a DataFrame in Pandas with multiple rows that have the same keys, but different information. I want to place these rows onto the same row.
df = pd.DataFrame({'key': ['K0', 'K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
This will give me a dataframe with 4 rows and 3 columns. But there is a duplicate value 'KO' in 'key'
Is there any way to turn this into a dataframe with 3 rows, and 5 columns like shown below?
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A2', 'A3'],
'B': ['B0', 'B2', 'B3'],
'A_1': ['A1', 'NaN', 'NaN'],
'B_1': ['B1', 'NaN', 'NaN']})
Perform groupby on cumcount, then concatenate individual groups together.
gps = []
for i, g in df.groupby(df.groupby('key').cumcount()):
gps.append(g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1))
r = pd.concat(gps, 1).sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
You can shorten this somewhat using a list comprehension -
r = pd.concat(
[g.drop('key', 1).add_suffix(i + 1).reset_index(drop=1)
for i, g in df.groupby(df.groupby('key').cumcount())],
axis=1)\
.sort_index(axis=1)
r['key'] = df.key.unique()
r
A1 A2 B1 B2 key
0 A0 A1 B0 B1 K0
1 A2 NaN B2 NaN K1
2 A3 NaN B3 NaN K2
Let's use set_index, groupby, cumcount, and unstack, then flatten multiindex with map and format:
df_out = df.set_index(['key', df.groupby('key').cumcount()]).unstack()
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
key A_0 A_1 B_0 B_1
0 K0 A0 A1 B0 B1
1 K1 A2 None B2 None
2 K2 A3 None B3 None
I think this alter the layout. just put key as an index to access fields :
df2 = df.set_index([df.key,df.index])
Then
In [248]: df2.loc['K1']
Out[248]:
A B key
2 A2 B2 K1
In [249]: df2.loc['K0']
Out[249]:
A B key
0 A0 B0 K0
1 A1 B1 K0
and iter on rows.
I've run into a bit of a sticky problem with pandas merge functionality. Here's a toy example of my issue:
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
s = pd.Series(['E0', 'E1', 'E2', 'E3'], index = [0,1,2,3])
If I now want to left merge these based on column A in the dataframe and the index in the series, i.e.:
pd.merge(df1,s,how='left', left_on = 'A', right_index = True)
I get the following error:
IndexError: list index out of range
Which I don't really understand. To make things more confusing, replacing the series with another dataframe means that everything works just fine:
df2 = pd.DataFrame({'E': ['E0','E1','E2','E3'],
'F': ['F0', 'F1', 'F2', 'F3']},
index=[0, 1, 2, 3])
pd.merge(df1,df2,how='left', left_on = 'A', right_index = True)
Gives:
A B C D E F
0 0 B0 C0 D0 E0 F0
1 1 B1 C1 D1 E1 F1
2 2 B2 C2 D2 E2 F2
3 3 B3 C3 D3 E3 F3
I'm probably missing something very basic, but I suspect this is one of those things that would be helpful to know for the future!
Many thanks in advance.
pd.merge expects DataFrames as its first two arguments. The second argument can not be a Series. However, you could convert s to a frame using its to_frame method:
In [10]: pd.merge(df1, s.to_frame(), how='left', left_on='A', right_index=True)
Out[10]:
A B C D 0
0 0 B0 C0 D0 E0
1 1 B1 C1 D1 E1
2 2 B2 C2 D2 E2
3 3 B3 C3 D3 E3
Notice that the last column name is 0. You can control the name of that column by giving the Series, s a name:
In [15]: s.name = 'Foo'
Then the last column name becomes Foo:
In [17]: pd.merge(df1, s.to_frame(), how='left', left_on='A', right_index=True)
Out[17]:
A B C D Foo
0 0 B0 C0 D0 E0
1 1 B1 C1 D1 E1
2 2 B2 C2 D2 E2
3 3 B3 C3 D3 E3