Pandas merge two dataframes with different columns

Pandas merge two dataframes with different columns - python

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

Related

Alternative to pandas concat

I am trying to concatenate two data frames with
pd.concat([df1.set_index(["t", "tc"]), df2.set_index(["t", "tc"])], axis=1)
It can happen that in df1, the index is not unique. In that case, I want the corresponding entry in df2 to be inserted into all the rows with that index. Unfortunately, instead of doing that, concat gives me an error.I thought ignore_index = True might help, but I still get the error ValueError: cannot handle a non-unique multi-index!
Is there an alternative to concat that does what I want?
For example:
df1
t tc a
a 1 5
b 1 6
a 1 7
df2:
t tc b
a 1 8
b 1 10
result(after resetting the index):
t tc a b
a 1 5 8
b 1 6 10
a 1 7 8

using .merge you can get where you need
df1.merge(df2, on =['t', 'tc'])
#result
t tc a b
0 a 1 5 8
1 a 1 7 8
2 b 1 6 10

Sort Dataframe by Descending Rows AND Columns at the Same Time

Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.

You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1

Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2

Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)

Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated

I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.

I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]

You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN

You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Merge unaligned DataFrames while filling with empty string

I have multiple DataFrames that I want to merge where I would like the fill value an empty string rather than nan. Some of the DataFrames have already nan values in them. concat sort of does what I want but fill empty values with nan. How does one not fill them with nan, or specify the fill_value to achieve something like this:
>>> df1
Value1
0 1
1 NaN
2 3
>>> df2
Value2
1 5
2 Nan
3 7
>>> merge_multiple_without_nan([df1,df2])
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7
This is what concat does:
>>> concat([df1,df2], axis=1)
Value1 Value2
0 1 NaN
1 NaN 5
2 3 NaN
3 NaN 7

Well, I couldn't find any function in concat or merge that would handle this by itself, but the code below works without much hassel:
df1 = pd.DataFrame({'Value2': [1,np.nan,3]}, index = [0,1, 2])
df2 = pd.DataFrame({'Value2': [5,np.nan,7]}, index = [1, 2, 3])
# Add temporary Nan values for the data frames.
df = pd.concat([df1.fillna('X'), df2.fillna('Y')], axis=1)
df=
Value2 Value2
0 1 NaN
1 X 5
2 3 Y
3 NaN 7
Step 2:
df.fillna('', inplace=True)
df=
Value2 Value2
0 1
1 X 5
2 3 Y
3 7
Step 3:
df.replace(to_replace=['X','Y'], value=np.nan, inplace=True)
df=
Value2 Value2
0 1
1 NaN 5
2 3 NaN
3 7

After using concat, you can iterate over the DataFrames you merged, find the indices that are missing, and fill them in with an empty string. This should work for concatenating an arbitrary number of DataFrames, as long as your column names are unique.
# Concatenate all of the DataFrames.
merge_dfs = [df1, df2]
full_df = pd.concat(merge_dfs, axis=1)
# Find missing indices for each merged frame, fill with an empty string.
for partial_df in merge_dfs:
missing_idx = full_df.index.difference(partial_df.index)
full_df.loc[missing_idx, partial_df.columns] = ''
The resulting output using your sample data:
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.