Joining multiple dataframes with multiple common columns

Joining multiple dataframes with multiple common columns - python

I have multiple dataframes like this-
df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[4,6,7]})
df2=pd.DataFrame({'a':[1,2,3],'d':[66,24,55],'c':[4,6,7]})
df3=pd.DataFrame({'a':[1,2,3],'f':[31,74,95],'c':[4,6,7]})
I want this output-
a c
0 1 4
1 2 6
2 3 7
This is the common columns across the 3 datasets. I am looking for a solution which works for multiple columns without having to specify the common columns as I have seen on SO( since the actual data frames are huge).

If need filter columns names with same content in each DataFrame is possible convert it to tuples and compare:
dfs = [df, df2, df3]
df1 = pd.concat([x.apply(tuple) for x in dfs], axis=1)
cols = df1.index[df1.eq(df1.iloc[:, 0], axis=0).all(axis=1)]
df2 = df[cols]
print (df2)
a c
0 1 4
1 2 6
2 3 7
If columns names should be different and is necessary compare only content:
df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[4,6,7]})
df2=pd.DataFrame({'r':[1,2,3],'t':[66,24,55],'l':[4,6,7]})
df3=pd.DataFrame({'f':[1,2,3],'g':[31,74,95],'m':[4,6,7]})
dfs = [df, df2, df3]
p = [x.apply(tuple).tolist() for x in dfs]
a = set(p[0]).intersection(*p)
print (a)
{(4, 6, 7), (1, 2, 3)}

You can use reduce, to apply function r_common cumulatively to the dataframes of dfs, from left to right, so as to reduce the list of dfs to a single dataframe df_common. The intersection method is use to find out the common columns in two dataframes d1 & d2 inside r_common function.
def r_common(d1, d2):
cols = d1.columns.intersection(d2.columns).tolist()
m = d1[cols].eq(d2[cols]).all()
return d1[m[m].index]
df_common = reduce(r_common, dfs) # dfs = [df, df2, df3]
Result:
# print(df_common)
a c
0 1 4
1 2 6
2 3 7

A combination of reduce, intersection, filter and concat could help with your usecase:
dfs = (df,df2,df3)
cols = [ent.columns for ent in dfs]
cols
[Index(['a', 'b', 'c'], dtype='object'),
Index(['a', 'd', 'c'], dtype='object'),
Index(['a', 'f', 'c'], dtype='object')]
#find the common columns to all :
from functools import reduce
universal_cols = reduce(lambda x,y : x.intersection(y), cols).tolist()
universal_cols
['a', 'c']
#filter for only universal_cols for each df
updates = [ent.filter(universal_cols) for ent in dfs]
If the columns and contents of the columns are the same, then you can skip the list comprehension and just filter from only one dataframe:
#let's use the first dataframe
output = df.filter(universal_cols)
If the columns' contents are different, then concatenate and drop duplicates:
#concatenate and drop duplicates
res = pd.concat(updates).drop_duplicates()
res #output has the same result
a c
0 1 4
1 2 6
2 3 7

Related

Merging dataframes and keeping columns in place?

I have the following data frames:
DF1
df1 = pd.DataFrame(columns = ['Key', 'Value'])
df1['Key'] = ['A', 'B', 'C', 'D']
DF2
df2 = pd.DataFrame(columns = ['Key', 'Value'])
df2['Key'] = ['A', 'C']
df2['Value'] = [1,7]
I would like to merge these two data frames such that the data from DF2 under the column 'Value' is filled in DF1, where the remaining letters 'B' and 'D' have zero.
I tried this:
df3 = pd.merge(df1,df2,how='outer', on = 'Key')
However, this creates an additional column Value_x and Value_y which is not what I want.
Thanks

I think the shortest way to accomplish this is
df1[['Key']].merge(df2, on='Key', how='outer')
by not including Value from the left frame, you don't have 2 columns in the resulting data frame.

You could remove the Value column from df1 and use your existing merge.
Or only use the Key column from df1 when merging.
df3 = pd.merge(df1['Key'],df2,how='outer', on = 'Key').fillna(value=0)
Key Value
0 A 1.0
1 B 0.0
2 C 7.0
3 D 0.0

Another way to do it is by concatenating the dataframes and then grouping the Key value like this:
df3 = pd.concat([df1, df2]).fillna(0).groupby('Key').sum().reset_index()
Output:
Key Value
0 A 1
1 B 0
2 C 7
3 D 0
This way is a little verbose but easier to read and extensible to more than 2 DFs.

concat by taking the values from column

i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.

Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6

Find all duplicate columns in a collection of data frames

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]

pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]

here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.

If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084

How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]

You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)

Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.

You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]

cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want

For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )

You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4

similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))

This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]

Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)

An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]

You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12

You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/

You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])

This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.

I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining multiple dataframes with multiple common columns - python

Related

Merging dataframes and keeping columns in place?

concat by taking the values from column

Find all duplicate columns in a collection of data frames

Pandas concatenate alternating columns

move column in pandas dataframe

Categories

Resources