Pandas: merge multiple dataframes and control column names?

Pandas: merge multiple dataframes and control column names? - python

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
I have nine datasets. All of them have the following columns:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values) shows me the following columns:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.

You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:
result = functools.reduce(merge, dfs)
This is equivalent to
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes parameter in functools.partial would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3

Would doing a big pd.concat() and then renaming all the columns work for you? Something like:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
Good luck.

Related

How can I groupby over multiple files in a folder in Python?

I have a folder with 30 csvs. All of them have unique columns from one another with the exception of a single "UNITID" column. I'm looking to do a groupby function on that UNITID column across all the csvs.
Ultimately I want a single dataframe with all the columns next to each other for each UNITID.
Any thoughts on how I can do that?
Thanks in advance.

Perhaps you could merge the dataframes together, one at a time? Something like this:
# get a list of your csv paths somehow
list_of_csvs = get_filenames_of_csvs()
# load the first csv file into a DF to start with
big_df = pd.read_csv(list_of_csvs[0])
# merge to other csvs into the first, one at a time
for csv in list_of_csvs[1:]:
df = pd.read_csv(csv)
big_df = big_df.merge(df, how="outer", on="UNITID")
All the csvs will be merged together based on UNITID, preserving the union of all columns.

An alternative one-liner to dustin's solution would be the combination of the functool's reduce function and DataFrame.merge()
like so,
from functools import reduce # standard library, no need to pip it
from pandas import DataFrame
# make some dfs
df1
id col_one col_two
0 0 a d
1 1 b e
2 2 c f
df2
id col_three col_four
0 0 A D
1 1 B E
2 2 C F
df3
id col_five col_six
0 0 1 4
1 1 2 5
2 2 3 6
The one-liner:
reduce(lambda x,y: x.merge(y, on= "id"), [df1, df2, df3])
id col_one col_two col_three col_four col_five col_six
0 0 a d A D 1 4
1 1 b e B E 2 5
2 2 c f C F 3 6
functools.reduce docs
pandas.DataFrame.merge docs

Drop rows and sort one dataframe according to another

I have two pandas data frames (df1 and df2):
# df1
ID COL
1 A
2 F
2 A
3 A
3 S
3 D
4 D
# df2
ID VAL
1 1
2 0
3 0
3 1
4 0
My goal is to append the corresponding val from df2 to each ID in df1. However, the relationship is not one-to-one (this is my client's fault and there's nothing I can do about this). To solve this problem, I want to sort df1 by df2['ID'] such that df1['ID'] is identical to df2['ID'].
So basically, for any row i in 0 to len(df2):
if df1.loc[i, 'ID'] == df2.loc[i, 'ID'] then keep row i in df1.
if df1.loc[i, 'ID'] != df2.loc[i, 'ID'] then drop row i from df1 and repeat.
The desired result is:
ID COL
1 A
2 F
3 A
3 S
4 D
This way, I can use pandas.concat([df1, df2['ID']], axis=0) to assign df2[VAL] to df1.
Is there a standardized way to do this? Does pandas.merge() have a method for doing this?
Before this gets voted as a duplicate, please realize that len(df1) != len(df2), so threads like this are not quite what I'm looking for.

This can be done with merge on both ID and the order within each ID:
(df1.assign(idx=df1.groupby('ID').cumcount())
.merge(df2.assign(idx=df2.groupby('ID').cumcount()),
on=['ID','idx'],
suffixes=['','_drop'])
[df1.columns]
)
Output:
ID COL
0 1 A
1 2 F
2 3 A
3 3 S
4 4 D

The simplest way I can see of getting the result you want is:
# Add a count for each repetition of the ids to temporary frames
x = df1.assign(id_counter=df1.groupby('ID').cumcount())
y = df2.assign(id_counter=df2.groupby('ID').cumcount())
# Merge using the ID and the repetition counter
df1 = pd.merge(x, y, how='right', on=['ID', 'id_counter']).drop('id_counter', axis=1)
Which would produce this output:
ID COL VAL
0 1 A 1
1 2 F 0
2 3 A 0
3 3 S 1
4 4 D 0

code multiple columns based on lists and dictionaries in Python

I have the following dataframe in Pandas
OfferPreference_A OfferPreference_B OfferPreference_C
A B A
B C C
C S G
I have the following dictionary of unique values under all the columns
dict1={A:1, B:2, C:3, S:4, G:5, D:6}
I also have a list of the columnames
columnlist=['OfferPreference_A', 'OfferPreference_B', 'OfferPreference_C']
I Am trying to get the following table as the output
OfferPreference_A OfferPreference_B OfferPreference_C
1 2 1
2 3 3
3 4 5
How do I do this.

Use:
#if value not match get NaN
df = df[columnlist].applymap(dict1.get)
Or:
#if value not match get original value
df = df[columnlist].replace(dict1)
Or:
#if value not match get NaN
df = df[columnlist].stack().map(dict1).unstack()
print (df)
OfferPreference_A OfferPreference_B OfferPreference_C
0 1 2 1
1 2 3 3
2 3 4 5

You can use map for this like shown below, assuming the values will match always
for col in columnlist:
df[col] = df[col].map(dict1)

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?

I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')

IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

Pandas python - matching values

I currently have two dataframes that have two matching columns. For example :
Data frame 1 with columns : A,B,C
Data frame 2 with column : A
I want to keep all lines in the first dataframe that have the values that the A contains. For example if df2 and df1 are:
df1
A B C
0 1 3
4 2 5
6 3 1
8 0 0
2 1 1
df2
Α
4
6
1
So in this case, I want to only keep the second and third line of df1.
I tried doing it like this, but it didnt work since both dataframes are pretty big:
for index, row in df1.iterrows():
counter = 0
for index2,row2 in df2.iterrows():
if row["A"] == row2["A"]:
counter = counter + 1
if counter == 0:
df2.drop(index, inplace=True)

Use isin to test for membership:
In [176]:
df1[df1['A'].isin(df2['A'])]
Out[176]:
A B C
1 4 2 5
2 6 3 1

Or use the merge method:
df1= pandas.DataFrame([[0,1,3],[4,2,5],[6,3,1],[8,0,0],[2,1,1]], columns = ['A', 'B', 'C'])
df2= pandas.DataFrame([4,6,1], columns = ['A'])
df2.merge(df1, on = 'A')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: merge multiple dataframes and control column names? - python

Related

How can I groupby over multiple files in a folder in Python?

Drop rows and sort one dataframe according to another

code multiple columns based on lists and dictionaries in Python

Concatenate dataframes alternating rows with Pandas

Pandas python - matching values

Categories

Resources