pandas join DataFrame force suffix? - python

How can I force a suffix on a merge or join. I understand it's possible to provide one if there is a collision but in my case I'm merging df1 with df2 which doesn't cause any collision but then merging again on df2 which uses the suffixes but I would prefer for each merge to have a suffix because it gets confusing if I do different combinations as you could imagine.

You could force a suffix on the actual DataFrame:
In [11]: df_a = pd.DataFrame([[1], [2]], columns=['A'])
In [12]: df_b = pd.DataFrame([[3], [4]], columns=['B'])
In [13]: df_a.join(df_b)
Out[13]:
A B
0 1 3
1 2 4
By appending to its column's names:
In [14]: df_a.columns = df_a.columns.map(lambda x: str(x) + '_a')
In [15]: df_a
Out[15]:
A_a
0 1
1 2
Now joins won't need the suffix correction, whether they collide or not:
In [16]: df_b.columns = df_b.columns.map(lambda x: str(x) + '_b')
In [17]: df_a.join(df_b)
Out[17]:
A_a B_b
0 1 3
1 2 4

As of pandas version 0.24.2 you can add a suffix to column names on a DataFrame using the add_suffix method.
This makes a one-liner merge command with force-suffix more bearable, for example:
df_merged = df1.merge(df2.add_suffix('_2'))

Pandas merge will give the new columns a suffix when there is already a column with the same name, When i need to force the new columns with a suffix, i create an empty column with the name of the column that i want to join.
df["colName"] = "" #create empty column
df.merge(right = "df1", suffixes = ("_a","_b"))
You can later drop the empty column.
You could do the same for more than one columns, or for every column in df.columns.values

This is what I've been using to pandas.merge two DataFrames and force suffixing:
def merge_force_suffix(left, right, **kwargs):
on_col = kwargs['on']
suffix_tupple = kwargs['suffixes']
def suffix_col(col, suffix):
if col != on_col:
return str(col) + suffix
else:
return col
left_suffixed = left.rename(columns=lambda x: suffix_col(x, suffix_tupple[0]))
right_suffixed = right.rename(columns=lambda x: suffix_col(x, suffix_tupple[1]))
del kwargs['suffixes']
return pd.merge(left_suffixed, right_suffixed, **kwargs)

Related

Apply the same operation to multiple DataFrames efficiently

I have two data frames with the same columns, and similar content.
I'd like apply the same functions on each, without having to brute force them, or concatenate the dfs. I tried to pass the objects into nested dictionaries, but that seems more trouble than it's worth (I don't believe dataframe.to_dict supports passing into an existing list).
However, it appears that the for loop stores the list of dfs in the df object, and I don't know how to get it back to the original dfs... see my example below.
df1 = {'Column1': [1,2,2,4,5],
'Column2': ["A","B","B","D","E"]}
df1 = pd.DataFrame(df1, columns=['Column1','Column2'])
df2 = {'Column1': [2,11,2,2,14],
'Column2': ["B","Y","B","B","V"]}
df2 = pd.DataFrame(df2, columns=['Column1','Column2'])
def filter_fun(df1, df2):
for df in (df1, df2):
df = df[(df['Column1']==2) & (df['Column2'].isin(['B']))]
return df1, df2
filter_fun(df1, df2)
If you write the filter as a function you can apply it in a list comprehension:
def filter(df):
return df[(df['Column1']==2) & (df['Column2'].isin(['B']))]
df1, df2 = [filter(df) for df in (df1, df2)]
I would recommend concatenation with custom specified keys, because 1) it is easy to assign it back, and 2) you can do the same operation once instead of twice.
# Concatenate df1 and df2
df = pd.concat([df1, df2], keys=['a', 'b'])
# Perform your operation
out = df[(df['Column1'] == 2) & df['Column2'].isin(['B'])]
out.loc['a'] # result for `df1`
Column1 Column2
1 2 B
2 2 B
out.loc['b'] # result for `df2`
Column1 Column2
0 2 B
2 2 B
3 2 B
This should work fine for most operations. For groupby, you will want to group on the 0th index level as well.

Pandas dataframe, delete rows in between 2 rows that have same values in some columns

given a panda dataframes, how would i delete all rows that are in between 2 rows that have the same values on 2 specific columns. In my case I have columns x,y and id. I would like if a x-y pair appears twice in the dataframe to delete all rows that lay in between those 2.
Example:
import pandas as pd
df1 = pd.DataFrame({'x':[1,2,3,2,1,3,4],
'y':[1,2,3,4,3,3,4],
'id':[1,2,3,4,5,6,7]})
^ ^
As you can see the value pair x=3,y=3 appears twice in the dataframe, once at id=3, once at id=6.
How could I spot these rows and drop all rows in between?
So that I would get this for example:
df1 = pd.DataFrame({'x':[1,2,3,4],
'y':[1,2,3,4],
'id':[1,2,3,7]})
The dataframe could also be like that, so that there are more "duplicates" as in my next example the 4,2 pair. I want to spot the outer duplicates so that with the deleting the rows in between them, all other twice or more appearing rows are eliminated too. For example:
df1 = pd.DataFrame({'x':[1,2,3,4,1,4,3,4],
'y':[1,2,3,2,3,2,3,4],
'id':[1,2,3,4,5,6,7,8]})
^ ^ ^ ^
out in in out
#should become:
df1 = pd.DataFrame({'x':[1,2,3,4],
'y':[1,2,3,4],
'id':[1,2,3,8]})
For my example this should cause a kind of loop elimination of the graph that i represent with the dataframe.
How would i implement that?
One of possible solutions:
Let's start from creation of your DataFrame (here I omitted the required import):
d = {'id': [1,2,3,4,5,6,7,8], 'x': [1,2,3,4,1,4,3,4], 'y': [1,2,3,2,3,2,3,4]}
df = pd.DataFrame(data=d)
Note that index values are consecutive numbers (from 0), what will be used later.
Then we have to find duplicated rows, marking all instances (keep=False):
dups = df[df.duplicated(subset=['x', 'y'], keep=False)]
These duplicates should then be groupped on x and y:
gr = dups.groupby(['x', 'y'])
Then, number of group to which belongs particular row should be added
to df as e.g. grpNo column.
df['grpNo'] = gr.ngroup()
The next step is to find the first and last index of row, which
were groupped within the first group (with group No == 0) and save them in
ind1 and ind2.
ind1 = df[df['grpNo'] == 0].index[0]
ind2 = df[df['grpNo'] == 0].index[-1]
Then we find a list of index values to be deleted:
indToDel = df[(df.index > ind1) & (df.index <= ind2)].index
To perform actual deletion of rows, we should execute:
df.drop(indToDel, inplace=True)
And the last step is to delete grpNo column, not needed any more.
df.drop('grpNo', axis=1, inplace=True)
The result is:
id x y
0 1 1 1
1 2 2 2
2 3 3 3
7 8 4 4
So the whole script can be as follows:
import pandas as pd
d = {'id': [1,2,3,4,5,6,7,8], 'x': [1,2,3,4,1,4,3,4], 'y': [1,2,3,2,3,2,3,4]}
df = pd.DataFrame(data=d)
dups = df[df.duplicated(subset=['x', 'y'], keep=False)]
gr = dups.groupby(['x', 'y'])
df['grpNo'] = gr.ngroup()
ind1 = df[df['grpNo'] == 0].index[0]
ind2 = df[df['grpNo'] == 0].index[-1]
indToDel = df[(df.index > ind1) & (df.index <= ind2)].index
df.drop(indToDel, inplace=True)
df.drop('grpNo', axis=1, inplace=True)
print(df)
This works for both your examples, although not sure if generalizes to all examples you have in mind:
df1[df1['x']==df1['y']]

How do I filter an empty DataFrame and still keep the columns of that DataFrame?

Here is an example of why pandas is a terribly designed hacked together library:
import pandas as pd
df = pd.DataFrame()
df['A'] = [1,2,3]
df['B'] = [4,5,6]
print(df)
df1 = df[df.A.apply(lambda x:x == 4)]
df2 = df1[df1.B.apply(lambda x:x == 1)]
print(df2)
This will print
df
A B
0 1 4
1 2 5
2 3 6
df2
Empty DataFrame
Columns: []
Index: []
Note how Columns: [] , which means any further/selecting on df2 will fail. This is a huge issue, because it means I now have to always check if any table is empty before attempting to select from it, which is garbage behaviour.
For clarity, the sensible, thoughtful, reasonable, not totally broken behaviour would be to preserve the columns.
Anyone care to offer some hack I can apply on top of the collection of hacks which is the dataframe API?
Pandas almost consider all situations we need, especially for those simple cases
PS: Nothing wrong with pandas
df1 = df.loc[df.A.apply(lambda x:x == 4)]
df2 = df1.loc[df1.B.apply(lambda x:x == 1)]
df1
Out[53]:
Empty DataFrame
Columns: [A, B]
Index: []
df2
Out[54]:
Empty DataFrame
Columns: [A, B]
Index: []
df2 = df1[df1.B.apply(lambda x:x == 1).astype(bool)]
All other answers are missing the point (except for Wen's, which is an ok alternative)

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.
If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084
How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]
You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)
Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

Categories