Pandas groupby different aggregation values with big dataframe

Pandas groupby different aggregation values with big dataframe - python

I have a dataframe with 700+ columns. I am doing a groupby with one column, lets say df.a, and I want to aggregate every column by mean except the last 10, which I want to aggregate my max. I am aware of creating a conditional dictionary and then passing into a groupby like this:
d = {'DATE': 'last', 'AE_NAME': 'last', 'ANSWERED_CALL': 'sum'}
res = df.groupby(df.a).agg(d)
However, with so many columns, I do not want to have to write this all out. Is there a quick way to do this?

You could use zip and some not really elegant code imo but it works:
cols = df.drop("A", axis=1).columns # drop groupby column since not in agg
len_means = len(cols[:-10]) # grabbing all cols except the last ten ones
len_max = len(cols[-10:] # grabbing the last ten cols length
d_means = {i:j for i,j in zip(cols[:-10], ["mean"]*len_means)}
d_max = {i:j for i,j in zip(cols[-10:], ["max"]*len_max)}
d = d_means.update(d_max}
res = df.groupby(df.a).agg(d)
Edit : since OP mentioned the columns are named differently (ending with letter c then)
c_cols = [col for col in df.columns if col.endswith('c')]
non_c_cols = [col for col in df.columns if col not in c_cols]
and one only needs to plug the cols in the code above the get the result

I would approach this problem the following:
Define a cutoff for which columns to select
Select the columns you need
Create both your mean and max aggregation with GroupBy
Join both dataframes together:
# example dataframe
df = pd.DataFrame(np.random.rand(5,10), columns=list('abcdefghij'))
df.insert(0, 'ID', ['aaa', 'bbb', 'aaa', 'ccc', 'bbb'])
ID a b c d e f g h i j
0 aaa 0.228208 0.822641 0.407747 0.416335 0.039717 0.854789 0.108124 0.666190 0.074569 0.329419
1 bbb 0.285293 0.274654 0.507607 0.527335 0.599833 0.511760 0.747992 0.930221 0.396697 0.959254
2 aaa 0.844373 0.431420 0.083631 0.656162 0.511913 0.486187 0.955340 0.130358 0.759013 0.181874
3 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127
4 bbb 0.823457 0.172947 0.907415 0.719616 0.632012 0.199703 0.672745 0.563852 0.120827 0.092455
cutoff = 7
mean_cols = df.columns[:cutoff]
max_cols = ['ID'] + df.columns[cutoff:].tolist()
df1 = df[mean_cols].groupby('ID').mean()
df2 = df[max_cols].groupby('ID').max()
df = df1.join(df2).reset_index()
ID a b c d e f g h i j
0 aaa 0.536290 0.627031 0.245689 0.536248 0.275815 0.670488 0.955340 0.666190 0.759013 0.329419
1 bbb 0.554375 0.223800 0.707511 0.623476 0.615923 0.355732 0.747992 0.930221 0.396697 0.959254
2 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127

Related

Identify min, max columns and transform as relative difference column

I have a pandas dataframe like as given below
dfx = pd.DataFrame({'min_temp' :[38,36,np.nan,38,37,39],'max_temp': [41,39,39,41,43,44],
'min_hr': [89,87,85,84,82,86],'max_hr': [91,98,np.nan,94,92,96], 'min_sbp':[21,23,25,27,28,29],
'ethnicity':['A','B','C','D','E','F'],'Gender':['M','F','F','F','F','F']})
What I would like to do is
1) Identify all columns that contain min and max.
2) Find their corresponding pair. ex: min_temp and max_temp are a pair. Similarly min_hr and max_hr are a pair
3) Convert these two columns into one column and name it as rel_temp. See below for formula
rel_temp = (max_temp - min_temp)/min_temp
This is what I was trying. Do note that my real data has several thousand records and hundreds of columns like this
def myfunc(n):
return lambda a,b : ((b-a)/a)
dfx.apply(myfunc(col for col in dfx.columns)) # didn't know how to apply string contains here
I expect my output to be like this. Please note that only min and max columns have to be transformed. Rest of the columns in dataframe should be left as is.

Idea is create df1 and df2 with same columns names with DataFrame.filter and rename, so then subtract and divide all columns with DataFrame.sub and DataFrame.div:
df1 = dfx.filter(like='max').rename(columns=lambda x: x.replace('max','rel'))
df2 = dfx.filter(like='min').rename(columns=lambda x: x.replace('min','rel'))
df = df1.sub(df2).div(df2).join(dfx.loc[:, ~dfx.columns.str.contains('min|max')])
print (df)
rel_temp rel_hr ethnicity Gender
0 0.078947 0.022472 A M
1 0.083333 0.126437 B F
2 NaN NaN C F
3 0.078947 0.119048 D F
4 0.162162 0.121951 E F
5 0.128205 0.116279 F F

Try using:
cols = dfx.columns
con = cols[cols.str.contains('_')]
for i in con.str.split('_').str[-1].unique():
df = dfx[[x for x in con if i in x]]
dfx['rel_%s' % i] = (df['max_%s' % i] - df['min_%s' % i]) / df['min_%s' % i]
dfx = dfx.drop(con, axis=1)
print(dfx)

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})

This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))

To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Find all duplicate columns in a collection of data frames

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]

pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]

here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare

Run a function that requires multiple arguments through multiple columns - Pandas

Hi I currently have a function that is able to split values in a same cell that is delimited by a new line. However the function below only accepts me to pass through one column at a time was thinking if there is any other ways that I can pass it through multiple columns or in fact the whole dataframe.
A sample would be like this
A B C
1\n2\n3 2\n\5 A
The code is below
def tidy_split(df, column, sep='|', keep=False):
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
It currently works when I run
df1 = tidy_split(df, 'A', '\n')
After running the function of selecting only column A
A B C
1 2\n5 A
2 2\n5 A
3 2\n5 A
I was hoping to be able to pass in more than just an accepted argument and in this case splitting column 'B' as well. Previously I have attempted passing in lambda or attempted using apply but it requires a positional argument which is 'column'. Would appreciate any help given! Was thinking if a loop is possible
EDIT: Desired output as each number refer to something important
Before
A B C
1\n2\n3 2\n5 A
After
A B C
1 2 A
2 5 A
3 n/a A

Input:
A B C
0 1\n2\n3 2\n5 A
Code:
import pandas as pd
cols = df.columns.tolist()
# create list in each cell by detecting '\n'
for col in cols:
df[col] = df[col].apply(lambda x: str(x).split("\n"))
# empty dataframe to store result
dfs = pd.DataFrame()
# loop over rows to construct small dataframes
# and then accumulate each to the resulting dataframe
for ind, row in df.iterrows():
a_vals = row['A']
b_vals = row['B'] + ["n/a"] * (len(a_vals) - len(row['B']))
c_vals = row['C'] + [row['C'][0]] * (len(a_vals) - len(row['C']))
temp = pd.DataFrame({'A': a_vals, 'B': b_vals, 'C': c_vals})
dfs = pd.concat([dfs, temp], axis=0, ignore_index=True)
Output:
A B C
0 1 2 A
1 2 5 A
2 3 n/a A

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.

You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]

cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want

For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )

You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4

similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))

This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]

Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)

An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]

You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12

You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/

You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])

This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.

I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby different aggregation values with big dataframe - python

Related

Identify min, max columns and transform as relative difference column

Renaming columns on slice of dataframe not performing as expected

Find all duplicate columns in a collection of data frames

Run a function that requires multiple arguments through multiple columns - Pandas

move column in pandas dataframe

Categories

Resources