Identify min, max columns and transform as relative difference column - python

I have a pandas dataframe like as given below
dfx = pd.DataFrame({'min_temp' :[38,36,np.nan,38,37,39],'max_temp': [41,39,39,41,43,44],
'min_hr': [89,87,85,84,82,86],'max_hr': [91,98,np.nan,94,92,96], 'min_sbp':[21,23,25,27,28,29],
'ethnicity':['A','B','C','D','E','F'],'Gender':['M','F','F','F','F','F']})
What I would like to do is
1) Identify all columns that contain min and max.
2) Find their corresponding pair. ex: min_temp and max_temp are a pair. Similarly min_hr and max_hr are a pair
3) Convert these two columns into one column and name it as rel_temp. See below for formula
rel_temp = (max_temp - min_temp)/min_temp
This is what I was trying. Do note that my real data has several thousand records and hundreds of columns like this
def myfunc(n):
return lambda a,b : ((b-a)/a)
dfx.apply(myfunc(col for col in dfx.columns)) # didn't know how to apply string contains here
I expect my output to be like this. Please note that only min and max columns have to be transformed. Rest of the columns in dataframe should be left as is.

Idea is create df1 and df2 with same columns names with DataFrame.filter and rename, so then subtract and divide all columns with DataFrame.sub and DataFrame.div:
df1 = dfx.filter(like='max').rename(columns=lambda x: x.replace('max','rel'))
df2 = dfx.filter(like='min').rename(columns=lambda x: x.replace('min','rel'))
df = df1.sub(df2).div(df2).join(dfx.loc[:, ~dfx.columns.str.contains('min|max')])
print (df)
rel_temp rel_hr ethnicity Gender
0 0.078947 0.022472 A M
1 0.083333 0.126437 B F
2 NaN NaN C F
3 0.078947 0.119048 D F
4 0.162162 0.121951 E F
5 0.128205 0.116279 F F

Try using:
cols = dfx.columns
con = cols[cols.str.contains('_')]
for i in con.str.split('_').str[-1].unique():
df = dfx[[x for x in con if i in x]]
dfx['rel_%s' % i] = (df['max_%s' % i] - df['min_%s' % i]) / df['min_%s' % i]
dfx = dfx.drop(con, axis=1)
print(dfx)

Related

python: concatenate unknown number of columns in pandas DataFrame

I need to concatenate some columns in a pandas DataFrame with "_" as separator and store the result in a new column in the same DataFrame. The problem is that I don't know in advance which and how many columns to concatenate. The labels of the columns to be concatenated are determined at run time of the program and stored in a list.
Example:
import pandas as pd
df=pd.DataFrame(data={'col.a':['a','b','c'],'col.b':['d','e','f'], 'col.c':['g','h','i']})
col.a col.b col.c
0 a d g
1 b e h
2 c f i
cols_to_concat = ['col.a','col.c']
Desired result:
col.a col.b col.c cols.concat
0 a d g a_g
1 b e h b_h
2 c f i c_i
I need a method for generating df['cols.concat'] that works for a df with any number of columns and where cols_to_concat is an arbitrary subset of df.columns.
supposing you have a list with column names to concatenate you could use apply and concatenate values as:
import pandas as pd
df=pd.DataFrame(data={'col.a':['a','b','c'],
'col.b':['d','e','f'],
'col.c':['g','h','i']})
#this is the list of columns to concatenate
cols_to_cat = ['col.a','col.b','col.c']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
this should do the trick.
EDIT
you could concatenate any number of columns with this:
cols_to_cat = ['col.a','col.c']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
You could even repeat columns:
cols_to_cat = ['col.a','col.c','col.a']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)

Pandas apply lambda returning a tuple and insert into respective column

How can a pandas apply returning a tuple which the result going to be insert to the respective column?
def foo(n, m):
a = n + 1
b = m + 2
return a, b
df['a'], df['b'] = df.apply(lambda x: foo(x['n'], x['m']), axis=1)
n and m in the lambda function is the columns to grab the value respectively. mean while i would like to return the result from foo and insert into the column a and b.
but the error message i get was "too many values to unpack (expected 2)"
I can store the return tuple in a temporary column and split it but i was wondering if there is any way apply will be able to split the results into columns?
Add result_type='expand' to DataFrame.apply and assign to subset of columns in list:
df = pd.DataFrame({'n':[1,2], 'm':[5,6]})
def foo(n, m):
a = n + 1
b = m + 2
return a, b
df[['a','b']] = df.apply(lambda x: foo(x['n'], x['m']), axis=1, result_type='expand')
print (df)
n m a b
0 1 5 2 7
1 2 6 3 8
Convert tuple returned to Series
df[['a','b']] = df.apply(lambda x: pd.Series(foo(x['n'], x['m'])), axis=1)

Pandas groupby different aggregation values with big dataframe

I have a dataframe with 700+ columns. I am doing a groupby with one column, lets say df.a, and I want to aggregate every column by mean except the last 10, which I want to aggregate my max. I am aware of creating a conditional dictionary and then passing into a groupby like this:
d = {'DATE': 'last', 'AE_NAME': 'last', 'ANSWERED_CALL': 'sum'}
res = df.groupby(df.a).agg(d)
However, with so many columns, I do not want to have to write this all out. Is there a quick way to do this?
You could use zip and some not really elegant code imo but it works:
cols = df.drop("A", axis=1).columns # drop groupby column since not in agg
len_means = len(cols[:-10]) # grabbing all cols except the last ten ones
len_max = len(cols[-10:] # grabbing the last ten cols length
d_means = {i:j for i,j in zip(cols[:-10], ["mean"]*len_means)}
d_max = {i:j for i,j in zip(cols[-10:], ["max"]*len_max)}
d = d_means.update(d_max}
res = df.groupby(df.a).agg(d)
Edit : since OP mentioned the columns are named differently (ending with letter c then)
c_cols = [col for col in df.columns if col.endswith('c')]
non_c_cols = [col for col in df.columns if col not in c_cols]
and one only needs to plug the cols in the code above the get the result
I would approach this problem the following:
Define a cutoff for which columns to select
Select the columns you need
Create both your mean and max aggregation with GroupBy
Join both dataframes together:
# example dataframe
df = pd.DataFrame(np.random.rand(5,10), columns=list('abcdefghij'))
df.insert(0, 'ID', ['aaa', 'bbb', 'aaa', 'ccc', 'bbb'])
ID a b c d e f g h i j
0 aaa 0.228208 0.822641 0.407747 0.416335 0.039717 0.854789 0.108124 0.666190 0.074569 0.329419
1 bbb 0.285293 0.274654 0.507607 0.527335 0.599833 0.511760 0.747992 0.930221 0.396697 0.959254
2 aaa 0.844373 0.431420 0.083631 0.656162 0.511913 0.486187 0.955340 0.130358 0.759013 0.181874
3 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127
4 bbb 0.823457 0.172947 0.907415 0.719616 0.632012 0.199703 0.672745 0.563852 0.120827 0.092455
cutoff = 7
mean_cols = df.columns[:cutoff]
max_cols = ['ID'] + df.columns[cutoff:].tolist()
df1 = df[mean_cols].groupby('ID').mean()
df2 = df[max_cols].groupby('ID').max()
df = df1.join(df2).reset_index()
ID a b c d e f g h i j
0 aaa 0.536290 0.627031 0.245689 0.536248 0.275815 0.670488 0.955340 0.666190 0.759013 0.329419
1 bbb 0.554375 0.223800 0.707511 0.623476 0.615923 0.355732 0.747992 0.930221 0.396697 0.959254
2 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

Find all duplicate columns in a collection of data frames

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]
pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]
here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare

Categories