python: concatenate unknown number of columns in pandas DataFrame - python

I need to concatenate some columns in a pandas DataFrame with "_" as separator and store the result in a new column in the same DataFrame. The problem is that I don't know in advance which and how many columns to concatenate. The labels of the columns to be concatenated are determined at run time of the program and stored in a list.
Example:
import pandas as pd
df=pd.DataFrame(data={'col.a':['a','b','c'],'col.b':['d','e','f'], 'col.c':['g','h','i']})
col.a col.b col.c
0 a d g
1 b e h
2 c f i
cols_to_concat = ['col.a','col.c']
Desired result:
col.a col.b col.c cols.concat
0 a d g a_g
1 b e h b_h
2 c f i c_i
I need a method for generating df['cols.concat'] that works for a df with any number of columns and where cols_to_concat is an arbitrary subset of df.columns.

supposing you have a list with column names to concatenate you could use apply and concatenate values as:
import pandas as pd
df=pd.DataFrame(data={'col.a':['a','b','c'],
'col.b':['d','e','f'],
'col.c':['g','h','i']})
#this is the list of columns to concatenate
cols_to_cat = ['col.a','col.b','col.c']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
this should do the trick.
EDIT
you could concatenate any number of columns with this:
cols_to_cat = ['col.a','col.c']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
You could even repeat columns:
cols_to_cat = ['col.a','col.c','col.a']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)

Related

Merge a list of dataframes to create one dataframe by selecting last column only

I have a list of dataframes
new_list=[df1,df2,df3]
I know the follwoing command to merge
pd.concat(new_list, axis=1)
How to create a dataframe by selcting last columns of all dataframes from the list without using for loops
Pick the last columns using map:
import pandas as pd
# This is to have a test list to use
df1 = pd.DataFrame({"a":[1,2,3], "b":[2,3,4]})
df2 = pd.DataFrame({"c":[1,2,3], "d":[2,3,4]})
new_list = [df1, df2]
# logic
pd.concat(map(lambda x:x[x.columns[-1]], new_list), axis=1)
OUTPUT
b d
0 2 2
1 3 3
2 4 4
Do use this
import pandas as pd
new_list = [df1.iloc[:,-1:], df2.iloc[:,-1:], df3.iloc[:,-1:]]
pd.concat(new_list, axis = 1)
You can also use lambda :
import pandas as pd
func = lambda x: x.iloc[:, -1] #select all rows and last column
new_list = [func(df1), func(df2), func(df3)]
pd.concat(new_list, axis=1)

Add data series content in another data series

Is there a way to concatenate the information of two data series in pandas? Not append or merge the information of two data frames, but actually combine the content of each data series in a new data series.
Example:
ColumnA (Item type) Row 2 = 1 (float64)
ColumnB (Item number) Row 2 = 1 (float64)
ColumnC (Registration Date) Row 2 = 04/07/2018 (43285) (datetime64[ns])
In excel I would concatenate the rows in Column A, B, C and have the number in each column combined, using the formula =concat(A2, B2, C2)
The result would be 1143285 in another cell D2, for example.
Is there a way for me to do that in Pandas? I could only find ways to join, combined or append the series in the data frame but not in the series itself.
You can use that
df['D'] = df.apply(lambda row : str(row['A'])+
str(row['B']) + str(row['C']), axis = 1)
Following your example it would be
import pandas as pd
d = {'A': [1],'B':[1],'C':[43285]}
df = pd.DataFrame(data=d)
df['D'] = df.apply(lambda row : str(row['A'])+
str(row['B']) + str(row['C']), axis = 1)
Output:
A B C D
0 1 1 43285 1143285

How to insert Pandas dataframe into another Pandas dataframe without wrapping it in a Series?

import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
df2[y] = [df1]
#df2.iloc[:,'y'].shape = (1,)
# type(df2.iloc[:,1][0]) = pandas.core.frame.DataFrame
I want to make a df a column in an existing row. However Pandas wraps this df in a Series object so that I cannot access it with dot notation such as df2.y.a to get the value 1. Is there a way to make this not occur or is there some constraint on object type for df elements such that this is impossible?
the desired output is a df like:
x y
0 100 a b
0 1 2
and type(df2.y) == pd.DataFrame
You can combine two DataFrame objects along the columns axis, which I think achieves what you're trying to. Let me know if this is what you're looking for
import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
a b x
0 1 2 100

Identify min, max columns and transform as relative difference column

I have a pandas dataframe like as given below
dfx = pd.DataFrame({'min_temp' :[38,36,np.nan,38,37,39],'max_temp': [41,39,39,41,43,44],
'min_hr': [89,87,85,84,82,86],'max_hr': [91,98,np.nan,94,92,96], 'min_sbp':[21,23,25,27,28,29],
'ethnicity':['A','B','C','D','E','F'],'Gender':['M','F','F','F','F','F']})
What I would like to do is
1) Identify all columns that contain min and max.
2) Find their corresponding pair. ex: min_temp and max_temp are a pair. Similarly min_hr and max_hr are a pair
3) Convert these two columns into one column and name it as rel_temp. See below for formula
rel_temp = (max_temp - min_temp)/min_temp
This is what I was trying. Do note that my real data has several thousand records and hundreds of columns like this
def myfunc(n):
return lambda a,b : ((b-a)/a)
dfx.apply(myfunc(col for col in dfx.columns)) # didn't know how to apply string contains here
I expect my output to be like this. Please note that only min and max columns have to be transformed. Rest of the columns in dataframe should be left as is.
Idea is create df1 and df2 with same columns names with DataFrame.filter and rename, so then subtract and divide all columns with DataFrame.sub and DataFrame.div:
df1 = dfx.filter(like='max').rename(columns=lambda x: x.replace('max','rel'))
df2 = dfx.filter(like='min').rename(columns=lambda x: x.replace('min','rel'))
df = df1.sub(df2).div(df2).join(dfx.loc[:, ~dfx.columns.str.contains('min|max')])
print (df)
rel_temp rel_hr ethnicity Gender
0 0.078947 0.022472 A M
1 0.083333 0.126437 B F
2 NaN NaN C F
3 0.078947 0.119048 D F
4 0.162162 0.121951 E F
5 0.128205 0.116279 F F
Try using:
cols = dfx.columns
con = cols[cols.str.contains('_')]
for i in con.str.split('_').str[-1].unique():
df = dfx[[x for x in con if i in x]]
dfx['rel_%s' % i] = (df['max_%s' % i] - df['min_%s' % i]) / df['min_%s' % i]
dfx = dfx.drop(con, axis=1)
print(dfx)

concat by taking the values from column

i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6

Categories