Is there a way to concatenate the information of two data series in pandas? Not append or merge the information of two data frames, but actually combine the content of each data series in a new data series.
Example:
ColumnA (Item type) Row 2 = 1 (float64)
ColumnB (Item number) Row 2 = 1 (float64)
ColumnC (Registration Date) Row 2 = 04/07/2018 (43285) (datetime64[ns])
In excel I would concatenate the rows in Column A, B, C and have the number in each column combined, using the formula =concat(A2, B2, C2)
The result would be 1143285 in another cell D2, for example.
Is there a way for me to do that in Pandas? I could only find ways to join, combined or append the series in the data frame but not in the series itself.
You can use that
df['D'] = df.apply(lambda row : str(row['A'])+
str(row['B']) + str(row['C']), axis = 1)
Following your example it would be
import pandas as pd
d = {'A': [1],'B':[1],'C':[43285]}
df = pd.DataFrame(data=d)
df['D'] = df.apply(lambda row : str(row['A'])+
str(row['B']) + str(row['C']), axis = 1)
Output:
A B C D
0 1 1 43285 1143285
Related
I need to concatenate some columns in a pandas DataFrame with "_" as separator and store the result in a new column in the same DataFrame. The problem is that I don't know in advance which and how many columns to concatenate. The labels of the columns to be concatenated are determined at run time of the program and stored in a list.
Example:
import pandas as pd
df=pd.DataFrame(data={'col.a':['a','b','c'],'col.b':['d','e','f'], 'col.c':['g','h','i']})
col.a col.b col.c
0 a d g
1 b e h
2 c f i
cols_to_concat = ['col.a','col.c']
Desired result:
col.a col.b col.c cols.concat
0 a d g a_g
1 b e h b_h
2 c f i c_i
I need a method for generating df['cols.concat'] that works for a df with any number of columns and where cols_to_concat is an arbitrary subset of df.columns.
supposing you have a list with column names to concatenate you could use apply and concatenate values as:
import pandas as pd
df=pd.DataFrame(data={'col.a':['a','b','c'],
'col.b':['d','e','f'],
'col.c':['g','h','i']})
#this is the list of columns to concatenate
cols_to_cat = ['col.a','col.b','col.c']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
this should do the trick.
EDIT
you could concatenate any number of columns with this:
cols_to_cat = ['col.a','col.c']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
You could even repeat columns:
cols_to_cat = ['col.a','col.c','col.a']
df['concat'] = df[cols_to_cat].apply(lambda x: '_'.join(x), axis=1)
I have a pandas dataframe like as given below
dfx = pd.DataFrame({'min_temp' :[38,36,np.nan,38,37,39],'max_temp': [41,39,39,41,43,44],
'min_hr': [89,87,85,84,82,86],'max_hr': [91,98,np.nan,94,92,96], 'min_sbp':[21,23,25,27,28,29],
'ethnicity':['A','B','C','D','E','F'],'Gender':['M','F','F','F','F','F']})
What I would like to do is
1) Identify all columns that contain min and max.
2) Find their corresponding pair. ex: min_temp and max_temp are a pair. Similarly min_hr and max_hr are a pair
3) Convert these two columns into one column and name it as rel_temp. See below for formula
rel_temp = (max_temp - min_temp)/min_temp
This is what I was trying. Do note that my real data has several thousand records and hundreds of columns like this
def myfunc(n):
return lambda a,b : ((b-a)/a)
dfx.apply(myfunc(col for col in dfx.columns)) # didn't know how to apply string contains here
I expect my output to be like this. Please note that only min and max columns have to be transformed. Rest of the columns in dataframe should be left as is.
Idea is create df1 and df2 with same columns names with DataFrame.filter and rename, so then subtract and divide all columns with DataFrame.sub and DataFrame.div:
df1 = dfx.filter(like='max').rename(columns=lambda x: x.replace('max','rel'))
df2 = dfx.filter(like='min').rename(columns=lambda x: x.replace('min','rel'))
df = df1.sub(df2).div(df2).join(dfx.loc[:, ~dfx.columns.str.contains('min|max')])
print (df)
rel_temp rel_hr ethnicity Gender
0 0.078947 0.022472 A M
1 0.083333 0.126437 B F
2 NaN NaN C F
3 0.078947 0.119048 D F
4 0.162162 0.121951 E F
5 0.128205 0.116279 F F
Try using:
cols = dfx.columns
con = cols[cols.str.contains('_')]
for i in con.str.split('_').str[-1].unique():
df = dfx[[x for x in con if i in x]]
dfx['rel_%s' % i] = (df['max_%s' % i] - df['min_%s' % i]) / df['min_%s' % i]
dfx = dfx.drop(con, axis=1)
print(dfx)
i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6
I have a dataframe (df1) of 5 columns (a,b,c,d,e) with 6 rows and another dataframe (df2) with 2 columns (a,z) with 20000 rows.
How do I map and merge those dataframes using ('a') value.
So that df1 having 5 columns should map values in df2 having 2 columns with 'a' value and return a new df which has 6 columns (5 from df1 and 1 mapped row in df2) with 6 rows.
By using pd.concat:
import pandas as pd
import numpy as np
columns_df1 = ['a','b','c','d']
columns_df2 = ['a','z']
data_df1 = [['abc','def','ghi','xyz'],['abc2','def2','ghi2','xyz2'],['abc3','def3','ghi3','xyz3'],['abc4','def4','ghi4','xyz4']]
data_df2 = [['a','z'],['a2','z2']]
df_1 = pd.DataFrame(data_df1, columns=columns_df1)
df_2 = pd.DataFrame(data_df2, columns=columns_df2)
print(df_1)
print(df_2)
frames = [df_1, df_2]
print (pd.concat(frames))
OUTPUT:
Edit:
To replace NaN values you could use pandas.DataFrame.fillna:
print (pd.concat(frames).fillna("NULL"))
Replcae NULL with anything you want e.g. 0
OUTPUT:
I have 2 dataframes, df1 and df2, and want to do the following, storing results in df3:
for each row in df1:
for each row in df2:
create a new row in df3 (called "df1-1, df2-1" or whatever) to store results
for each cell(column) in df1:
for the cell in df2 whose column name is the same as for the cell in df1:
compare the cells (using some comparing function func(a,b) ) and,
depending on the result of the comparison, write result into the
appropriate column of the "df1-1, df2-1" row of df3)
For example, something like:
df1
A B C D
foo bar foobar 7
gee whiz herp 10
df2
A B C D
zoo car foobar 8
df3
df1-df2 A B C D
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
I've started with this:
for r1 in df1.iterrows():
for r2 in df2.iterrows():
for c1 in r1:
for c2 in r2:
but am not sure what to do with it, and would appreciate some help.
So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy. Ideally, you shouldn't ever be calling iterrows(). To be a little more explicit with my suggestion:
# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']
# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0 foofoofoozoo
1 NaN
Name: A, dtype: object
# more generally
# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns)
for colName in df1:
df3[colName] = func(df1[colName], df2[colName])
Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:
# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]
# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
df3[colName] = func(df1[colName], df2[colName])
The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns. For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined. Just something to be aware of.
Edit, re: your comment: That is doable as well. Iterate over the dfX.columns that is larger, so you don't run into a KeyError, and throw an if statement in there:
# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
if colName not in df1:
df3[colName] = np.nan # be sure to import numpy as np
else:
df3[colName] = func(df1[colName], df2[colName])