join two columns of different dataframes into another dataframe - python

I have two dataframes:
one:
[A]
1
2
3
two:
[B]
7
6
9
How can I join two columns of different dataframes into another dataframe?
Like that:
[A][B]
1 7
2 6
3 9
I already tried that:
result = A
result = result.rename(columns={'employee_id': 'A'})
result['B'] = pd.Series(B['employee_id'])
and
B_column = B["employee_id"]
result = pd.concat([result,B_column], axis = 1)
result
but I still couldn't

import pandas as pd
df1 = pd.DataFrame(data = {"A" : range(1, 4)})
df2 = pd.DataFrame(data = {"B" : range(7, 10)})
df = df1.join(df2)
Gives
A
B
0
1
7
1
2
8
2
3
9

While there is various way to accomplish this, one way would be to just merge them on the index.
Something like this:
dfResult = dfA.merge(dfB, left_on=dfA.index, right_on=dfB.index, how='inner')

Related

Pandas apply row-wise a function and create multiple new columns

What is the best way to apply a row-wise function and create multiple new columns?
I have two dataframes and a working code, but it's most likely not optimal
df1 (dataframe has thousands of rows and xx number of columns)
sic
data1
data2
data3
data4
data5
5
0.90783598
0.84722083
0.47149924
0.98724123
0.50654476
6
0.53442684
0.59730371
0.92486887
0.61531646
0.62784041
3
0.56806423
0.09619383
0.33846097
0.71878313
0.96316724
8
0.86933042
0.64965755
0.94549745
0.08866519
0.92156389
12
0.651328
0.37193774
0.9679044
0.36898991
0.15161838
6
0.24555531
0.50195983
0.79114578
0.9290596
0.10672607
df2 (column header maps to the sic-code in df1. There are in total 12 sic-codes and the dataframe is thousands of rows long)
1
2
3
4
5
6
7
8
9
10
11
12
c_bar
0.4955329
0.92970292
0.68049726
0.91325006
0.55578465
0.78056519
0.53954711
0.90335326
0.93986402
0.0204794
0.51575764
0.61144255
a1_bar
0.75781444
0.81052669
0.99910449
0.62181902
0.11797144
0.40031316
0.08561665
0.35296894
0.14445697
0.93799762
0.80641802
0.31379671
a2_bar
0.41432552
0.36313911
0.13091618
0.39251953
0.66249636
0.31221897
0.15988528
0.1620938
0.55143589
0.66571044
0.68198944
0.23806947
a3_bar
0.38918855
0.83689178
0.15838139
0.39943204
0.48615188
0.06299899
0.86343819
0.47975619
0.05300611
0.15080875
0.73088725
0.3500239
a4_bar
0.47201384
0.90874121
0.50417142
0.70047698
0.24820601
0.34302454
0.4650635
0.0992668
0.55142391
0.82947194
0.28251699
0.53170308
I achieved the result I need with the following code:
ind_list = np.arange(1,13) # Create list of industries
def c_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['const',i]
def a1_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a1bar',i]
def a2_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a2bar',i]
def a3_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a3bar',i]
def a4_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a4bar',i]
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')
The output is something like this:
sic
data1
data2
data3
data4
c_bar
a1_bar
a2_bar
a3_bar
a4_bar
5
0.10316948
0.61408639
0.04042675
0.79255749
0.56357931
0.42920472
0.20701581
0.67639811
0.37778029
6
0.5730904
0.16753145
0.27835136
0.00178992
0.51793793
0.06772307
0.15084885
0.12451806
0.33114948
3
0.87710893
0.66834187
0.14286608
0.12609769
0.75873957
0.72586804
0.6081763
0.14598001
0.21557266
8
0.24565579
0.56195558
0.93316676
0.20988936
0.67404545
0.65221594
0.79758557
0.67093021
0.33400764
12
0.79703344
0.61066111
0.94602909
0.56218703
0.92384307
0.30836159
0.72521994
0.00795362
0.76348227
6
0.86604791
0.28454782
0.97229172
0.21853932
0.75650652
0.40788056
0.53233553
0.60326386
0.27399405
Cell values in the example are randomly generated, but the point is to map based on sic-codes and add rows from df2 as new columns into df1.
To do this, you need to:
Transpose df2 so that its columns are correct for concatenation
Index it with the df1["sic"] column to get the correct rows
Reset the index of the obtained rows of df2 using .reset_index(drop=True), so that the dataframes can be concatenated correctly. (This replaces the current index e.g. 5, 6, 3, 8, 12, 6 with a new one e.g. 0, 1, 2, 3, 4, 5 while keeping the actual values the same. This is so that pandas doesn't get confused while concatenating them)
Concatenate the two dataframes
Note: I used a method based off of this to read in the dataframe, and it assumed that the columns of df2 were strings but the values of the sic column of df1 were ints. Therefore I used .astype(str) to get step 2 working. If this is not actually the case, you may need to remove the .astype(str).
Here is the single line of code to do these things:
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
Here is the full code I used:
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
print(merged)
which produces the output:
sic data1 data2 data3 ... a1_bar a2_bar a3_bar a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025
[6 rows x 11 columns]
Try transposing df2 and applying transformations to it.
Transposing a data frame means converting the rows into columns of your data frame.
df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)
then, you can use concatenate the transformed columns of df2 with the columns of df1, using df1 = pd.concat([df1,df2],axis=1).

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

Pandas - Interleave / Zip two DataFrames by row

Suppose I have two dataframes:
>> df1
0 1 2
0 a b c
1 d e f
>> df2
0 1 2
0 A B C
1 D E F
How can I interleave the rows? i.e. get this:
>> interleaved_df
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
(Note my real DFs have identical columns, but not the same number of rows).
What I've tried
inspired by this question (very similar, but asks on columns):
import pandas as pd
from itertools import chain, zip_longest
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2])
new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices
interleaved_df = concat_df.reindex(new_index)
ValueError: cannot reindex from a duplicate axis
The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).
Any ideas?
You can sort the index after concatenating and then reset the index i.e
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Output :
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
EDIT (OmerB) : Incase of keeping the order regardless of the index value then.
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()
concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Use toolz.interleave
In [1024]: from toolz import interleave
In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Here's an extension of #Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.
Define Dataframes with the full set of column/ index labels and names:
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'
Add DataFrame ID to MultiIndex:
df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])
Then use #Bharath's concat() and sort_index():
data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)
Output:
cols col_a col_b col_c
df_id rows
1 one a b c
2 one A B C
1 two d e f
2 two D E F
You could also preallocate a new DataFrame, and then fill it using a slice.
def interleave(dfs):
data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
out = pd.DataFrame(data, columns=dfs[0].columns)
for ix, df in enumerate(dfs):
out.iloc[ix::len(dfs),:] = df.values
return out
The preallocation code is taken from this question.
While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.
Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.
You can try this way :
In [31]: import pandas as pd
...: from itertools import chain, zip_longest
...:
...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
In [32]: concat_df = pd.concat([df1,df2]).sort_index()
...:
In [33]: interleaved_df = concat_df.reset_index(drop=1)
In [34]: interleaved_df
Out[34]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F

How to factorize two data frame meanwhile with python-pandas?

I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!
Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20
Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.

Which data structure in Python to use to replace Excel 2-dim array of strings/amounts?

I am using xlwings to replace my VB code with Python but since I am not an experienced programmer I was wondering - which data structure to use?
Data is in .xls in 2 columns and has the following form; In VB I lift this into a basic two dimensional array arrCampaignsAmounts(i, j):
Col 1: 'market_channel_campaign_product'; Col 2: '2334.43 $'
Then I concatenate words from 4 columns on another sheet into a similar 'string', into another 2-dim array arrStrings(i, j):
'Austria_Facebook_Winter_Active vacation'; 'rowNumber'
Finally, I search strings from 1. array within strings from 2. array; if found I write amounts into rowNumber from arrStrings(i, 2).
Would I use 4 lists for this task?
Two dictionaries?
Something else?
Definitely use pandas Dataframes. Here are references and very simple Dataframe examples.
#reference: http://pandas.pydata.org/pandas-docs/stable/10min.html
#reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
import numpy as np
import pandas as pd
def df_dupes(df_in):
'''
Returns [object,count] pairs for each unique item in the dataframe.
'''
# import pandas
if isinstance(df_in, list) or isinstance(df_in, tuple):
import pandas as pd
df_in = pd.DataFrame(df_in)
return df_in.groupby(df_in.columns.tolist(),as_index=False).size()
def df_filter_example(df):
'''
In [96]: df
Out[96]:
A B C D
0 1 4 9 1
1 4 5 0 2
2 5 5 1 0
3 1 3 9 6
'''
import pandas as pd
df=pd.DataFrame([[1,4,9,1],[4,5,0,2],[5,5,1,0],[1,3,9,6]],columns=['A','B','C','D'])
return df[(df.A == 1) & (df.D == 6)]
def df_compare(df1, df2, compare_col_list, join_type):
'''
df_compare compares 2 dataframes.
Returns left, right, inner or outer join
df1 is the first/left dataframe
df2 is the second/right dataframe
compare_col_list is a lsit of column names that must match between df1 and df2
join_type = 'inner', 'left', 'right' or 'outer'
'''
import pandas as pd
return pd.merge(df1, df2, how=join_type,
on=compare_col_list)
def df_compare_examples():
import numpy as np
import pandas as pd
df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9 '''
df2=pd.DataFrame([[4,5,6],[7,8,9],[10,11,12]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 4 5 6
1 7 8 9
2 10 11 12 '''
# One can see that df1 contains 1 row ([1,2,3]) not in df3 and
# df2 contains 1 rown([10,11,12]) not in df1.
# Assume c1 is not relevant to the comparison. So, we merge on cols 2 and 3.
df_merge = pd.merge(df1,df2,how='outer',on=['c2','c3'])
print(df_merge)
''' c1_x c2 c3 c1_y
0 1 2 3 NaN
1 4 5 6 4
2 7 8 9 7
3 NaN 11 12 10 '''
''' One can see that columns c2 and c3 are returned. We also received
columns c1_x and c1_y, where c1_X is the value of column c1
in the first dataframe and c1_y is the value of c1 in the second
dataframe. As such,
any row that contains c1_y = NaN is a row from df1 not in df2 &
any row that contains c1_x = NaN is a row from df2 not in df1. '''
df1_unique = pd.merge(df1,df2,how='left',on=['c2','c3'])
df1_unique = df1_unique[df1_unique['c1_y'].isnull()]
print(df1_unique)
df2_unique = pd.merge(df1,df2,how='right',on=['c2','c3'])
print(df2_unique)
df_common = pd.merge(df1,df2,how='inner',on=['c2','c3'])
print(df_common)
def delete_column_example():
print 'create df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
print 'drop (delete/remove) column'
col_name = 'b'
df.drop(col_name, axis=1, inplace=True) # or df = df.drop('col_name, 1)
def delete_rows_example():
print '\n\ncreate df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['col_1','col_2','col_3'])
print(df)
print '\n\nappend rows'
df= df.append(pd.DataFrame([[11,22,33]], columns=['col_1','col_2','col_3']))
print(df)
print '\n\ndelete rows where (based on) column value'
df = df[df.col_1 == 4]
print(df)

Categories