Mapping to dataframes based on one column - python

I have a dataframe (df1) of 5 columns (a,b,c,d,e) with 6 rows and another dataframe (df2) with 2 columns (a,z) with 20000 rows.
How do I map and merge those dataframes using ('a') value.
So that df1 having 5 columns should map values in df2 having 2 columns with 'a' value and return a new df which has 6 columns (5 from df1 and 1 mapped row in df2) with 6 rows.

By using pd.concat:
import pandas as pd
import numpy as np
columns_df1 = ['a','b','c','d']
columns_df2 = ['a','z']
data_df1 = [['abc','def','ghi','xyz'],['abc2','def2','ghi2','xyz2'],['abc3','def3','ghi3','xyz3'],['abc4','def4','ghi4','xyz4']]
data_df2 = [['a','z'],['a2','z2']]
df_1 = pd.DataFrame(data_df1, columns=columns_df1)
df_2 = pd.DataFrame(data_df2, columns=columns_df2)
print(df_1)
print(df_2)
frames = [df_1, df_2]
print (pd.concat(frames))
OUTPUT:
Edit:
To replace NaN values you could use pandas.DataFrame.fillna:
print (pd.concat(frames).fillna("NULL"))
Replcae NULL with anything you want e.g. 0
OUTPUT:

Related

Sum all combinations of 2 columns from 2 dataframes

I have 2 dataframes df1 and df2 (same index and number of rows), and I would like to create a new dataframe which columns are the sum of all combinations of 2 columns from df1 and df2, example :
input :
import pandas as pd
df1 = pd.DataFrame([[10,20]])
df2 = pd.DataFrame([[1,2]])
output :
import pandas as pd
df3 = pd.DataFrame([[11,12,21,22]])
Use MultiIndex.from_product for all combinations and sum DataFrames with repeated values by DataFrame.reindex:
mux = pd.MultiIndex.from_product([df1.columns, df2.columns])
df = df1.reindex(mux, level=0, axis=1) + df2.reindex(mux, level=1, axis=1)
df.columns = range(len(df.columns))
IIUC you can do this with numpy.
>>> import numpy as np
>>> n = df1.shape[1]
>>> pd.DataFrame(df1.values.repeat(n) + np.tile(df2.values, n))
0 1 2 3
0 11 12 21 22

concat by taking the values from column

i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?
1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083
Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083
You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')

Concat dataframes on different columns

I have 3 different csv files and I'm looking for concat the values. The only condition I need is that the first csv dataframe must be in column A of the new csv, the second csv dataframe in the column B and the Thirth csv dataframe in the C Column. The quantity of rows is the same for all csv files.
Also I need to change the three headers to ['año_pasado','mes_pasado','este_mes']
import pandas as pd
df = pd.read_csv('año_pasado_subastas2.csv', sep=',')
df1 = pd.read_csv('mes_pasado_subastas2.csv', sep=',')
df2 = pd.read_csv('este_mes_subastas2.csv', sep=',')
df1
>>>
Subastas
166665859
237944547
260106086
276599496
251813654
223790056
179340698
177500866
239884764
234813107
df2
>>>
Subastas
212003586
161813617
172179313
209185016
203804433
198207783
179410798
156375658
130228140
124964988
df3
>>>
Subastas
142552750
227514418
222635042
216263925
196209965
140984000
139712089
215588302
229478041
222211457
The output that I need is:
año_pasado,mes_pasado,este_mes
166665859,124964988,142552750
237944547,161813617,227514418
260106086,172179313,222635042
276599496,209185016,216263925
251813654,203804433,196209965
223790056,198207783,140984000
179340698,179410798,139712089
177500866,156375658,215588302
239884764,130228140,229478041
234813107,124964988,222211457
I think you need concat of Series created by squeeze=True if one column data only or selecting columns and for new columns names use parameter keys:
df = pd.read_csv('año_pasado_subastas2.csv', squeeze=True)
df1 = pd.read_csv('mes_pasado_subastas2.csv', squeeze=True)
df2 = pd.read_csv('este_mes_subastas2.csv', squeeze=True)
cols = ['año_pasado','mes_pasado','este_mes']
df = pd.concat([df, df1, df2], keys = cols, axis=1)
Or:
df = pd.read_csv('año_pasado_subastas2.csv')
df1 = pd.read_csv('mes_pasado_subastas2.csv')
df2 = pd.read_csv('este_mes_subastas2.csv')
cols = ['año_pasado','mes_pasado','este_mes']
df = pd.concat([df['Subastas'], df1['Subastas'], df2['Subastas']], keys = cols, axis=1)
print (df)
año_pasado mes_pasado este_mes
0 166665859 212003586 142552750
1 237944547 161813617 227514418
2 260106086 172179313 222635042
3 276599496 209185016 216263925
4 251813654 203804433 196209965
5 223790056 198207783 140984000
6 179340698 179410798 139712089
7 177500866 156375658 215588302
8 239884764 130228140 229478041
9 234813107 124964988 222211457

Merge after groupby

I'm having trouble using pd.merge after groupby. Here's my hypothetical:
import pandas as pd
from pandas import DataFrame
import numpy as np
df1 = DataFrame({'key': [1,1,2,2,3,3],
'var11': np.random.randn(6),
'var12': np.random.randn(6)})
df2 = DataFrame({'key': [1,2,3],
'var21': np.random.randn(3),
'var22': np.random.randn(3)})
#group var11 in df1 by key
grouped = df1['var11'].groupby(df1['key'])
# calculate the mean of var11 by key
grouped = grouped.mean()
print grouped
key
1 1.399430
2 0.568216
3 -0.612843
dtype: float64
print grouped.index
Int64Index([1, 2, 3], dtype='int64')
print df2
key var21 var22
0 1 -0.381078 0.224325
1 2 0.836719 -0.565498
2 3 0.323412 -1.616901
df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)
At this point, I get IndexError: list index out of range.
When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?
I think you should just do this:
In [140]:
df2 = pd.merge(df2,
pd.DataFrame(grouped, columns=['mean']),
left_on='key',
right_index=True)
print df2
key var21 var22 mean
0 1 0.324476 0.701254 0.400313
1 2 -1.270500 0.055383 -0.293691
2 3 0.804864 0.566747 0.628787
[3 rows x 4 columns]
The reason it didn't work is that grouped is a Series not a DataFrame

Categories