Sum all combinations of 2 columns from 2 dataframes - python

I have 2 dataframes df1 and df2 (same index and number of rows), and I would like to create a new dataframe which columns are the sum of all combinations of 2 columns from df1 and df2, example :
input :
import pandas as pd
df1 = pd.DataFrame([[10,20]])
df2 = pd.DataFrame([[1,2]])
output :
import pandas as pd
df3 = pd.DataFrame([[11,12,21,22]])

Use MultiIndex.from_product for all combinations and sum DataFrames with repeated values by DataFrame.reindex:
mux = pd.MultiIndex.from_product([df1.columns, df2.columns])
df = df1.reindex(mux, level=0, axis=1) + df2.reindex(mux, level=1, axis=1)
df.columns = range(len(df.columns))

IIUC you can do this with numpy.
>>> import numpy as np
>>> n = df1.shape[1]
>>> pd.DataFrame(df1.values.repeat(n) + np.tile(df2.values, n))
0 1 2 3
0 11 12 21 22

Related

concat by taking the values from column

i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6

nunique into new dataframe

I have some data in pandas:
df1
df1['ID_A'].nunique()
5
df2
df2['ID_B'].nunique()
6
df3
df1['ID_A'].nunique()
2
df4
df2['ID_B'].nunique()
9
and so-on until 200 df.
how to make new dataframe based on this nunique
my expected result looks like this:
combine ID_A ID_B
combine_1 5 6
combine_2 2 9
thank you
Use list comprehension with list of DataFrames and if necessary change index names by list comprehensions with f-strings:
df1 = pd.DataFrame({'ID_A':[1,2,3,4,5,5],
'ID_B':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'ID_A':[1,2,1,2,1,1,1,2,1],
'ID_B':[1,2,3,4,5,6,7,8,9]})
dfs = [df1, df2]
df = pd.DataFrame([x.nunique() for x in dfs])
df.index = [f'combine_{x+1}' for x in df.index]
df.index.name= 'combine'
print (df)
ID_A ID_B
combine
combine_1 5 6
combine_2 2 9
If necessary filter only columns by list:
cols = ['ID_A', 'ID_B']
dfs = [df1, df2]
df = pd.DataFrame([x[cols].nunique() for x in dfs])
#filter only columns starting by ID_
#df = pd.DataFrame([x.filter(regex='^ID_').nunique() for x in dfs])
df.index = [f'combine_{x+1}' for x in df.index]
df.index.name= 'combine'

Mapping to dataframes based on one column

I have a dataframe (df1) of 5 columns (a,b,c,d,e) with 6 rows and another dataframe (df2) with 2 columns (a,z) with 20000 rows.
How do I map and merge those dataframes using ('a') value.
So that df1 having 5 columns should map values in df2 having 2 columns with 'a' value and return a new df which has 6 columns (5 from df1 and 1 mapped row in df2) with 6 rows.
By using pd.concat:
import pandas as pd
import numpy as np
columns_df1 = ['a','b','c','d']
columns_df2 = ['a','z']
data_df1 = [['abc','def','ghi','xyz'],['abc2','def2','ghi2','xyz2'],['abc3','def3','ghi3','xyz3'],['abc4','def4','ghi4','xyz4']]
data_df2 = [['a','z'],['a2','z2']]
df_1 = pd.DataFrame(data_df1, columns=columns_df1)
df_2 = pd.DataFrame(data_df2, columns=columns_df2)
print(df_1)
print(df_2)
frames = [df_1, df_2]
print (pd.concat(frames))
OUTPUT:
Edit:
To replace NaN values you could use pandas.DataFrame.fillna:
print (pd.concat(frames).fillna("NULL"))
Replcae NULL with anything you want e.g. 0
OUTPUT:

Concat dataframes on different columns

I have 3 different csv files and I'm looking for concat the values. The only condition I need is that the first csv dataframe must be in column A of the new csv, the second csv dataframe in the column B and the Thirth csv dataframe in the C Column. The quantity of rows is the same for all csv files.
Also I need to change the three headers to ['año_pasado','mes_pasado','este_mes']
import pandas as pd
df = pd.read_csv('año_pasado_subastas2.csv', sep=',')
df1 = pd.read_csv('mes_pasado_subastas2.csv', sep=',')
df2 = pd.read_csv('este_mes_subastas2.csv', sep=',')
df1
>>>
Subastas
166665859
237944547
260106086
276599496
251813654
223790056
179340698
177500866
239884764
234813107
df2
>>>
Subastas
212003586
161813617
172179313
209185016
203804433
198207783
179410798
156375658
130228140
124964988
df3
>>>
Subastas
142552750
227514418
222635042
216263925
196209965
140984000
139712089
215588302
229478041
222211457
The output that I need is:
año_pasado,mes_pasado,este_mes
166665859,124964988,142552750
237944547,161813617,227514418
260106086,172179313,222635042
276599496,209185016,216263925
251813654,203804433,196209965
223790056,198207783,140984000
179340698,179410798,139712089
177500866,156375658,215588302
239884764,130228140,229478041
234813107,124964988,222211457
I think you need concat of Series created by squeeze=True if one column data only or selecting columns and for new columns names use parameter keys:
df = pd.read_csv('año_pasado_subastas2.csv', squeeze=True)
df1 = pd.read_csv('mes_pasado_subastas2.csv', squeeze=True)
df2 = pd.read_csv('este_mes_subastas2.csv', squeeze=True)
cols = ['año_pasado','mes_pasado','este_mes']
df = pd.concat([df, df1, df2], keys = cols, axis=1)
Or:
df = pd.read_csv('año_pasado_subastas2.csv')
df1 = pd.read_csv('mes_pasado_subastas2.csv')
df2 = pd.read_csv('este_mes_subastas2.csv')
cols = ['año_pasado','mes_pasado','este_mes']
df = pd.concat([df['Subastas'], df1['Subastas'], df2['Subastas']], keys = cols, axis=1)
print (df)
año_pasado mes_pasado este_mes
0 166665859 212003586 142552750
1 237944547 161813617 227514418
2 260106086 172179313 222635042
3 276599496 209185016 216263925
4 251813654 203804433 196209965
5 223790056 198207783 140984000
6 179340698 179410798 139712089
7 177500866 156375658 215588302
8 239884764 130228140 229478041
9 234813107 124964988 222211457

Merge after groupby

I'm having trouble using pd.merge after groupby. Here's my hypothetical:
import pandas as pd
from pandas import DataFrame
import numpy as np
df1 = DataFrame({'key': [1,1,2,2,3,3],
'var11': np.random.randn(6),
'var12': np.random.randn(6)})
df2 = DataFrame({'key': [1,2,3],
'var21': np.random.randn(3),
'var22': np.random.randn(3)})
#group var11 in df1 by key
grouped = df1['var11'].groupby(df1['key'])
# calculate the mean of var11 by key
grouped = grouped.mean()
print grouped
key
1 1.399430
2 0.568216
3 -0.612843
dtype: float64
print grouped.index
Int64Index([1, 2, 3], dtype='int64')
print df2
key var21 var22
0 1 -0.381078 0.224325
1 2 0.836719 -0.565498
2 3 0.323412 -1.616901
df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)
At this point, I get IndexError: list index out of range.
When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?
I think you should just do this:
In [140]:
df2 = pd.merge(df2,
pd.DataFrame(grouped, columns=['mean']),
left_on='key',
right_index=True)
print df2
key var21 var22 mean
0 1 0.324476 0.701254 0.400313
1 2 -1.270500 0.055383 -0.293691
2 3 0.804864 0.566747 0.628787
[3 rows x 4 columns]
The reason it didn't work is that grouped is a Series not a DataFrame

Categories