Create a dataframe of combinations with an ID with pandas [duplicate] - python

This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 19 days ago.
I'm running into a wall in terms of how to do this with Pandas. Given a dataframe (df1) with an ID column, and a separate dataframe (df2), how can I combine the two to make a third dataframe that preserves the ID column with all the possible combinations it could have?
df1
ID name.x
1 a
2 b
3 c
df2
name.y
l
m
dataframe creation:
df1 = pd.DataFrame({'ID':[1,2,3],'name.x':['a','b','c']})
df2 = pd.DataFrame({'name.y':['l','m']})
combined df
ID name.x name.y
1 a l
1 a m
2 b l
2 b m
3 c l
3 c m

create a col on each that is the same, do a full outer join, then keep the cols you want:
df1 = pd.DataFrame({'ID':[1,2,3],'name.x':['a','b','c']})
df2 = pd.DataFrame({'name.y':['l','m']})
df1['join_col'] = True
df2['join_col'] = True
df3 = pd.merge(df1,df2, how='outer',on = 'join_col')
print(df3[['ID','name.x','name.y']])
will output:
ID name.x name.y
0 1 a l
1 1 a m
2 2 b l
3 2 b m
4 3 c l
5 3 c m

Related

How can a duplicate row be dropped with some condition [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.
Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function
Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d
Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

i am trying to iterate value of column from df2 and assign each value of column from df2 to the df1.As if df1 will multiply itself based on value of column from df2.
let's say i have df1 as per below:
df1
1
2
3
and df2 as per below:
df2
A
B
C
I want third dataframe df3 will became like below:
df3
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
for now i have tried below code
for i, value in ACS_shock['scenario'].iteritems():
df1['sec'] = df1[i] = value[:]
But when i generate the file from DF1 my output is like below:
1 A B C
2 A B C
3 A B C
Any idea how can i correct this code.
much appreciated.
You can use pd.concat and np.repeat:
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.Series([1,2,3])
>>> df1
0 1
1 2
2 3
dtype: int64
>>> df2 = pd.Series(list('ABC'))
>>> df2
0 A
1 B
2 C
dtype: object
>>> df3 = pd.DataFrame({'df1': pd.concat([df1]*3).reset_index(drop=True),
'df2': np.repeat(df2, 3).reset_index(drop=True)})
>>> df3
df1 df2
0 1 A
1 2 A
2 3 A
3 1 B
4 2 B
5 3 B
6 1 C
7 2 C
8 3 C

Pandas multiply two data frames to get product

I have two data frames with different variable names
df1 = pd.DataFrame({'A':[2,2,3],'B':[5,5,6]})
>>> df1
A B
0 2 5
1 2 5
2 3 6
df2 = pd.DataFrame({'C':[3,3,3],'D':[5,5,6]})
>>> df2
C D
0 3 5
1 3 5
2 3 6
I want to create a third data frame where the n-th column is the product of the n-th columns in the first two data frames. In the above example, df3 would have two columns X and Y, where df.X = df.A * df.C and df.Y = df.B * df.D
df3 = pd.DataFrame({'X':[6,6,9],'Y':[25,25,36]})
>>> df3
X Y
0 6 25
1 6 25
2 9 36
Is there a simple pandas function that allows me to do this?
You can use mul, to multiply df1 by the values of df2:
df3 = df1.mul(df2.values)
df3.columns = ['X','Y']
>>> df3
X Y
0 6 25
1 6 25
2 9 36
You can also use numpy as:
df3 = np.multiply(df1, df2)
Note: Most numpy operations will take Pandas Series or DataFrame.

pandas dataframe reshape cast [duplicate]

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 6 years ago.
I have a dataframe like this:
import pandas
df=pandas.DataFrame([['a','b'],['a','c'],['b','c'],['b','d'],['c','f']],columns=['id','key'])
print(df)
id key
0 a b
1 a c
2 b c
3 b d
4 c f
the result that I wanted:
id key
0 a b,c
1 b c,d
2 c f
I try use pivot function, but I don't get the result. The cast packages in R seems to tackle the problem. Thanks!
You need groupby with apply join:
df1 = df.groupby('id')['key'].apply(','.join).reset_index()
print (df1)
id key
0 a b,c
1 b c,d
2 c f
a numpy approach
g = df.id.values
k = df.key.values
a = g.argsort(kind='mergesort')
gg = g[a]
kg = k[a]
w = np.where(gg[:-1] != gg[1:])[0]
pd.DataFrame(dict(
id=gg[np.append(w, len(a) - 1)],
key=[','.join(l.tolist()) for l in np.split(kg, w + 1)]
))
id key
0 a b,c
1 b c,d
2 c f
speed versus intuition

Categories