There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?
Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90
I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)
Hi,
i have two dataframes and i want to change values in first dataframe where have same IDs in both dataframes,
suppose i have:
df1 = [ID price
1 200
4 300
5 120
7 230
8 110
9 90
12 180]
and
df2 = [ID price count
3 340 27
4 60 10
5 290 2]
after replace:
df1 = [ID price
1 200
4 60
5 290
7 230
8 110
9 90
12 180]
my first try:
df1.loc[df1.ID.isin(df2.ID),['price']] = df2.loc[df2.ID.isin(df1.ID),['price']].values
but it isn't correct.
Assuming ID is the index (or can be set as the index), then you can just update:
In []:
df1.update(df2)
df1
Out[]:
price
ID
1 200.0
4 60.0
5 290.0
7 230.0
8 110.0
9 90.0
12 180.0
If you need to set_index():
df = df1.set_index('ID')
df.update(df2.set_index('ID'))
df1 = df.reset_index()
I have two columns "ID" and "division" as shown below.
df = pd.DataFrame(np.array([['111', 'AAA'],['222','AAA'],['333','BBB'],['444','CCC'],['444','AAA'],['222','BBB'],['111','BBB']]),columns=['ID','division'])
ID division
0 111 AAA
1 222 AAA
2 333 BBB
3 444 CCC
4 444 AAA
5 222 BBB
6 111 BBB
The expected output is as shown below where I need to pivot on the same column but the count is dependent on "division". This should be presented in a heatmap.
df = pd.DataFrame(np.array([['0','2','1','1'],['2','0','1','1'],['1','1','0','0'],['1','1','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 2 1 1
222 2 0 1 1
333 1 1 0 0
444 1 1 0 0
So, technically I am doing an overlap between ID's with respect to division.
Example:
The highlighted box in red where the overlap between 111 and 222 ID's is 2(AAA and BBB). where as the overlap between 111 and 444 is 1 (AAA highlighted in the black box).
I could do this in excel in 2 steps.Not sure if below one helps.
Step1:=SUM(COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,$G2),COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,H$1))-1
Step2:=IF($G12=H$1,0,SUMIFS(H$2:H$8,$G$2:$G$8,$G12))
But is there any way that we can do it in Python using dataframes.
Appreciate your help
Case-2
if df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
ID division count
0 111 AAA 4
1 222 AAA 5
2 333 BBB 6
3 444 CCC 3
4 444 AAA 2
5 222 BBB 2
6 111 BBB 7
Expected output would be
df_result = pd.DataFrame(np.array([['0','18','13','6'],['18','0','8','7'],['13','8','0','0'],['6','7','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 18 13 6
222 18 0 8 7
333 13 8 0 0
444 6 7 0 0
Calculation: Here there is an overlap between 111 and 222 with respect to divisions AAA and BBB hence the sum would be 4+5+2+7=18
Another way to do this is to use a self join with merge and pd.crosstab:
df_out = df.merge(df, on='division')
results = pd.crosstab(df_out.ID_x, df_out.ID_y)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 2.0 1.0 1.0
222 2.0 0.0 1.0 1.0
333 1.0 1.0 0.0 0.0
444 1.0 1.0 0.0 0.0
Case 2
df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
df['count'] = df['count'].astype(int)
df_out = df.merge(df, on='division')
df_out = df_out.assign(count = df_out.count_x + df_out.count_y)
results = pd.crosstab(df_out.ID_x, df_out.ID_y, df_out['count'], aggfunc='sum').fillna(0)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 18.0 13.0 6.0
222 18.0 0.0 8.0 7.0
333 13.0 8.0 0.0 0.0
444 6.0 7.0 0.0 0.0
I have dataframe
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456
And I want to get
id date count
0 123 10-12-2015 0
1 123 11-12-2015 0
2 123 12-12-2015 1
3 123 13-12-2015 1
4 123 14-12-2015 0
5 123 15-12-2015 1
6 123 16-12-2015 1
7 123 17-12-2015 0
8 123 18-12-2015 1
9 456 10-12-2015 1
10 456 11-12-2015 0
11 456 12-12-2015 0
12 456 13-12-2015 1
13 456 14-12-2015 0
14 456 15-12-2015 1
I try before
df = df.groupby('id').resample('D').size().reset_index(name='val')
But it search date between existing to every id. How can I do it to some period?
You can achieve what you want by reindexing in the aggregation of each group and filling NaNs with 0.
import io
import pandas as pd
data = io.StringIO("""\
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456""")
df = pd.read_csv(data, delim_whitespace=True)
df['date'] = pd.to_datetime(df['date'], format="%d-%m-%Y")
startdate = df['date'].min()
enddate = df['date'].max()
alldates = pd.date_range(startdate, enddate, freq='D', name='date')
def process_id(g):
return g.resample('D').size().reindex(alldates).fillna(0)
output = (df.set_index('date')
.groupby('id')
.apply(process_id)
.stack()
.rename('val')
.reset_index('id'))
print(output)
# id val
# date
# 2015-12-10 123 0.0
# 2015-12-11 123 0.0
# 2015-12-12 123 1.0
# 2015-12-13 123 1.0
# 2015-12-14 123 0.0
# 2015-12-15 123 1.0
# 2015-12-16 123 1.0
# 2015-12-17 123 0.0
# 2015-12-18 123 1.0
# 2015-12-10 456 1.0
# 2015-12-11 456 0.0
# 2015-12-12 456 0.0
# 2015-12-13 456 1.0
# 2015-12-14 456 0.0
# 2015-12-15 456 1.0
# 2015-12-16 456 0.0
# 2015-12-17 456 0.0
# 2015-12-18 456 0.0