my sample df looks like this:
sid score completed
101 70 NaN
102 56 Yes
101 65 No
103 88 Yes
103 50 NaN
102 42 No
105 79 NaN
....
What do I want?
I want to groupby sid and take the max score from the score column.
For the completed column, I want to take the value Yes if the groupby "group" column contains Yes else choose No or simply NaN if it does both "Yes" or "No" does not exists
My final df should look like this:
sid score_max completed
101 70 No
102 56 Yes
103 88 Yes
105 79 NaN
....
What did I do?
df_groupby = df.groupby(['sid']).agg(
score_max = ('score','max'),
completed = ('completed', any(completed="Yes"))
)
However, the solution does not work. Could you please assist me in solving this problem?
Use ordered pd.CategoricalDtype to solve your problem:
>>> df.astype({'completed': pd.CategoricalDtype(['No', 'Yes'], ordered=True)}) \
.groupby('sid') \
.agg(score_max=('score', 'max'), completed=('completed', 'max')) \
.reset_index()
sid score_max completed
0 101 70 No
1 102 56 Yes
2 103 88 Yes
3 105 79 NaN
Detail about category:
df1 = pd.DataFrame({'Col1': ['No', 'Yes', np.NaN]})
df1['Col1'] = df1['Col1'].astype(pd.CategoricalDtype(['No', 'Yes'],
ordered=True))
>>> df1['Col1'].min()
'No'
>>> df1['Col1'].max()
'Yes'
Related
If I have a dataframe and I want to sum the values of the columns I could do something like
import pandas as pd
studentdetails = {
"studentname":["Ram","Sam","Scott","Ann","John","Bobo"],
"mathantics" :[80,90,85,70,95,100],
"science" :[85,95,80,90,75,100],
"english" :[90,85,80,70,95,100]
}
index_labels=['r1','r2','r3','r4','r5','r6']
df = pd.DataFrame(studentdetails ,index=index_labels)
print(df)
df3 = df.sum()
print(df3)
col_list= ['studentname', 'mathantics', 'science']
print( df[col_list].sum())
How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns?
I tried abs in several way but it did not work
Edit:
studentname mathantics science english
r1 Ram 80 85 90
r2 Sam 90 95 -85
r3 Scott -85 80 80
r4 Ann 70 90 70
r5 John 95 -75 95
r6 Bobo 100 100 100
Expected output
mathantics 520
science 525
english 520
Edit2:
The col_list cannot include string value columns
You need numeric columns for absolute values:
col_list = df.columns.difference(['studentname'])
df[col_list].abs().sum()
df.set_index('studentname').abs().sum()
df.select_dtypes(np.number).abs().sum()
I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
Python: how to get unique ID and remove duplicates from column 1 (ID), and column 3 (Description), Then get the median for column 2
ID
Value
Description
123456
116
xx
123456
117
xx
123456
113
xx
123456
109
xz
123456
108
xz
123456
98
xz
121214
115
abc
121214
110
abc
121214
103
abc
121214
117
abz
121214
120
abz
121214
125
abz
151416
114
zxc
151416
135
zxc
151416
127
zxc
151416
145
zxm
151416
125
zxm
151416
121
zxm
Procced table should look like:
ID
xx
xz
abc
abz
zxc
zxm
123456
110
151
0
0
0
0
121214
0
0
132
113
0
0
151416
0
0
0
0
124
115
I went for the approach of the mean, but your "expected output" example doesn't give a mean. Is that me misunderstanding what you mean?
pd.pivot_table(DF, 'Value', index='ID', columns='Description')
Should do the trick, default math function is the mean, so that's ideal. More info can be found here (mind you, DF is the to import dataframe).
Maybe this approach will work for you?
d = {'ID': [1,1,2,3,3,4,4,4,4,5,5], 'Value': [5,6,7,8,9,7,8,5,1,2,4]}
df = pd.DataFrame(data=d)
unique = set(df['ID'])
value_mean = []
for i in unique:
a = df[df['ID']==i]['Value']
a = a.mean()
value_mean.append(a)
Well you have e.g. 6 'ID' with value '123456'. If you only want unique 'ID', you need to remove 5 'ID' rows, by doing this you will not have duplicate 'Description' values anymore. The question is, do you want unique ID or unique Description values (or unique combination of both)?
There are probably more options to solve this. What you could do is combine the ID and Description into a new column, and remove the duplicate in the DataFrame. Hopefully this would help.
import pandas as pd
a = {'ID': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],
'Value': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6],
'Description': ['a','a','b','b','c','d','d','a','c','d','e','e','e','a','b']}
df = pd.DataFrame(data=a)
unique_combined = []
for i in range(len(df)):
unique_combined.append((str(df.iloc[i]['ID'])+ df.iloc[i]['Description']))
df['un'] = unique_combined
df.drop_duplicates(subset=['un'])
Following on from this question, I have a dataset as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
8 60 530 NaN
Which I have transformed to the following such that each mother has a single value for preDiabetes:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 No
I did this by applying the following logic:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if preDiabetes != "Yes" for a particular MotherID, I will assign preDiabetes a value of "No"
However, after thinking about this again, I realised that I should preserve NaN values to impute them later on, rather than just assign them 'No".
So I should edit my logic to be:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if all values for preDiabetes==NaN for a particular MotherID, assign preDiabetes a single NaN value
else assign preDiabetes a value of "No"
So, in the above table MotherID=530 should have a value of NaN for preDiabetes like so:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 NaN
I tried doing this using the following line of code:
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if np.NaN in x.values.all() else 'No'))
However, running this line of code is resulting in the following error:
TypeError: 'in ' requires string as left operand, not float
I'd appreciate if you guys can point out what it is I am doing wrong. Thank you.
You can try:
import pandas as pd
import numpy as np
import io
data_string = """ChildID,MotherID,preDiabetes
20,455,No
20,455,Not documented
13,102,NaN
13,102,Yes
702,946,No
82,571,No
82,571,Yes
82,571,Not documented
60,530,NaN
"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=',', na_values=['NaN'])
df.fillna('no_value', inplace=True)
df = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if 'no_value' in x.values.all() else 'No'))
df
Result:
MotherID ChildID
102 13 Yes
455 20 No
530 60 NaN
571 82 Yes
946 702 No
Name: preDiabetes, dtype: object
You can do using a custom function:
def func(s):
if s.eq('Yes').any():
return 'Yes'
elif s.isna().all():
return np.nan
else:
return 'No'
df = (df
.groupby(['ChildID', 'MotherID'])
.agg({'preDiabetes': func}))
print(df)
ChildID MotherID preDiabetes
0 13 102 Yes
1 20 455 No
2 60 530 NaN
3 82 571 Yes
4 702 946 No
Try:
df['preDiabetes']=df['preDiabetes'].map({'Yes': 1, 'No': 0}).fillna(-1)
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].max().map({1: 'Yes', 0: 'No', -1: 'NaN'}).reset_index()
First line will format preDiabetes to numbers, assuming NaN to be everything other than Yes or No (denoted by -1).
Second line assuming at least one preDiabetes is Yes - we output Yes for the group. Assuming we have both No and NaN - we output No. Assuming all are NaN we output NaN.
Outputs:
>>> df
MotherID ChildID preDiabetes
0 102 13 Yes
1 455 20 No
2 530 60 NaN
3 571 82 Yes
4 946 702 No
To best illustrate consider the following SQL Illustration:
Table StockPrices, BarSeqId is a sequential number where each increment is information from next minute of trading.
The goal to achieve in pandas data frame is to transform this data:
StockPrice BarSeqId LongProfitTarget
105 0 109
100 1 105
103 2 107
103 3 108
104 4 110
105 5 113
into this data:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
106 0 109 Nan
100 1 105 3
103 2 107 5
105 3 108 Nan
104 4 110 Nan
107 5 113 Nan
to create a new column which describes at which soonest sequential time-frame a price target will be hit in the future from the current time-frame
Here is how it could be achieved in SQL:
SELECT S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget,
min(S2.BarSeqId) as TargetHitBarSeqId
FROM StockPrices S1
left outer join StockPrices S2 on S1.BarSeqId<s2.BarSeqId and
S2.StockPrice>=S1.LongProfitTarget
GROUP BY S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget
I would like the answer to be as follows:
someDataFrame['TargetHitBarSeqId'] = (pandas expression here ...**
assume that someDataFrame already has columns: StockPrice, BarSeqId, LongProfitTarget
data edited to illustrate case
so in the second row result should be
100 1 105 3
and NOT
100 1 105 0
since 3 and not 0 occurs after 1.
It is important that the barseq in question shall occur in the future (greater than current BarSeq)
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Here's one solution:
import pandas as pd
import numpy as np
df = <your input data frame>
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Output:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
0 100 1 105 3.0
1 103 2 107 5.0
2 105 3 108 NaN
3 104 4 110 NaN
4 107 5 113 NaN
from pathlib import Path
import pandas as pd
from itertools import islice
import numpy as np
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget,barseq):
try:
idx = df[(df.StockPrice >= longProfitTarget) & (df.BarSeqId>barseq)].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget'], row['BarSeqId']), axis=1)
df
The key misunderstanding for me was a need to use & operator instead of regular 'or'
Assuming data is manageable, consider a cross join followed by filter and groupby, which would replicate the SQL query:
cdf = pd.merge(df.assign(key=1), df.assign(key=1), on='key', suffixes=['','_'])\
.query('(BarSeqId < BarSeqId_) & (LongProfitTarget <= StockPrice_)')\
.groupby(['StockPrice', 'BarSeqId', 'LongProfitTarget'])['BarSeqId_'].min()
print(cdf)
# StockPrice BarSeqId LongProfitTarget
# 100 1 105 3
# 103 2 107 5
# Name: BarSeqId_, dtype: int64