Calculate mean of each subsequent group of 2 rows with pandas - python

I'm trying to calculate the mean of each subsequent group of 2 rows for all data frame. I think I got that with the following line:
df.groupby(np.arange(len(df))//2).mean()
However, the problem is that not all values are numeric. In that case, if the second line of the group is numeric while the first is not, instead of the mean, the value stays the same as the second row. In case of both lines be non numeric, the value should be assigned with 0.
For a better visualization, I have this dataframe:
Well Ct
0 A1 Undetermined
1 A2 Undertermined
2 A3 Undetermined
3 A4 41.2
4 B1 42
5 B2 43
What I'm trying to obtain is:
Well Ct
0 A1-A2 0.0
1 A3-A4 41.2
2 B1/B2 42.5
Is there any way to do this or other similar question that was already been posted?

Use pandas.to_numeric to coerce non-numeric values to NaNs (which pandas will ignore by default when calculating means), then use groupby + agg to assign your final groups.
df.Ct = pd.to_numeric(df.Ct, errors='coerce')
df.groupby(np.arange(df.shape[0]) // 2).agg({'Well': '-'.join, 'Ct': 'mean'}).fillna(0)
Well Ct
0 A1-A2 0.0
1 A3-A4 41.2
2 B1-B2 42.5

Related

Python pandas groupby category and integer variables results in pandas last and tail difference

UPDATE:
Please download my full dataset here.
my datatype is:
>>> df.dtypes
increment int64
spread float64
SYM_ROOT category
dtype: object
I have realized that the problem might have been caused by the fact that my SYM_ROOT is a category variable.
To replicate the issue you might want to do the following first:
df=pd.read_csv("sf.csv")
df['SYM_ROOT']=df['SYM_ROOT'].astype('category')
But I am still puzzled as in why my SYM_ROOT will result in the gaps in increment being filled with NA? Unless groupby category and integer value will result in a balanced panel by default.
I noticed that the behaviour of pd.groupby().last is different from that of pd.groupby().tail(1).
For example, suppose I have the following data:
increment is an integer that spans from 0 to 4680. However, for some SYM_ROOT variable, there are gaps in between. For example, 4 could be missing from it.
What I want to do is to keep the last observation per group.
If I do df.groupby(['SYM_ROOT','increment']).last(), the dataframe becomes:
While if I do df.groupby(['SYM_ROOT','increment']).tail(1), the dataframe becomes:
It looks to me that the last() statement will create a balanced time-series data and fill in the gaps with NaN, while the tail(1) statement doesn't. Is it correct?
Update :
Your columns increment is category
df=pd.DataFrame({'A':[1,1,2,2],'B':[1,1,2,3],'C':[1,1,1,1]})
df.B=df.B.astype('category')
df.groupby(['A','B']).last()
Out[590]:
C
A B
1 1 1.0
2 NaN
3 NaN
2 1 NaN
2 1.0
3 1.0
When you using tail it will not make up the miss level since , tail is more like dataframe base , not single columns
df.groupby(['A','B']).tail(1)
Out[593]:
A B C
1 1 1 1
2 2 2 1
3 2 3 1
After hange it using astype
df.B=df.B.astype('int')
df.groupby(['A','B']).last()
Out[591]:
C
A B
1 1 1
2 2 1
3 1
It is actually an issue here at Github, where the problem is mainly caused by groupby categories guessing the values.

How to filter Pandas rows based on last/next row?

I have two data sets from different pulse oximeters, and plot them with pyplot as displayed below. As you may see, the green data sheet has alot of outliers(vertical drops). In my work I've defined these outlayers as non-valid in for my statistical analysis, they are must certainly not measurements. Therefore I argue that I can simply remove them.
The characteristics of these rogue values is that they're single(or top two) value outliers(see df below). The "real" sample values are either the same as the previous value, or +-1. In e.g. java(pseudo code) I would do something like:
for(i; i <df.length; i++)
if (df[i+1|-1].spo2 - df[i].spo2 > 1|-1)
df[i].drop
What would be the pandas(numpy?) equivalent of what I'm trying to do, remove values that is more/less than 1 compared to the last/next value?
df:
time, spo2
1900-01-01 18:18:41.194 98.0
1900-01-01 18:18:41.376 98.0
1900-01-01 18:18:41.559 78.0
1900-01-01 18:18:41.741 98.0
1900-01-01 18:18:41.923 98.0
1900-01-01 18:18:42.105 90.0
1900-01-01 18:18:42.288 97.0
1900-01-01 18:18:42.470 97.0
1900-01-01 18:18:42.652 98.0
have a look at pandas.DataFrame.shift. This is a column-wise operation that shifts all rows in a given column to another row of another column:
# original df
x1
0 0
1 1
2 2
3 3
4 4
# shift down
df.x2 = df.x1.shift(1)
x1 x2
0 0 NaN # Beware
1 1 0
2 2 1
3 3 2
4 4 3
# Shift up
df.x2 = df.x1.shift(-1)
x1 x2
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN # Beware
You can use this to move spo2 of timestamp n+1 next to spo2 in the timestamp n row. Then, filter based on conditions applied to that one row.
df['spo2_Next'] = df['spo2'].shift(-1)
# replace NaN to allow float comparison
df.spo2_Next.fillna(1, inplace = True)
# Apply your row-wise condition to create filter column
df.loc[((df.spo2_Next - df.spo2) > 1) or ((df.spo2_Next - df.spo2) < 1), 'Outlier'] = True
# filter
df_clean = df[df.Outlier != True]
# remove filter column
del df_clean['Outlier']
When you filter a pandas dataframe like:
df[ df.colum1 = 2 & df.colum2 < 3 ], you are:
comparing a numeric series to a scalar value and generating a boolean series
obtaining two boolean series and doing a logical and
then using a numeric series to filter the data frame (the false values will not be added in the new data frame)
So you just need create an iterative algorithm over the data frame to produce such boolean array, and use it to filter the dataframe, as in:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df[ [True, False, True]]
You can also create a closure to filter the data frame (using df.apply), and keeping previous observations in the closure to detect abrupt changes, but this would be way too complicated. I would go for the straightforward imperative solution.

Subtract 2 values from one another within 1 column after groupby

I am very sorry if this is a very basic question but unfortunately, I'm failing miserably at figuring out the solution.
I need to subtract the first value within a column (in this case column 8 in my df) from the last value & divide this by a number (e.g. 60) after having applied groupby to my pandas df to get one value per id. The final output would ideally look something like this:
id
1 1523
2 1644
I have the actual equation which works on its own when applied to the entire column of the df:
(df.iloc[-1,8] - df.iloc[0,8])/60
However I fail to combine this part with the groupby function. Among others, I tried apply, which doesn't work.
df.groupby(['id']).apply((df.iloc[-1,8] - df.iloc[0,8])/60)
I also tried creating a function with the equation part and then do apply(func)but so far none of my attempts have worked. Any help is much appreciated, thank you!
Demo:
In [204]: df
Out[204]:
id val
0 1 12
1 1 13
2 1 19
3 2 20
4 2 30
5 2 40
In [205]: df.groupby(['id'])['val'].agg(lambda x: (x.iloc[-1] - x.iloc[0])/60)
Out[205]:
id
1 0.116667
2 0.333333
Name: val, dtype: float64

Adding calculated constant value into Python data frame

I'm new to Python, and I believe this is very basic question (sorry for that), but I tried to look for a solution here: Better way to add constant column to pandas data frame and here: add column with constant value to pandas dataframe and in many other places...
I have a data frame like this "toy" sample:
A B
10 5
20 12
50 200
and I want to add new column (C) which will be the division of the last data cells of A and B (50/200); So in my example, I'd like to get:
A B C
10 5 0.25
20 12 0.25
50 200 0.25
I tried to use this code:
groupedAC ['pNr'] = groupedAC['cIndCM'][-1:]/groupedAC['nTileCM'][-1:]
but I'm getting the result only in the last cell (I believe it's a result of my code acting as a "pointer" and not as a number - but as I said, I tried to "convert" my result into a constant (even using temp variables) but with no success).
Your help will be appreciated!
You need to index it with .iloc[-1] instead of .iloc[-1:], because the latter returns a Series and thus when assigning back to the data frame, the index needs to be matched:
df.B.iloc[-1:] # return a Series
#2 150
#Name: B, dtype: int64
df['C'] = df.A.iloc[-1:]/df.B.iloc[-1:] # the index has to be matched in this case, so only
# the row with index = 2 gets updated
df
# A B C
#0 10 5 NaN
#1 20 12 NaN
#2 50 200 0.25
df.B.iloc[-1] # returns a constant
# 150
df['C'] = df.A.iloc[-1]/df.B.iloc[-1] # there's nothing to match when assigning the
# constant to a new column, the value gets broadcasted
df
# A B C
#0 10 5 0.25
#1 20 12 0.25
#2 50 200 0.25

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Categories