Get mean and average from only certain columns in data frame - python

I have the following dataframe.
for each time point (row) A1,A2,A3 ; A4,5,6 ; ... are 3 replicates. I would like to get the averages and standard deviation for each group of 3 per row and add it to a new df.
I have tried:
new_df['A1-A3_mean']=np.mean(df[['A1','A2','A3']],axis=1)
new_df['A1-A3_std']=np.std(df[['A1','A2','A3']],axis=1)
which works but is quite manual and time consuming. I tried using groupby('Time').agg({'mean','std'}) but not I don't know how to specify that it should always take 3 columns. Ideally the resulting column would be named A1-3_mean / A1-3_stdev
Thanks in advance!

You can try:
N = 3
cols = list(df.drop(columns='time'))
mapper = {c: f'{cols[i//N]}-{cols[i//N+N-1]}' for i,c in enumerate(cols)}
g = df[cols].rename(columns=mapper).groupby(level=0, axis=1)
out = pd.concat({x: g.agg(x) for x in ['mean', 'std']}, axis=1)
Output:
mean std
A1-A3 A2-A4 A1-A3 A2-A4
0 4.666667 3.000000 2.886751 2.000000
1 2.666667 4.333333 1.154701 3.214550
2 6.333333 4.333333 2.309401 1.154701

Related

Pandas Resample-Sum without Zero filling

When resampling Series with mean aggregation (daily to monthly) -> missing datetimes are filled with NaNs which is okay since we can simply remove them using .dropna() function,
however, with sum/total aggregation -> missing datetimes are filled with 0s (zeros) which is technically correct, but a bit bothersome as masks are needed to remove them.
The question is if there is a more efficient way on resampling with aggregate sum without zero-filling or using masks? Preferrably similar to dropna() but for dropping 0s.
For example:
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2000-03-01', '2000-03-02', '2000-05-01', '2000-05-02'])
# wanted output
# 2000-01-31 2.0
# 2000-03-31 2.0
# 2000-05-31 2.0
# ideal output but for aggregate sum.
ser.resample('M').mean().dropna()
# 2000-01-31 1.0
# 2000-03-31 1.0
# 2000-05-31 1.0
# not ideal
ser.resample('M').sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with .grouper() seems to have the exact behavior from resampling.
# not ideal
ser.groupby(pd.Grouper(freq='M')).sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with index.year is also doable, however, there does not seem to be an 'identity' for calendar month. Noting that .index.month is not what we are after.
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2002-03-01', '2002-03-02', '2005-05-01', '2005-05-02'])
ser.groupby(ser.index.year).sum()
# 2000 2
# 2002 2
# 2005 2
Use pd.offsets.MonthEnd and add this with the DatetimeIndex of ser to create a month end grouper, then use Series.groupby with this grouper and aggregate using sum or mean:
grp = ser.groupby(ser.index + pd.offsets.MonthEnd())
s1, s2 = grp.sum(), grp.mean()
Result:
print(s1)
2000-01-31 2
2002-03-31 2
2005-05-31 2
dtype: int64
print(s2)
2000-01-31 1
2002-03-31 1
2005-05-31 1
dtype: int64

how to apply in python mobile averaging considering periodic boundary conditions in data

I would like to perform mobile averaging considering periodic boundary conditions. I try to make myself clear.
I have this data:
Date,Q
1989-01-01 00:00,0
1989-01-02 00:00,1
1989-01-03 00:00,4
1989-01-04 00:00,6
1989-01-05 00:00,8
1989-01-06 00:00,10
1989-01-07 00:00,11
I would like to compute the mobile averaging considering 3 data: the next and the previous.
In particular, I would like to use same option in the "rolling" function where the first data (0 in python framework) were able to take into account the last one and vice versa the last one the first one. This would allows me to have a sort of periodic boundary conditions.
Indeed, I have applied the following:
First, I read the dataframe
df = pd.read_csv(fname, index_col = 0, parse_dates=True)
then I apply the "rolling" as
df['Q'] = pd.Series(df["Q"].rolling(3, center=True).mean())
However, I get the following results:
Date
1989-01-01 NaN
1989-01-02 1.66
1989-01-03 3.66
1989-01-04 6
1989-01-05 8
1989-01-06 9.66
1989-01-07 NaN
I know that I could apply the "min_periods=1" option but this is not what I want. Indeed, It is clear that in the second row the result is correct:
1.66 = (0+1+4)/3
However, I would like to have this result in the first row:
(0+1+11)/3
As you can noticed, the number 11 is the value of the last row. Similarly, I expect in the last row:
(10+11+0)/3
where 0 is the value of the first row.
Do you have some suggestions or idea?
Thanks,
Diego
I would just duplicate the values before the first one and after last one, sort the dataframe, and do the rolling average. Then it would be enough to drop the added values:
df.loc[df.index[0] - pd.offsets.Day(1), 'Q'] = df.iloc[-1]['Q']
df.loc[df.index[-2] + pd.offsets.Day(1), 'Q'] = df.iloc[0]['Q']
df = df.sort_index()
df['Q'] = pd.Series(df["Q"].rolling(3, center=True).mean())
It gives as expected:
Q
Date
1989-01-01 4.000000
1989-01-02 1.666667
1989-01-03 3.666667
1989-01-04 6.000000
1989-01-05 8.000000
1989-01-06 9.666667
1989-01-07 7.000000

How to add a new row into the output of Panda describe function output in python

Here is my Python question:
I am asked to generate an output table which contains the number of Nan in each variables (there are more than 10 variables in the data), min, max, mean, std, 25%, 50%,and 70%. I used the describe function in panda to created the describe table which gave me everything i want but the number of Nan in each variables. I am thinking about adding the number of Nan as a new row into the output generated from the describe output.
Anyone can help with this?
output = input_data.describe(include=[np.number]) # this gives the table output
count_nan = input_data.isnull().sum(axis=0) # this counts the number of Nan of each variable
How can I add the second as a row into the first table?
You could use .append to append a new row to a DataFrame:
In [21]: output.append(pd.Series(count_nan, name='nans'))
Out[21]:
0 1 2 3 4
count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 0.583707 0.578610 0.566523 0.480307 0.540259
std 0.142930 0.358793 0.309701 0.097326 0.277490
min 0.450488 0.123328 0.151346 0.381263 0.226411
25% 0.519591 0.406628 0.478343 0.406436 0.429003
50% 0.549012 0.610845 0.607350 0.478787 0.516508
75% 0.613127 0.782827 0.695530 0.552658 0.627764
max 0.786316 0.969421 0.900046 0.582391 0.901610
nans 0.000000 0.000000 0.000000 0.000000 0.000000

Definite numerical integration in a python pandas dataframe

I have a pandas dataframe of variable number of columns. I'd like to numerically integrate each column of the dataframe so that I can evaluate the definite integral from row 0 to row 'n'. I have a function that works on an 1D array, but is there a better way to do this in a pandas dataframe so that I don't have to iterate over columns and cells? I was thinking of some way of using applymap, but I can't see how to make it work.
This is the function that works on a 1D array:
def findB(x,y):
y_int = np.zeros(y.size)
y_int_min = np.zeros(y.size)
y_int_max = np.zeros(y.size)
end = y.size-1
y_int[0]=(y[1]+y[0])/2*(x[1]-x[0])
for i in range(1,end,1):
j=i+1
y_int[i] = (y[j]+y[i])/2*(x[j]-x[i]) + y_int[i-1]
return y_int
I'd like to replace it with something that calculates multiple columns of a dataframe all at once, something like this:
B_df = y_df.applymap(integrator)
EDIT:
Starting dataframe dB_df:
Sample1 1 dB Sample1 2 dB Sample1 3 dB Sample1 4 dB Sample1 5 dB Sample1 6 dB
0 2.472389 6.524537 0.306852 -6.209527 -6.531123 -4.901795
1 6.982619 -0.534953 -7.537024 8.301643 7.744730 7.962163
2 -8.038405 -8.888681 6.856490 -0.052084 0.018511 -4.117407
3 0.040788 5.622489 3.522841 -8.170495 -7.707704 -6.313693
4 8.512173 1.896649 -8.831261 6.889746 6.960343 8.236696
5 -6.234313 -9.908385 4.934738 1.595130 3.116842 -2.078000
6 -1.998620 3.818398 5.444592 -7.503763 -8.727408 -8.117782
7 7.884663 3.818398 -8.046873 6.223019 4.646397 6.667921
8 -5.332267 -9.163214 1.993285 2.144201 4.646397 0.000627
9 -2.783008 2.288842 5.836786 -8.013618 -7.825365 -8.470759
Ending dataframe B_df:
Sample1 1 B Sample1 2 B Sample1 3 B Sample1 4 B Sample1 5 B Sample1 6 B
0 0.000038 0.000024 -0.000029 0.000008 0.000005 0.000012
1 0.000034 -0.000014 -0.000032 0.000041 0.000036 0.000028
2 0.000002 -0.000027 0.000010 0.000008 0.000005 -0.000014
3 0.000036 0.000003 -0.000011 0.000003 0.000002 -0.000006
4 0.000045 -0.000029 -0.000027 0.000037 0.000042 0.000018
5 0.000012 -0.000053 0.000015 0.000014 0.000020 -0.000023
6 0.000036 -0.000023 0.000004 0.000009 0.000004 -0.000028
7 0.000046 -0.000044 -0.000020 0.000042 0.000041 -0.000002
8 0.000013 -0.000071 0.000011 0.000019 0.000028 -0.000036
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
In the above example,
(x[j]-x[i]) = 0.000008
First of all, you can achieve a similar result using vectorized operations. Each element of the integration is just the mean of the current and next y value scaled by the corresponding difference in x. The final integral is just the cumulative sum of these elements. You can achieve the same result by doing something like
def findB(x, y):
"""
x : pandas.Series
y : pandas.DataFrame
"""
mean_y = (y[:-1] + y.shift(-1)[:-1]) / 2
delta_x = x.shift(-1)[:-1] - x[:-1]
scaled_int = mean_y.multiply(delta_x)
cumulative_int = scaled_int.cumsum(axis='index')
return cumulative_int.shift(1).fillna(0)
Here DataFrame.shift and Series.shift are used to match the indices of the "next" elements to the current. You have to use DataFrame.multiply rather than the * operator to ensure that the proper axis is used ('index' vs 'column'). Finally, DataFrame.cumsum provides the final integration step. DataFrame.fillna ensures that you have a first row of zeros as you did in the original solution. The advantage of using all the native pandas functions is that you can pass in a dataframe with any number of columns and have it operate on all of them simultaneously.
Do you really look for numeric values of the integral? Maybe you just need a picture? Then it is easier, using pyplot.
import matplotlib.pyplot as plt
# Introduce a column *bin* holding left limits of our bins.
df['bin'] = pd.cut(df['volume2'], 50).apply(lambda bin: bin.left)
# Group by bins and calculate *f*.
g = df[['bin', 'universe']].groupby('bin').sum()
# Plot the function using cumulative=True.
plt.hist(list(g.index), bins=50, weights=list(g['universe']), cumulative=True)
plt.show()

How to calculate rolling mean on a GroupBy object using Pandas?

How to calculate rolling mean on a GroupBy object using Pandas?
My Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index('ds')
grouped_df = df.groupby('city')
What grouped_df looks like:
I want calculate rolling mean on each of my groups in my GroupBy object using Pandas?
I tried pd.rolling_mean(grouped_df, 3).
Here is the error I get:
AttributeError: 'DataFrameGroupBy' object has no attribute 'dtype'
Edit: Do I use itergroups maybe and calculate rolling mean on each group on each group as I iterate through?
You could try iterating over the groups
In [39]: df = pd.DataFrame({'a':list('aaaaabbbbbaaaccccbbbccc'),"bookings":range(1,24)})
In [40]: grouped = df.groupby('a')
In [41]: for group_name, group_df in grouped:
....: print group_name
....: print pd.rolling_mean(group_df['bookings'],3)
....:
a
0 NaN
1 NaN
2 2.000000
3 3.000000
4 4.000000
10 6.666667
11 9.333333
12 12.000000
dtype: float64
b
5 NaN
6 NaN
7 7.000000
8 8.000000
9 9.000000
17 12.333333
18 15.666667
19 19.000000
dtype: float64
c
13 NaN
14 NaN
15 15
16 16
20 18
21 20
22 22
dtype: float64
You want the dates on your left column and all city values as separate columns. One way to do this is set the index on date and city, and then unstack. This is equivalent to a pivot table. You can then perform your rolling mean in the usual fashion.
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index(['date', 'city']).unstack('city')
rm = pd.rolling_mean(df, 3)
I wouldn't recommend using a function, as the data for a given city can simply be returned as follows (: returns all rows):
df.loc[:, city]

Categories