Unable to use certain basic statistical functions in Pandas groupby aggregate - python

I have an experiment where 'depth' is measured for varying 'force' and 'scanning speeds'. Five runs are conducted for each set of variables.
I have to compute the maximum depth measured across the five runs as well as the standard deviation of the measurements. To this end, I have constructed a Pandas dataframe as follows:
force scanspeed depth
0 0.5 10 3.541
1 0.5 20 2.531
2 0.5 10 3.020
3 1 10 2.130
4 0.5 20 1.502
5 0.5 10 4.102
6 2 50 2.413
...
(100+ rows)
For this dataframe, I want to groupby using the force and scanspeed columns and generate the maximum and standard deviation for each group (there are multiple rows with the same force and scanspeed). However, in running the following line:
print(subframe.groupby(['force', 'scanspeed'])['depth'].agg([max, std]))
the function std is not recognized, prompting NameError: name 'std' is not defined.
Other functions found not to work include: mean, median, corr, var, count, np.std. I have not tested the full range of functions available but so far it seems like only max and min work despite all of these functions coming from the same pandas library (aside from np.std of course).
I'd appreciate any help regarding this issue.

If you're sure that np.std is otherwise accessible in that statements scope, agg also allows you to pass names of certain functions as strings:
print(subframe.groupby(['force', 'scanspeed'])['depth'].agg([max, 'std']))
That line seemed to work for me without importing anything except for pandas.
Otherwise, maybe try a call like np.std([0,1]) right before that statement to make sure it doesn't throw an error as well. Or you could try putting in import numpy as np on the line right before.

Related

Windows of difference between 2 time series

I am trying to find 3 areas of difference between 2 time series. I am able to see the difference between the 2 but i want to eventually automatically detect the biggest difference and the smaller between the 2 curves. Using the following code i can view the difference between the 2 curves but i want to be able to find the 3 areas (chronologically) by defining a number of points or time period like in the image. So, for example find 3 windows of a week each where the difference is small then big and then small again. Any idea if there is a build in function for this?
Thank you
ax.fill_between(
x=feature.reset_index().index,
y1=feature.1,
y2=feature.2,
alpha=0.3
)
The 2 time series and 3 wanted areas that i would like to find
As a concept:
Define a large time window as t_0 to T, find the initial minimum in the difference of the two series (i.e. find the minimum of the spread) and record the location of this time. If you have an aligned data.frame of the time series this should be rudimentary in finding the minimum of the difference and looking up the loc of that item to identify the time within the window.
Then restrict your search to t_min_1 to T, and search for the maximum, again obtaining the loc for this maximum value in the spread. Lastly, search over t_max to T, for a local minimum within the spread and find the loc for that value.
This will return for you in your given window the times of your first minimum (t_min_1), second maximum (t_max) and third minimum (t_min_2) following within each event.

Pandas performance improvement over pd.apply with eval()

I'm trying to optimize my script performance via Pandas. I'm running into a roadblock where I need to apply a large number of filters to a DataFrame and store a few totals from the results.
Currently the fastest way I can make this happen is running a For Loop on the list of filters (as strings) and using eval() to calculate the totals:
for filter_str in filter_list:
data_filtered = data[eval(filter_str)]
avg_change = data_filtered['NewChangePerc'].mean()
Here's my attempt at using pd.apply() to speed this up because I can't think of a vectorized way to make it happen (the filters are in a DataFrame this time instead of a list):
def applying(x):
f = data[eval(x)]
avg_change = f['NewChangePerc'].mean()
filter_df.processed.apply(applying)
The main goal is to simply make it as fast as possible. What I don't understand is why a For Loop is faster than pd.apply(). It's about twice as fast.
Any input would be greatly appreciated.
UPDATE
Here's more specifics about what I'm trying to accomplish:
Take a data set of roughly 67 columns and 2500 rows.
Code Name ... Price NewChangePerc
0 JOHNS33 Johnson, Matthew ... 0.93 0.388060
1 QUEST01 Questlove, Inc. ... 18.07 0.346498
2 773NI01 773 Entertainment ... 1.74 0.338462
3 CLOVE03 Cloverfield, Sam ... 21.38 0.276418
4 KITET08 Kite Teal 6 ... 0.38 0.225806
Then take a list of filters.
['Price > 5.0', 'NewChangePerc < .02']
Apply each filter to the data and calculate certain values, such as the average NewChangePerc.
For example, when applying 'Price > 5.0', the average NewChangePerc would be ~0.31.
Now grow that list of filters to a length of about 1,000, and it starts to take some time to run. I need to cut that time down as much as possible. I've run out of ideas and can't find any solutions beyond what I've listed above, but they're just too slow (~0.86s for 50 filters with the For Loop; ~1.65s for 50 filters with pd.apply()). Are there any alternatives?

Conditional Rolling Sum using filter on groupby group rows

I've been trying without success to find a way to create an "average_gain_up" in python and have gotten a bit stuck. Being new to groupby there is something of how it is treating functions that i've not managed to grasp so any intuition behind how to think through these types of problems would be helpful.
Problem:
Create a rolling 14 day sum, only summing if the value is >0 .
new=pd.DataFrame([[1,-2,3,-2,4,5],['a','a','a','b','b','b']])
new= new.T #transposing into a friendly groupby format
#Group by a or b, filter to only have positive values and then sum rolling, we
keep NAs to ensure the sum is ran over 14 values.
groupby=new.groupby(1)[0].filter(lambda x: x>0,dropna=False).rolling(14).sum()
Intended Sum Frame:
x.all()/len(x) result:
this throws a type error "the filter must return a boolean result" .
from reading other answers, I understand as i'm asking if a series/frame is superior to 0 .
The above code works with len(x), again makes sense in that context.
i tried with all() as well but it doesn't behave as intended. the .all() functions returns a single boolean per group and the sum is then just a simple rolling sum.
i've tried creating a list of booleans to say which values are positive and which are not but that also yields an error, this time i'm not sure why.
groupby1=new.groupby(1)[0]
groupby2=[y>0 for x in groupby1 for y in x[1] ]
groupby_try=new.groupby(1)[0].filter(lambda x:groupby2,dropna=False).rolling(2).sum()
1) how do i make the above code work and what is wrong in how i am thinking about it ?
2) is this the "best Practice" way to do these types of operations ?
any help appreciated, let me know if i've missed anything or any further clarification is needed.
According to the doc on filter after a groupby, it is not supposed to filter values within a group but groups as a whole if they don't meet some criteria, such as if the sum of all the elements of the group is above 2 then the group is kept in the first example given
One way could be to replace all the negative values by 0 in new[0] first, using np.clip for example, and then groupby, rolling and sum such as
print (np.clip(new[0],0,np.inf).groupby(new[1]).rolling(2).sum())
1
a 0 NaN
1 1.0
2 3.0
b 3 NaN
4 4.0
5 9.0
Name: 0, dtype: float64
This way prevents from modifying the data in new, if you don't mind you can change the column 0 with new[0] = np.clip(new[0],0,np.inf) and then do new.groupby(1)[0].rolling(2).sum() which give the same result.

Why max and min functions are returning unexpected results using Pandas?

I am using ECG data in csv format and read the data as:
myECG = pd.read_csv('ECG_MIT.csv');
Then I extracted a column called 'ECG" from the above read data (I am calling it as ECG_data) and attempted to derive some useful metrics. These include the following.
print 'Max val in ECG: ', ECG_data.max(); #reports 1023
print 'Min val in ECG: ', ECG_data.min(); # reports 0
The results are wrong as I see the max value itself is 800 and min value is 474 via Excel's max and min functions. Also I printed sample values and checked. I also used alternate forms like "max(ECG_data)" and "min()".
Also, when I use:
print "Data Summary: \n",myECG.describe()
I seem to see the same wrong values in the statistics reported. What am I doing wrong here? Pls help. Thanks.
A few suggestions for you birdie:
1. Based on your examples, i'm going to assume your data is all integers.
2. Next step will be to validate that. If it's not integers, then convert it.
3. Sort your data in excel ascending to confirm what your excel min and max functions are yielding.
4. How does that differ from pandas?
5. In pandas, try calling the min or max function with the column name.
How this helps!
0 and 1023 happen to be the min and max values of a 10 bit integer.
So you are probably getting the min/max values of some limits of your object (buffers, for example, tend to grow in powers of 2, and 1024 is one of them).
You will need to check if ECG_data is the right type of object, and if you're using the min()/max() functions in the correct way.

Using describe() with weighted data -- mean, standard deviation, median, quantiles

I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.)
I've got a dataframe (called resp) containing respondent level survey data. I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]).
resp["anninc"].describe()
Which gives me the basic stats:
count 76310.000000
mean 43455.874862
std 33154.848314
min 0.000000
25% 20140.000000
50% 34980.000000
75% 56710.000000
max 152884.330000
dtype: float64
But there's a catch. Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis.
In my prior SAS life, most of the proc's have options to process data with weights like this. For example, a standard proc univariate to give the same results would look something like this:
proc univariate data=resp;
var anninc;
output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;
And the same analysis using weighted data would look something like this:
proc univariate data=resp;
var anninc;
weight tufnwgrp;
output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;
Is there a similar sort of weighting option available in pandas for methods like describe() etc?
There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends #MSeifert's answer here on a similar question.
df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })
from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1)
print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )
67.0
23.6877840059
p
0.25 50
0.50 71
0.75 87
I don't use SAS, but this gives the same answer as the stata command:
sum x [fw=wt], detail
Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.
Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:
df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()

Categories