Using describe() with weighted data -- mean, standard deviation, median, quantiles - python

I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.)
I've got a dataframe (called resp) containing respondent level survey data. I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]).
resp["anninc"].describe()
Which gives me the basic stats:
count 76310.000000
mean 43455.874862
std 33154.848314
min 0.000000
25% 20140.000000
50% 34980.000000
75% 56710.000000
max 152884.330000
dtype: float64
But there's a catch. Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis.
In my prior SAS life, most of the proc's have options to process data with weights like this. For example, a standard proc univariate to give the same results would look something like this:
proc univariate data=resp;
var anninc;
output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;
And the same analysis using weighted data would look something like this:
proc univariate data=resp;
var anninc;
weight tufnwgrp;
output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;
Is there a similar sort of weighting option available in pandas for methods like describe() etc?

There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends #MSeifert's answer here on a similar question.
df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })
from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1)
print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )
67.0
23.6877840059
p
0.25 50
0.50 71
0.75 87
I don't use SAS, but this gives the same answer as the stata command:
sum x [fw=wt], detail
Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.
Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:
df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()

Related

Python Pandas DataFrame Describe Gives Wrong Results?

While I was looking at the results of the describe() method, I realized something very strange. Data is the House Price data from kaggle . Below, you can see the code and the result for "Condition2" feature:
train.groupby(train["Condition2"].fillna('None'))["SalePrice"].describe()
On the other hand, when I look at data in Excel, the quantiles do not match.
So, while 33% of data points 85K SalePrice, how can 25% of data points be 95.5K SalePrice? It is really weird, or may be I'm missing something. Could anybody explain this?
Quantiles seek to divide the data into four equal groups so the value of the 25% quantile isn't going to be the value of the 25% of your data with small sample sizes like this where n=6. There are different methods of calculating quantile values. Describe uses the linear method described in the docs as
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
If you switch to the lower method it produces different results.
>>> feedr = train_df[train_df.Condition2 == 'Feedr']
>>> feedr.SalePrice.quantile(.50)
127500.0
>>> feedr.SalePrice.quantile(.50, interpolation='lower')
127000
>>> feedr.SalePrice.quantile(.25, interpolation='lower')
85000
>>> feedr.SalePrice.quantile(.25)
95500.0

Regarding min_period in corr() function python

I am trying to create a corr matrix. This is regarding the documentation here on min_period. So what i understand is min_period is the number of days for which the correlation is calculated on? So for example
corr = df['Close'].corr(method= 'pearson', min_periods=10)
This would give me the correlation between 2 pairs as calculated on shifting 10 days basis? Please let me know if i understand it right.
It means you need at least 10 valid pairs. Othwerwise it will be np.nan. Documentation states:
Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

How to reverse a seasonal log difference of timeseries in python

Could you please help me with this issue as I made many searches but cannot solve it. I have a multivariate dataframe for electricity consumption and I am doing a forecasting using VAR (Vector Auto-regression) model for time series.
I made the predictions but I need to reverse the time series (energy_log_diff) as I applied a seasonal log difference to make the serie stationary, in order to get the real energy value:
df['energy_log'] = np.log(df['energy'])
df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(1)
For that, I did first:
df['energy'] = np.exp(df['energy_log_diff'])
This is supposed to give the energy difference between 2 values lagged by 365 days but I am not sure for this neither.
How can I do this?
The reason we use log diff is that they are additive so we can use cumulative sum then multiply by the last observed value.
last_energy=df['energy'].iloc[-1]
df['energy']=(np.exp(df['energy'].cumsum())*last_energy)
As per seasonality: if you de-seasoned the log diff simply add(or multiply) before you do the above step if you de-seasoned the original series then add after
Short answer - you have to run inverse transformations in the reversed order which in your case means:
Inverse transform of differencing
Inverse transform of log
How to convert differenced forecasts back is described e.g. here (it has R flag but there is no code and the idea is the same even for Python). In your post, you calculate the exponential, but you have to reverse differencing at first before doing that.
You could try this:
energy_log_diff_rev = []
v_prev = v_0
for v in df['energy_log_diff']:
v_prev += v
energy_log_diff_rev.append(v_prev)
Or, if you prefer pandas way, you can try this (only for the first order difference):
energy_log_diff_rev = df['energy_log_diff'].expanding(min_periods=0).sum() + v_0
Note the v_0 value, which is the original value (after log transformation before difference), it is described in the link above.
Then, after this step, you can do the exponential (inverse of log):
energy_orig = np.exp(energy_log_diff_rev)
Notes/Questions:
You mention lagged values by 365 but you are shifting data by 1. Does it mean you have yearly data? Or would you like to do this - df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(365) instead (in case of daily granularity of data)?
You want to get the reverse time series from predictions, is that right? Or am I missing something? In such a case you would make inverse transformations on prediction not on the data I used above for explanation.

Unable to use certain basic statistical functions in Pandas groupby aggregate

I have an experiment where 'depth' is measured for varying 'force' and 'scanning speeds'. Five runs are conducted for each set of variables.
I have to compute the maximum depth measured across the five runs as well as the standard deviation of the measurements. To this end, I have constructed a Pandas dataframe as follows:
force scanspeed depth
0 0.5 10 3.541
1 0.5 20 2.531
2 0.5 10 3.020
3 1 10 2.130
4 0.5 20 1.502
5 0.5 10 4.102
6 2 50 2.413
...
(100+ rows)
For this dataframe, I want to groupby using the force and scanspeed columns and generate the maximum and standard deviation for each group (there are multiple rows with the same force and scanspeed). However, in running the following line:
print(subframe.groupby(['force', 'scanspeed'])['depth'].agg([max, std]))
the function std is not recognized, prompting NameError: name 'std' is not defined.
Other functions found not to work include: mean, median, corr, var, count, np.std. I have not tested the full range of functions available but so far it seems like only max and min work despite all of these functions coming from the same pandas library (aside from np.std of course).
I'd appreciate any help regarding this issue.
If you're sure that np.std is otherwise accessible in that statements scope, agg also allows you to pass names of certain functions as strings:
print(subframe.groupby(['force', 'scanspeed'])['depth'].agg([max, 'std']))
That line seemed to work for me without importing anything except for pandas.
Otherwise, maybe try a call like np.std([0,1]) right before that statement to make sure it doesn't throw an error as well. Or you could try putting in import numpy as np on the line right before.

Understanding percentile= calculation in describes () of python

I am trying to understand the following:
1)how the percentiles are calculated.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
3) My requirement is to know actual value below which x% of population lies. How to do that?
Thanks
Python-2
new=pd.DataFrame({'a':range(10),'b':[60510,60053,54968,62269,91107,29812,45503,6460,62521,37128]})
print new.describe(percentiles=[ 0,0.1 ,0.2,0.3,0.4, 0.50, 0.6,0.7,0.8 ,0.90,1 ])
1)how the percentiles are calculated
90% percentile/quantile means 10% of the data is greater than that value, 90% of the data falls below that value. By default, it's based on a linear interpolation. This is why in your a column, values increment by 0.9instead of original data values of [0, 1, 2 ...]. If you want to use nearest values instead of interpolation, you can use the quantile method instead of describe and change the interpolation parameter.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
Your question is unclear here. It does return values in a sorted order, indexed based on the output of the .describe method output: count, mean, std, min, quantiles from low to high, max. If you only want quantiles and not the other statistics, you can use the quantile method instead.
3) My requirement is to know actual value below which x% of population lies. How to do that?
Nothing is wrong with the output. Those quantiles are accurate, although they aren't very meaningful when your data only has 10 observations.
Edit: It wasn't originally clear to me that you were attempting to do stats on a frequency table. I don't know of a direct solution in pandas that don't involve moving your data over to a numpy array. You could use numpy.repeat like to get a raw list of observations to put back into pandas and do descriptive stats on.
vals = np.array(new.a)
freqs = np.array(new.b)
observations = np.repeat(vals, freqs)

Categories