Pandas describe vs scipy.stats percentileofscore with NaN? - python

I'm having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.
My df is:
f_recommend
0 3.857143
1 4.500000
2 4.458333
3 NaN
4 3.600000
5 NaN
6 4.285714
7 3.587065
8 4.200000
9 NaN
When I run df.describe(percentiles=[.25, .5, .75]) I get:
f_recommend
count 7.000000
mean 4.069751
std 0.386990
min 3.587065
25% 3.728571
50% 4.200000
75% 4.372024
max 4.500000
I get the same values when I run with NaN removed.
When I want to look up a specific value, however, when I run scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') I get: 28th percentile with NaN and 20th without.
Any thoughts to explain this discrepancy?
ETA:
I don't believe that the problem is that we're calculating percentiles differently. Because that only matters when you're calculating percentiles of the same 2 numbers in different ways. But here, describe gives 25 percentile as 3.72. So there is absolutely no way that 3.61 can be 28th percentile. None of the formulas should give that.
In particular, when I use describe on the 6 values without NaN, I get the same values, so that's ignoring NaN, which is fine. But when I run percentile of score without the NaN I get a number that doesn't match.
ETA 2:
Simpler example:
In [48]: d = pd.DataFrame([1,2,3,4,5,6,7])
In [49]: d.describe()
Out[49]:
0
count 7.000000
mean 4.000000
std 2.160247
min 1.000000
25% 2.500000
50% 4.000000
75% 5.500000
max 7.000000
In [50]: sp.stats.percentileofscore(d[0], 2.1, kind = 'mean')
Out[50]: 28.571428571428573
the "kind" argument doesn't matter because 2.1 is unique.

scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:
In [44]: np.nan > 0
Out[44]: False
In [45]: np.nan < 0
Out[45]: False
In [46]: np.nan == 0
Out[46]: False
In [47]: np.nan == np.nan
Out[47]: False
Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.
If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:
In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
Out[53]: 16.666666666666664
In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
Out[54]: 16.666666666666664
Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.
The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:
result = percentileofscore(a[~np.isnan(a)], score)

the answer is very simple.
There is no universally accepted formula for computing percentiles, in particular when your data contains ties or when it cannot be perfectly broken down in equal-size buckets.
For instance, have a look at the documentation in R. There are more than seven types of formulas! https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
At the end, it comes down to understanding which formula is used and whether the differences are big enough to be a problem in your case.

Related

During the calculation of mean of a column in dataframe that contain missing values

Let's take an example.
suppose we have a data frame that has column name "f1"
f1 : {2, 4, NaN, 1, NaN, 15}
and when we apply mean imputation to it then we write a code like this
dataframe['f1'].fillna(dataframe['f1'].mean())
so my doubt is when it computes the mean of f1 during dataframe['f1'].mean() I know that it excludes the NaN value during summation(in the numerator) because they can't be added but what I want to know is it can be included or excluded in the denominator when dividing by the total number of values.
mean is computes like this
mean(f1) = (2+4+1+15)/6(include NaN in total number of values)
or this way
mean(f1) = (2+4+1+15)/4(exclude NaN in total number of values)
also, explain why?
thanks in advance
pd.Series.mean calculates the mean only for non-NaN values, so for above data, mean is (2+4+1+15)/4=5.5, 4 is the number of non-NaN values, this is the default behavior for calculating mean. If you want to include the mean for the given Series using all the rows for denominator, you can fillna(0) before calling mean():
Calling mean() directly:
df['f1'].fillna(df['f1'].mean())
0 2.0
1 4.0
2 5.5 <------
3 1.0
4 5.5 <------
5 15.0
Name: f1, dtype: float64
calling mean() after fillna(0):
df['f1'].fillna(df['f1'].fillna(0).mean())
0 2.000000
1 4.000000
2 3.666667 <------
3 1.000000
4 3.666667 <------
5 15.000000
Name: f1, dtype: float64
According to the official documentation of pandas.DataFrame.mean "skipna" parameter excludes the NA/null values. If it was excluded from numerator but denominator this would be exclusively mentioned in the documentation. You could prove yourself that it is excluded from denominator by performing a simple experimentation with a dummy dataframe such as the one you have examplified in the question.
The reason NA/null values should be excluded from denominator is about being statistically correct. Mean is the sum of the numbers divided by total number of them. If you could not add a value to the summation, then it is pointless to make an extra count in the denominator for it. If you count it in the denominator, it equals behaving as though the NA/null value was 0. However, the value is not 0, it is unknown, unobserved, hidden etc.
If you are acknowledged about the nature of the distribution in practice, you could interpolate or fill NA/null values accordingly with the nature of the distribution, then take the mean of all the values. For instance, if you realize that the feature in question has a linear nature, you could interpolate missing values with "linear" approach.

how does pandas rolling std calculate?

I am using a.rolling(5).std() to get a std series in a window(size=5, a is a pd.Series)
but i found the result is not what i want.
here is the example:
In [15]: a = [-49, -50, -50, -51, -48]
In [16]: pd.Series(a).rolling(5).std()
Out[16]:
0 NaN
1 NaN
2 NaN
3 NaN
4 1.140175
dtype: float64
In [17]: np.std(a)
Out[17]: 1.0198039027185568
I think the last element of pd.Series(a).rolling(5).std() should be equal with np.std(a),
but why it's not?
This is probably due to Pandas normalizing by N - 1 instead of N. See the first note at https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.std.html
You can change this behavior using the degrees of freedom argument, ddof, e.g. pd.Series(a).rolling(5).std(ddof=0).

Interpolation still leaving NaN's (pandas groupby)

I have a Dataframe with the location of some customers (so I have a column with Customer_id and others with Lat and Lon) and I am trying to interpolate the NaN's according to each customer.
For example, if I interpolate with the nearest approach here (I made up the values here):
Customer_id Lat Lon
A 1 1
A NaN NaN
A 2 2
B NaN NaN
B 4 4
I would like the NaN for B to be 4 and not 2.
I have tried this
series.groupby('Customer_id').apply(lambda group: group.interpolate(method = 'nearest', limit_direction = 'both'))
And the number of NaN's goes down from 9003 to 94. But I'm not understanding why it is still leaving some missing values.
I checked and these 94 missing values corresponded to records from customers that were already being interpolated. For example,
Customer_id Lat
0. A 1
1. A NaN
2. A NaN
3. A NaN
4. A NaN
It would interpolate correctly until some value (let's say it interpolates 1, 2 and 3 correctly) and then leaves 4 as NaN.
I have tried to set a limit in interpolate greater than the maximum number of records per client but it is still not working out. I don't know where my mistake is, can somebody help out?
(I don't know if it's relevant to mention or not but I fabricated my own NaN's for this. This is the code I used Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe I think the problem isn't here but since I'm very confused as to where the issue actually is I'll just leave it here)
When you interpolate with nearest it is only able to fill in-between missing values. (You'll notice this because you get an error when there's only 1 non-null value, like in your example). The remaining null values are "edges" which are taken care of with .bfill().ffill() for the nearest logic. This is also the appropriate logic to "interpolate" with only one non-missing value.
def my_interp(x):
if x.notnull().sum() > 1:
return x.interpolate(method='nearest').ffill().bfill()
else:
return x.ffill().bfill()
df.groupby('Customer_id').transform(my_interp)
# Lat Lon
#0 1.0 1.0
#1 1.0 1.0
#2 2.0 2.0
#3 4.0 4.0
#4 4.0 4.0

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

The mean of a dataframe of NaN's is Zero, instead of NaN

I have a dataframe of various timeseries, where the data starts at various points in time. So to have the same starting point, they are all padded with NaN, like so:
location townA townB
datanumber 1234 1235
1940-01-01 NaN NaN
1940-02-01 NaN NaN
1940-03-01 NaN NaN
1940-04-01 NaN NaN
1940-05-01 0.53 NaN
I need to get the average for all my locations, so it seems like meandf = locdf.mean(axis = 1) should do the job. The documentation for pd.mean() tells me that
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
it does skip NA values (which is the same as NaN?), just as all other functions in pandas do, so I would expect a result like
1940-01-01 NaN
1940-02-01 NaN
1940-03-01 NaN
1940-04-01 NaN
1940-05-01 0.53
but I get
1940-01-01 0
1940-02-01 0
1940-03-01 0
1940-04-01 0
1940-05-01 0.53
which wreaks havoc afterwards, because everything else in pandas seems to work well with NaN, hence I am always using it.
Specifying it again just to be sure with skipna = 'True' produces the same result, and also numeric_only does not change anything.
So what am I doing wrong?
This is a known confusing issue with pandas/numpy. In short, the actual outcome of the operation will depend on the version of bottleneck that you have installed, as pandas defers to bottleneck for these calculations. See also https://github.com/pydata/pandas/issues/9422 (and GH11409)
bottleneckchanged its implementation of nansum to return 0 on all NaN arrays instead of NaN. This was to match the behaviour of numpy's nansum. For this reason, the actual behaviour in pandas can be inconsistent depending on if and which version of bottleneck is installed.
The numpy behaviour:
In [2]: a = np.array([np.nan, np.nan, np.nan])
In [3]: a
Out[3]: array([ nan, nan, nan])
In [4]: np.nansum(a)
Out[4]: 0.0
The logic is that the sum of nothing is 0 (you get nothing as you skip all NaNs here).
By default, pandas deviates from this behaviour and does return NaN (the result you expected):
In [6]: s = pd.Series(a)
In [7]: s.sum()
Out[7]: nan
When you have bottleneck installed, this will be used for this calculation. Previously, bottleneck also returned NaN, so you would get a consistent behaviour whether you have bottleneck installed or not. However, a more recent version of bottleneck changed the behaviour (>= 1.0) to match the behaviour of numpy's nansum.
So if you have this version of bottleneck installed, you will see another behaviour:
In [1]: a = np.array([np.nan, np.nan, np.nan])
In [2]: np.nansum(a)
Out[2]: 0.0
In [3]: s = pd.Series(a)
In [4]: s.sum()
Out[4]: 0.0
In [5]: import bottleneck
In [6]: bottleneck.__version__
Out[6]: '1.0.0'
I think there is something to say for both results (0 or NaN), and there is not one of both that is 'wrong', but of course what is most confusing/problematic is that the behaviour differs between pandas and numpy/bottleneck.

Categories