Correlation between dichotomous variable and continuous variable - python

I have two dataframes one (Lots) which is structured as follows:
Lot Group
Lot Number
Booking Stage
Date
1
216000.00
HPRESM
2020-08-28
2
890000.01
PART
2013-04-17
and the other one measurements as follows:
Mid
Date
Measurement 1
Measurement 2
1901827
2020-08-28
44.5
23.22
2981632
2013-04-17
49.0
34.5
The date column in both dataframes has unique dates and they are identical in both dataframes, as they have the same length.
What I am trying to do is compute correlations between the measurement columns which is are continuous variables and the lot group which is either 1 (good lot) or 2 (bad lot i.e. a dichotomous variable. The measurement variables have a lot of NaNs over 50%. My question is that I tried to compute the Point-Biserial correlation as I read it is used to calculate correlation between these two type of variables but I get nan for the statistic and 1 for the p-value.
columns = measurement.select_dtypes(exclude = ["object", "datetime"]).columns
for col in columns:
stat, p = ss.pointbiserialr(lots["LosGruppe"], measurement[col])
print(f"Variable: {col}, Correlation: {stat}, P-Value: {p}")
Output:
Variable: Mes 1, Correlation: nan, P-Value: 1.0
Variable: Mes 2, Correlation: nan, P-Value: 1.0
Variable: Mes 3, Correlation: nan, P-Value: 1.0
Variable: Mes 4, Correlation: nan, P-Value: 1.0
Variable: Mes 5, Correlation: nan, P-Value: 1.0
What would you advise as a solution or cause of this issue and what is a suitable correlation method between such variables?

Point biserial correlation is a good approach here, but the problem you have is with the missing values. You will need to remove these first with dropna()

Related

During the calculation of mean of a column in dataframe that contain missing values

Let's take an example.
suppose we have a data frame that has column name "f1"
f1 : {2, 4, NaN, 1, NaN, 15}
and when we apply mean imputation to it then we write a code like this
dataframe['f1'].fillna(dataframe['f1'].mean())
so my doubt is when it computes the mean of f1 during dataframe['f1'].mean() I know that it excludes the NaN value during summation(in the numerator) because they can't be added but what I want to know is it can be included or excluded in the denominator when dividing by the total number of values.
mean is computes like this
mean(f1) = (2+4+1+15)/6(include NaN in total number of values)
or this way
mean(f1) = (2+4+1+15)/4(exclude NaN in total number of values)
also, explain why?
thanks in advance
pd.Series.mean calculates the mean only for non-NaN values, so for above data, mean is (2+4+1+15)/4=5.5, 4 is the number of non-NaN values, this is the default behavior for calculating mean. If you want to include the mean for the given Series using all the rows for denominator, you can fillna(0) before calling mean():
Calling mean() directly:
df['f1'].fillna(df['f1'].mean())
0 2.0
1 4.0
2 5.5 <------
3 1.0
4 5.5 <------
5 15.0
Name: f1, dtype: float64
calling mean() after fillna(0):
df['f1'].fillna(df['f1'].fillna(0).mean())
0 2.000000
1 4.000000
2 3.666667 <------
3 1.000000
4 3.666667 <------
5 15.000000
Name: f1, dtype: float64
According to the official documentation of pandas.DataFrame.mean "skipna" parameter excludes the NA/null values. If it was excluded from numerator but denominator this would be exclusively mentioned in the documentation. You could prove yourself that it is excluded from denominator by performing a simple experimentation with a dummy dataframe such as the one you have examplified in the question.
The reason NA/null values should be excluded from denominator is about being statistically correct. Mean is the sum of the numbers divided by total number of them. If you could not add a value to the summation, then it is pointless to make an extra count in the denominator for it. If you count it in the denominator, it equals behaving as though the NA/null value was 0. However, the value is not 0, it is unknown, unobserved, hidden etc.
If you are acknowledged about the nature of the distribution in practice, you could interpolate or fill NA/null values accordingly with the nature of the distribution, then take the mean of all the values. For instance, if you realize that the feature in question has a linear nature, you could interpolate missing values with "linear" approach.

smoothing curve with pandas and interpolate not modifying data

I'm sure I'm not doing this right. I have a dataframe with a series of data, basically year and a value. I want to smoothen the curve and was looking to use spline to test results.
Basically I was trying to take a column and return the new datapoints into another column:
df['smooth'] = df['value'].interpolate(method='spline', order=3, s=0.)
but the results between smooth and value are the same.
value periodDate smooth diffSmooth
6 422976.72 2019 422976.72 0.0
7 190865.94 2018 190865.94 0.0
8 188440.89 2017 188440.89 0.0
9 192481.64 2016 192481.64 0.0
10 191958.64 2015 191958.64 0.0
11 681376.60 2014 681376.60 0.0
Any suggestions of what I'm doing wrong?
According to the Pandas docs, the interpolate function fills missing values in a sequence, so for example linear interpolation would be [0, 1, NaN, 3] -> [0, 1, 2, 3]. In short, you're using the wrong function. If you want to fit a spline, sklearn or scipy or numpy may be better bets.

pandas.Series.interpolate() along "index" shows unexpected results

A pandas.Series() called "bla" in my example contains pressures in Pa as the index and wind speeds in m/s as values:
bla
100200.0 2.0
97600.0 NaN
91100.0 NaN
85000.0 3.0
82600.0 NaN
...
6670.0 NaN
5000.0 2.0
4490.0 NaN
3880.0 NaN
3000.0 9.0
Length: 29498, dtype: float64
bla.index
Float64Index([100200.0, 97600.0, 91100.0, 85000.0, 82600.0, 81400.0,
79200.0, 73200.0, 70000.0, 68600.0,
...
11300.0, 10000.0, 9970.0, 9100.0, 7000.0, 6670.0,
5000.0, 4490.0, 3880.0, 3000.0],
dtype='float64', length=29498)
As the wind speed values are NaN more often than not, I intended to interpolate considering the different pressure levels in order to have more wind speed values to work with.
The docs of interpolate() state that there's a method called "index" which interpolates considering the index-values, but the results don't make sense as compared to the initial values:
bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0 **2.00**
97600.0 10.40
91100.0 8.00
85000.0 **3.00**
82600.0 9.75
...
6670.0 3.00
5000.0 **2.00**
4490.0 9.00
3880.0 5.00
3000.0 **9.00**
Length: 29498, dtype: float64
I marked the original values in boldface.
I'd rather expect something like when using "linear":
bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0 **2.000000**
97600.0 2.333333
91100.0 2.666667
85000.0 **3.000000**
82600.0 4.600000
...
6670.0 4.500000
5000.0 **2.000000**
4490.0 4.333333
3880.0 6.666667
3000.0 **9.000000**
Nevertheless, I'd like to use properly "index" as interpolation method, since this should be the most accurate considering the pressure levels for interpolation to mark the "distance" between each wind speed value.
By and large, I'd like to understand how the interpolation results using "index" with the pressure levels in it could become so counterintuitive, and how I could achieve them to be more sound.
Thanks to #ALollz in the first comment underneath my question I came up where the issue lied:
It was just that my dataframe had 2 index levels, the outer being unique measuring timestamps, the inner being a standard range-index.
I should've looked just at each sub-set associated with the unique timestamps separately.
Within these subsets, interpolation makes sense and the results are being produced just right.
Example:
# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
# Extract the current subset
df_subset = df.loc[timestamp, :]
# Carry out interpolation on a column of interest
df_subset["column of interest"] = df_subset[
"column of interest"].interpolate(method="linear",
axis=0,
limit=1,
limit_direction="both")

Pandas describe vs scipy.stats percentileofscore with NaN?

I'm having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.
My df is:
f_recommend
0 3.857143
1 4.500000
2 4.458333
3 NaN
4 3.600000
5 NaN
6 4.285714
7 3.587065
8 4.200000
9 NaN
When I run df.describe(percentiles=[.25, .5, .75]) I get:
f_recommend
count 7.000000
mean 4.069751
std 0.386990
min 3.587065
25% 3.728571
50% 4.200000
75% 4.372024
max 4.500000
I get the same values when I run with NaN removed.
When I want to look up a specific value, however, when I run scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') I get: 28th percentile with NaN and 20th without.
Any thoughts to explain this discrepancy?
ETA:
I don't believe that the problem is that we're calculating percentiles differently. Because that only matters when you're calculating percentiles of the same 2 numbers in different ways. But here, describe gives 25 percentile as 3.72. So there is absolutely no way that 3.61 can be 28th percentile. None of the formulas should give that.
In particular, when I use describe on the 6 values without NaN, I get the same values, so that's ignoring NaN, which is fine. But when I run percentile of score without the NaN I get a number that doesn't match.
ETA 2:
Simpler example:
In [48]: d = pd.DataFrame([1,2,3,4,5,6,7])
In [49]: d.describe()
Out[49]:
0
count 7.000000
mean 4.000000
std 2.160247
min 1.000000
25% 2.500000
50% 4.000000
75% 5.500000
max 7.000000
In [50]: sp.stats.percentileofscore(d[0], 2.1, kind = 'mean')
Out[50]: 28.571428571428573
the "kind" argument doesn't matter because 2.1 is unique.
scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:
In [44]: np.nan > 0
Out[44]: False
In [45]: np.nan < 0
Out[45]: False
In [46]: np.nan == 0
Out[46]: False
In [47]: np.nan == np.nan
Out[47]: False
Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.
If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:
In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
Out[53]: 16.666666666666664
In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
Out[54]: 16.666666666666664
Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.
The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:
result = percentileofscore(a[~np.isnan(a)], score)
the answer is very simple.
There is no universally accepted formula for computing percentiles, in particular when your data contains ties or when it cannot be perfectly broken down in equal-size buckets.
For instance, have a look at the documentation in R. There are more than seven types of formulas! https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
At the end, it comes down to understanding which formula is used and whether the differences are big enough to be a problem in your case.

Finding minimum and maximum value for each row, excluding NaN values

I have a code that plots multiple wind speed values during a day at 50 different altitudes. I'm trying to program it where it gives me the minimum and maximum values at each different altitude so I can see the minimum and maximum winds experienced during the day.
I've tried np.min(wind_speed, axis=0) but this is giving me nan. I have a line that reads bad values of wind speed as nan. How would I able to avoid getting the nan value and getting the actual minimum and maximum value occurring during the day?
To ignore the NaN values use nanmin and the analagous nanmax:
npnanmin(wind_speed, axis=0)
npnanmax(wind_speed, axis=0)
This will ignore the NaN values as desired
Example:
In [93]:
wind_speed = np.array([234,np.NaN,343, np.NaN])
wind_speed
Out[93]:
array([ 234., nan, 343., nan])
In [94]:
print(np.nanmin(wind_speed, axis=0), np.nanmax(wind_speed, axis=0))
234.0 343.0

Categories