smoothing curve with pandas and interpolate not modifying data - python

I'm sure I'm not doing this right. I have a dataframe with a series of data, basically year and a value. I want to smoothen the curve and was looking to use spline to test results.
Basically I was trying to take a column and return the new datapoints into another column:
df['smooth'] = df['value'].interpolate(method='spline', order=3, s=0.)
but the results between smooth and value are the same.
value periodDate smooth diffSmooth
6 422976.72 2019 422976.72 0.0
7 190865.94 2018 190865.94 0.0
8 188440.89 2017 188440.89 0.0
9 192481.64 2016 192481.64 0.0
10 191958.64 2015 191958.64 0.0
11 681376.60 2014 681376.60 0.0
Any suggestions of what I'm doing wrong?

According to the Pandas docs, the interpolate function fills missing values in a sequence, so for example linear interpolation would be [0, 1, NaN, 3] -> [0, 1, 2, 3]. In short, you're using the wrong function. If you want to fit a spline, sklearn or scipy or numpy may be better bets.

Related

Spline interpolation on dataframes by row

I have the following data frame:
OBJECTID 2017 2018 2019 2020 2021
1.0 NaN NaN 7569.183179 7738.162829 7907.142480
2.0 NaN NaN 766.591146 783.861122 801.131099
3.0 NaN NaN 8492.215747 8686.747704 8881.279662
4.0 NaN NaN 40760.327825 41196.877473 41633.427120
5.0 NaN NaN 6741.819674 6788.981231 6836.142788
I am trying to apply a spline interpolation on each row to get the values for 2017 and 2018 using the following code:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
However, I get the following error:
ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
The dataframe in this question is just a subset of a much larger dataset I am using. All of the examples I have seen do the spline interpolation down each column, but I can't seem to get it work across each row. I feel like it's a simple solution and I'm just missing it. Could someone please help?
It appears to be because the dtype of the index (really columns for axis=1) is probably object in your case since the index contains a string column name also. Even though you are grabbing a slice of the columns that contains only integer years the overall index dtype remains the same - object. Then it looks like interpolate looks at the dtype and punts when it sees a dtype of object.
Example - even though the years are stored as integers the overall dtype is object:
df.columns
Index(['OBJECTID', 2017, 2018, 2019, 2020, 2021], dtype='object')
If we did this:
df.drop(columns=['OBJECTID'], inplace=True)
df.columns = df.columns.astype('uint64')
df.columns
UInt64Index([2017, 2018, 2019, 2020, 2021], dtype='uint64')
Then the axis=1 interpolation works:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
2017 2018 2019 2020 2021
0 7231.223878 7400.203528 7569.183179 7738.162829 7907.142480
1 732.051193 749.321169 766.591146 783.861122 801.131099
2 8103.151832 8297.683789 8492.215747 8686.747704 8881.279662
3 39887.228530 40323.778178 40760.327825 41196.877473 41633.427120
4 6647.496560 6694.658117 6741.819674 6788.981231 6836.142788
Dropping the OBJECTID was done to illustrate what is going on.

Correlation between dichotomous variable and continuous variable

I have two dataframes one (Lots) which is structured as follows:
Lot Group
Lot Number
Booking Stage
Date
1
216000.00
HPRESM
2020-08-28
2
890000.01
PART
2013-04-17
and the other one measurements as follows:
Mid
Date
Measurement 1
Measurement 2
1901827
2020-08-28
44.5
23.22
2981632
2013-04-17
49.0
34.5
The date column in both dataframes has unique dates and they are identical in both dataframes, as they have the same length.
What I am trying to do is compute correlations between the measurement columns which is are continuous variables and the lot group which is either 1 (good lot) or 2 (bad lot i.e. a dichotomous variable. The measurement variables have a lot of NaNs over 50%. My question is that I tried to compute the Point-Biserial correlation as I read it is used to calculate correlation between these two type of variables but I get nan for the statistic and 1 for the p-value.
columns = measurement.select_dtypes(exclude = ["object", "datetime"]).columns
for col in columns:
stat, p = ss.pointbiserialr(lots["LosGruppe"], measurement[col])
print(f"Variable: {col}, Correlation: {stat}, P-Value: {p}")
Output:
Variable: Mes 1, Correlation: nan, P-Value: 1.0
Variable: Mes 2, Correlation: nan, P-Value: 1.0
Variable: Mes 3, Correlation: nan, P-Value: 1.0
Variable: Mes 4, Correlation: nan, P-Value: 1.0
Variable: Mes 5, Correlation: nan, P-Value: 1.0
What would you advise as a solution or cause of this issue and what is a suitable correlation method between such variables?
Point biserial correlation is a good approach here, but the problem you have is with the missing values. You will need to remove these first with dropna()

Python - Pandas: how can I interpolate between values that grow exponentially?

I have a Pandas Series that contains the price evolution of a product (my country has high inflation), or say, the amount of coronavirus infected people in a certain country. The values in both of these datasets grows exponentially; that means that if you had something like [3, NaN, 27] you'd want to interpolate so that the missing value is filled with 9 in this case. I checked the interpolation method in the Pandas documentation but unless I missed something, I didn't find anything about this type of interpolation.
I can do it manually, you just take the geometric mean, or in the case of more values, get the average growth rate by doing (final value/initial value)^(1/distance between them) and then multiply accordingly. But there's a lot of values to fill in in my Series, so how do I do this automatically? I guess I'm missing something since this seems to be something very basic.
Thank you.
You could take the logarithm of your series, interpolate lineraly and then transform it back to your exponential scale.
import pandas as pd
import numpy as np
arr = np.exp(np.arange(1,10))
arr = pd.Series(arr)
arr[3] = None
0 2.718282
1 7.389056
2 20.085537
3 NaN
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64
arr = np.log(arr) # Transform according to assumed process.
arr = arr.interpolate('linear') # Interpolate.
np.exp(arr) # Invert previous transformation.
0 2.718282
1 7.389056
2 20.085537
3 54.598150
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64

wasserstein distance for multiple histograms

I'm trying to calculate the distance matrix between histograms. I can only find the code for calculating the distance between 2 histograms and my data have more than 10. My data is a CSV file and histogram comes in columns that add up to 100. Which consist of about 65,000 entries, I only run with 20% of the data but the code still does not work.
I've tried the distance_matrix from scipy.spatial.distance_matrix but it ignore the face that data are histogram and treat them as normal numerical data. I've also tried wasserstein distance but the error was object too deep for desired array
from scipy.stats import wasserstein_distance
distance = wasserstein_distance (df3,df3)
I expected the result to be somewhat like this :
0 1 2 3 4 5 6
0 0.000000 259.730341 331.083554 320.302997 309.577373 249.868085
1 259.730341 0.000000 208.368304 190.441382 262.030304 186.033572
2 331.083554 208.368304 0.000000 112.255111 256.269253 227.510879
3 320.302997 190.441382 112.255111 0.000000 246.350482 205.346804
4 309.577373 262.030304 256.269253 246.350482 0.000000 239.642379
but it was an error instead
ValueError: object too deep for desired array

Pandas describe vs scipy.stats percentileofscore with NaN?

I'm having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.
My df is:
f_recommend
0 3.857143
1 4.500000
2 4.458333
3 NaN
4 3.600000
5 NaN
6 4.285714
7 3.587065
8 4.200000
9 NaN
When I run df.describe(percentiles=[.25, .5, .75]) I get:
f_recommend
count 7.000000
mean 4.069751
std 0.386990
min 3.587065
25% 3.728571
50% 4.200000
75% 4.372024
max 4.500000
I get the same values when I run with NaN removed.
When I want to look up a specific value, however, when I run scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') I get: 28th percentile with NaN and 20th without.
Any thoughts to explain this discrepancy?
ETA:
I don't believe that the problem is that we're calculating percentiles differently. Because that only matters when you're calculating percentiles of the same 2 numbers in different ways. But here, describe gives 25 percentile as 3.72. So there is absolutely no way that 3.61 can be 28th percentile. None of the formulas should give that.
In particular, when I use describe on the 6 values without NaN, I get the same values, so that's ignoring NaN, which is fine. But when I run percentile of score without the NaN I get a number that doesn't match.
ETA 2:
Simpler example:
In [48]: d = pd.DataFrame([1,2,3,4,5,6,7])
In [49]: d.describe()
Out[49]:
0
count 7.000000
mean 4.000000
std 2.160247
min 1.000000
25% 2.500000
50% 4.000000
75% 5.500000
max 7.000000
In [50]: sp.stats.percentileofscore(d[0], 2.1, kind = 'mean')
Out[50]: 28.571428571428573
the "kind" argument doesn't matter because 2.1 is unique.
scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:
In [44]: np.nan > 0
Out[44]: False
In [45]: np.nan < 0
Out[45]: False
In [46]: np.nan == 0
Out[46]: False
In [47]: np.nan == np.nan
Out[47]: False
Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.
If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:
In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
Out[53]: 16.666666666666664
In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
Out[54]: 16.666666666666664
Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.
The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:
result = percentileofscore(a[~np.isnan(a)], score)
the answer is very simple.
There is no universally accepted formula for computing percentiles, in particular when your data contains ties or when it cannot be perfectly broken down in equal-size buckets.
For instance, have a look at the documentation in R. There are more than seven types of formulas! https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
At the end, it comes down to understanding which formula is used and whether the differences are big enough to be a problem in your case.

Categories