Calculate a prediction interval for a dataset Python - python

I have the following table:
perc
0 59.98797
1 61.89383
2 61.08403
3 61.00661
4 62.64753
5 62.18118
6 60.74520
7 57.83964
8 62.09705
9 57.07985
10 58.62777
11 60.02589
12 58.74948
13 59.14136
14 58.37719
15 58.27401
16 59.67806
17 58.62855
18 58.45272
19 57.62186
20 58.64749
21 58.88152
22 54.80138
23 59.57697
24 60.26713
25 60.96022
26 55.59813
27 60.32104
28 57.95403
29 58.90658
30 53.72838
31 57.03986
32 58.14056
33 53.62257
34 57.08174
35 57.26881
36 48.80800
37 56.90632
38 59.08444
39 57.36432
consisting of various percentages.
I'm interested in creating a probability distribution based on these percentages for the sake of coming up with a prediction interval (say 95%) of what we would expect a new observation of this percentage to be within.
I initially was doing the following, but upon testing with my sample data I remembered that CIs capture the mean, not a new observation.
import scipy.stats as st
import numpy as np
# Get data in a list
lst = list(percDone['perc'])
# create 95% confidence interval
st.t.interval(alpha=0.95, df=len(lst)-1,
loc=np.mean(lst),
scale=st.sem(lst))
Thanks!

Related

Create Xarray dataset from grd files and list of grid coordinates

I am trying to use a Canadian gridded historical dataset of temperature anomalies but it seems that I don't have the skills to pull that off. The grd file are temperatures anomalies on what I believe is a highly regular grid. I have no experience with that kind of grid and I am having trouble building the xarray dataset.
What I have (a subset of the grd and the text file is accessible here) :
2075 '.grd' files ('t190001.grd' to 't202112.grd' following "t{year}{month}.grd" structure)
1 txt file listing the grid coordinates called "CANGRD_points_LL.txt"
From this I would like to build a xarray dataset in order to do some analysis.
Naively, I thought the grid files were already georeferenced and all so I started by doing this :
import glob
import rioxarray as rio
import pandas as pd
import numpy as np
import xarray as xr
#not used for the moment even though I believe that will be needed
#df = pd.read_csv(r"CANGRD_points_LL.txt", sep = ' ', header=None)
list_files = sorted(set(glob.glob(r"t?????[0-2].grd" ) + glob.glob(r"t????0[0-9].grd" )))
times = pd.date_range("1900/01/01",freq='M', periods= len(list_files))
datarrays = [rio.open_rasterio(rst, masked=True,band_as_variable=True).assign_coords(time = t).expand_dims(dim='time').squeeze() for rst,t in zip(list_files, times)]
ds = xr.concat(datarrays,dim='time').rename({'band_1' : 'tas', 'y': 'lat', 'x' : 'lon'})
But as I plotted the results it became evident that my coordinates were only the indices of the pixels :
So I believe I have to use the txt file provided, however, I have no idea how to make the xarray grid using the grid's coordinates and how to make that match with my array obtained by loading a grid via rioxarray. Here is a sample, the complete file is available above. What baffles me is that most of the 11874 lines of the dataframe resulting from the txt file seem to be unique, so how could I fit an array of dimensions 125 lon by 95 lat into it.
0 1 2 3
0 0 0 40.0451 -129.8530
1 0 1 40.1780 -129.3650
2 0 2 40.3080 -128.8740
3 0 3 40.4348 -128.3801
4 0 4 40.5585 -127.8834
5 0 5 40.6790 -127.3840
6 0 6 40.7963 -126.8817
7 0 7 40.9104 -126.3768
8 0 8 41.0211 -125.8693
9 0 9 41.1286 -125.3591
10 0 10 41.2327 -124.8465
11 0 11 41.3335 -124.3314
12 0 12 41.4308 -123.8140
13 0 13 41.5247 -123.2942
14 0 14 41.6151 -122.7722
15 0 15 41.7020 -122.2481
16 0 16 41.7853 -121.7218
17 0 17 41.8651 -121.1936
18 0 18 41.9413 -120.6634
19 0 19 42.0139 -120.1313
20 0 20 42.0828 -119.5975
21 0 21 42.1481 -119.0620
22 0 22 42.2097 -118.5249
23 0 23 42.2675 -117.9863
24 0 24 42.3216 -117.4462
25 0 25 42.3720 -116.9049
26 0 26 42.4186 -116.3622
27 0 27 42.4614 -115.8185
28 0 28 42.5005 -115.2736
29 0 29 42.5357 -114.7279
30 0 30 42.5670 -114.1812
31 0 31 42.5946 -113.6338
32 0 32 42.6182 -113.0857
33 0 33 42.6381 -112.5371
34 0 34 42.6540 -111.9880
35 0 35 42.6661 -111.4385
36 0 36 42.6743 -110.8888
37 0 37 42.6786 -110.3389
38 0 38 42.6791 -109.7889
39 0 39 42.6757 -109.2390
40 0 40 42.6684 -108.6892
41 0 41 42.6572 -108.1397
42 0 42 42.6421 -107.5905
43 0 43 42.6232 -107.0417
44 0 44 42.6004 -106.4935
45 0 45 42.5738 -105.9459
46 0 46 42.5433 -105.3991
47 0 47 42.5090 -104.8531
48 0 48 42.4708 -104.3081
49 0 49 42.4289 -103.7640
Here is the view of one grid file loaded as xarray,
Any help would be greatly appreciated! Thank you so much
I directly asked on the Xarray Github discussion here is the original answer from Keewis:
https://github.com/pydata/xarray/discussions/7443#discussioncomment-4700261
The grid file contains stacked 2D coordinates, which I guess is due to the grid's original coordinate system not being aligned with the lat / lon axes.
To read the coordinates into 2D coordinates you can use:
df = pd.read_csv(r"CANGRD_points_LL.txt", sep=" ", header=None, names=["y", "x", "lat", "lon"])
grid = df.set_index(["y", "x"]).to_xarray().set_coords(["lat", "lon"])
raw = xr.concat([...], dim="time")
ds = xr.merge([raw, grid]).assign_coords(time=times).rename_vars(...)

Pre-processing single feature containing different scales

How do I preprocess this data containing a single feature with different scales? This will then be used for supervised machine learning classification.
Data
import pandas as pd
import numpy as np
np.random.seed = 4
df_eur_jpy = pd.DataFrame({"value": np.random.default_rng().uniform(0.07, 3.85, 50)})
df_usd_cad = pd.DataFrame({"value": np.random.default_rng().uniform(0.0004, 0.02401, 50)})
df_usd_cad["ticker"] = "usd_cad"
df_eur_jpy["ticker"] = "eur_jpy"
df = pd.concat([df_eur_jpy,df_usd_cad],axis=0)
df.head(1)
value ticker
0 0.161666 eur_jpy
We can see the different tickers contain data with a different scale when looking at the max/min of this groupby:
df.groupby("ticker")["value"].agg(['min', 'max'])
min max
ticker
eur_jpy 0.079184 3.837519
usd_cad 0.000405 0.022673
I have many tickers in my real data and would like to combine all of these in the one feature (pandas column) and use with an estimator in sci-kit learn for supervised machine learning classification.
If I Understand Carefully (IIUC), you can use the min-max scaling formula:
You can apply this formula to your dataframe with implemented sklearn.preprocessing.MinMaxScaler like below:
from sklearn.preprocessing import MinMaxScaler
df2 = df.pivot(columns='ticker', values='value')
# ticker eur_jpy usd_cad
# 0 3.204568 0.021455
# 1 1.144708 0.013810
# ...
# 48 1.906116 0.002058
# 49 1.136424 0.022451
df2[['min_max_scl_eur_jpy', 'min_max_scl_usd_cad']] = MinMaxScaler().fit_transform(df2[['eur_jpy', 'usd_cad']])
print(df2)
Output:
ticker eur_jpy usd_cad min_max_scl_eur_jpy min_max_scl_usd_cad
0 3.204568 0.021455 0.827982 0.896585
1 1.144708 0.013810 0.264398 0.567681
2 2.998154 0.004580 0.771507 0.170540
3 1.916517 0.003275 0.475567 0.114361
4 0.955089 0.009206 0.212517 0.369558
5 3.036463 0.019500 0.781988 0.812471
6 1.240505 0.006575 0.290608 0.256373
7 1.224260 0.020711 0.286163 0.864584
8 3.343022 0.020564 0.865864 0.858280
9 2.710383 0.023359 0.692771 0.978531
10 1.218328 0.008440 0.284540 0.336588
11 2.005472 0.022898 0.499906 0.958704
12 2.056680 0.016429 0.513916 0.680351
13 1.010388 0.005553 0.227647 0.212368
14 3.272408 0.000620 0.846543 0.000149
15 2.354457 0.018608 0.595389 0.774092
16 3.297936 0.017484 0.853528 0.725720
17 2.415297 0.009618 0.612035 0.387285
18 0.439263 0.000617 0.071386 0.000000
19 3.335262 0.005988 0.863740 0.231088
20 2.767412 0.013357 0.708375 0.548171
21 0.830678 0.013824 0.178478 0.568255
22 1.056041 0.007806 0.240138 0.309336
23 1.497400 0.023858 0.360896 1.000000
24 0.629698 0.014088 0.123489 0.579604
25 3.758559 0.020663 0.979556 0.862509
26 0.964214 0.010302 0.215014 0.416719
27 3.680324 0.023647 0.958150 0.990918
28 3.169445 0.017329 0.818372 0.719059
29 1.898905 0.017892 0.470749 0.743299
30 3.322663 0.020508 0.860293 0.855869
31 2.735855 0.010578 0.699741 0.428591
32 2.264645 0.017853 0.570816 0.741636
33 2.613166 0.021359 0.666173 0.892456
34 1.976168 0.001568 0.491888 0.040928
35 3.076169 0.013663 0.792852 0.561335
36 3.330470 0.013048 0.862429 0.534891
37 3.600527 0.012340 0.936318 0.504426
38 0.653994 0.008665 0.130137 0.346288
39 0.587896 0.013134 0.112052 0.538567
40 0.178353 0.011326 0.000000 0.460781
41 3.727127 0.016738 0.970956 0.693658
42 1.719622 0.010939 0.421696 0.444123
43 0.460177 0.021131 0.077108 0.882665
44 3.124722 0.010328 0.806136 0.417826
45 1.011988 0.007631 0.228085 0.301799
46 3.833281 0.003896 1.000000 0.141076
47 3.289872 0.017223 0.851322 0.714495
48 1.906116 0.002058 0.472721 0.062020
49 1.136424 0.022451 0.262131 0.939465

how to sequentially assign two numbers in an array?

I try to assign two numbers diagonally to each other in the matrix according to certain procedures.
At first the first 1st number in the penultimate line of the line with the 2nd number in the last line, then the first number in the line up with the 2nd number in the penultimate line, etc..This sequence is shown in the example below. The matrix does not always have to be the same size.
Example
a=np.array([[11,12,13],
[21,22,23],
[31,32,33]])
required output:
21 32
11 22
11 33
22 33
12 23
or
a=np.array([[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44]])
required output:
31 42
21 32
21 43
32 43
11 22
11 33
11 44
22 33
22 44
12 23
12 34
23 34
13 24
It is possible?
Here's an iterative solution, assuming a square matrix. Modifying this for non-square matrices shouldn't be hard.
import numpy as np
a=np.array([[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44]])
w,h = a.shape
for y0 in range(1,h):
y = h-y0-1
for x in range(h-y-1):
print( a[y+x,x], a[y+x+1,x+1] )
for x in range(1,w-1):
for y in range(w-x-1):
print( a[y,x+y], a[y+1,x+y+1] )

Select/Group rows from a data frame with the nearest values for a specific column(s)

I have the two columns in a data frame (you can see a sample down below)
Usually in columns A & B I get 10 to 12 rows with similar values.
So for example: from index 1 to 10 and then from index 11 to 21.
I would like to group these values and get the mean and standard deviation of each group.
I found this following line code where I can get the index of the nearest value. but I don't know how to do this repetitively:
Index = df['A'].sub(df['A'][0]).abs().idxmin()
Anyone has any ideas on how to approach this problem?
A B
1 3652.194531 -1859.805238
2 3739.026566 -1881.965576
3 3742.095325 -1878.707674
4 3747.016899 -1878.728626
5 3746.214554 -1881.270329
6 3750.325368 -1882.915532
7 3748.086576 -1882.406672
8 3751.786422 -1886.489485
9 3755.448968 -1885.695822
10 3753.714126 -1883.504098
11 -337.969554 24.070990
12 -343.019575 23.438956
13 -344.788697 22.250254
14 -346.433460 21.912217
15 -343.228579 22.178519
16 -345.722368 23.037441
17 -345.923108 23.317620
18 -345.526633 21.416528
19 -347.555162 21.315934
20 -347.229210 21.565183
21 -344.575181 22.963298
22 23.611677 -8.499528
23 26.320500 -8.744512
24 24.374874 -10.717384
25 25.885272 -8.982414
26 24.448127 -9.002646
27 23.808744 -9.568390
28 24.717935 -8.491659
29 25.811393 -8.773649
30 25.084683 -8.245354
31 25.345618 -7.508419
32 23.286342 -10.695104
33 -3184.426285 -2533.374402
34 -3209.584366 -2553.310934
35 -3210.898611 -2555.938332
36 -3214.234899 -2558.244347
37 -3216.453616 -2561.863807
38 -3219.326197 -2558.739058
39 -3214.893325 -2560.505207
40 -3194.421934 -2550.186647
41 -3219.728445 -2562.472566
42 -3217.630380 -2562.132186
43 234.800448 -75.157523
44 236.661235 -72.617806
45 238.300501 -71.963103
46 239.127539 -72.797922
47 232.305335 -70.634125
48 238.452197 -73.914015
49 239.091210 -71.035163
50 239.855953 -73.961841
51 238.936811 -73.887023
52 238.621490 -73.171441
53 240.771812 -73.847028
54 -16.798565 4.421919
55 -15.952454 3.911043
56 -14.337879 4.236691
57 -17.465204 3.610884
58 -17.270147 4.407737
59 -15.347879 3.256489
60 -18.197750 3.906086
A simpler approach consist in grouping the values where the percentage change is not greater than a given threshold (let's say 0.5):
df['Group'] = (df.A.pct_change().abs()>0.5).cumsum()
df.groupby('Group').agg(['mean', 'std'])
Output:
A B
mean std mean std
Group
0 3738.590934 30.769420 -1880.148905 7.582856
1 -344.724684 2.666137 22.496995 0.921008
2 24.790470 0.994361 -9.020824 0.977809
3 -3210.159806 11.646589 -2555.676749 8.810481
4 237.902230 2.439297 -72.998817 1.366350
5 -16.481411 1.341379 3.964407 0.430576
Note: I have only used the "A" column, since the "B" column appears to follow the same pattern of consecutive nearest values. You can check if the identified groups are the same between columns with:
grps = (df[['A','B']].pct_change().abs()>1).cumsum()
grps.A.eq(grps.B).all()
I would say that if you know the length of each group/index set you want then you can first subset the column and row with :
df['A'].iloc[0:11].mean()
Then figure out a way to find standard deviation.

Given a discrete distribution, how do I round a number to the closest value in that distribution?

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

Categories