Python- np.mean() giving wrong means? - python

The issue
So I have 50 netCDF4 data files that contain decades of monthly temperature predictions on a global grid. I'm using np.mean() to make an ensemble average of all 50 data files together while preserving time length & spatial scale, but np.mean() gives me two different answers. The first time I run its block of code, it gives me a number that, when averaged over latitude & longitude & plotted against the individual runs, is slightly lower than what the ensemble mean should be. If I re-run the block, it gives me a different mean which looks correct.
The code
I can't copy every line here since it's long, but here's what I do for each run.
#Historical (1950-2020) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_195001-202012.nc") #Import data file
tash1 = ncin_1.variables['tas'][:] #extract tas (temperature) variable
ncin_1.close() #close to save memory
#Repeat for future (2021-2100) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_202101-210012.nc")
tasr1 = ncin_1.variables['tas'][:]
ncin_1.close()
#Concatenate historical & future files together to make one time series array
tas11 = np.concatenate((tash1,tasr1),axis=0)
#Subtract the 1950-1979 mean to obtain anomalies
tas11 = tas11 - np.mean(tas11[0:359],axis=0,dtype=np.float64)
And I repeat that 49 times more for other datasets. Each tas11, tas12, etc file has the shape (1812, 64, 128) corresponding to time length in months, latitude, and longitude.
To get the ensemble mean, I do the following.
#Move all tas data to one array
alltas = np.zeros((1812,64,128,51)) #years, lat, lon, members (no ensemble mean value yet)
alltas[:,:,:,0] = tas11
(...)
alltas[:,:,:,49] = tas50
#Calculate ensemble mean & fill into 51st slot in axis 3
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
When I check a coordinate & month, the ensemble mean is off from what it should be. Here's what a plot of globally averaged temperatures from 1950-2100 looks like with the first mean (with monhly values averaged into annual values. Black line is ensemble mean & colored lines are individual runs.
Obviously that deviated below the real ensemble mean. Here's what the plot looks like when I run alltas[:,:,:,50]=np.mean(alltas,axis=3,dtype=np.float64) a second time & keep everything else the same.
Much better.
The question
Why does np.mean() calculate the wrong value the first time? I tried specifying the data type as a float when using np.mean() like in this question- Wrong numpy mean value?
But it didn't work. Any way I can fix it so it works correctly the first time? I don't want this problem to occur on a calculation where it's not so easy to notice a math error.

In the line
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
the argument to mean should be alltas[:,:,:,:50]:
alltas[:,:,:,50] = np.mean(alltas[:,:,:,:50], axis=3, dtype=np.float64)
Otherwise you are including those final zeros in the calculation of the ensemble means.

Related

Strange results when scaling data using scikit learn

I have an input dataset that has 4 time series with 288 values for 80 days. So the actual shape is (80,4,288). I would like to cluster differnt days. I have 80 days and all of them have 4 time series: outside temperature, solar radiation, electrical demand, electricity prices. What I want is to group similar days with regard to these 4 time series combined into clusters. Days belonging to the same cluster should have similar time series.
Before clustering the days using k-means or Ward's method, I would like to scale them using scikit learn. For this I have to transform the data into a 2 dimensional shape array with the shape (80, 4*288) = (80, 1152), as the Standard Scaler of scikit learn does not accept 3-dimensional input. The Standard Scaler just standardizes features by removing the mean and scaling to unit variance.
Now I scale this data using sckit learn's standard scaler:
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";")
scaler = StandardScaler()
data_Scaled = scaler.fit_transform(data_Unscaled)
np.savetxt("C:/Users/User1/Desktop/data_Scaled.csv", data_Scaled, delimiter=";")
When I now compare the unscaled and scaled data e.g. for the first day (1 row) and the 4th time series (columns 864 - 1152 in the csv file), the results look quite strange as you can see in the following figure:
As far as I see it, they are not in line with each other. For example in the timeslots between 111 and 201 the unscaled data does not change at all whereas the scaled data fluctuates. I can't explain that. Do you have any idea why this is happening and why they don't seem to be in line?
Here is the unscaled input data with shape (80,1152): https://filetransfer.io/data-package/CfbGV9Uk#link
and here the scaled output of the scaling with shape (80,1152): https://filetransfer.io/data-package/23dmFFCb#link
You have two issues here: scaling and clustering. As the question title refers to scaling, I'll handle that one in detail. The clustering issue is probably better suited for CrossValidated.
You don't say it, but it seems natural that all temperatures, be it on day 1 or day 80, are measured on a same scale. The same holds for the other three variables. So, for the purpose of scaling you essentially have four time series.
StandardScaler, like basically everything in sklearn, expects your observations to be organised in rows and variables in columns. It treats each column separately, deducting its mean from all the values in the column and dividing the resulting values by their standard deviation.
I reckon from your data that the first 288 entries in each row correspond to one variable, the next 288 to the second one etc. You need to reshape these data to form 288*80=23040 rows and 4 columns, one for each variable.
You apply StandardScaler on that array and reformat the data into the original shape, with 80 rows and 4*288=1152 columns. The code below should do the trick:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";", header=None)
X = data_Unscaled.to_numpy()
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
scaler = StandardScaler()
X_narrow_scaled = scaler.fit_transform(X_narrow)
X_scaled = np.array([X_narrow_scaled[i*288:(i+1)*288, :].T.ravel() for i in range(80)])
# Plot the original data:
i=3
j=0
plt.plot(X[j, i*288:(i+1)*288])
plt.title('TimeSeries_Unscaled')
plt.show()
# plot the scaled data:
plt.plot(X_scaled[j, i*288:(i+1)*288])
plt.title('TimeSeries_Scaled')
plt.show()
resulting in the following graphs:
The line
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
uses list comprehension to generate the four columns of the long, narrow array X_narrow. Basically, it is just a shorthand for a for-loop over your four variables. It takes the first 288 columns of X, flattens them into a vector, which it then puts into the first column of X_narrow. Then it does the same for the next 288 columns, X[:, 288:576], and then for the third and the fourth block of the 288 observed values per day. This way, each column in X_narrow contains a long time series, spanning 80 days (and 288 observations per day), of exactly one of your variables (outside temperature, solar radiation, electrical demand, electricity prices).
Now, you might try to cluster X_scaled using K-means, but I doubt it will work. You have just 80 points in a 1152-dimensional space, so the curse of dimensionality will almost certainly kick in. You'll most probably need to perform some kind of dimensionality reduction, but, as I noted above, that's a different question.

Python: efficient way to calculate moving average for fixed time window (NOT fixed observation widnow)

Problem description
Say I have:
vector of time, dtype is numpy datetime64,
vector of parameters, dtype is numpy float
time horizon, dtype is numpy timedelta64
And time.shape == parameters.shape. Values of time are unique and distances between elements are not even.
Goal I have: for each moment t of time calculate some statistic (for instance mean, min, max, sum, etc.) for the parameters vector for time period from time[t-horizon] to time[t]
The rookie way would be to use a loop (I don't want to use a loop for performance reasons) or some pandas aggregation/resampling (this is however not ideal as I don't want to aggregate - this creates a new time vector, while I want to preserve my time.
My current approach
I create the following matrix. Matrix visualization is set on real data and shows why I need different range to calculate statistic for every observation separately - sometimes 15min of history has 5,000 observations while sometimes roughly few hundred. This is also something that I measure - how many events occurred within a fixed time horizon.
past = (time < time[:, None]) & (time>(time- horizon)[:, None]
plt.imshow(past)
past
The first problem - creation of the matrix like above for long observation vectors is time-consuming. Is there a better way to create such a matrix? Presented matrix represent real data for one day but this also can be longer (up to 50,000 unique observations but what I'm aiming for is scalability also).
Later I use TensorFlow to calculate desired statistic (first multiplying matrices by themselves - then I have data only where past was true and later calculation of desired statistic (mean, count or whatever I want on rows of produced matrix). What is returned is vector of shape==parameters.shape.
The second question - is there a better way to do that? By better of course I mean faster.
EDIT
Sample code
import datetime
import matplotlib.pyplot as plt
def multiply_time(param, time):
if param.shape[0] == 1 or param.ndim == 1:
_temp_param = np.ma.masked_equal(param * time, 0).data
else:
_temp_param = np.ma.masked_equal(np.sum(param, axis=1) * time, 0).data
return_param = np.nanmean( np.where(_temp_param != 0, _temp_param, np.nan), axis=1)
return return_param
horizon = np.timedelta64(10,'s')
increment = np.timedelta64(1,'s')
vector_len = 100
parameters = np.random.rand(vector_len)
# create time vector where distances between elements are not even
increment_vec = np.cumsum(np.random.randint(0,10,vector_len)*increment)
time = np.datetime64(datetime.datetime.now()) + increment_vec
past = (time < time[:, None]) & (time > (time - horizon )[:, None])
plt.imshow(past)
result = multiply_time(parameters, past)
import pandas as pd
pd_result = pd.DataFrame(parameters).rolling(10,1).mean()
plt.plot(time,result, c='r', label='desired')
plt.plot(time,parameters,c='g', label='original')
plt.plot(time,pd_result,c='b', label='pandas')
plt.legend()
plt.show()```
EDI2:
I guess we can close as answer with pandas rolling gives best results.

Generating large simulations and inserting the same array multiple times into another array at different locations

I am working to generate a monte carlo simulation for oil wells. The end goal is to have all the wells with a smoothed probabilistic production curve. I have optimized what I can, but each of the 3 apply statements I am listing take so much time when I use my full dataset and the number of simulations I want. (Hours) The code I included is has 10 iterations. If you crank it up to 10,000 which is the goal it really starts to drag.
I have generated a Panda that has all the future wells I want to model with a probability of that well being chosen next to be drilled.
I then created a panda where I grouped everything into the categories I want to use to figure out the order that the model will choose the wells. So my "timing" panda contains my categories and an array of every index of those wells in those categories and an array of the well's probabilities.
This all is done in a few seconds. The next part works, but gets very slow.
Next I use a numpy generator choice with percentages to randomly generate the order of the wells for i simulations. As other posts have noted #njit does not work with the probability array. The result is 1 dimension of the array is the order that the wells will be chosen by each category, and the other dimension is each simulation. There are about 150 categories, and 10,000s of wells in each categories. I am hoping to run 10,000 simulations.
a is an array of indexes of wells that can be chosen
size is the length of that array
per is the probability that each well will get chosen
Next I link my timing panda to my panda with all of the wells in it. This attaches the previous array to the wells array. Then I search this array for the well index to figure out for each simulation when that specific well is going to get run. This generates a 1d array with what order that well is going be drilled in each simulation.
This function gets called on 100,000s of wells and as I increase the number of simulations it really slows down.
order is an array of the order each well is drilled per simulation
index is the index of that well
The final difficulty I am having is the averaging out the production curve for the wells. I have how much oil will be produced by each well per month. I need to insert that curve into the array at each point when the well is drilled, then average all of those values together to get the average production of the well given all the simulations.
I have also tried creating an np.zeros array then using the np.insert function, but I could not figure out how to insert an array multiple times without a loop and generating the initial array of 0's took longer than the current method I had. (I overcame inserting the array multiple times by covering everything to a string, inserting the type curve as a string then converting back to an array of numbers, but this did not seem efficient). I need to have the number of leading 0's
order is the time in months that each well will get drilled
curve is the production curve passed as a list
m is the highest value of the months that the well is drilled in all simulations
import numpy as np
from numba import njit
import datetime
import math
def TimingGenerator(a, size, p):
i = 10
g = np.random.Generator(np.random.PCG64())
order = np.concatenate([g.choice(a=a, size=size, replace=False, p=p) for z in range(i)]).reshape(i, size)
return order
#njit
def OrderGenerator(order, index):
result = np.where(order == index)[1]
return result
def CurveAverager(order, curve, m):
matrix = np.array([[0] * math.ceil(i) + curve + [0] * int((m - math.ceil(i))) for i in order])
result = np.mean(matrix, axis=0)
return result
begin_time = datetime.datetime.now()
size = 8000
g = np.random.Generator(np.random.PCG64())
a = g.choice(20_000, size=size, replace=False)
p = np.random.randint(1,100, size=size)
p = p/np.sum(p)
for i in range(150):
q = TimingGenerator(a,size,p)
print(datetime.datetime.now() - begin_time)
index = np.amin(q)
for i in range(100000):
order = OrderGenerator(q, index)
print(datetime.datetime.now() - begin_time)
order = order / 15
curve = list(range(600, 0, -1))
for i in range(20000):
avgcurve = CurveAverager(order, curve, size)
print(datetime.datetime.now() - begin_time)
Thanks for any help you can offer. I am willing to greatly alter my code if you can think of anything to help speed it up. Not sure if there is a better way to apply probabilities and smooth out the production curve which is really the end goal.
Cheers.

Python: Fitting lines to alle the decreasing parts of my data (max to min) and extrapolate them (max to max)?

I have daily water level measurements (hydraulic head) over several years (stored in a series with datetime index). I'm trying to fit a line to all the decreasing parts of the data. These straight lines should then be extrapolated until the next max of the data. If the first point is a minimum I want to fit a straight line till the next max. This is illustrated in the picture below.
I managed to code this problem in Python but in a very "ugly" way using 150 lines of code (lot of if statements).
My approach: smooth the data by fitting splines. Then use find_peaks of scipy.signal to find the extremas (multiply by -1 to get min). As this function does not deal with the first and the last point I used if statements to deal with this. Then I use two for loops to do the curve fitting and the extrapolation. I used one for loop in case the data starts with a min and another in case the data starts with a max as the boundaries of my "fit interval" and my "extrapolation interval" differ for each case. In case the data starts with a min I used a straight line for the first interval. The result of my code is shown in the image.
Image showing result of my code
Any ideas how to do this in a better way? Without using so many lines of code
The following code snippet shows my approach for the case where the data starts with a maximum
#hydraulic_head is a series of interpolated (spline) hydraulic head measurements with a datetime index
from scipy.signal import find_peaks
import pandas as pd
import numpy as np
peak_max=hydraulic_head[find_peaks(hydraulic_head)[0]] #hydraulic head at max
peak_min=hydraulic_head[find_peaks(hydraulic_head*-1)[0]] #hydraulic head at min
for gr in range(1,len(peak_max.index),1):
interval_fit=hydraulic_head[peak_max.index[gr-1]:peak_min.index[gr-1]] #interval to fit curve from max to min
t_fit=(interval_fit.index-interval_fit.index[0]).total_seconds().values #time in seconds
parameters=np.polyfit(t_fit,interval_max_min.values,1) #fit a line
parameter_estimated[gr]=parameterss #store the paramters of the line in a dict
interval_extrapolate=hydraulic_head[peak_max.index[gr-1]:peak_max.index[gr]] #interval to extrapolate
t_extrapolate=(interval_extrapolate.index-interval_extrapolate.index[0]).total_seconds().values #transform to time
values_extrapolated=parameters[0]*t_extrapolate+parameters[1] #extrapolate the line
new_index=interval_extrapolate.index #get the index from the extrapolated interval
new_series=pd.DataFrame(data=values_extrapolated,index=new_index,columns=['extrapolated']) #new data frame with extrapolated values
interpolation_out=pd.concat([interpolation_out,new_series]) #growing frame where lines are stored
Possible other approach: Using masks to find the intervals, numerate them and then posibily use groupby to extract the intervals. I didn't manage to do it this way.
It's my first question here. Open for any improvement on question formulation

How do I get peak values back from fourier transform?

I suspect that there's something I'm missing in my understanding of the Fourier Transform, so I'm looking for some correction (if that's the case). How should I gather peak information from the first plot below?
The dataset is hourly data for 911 calls over the past 17 years (for a particular city).
I've removed the trend from my data, and am now removing the seasonality. When I run the Fourier transform, I get the following plot:
I believe the dataset does have some seasonality to it (looking at weekly data, I have this pattern):
How do I pick out the values of the peaks in the first plot? Presumably for all of the "peaks" under, say 5000 in the first plot, I may ignore the inclusion of that seasonality in my final model, but only at a loss of accuracy, correct?
Here's the bit of code I'm working with, currently:
from scipy import fftpack
fft = fftpack.fft(calls_grouped_hour.detrended_residuals - calls_grouped_hour.detrended_residuals.mean())
plt.plot(1./(17*365)*np.arange(len(fft)), np.abs(fft))
plt.xlim([-.1, 23/2]);
EDIT:
After Mark Snider's initial answer, I have the following plot:
Adding code attempt to get peak values from fft:
Do I need to convert the values back using ifft first?
fft_x_y = np.stack((fft.real, fft.imag), -1)
peaks = []
for x, y in np.abs(fft_x_y):
if (y >= 0):
spipeakskes.append(x)
peaks = np.unique(peaks)
print('Length: ', len(peaks))
print('Peak values: ', '\n', np.sort(peaks))
threshold = 5000
fft[np.abs(fft)<threshold] = 0
This'll give you an fft that ignores everything except the peaks. And no, I wouldn't imagine that the "noise" represents actual seasonality. The peak at fft[0] doesn't represent seasonality, either - it's a multiple of the mean of the data, so if you plan on subtracting the ifft of the peaks I wouldn't include fft[0] either unless you want your data to be centered.
If you want just the peak values and not the full fft that you can invert, you can just do this:
peaks = [np.abs(value) for value in fft if np.abs(value)>threshold]

Categories