I did the following script to integrate (average) data by intervals in python:
# N = points to mean in the array
# data = original data
# data_mean = average data each N points
data_mean = np.array([np.mean(i) for i in np.array_split(data, len(data)/N)])
How could do that in IDL?
There are a "mean" function, but a "array_split-like"?
The array_split functionality is usually done via REFORM to create a two (or higher) dimensional array from a 1-dimensional array using the same values. So for example:
n = 20
data = randomu(seed, 100)
data = reform(data, 100 / n, n)
print, mean(data, dimension=2)
The IDL mean function is equivalent to the numpy mean function, and the IDL reform can be used similarly to the numpy array_split:
data_mean = mean(reform(data, n_elements(data) / N), dimension=2)
If you don't mind data ending up with different dimensions, you can greatly speed this up using the /overwrite keyword:
data_mean = mean(reform(data, n_elements(data) / N, /overwrite), dimension=2)
Finally, if you have a version of IDL before IDL 8.0, then you won't have the dimension keyword for the mean function. Use this (less elegant) pattern instead:
data_mean = total(reform(data, n_elements(data) / N), 2) / N
Note that this version with total also accepts the /nan keyword, so that it works even when some data are missing.
Related
I have to calculate the Fourier transform of an acceleration data that I've already coded. I have to do it the old fashion way (I mean, without the numpy np.fft.fft command, even though I don't master that neither) So, this is what I have for the integration:
ri = 1j # first time defining a complex number in python
Fmax = 50 # Hz, the maximum frequency to consider
df = 0.01 # frequency diferential
nf = int(Fmax / df) # number of sample points for frequency
# and I already have UD_Acc defined as a 1D numpy array, then the "for loop":
Int_UD = []
for i in range(UD_Acc.size):
w = []
for j in range(nf):
w.append(2 * np.pi * df * (j - 1))
Int_UD.append(Int_UD[i - 1] + UD_Acc[i] * np.exp(ri * w * (i - 1) * dt1))
First of all, in the for loop the w variable has a warning as:
Expected type 'complex', got 'List[Union[Union[float, int], Any]]' instead
And then, even if I run it, it says that the list index is out of range.
I know it may seem a little rudiment to integrate like this, or to find a Fourier transform without using scipy or np.fft, but is for class, and I'm trying to understand the basics, so thanks in advance.
I have a question on the resampling 2-d array.
Sometimes, the original size of the geoscience data should be transformed to other size. If the ratio for each axis is equal, the task is simple, in which np.reshape allow a 2-d array of 100x100 to 50x50 without data loss. The code is shown as:
## creat a original data
xc1, xc2, yc1, yc2 = 100, 110, 35, 45
XSIZE,YSIZE=100,100
lon,lat = np.linspace(xc1,xc2,XSIZE),np.linspace(yc1,yc2,YSIZE)
pop = np.random.uniform(low=1000, high=50000, size=(XSIZE*YSIZE,)).reshape(YSIZE,XSIZE)
## reshape
shape = np.array(pop.shape, dtype=float)
coarseness = 2 # the new shape is in 50 x 50
new_shape = coarseness * np.ceil(shape/coarseness).astype(int)
zp_pop = np.zeros(new_shape)
zp_pop[:int(shape[0]), :int(shape[1])] = pop
temp = zp_pop.reshape((new_shape[0] // coarseness, coarseness,
new_shape[1] // coarseness, coarseness))
coarse_pop = np.sum(temp, axis=(1,3))
print (pop.sum())
print (coarse_pop.sum())
However, when the coarse factor is different for each axis, this method can not be implemented. I turned to apply other method. Here is an example I tried to use FFT to generate a 60*80 array as output
from scipy import fftpack
pop_fft = fftpack.fft2(pop,shape = (60,80))
pop_res = fftpack.ifft2(pop_fft).real
print(pop.sum())
print(pop_res.sum())
254208134.8356425
122048754.13639387
The data loss was significant. Thus, I posted my issue here. Maybe the resampling function I used was not correct. Or there are some better approach to deal with this situation. Any advices or comments are highly appreciated!
When you set up the 'coarse array' yourself you sum over adjacent entries, instead of computing the average or interpolating.
This way the sum over all elements in the coarse and original array are identical str((coarse_pop.sum()-pop.sum())/(0.5*(pop.sum()+coarse_pop.sum()))) gives '-1.1638426077573779e-16' only a tiny numerical error.
if you compare the mean of the fftpack resampled coarse array it matches up:
print(pop.mean())
print(pop_res.mean())
25606.832220313503
25496.03271480075
alternatively you can correct for the number of elements yourself:
print(pop.sum())
print(pop_res.sum()*100*100/(60*80))
256068322.20313504
254960327.14800745
I don't know about your problem but the fftpack way of downsampling the array makes more sense to me. if it's not what you want you can apply the prefactor to the original array, like pop_fft = fftpack.fft2(pop*100*100/(60*80),shape = (60,80))
I am trying to implement a normalization function manually rather than using the scikit learn's one. The reason is that, I need to define the maximum and minimum parameters manually and scikit learn doesn't allow that alteration.
I successfully implemented this to normalize the values between 0 and 1. But it is taking a very long time to run.
Question: Is there another efficient way I can do this? How can I make this execute faster.
Shown below is my code:
scaled_train_data = scale(train_data)
def scale(data):
for index, row in data.iterrows():
X_std = (data.loc[index, "Close"] - 10) / (2000 - 10)
data.loc[index, "Close"] = X_std
return data
2000 and 10 are the attributes that i defined manually rather than taking the minimum and the maximum value of the dataset.
Thank you in advance.
Use numpy's matrix.you can also set your min and max mannually.
import numpy as np
data = np.array(df)
_min = np.min(data, axis=0)
_max = np.max(data, axis=0)
normed_data = (data - _min) / (_max - _min)
Why loop? You can just use
train_data['close'] = (train_data['close'] - 10)/(2000 - 10)
to make use of vectorized numpy functions. Of course, you could also put this in a function, if you prefer.
Alternatively, if you want to rescale to a linear range, you could use http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html. The advantage of this is that you can save it and then rescale the test data in the same manner.
Short Question
I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.
How do I extract the indices of each bin to efficiently perform my calculation using the bins values?
What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?
Detailed Explanation
1. Straightforward Solution
One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below
import numpy as np
# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
y = np.sum(x)
return y
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = [func(vals[ind == j]) for j in range(1, nbins)]
This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.
2. Using binned_statistics
The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following
import numpy as np
from scipy.stats import binned_statistics
vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]
3. Using labeled_comprehension
Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become
import numpy as np
from scipy.ndimage import labeled_comprehension
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)
Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.
4. Comparison with IDL language
To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?
Specifically, using the IDL language the above example could be written as
vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)
for j=0, nbins-1 do begin
jbins = r[r[j]:r[j+1]-1] ; Selects indices of bin j
result[j] = func(vals[jbins])
endfor
The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
import numpy as np
from scipy.sparse import csr_matrix
def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''
N = len(values)
r0, r1 = range
digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.
I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.
vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()
#example of using the data:
for j in range(nbins-1):
#I am using a plotting function for your f, to show that they cluster
plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')
The figure shows that the bins are actually clusters as expected:
You can halve the computation time by sorting the array first, then use np.searchsorted.
vals = np.random.random(1e8)
vals.sort()
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
results = [func(vals[np.searchsorted(ind,j,side='left'):
np.searchsorted(ind,j,side='right')])
for j in range(1,nbins)]
Using 1e8 as my test case, I go from 34 seconds of computation to about 17.
One efficient solution is using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
npi.group_by(ind).split(vals)
Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :
import pandas as pd
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values
or more generally :
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values
Although the latter is slower for standard aggregation functions
(like sum, mean, etc)
I have an array where discreet sinewave values are recorded and stored. I want to find the max and min of the waveform. Since the sinewave data is recorded voltages using a DAQ, there will be some noise, so I want to do a weighted average. Assuming self.yArray contains my sinewave values, here is my code so far:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
for y in range (0,filtersize):
summation = sum(self.yArray[x+y])
ave = summation/filtersize
filterarray.append(ave)
My issue seems to be in the second for loop, where depending on my averaging window size (filtersize), I want to sum up the values in the window to take the average of them. I receive an error saying:
summation = sum(self.yArray[x+y])
TypeError: 'float' object is not iterable
I am an EE with very little experience in programming, so any help would be greatly appreciated!
The other answers correctly describe your error, but this type of problem really calls out for using numpy. Numpy will run faster, be more memory efficient, and is more expressive and convenient for this type of problem. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
# make a sine wave with noise
times = np.arange(0, 10*np.pi, .01)
noise = .1*np.random.ranf(len(times))
wfm = np.sin(times) + noise
# smoothing it with a running average in one line using a convolution
# using a convolution, you could also easily smooth with other filters
# like a Gaussian, etc.
n_ave = 20
smoothed = np.convolve(wfm, np.ones(n_ave)/n_ave, mode='same')
plt.plot(times, wfm, times, -.5+smoothed)
plt.show()
If you don't want to use numpy, it should also be noted that there's a logical error in your program that results in the TypeError. The problem is that in the line
summation = sum(self.yArray[x+y])
you're using sum within the loop where your also calculating the sum. So either you need to use sum without the loop, or loop through the array and add up all the elements, but not both (and it's doing both, ie, applying sum to the indexed array element, that leads to the error in the first place). That is, here are two solutions:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = sum(self.yArray[x:x+filtersize]) # sum over section of array
ave = summation/filtersize
filterarray.append(ave)
or
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = 0.
for y in range (0,filtersize):
summation = self.yArray[x+y]
ave = summation/filtersize
filterarray.append(ave)
self.yArray[x+y] is returning a single item out of the self.yArray list. If you are trying to get a subset of the yArray, you can use the slice operator instead:
summation = sum(self.yArray[x:y])
to return an iterable that the sum builtin can use.
A bit more information about python slices can be found here (scroll down to the "Sequences" section): http://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy
You could use numpy, like:
import numpy
filtersize = 2
ysums = numpy.cumsum(numpy.array(self.yArray, dtype=float))
ylags = numpy.roll(ysums, filtersize)
ylags[0:filtersize] = 0.0
moving_avg = (ysums - ylags) / filtersize
Your original code attempts to call sum on the float value stored at yArray[x+y], where x+y is evaluating to some integer representing the index of that float value.
Try:
summation = sum(self.yArray[x:y])
Indeed numpy is the way to go. One of the nice features of python is list comprehensions, allowing you to do away with the typical nested for loop constructs. Here goes an example, for your particular problem...
import numpy as np
step=2
res=[np.sum(myarr[i:i+step],dtype=np.float)/step for i in range(len(myarr)-step+1)]