Calculate histogram of distances between points in big data set - python

I have big data set, representing 1.2M points in 220 dimensional periodic space (x changes fom (-pi,pi))... (matrix: 1.2M x 220).
I would like to calculate histogram of distances between these points taking into account periodicity. I have written some code in python but still it works quite slow for my test case (I am not even trying to run it on the whole set...).
Can you maybe take a look and help me with some tweaking?
Any suggestions and comments much appreciated.
import numpy as np
# 1000x220 test set (-pi,pi)
d=np.random.random((1000, 220))*2*np.pi-np.pi
# calculating theoretical limit on the histogram range, max distance between
# two points can be pi in each dimension
m=np.zeros(np.shape(d)[1])+np.pi
m_=np.sqrt(np.sum(m**2))
# hist range is from 0 to mm
mm=np.floor(m_)
bins=mm/0.01
m=np.zeros(bins)
# proper calculations
import time
start_time = time.time()
for i in range(np.shape(d)[0]):
diff=d[:-(i+1),:]-d[i+1:,:]
diff=np.absolute(diff)
adiff=diff-np.pi
diff=np.pi-np.absolute(adiff)
s=np.sqrt(np.einsum('ij,ij->i', diff,diff))
m+=np.histogram(s,range=(0,mm),bins=bins)[0]
print time.time() - start_time

I think you will see the most improvement from breaking the main loop to smaller parts by dividing range(...) to a couple of smaller ranges and use the threading module to have a couple of threads run the loop concurrently

Related

How to parallelize writing to same cell in a numpy array?

Background: I have millions of points in 2D space with (x_position, y_position, value) associated with each point. I am trying to summarize these points by creating an image, where each pixel can contain multiple points. To summarize, each pixel stores the sum of values at that (x_pixel, y_pixel) location in the image.
Question: How can I do this efficiently? Currently, my code does something like this:
image = np.zeros((4096,4096))
for each point in data:
x_pixel, y_pixel = convertPointPos2PixelPos(point)
image[x_pixel, y_pixel] += point.getValue()
but the ETA for this code completing is 450 hours, which is unacceptable. Is there a way to parallelize this? The code is writing to the same image[x,y] index multiple times. I found StackOverflow posts that suggest using multiprocessing, but I think needing to lock to prevent race conditions will mean this will take just as much time as it would without parallelizing.
Assuming you want something on a regular grid, you can use simple division to bin your data. Here is an example:
size = (4096, 4096)
data = np.random.rand(100000000, 3)
image = np.zeros(size)
coords = data[:, :2]
min = coords.min(0)
max = coords.max(0)
index = np.floor_divide(coords - min, (max - min) / np.subtract(size, 1), out=np.empty(coords.shape, dtype=int), casting='unsafe')
index is now an array of indices into image where you want to add the corresponding values. You can do an unbuffered add using np.add.at:
np.add.at(image, tuple(index.T), data[:, -1])
If your data range is better defined than just the bounding box of the coordinates, you can save a little time by not computing coord.max() and coord.min().
The result is something like this:
This entire operation takes 6.4sec on my very moderately powered machine for 10M points, including the call to plt.imshow, plt.colorbar and garbage collection before runs.
Timing collected using the %%timeit cell magic in IPython.
Either way, you're well under 450 hours. Even if your coordinate transformation is not linear binning, I expect that you can run in reasonable time as long as you vectorize it properly. Also, multiprocessing is not likely to give you a huge boost since it requires copying data around.

Generating large simulations and inserting the same array multiple times into another array at different locations

I am working to generate a monte carlo simulation for oil wells. The end goal is to have all the wells with a smoothed probabilistic production curve. I have optimized what I can, but each of the 3 apply statements I am listing take so much time when I use my full dataset and the number of simulations I want. (Hours) The code I included is has 10 iterations. If you crank it up to 10,000 which is the goal it really starts to drag.
I have generated a Panda that has all the future wells I want to model with a probability of that well being chosen next to be drilled.
I then created a panda where I grouped everything into the categories I want to use to figure out the order that the model will choose the wells. So my "timing" panda contains my categories and an array of every index of those wells in those categories and an array of the well's probabilities.
This all is done in a few seconds. The next part works, but gets very slow.
Next I use a numpy generator choice with percentages to randomly generate the order of the wells for i simulations. As other posts have noted #njit does not work with the probability array. The result is 1 dimension of the array is the order that the wells will be chosen by each category, and the other dimension is each simulation. There are about 150 categories, and 10,000s of wells in each categories. I am hoping to run 10,000 simulations.
a is an array of indexes of wells that can be chosen
size is the length of that array
per is the probability that each well will get chosen
Next I link my timing panda to my panda with all of the wells in it. This attaches the previous array to the wells array. Then I search this array for the well index to figure out for each simulation when that specific well is going to get run. This generates a 1d array with what order that well is going be drilled in each simulation.
This function gets called on 100,000s of wells and as I increase the number of simulations it really slows down.
order is an array of the order each well is drilled per simulation
index is the index of that well
The final difficulty I am having is the averaging out the production curve for the wells. I have how much oil will be produced by each well per month. I need to insert that curve into the array at each point when the well is drilled, then average all of those values together to get the average production of the well given all the simulations.
I have also tried creating an np.zeros array then using the np.insert function, but I could not figure out how to insert an array multiple times without a loop and generating the initial array of 0's took longer than the current method I had. (I overcame inserting the array multiple times by covering everything to a string, inserting the type curve as a string then converting back to an array of numbers, but this did not seem efficient). I need to have the number of leading 0's
order is the time in months that each well will get drilled
curve is the production curve passed as a list
m is the highest value of the months that the well is drilled in all simulations
import numpy as np
from numba import njit
import datetime
import math
def TimingGenerator(a, size, p):
i = 10
g = np.random.Generator(np.random.PCG64())
order = np.concatenate([g.choice(a=a, size=size, replace=False, p=p) for z in range(i)]).reshape(i, size)
return order
#njit
def OrderGenerator(order, index):
result = np.where(order == index)[1]
return result
def CurveAverager(order, curve, m):
matrix = np.array([[0] * math.ceil(i) + curve + [0] * int((m - math.ceil(i))) for i in order])
result = np.mean(matrix, axis=0)
return result
begin_time = datetime.datetime.now()
size = 8000
g = np.random.Generator(np.random.PCG64())
a = g.choice(20_000, size=size, replace=False)
p = np.random.randint(1,100, size=size)
p = p/np.sum(p)
for i in range(150):
q = TimingGenerator(a,size,p)
print(datetime.datetime.now() - begin_time)
index = np.amin(q)
for i in range(100000):
order = OrderGenerator(q, index)
print(datetime.datetime.now() - begin_time)
order = order / 15
curve = list(range(600, 0, -1))
for i in range(20000):
avgcurve = CurveAverager(order, curve, size)
print(datetime.datetime.now() - begin_time)
Thanks for any help you can offer. I am willing to greatly alter my code if you can think of anything to help speed it up. Not sure if there is a better way to apply probabilities and smooth out the production curve which is really the end goal.
Cheers.

Python: Fitting lines to alle the decreasing parts of my data (max to min) and extrapolate them (max to max)?

I have daily water level measurements (hydraulic head) over several years (stored in a series with datetime index). I'm trying to fit a line to all the decreasing parts of the data. These straight lines should then be extrapolated until the next max of the data. If the first point is a minimum I want to fit a straight line till the next max. This is illustrated in the picture below.
I managed to code this problem in Python but in a very "ugly" way using 150 lines of code (lot of if statements).
My approach: smooth the data by fitting splines. Then use find_peaks of scipy.signal to find the extremas (multiply by -1 to get min). As this function does not deal with the first and the last point I used if statements to deal with this. Then I use two for loops to do the curve fitting and the extrapolation. I used one for loop in case the data starts with a min and another in case the data starts with a max as the boundaries of my "fit interval" and my "extrapolation interval" differ for each case. In case the data starts with a min I used a straight line for the first interval. The result of my code is shown in the image.
Image showing result of my code
Any ideas how to do this in a better way? Without using so many lines of code
The following code snippet shows my approach for the case where the data starts with a maximum
#hydraulic_head is a series of interpolated (spline) hydraulic head measurements with a datetime index
from scipy.signal import find_peaks
import pandas as pd
import numpy as np
peak_max=hydraulic_head[find_peaks(hydraulic_head)[0]] #hydraulic head at max
peak_min=hydraulic_head[find_peaks(hydraulic_head*-1)[0]] #hydraulic head at min
for gr in range(1,len(peak_max.index),1):
interval_fit=hydraulic_head[peak_max.index[gr-1]:peak_min.index[gr-1]] #interval to fit curve from max to min
t_fit=(interval_fit.index-interval_fit.index[0]).total_seconds().values #time in seconds
parameters=np.polyfit(t_fit,interval_max_min.values,1) #fit a line
parameter_estimated[gr]=parameterss #store the paramters of the line in a dict
interval_extrapolate=hydraulic_head[peak_max.index[gr-1]:peak_max.index[gr]] #interval to extrapolate
t_extrapolate=(interval_extrapolate.index-interval_extrapolate.index[0]).total_seconds().values #transform to time
values_extrapolated=parameters[0]*t_extrapolate+parameters[1] #extrapolate the line
new_index=interval_extrapolate.index #get the index from the extrapolated interval
new_series=pd.DataFrame(data=values_extrapolated,index=new_index,columns=['extrapolated']) #new data frame with extrapolated values
interpolation_out=pd.concat([interpolation_out,new_series]) #growing frame where lines are stored
Possible other approach: Using masks to find the intervals, numerate them and then posibily use groupby to extract the intervals. I didn't manage to do it this way.
It's my first question here. Open for any improvement on question formulation

How to generate noisy mock time series or signal (in Python)

Quite often I have to work with a bunch of noisy, somewhat correlated time series. Sometimes I need some mock data to test my code, or to provide some sample data for a question on Stack Overflow. I usually end up either loading some similar dataset from a different project, or just adding a few sine functions and noise and spending some time to tweak it.
What's your approach? How do you generate noisy signals with certain specs? Have I just overlooked some blatantly obvious standard package that does exactly this?
The features I would generally like to get in my mock data:
Varying noise levels over time
Some history in the signal (like a random walk?)
Periodicity in the signal
Being able to produce another time series with similar (but not exactly the same) features
Maybe a bunch of weird dips/peaks/plateaus
Being able to reproduce it (some seed and a few parameters?)
I would like to get a time series similar to the two below [A]:
I usually end up creating a time series with a bit of code like this:
import numpy as np
n = 1000
limit_low = 0
limit_high = 0.48
my_data = np.random.normal(0, 0.5, n) \
+ np.abs(np.random.normal(0, 2, n) \
* np.sin(np.linspace(0, 3*np.pi, n)) ) \
+ np.sin(np.linspace(0, 5*np.pi, n))**2 \
+ np.sin(np.linspace(1, 6*np.pi, n))**2
scaling = (limit_high - limit_low) / (max(my_data) - min(my_data))
my_data = my_data * scaling
my_data = my_data + (limit_low - min(my_data))
Which results in a time series like this:
Which is something I can work with, but still not quite what I want. The problem here is mainly that:
it doesn't have the history/random walk aspect
it's quite a bit of code and tweaking (this is especially a problem if i want to share a sample time series)
I need to retweak the values (freq. of sines etc.) to produce another similar but not exactly the same time series.
[A]: For those wondering, the time series depicted in the first two images is the traffic intensity at two points along one road over three days (midnight to 6 am is clipped) in cars per second (moving hanning window average over 2 min). Resampled to 1000 points.
Have you looked into TSimulus? By using Generators, you should be able generate data with specific patterns, periodicity, and cycles.
The TSimulus project provides tools for specifying the shape of a time series (general patterns, cycles, importance of the added noise, etc.) and for converting this specification into time series values.
Otherwise, you can try "drawing" the data yourself and exporting those data points using Time Series Maker.

Vectorizing for loops in python with numpy multidimensional arrays

I'm trying to improve the performance of this code below. Eventually it will be using much bigger arrays but I thought I would start of with something simple that works then look at where is is slow, optimise it then try it out on the full size. Here is the original code:
#Minimum example with random variables
import numpy as np
import matplotlib.pyplot as plt
n=4
# Theoretical Travel Time to each station
ttable=np.array([1,2,3,4])
# Seismic traces,measured at each station
traces=np.random.random((n, 506))
dt=0.1
# Forward Problem add energy to each trace at the deserired time from a given origin time
given_origin_time=1
for i in range(n):
# Energy will arrive at the sample equivelant to origin time + travel time
arrival_sample=int(round((given_origin_time+ttable[i])/dt))
traces[i,arrival_sample]=2
# The aim is to find the origin time by trying each possible origin time and adding the energy up.
# Where this "Stack" is highest is likely to be the origin time
# Find the maximum travel time
tmax=ttable.max()
# We pad the traces to avoid when we shift by a travel time that the trace has no value
traces=np.lib.pad(traces,((0,0),(round(tmax/dt),round(tmax/dt))),'constant',constant_values=0)
#Available origin times to search for relative to the beginning of the trace
origin_times=np.linspace(-tmax,len(traces),len(traces)+round(tmax/dt))
# Create an empty array to fill with our stack
S=np.empty((origin_times.shape[0]))
# Loop over all the potential origin times
for l,otime in enumerate(origin_times):
# Create some variables which we will sum up over all stations
sum_point=0
sqrr_sum_point=0
# Loop over each station
for m in range(n):
# Find the appropriate travel time
ttime=ttable[m]
# Grap the point on the trace that corresponds to this travel time + the origin time we are searching for
point=traces[m,int(round((tmax+otime+ttime)/dt))]
# Sum up the points
sum_point+=point
# Sum of the square of the points
sqrr_sum_point+=point**2
# Create the stack by taking the square of the sums dived by sum of the squares normalised by the number of stations
S[l]=sum_point#**2/(n*sqrr_sum_point)
# Plot the output the peak should be at given_origin_time
plt.plot(origin_times,S)
plt.show()
I think the problem i dont understand the broacasting and indexing of multidimensional arrays. After this I will be extended the dimensions to search for x,y,z which would be given by increaseing the dimension ttable. I will probably try and implement either pytables or np.memmap to help with the large arrays.
With some quick profiling, it appears that the line
point=traces[m,int(round((tmax+otime+ttime)/dt))]
is taking ~40% of the total program's runtime. Let's see if we can vectorize it a bit:
ttime_inds = np.around((tmax + otime + ttable) / dt).astype(int)
# Loop over each station
for m in range(n):
# Grap the point on the trace that corresponds to this travel time + the origin time we are searching for
point=traces[m,ttime_inds[m]]
We noticed that the only thing changing in the loop (other than m) was ttime, so we pulled it out and vectorized that part using numpy functions.
That was the biggest hotspot, but we can go a bit further and remove the inner loop entirely:
# Loop over all the potential origin times
for l,otime in enumerate(origin_times):
ttime_inds = np.around((tmax + otime + ttable) / dt).astype(int)
points = traces[np.arange(n),ttime_inds]
sum_point = points.sum()
sqrr_sum_point = (points**2).sum()
# Create the stack by taking the square of the sums dived by sum of the squares normalised by the number of stations
S[l]=sum_point#**2/(n*sqrr_sum_point)
EDIT: If you want to remove the outer loop as well, we need to pull otime out:
ttime_inds = np.around((tmax + origin_times[:,None] + ttable) / dt).astype(int)
Then, we proceed as before, summing over the second axis:
points = traces[np.arange(n),ttime_inds]
sum_points = points.sum(axis=1)
sqrr_sum_points = (points**2).sum(axis=1)
S = sum_points # **2/(n*sqrr_sum_points)

Categories