Create a density plot of vertical lines in python - python

I have a bunch of data cotaining coordinate intervals within one large region, which i want to plot and then create a density plot showing where in the region there are more interval lines than others.
As a very basic example i've just plotted some horizontal lines for a given interval. I cant really find any good examples of how to create a better plot of intervals. I've looked into seaborn, but im not entirely sure about it. So here i've just created a basic example of what i am trying to do.
import numpy as np
import matplotlib.pyplot as plt
x1 = np.linspace(1, 30,100)
x2 = np.linspace(10,40,100)
x3 = np.linspace(2,50,100)
x4 = np.linspace(40,60,100)
x5 = np.linspace(30,78,100)
x6 = np.linspace(82,99,100)
x7 = np.linspace(66,85,100)
x = [x1,x2,x3,x4,x5,x6,x7]
y = np.linspace(1,len(x),len(x))
fig, ax = plt.subplots()
for i in range(len(x)):
ax.hlines(y[i], xmin=x[i][0], xmax=x[i][-1], linewidth=1)
plt.xlim(-5,105)
plt.show()
And then I would like to create a density plot of the number of intervals overlapping. Could anyone have any suggestions on how to proceed with this?
Thans for your help and suggestions

This seems to do what you want:
def count(xi):
samples = np.linspace(0, 100, 101)
return (xi[0] < samples) & (samples <= xi[-1])
is_in_range = np.apply_along_axis(count, arr=x, axis=1)
density = np.sum(is_in_range, axis=0)
The general idea is to make some output linspace, then check to see if those coordinates are in the ranges in the array x — that's what the function count does. Then apply_along_axis runs this function on every row (i.e. every 1D array) in your array x.
Here's what I get when I plot density:
You might want to adjust the <= and < signs in the count function to handle the edges as you like.
If your actual data have a different format, or if there are multiple intervals in one array, you will need to adjust this.

Related

How to match two different graphs as much as possible by reducing difference in y-axis step by step by iteration in python

I am trying to match two graphs drawn below as close as possible by shifting one graph to another one in python.
Figure
Two graphs are of different ranges in x and these are drawn from two array datasets.
What I am thinking now is that by shifting one by one iteration of one of them and let it move until the difference between two data (or graph) get minimized.
Yet, I have no idea how to start.
What I'd do is try to shift one of the sets such that root mean square of difference between it and the other one is minimised. You could also narrow the criterion down to a region of interest in the data (I'm guessing around the peak). To compute RMS error, you'll need to interpolate the data onto the same x-values. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
# Create data
x0 = np.linspace(0, 2.*np.pi, 101)
y0 = np.sin(x0)
x1 = np.linspace(0, 2.*np.pi, 201)
y1 = np.sin(x1+0.1*np.pi)
def target(x):
# Interpolate set 1 onto grid of set 0 while shifting it by x.
y1interp = np.interp(x0, x1+x, y1)
# Compute RMS error between the two data with set 1 shifted by x.
return np.sqrt(np.sum((y0-y1interp)**2.))
result = minimize(target, method="BFGS", x0=[0.])#, bounds=[(-0.2, 0.2)]) # bounds work with some methods only
print(result)
plt.figure()
plt.plot(x0, y0, "r", x1, y1, "b")
plt.plot(x1+result.x, y1, "k", lw=2)
plt.legend(["set 0", "set 1", "set 1 shifted"])
Result:
Note that scipy.optimize.minimize is quite sensitive to the settings so you'll need to play with them to make them better suited to tackle your problem:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html

Least-square fitting for points in 2d doesn't pass through symmetrical axis

I'm trying to draw the best fitting line for given (x,y) data points.
Here shows data points (red pixels) and estimated line (green), I obtained using following library.
import numpy as np
m, c = np.linalg.lstsq(A, y)[0]
Documentation for used library module
We can see data points are roughly symmetrically distributed. Problem is why is this line not having the gradient similar to the long symmetric axis through the data points? Can you please explain can this result is correct? Then, how it gives minimum error? (Line is drawn correctly using gradient returned by the lstsq method). Thank you.
EDIT
Here is the code I'm trying. Input image can be downloaded from here. In this code I've not forced the line to pass through the center of the pixel distribution. (Note: here I've used polyfit instead of lstsq. Both gives same results)
import numpy as np
import cv2
import math
img = cv2.imread('points.jpg',1);
h, w = img.shape[:2]
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
points = np.argwhere(gray>10) # get (x,y) pairs where red pixels exist
y = points[:,0]
x = points[:,1]
m, c = np.polyfit(x, y, 1) # calculate least square fit line
# calculate two cordinates (x1,y1),(x2,y2) on the line
angle = np.arctan(m)
x1, y1, length = 0, int(c), 500
x2 = int(round(math.ceil(x1 + length * np.cos(angle)),0))
y2 = int(round(math.ceil(y1 + length * np.sin(angle)),0))
# draw line on the color image
cv2.line(img, (x1, y1), (x2, y2), (0,255,0), 1, cv2.LINE_8)
# show output the image
cv2.namedWindow("Display window", cv2.WINDOW_AUTOSIZE);
cv2.imshow("Display window", img);
cv2.waitKey(0);
cv2.destroyAllWindows()
How can I have the line pass through the longest symmetric axis of the pixel distribution? Can I use principle component analysis?
It's hard to say why this would be the case. The bottom line is that I can't see the data you're using, and I can't see what the calculated slope and y intercept are for the data you're using.
Here are a couple of things that could explain what we're seeing:
(1) The density of data points is actually quite different than it appears to a casual glance and everything is working properly.
(2) You're sending the wrong arguments to the least squares function and you've got a GIGO situation. (I haven't used numpy's least squares algorithm, so I can't check this.)
(3) The scatter plot and the line plot don't agree on the scale of the axes.
(4) The least squares function in question is broken.
(5) You're not passing the same data to the least squares algorithm as you're passing to the plotting routine.
(6) The data formatting is funky so that the scatter plot and least squares routines are interpreting your data differently.
I can't know which of these is the problem, and unless it's (3), I expect we'd need more data to be able to distinguish between these possibilities.
Here's how I'd proceed if I were you: (1) Create a small artificial data set that sits on a line and pass it to the least squares function and see if it spits out the right numbers. See if these look right when plotted or not. (2) If this looks okay, record the output of the least squares algorithm, see if you can find another least squares program to calculate the slope and y intercept and compare them. If they're the same, it's probably not the routine, it's probably something to do with plotting.
If you get this far and it's still a mystery, let us know what you've found and maybe we can make another suggestion.
Good luck.
If the red dots truly represent your data, you are probably applying your linear regression function in a way that forces the line through the origin. How do i know? When using linear regression on two variables x and y, the line will intercept a few specific points. For example the average of x, and the average of y. Also, depending on your specifications, a calculated or specified intercept of the y axis. If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin. Not much more can be said before you provide som reproducible data and code.
EDIT:
I didn't have much luck with the reproducble sample provided, so I built an example with random numbers to elaborate on my original answer. I think statsmodels is a decent library for linear regression analysis. First, I'll address this earlier comment:
If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin.
You'll see an increasing effect of this the larger your numbers are (the further away from the origin your numbers are). Using sm.OLS(y,sm.add_constant(x)).fit() and sm.OLS(y,x).fit() for two different sets of numbers will show you exactly what I mean. First, I'll run a regression on the dataset below without an estimated constant (the line goes through the origin). This will give us a plot that at resembles your original plot:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
regLine_origin = x*results1.params[0]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.patch.set_facecolor('black')
plt.show()
Next, I'll include a constant in the regression. Now, the yellow line will represent what I think you were after in your question:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
results2 = sm.OLS(y,sm.add_constant(x)).fit()
regLine_origin = x*results1.params[0]
regLine_constant = results2.params[0] + x*results2.params[1]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.scatter(x, regLine_constant, c = 'yellow', s = 1)
ax.patch.set_facecolor('black')
plt.show()
And lastly, we can take a look at what happens when the numbers are closer to the origin. So to speak. Here, I'll remove the +100 part when the numbers are produced:
# The following is changed in the snippet above:
# Data
x = np.random.normal(size=2500)
y = x * 2 + np.random.normal(size=2500)
And that's why I think your original regression line is set to go through the origin. Have a look at the statsmodels package. Here you can study the details of the estimate by running print(results2.summary()):
And as you've already seen in the snippets above, you'll have direct access to the regression coefficients by using results2.params.
Edit2: My explanation still isn't 100% valid. The x and y values will have to differ a bit in size to see this effect. You'll certainly find situations where the line goes through the origin no matter the size of the numbers.
Have a look at the different x labels, and you'll see what I mean.

Fill missing array values using extrapolated plot Python

I have a 2D numpy array containing X and Y data. The axis X contain time information with resolution of nano seconds. My problem occours because I need to compare simulated signal and a real signal. The problem of the simulated signal is that the simulator, with optimization purposes, has a diferent step sizes, as show on fig. 1.
In other hand my real data was acquired by an osciloscope and your data has exaclty 1 ns of diference between each point recorded. Because of this I need to have the same scale in the X axis to make a correct comparasion. How can I get the extra points to make my data with a constant step between the points?
EDIT 1
I need that this new points fill my array to make the simulated data with constant step, like show in fig 2.
The green points show an example of data extracted from extrapolated data.
A common way to do this is to simply duplicate some points (adding a point with same average value doesn't modify much most of statistical values)
So you have to change the dataset everytime you change the scale. Takes lots of time every scale change but it is super easy. If you don't have to change the scale too much, you can try.
This problem was solved using scipy interpolate module. Eg.
interpolate.py
import matplotlib.pyplot as plt
from scipy import interpolate as inter
import numpy as np
Fs = 0.1
f = 0.01
sample = 10
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)
inte = inter.interp1d(x,y)
new_x = np.arange(0,9,0.1)
new_y = inte(new_x)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(new_x,new_y,s=5,marker='.')
ax1.scatter(x,y,s=50,marker='*')
plt.show()
This code give the following result.

Discrete fourier transformation from a list of x-y points

What I'm trying to do is, from a list of x-y points that has a periodic pattern, calculate the period. With my limited mathematics knowledge I know that Fourier Transformation can do this sort of thing.
I'm writing Python code.
I found a related answer here, but it uses an evenly-distributed x axis, i.e. dt is fixed, which isn't the case for me. Since I don't really understand the math behind it, I'm not sure if it would work properly in my code.
My question is, does it work? Or, is there some method in numpy that already does my work? Or, how can I do it?
EDIT: All values are Pythonic float (i.e. double-precision)
For samples that are not evenly spaced, you can use scipy.signal.lombscargle to compute the Lomb-Scargle periodogram. Here's an example, with a signal whose
dominant frequency is 2.5 rad/s.
from __future__ import division
import numpy as np
from scipy.signal import lombscargle
import matplotlib.pyplot as plt
np.random.seed(12345)
n = 100
x = np.sort(10*np.random.rand(n))
# Dominant periodic signal
y = np.sin(2.5*x)
# Add some smaller periodic components
y += 0.15*np.cos(0.75*x) + 0.2*np.sin(4*x+.1)
# Add some noise
y += 0.2*np.random.randn(x.size)
plt.figure(1)
plt.plot(x, y, 'b')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
dxmin = np.diff(x).min()
duration = x.ptp()
freqs = np.linspace(1/duration, n/duration, 5*n)
periodogram = lombscargle(x, y, freqs)
kmax = periodogram.argmax()
print("%8.3f" % (freqs[kmax],))
plt.figure(2)
plt.plot(freqs, np.sqrt(4*periodogram/(5*n)))
plt.xlabel('Frequency (rad/s)')
plt.grid()
plt.axvline(freqs[kmax], color='r', alpha=0.25)
plt.show()
The script prints 2.497 and generates the following plots:
As starting point:
(I assume all coordinates are positive and integer, otherwise map them to reasonable range like 0..4095)
find max coordinates xMax, yMax in list
make 2D array with dimensions yMax, xMax
fill it with zeros
walk through you list, set array elements, corresponding to coordinates, to 1
make 2D Fourier transform
look for peculiarities (peaks) in FT result
This page from Scipy shows you basic knowledge of how Discrete Fourier Transform works:
http://docs.scipy.org/doc/numpy-1.10.0/reference/routines.fft.html
They also provide API for using DFT. For your case, you should look at how to use fft2.

Finding corresponding bins between two data sets

So I have two data sets which overlap in their parameter space:
I want to bin up the red set and find the standard deviation of each bin. Then for each point in the blue set, I want to find which red bin that point corresponds to and grab the standard deviation calculated for that bin.
So far, I've been using scipy.statistics.binned_2d, but I'm not sure where to go from here:
import scipy.stats
import numpy as np
# given numpy recarrays red_set and blue_set with columns x,y,values
nbins = 50
red_bins = scipy.stats.binned_statistic_2d(red_set['x'],
red_set['y'],
red_set['values'],
statistic = np.std,
bins = nbins)
blue_bins = scipy.stats.binned_statistic_2d(blue_set['x']
blue_set['y']
blue_set['values']
statistic = count,
bins = red_bins[1],red_bins[2])
Now, I don't know how to get the value of the corresponding red bin for each blue point. I know that scipy.statistics.binned_2d's third return is a binnumber for each input data point, but I don't know how to translate that to the actual calculated statistic (standard deviation in this example).
I know that the blue set is getting binned exactly the same as the red (a quick plot will confirm this). It seems like it should be totally straightforward to grab the corresponding red bin, but I can't figure it out.
Let me know if I can make my question clearer
You need to make sure you specify the same range when binning the data. In that way, the corresponding indices of the bins will be consistent. I've used the lower level numpy function hist2d, extension to standard deviations can be done in the same way using scipy.stats.binned_statistic_2d,
import numpy as np
import matplotlib.pyplot as plt
#Setup random data
red = np.random.randn(100,2)
blue = np.random.randn(100,2)
#plot
plt.plot(red[:,0],red[:,1],'r.')
plt.plot(blue[:,0],blue[:,1],'b.')
#Specify limits of binned data
xmin = -3.; xmax = 3.
ymin = -3.; ymax = 3.
#Bin data using hist2d
rbins, xrb, yrb = np.histogram2d(red[:,0],red[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
bbins, xbb, ybb = np.histogram2d(blue[:,0],blue[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
#Check that bins correspond to the same positions in space
assert all(xrb == xbb)
assert all(yrb == ybb)
#Obtain centers of the bins and plots difference
xc = xrb[:-1] + 0.5 * (xrb[1:] - xrb[:-1])
yc = yrb[:-1] + 0.5 * (yrb[1:] - yrb[:-1])
plt.contourf(xc, yc, rbins-bbins, alpha=0.4)
plt.colorbar()
plt.show()

Categories