Finding a probability density function that reproduces a histogram in Python - python

So all the data that I actually have is a picture of a histogram from which I can get heights and bin width, the median and one sigma errors.
The histogram is skewed, so the 16th and 84th quantile are not symmetric. I found that the median and the errors can be replicated with a skewed gaussian function, however the resulting histogram, from my found pdf, is difficult to match no mater how much I play with bin numbers and bin widths.
I understand that I can't possibly recreate the histogram exactly, but I will be very happy with something that is close enough.
My best idea is to loop through possible parameters of the skewed gaussian, make a histogram, somehow quantify the difference (like difference in heights at all points) and find best one. I think that might be a very long process though and I'm very sure there is something in scipy that does this quicker. Please refer me to anything useful if possible.

IMO your best shot is to treat the data as points and fit a function with scipy.optimize.curve_fit
This post might also help:

Related

Probability Density Function in histogram2D

I've been working on a model to show the number (of people) in groups who walk by a certain point, and from which direction they come from.
I have data from 1994 up until 2020, with a time step of 10 minutes (a lot of data!)
After getting all the data, I used np.histogram2D to plot in a polar plot the distribution and got something like this:
For this I used plt.pcolormesh (not relevant, but that's how I got this).
The data I have is all stored in an np.array and the values are the ones obtained when the np.histogram2D was used:
count, xedges, yedges = np.histogram2d(Umag, phi,bins=(xedges,yedges))
Where count is the data stored in the np.array mentioned before, Umag is the data used and phi is the direction from where the people come from.
I have now been asked to calculate the Probability Density Function of this set of data, meaning that what I want to show is the probability of a group of 3 people come from 45ยบ (as an example).
I really dont know how to do this so maybe if someone has any suggestion, it would be greatly appreciated!
I have thought about using script.stats.norm.pdf but I can't seem to figure out how to do it!
Thank you for any help you can give!

Is there a way to count the number of points within a certain area on a graph?

I've got output graphs that look like this:
My question is, is there an easy way for me to count the number of points within each of the obvious 'lines' or 'streaks' of particles? In other words, I need to find the density of each of the streaks separately. Most of them are overlapping which is where my issue comes in.
I've tried specifying x and y limits but again, the overlapping comes into play. The existing code is just importing and plotting values.
Ken, thanks for your comment. I went along that path, I found that single linkage works best for the type of clusters I have. I also had to find a multiplying factor for my own data first, because the clustering was failing with the data overlapping. With this data the different colours represent different clusters. The dendrogram x-axis is labelled with the cluster densities, but they aren't in order! I'm yet to find an efficient way around this. I manually adjusted the dendrogram to produce 2 clusters first, which told me the density of the first shell (it produced 2 clusters, 1 of the first shell and 1 with everything else). Then repeated it for 3,4, etc.
Sorry if none of this makes sense! It's quite late/early here.

Python adaptive histogram widths

I am currently working on a project where I have to bin up to 10-dimensional data. This works totally fine with numpy.histogramdd, however with one have a serious obstacle:
My parameter space is pretty large, but only a fraction is actually inhabited by data (say, maybe a few % or so...). In these regions, the data is quite rich, so I would like to use relatively small bin widths. The problem here, however, is that the RAM usage totally explodes. I see usage of 20GB+ for only 5 dimensions which is already absolutely not practical. I tried defining the grid myself, but the problem persists...
My idea would be to manually specify the bin edges, where I just use very large bin widths for empty regions in the data space. Only in regions where I actually have data, I would need to go to a finer scale.
I was wondering if anyone here knows of such an implementation already which works in arbitrary numbers of dimensions.
thanks ๐Ÿ˜Š
I think you should first remap your data, then create the histogram, and then interpret the histogram knowing the values have been transformed. One possibility would be to tweak the histogram tick labels so that they display mapped values.
One possible way of doing it, for example, would be:
Sort one dimension of data as an unidimensional array;
Integrate this array, so you have a cumulative distribution;
Find the steepest part of this distribution, and choose a horizontal interval corresponding to a "good" bin size for the peak of your histogram - that is, a size that gives you good resolution;
Find the size of this same interval along the vertical axis. That will give you a bin size to apply along the vertical axis;
Create the bins using the vertical span of that bin - that is, "draw" horizontal, equidistant lines to create your bins, instead of the most common way of drawing vertical ones;
That way, you'll have lots of bins where data is more dense, and lesser bins where data is more sparse.
Two things to consider:
The mapping function is the cumulative distribution of the sorted values along that dimension. This can be quite arbitrary. If the distribution resembles some well known algebraic function, you could define it mathematically and use it to perform a two-way transform between actual value data and "adaptive" histogram data;
This applies to only one dimension. Care must be taken as how this would work if the histograms from multiple dimensions are to be combined.

KDE is very slow with large data

When I try to make a scatter plot, colored by density, it takes forever.
Probably because the length of the data is quite big.
This is basically how I do it:
xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')
As an additional info, I have to add that:
>>len(x_values)
809649
>>len(y_values)
809649
Is it any other option to get the same results but with better speed results?
No, there is not good solutions.
Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.
My tricks: (note these point may change slightly the output)
get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
duplicate data
convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.
Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.
I would suggest plotting a sample of the data.
If the sample is large enough you should get the same distribution.
Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.

How to use interpolation to calculate a force based on angle

I am trying to make a python script that will output a force based on a measured angle. The inputs are time, the curve and the angle, but I am having trouble using interpolation to fit the force to the curve. I looked at scipy.interpolate, but I'm not sure it will help me because the points aren't evenly spaced.
numpy.interp does not require your points to be evenly distributed. I'm not certain if you mean by "The inputs are time, the curve and the angle" that you have three independent variables, if so you will have to adapt it quite a bit... But for one-variable problems, interp is the way to go.

Categories