Probability Density Function in histogram2D - python

I've been working on a model to show the number (of people) in groups who walk by a certain point, and from which direction they come from.
I have data from 1994 up until 2020, with a time step of 10 minutes (a lot of data!)
After getting all the data, I used np.histogram2D to plot in a polar plot the distribution and got something like this:
For this I used plt.pcolormesh (not relevant, but that's how I got this).
The data I have is all stored in an np.array and the values are the ones obtained when the np.histogram2D was used:
count, xedges, yedges = np.histogram2d(Umag, phi,bins=(xedges,yedges))
Where count is the data stored in the np.array mentioned before, Umag is the data used and phi is the direction from where the people come from.
I have now been asked to calculate the Probability Density Function of this set of data, meaning that what I want to show is the probability of a group of 3 people come from 45ยบ (as an example).
I really dont know how to do this so maybe if someone has any suggestion, it would be greatly appreciated!
I have thought about using script.stats.norm.pdf but I can't seem to figure out how to do it!
Thank you for any help you can give!

Related

Interpolation of 3D Arrays (python)

hopefully, there isn't a similar question. So far, I am quite struggling with interpolating in python two different datasets of arrays to define the z-direction for another.
In other words: this is my original data set, just plotted in 3D space.
Original data set
Now I am trying to interpolate the z-values to the original data set (the best solution would be with a "quadratic" polynomial function) with these x/y points (z=0):
Sought data set with z=0, sought are the new-interpolated z-values
Unfortunately, I was only able to illustrate this kind of plot, which doesn't make sense at all:
My best solution - unfortunately
Here are my defined arrays, "pla_data.txt" is referred to as the original data; "sof_data.txt" is referred to the desired data:
https://wetransfer.com/downloads/0eb1957ceae06e8a7ac44d71063d19cc20220113164938/417b2b853840b5019eccd7d68c3df16a20220113164938/cf9729
Hopefully, somebody could help me. Thank you!

How to plot a cube or parallelepid with function-related color map

I am unfortunately quite inexperienced with python, and programming in general. I am devoting a lot of time to get better but this one really got me.
I need to evaluate a time evolving funtion happening within the boundaries of a solid, a cube if you like.
My idea was to plot a 3D surface with x, y and z being the dimensions of my solid, and the colormap being the values of the function i mentioned at a given point in time. The final result would be a video with the sequence of plots for a given time interval.
I smashed my head with matplotlib recently but I don't seem to understand the idea behind the need for numpy 2D arrays for surface plotting. The examples given in the docs are somewhat not revelant as my function values come from a numerical solution, hence there is no explicit relation between x,y and z and F(x,y,z).
Does anyone have any suggestion? I hope I haven't been negligent on the doc reading on the topic.

Is there a way to count the number of points within a certain area on a graph?

I've got output graphs that look like this:
My question is, is there an easy way for me to count the number of points within each of the obvious 'lines' or 'streaks' of particles? In other words, I need to find the density of each of the streaks separately. Most of them are overlapping which is where my issue comes in.
I've tried specifying x and y limits but again, the overlapping comes into play. The existing code is just importing and plotting values.
Ken, thanks for your comment. I went along that path, I found that single linkage works best for the type of clusters I have. I also had to find a multiplying factor for my own data first, because the clustering was failing with the data overlapping. With this data the different colours represent different clusters. The dendrogram x-axis is labelled with the cluster densities, but they aren't in order! I'm yet to find an efficient way around this. I manually adjusted the dendrogram to produce 2 clusters first, which told me the density of the first shell (it produced 2 clusters, 1 of the first shell and 1 with everything else). Then repeated it for 3,4, etc.
Sorry if none of this makes sense! It's quite late/early here.

Finding a probability density function that reproduces a histogram in Python

So all the data that I actually have is a picture of a histogram from which I can get heights and bin width, the median and one sigma errors.
The histogram is skewed, so the 16th and 84th quantile are not symmetric. I found that the median and the errors can be replicated with a skewed gaussian function, however the resulting histogram, from my found pdf, is difficult to match no mater how much I play with bin numbers and bin widths.
I understand that I can't possibly recreate the histogram exactly, but I will be very happy with something that is close enough.
My best idea is to loop through possible parameters of the skewed gaussian, make a histogram, somehow quantify the difference (like difference in heights at all points) and find best one. I think that might be a very long process though and I'm very sure there is something in scipy that does this quicker. Please refer me to anything useful if possible.
IMO your best shot is to treat the data as points and fit a function with scipy.optimize.curve_fit
This post might also help:

KDE is very slow with large data

When I try to make a scatter plot, colored by density, it takes forever.
Probably because the length of the data is quite big.
This is basically how I do it:
xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')
As an additional info, I have to add that:
>>len(x_values)
809649
>>len(y_values)
809649
Is it any other option to get the same results but with better speed results?
No, there is not good solutions.
Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.
My tricks: (note these point may change slightly the output)
get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
duplicate data
convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.
Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.
I would suggest plotting a sample of the data.
If the sample is large enough you should get the same distribution.
Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.

Categories