KDE is very slow with large data - python

When I try to make a scatter plot, colored by density, it takes forever.
Probably because the length of the data is quite big.
This is basically how I do it:
xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')
As an additional info, I have to add that:
>>len(x_values)
809649
>>len(y_values)
809649
Is it any other option to get the same results but with better speed results?

No, there is not good solutions.
Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.
My tricks: (note these point may change slightly the output)
get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
duplicate data
convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.
Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.

I would suggest plotting a sample of the data.
If the sample is large enough you should get the same distribution.
Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.

Related

Wiskerplots are not clear enough to analyze data

I'm trying to analyze a set of costs using python.
The columns in the data frame are,
'TotalCharges', 'TotalPayments', 'TotalDirectVariableCost', 'TotalDirectFixedCost', 'TotalIndirectVariableCost', 'TotalIndirectFixedCost.
When I tried to plot them using the whisker plots, this is how they could display
I need to properly analyze these data and understand their behavior.
The following are my questions.
Is there any way that I can use wisker plots more clearly?
I believe since these are costs, we cannot ignore them as outliars. So keeping the data as it is what else I can use to represent data more clearly?
Thanks
There are a couple of things you could do:
larger print area
rotate the axis
plot one axis log scale
That said, I think you should examine once again your understanding of what a box and whisker plot is for.
Additionally, you might consider posting this on the Math or Cross Validated site as this doesn't have much to do with code.

use numpy/matplotlib to draw a histogram with range of bottom/top/infinite

I don't think the title is precise enouth. If anyone will modify it, please help me.
I used to use numpy and matplotlib to draw a distribution diagram. As far as I know, np.histogram can only set the range with a bottom and a top value. But I'd like to make it three values, which are bottom, top and infinite.
For example
MW=[121,131,...,976,1400] # hundreds of out-of-order items
b,bins = np.histogram(MW,bins=10,range=(0,1000))
ax.bar(bins[:-1]+50,b,align='center',facecolor='grey',alpha=0.5,width=100,)
with these codes, I can draw a distribution diagram in which ten bins locates (0-100,100-200,...900-1000). But there are a few numbers higher than 1000. I want to put them in "(1000 - +∞)". So it seems like to make the parameter of range become (0,1000,infinite/or a number big enough), but it is not available.
A awful way to do is using some tricks such as:
MW=[x if x <1000 else 1001 for x in MW]
b,bins = np.histogram(MW,bins=11,range=(0,1100))
And change the xlabel of the plot.
Is there any better way to implement it?
If trick is the only way, is it possible to quickly change the xlabel?

Finding a probability density function that reproduces a histogram in Python

So all the data that I actually have is a picture of a histogram from which I can get heights and bin width, the median and one sigma errors.
The histogram is skewed, so the 16th and 84th quantile are not symmetric. I found that the median and the errors can be replicated with a skewed gaussian function, however the resulting histogram, from my found pdf, is difficult to match no mater how much I play with bin numbers and bin widths.
I understand that I can't possibly recreate the histogram exactly, but I will be very happy with something that is close enough.
My best idea is to loop through possible parameters of the skewed gaussian, make a histogram, somehow quantify the difference (like difference in heights at all points) and find best one. I think that might be a very long process though and I'm very sure there is something in scipy that does this quicker. Please refer me to anything useful if possible.
IMO your best shot is to treat the data as points and fit a function with scipy.optimize.curve_fit
This post might also help:

Python adaptive histogram widths

I am currently working on a project where I have to bin up to 10-dimensional data. This works totally fine with numpy.histogramdd, however with one have a serious obstacle:
My parameter space is pretty large, but only a fraction is actually inhabited by data (say, maybe a few % or so...). In these regions, the data is quite rich, so I would like to use relatively small bin widths. The problem here, however, is that the RAM usage totally explodes. I see usage of 20GB+ for only 5 dimensions which is already absolutely not practical. I tried defining the grid myself, but the problem persists...
My idea would be to manually specify the bin edges, where I just use very large bin widths for empty regions in the data space. Only in regions where I actually have data, I would need to go to a finer scale.
I was wondering if anyone here knows of such an implementation already which works in arbitrary numbers of dimensions.
thanks 😊
I think you should first remap your data, then create the histogram, and then interpret the histogram knowing the values have been transformed. One possibility would be to tweak the histogram tick labels so that they display mapped values.
One possible way of doing it, for example, would be:
Sort one dimension of data as an unidimensional array;
Integrate this array, so you have a cumulative distribution;
Find the steepest part of this distribution, and choose a horizontal interval corresponding to a "good" bin size for the peak of your histogram - that is, a size that gives you good resolution;
Find the size of this same interval along the vertical axis. That will give you a bin size to apply along the vertical axis;
Create the bins using the vertical span of that bin - that is, "draw" horizontal, equidistant lines to create your bins, instead of the most common way of drawing vertical ones;
That way, you'll have lots of bins where data is more dense, and lesser bins where data is more sparse.
Two things to consider:
The mapping function is the cumulative distribution of the sorted values along that dimension. This can be quite arbitrary. If the distribution resembles some well known algebraic function, you could define it mathematically and use it to perform a two-way transform between actual value data and "adaptive" histogram data;
This applies to only one dimension. Care must be taken as how this would work if the histograms from multiple dimensions are to be combined.

Python: how to plot points with little overlapping

I am using python to plot points. The plot shows relationship between area and the # of points of interest (POIs) in this area. I have 3000 area values and 3000 # of POI values.
Now the plot looks like this:
The problem is that, at lower left side, points are severely overlapping each other so it is hard to get enough information. Most areas are not that big and they don't have many POIs.
I want to make a plot with little overlapping. I am wondering whether I can use unevenly distributed axis or use histogram to make a beautiful plot. Can anyone help me?
I would suggest using a logarithmic scale for the y axis. You can either use pyplot.semilogy(...) or pyplot.yscale('log') (http://matplotlib.org/api/pyplot_api.html).
Note that points where area <= 0 will not be rendered.
I think we have two major choices here. First adjusting this plot, and second choosing to display your data in another type of plot.
In the first option, I would suggest clipping the boundries. You have plenty of space around the borders. If you limit the plot to the boundries, your data would scale better. On top of it, you may choose to plot the points with smaller dots, so that they would seem less overlapping.
Second option would be to choose displaying data in a different view, such as histograms. This might give a better insight in terms of distribution of your data among different bins. But this would be completely different type of view, in regards to the former plot.
I would suggest trying to adjust the plot by limiting the boundries of the plot to the data points, so that the plot area would have enough space to scale the data and try using histograms later. But as I mentioned, these are two different things and would give different insights about your data.
For adjusting you might try this:
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,y1,y2))
You would probably need to make minor adjustments to the axis variables. Note that there should definetly be better options instead of this, but this was the first thing that came to my mind.

Categories