Histogram with same frequencies - python

I am plotting a histogram, with another set of data, but the frequencies are all 1, no matter how I change the number of bins. I did this with data generated from a normal distribution in the following fashion
x=npr.normal(0,2,(1,100))
plt.hist(x,bins=10)
and I get the following histogram:
This happens even if I increase the number of simulations to 1000 or 10000.
How do I plot a histogram that displays the bell shape of the normal distribution?
Thanks in advance.

You are ploting one histogram for each column of your input array. That is one histogram with 1 value for each of your 100 columns.
x=npr.normal(0,2,(1,100))
plt.hist(x[0],bins=10)
will do (note that I am selecting the first (and only) row of x).

Related

How to check if a vector hirogramm correlates with uniform distribution?

I have a vector of floats V with values from 0 to 1. I want to create a histogram with some window say A==0.01. And check how close is the resulting histogram to uniform distribution getting one value from zero to one where 0 is correlating perfectly and 1 meaning not correlating at all. For me correlation here first of all means histogram shape.
How one would do such a thing in python with numpy?
You can create the histogram with np.histogram. Then, you can generate the uniform histogram from the average of the previously retrieved histogram with np.mean. Then you can use a statistical test like the Pearson coefficient to do that with scipy.stats.pearsonr.

Interpolate smaller histogram bins (Pandas / Plotly)

I have a histogram of probability data, where the probabilities are bucketed in bins of size 10. When I display the histogram as a heatmap in plotly using buckets that are the same size as the data, I get this:
I'd like to view the heatmap at a finer granularity, like buckets of size 1. Since the data is in buckets of size 10, it looks like this:
How do I interpolate the data to fill in values for the gaps?
One idea is to divide the bucket from 190 to 200 evenly between 190-191, 191-192, etc, but that would not accurately represent the bell-curve shape of the histogram. Since the peak is at ~200, ideally there would be more weight at 199-200 than at 190-191.
I just found distplot. I can probably use that to get a good idea of the fine-grained distribution. I'll post back with the results when I have them.

How to plot linear regression between two continous values?

I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:
Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

2D histogram colour by "label fraction" of data in each bin

Following on from the post found here: 2D histogram coloured by standard deviation in each bin
I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.
Note that, in this dataset, each point has a continuous label value between 0-1.
For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:
The way this was done was by using
scipy.stats.binned_statistic_2d()
(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)
..and setting the statistic argument to 'std'
But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?
It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!
Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.
The fraction of elements in an array below a threshold can be calculated as
fraction = lambda a, threshold: len(a[a<threshold])/len(a)
Hence you can call
scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))

How to plot joint probability distribution of three elements?

I have 3 columns of data. I have used numpy.histogramdd to find joint probability distribution of these elements. I have used no. of bins = 100. Now I want to plot these using matplotlib. The data now have format of 100X100X100. What is the best way to plot?

Categories