I'm working on credit card data available on Kaggle. I want to plot the histogram for a column named 'Amount' from the data file named 'credit'. I want the plot for all the range of Amount but I ain't getting it. The range of Amount is [0,25691.16]. But the range showing in the plot is max_value/(num_bins). What should be the change in the code required to get plot over the total range mentioned above?
In the example code mentioned below the plot is showing a single bar of width 2569.116 (range/num_bins). What I need is 10 bars covering the entire range
plt.hist(credit['Amount'],10,density=True,range=(0,25691.16) ,facecolor='red',alpha=0.5)
your code is right. I think it is the nature of your data that you cannot see the other bars since the density in the first bin is very high. In other words, almost all 'Amount' are in (0, 2569.116) and there are a few 'Amount' that are in the intervals (2569.116, 5138.232), ... , (23122.044, 25691.16).
Related
I am plotting a histogram, with another set of data, but the frequencies are all 1, no matter how I change the number of bins. I did this with data generated from a normal distribution in the following fashion
x=npr.normal(0,2,(1,100))
plt.hist(x,bins=10)
and I get the following histogram:
This happens even if I increase the number of simulations to 1000 or 10000.
How do I plot a histogram that displays the bell shape of the normal distribution?
Thanks in advance.
You are ploting one histogram for each column of your input array. That is one histogram with 1 value for each of your 100 columns.
x=npr.normal(0,2,(1,100))
plt.hist(x[0],bins=10)
will do (note that I am selecting the first (and only) row of x).
I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:
Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html
I'm new to heatmaps and seaborn. I've been trying this for days but haven't been able to find a solution or any related threads. I think I'm setting up the problem incorrectly but would like to know if what I'm trying to do, is possible with seaborn heatmaps... or if a heatmap isn't the right graphic representation for what I want to show.
I have a .csv file of scores. It looks something like:
Genus,Pariaconus,Trioza,Non-members
-40,-80,-90,-300
-40.15,-80,-100,-320
,-40.17,-86,-101,-470
,-86.2,-130,-488
,,-132,-489
,,,-500
...
As I try to show above, the columns are different lengths. Let's say length of (the number of values in) Genus is 10, Pariaconus is 15, Trioza is 20, and Non-members is 18,000.
In addition, the columns and rows are not related to each other. Each score is individual and just falls under the column group. What I want to show with the heatmap is the range of scores that occur in each column.
I would ideally like to represent the data using a heatmap, where:
X-axis is "Genus", "Pariaconus", "Trioza", "Non-members".
Y-axis is
range of scores that occur in the dataset. In the example above,
Y-axis values would go from -40 to -500.
Colorbar is the
normalized population of the columns that get that score in
the Y-axis. For example, if 100% of the Genus column scores around
-40, that area in Y-axis would be colored red (for 1.0). The remainder of the y-axis for Genus would be colored blue (for 0.0),
because no scores for Genus are in the range -50 to -500. For the
purposes of my project, I'd like to show that the majority of scores
of "Genus" fall in a certain range, "Pariaconus" in another range,
"Non-members" in another range, and so on.
The reason I want to represent this with a heatmap and not, say, a line graph is because line graphs would suggest that there is a trend between rows in the same column. In the example above (Genus column), a line/scatter graph would make it seem that there's a relationship between the score -40, -41, and -45 as you move down the X-axis. In contrast, I just want to show the range of scores in each column.
With the data in the .csv format above, right now I have the following heatmap: https://imgur.com/a/VwgQwfQ
I get this with the line of code:
sns.heatmap(df, cmap="coolwarm")
In this heatmap, the values of the Y-axis are automatically set as the row indices from the .csv file, and the colormap values are the scores (values of the rows).
If I could just figure out how to swap the colormap and the Y-axis, then I hope that I could then move on to figuring out how to normalize the populations of each column instead of having it as the raw indices: 0 to 18000. But I've been trying to do this for days and haven't come close to what I want.
Ideally, I would want something like this: https://imgur.com/a/3A0eaOD. Of course, in the heatmap, there would be gradients instead of solid colors.
If anyone can answer, I had these questions:
Is what I'm trying to do achievable/is this something I can do with
a heatmap? Or should I be using a different representation?
Is this possibly a problem with how my input data is represented? If so, what is the correct representation when building a heatmap like this?
Any other guidance would be super appreciated.
I'm encountering some matplotlib behaviour that I do not understand.
I have a daframe:
august.head()
value
config name
low max velocity -0.000145
medium max velocity -0.000165
reference -0.000198
high max velocity -0.000192
When I plot this dataframe using
plt.plot(august)
I get the following plot:
My data seems plotted chaotically and the blue line 'comes back to a previous x value' (sorry, that's the best I can do for a description of my problem)
I would like to see my data plotted with plt.plot(august) just as when I plot it using
august.plot()
Which gives me a good, ordered graph:
Any ideas?
Thanks
Maybe the config names were ordered alphabetically ?
In that case you could associate an integer to each config name, like here:
plot-with-custom-text-for-x-axis-points
I started with the matplotlib radar example but values below some min values disappear.
I have a gist here.
The result looks like
As you can see in the gist, the values for D and E in series A are both 3 but they don't show up at all.
There is some scaling going on.
In order to find out what the problem is I started with the original values and removed one by one.
When I removed one whole series then the scale would shrink.
Here an example (removing Factor 5) and scale in [0,0.2] range shrinks.
From
to
I don't care so much about the scaling but I would like my values at 3 score to show up.
Many thanks
Actually, the values for D and E in series A do show up, although they are plotted in the center of the plot. This is because the limits of your "y-axis" is autoscaled.
If you want to have a fixed "minimum radius", you can simply put ax.set_ylim(bottom=0) in your for-loop.
If you want the minimum radius to be a number relative to the lowest plotted value, you can include something like ax.set_ylim(np.asarray(data.values()).flatten().min() - margin) in the for-loop, where margin is the distance from the lowest plotted value to the center of the plot.
With fixed center at radius 0 (added markers to better show that the points are plotted):
By setting margin = 1, and using the relative y-limits, I get this output: