I'm encountering some matplotlib behaviour that I do not understand.
I have a daframe:
august.head()
value
config name
low max velocity -0.000145
medium max velocity -0.000165
reference -0.000198
high max velocity -0.000192
When I plot this dataframe using
plt.plot(august)
I get the following plot:
My data seems plotted chaotically and the blue line 'comes back to a previous x value' (sorry, that's the best I can do for a description of my problem)
I would like to see my data plotted with plt.plot(august) just as when I plot it using
august.plot()
Which gives me a good, ordered graph:
Any ideas?
Thanks
Maybe the config names were ordered alphabetically ?
In that case you could associate an integer to each config name, like here:
plot-with-custom-text-for-x-axis-points
Related
I am working on IPL cricket dataset which has data about batting stats for all the teams over by over.
I want to visualise how different cricket grounds affect the total score of the batting team. I try to plot a simple scatter plot but the stadium names are too long and it does not show the names clearly.
Do I have to convert the 35 values into numeric values? It prints nothing when I try to find correlation with the target variable.
The data set:
The problem with reading the plot (the x-axis):
You can change the size of the font and/or rotate it: https://matplotlib.org/api/matplotlib_configuration_api.html#matplotlib.rc
You can make your plot bigger by setting figsize.
(add this at the first line):
plt.figure(figsize(14,8))
and then rotate the xticks. (at the end):
plt.xticks(rotation=90)
I'm working on credit card data available on Kaggle. I want to plot the histogram for a column named 'Amount' from the data file named 'credit'. I want the plot for all the range of Amount but I ain't getting it. The range of Amount is [0,25691.16]. But the range showing in the plot is max_value/(num_bins). What should be the change in the code required to get plot over the total range mentioned above?
In the example code mentioned below the plot is showing a single bar of width 2569.116 (range/num_bins). What I need is 10 bars covering the entire range
plt.hist(credit['Amount'],10,density=True,range=(0,25691.16) ,facecolor='red',alpha=0.5)
your code is right. I think it is the nature of your data that you cannot see the other bars since the density in the first bin is very high. In other words, almost all 'Amount' are in (0, 2569.116) and there are a few 'Amount' that are in the intervals (2569.116, 5138.232), ... , (23122.044, 25691.16).
I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:
Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html
I would like to represent my data (which consist of 256 values) using a bokeh heat map where each value has its own color (so every item with the same value should have the same color).
I've been experimenting and bokeh is doing ranges for me such as the range between 24 - 47 has the same color and so on, but i wish to have a color for each value.
What is the best way to approach this problem?
I've been experimenting with palettes and some perform way better than others, for example Inferno256 is doing a good job but is that the correct way to solve this? I mean is there a way to tell the chart/heat-map to display every value with a color (specify ranges?) or should i for example define a palette of 256 colors any thoughts?
Example where bokeh create big ranges for me:
Data=column_of_values[:1000]
data = {'fruit': [1]*len(Data), # Sections
'fruit_count': Data,
'sample': list(range(1,len(Data)+1))}
hm = HeatMap(data, x='sample', y='fruit', values='fruit_count', palette=bp.Plasma11 , title='Fruits', stat=None)
hm.width=5000
output_file('heatmap.html')
show(hm)
The second part of my question (if possible), does bokeh handle big data well?
forexample plotting 1000 values is different from plotting 10,000 using the same code, the values seem to be smashed together or something, should I fix that by expanding the width or something else :-)
Heat map plotting 1000 values then 10,000 values
I started with the matplotlib radar example but values below some min values disappear.
I have a gist here.
The result looks like
As you can see in the gist, the values for D and E in series A are both 3 but they don't show up at all.
There is some scaling going on.
In order to find out what the problem is I started with the original values and removed one by one.
When I removed one whole series then the scale would shrink.
Here an example (removing Factor 5) and scale in [0,0.2] range shrinks.
From
to
I don't care so much about the scaling but I would like my values at 3 score to show up.
Many thanks
Actually, the values for D and E in series A do show up, although they are plotted in the center of the plot. This is because the limits of your "y-axis" is autoscaled.
If you want to have a fixed "minimum radius", you can simply put ax.set_ylim(bottom=0) in your for-loop.
If you want the minimum radius to be a number relative to the lowest plotted value, you can include something like ax.set_ylim(np.asarray(data.values()).flatten().min() - margin) in the for-loop, where margin is the distance from the lowest plotted value to the center of the plot.
With fixed center at radius 0 (added markers to better show that the points are plotted):
By setting margin = 1, and using the relative y-limits, I get this output: