I am trying to build a simple histogram. For some reason, my bars are behaving abnormally. As you can see in this picture, my bar over "3" is moved to the right side. I am not sure what caused it. I did align='mid' but it did not fix it.
This is the code that I used to create it:
def createBarChart(colName):
df[colName].hist(align='mid')
plt.title(str(colName))
RUNS = [1,2,3,4,5]
plt.xticks(RUNS)
plt.show()
for column in colName:
createBarChart(column)
And this is what I got:
bar is not centered over 3
To recreate my data:
df = pd.DataFrame(np.random.randint(1,6,size=(100, 4)), columns=list('ABCD'))
Thank you for your help!
P/s: idk if this info is relevant, but I am using seaborn-whitegrid style. I tried to recreate a plot with sample data and it's still showing up. Is it a bug?
hist created using random data
The hist function is behaving exactly as it is supposed to. By default it splits the data you pass into 10 bins, with the left edge of the first bin at the data's minimum value and the right edge of the last bin at its maximum. The chart below shows the randomly generated data binned this way, with red dashed lines to mark the edges of the bins.
The way around this is to define the bin edges yourself, with a slight adjustment to the minimum and maximum values to centre the bars over the x axis ticks. This can be done quite easily with numpy's linspace function (using column A in the randomly generated data frame as an example):
bins = np.linspace(df["A"].min() - .5, df["A"].max() + .5, 6)
df["A"].hist(bins=bins)
We ask for 6 values because we are defining the bin edges, this will result in 5 bins, as shown in this chart:
If you wanted to keep the gaps between the bars you can increase the number of bins to 9 and adjust the offset slightly, but this wouldn't work in all cases (it works here because every value is either 1, 2, 3, 4 or 5).
bins = np.linspace(df["A"].min() - .25, df["A"].max() + .25, 10)
df["A"].hist(bins=bins)
Finally, as this data contains discrete values and really you are plotting the counts, you could use the value_counts function to create a series that can then be plotted as a bar chart:
df["A"].value_counts().sort_index().plot(kind="bar")
# Provide a 'color' argument if you need all of the bars to look the same.
df["A"].value_counts().sort_index().plot(kind="bar", color="steelblue")
Try using something like this in your code to create all of the histogram bars to the same place.
plt.hist("Your data goes here", bins=range(1,7), align='left', rwidth=1, normed=True)
place your data where I put your data goes here
Related
I am drawing some graphs and I wanna import them in LaTex in 2 by 2 format. One of the problems is that values on the y-axis for one graph range from 1 to 6, but for another graph, those range from 1 to 200. Because of that, when I import graphs into my document, they do not look good. Is there any way to set the same width for value on the y-axis?
You can set the y axis limits using ax.set_ylim or plt.ylim:
# Set axis from 1 to 200
ax.set_ylim((1,200))
# Or just set it directly - this will also act on the current axis
plt.ylim((1,200))
Edit: The question is about widths rather than limits.
I think making the subplots together on one figure should solve this problem.
plt.figure()
plt.subplot(2,2,1)
plt.plot(x1,y1)
.
.
plt.subplot(2,2,4)
plt.plot(x4,y4)
I'm trying to use matplotlib and contourf to generate some filled (polar) contour plots of velocity data. I have some data (MeanVel_Z_Run16_np) I am plotting on theta (Th_Run16) and r (R_Run16), as shown here:
fig,ax = plt.subplots(subplot_kw={'projection':'polar'})
levels = np.linspace(-2.5,4,15)
cplot = ax.contourf(Th_Run16,R_Run16,MeanVel_Z_Run16_np,levels,cmap='plasma')
ax.set_rmax(80)
ax.set_rticks([15,30,45,60])
rlabels = ax.get_ymajorticklabels()
for label in rlabels:
label.set_color('#E6E6FA')
cbar = plt.colorbar(cplot,pad=0.1,ticks=[0,3,6,9,12,15])
cbar.set_label(r'$V_{Z}$ [m/s]')
plt.show()
This generates the following plot:
Velocity plot with 15 levels:
Which looks great (and accurate), outside of that random straight orange line roughly between 90deg and 180deg. I know that this is not real data because I plotted this in MATLAB and it did not appear there. Furthermore, I have realized it appears to relate to the number of contour levels I use. For example, if I bump this code up to 30 levels instead of 15, the result changes significantly, with odd triangular regions of uniform value:
Velocity plot with 30 levels:
Does anyone know what might be going on here? How can I get contourf to just plot my data without these strange misrepresentations? I would like to use 15 contour levels at least. Thank you.
I am experimenting with JointGrid from Seaborn. I used .plot_joint() to plot my scatter plot, group-colored using the hue parameter. I have filtered my dataset to only include 2 of the 5 groups, to prevent too much overlap in the plots.
The plotted points appear correct, in that they match what I expect from the two groups I chose. Additionally, I double-checked my filtering by viewing the filtered dataframe. That too was correct as it contained only the two groups I chose.
However the legend that is automatically plotted along with the scatterplot is incorrect. It shows 4 groups (not sure why not 5), and the coloring is also incorrect. For 2 groups I would expect only the Red and Blue colors (the first 2 colors in the Set1 palette), but my 2nd group is colored with the 4th color in the Set1 palette.
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
# initialize Joint Grid
grid = sns.JointGrid(data=df_tmp, x='MP', y='PTS')
# plot scatter (main plot)
grid = grid.plot_joint(sns.scatterplot, data=df_tmp, hue='Kmeans_Clusters',
palette='Set1')
# plot marginal distplot for cluster 0, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].MP, ax=grid.ax_marg_x,
vertical=False, color='firebrick', label='Cluster0')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].PTS, ax=grid.ax_marg_y,
vertical=True, color='firebrick', label='Cluster0')
# plot marginal distplot for cluster 3, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].MP, ax=grid.ax_marg_x,
vertical=False, color='steelblue', label='Cluster3')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].PTS, ax=grid.ax_marg_y,
vertical=True, color='steelblue', label='Cluster3')
plt.suptitle('PTS vs MP, Cluster 0 & 3\n1982-2019', y=1.05, fontsize=20)
plt.show()
jointgrid_incorrect_legend_and_coloring
--- Update---
I just tried this with a simple scatterplot (no JointGrid) and I can repeat my previous observation. Is there just something I am not understanding with the hue parameter and the scatterplot() function?
I do not see this issue with lmplot()
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
sns.scatterplot(data=df_tmp, y='PTS', x='MP', hue='Kmeans_Clusters', palette='Set1')
plt.title('PTS vs MP\n1982-2019')
plt.xlabel('Minutes Played Annually')
plt.ylabel('Points Scored Annually')
plt.show()
Again, once I fine tuned my searches, I was able to find the solution. In fact, here's another stackoverflow question that asks the same thing and is answered in detail: The `hue` parameter in Seaborn.relplot() skips an integer when given numerical data?.
Pasting the solution I used, as described in the link above:
"""
An alternative is to make sure the values are treated categorical Unfortunately, even if you plug in the numbers as strings, they will be converted to numbers falling back to the same mechanism described above. This may be seen as a bug.
However, one choice you have is to use real categories, like e.g. single letters.
'cluster':list("ABCDE")
works fine,
"""
I want to create a Pie chart using single column of my dataframe, say my column name is 'Score'. I have stored scores in this column as below :
Score
.92
.81
.21
.46
.72
.11
.89
Now I want to create a pie chart with the range in percentage.
Say 0-0.4 is 30% , 0.4-0.7 is 35 % , 0.7+ is 35% .
I am using the below code using
df1['bins'] = pd.cut(df1['Score'],bins=[0,0.5,1], labels=["0-50%","50-100%"])
df1 = df.groupby(['Score', 'bins']).size().unstack(fill_value=0)
df1.plot.pie(subplots=True,figsize=(8, 3))
With the above code I am getting the Pie chart, but i don’t know how i can do this using percentage.
my pie chart look like this for now
Cutting the dataframe up into bins is the right first step. After which, you can use value_counts with normalize=True in order to get relative frequencies of values in the bins column. This will let you see percentage of data across ranges that are defined in the bins.
In terms of plotting the pie chart, I'm not sure if I understood correctly, but it seemed like you would like to display the correct legend values and the percentage values in each slice of the pie.
pandas.DataFrame.plot is a good place to see all parameters that can be passed into the plot method. You can specify what are your x and y columns to use, and by default, the dataframe index is used as the legend in the pie plot.
To show the percentage values per slice, you can use the autopct parameter as well. As mentioned in this answer, you can use all the normal matplotlib plt.pie() flags in the plot method as well.
Bringing everything together, this is the resultant code and the resultant chart:
df = pd.DataFrame({'Score': [0.92,0.81,0.21,0.46,0.72,0.11,0.89]})
df['bins'] = pd.cut(df['Score'], bins=[0,0.4,0.7,1], labels=['0-0.4','0.4-0.7','0.7-1'], right=True)
bin_percent = pd.DataFrame(df['bins'].value_counts(normalize=True) * 100)
plot = bin_percent.plot.pie(y='bins', figsize=(5, 5), autopct='%1.1f%%')
Plot of Pie Chart
I wrote the following program in python to obtain equi-width histograms. But when I am plotting it I am getting a single line in figure instead of a histogram. Can someone please help me figure out as to where am I going wrong.
import numpy as np
import matplotlib.pyplot as plt
for num in range(0,5):
hist, bin_edges = np.histogram([1000, 98,99992,8474,95757,958574,97363,97463,1,4,5], bins = 5)
plt.bar(bin_edges[:-1], hist, width = 1000)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Additionally I want to label each plot obtained with its "num" value..which range from 0 to 5. In the example given above although I have kept my data constant, but I intend to change my data for different "num" values.
Look at your bin edges:
>>> bin_edges
array([ 1.00000000e+00, 1.91715600e+05, 3.83430200e+05,
5.75144800e+05, 7.66859400e+05, 9.58574000e+05])
Your bin positions range from 1 to approximately 1 million, but you only gave the bars a width of 1000. Your bars, where they exist at all, are too skinny to be seen. Also, most of the bars have sero height, because most of the bins are empty:
>>> hist
array([10, 0, 0, 0, 1])
The "line" you see is the last bin, with one element. This bin covers a span of approximately 200000, but the bar width is only 1000, so it is very thin relative to the amount of space it is supposed to cover. The bar of height 10 is also there, but it's also very skinny, and jammed up against the left edge of the plot, so it's basically invisible.
It doesn't make sense to try to use constant-width bars while also placing them at x-coordinates that correspond to their size. By putting the bars at those x-coordinates, you are already spacing them out proportional to the bin widths; making the bars skinnier doesn't bring them closer together, it just makes them invisible.
If you want to use constant-width bars, you should put them at sequential X positions and use labels on the axis to show the values the bins represent. Here's a simple example with your data:
plt.bar(np.arange(len(bin_edges)-1), hist, width=1)
plt.xticks((np.arange(len(bin_edges))-0.5)[1:], bin_edges[:-1])
You'll have to decide how you want to format those labels.