Matplotlib plotting range of values as a bar - python

I am stuck with the following problem: Using Matplotlib I need to plot an array of data, where the abscissa is a range of values (i.e. [1000..2000]), while the ordinate is represented by a single value.
I need to plot the data in a form of a bar, which starts at the value of 1000 (from the example above), and finishes at 2000. While in ordinate, the bar is located at the level of certain value defined above.
Any ideas ? I looked through various examples, but I only see bars and histograms which do something different.

Just use plot to make a wide line:
import matplotlib.pyplot as plt
plt.plot([1000, 2000], [5, 5], lw=10, color="orange", solid_capstyle="butt")#Setting capstyle to butt, because otherwise the length of the line is slightly longer, than required
plt.yticks(range(10))
plt.xticks(range(500, 3000, 500))
plt.margins(0.5)
plt.show()

Related

How to create a stacked barchart for a large dataset in Python?

For research purposes at my university, I need to create a stacked bar chart for speech data. I would like to represent the hours of speech on the y-axis and the frequency on the x-axis. The speech comes from different components, hence the stacked part of the chart. The data resides in a Pandas dataframe, which has a lot of columns, but the important ones are "component", "hours" and "ps_med_frequency" which are used in the graph.
A simplified view of the DF (it has 6.2k rows and 120 columns, a-k components):
component
filename
ps_med_freq (rounded to integer)
hours (length)
...
a
fn0001_ps
230
0.23
b
fn0002_ps
340
0.12
c
fn003_ps
278
0.09
I have already tried this with matplotlib, seaborn or just the plot method from the Pandas dataframe itself. None seem to work properly.
A snippet of seaborn code I have tried:
sns.barplot(data=meta_dataframe, x='ps_med_freq', y='hours', hue='component', dodge=False)
And basically all variations of this as well.
Below you can see one of the most "viable" results I've had so far:
example of failed graph
It seems to have a lot of inexplicable grey blobs, which I first contributed to the large dataset, but if I just plot it as a histogram and count the frequencies instead of showing them by hour, it works perfectly fine. Does anyone know a solution to this?
Thanks in advance!
P.S.: Yes, I realise this is a huge dataset and at first sight, the graph seems useless with that much data on it, but matplotlib has interactive graphs where you can zoom etc., where this kind of graph becomes useful for my purpose.
With sns.barplot you're creating a bar for each individual frequency value. You'll probably want to group similar frequencies together, as with sns.histplot(..., multiple='stack'). If you want a lot of detail, you can increase the number of bins for the histogram. Note that sns.barplot never creates stacks, it would just plot each bar transparently on top of the others.
You can create a histogram, using the hours as weights, so they get summed.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some suitable random test data
np.random.seed(20230104)
component_prob = np.random.uniform(0.1, 2, 7)
component_prob /= component_prob.sum()
df = pd.DataFrame({'component': np.random.choice([*'abcdefg'], 6200, p=component_prob),
'ps_med_freq': (np.random.normal(0.05, 1, 6200).cumsum() + 200).astype(int),
'hours': np.random.randint(1, 39, 6200) * .01})
# create bins for every range of 10, suitably rounded, and shifted by 0.001 to avoid floating point roundings
bins = np.arange(df['ps_med_freq'].min() // 10 * 10 - 0.001, df['ps_med_freq'].max() + 10, 10)
plt.figure(figsize=(16, 5))
ax = sns.histplot(data=df, x='ps_med_freq', weights='hours', hue='component', palette='bright',
multiple='stack', bins=bins)
# sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1.01, 1.01)) # legend outside
sns.despine()
plt.tight_layout()
plt.show()

Histogram shows unlimited bins despite bin specification in matplotlib

I have a error data and when I tried to make a histogram of the data the intervals or the bin sizes were showing large as shown in the below image
Below is the code
import matplotlib.pyplot as plt
plt.figure()
plt.hist(error)
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
When I tried to explicitly mention the bins as we usually do, like in the below code I get the hist plot as shown below
plt.figure()
plt.hist(error, bins=[-4,-3,-2,-1, 0,1, 2,3, 4,])
#plt.hist(error, bins = 6)
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
I wish to make the hist look nice, something like below (an example from google) with bins clearly defined.
i Tried with seaborn displot and it gave a nice plot as shown below.
import seaborn as sns
sns.displot(error, bins=[-4,-3,-2,-1, 0,1, 2,3, 4,])
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
Why is that the matplotlib not able to make this plot? Did I miss anything or do I need to set something in order to make the usual histogram plot? Please highlight
The matplotlib documentation for plt.hist() explains that the first parameter can either by a 1D array or a sequence of 1D arrays. The latter case is used if you pass in a 2D array and will result in plotting a separate bar with cycling colors for each of the rows.
This is what we see in your example: The X-axis ticks still correspond to the bin-edges that were passed in - but for each bin there are many bars. So, I'm assuming you passed in a multidimensional array.
To fix this, simply flatten your data before passing it to matplotlib, e.g. plt.hist(np.ravel(error), bins=bins).

Vary legend properties for different data in seaborn scatterplots

I have created a seaborn scatterplot for a dataset, where I set the sizes parameter to one column, and the hue parameter to another. Now the hue parameter only consists of five different values and is supposed to help classifying my data, while the sizes parameter consists of a lot more to represent actual numeric data. In this current data set, my hue values only consist of 0, 2, and 4, but in the "brief" legend option, the legend labels are not synchronized to that, which is very confusing. In the "full" legend option, the hue-labels are correct, but the size-labels are way too many. Therefore I would like to display the full legend for my hue parameter, but only a brief legend for the sizes parameter, because it consists of lots of unique values.
How the overcrowded "full" legend looks
The "brief" legend that is confusingly labeled
Edit: I edited some code in that demonstrates the issue for a random dataset. To specify my question again, I want the "shape" parameters to get fully depicted on the legend, while the "size" parameters have to be shortened (equivalent to the legend setting "brief").
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x_condition=np.arange(0,20,1)
y_condition=np.arange(0,20,1)
size=np.random.randint(0,200,20)
# I haven't made a random distribution here, because I wanted to make sure it contains at least one of each [0,2,4]
shape=[0,2,0,4]*5
df=pd.DataFrame({"x_condition":x_condition,"y_condition":y_condition,"size":size,"shape":shape})
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="brief")
plt.show()
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="full")
plt.show()

Differences between bar plots in Matplotlib and pandas

I feel like I'm missing something ridiculously basic here.
If I'm trying to create a bar chart with values from a dataframe, what's the difference between calling .plot on the dataframe object and just entering the data within plt.plot's parentheses?
e.g.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
VERSUS
df.groupby('category').count().plot(kind='bar')?
Can someone please walk me through what the difference is and when I should use either? I get that with plt.plot I'm calling the plot method of the plt (Matplotlib) library, whereas when I do df.plot I'm calling plot on the dataframe? What does that mean exactly -- that the dataframe has a plot object?
Those are different plotting methods. Fundamentally, they both produce a matplotlib object, which can be shown via one of the matplotlib backends.
There is however an important difference. Pandas bar plots are categorical in nature. This means, bars are positionned at subsequent integer numbers, and each bar gets a tick with a label according to the index of the dataframe.
For example:
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
s.plot.bar()
plt.show()
Here, there are four bars, the first is at positon 0, with the first label of the series' index, 1. The second is at positon 1, with the label 4 etc.
In contrast, a matplotlib bar plot is numeric in nature. Compare this to
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
plt.bar(s.index, s.values)
plt.show()
Here the bars are at the numerical position of the index; the first bar at 1, the second at 4 etc. and the axis labelling is independent of where the bars are.
Note that you can achieve a categorical bar plot with matplotlib by casting your values to strings.
plt.bar(s.index.astype(str), s.values)
The result looks similar to the pandas plot, except for some minor tweaks like rotated labels and bar widths. In case you are interested in tweaking some sophisticated properties, it will be easier to do with a matplotlib bar plot, because that directly returns the bar container with all the bars.
bc = plt.bar()
for bar in bc:
bar.set_some_property(...)
Pandas plot function is using Matplotlib's pyplot to do the plotting, but it's like a shortcut.
I was similarly confused when I started trying to visualise my data, but I decided in the end to learn matplotlib because in the end you get more control of the visualisation.
I think it depends on the data you have. If you have a clean data frame and you just want to print something quickly, then you can use df.plot. For example, you can group by a column and then specify x and y axes.
If you want a more complicated graph, then working directly with matplotlib is better. At the end, matplotlib will give you more options.
This is a good reference to start with: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/

Pyplot colormap line by line

I'm beginning with plotting on python using the very nice pyplot. I aim at showing the evolution of two series of data along time. Instead of doing a casual plot of data function of time, I'd like to have a scatter plot (data1,data2) where the time component is shown as a color gradient.
In my two column file, the time would be described by the line number. Either written as a 3rd column in the file either using the intrinsic capability of pyplot to get the line number on its own.
Can anyone help me in doing that ?
Thanks a lot.
Nicolas
When plotting using matplotlib.pyplot.scatter you can pass a third array via the keyword argument c. This array can choose the colors that you want your scatter points to be. You then also pick an appropriate colormap from matplotlib.cm and assign that with the cmap keyword argument.
This toy example creates two datasets data1 and data2. It then also creates an array colors, an array of continual values equally spaced between 0 and 1, and with the same length as data1 and data2. It doesn't need to know the "line number", it just needs to know the total number of data points, and then equally spaces the colors.
I've also added a colorbar. You can remove this by removing the plt.colorbar() line.
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
N = 500
data1 = np.random.randn(N)
data2 = np.random.randn(N)
colors = np.linspace(0,1,N)
plt.scatter(data1, data2, c=colors, cmap=cm.Blues)
plt.colorbar()
plt.show()

Categories