Why is matplotlib .plot(kind='bar') plot so different to .plot() - python

This may be a very stupid question, but when plotting a Pandas DataFrame using .plot() it is very quick and produces a graph with an appropriate index. As soon as I try to change this to a bar chart, it just seems to lose all formatting and the index goes wild. Why is this the case? And is there an easy way to just plot a bar chart with the same format as the line chart?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.date_range(start='01/01/2012', end='31/12/2018')
df['Value'] = np.random.randint(low=5, high=100, size=len(df))
df.set_index('Date', inplace=True)
df.plot()
plt.show()
df.plot(kind='bar')
plt.show()
Update:
For comparison, if I take the data and put it into Excel, then create a line plot and a bar ('column') plot it instantly will convert the plot and keep the axis labels as they were for the line plot. If I try to produce many (thousands) of bar charts in Python with years of daily data, this takes a long time. Is there just an equivalent way of doing this Excel transformation in Python?

Pandas bar plots are categorical in nature; i.e. each bar is a separate category and those get their own label. Plotting numeric bar plots (in the same manner a line plots) is not currently possible with pandas.
In contrast matplotlib bar plots are numerical if the input data is numbers or dates. So
plt.bar(df.index, df["Value"])
produces
Note however that due to the fact that there are 2557 data points in your dataframe, distributed over only some hundreds of pixels, not all bars are actually plotted. Inversely spoken, if you want each bar to be shown, it needs to be one pixel wide in the final image. This means with 5% margins on each side your figure needs to be more than 2800 pixels wide, or a vector format.
So rather than showing daily data, maybe it makes sense to aggregate to monthly or quarterly data first.

The default .plot() connects all your data points with straight lines and produces a line plot.
On the other hand, the .plot(kind='bar') plots each data point as a discrete bar. To get a proper formatting on the x-axis, you will have to modify the tick-labels post plotting.

Related

How to plot certain row and column using panda dataframe?

I have a very simple data frame but I could not plot a line using a row and a column. Here is an image, I would like to plot a "line" that connects them.
enter image description here
I tried to plot it but x-axis disappeared. And I would like to swap those axes. I could not find an easy way to plot this simple thing.
Try:
import matplotlib.pyplot as plt
# Categories will be x axis, sexonds will be y
plt.plot(data["Categories"], data["Seconds"])
plt.show()
Matplotlib generates the axis dynamically, so if you want the labels of the x-axis to appear you'll have to increase the size of your plot.

Differences between bar plots in Matplotlib and pandas

I feel like I'm missing something ridiculously basic here.
If I'm trying to create a bar chart with values from a dataframe, what's the difference between calling .plot on the dataframe object and just entering the data within plt.plot's parentheses?
e.g.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
VERSUS
df.groupby('category').count().plot(kind='bar')?
Can someone please walk me through what the difference is and when I should use either? I get that with plt.plot I'm calling the plot method of the plt (Matplotlib) library, whereas when I do df.plot I'm calling plot on the dataframe? What does that mean exactly -- that the dataframe has a plot object?
Those are different plotting methods. Fundamentally, they both produce a matplotlib object, which can be shown via one of the matplotlib backends.
There is however an important difference. Pandas bar plots are categorical in nature. This means, bars are positionned at subsequent integer numbers, and each bar gets a tick with a label according to the index of the dataframe.
For example:
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
s.plot.bar()
plt.show()
Here, there are four bars, the first is at positon 0, with the first label of the series' index, 1. The second is at positon 1, with the label 4 etc.
In contrast, a matplotlib bar plot is numeric in nature. Compare this to
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
plt.bar(s.index, s.values)
plt.show()
Here the bars are at the numerical position of the index; the first bar at 1, the second at 4 etc. and the axis labelling is independent of where the bars are.
Note that you can achieve a categorical bar plot with matplotlib by casting your values to strings.
plt.bar(s.index.astype(str), s.values)
The result looks similar to the pandas plot, except for some minor tweaks like rotated labels and bar widths. In case you are interested in tweaking some sophisticated properties, it will be easier to do with a matplotlib bar plot, because that directly returns the bar container with all the bars.
bc = plt.bar()
for bar in bc:
bar.set_some_property(...)
Pandas plot function is using Matplotlib's pyplot to do the plotting, but it's like a shortcut.
I was similarly confused when I started trying to visualise my data, but I decided in the end to learn matplotlib because in the end you get more control of the visualisation.
I think it depends on the data you have. If you have a clean data frame and you just want to print something quickly, then you can use df.plot. For example, you can group by a column and then specify x and y axes.
If you want a more complicated graph, then working directly with matplotlib is better. At the end, matplotlib will give you more options.
This is a good reference to start with: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/

Python Pandas Matplotlib : How to Plot Graph without Numerics?

I want to plot bar graph or graphs in python using a Pandas dataframe using two columns that don't contain numeric. One column is Operating System, another is computer name, I want to plot a graph between them showing which OS is running over how many Systems, the sample data is like below.
How can I plot bar graph or other graphs for these two colums. When I try the code below:
ax = dfdefault[['Operating System','Computer Name']].plot(kind='bar')
ax.set_xlabel("Hour", fontsize=12)
ax.set_ylabel("V", fontsize=12)
plt.show()
I get this error:
Error:
TypeError: Empty 'DataFrame': no numeric data to plot
You will need to count the occurrence of each operating system first and then plot using a bar graph or pie chart. bar expects numeric data already, which you don't have. Counting will take care of this. Here is an example using a pie chart:
df = pd.DataFrame(
[['asd', 'win'],
['sdf', 'mac'],
['aww', 'win'],
['dd', 'linux']],
columns=['computer', 'os']
)
df['os'].value_counts().plot.pie()
A bar chart would work similarly. Just change pie to bar.

How do I create a multiline plot using seaborn?

I am trying out Seaborn to make my plot visually better than matplotlib. I have a dataset which has a column 'Year' which I want to plot on the X-axis and 4 Columns say A,B,C,D on the Y-axis using different coloured lines. I was trying to do this using the sns.lineplot method but it allows for only one variable on the X-axis and one on the Y-axis. I tried doing this
sns.lineplot(data_preproc['Year'],data_preproc['A'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['B'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['C'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['D'], err_style=None)
But this way I don't get a legend in the plot to show which coloured line corresponds to what. I tried checking the documentation but couldn't find a proper way to do this.
Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from its "wide format" (one column per measurement type) into long format (one column for all measurement values, one column to indicate the type) is pandas.melt. Given a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
sns.lineplot(x='Year', y='value', hue='variable',
data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be adapted to your liking.)
This:
sns.lineplot(data=data_preproc)
will do what you want.
See the documentation:
sns.lineplot(x="Year", y="signal", hue="label", data=data_preproc)
You probably need to re-organize your dataframe in a suitable way so that there is one column for the x data, one for the y data, and one which holds the label for the data point.
You can also just use matplotlib.pyplot. If you import seaborn, much of the improved design is also used for "regular" matplotlib plots. Seaborn is really "just" a collection of methods which conveniently feed data and plot parameters to matplotlib.

Plotting multiple timeseries power data using matplotlib and pandas

I have a csv file of power levels at several stations (4 in this case, though "HUT4" is not in this short excerpt):
2014-06-21T20:03:21,HUT3,74
2014-06-21T21:03:16,HUT1,70
2014-06-21T21:04:31,HUT3,73
2014-06-21T21:04:33,HUT2,30
2014-06-21T22:03:50,HUT3,64
2014-06-21T23:03:29,HUT1,60
(etc . .)
The times are not synchronised across stations. The power level is (in this case) integer percent. Some machines report in volts (~13.0), which would be an additional issue when plotting.
The data is easy to read into a dataframe, to index the dataframe, to put into a dictionary. But I can't get the right syntax to make a meaningful plot. Either all stations on a single plot sharing a timeline that's big enough for all stations, or as separate plots, maybe a subplot for each station. If I do:
import pandas as pd
df = pd.read_csv('Power_Log.csv',names=['DT','Station','Power'])
df2=df.groupby(['Station']) # set 'Station' as the data index
d = dict(iter(df2)) # make a dictionary including each station's data
for stn in d.keys():
d[stn].plot(x='DT',y='Power')
plt.legend(loc='lower right')
plt.savefig('Station_Power.png')
I do get a plot but the X axis is not right for each station.
I have not figured out yet how to do four independent subplots, which would free me from making a wide-enough timescale.
I would greatly appreciate comments on getting a single plot right and/or getting good looking subplots. The subplots do not need to have synchronised X axes.
I'd rather plot the typical way, smth like:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
plt.savefig()
( http://matplotlib.org/users/pyplot_tutorial.html )
Re more subplots: simply call plt.plot() multiple times, once for each data series.
P.S. you can set xticks this way: Changing the "tick frequency" on x or y axis in matplotlib?
Sorry for the comment above where I needed to add code. Still learning . .
From the 5th code line:
import matplotlib.dates as mdates
for stn in d.keys():
plt.figure()
d[stn].interpolate().plot(x='DT',y='Power',title=stn,rot=45)
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%D/%M/%Y'))
plt.savefig('Station_Power_'+stn+'.png')
Does more or less what I want to do except the DateFormatter line does not work. I would like to shorten my datetime data to show just date. If it places ticks at midnight that would be brilliant but not strictly necessary.
The key to getting a continuous plot is to use the interpolate() method in the plot.
With this data having different x scales from station to station a plot of all stations on the same graph does not work. HUT4 in my data has far fewer records and only plots to about 25% of the scale even though the datetime values cover more or less the same range as the other HUTs.

Categories