Series markers in pandas dataframe plots - python

I'm plotting a pandas dataframe which contains multiple time series.
I have more series than the number of colors matplotlib chooses from, so there is ambiguity in mapping legend colors to plots.
I haven't seen any matplotlib examples that assigns markers as a batch across all series and I'm wondering if there's a way to pass a list of marker styles that df.plot() can rotate through in the same way it chooses colors.
df.plot(markers = ??)

A for loop would be sufficient:
df = pd.DataFrame(np.arange(16).reshape(4,-1))
for c,m in zip(df,'oxds'):
df[c].plot(marker=m)
plt.legend()
Output:

Related

Why is matplotlib .plot(kind='bar') plot so different to .plot()

This may be a very stupid question, but when plotting a Pandas DataFrame using .plot() it is very quick and produces a graph with an appropriate index. As soon as I try to change this to a bar chart, it just seems to lose all formatting and the index goes wild. Why is this the case? And is there an easy way to just plot a bar chart with the same format as the line chart?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.date_range(start='01/01/2012', end='31/12/2018')
df['Value'] = np.random.randint(low=5, high=100, size=len(df))
df.set_index('Date', inplace=True)
df.plot()
plt.show()
df.plot(kind='bar')
plt.show()
Update:
For comparison, if I take the data and put it into Excel, then create a line plot and a bar ('column') plot it instantly will convert the plot and keep the axis labels as they were for the line plot. If I try to produce many (thousands) of bar charts in Python with years of daily data, this takes a long time. Is there just an equivalent way of doing this Excel transformation in Python?
Pandas bar plots are categorical in nature; i.e. each bar is a separate category and those get their own label. Plotting numeric bar plots (in the same manner a line plots) is not currently possible with pandas.
In contrast matplotlib bar plots are numerical if the input data is numbers or dates. So
plt.bar(df.index, df["Value"])
produces
Note however that due to the fact that there are 2557 data points in your dataframe, distributed over only some hundreds of pixels, not all bars are actually plotted. Inversely spoken, if you want each bar to be shown, it needs to be one pixel wide in the final image. This means with 5% margins on each side your figure needs to be more than 2800 pixels wide, or a vector format.
So rather than showing daily data, maybe it makes sense to aggregate to monthly or quarterly data first.
The default .plot() connects all your data points with straight lines and produces a line plot.
On the other hand, the .plot(kind='bar') plots each data point as a discrete bar. To get a proper formatting on the x-axis, you will have to modify the tick-labels post plotting.

Plot stacked histogram with grouped DataFrame

I want I stacked histogram where the different classes are visible.
At the moment I have the histogram without classes with this code:
plt.hist(hist_matrix2.column_name)
which produces this histogram:
and another histogram with the same data, that is grouped by the classes with this code:
hist_matrix2.groupby("number").column_name.plot.hist(alpha=0.5, bins = [0,5,10,15,20,25,30], stacked = True)
which produces this histogram:
As you can see the classes are there but it is not stacked, although the parameter is set. What can I do to stack the classes?
plt.hist has a built-in stacking flag you can set:
plt.hist(hist_matrix2.column_name, stacked=True)
Edit in response to your question, for long data (with multiple levels stacked) first you need to restructure the data into a list of lists:
wide=hist_matrix2.pivot( columns='number', values='column_name')
#This creates many missing values which pandas does not like, so we drop them
widelist=[wide[col].dropna() for col in wide.columns]
# and the stacked graph is here
plt.hist(widelist,stacked=True)
plt.show()

Color time-series based on column values in pandas

I have a time-series in a pandas DataFrame (df.data in the example) and want to color the plot based on the values of another column (df.colorsin the example; values are 0, 1, and 2 in this case, but it would be good / more portable if it would also work with floats).
import pandas as pd
n = 10
seed(1)
df = pd.DataFrame(data={"data":randn(n), "colors":randint(0,3,n)},
index=pd.date_range(start="2016-01-01", periods=n))
df.data.plot(style=".", ms=10)
What I am looking for is something like
df.data.plot(style=".", color=df.colors)
(which does not work), in order to produce a plot like this:
Here the markers are colored red, orange, and green, for colors==0, 1, and 2, respectively. It's relatively easy to do this manually for few data and few colors, but is there a straightforward way to do this automatically?
There seems to be a solution using plt.scatter and colormaps, as shown in the answer to How to use colormaps to color plots of Pandas DataFrames, but using plt.scatter with a datetime index destroys the convenient automatic axis scaling of using df.data.plot(...). Is there a way using this notation?
One way to achieve this would be to use DF.replace and create a nested dictionary to specify the color values for the int/float values to be mapped against.
plt.style.use('seaborn-white')
df.replace({'colors':{0:'red',1:'orange',2:'green'}}, inplace=True)
You could then perform DF.groupby on it to keep the colors same for each subgroup of the groupby object on every iteration step.
for index, group in df.groupby('colors'):
group['data'].plot(style=".", x_compat=True, ms=10, color=index, grid=True)

Sorted bar charts with pandas/matplotlib or seaborn

I have a dataset of 5000 products with 50 features. One of the column is 'colors' and there are more than 100 colors in the column. I'm trying to plot a bar chart to show only the top 10 colors and how many products there are in each color.
top_colors = df.colors.value_counts()
top_colors[:10].plot(kind='barh')
plt.xlabel('No. of Products');
Using Seaborn:
sns.factorplot("colors", data=df , palette="PuBu_d");
1) Is there a better way to do this?
2) How can i replicate this with Seaborn?
3) How do i plot such that the highest count is at the top (i.e black at the very top of the bar chart)
An easy trick might be to invert the y axis of your plot, rather than futzing with the data:
s = pd.Series(np.random.choice(list(string.uppercase), 1000))
counts = s.value_counts()
ax = counts.iloc[:10].plot(kind="barh")
ax.invert_yaxis()
Seaborn barplot doesn't currently support horizontally oriented bars, but if you want to control the order the bars appear in you can pass a list of values to the x_order param. But I think it's easier to use the pandas plotting methods here, anyway.
If you want to use pandas then you can first sort:
top_colors[:10].sort(ascending=0).plot(kind='barh')
Seaborn already styles your pandas plots, but you can also use:
sns.barplot(top_colors.index, top_colors.values)

Using a Pandas dataframe index as values for x-axis in matplotlib plot

I have time series in a Pandas dateframe with a number of columns which I'd like to plot. Is there a way to set the x-axis to always use the index from a dateframe?
When I use the .plot() method from Pandas the x-axis is formatted correctly however I when I pass my dates and the column(s) I'd like to plot directly to matplotlib the graph doesn't plot correctly. Thanks in advance.
plt.plot(site2.index.values, site2['Cl'])
plt.show()
FYI: site2.index.values produces this (I've cut out the middle part for brevity):
array([
'1987-07-25T12:30:00.000000000+0200',
'1987-07-25T16:30:00.000000000+0200',
'2010-08-13T02:00:00.000000000+0200',
'2010-08-31T02:00:00.000000000+0200',
'2010-09-15T02:00:00.000000000+0200'
],
dtype='datetime64[ns]')
It seems the issue was that I had .values. Without it (i.e. site2.index) the graph displays correctly.
You can use plt.xticks to set the x-axis
try:
plt.xticks( site2['Cl'], site2.index.values ) # location, labels
plt.plot( site2['Cl'] )
plt.show()
see the documentation for more details: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.xticks
That's Builtin Right Into To plot() method
You can use yourDataFrame.plot(use_index=True) to use the DataFrame Index On X-Axis.
The "use_index=True" sets the DataFrame Index on the X-Axis.
Read More Here: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.html
you want to use matplotlib to select a 'sensible' scale just like me, there is one way can solve this question. using a Pandas dataframe index as values for x-axis in matplotlib plot. Code:
ax = plt.plot(site2['Cl'])
x_ticks = ax.get_xticks() # use matplotlib default xticks
x_ticks = list(filter(lambda x: x in range(len(site2)), x_ticks))
ax.set_xticklabels([' '] + site2.index.iloc[x_ticks].to_list())

Categories