how do I change the frequency while producing a bar plot - python

This is an addition to the original question i had asked here
Unable to change the tick frequency on my chart
The answer works absolutely fine, but when my index starts from say 2100 (instead of 0) in my original Q,the graph looks incorrect.
How do I fix it?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,(90,1)),columns=['Values'])
df.index = np.arange(2100,2190,1)
df.plot(kind='bar', xticks=np.arange(2100,2190,5))

With bar plots, the xticks are the range index. So you want:
df = pd.DataFrame(np.random.randint(1,10,(90,1)),columns=['Values'])
df.index = np.arange(2100,2190,1)
ax = df.plot(kind='bar')
ax.set_xticks(np.arange(0,len(df),5))
ax.set_xticklabels(df.index[::5]);
Output:

Related

Plotly Express Chart Gaps Even with Index

I am having trouble eliminating datetime gaps within a dataset that i'm trying to create a very simple line chart in plotly express and I have straight lines on the graph connecting datapoints over a gap in the data (weekends).
Dataframe simply has an index of datetime (to the hour) called sale_date, and cols called NAME, COST with approximately 30 days worth of data.
df['sale_date'] = pd.to_datetime(df['sale_date'])
df = df.set_index('sale_date')
px.line(df, x=df.index, y='COST', color='NAME')
I've seen a few posts regarding this issue and one recommended setting datetime as the index, but it still yields the gap lines.
The data in the example may not be the same as yours, but the point is that you can change the x-axis data to string data instead of date/time data, or change the x-axis type to category, and add a scale and tick text.
import pandas as pd
import plotly.express as px
import numpy as np
np.random.seed(2021)
date_rng = pd.date_range('2021-08-01','2021-08-31', freq='B')
name = ['apple']
df = pd.DataFrame({'sale_date':pd.to_datetime(date_rng),
'COST':np.random.randint(100,3000,(len(date_rng),)),
'NAME':np.random.choice(name,size=len(date_rng))})
df = df.set_index('sale_date')
fig= px.line(df, x=[d.strftime('%m/%d') for d in df.index], y='COST', color='NAME')
fig.show()
xaxis update
fig= px.line(df, x=df.index, y='COST', color='NAME')
fig.update_xaxes(type='category',
tickvals=np.arange(0,len(df)),
ticktext=[d.strftime('%m/%d') for d in df.index])

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Side-by-side boxplot of multiple columns of a pandas DataFrame

One year of sample data:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A":rnd.randn(n), "B":rnd.randn(n)+1},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
I want to boxplot these data side-by-side grouped by the month (i.e., two boxes per month, one for A and one for B).
For a single column sns.boxplot(df.index.month, df["A"]) works fine. However, sns.boxplot(df.index.month, df[["A", "B"]]) throws an error (ValueError: cannot copy sequence with size 2 to array axis with dimension 365). Melting the data by the index (pd.melt(df, id_vars=df.index, value_vars=["A", "B"], var_name="column")) in order to use seaborn's hue property as a workaround doesn't work either (TypeError: unhashable type: 'DatetimeIndex').
(A solution doesn't necessarily need to use seaborn, if it is easier using plain matplotlib.)
Edit
I found a workaround that basically produces what I want. However, it becomes somewhat awkward to work with once the DataFrame includes more variables than I want to plot. So if there is a more elegant/direct way to do it, please share!
df_stacked = df.stack().reset_index()
df_stacked.columns = ["date", "vars", "vals"]
df_stacked.index = df_stacked["date"]
sns.boxplot(x=df_stacked.index.month, y="vals", hue="vars", data=df_stacked)
Produces:
here's a solution using pandas melting and seaborn:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A": rnd.randn(n),
"B": rnd.randn(n)+1,
"C": rnd.randn(n) + 10, # will not be plotted
},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
df['month'] = df.index.month
df_plot = df.melt(id_vars='month', value_vars=["A", "B"])
sns.boxplot(x='month', y='value', hue='variable', data=df_plot)
month_dfs = []
for group in df.groupby(df.index.month):
month_dfs.append(group[1])
plt.figure(figsize=(30,5))
for i,month_df in enumerate(month_dfs):
axi = plt.subplot(1, len(month_dfs), i + 1)
month_df.plot(kind='box', subplots=False, ax = axi)
plt.title(i+1)
plt.ylim([-4, 4])
plt.show()
Will give this
Not exactly what you're looking for but you get to keep a readable DataFrame if you add more variables.
You can also easily remove the axis by using
if i > 0:
y_axis = axi.axes.get_yaxis()
y_axis.set_visible(False)
in the loop before plt.show()
This is quite straight-forward using Altair:
alt.Chart(
df.reset_index().melt(id_vars = ["index"], value_vars=["A", "B"]).assign(month = lambda x: x["index"].dt.month)
).mark_boxplot(
extent='min-max'
).encode(
alt.X('variable:N', title=''),
alt.Y('value:Q'),
column='month:N',
color='variable:N'
)
The code above melts the DataFrame and adds a month column. Then Altair creates box-plots for each variable broken down by months as the plot columns.

Date removed from x axis on overlaid plots matplotlib

I am trying to show time series lines representing an effort amount using matplotlib and pandas.
I've got my DF's to all to overlay in one plot, however when I do python seems to strip the x axis of the date and input some numbers. (I'm not sure where these come from but at a guess, not all days contain the same data so python has reverted to using an index id number). If I plot any one of these they come up with date on the x-axis.
Any hints or solutions to make the x axis show date for the multiple plot would be much appreciated.
This is the single figure plot with time axis:
Code I'm using to plot is
fig = pl.figure()
ax = fig.add_subplot(111)
ax.plot(b342,color='black')
ax.plot(b343,color='blue')
ax.plot(b344,color='red')
ax.plot(b345,color='green')
ax.plot(b346,color='pink')
ax.plot(fi,color='yellow')
plt.show()
This is the multiple plot fig with weird x axis:
One option would be to manually specify the x-axis based on the DataFrame index, and then plot directly using matplotlib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# make up some data
n = 100
dates = pd.date_range(start = "2015-01-01", periods = n, name = "yearDate")
dfs = []
for i in range(3):
df = pd.DataFrame(data = np.random.random(n)*(i + 1), index = dates,
columns = ["FishEffort"] )
df.df_name = str(i)
dfs.append(df)
# plot it directly using matplotlib instead of through the DataFrame
fig = plt.figure()
ax = fig.add_subplot()
for df in dfs:
plt.plot(df.index,df["FishEffort"], label = df.df_name)
plt.legend()
plt.show()
Another option would be to concatenate your DataFrames and plot using Pandas. If you give your "FishEffort" field the correct label name when loading the data or via DataFrame.rename then the labels will be specified automatically.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 100
dates = pd.date_range(start = "2015-01-01", periods = n, name = "yearDate")
dfs = []
for i in range(3):
df = pd.DataFrame(data = np.random.random(n)*(i + 1), index = dates,
columns = ["DataFrame #" + str(i) ] )
df.df_name = str(i)
dfs.append(df)
df = pd.concat(dfs, axis = 1)
df.plot()
I've found an answer that does what I want, it seems that calling plt.plot wasn't using the date as the x axis, however calling it using the pandas documentation did the trick.
ax = b342.plot(label='342')
b343.plot(ax=ax, label='test')
b344.plot(ax=ax)
b345.plot(ax=ax)
b346.plot(ax=ax)
fi.plot(ax=ax)
plt.show()
I was wondering if anyone knew hwo to change the labels here?

python pandas plot with uneven timeseries index (with count evenly distributed)

My dataframe has uneven time index.
how could I find a way to plot the data, and local the index automatically? I searched here, and I know I can plot something like
e.plot()
but the time index (x axis) will be even interval, for example per 5 minutes.
if I have to 100 data in first 5 minutes and 6 data for the second 5 minutes, how do I plot
with number of data evenly. and locate the right timestamp on x axis.
here's even count, but I don't know how to add time index.
plot(e['Bid'].values)
example of data format as requested
Time,Bid
2014-03-05 21:56:05:924300,1.37275
2014-03-05 21:56:05:924351,1.37272
2014-03-05 21:56:06:421906,1.37275
2014-03-05 21:56:06:421950,1.37272
2014-03-05 21:56:06:920539,1.37275
2014-03-05 21:56:06:920580,1.37272
2014-03-05 21:56:09:071981,1.37275
2014-03-05 21:56:09:072019,1.37272
and here's the link
http://code.google.com/p/eu-ats/source/browse/trunk/data/new/eur-fix.csv
here's the code, I used to plot
import numpy as np
import pandas as pd
import datetime as dt
e = pd.read_csv("data/ecb/eur.csv", dtype={'Time':object})
e.Time = pd.to_datetime(e.Time, format='%Y-%m-%d %H:%M:%S:%f')
e.plot()
f = e.copy()
f.index = f.Time
x = [str(s)[:-7] for s in f.index]
ff = f.set_index(pd.Series(x))
ff.index.name = 'Time'
ff.plot()
Update:
I added two new plots for comparison to clarify the issue. Now I tried brute force to convert timestamp index back to string, and plot string as x axis. the format easily got messed up. it seems hard to customize location of x label.
Ok, it seems like what you're after is that you want to move around the x-tick locations so that there are an equal number of points between each tick. And you'd like to have the grid drawn on these appropriately-located ticks. Do I have that right?
If so:
import pandas as pd
import urllib
import matplotlib.pyplot as plt
import seaborn as sbn
content = urllib.urlopen('https://eu-ats.googlecode.com/svn/trunk/data/new/eur-fix.csv')
df = pd.read_csv(content, header=0)
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S:%f')
every30 = df.loc[df.index % 30 == 0, 'Time'].values
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
df.plot(x='Time', y='Bid', ax=ax)
ax.set_xticks(every30)
I have tried to reproduce your issue, but I can't seem to. Can you have a look at this example and see how your situation differs?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
np.random.seed(0)
idx = pd.date_range('11:00', '21:30', freq='1min')
ser = pd.Series(data=np.random.randn(len(idx)), index=idx)
ser = ser.cumsum()
for i in range(20):
for j in range(8):
ser.iloc[10*i +j] = np.nan
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
ser.plot(ax=axes[0])
ser.dropna().plot(ax=axes[1])
gives the following two plots:
There are a couple differences between the graphs. The one on the left doesn't connect the non-continuous bits of data. And it lacks vertical gridlines. But both seem to respect the actual index of the data. Can you show an example of your e series? What is the exact format of its index? Is it a datetime_index or is it just text?
Edit:
Playing with this, my guess is that your index is actually just text. If I continue from above with:
idx_str = [str(x) for x in idx]
newser = ser
newser.index = idx_str
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
newser.plot(ax=axes[0])
newser.dropna().plot(ax=axes[1])
then I get something like your problem:
More edit:
If this is in fact your issue (the index is a bunch of strings, not really a bunch of timestamps) then you can convert them and all will be well:
idx_fixed = pd.to_datetime(idx_str)
fixedser = newser
fixedser.index = idx_fixed
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
fixedser.plot(ax=axes[0])
fixedser.dropna().plot(ax=axes[1])
produces output identical to the first code sample above.
Editing again:
To see the uneven spacing of the data, you can do this:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
fixedser.plot(ax=axes[0], marker='.', linewidth=0)
fixedser.dropna().plot(ax=axes[1], marker='.', linewidth=0)
Let me try this one from scratch. Does this solve your issue?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
import urllib
content = urllib.urlopen('https://eu-ats.googlecode.com/svn/trunk/data/new/eur-fix.csv')
df = pd.read_csv(content, header=0, index_col='Time')
df.index = pd.to_datetime(df.index, format='%Y-%m-%d %H:%M:%S:%f')
df.plot()
The thing is, you want to plot bid vs time. If you've put the times into your index then they become your x-axis for "free". If the time data is just another column, then you need to specify that you want to plot bid as the y-axis variable and time as the x-axis variable. So in your code above, even when you convert the time data to be datetime type, you were never instructing pandas/matplotlib to use those datetimes as the x-axis.

Categories