Python vs matplotlib - Chart generation issue - python

I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.

What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())

Related

Python seaborn line plot when i plot the x axis values are out of order (even though they are in order in dataframe)

I'm using seaborn to do a line plot.
here's a sample data:
error_mean.head(5)
output is below:
error_rate
10 0.829440
20 0.833747
30 0.835182
40 0.837922
50 0.835835
so the index values are indeed ordered (or at least it seems like).
here's my code plotting the above data:
plt.figure(figsize=(15,5))
sns.lineplot(x=error_mean.index.values, y=error_mean['error_rate'])
and i keep getting a plot like following:
as you can see, the x-axis values are so out of order! i tried googling into this but i couldnt find similar issues answered.
appreciate any help!
I guess the issue is that error_mean.index.values is a Series of type str. You need to convert it as int.
Check the difference between:
import pandas as pd
import seaborn as sns
import matplotlib as plt
df1 = pd.DataFrame([
["10", 0.829440],
["20", 0.833747],
["100", 0.835182],
["40" , 0.837922],
["50", 0.835835]])
sns.lineplot(x=df1[0], y=df1[1])
and
df1 = pd.DataFrame([
["10", 0.829440],
["20", 0.833747],
["100", 0.835182],
["40" , 0.837922],
["50", 0.835835]])
sns.lineplot(x=(df1[0]).astype(int), y=df1[1])
So I will try:
plt.figure(figsize=(15,5))
sns.lineplot(x=error_mean.index.values.astype(int), y=error_mean['error_rate'])

Inconsistent automatic pandas date labeling

I was wondering how pandas formats the x-axis date exactly. I am using the same script on a bunch of data results, which all have the same pandas df format. However, pandas formats each df date differently. How could this be more consistently?
Each df has a DatetimeIndex like this, dtype='datetime64[ns]
>>> df.index
DatetimeIndex(['2014-10-02', '2014-10-03', '2014-10-04', '2014-10-05',
'2014-10-06', '2014-10-07', '2014-10-08', '2014-10-09',
'2014-10-10', '2014-10-11',
...
'2015-09-23', '2015-09-24', '2015-09-25', '2015-09-26',
'2015-09-27', '2015-09-28', '2015-09-29', '2015-09-30',
'2015-10-01', '2015-10-02'],
dtype='datetime64[ns]', name='Date', length=366, freq=None)
Eventually, I plot with df.plot() where the df has two columns.
But the axes of the plots have different styles, like this:
I would like all plots to have the x-axis style of the first plot. pandas should do this automatically, so I'd rather not prefer to begin with xticks formatting, since I have quite a lot of data to plot. Could anyone explain what to do? Thanks!
EDIT:
I'm reading two csv-files from 2015. The first has the model results of about 200 stations, the second has the gauge measurements of the same stations. Later, I read another two csv-files from 2016 with the same format.
import pandas as pd
df_model = pd.read_csv(path_model, sep=';', index_col=0, parse_dates=True)
df_gauge = pd.read_csv(path_gauge, sep=';', index_col=0, parse_dates=True)
df = pd.DataFrame(columns=['model', 'gauge'], index=df_model.index)
df['model'] = df_model['station_1'].copy()
df['gauge'] = df_gauge['station_1'].copy()
df.plot()
I do this for each year, so the x-axis should look the same, right?
I do not think this possible unless you make modifications to the pandas library. I looked around a bit for options that one may set in Pandas, but couldn't find one. Pandas tries to intelligently select the type of axis ticks using logic implemented here (I THINK). So in my opinion, it would be best to define your own function to make the plots and than overwrite the tick formatting (although you do not want to do that).
There are many references around the internet which show how to do this. I used this one by "Simone Centellegher" and this stackoverflow answer to come up with a function that may work for you (tested in python 3.7.1 with matplotlib 3.0.2, pandas 0.23.4):
import pandas as pd
import numpy as np
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
## pass df with columns you want to plot
def my_plotter(df, xaxis, y_cols):
fig, ax = plt.subplots()
plt.plot(xaxis,df[y_cols])
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y'))
# Remove overlapping major and minor ticks
majticklocs = ax.xaxis.get_majorticklocs()
minticklocs = ax.xaxis.get_minorticklocs()
minticks = ax.xaxis.get_minor_ticks()
for i in range(len(minticks)):
cur_mintickloc = minticklocs[i]
if cur_mintickloc in majticklocs:
minticks[i].set_visible(False)
return fig, ax
df = pd.DataFrame({'values':np.random.randint(0,1000,36)}, \
index=pd.date_range(start='2014-01-01', \
end='2016-12-31',freq='M'))
fig, ax = my_plotter(df, df.index, ["values"])

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Multiple single plots in seaborn with pandas groupby data

My issue is very specific, i guess, but i can't seem to find a proper solution, and im clueless with the error output that i get.
Anyway, i have a pandas dataframe loaded from an sqlite database.
data_frame = pd.read_sql_query(
"SELECT (total_comb + total_comb_rc) as total_comb, p_val, w_length from {tn}".format(
tn=table_name), conn)
With that loaded, i group the data by the 'w_length' value.
for i, group in data_frame.groupby('w_length'):
Now, i want to plot a scatter plot for each group created with seaborn lmplot.
for i, group in data_frame.groupby('w_length'):
sns.lmplot(x=group['total_comb'], y=group['p_val'],
data=group,
fit_reg=False)
sns.despine()
plt.savefig('test_scatter'+i+'.png', dpi=400)
But for some reason im getting, this output.
'[ 6.95485628e-02 3.53641178e-01 3.46862200e+06 4.11684800e+06] not in index'
and no plot file.
I know im doing something wrong, but i cant seem to figure it out.
pd: i know i can do something like this.
sns.lmplot(x='total_comb', y='p_val',
data=data_frame,
fit_reg=False,
hue="w_length", x_jitter=.1, col="w_length", col_wrap=3, size=4)
but i also need the separeted plots for each 'w_length'.
Thanks!!
Supposing the problem is not due to the data collection from the sql database, it's probably due to the fact that you call
sns.lmplot(x=group['total_comb'], y=group['p_val'], data=group)
instead of
sns.lmplot(x='total_comb', y='p_val', data=group)
Here is a working example, which produces two separate plots:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np; np.random.seed(42)
x = np.arange(24)
y = np.random.randint(1,10, len(x))
cat = np.random.choice(["A", "B"], size=len(x))
df = pd.DataFrame({"x": x, "y": y, "cat": cat})
for i, group in df.groupby('cat'):
sns.lmplot(x="x", y="y", data=group, fit_reg=False)
plt.savefig(__file__+str(i)+".png")
plt.show()

Time-series boxplot in pandas

How can I create a boxplot for a pandas time-series where I have a box for each day?
Sample dataset of hourly data where one box should consist of 24 values:
import pandas as pd
n = 480
ts = pd.Series(randn(n),
index=pd.date_range(start="2014-02-01",
periods=n,
freq="H"))
ts.plot()
I am aware that I could make an extra column for the day, but I would like to have proper x-axis labeling and x-limit functionality (like in ts.plot()), so being able to work with the datetime index would be great.
There is a similar question for R/ggplot2 here, if it helps to clarify what I want.
If its an option for you, i would recommend using Seaborn, which is a wrapper for Matplotlib. You could do it yourself by looping over the groups from your timeseries, but that's much more work.
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
Which gives:
Note that i'm passing the day of year as the grouper to seaborn, if your data spans multiple years this wouldn't work. You could then consider something like:
ts.index.to_series().apply(lambda x: x.strftime('%Y%m%d'))
Edit, for 3-hourly you could use this as a grouper, but it only works if there are no minutes or lower defined. :
[(dt - datetime.timedelta(hours=int(dt.hour % 3))).strftime('%Y%m%d%H') for dt in ts.index]
(Not enough rep to comment on accepted solution, so adding an answer instead.)
The accepted code has two small errors: (1) need to add numpy import and (2) nned to swap the x and y parameters in the boxplot statement. The following produces the plot shown.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
I have a solution that may be helpful-- It only uses native pandas and allows for hierarchical date-time grouping (i.e spanning years). The key is that if you pass a function to groupby(), it will be called on each element of the dataframe's index. If your index is a DatetimeIndex (or similar), you can access all of the dt's convenience functions for resampling!
Try this:
n = 480
ts = pd.DataFrame(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts.groupby(lambda x: x.strftime("%Y-%m-%d")).boxplot(subplots=False, figsize=(12,9), rot=90)

Categories