Struggling to convert grouped data to a boxplot with Pandas - python

I have a data set where the goal is to create a box plot of two grouped columns, and I cannot figure out how to properly code the boxplot
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
scooters = pd.read_csv(.......)
df = pd.DataFrame(scooters)
dx = df.groupby("Day of Week")["Trip Duration"]
box_plt = sb.boxplot(data = dx, x = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']);
box_plt.set(xlabel = ['Day of Week'], ylabel = ['Trip Duration'])
box_plt.plot()
plot.show()
Currently nothing happens when I run the above code. Every resource is definitely where it's supposed to be, but the goal, which was to group the Trip Durations by each day of the week and then make a box plot for each day, has been incredibly confusing. Any tips would be appreciated for how I can make a plot for each day. When I print the groups they're correctly grouped, as in there's a 0 group for Monday with all those values, 1 for Tuesday, etc.

Some remarks:
Seaborn does the grouping for you, it even can't create a good boxplot without the original data.
pd.read_csv() already creates a dataframe, you shouldn't convert it a second time to a dataframe. (So, scooters in your example is already a dataframe)
For readability and checking with tutorials, it helps to use the standard abbreviations such as sns for seaborn
Seaborn automatically puts the dataframe columns as names for x and y labels.
order= can be used to fix an order on the x-axis. Otherwise, the order comes from the order they appear in the dataframe. (Using pd.Categorical() on the dataframe column is another way to set an order.)
sns.boxplot doesn't return a plot, it creates a plot and returns the ax on which the boxplot has been drawn.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
days_of_week = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
scooters = pd.DataFrame({'Day of Week': np.repeat(days_of_week, 200),
'Trip Duration': np.abs(np.random.randn(7, 200).cumsum(axis=1)).ravel() * 30 + 20})
sns.set_style('darkgrid')
ax = sns.boxplot(data=scooters, x='Day of Week', y='Trip Duration', order=days_of_week, palette='turbo')
plt.tight_layout()
plt.show()

Related

Python vs matplotlib - Chart generation issue

I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.
What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())

Dates in X-axis using pandas and matplotlib

I am trying to plot some data from pandas. First I group by weeks and count for each grouped week, them I want to plot for each date, however when I try to plot I get just some dates, not all of them.
I am using the following code:
my_data = res1.groupby(pd.Grouper(key='d', freq='W-MON')).agg('count').u
p1, = plt.plot(my_data, '.-')
a = plt.xticks(rotation=45)
My result is the following:
I wanted a value in the x-axis for each date in the grouped dataframe.
EDIT: I tried to use plt.xticks(list(my_data.index.astype(str)), rotation=45)
The plot I get is the following:
Please find a working chunk of code below:
from datetime import date, timedelta
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
a = pd.Series(np.random.randint(10, 99, 10))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(pd.date_range(date(2016,1,1), periods=10, freq='D'), a)
plt.gcf().autofmt_xdate()
Hope it helps :)

datetime x-axis matplotlib labels causing uncontrolled overlap

I'm trying to plot a pandas series with a 'pandas.tseries.index.DatetimeIndex'. The x-axis label stubbornly overlap, and I cannot make them presentable, even with several suggested solutions.
I tried stackoverflow solution suggesting to use autofmt_xdate but it doesn't help.
I also tried the suggestion to plt.tight_layout(), which fails to make an effect.
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
ax.figure.autofmt_xdate()
#plt.tight_layout()
print(type(test_df[(test_df.index.year ==2017) ]['error'].index))
UPDATE: That I'm using a bar chart is an issue. A regular time-series plot shows nicely-managed labels.
A pandas bar plot is a categorical plot. It shows one bar for each index at integer positions on the scale. Hence the first bar is at position 0, the next at 1 etc. The labels correspond to the dataframes' index. If you have 100 bars, you'll end up with 100 labels. This makes sense because pandas cannot know if those should be treated as categories or ordinal/numeric data.
If instead you use a normal matplotlib bar plot, it will treat the dataframe index numerically. This means the bars have their position according to the actual dates and labels are placed according to the automatic ticker.
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=42).tolist()
df = pd.DataFrame(np.cumsum(np.random.randn(42)),
columns=['error'], index=pd.to_datetime(datelist))
plt.bar(df.index, df["error"].values)
plt.gcf().autofmt_xdate()
plt.show()
The advantage is then in addition that matplotlib.dates locators and formatters can be used. E.g. to label each first and fifteenth of a month with a custom format,
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=93).tolist()
df = pd.DataFrame(np.cumsum(np.random.randn(93)),
columns=['error'], index=pd.to_datetime(datelist))
plt.bar(df.index, df["error"].values)
plt.gca().xaxis.set_major_locator(mdates.DayLocator((1,15)))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%d %b %Y"))
plt.gcf().autofmt_xdate()
plt.show()
In your situation, the easiest would be to manually create labels and spacing, and apply that using ax.xaxis.set_major_formatter.
Here's a possible solution:
Since no sample data was provided, I tried to mimic the structure of your dataset in a dataframe with some random numbers.
The setup:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
# A dataframe with random numbers ro run tests on
np.random.seed(123456)
rows = 100
df = pd.DataFrame(np.random.randint(-10,10,size=(rows, 1)), columns=['error'])
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
test_df = df.copy(deep = True)
# Plot of data that mimics the structure of your dataset
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
ax.figure.autofmt_xdate()
plt.figure(figsize=(15,8))
A possible solution:
test_df = df.copy(deep = True)
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
plt.figure(figsize=(15,8))
# Make a list of empty myLabels
myLabels = ['']*len(test_df.index)
# Set labels on every 20th element in myLabels
myLabels[::20] = [item.strftime('%Y - %m') for item in test_df.index[::20]]
ax.xaxis.set_major_formatter(ticker.FixedFormatter(myLabels))
plt.gcf().autofmt_xdate()
# Tilt the labels
plt.setp(ax.get_xticklabels(), rotation=30, fontsize=10)
plt.show()
You can easily change the formatting of labels by checking strftime.org

Time-series boxplot in pandas

How can I create a boxplot for a pandas time-series where I have a box for each day?
Sample dataset of hourly data where one box should consist of 24 values:
import pandas as pd
n = 480
ts = pd.Series(randn(n),
index=pd.date_range(start="2014-02-01",
periods=n,
freq="H"))
ts.plot()
I am aware that I could make an extra column for the day, but I would like to have proper x-axis labeling and x-limit functionality (like in ts.plot()), so being able to work with the datetime index would be great.
There is a similar question for R/ggplot2 here, if it helps to clarify what I want.
If its an option for you, i would recommend using Seaborn, which is a wrapper for Matplotlib. You could do it yourself by looping over the groups from your timeseries, but that's much more work.
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
Which gives:
Note that i'm passing the day of year as the grouper to seaborn, if your data spans multiple years this wouldn't work. You could then consider something like:
ts.index.to_series().apply(lambda x: x.strftime('%Y%m%d'))
Edit, for 3-hourly you could use this as a grouper, but it only works if there are no minutes or lower defined. :
[(dt - datetime.timedelta(hours=int(dt.hour % 3))).strftime('%Y%m%d%H') for dt in ts.index]
(Not enough rep to comment on accepted solution, so adding an answer instead.)
The accepted code has two small errors: (1) need to add numpy import and (2) nned to swap the x and y parameters in the boxplot statement. The following produces the plot shown.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
I have a solution that may be helpful-- It only uses native pandas and allows for hierarchical date-time grouping (i.e spanning years). The key is that if you pass a function to groupby(), it will be called on each element of the dataframe's index. If your index is a DatetimeIndex (or similar), you can access all of the dt's convenience functions for resampling!
Try this:
n = 480
ts = pd.DataFrame(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts.groupby(lambda x: x.strftime("%Y-%m-%d")).boxplot(subplots=False, figsize=(12,9), rot=90)

Create a weekly timetable using matplotlib

Edit: I changed Data Type to Pandas DataFrame that looks like this (datetime.datetime,int) in order to make the problem more simple.
Original Post:
I have a numpy array of data reports that looks like this (datetime.datetime,int,int) and I can't seem to plot it right. I need the X axes to be a 24 hours and this array
np.array([datetime.datetime.time(x) for x in DataArr])
the Y should be the days(monday,tuesday and so on) from the datetime
and the int should give me different colors for different events but I can't find an example
in matplotlib's web site.
An example of what I'm looking for:
It sounds like you want something like this?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# I'm using pandas here just to easily create a series of dates.
time = pd.date_range('01/01/2013', '05/20/2013', freq='2H')
z = np.random.random(time.size)
# There are other ways to do this, but we'll exploit how matplotlib internally
# handles dates. They're floats where a difference of 1.0 corresponds to 1 day.
# Therefore, modulo 1 results in the time of day. The +1000 yields a valid date.
t = mdates.date2num(time) % 1 + 1000
# Pandas makes getting the day of the week trivial...
day = time.dayofweek
fig, ax = plt.subplots()
scat = ax.scatter(t, day, c=z, s=100, edgecolor='none')
ax.xaxis_date()
ax.set(yticks=range(7),
yticklabels=['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
# Optional formatting tweaks
ax.xaxis.set_major_formatter(mdates.DateFormatter('%l%p'))
ax.margins(0.05)
fig.colorbar(scat)
plt.show()

Categories