How to make a categorical barplot with time series in Bokeh? - python

I'd like to make a categorical barplot with timeseries on the x-axis.
My dataframe looks like this:
VRI TIME QTY
0 308 00:00:00 613.0
1 308 00:15:00 581.0
...
92 309 00:00:00 299.0
93 309 00:15:00 300.5
...
188 310 00:00:00 166.0
189 310 00:15:00 125.0
...
284 328 00:00:00 133.5
285 328 00:15:00 85.5
The VRI needs to be the categorical variable, so I'd like to create 4 bargraphs next to each other.
On the X-axis I would like to have the TIME column, which consists of all the hours of a day per 15 minutes.
This is what my code looks like right now:
source = ColumnDataSource(vri_data)
p = figure(x_axis_type='datetime', title='Total traffic intensity per VRI', plot_width=1000)
p.vbar(x='time',top='aantal', width=timedelta(minutes=10), source=source, hover_line_color="black")
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Traffic intensity'
hover = HoverTool(tooltips=
[
('Time', '#time'),
('Traffic Intensity', '#aantal'),
('VRI Number', '#vri')
])
p.add_tools(hover)
show(p)
It outputs this:
In this plot all the 4 graphs are placed on top of each other, making some invisible. Now what I would like is to have 4 bargraphs next to each other instead of on top of each other, one for every distinct VRI value.
I have tried to use:
p = figure(x_range = vri_data['vri'], ...
But this outputs ValueError: Unrecognized range input:
Does anyone know a fix in order to get the plot as I want it?
Thanks!

There are two options:
Turn the X axis to a proper categorical one, making each of those 15 minutes intervals a separate categories. That would allow you to use nested categories as described here in the Bokeh documentation.
Do it all manually. Either add a color column to the data source and use specify the corresponding vbar parameter or just create 4 vbars, 1 for each VRI value.

Related

Using missingno but got incorrect result

I have a dataframe for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
Take the first column "ambient temperature"(amb_temp) for instance:
There are given missing info below:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
I want to plot the overview of missing value and what I've done is:
import missingno as msno
missing_plot = msno.matrix(df , freq='Y')
and got a figure like this:
Obviously, in the first column, the AMB_TEMP is not consistent to the real. Only three horizontal lines but actually it should be at least 136.
**Update: Thanks to Patrick, I also tried only one column, and nothing improved.
Is there any error from the code or else..?

Rolling Mean ValueError

I am trying to plot the rolling mean on a double-axis graph. However, I get the ValueError: view limit minimum -36867.6 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units error. My columns do have datetime objects in them so I am not sure why this is happening.
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
lns1 = ax1.plot(df5['TIME'],
df5["y"])
lns2 = ax2.plot(df3_plot.rolling(window=3).mean(),
color='black')
df5 looks like this:
TIME y
0 1990-01-01 3.380127
1 1990-02-01 3.313274
2 1990-03-01 4.036463
3 1990-04-01 3.813060
4 1990-05-01 3.847867
...
355 2019-08-01 8.590325
356 2019-09-01 7.642616
357 2019-10-01 8.362921
358 2019-11-01 7.696176
359 2019-12-01 8.206370
And df3_plot looks like this:
date y
0 1994-01-01 239.274414
1 1994-02-01 226.126581
2 1994-03-01 211.591748
3 1994-04-01 214.708679
4 1995-01-01 223.093071
...
99 2018-04-01 181.889699
100 2019-01-01 174.500096
101 2019-02-01 179.803310
102 2019-03-01 175.570419
103 2019-04-01 176.697451
Futhermore, the graph comes out fine if I don't try using rolling mean for df3_plot. This means that the x-axis is a datetime for both. When I have
lns2 = ax2.plot(df3_plot['date'],
df3_plot['y'],
color='black')
I get this graph
Edit
Suppose that df5 has another column 'y2' that is correctly rolling meaned with 'y'. How can I graph and label it properly? I currently have
df6 = df5.rolling(window=12).mean()
lns1 = ax1.plot(
df6,
label = 'y', # how do I add 'y2' label correctly?
linewidth = 2.0)
df6 looks like this:
TIME y y2
0 1990-01-01 NaN NaN
1 1990-02-01 NaN NaN
2 1990-03-01 NaN NaN
3 1990-04-01 NaN NaN
4 1990-05-01 NaN NaN
... ... ... ...
355 2019-08-01 10.012447 8.331901
356 2019-09-01 9.909044 8.263813
357 2019-10-01 9.810155 8.185539
358 2019-11-01 9.711690 8.085016
359 2019-12-01 9.619968 8.035330
Making 'date' into the index of my dataframe did the trick: df3_plot.set_index('date', inplace=True).
However, I'm not sure why the error messages are different for #dm2 and I.
You already caught this, but the problem is that rolling by default works on the index. There is also an on parameter for setting a column to work on instead:
rolling = df3_plot.rolling(window=3, on='date').mean()
lns2 = ax2.plot(rolling['date'], rolling['y'], color='black')
Note that if you just do df3_plot.rolling(window=3).mean(), you get this:
y
0 NaN
1 NaN
2 0.376586
3 0.168073
4 0.258431
.. ...
299 0.285585
300 0.327987
301 0.518088
302 0.300169
303 0.299366
[304 rows x 1 columns]
Seems like matplotlib tries to plot y here since there is only one column. But the index is int, not dates, so I believe that leads to the error you saw when trying to plot over the other date axis.
When you use on to create rolling in my example, the result still has date and y columns, so you still need to reference the appropriate columns when plotting.

Create Multiple Subplots of sns.factorplot based on Dataframe Integer Column Values

I have a dataframe like on below,
df_sales:
ProductCode Weekly_Units_Sold Is_Promo
Date
2015-01-11 1 49.0 No
2015-01-11 2 35.0 No
2015-01-11 3 33.0 No
2015-01-11 4 40.0 No
2015-01-11 5 53.0 No
... ... ... ...
2015-07-26 313 93.0 No
2015-07-26 314 4.0 No
2015-07-26 315 1.0 No
2015-07-26 316 5.0 No
2015-07-26 317 2.0 No
Want to observe Promotime effect on Each ProductCode with Sns.factorplot like code on below:
sns.factorplot(data= df_sales,
x= 'Is_Promo',
y= 'Weekly_Units_Sold',
hue= 'ProductCode');
It is working good but it seems very confusing and overlapped due to 317 product plotted in one table.(https://i.stack.imgur.com/fgrjV.png)
When i split the dataframe with this code:
df_sales=df_sales.query('1<=ProductCode<=10')
It looks great readbility.
https://i.stack.imgur.com/NTQev.png
So, ı wanted to draw as subplots with help of splitting data respect of 10 productcode range(like first subplot ProductCOde is [1-10], second[11-20]..[291-300],[301-310],[311-317] in each subplot.
My Failed Tries :`
g=sns.FacetGrid(df_sales,col='ProductCode')
g.map(sns.factorplot,'Is_Promo','Weekly_Units_Sold')
sns.factorplot(data= df_sales,
x= 'Is_Promo',
y= 'Weekly_Units_Sold',
hue= 'ProductCode');
I tried not splitting with 10 ProductCode ranges.
I have just tried to create subplot for each ProductCode but
gave me image size error.
So how can I create subplots of sns.factorplot splitted respect to ProductCode range to get more readible results?
Thanks
You need to create a new column with a unique value for each group of products. A simple way of doing that is using pd.cut()
Nproducts = 100
Ngroups = 10
df1 = pd.DataFrame({'ProductCode':np.arange(Nproducts),
'Weekly_Units_Sold': np.random.random(size=Nproducts),
'Is_Promo':'No'})
df2 = pd.DataFrame({'ProductCode':np.arange(Nproducts),
'Weekly_Units_Sold': np.random.random(size=Nproducts),
'Is_Promo':'Yes'})
df = pd.concat([df1,df2])
df['ProductGroup'] = pd.cut(df['ProductCode'], Ngroups, labels=False)
After that, you can facet based on ProductGroup, and plot whatever relationship you want for each group.
g = sns.FacetGrid(data=df, col='ProductGroup', col_wrap=3, hue='ProductCode')
g.map(sns.pointplot, 'Is_Promo', 'Weekly_Units_Sold', order=['No','Yes'])
Note that this using seaborn v.0.10.0. factorplot() was renamed catplot in v.0.9 so you may have to adjust for version difference.
EDIT: to create a legend, I had to modify a bit the code to move the hue parameter out of the FacetGrid:
g = sns.FacetGrid(data=df, col='ProductGroup', col_wrap=3)
g.map_dataframe(sns.pointplot, 'Is_Promo', 'Weekly_Units_Sold', order=['Yes','No'], hue='ProductCode')
for ax in g.axes.ravel():
ax.legend(loc=1, bbox_to_anchor=(1.1,1))

Change tick frequency for datetime axis [duplicate]

This question already has an answer here:
Change tick frequency on X (time, not number) frequency in matplotlib
(1 answer)
Closed 3 years ago.
I have the following dataframe:
Date Prod_01 Prod_02
19 2018-03-01 49870 0.0
20 2018-04-01 47397 0.0
21 2018-05-01 53752 0.0
22 2018-06-01 47111 0.0
23 2018-07-01 53581 0.0
24 2018-08-01 55692 0.0
25 2018-09-01 51886 0.0
26 2018-10-01 56963 0.0
27 2018-11-01 56732 0.0
28 2018-12-01 59196 0.0
29 2019-01-01 57221 5.0
30 2019-02-01 55495 472.0
31 2019-03-01 65394 753.0
32 2019-04-01 59030 1174.0
33 2019-05-01 64466 2793.0
34 2019-06-01 58471 4413.0
35 2019-07-01 64785 6110.0
36 2019-08-01 63774 8360.0
37 2019-09-01 64324 9558.0
38 2019-10-01 65733 11050.0
And I need to plot a time series of the 'Prod_01' column.
The 'Date' column is in the pandas datetime format.
So I used the following command:
plt.figure(figsize=(10,4))
plt.plot('Date', 'Prod_01', data=test, linewidth=2, color='steelblue')
plt.xticks(rotation=45, horizontalalignment='right');
Output:
However, I want to change the frequency of the xticks to one month, so I get one tick and one label for each month.
I have tried the following command:
plt.figure(figsize=(10,4))
plt.plot('Date', 'Prod_01', data=test, linewidth=2, color='steelblue')
plt.xticks(np.arange(1, len(test), 1), test['Date'] ,rotation=45, horizontalalignment='right');
But I get this:
How can I solve this problem?
Thanks in advance.
I'm not very familiar with pandas data frames. However, I can't see why this wouldn't work with any pyplot:
According the top SO answer on related post by ImportanceOfBeingErnest:
The spacing between ticklabels is exclusively determined by the space between ticks on the axes.
So, to change the distance between ticks, and the labels you can do this:
Suppose a cluttered and base-10 centered person displays the following graph:
It takes the following code and importing matplotlib.ticker:
import numpy as np
import matplotlib.pyplot as plt
# Import this, too
import matplotlib.ticker as ticker
# Arbitrary graph with x-axis = [-32..32]
x = np.linspace(-32, 32, 1024)
y = np.sinc(x)
# -------------------- Look Here --------------------
# Access plot's axes
axs = plt.axes()
# Set distance between major ticks (which always have labels)
axs.xaxis.set_major_locator(ticker.MultipleLocator(5))
# Sets distance between minor ticks (which don't have labels)
axs.xaxis.set_minor_locator(ticker.MultipleLocator(1))
# -----------------------------------------------------
# Plot and show graph
plt.plot(x, y)
plt.show()
To change where the labels are placed, you can change the distance between the 'major ticks'. You can also change the smaller 'minor ticks' in between, which don't have a number attached. E.g., on a clock, the hour ticks have numbers on them and are larger (major ticks) with smaller, unlabeled ones between marking the minutes (minor ticks).
By changing the --- Look Here --- part to:
# -------------------- Look Here --------------------
# Access plot's axes
axs = plt.axes()
# Set distance between major ticks (which always have labels)
axs.xaxis.set_major_locator(ticker.MultipleLocator(8))
# Sets distance between minor ticks (which don't have labels)
axs.xaxis.set_minor_locator(ticker.MultipleLocator(4))
# -----------------------------------------------------
You can generate the cleaner and more elegant graph below:
Hope that helps!

pandas display categories incorrect displayed in matplotlib

I am trying to represent categories in matplotlib and for some reason I have categories overlapping on x-axis, as well as missing categories, but y-axis values present. I marked this with red arrows in the picture from the bottom of the question.
The data is contained in sales.csv file that looks like this:
date,first name,last name,city,cost,rooms,bathrooms,type,status
2018-03-04 12:13:21,Linda,Evangelista,Balm Beach,333000,2,2,townhouse,sold
2018-02-01 07:20:20,Rita,Ford,Balm Beach,818000,2,2,detached,sold
2018-03-08 07:13:00,Ali,Hassan,Bowmanville,413000,2,2,bungalow,forsale
2018-05-08 21:00:00,Rashid,Forani,Bowmanville,467000,2,2,townhouse,sold
2018-02-07 16:43:00,Kumar,Yoshi,Bowmanville,613000,3,3,bungalow,sold
2018-01-05 13:43:00,Srini,Santinaram,Bowmanville,723000,2,2,bungalow,forsale
2018-01-03 14:19:00,Maria,Dugall,Brampton,900000,4,3,semidetached,forsale
2018-05-04 19:22:00,Zina,Evangel,Burlington,221000,1,1,townhouse,forsale
2018-05-01 19:44:00,Pierre,Merci,Gatineau,3199000,14,14,bungalow,forsale
2018-05-31 18:10:00,Istvan,Kerekes,Kingston,1110000,4,5,bungalow,sold
2018-03-25 08:22:00,Dumitru,Plamada,Kingston,1650000,5,5,bungalow,forsale
2018-01-01 11:54:00,John,Smith,Markham,1200000,3,3,bungalow,sold
2018-05-07 15:30:00,Arturo,Gonzales,Mississauga,187000,3,3,bungalow,forsale
2018-03-07 22:20:00,Lei,Zhang,North York,122000,1,1,townhouse,forsale
2018-05-04 20:04:00,William,King,Oaks,,3,3,bungalow,sold
2018-03-04 13:05:00,Jeffrey,Kong,Oakville,,2,2,townhouse,forsale
2018-01-04 17:23:00,Abdul,Karrem,Orillia,883000,3,4,townhouse,sold
2018-03-01 13:09:00,Jean,Paumier,Ottawa,1520000,4,4,townhouse,sold
2018-02-01 10:00:00,Ken,Beaufort,Ottawa,3440000,5,5,bungalow,forsale
2018-02-15 11:33:00,Gheorghe,Ionescu,Richmond Hill,1630000,4,3,bungalow,forsale
2018-01-05 10:32:00,Ion,Popescu,Scarborough,1420000,5,3,semidetached,sold
2018-02-07 11:44:00,Xu,Yang,Toronto,422000,2,2,townhouse,forsale
2018-05-29 00:33:00,Giovanni,Gianparello,Toronto,1917000,4,4,bungalow,forsale
2018-03-25 08:27:00,John,Saint-Claire,Toronto,3337000,5,4,bungalow,forsale
2018-01-06 14:06:00,Ann,Murdoch Pyrell,Toronto,1427000,5,4,bungalow,forsale
2018-02-15 13:12:00,Claire,Coldwell,Toronto,3777000,5,4,bungalow,forsale
2018-01-02 09:37:00,Kyle,MCDonald,Toronto,,2,2,townhouse,forsale
2018-02-01 21:22:00,Miriam,Berg,Toronto,,4,4,townhouse,forsale
The code to load the data and display the graph is below:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
sales_brute = pd.read_csv('sales.csv', parse_dates=True, index_col='date')
# Fix the columns names by stripping the extra spaces
sales_brute = sales_brute.rename(columns=lambda x: x.strip())
# Fix the N/A from cost column
sales_brute['cost'].fillna(sales_brute['cost'].mean(), inplace=True)
# Draws a scattered plot, price by cities. Change the colors of plot.
plt.scatter(sales_brute['city'], sales_brute['cost'], color='red')
# Rotates the ticks with 70 grd
plt.xticks(sales_brute['city'], rotation=70)
plt.tight_layout()
# Add grid
plt.grid()
plt.show()
and the results looks strangely like this:
Incorrect display of categories
Maybe we have different versions of matplotlib, but I can't use plt.scatter at all with sales_brute['city'] as first argument.
ValueError: could not convert string to float: 'Toronto'
Instead I made up a new x-axis:
x = range(len(sales_brute))
plt.scatter(x=x, y=sales_brute['cost'], color='red')
plt.xticks(x, sales_brute['city'], rotation=70)
plt.show()
Which results in:
(some stretching required to see the full names)
plt.scatter seems to be happy to take strings as the x-coordinate and arrange them in alphabetical order. plt.xticks, however, wants a list matching the number of ticks and in the same order.
If you change:
plt.xticks(sales_brute['city'], rotation=70)
to
plt.xticks(sales_brute['city'].sort_values().unique(), rotation=70),
you'll get the effect you want.

Categories