How to change the color of line plotted on y axis midway in plotly? - python

I have a dataframe with 2 columns: 'Ground Truth' and 'Predicted Values', that I am plotting using plotly express.
timestamp Ground Truth Predicted Values
2012-04-01 00:30:00 251.71 NA
2012-04-01 00:15:00 652.782 NA
2012-04-01 00:00:00 458.099 NA
2012-03-31 23:45:00 3504.664 NA
2012-03-31 23:30:00 1215.76 1230
2012-03-31 23:15:00 -21.48 -19.99
2012-03-31 23:00:00 -8.538 -7.42
2012-03-31 22:40:00 -5.11 -5.2
Code for plot
fig = px.line(df, x = df.index, y = ['Ground Truth','Predicted Values'], markers='.')
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count = 1, label = "1H", step = "hour", stepmode ="backward"),
dict(step="all")
])
)
)
fig.show()
With the current code, one line for each column is being plotted (As per given below). I am trying to change the graph, in which the blue line's changes to some other color from the start of the red line. Can someone please help? (In reference to the dataframe, the red line is for 'Predicted values' columns which starts after the NA values end).
Current graph (plotted on different values):

I managed to solve it based on the help from #r-beginners comment above. I'll add the code here, in case it might help anyone.
Use boolean indexing. I create a new column in the DataFrame called 'New Column' and fill it with NaN values. Then used boolean indexing to set the values in 'New Column' where 'Predicted Values' are not NaN to the corresponding values in 'Ground Truth'. The ~ operator inverts the boolean mask generated by the isna() method, so I am selecting rows where 'Predicted Values' are not NaN.
# Create a new column and populate it with 'Ground Truth' values where 'Predicted Values' are not NaN
df['New Column'] = np.nan
df.loc[~df['Predicted Values'].isna(), 'New Column'] = df.loc[~df['Predicted Values'].isna(), 'Ground Truth']

Related

Altair missing value in graph

I would like to vizualize a dataframe using altair.
It is a line and a barchart in one graph, drawn for each group (ID) in my dataframe.
My dataframe has missing values. According to https://altair-viz.github.io/user_guide/transform/impute.html
missing entries are skipped and a line is drawn across the missing data point.
This is actually what I want, but with my data this does not seem to work.
I get a break in my line graph where the value is missing.
I prepared a simple example to explain my problem:
import altair as alt
import numpy as np
#create dataframe
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06','2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],'bar': [np.nan,8,np.nan,np.nan, np.nan, 8,np.nan,np.nan],
'line': [8,np.nan,10,8, 4, 5,6,7] })
df:
date ID bar line
0 2020-04-03 a NaN 8.0
1 2020-04-04 a 8.0 NaN
2 2020-04-05 a NaN 10.0
3 2020-04-06 a NaN 8.0
4 2020-04-03 b NaN 4.0
5 2020-04-04 b 8.0 5.0
6 2020-04-05 b NaN 6.0
7 2020-04-06 b NaN 7.0
# create graph
bars = alt.Chart(df).mark_bar(color="grey", size=5).encode(
alt.X('monthdate(date):O'), y='bar:Q')
lines = alt.Chart(df).mark_line(point=True,size=2,).encode(
alt.X('monthdate(date):O'), y='line:Q')
alt.layer(bars + lines,width=350,height=150).facet(facet=alt.Facet('ID:N'),
).resolve_axis(y='independent',x='independent')
it gives this image
Has anyone an idea why the line has a break (a) and how to draw the line through the missing data point?
I know I could use "impute" to calculate the mean and replace the missing value.
But this implies a data point for the date which is actually not true.
Thanks for any hints, ideas or help!
It is because you have the value recorded as NaN in the dataframe, so there is a valid date entry for this observation, and an NaN for the y-xis which can't be plotted.
This is what you have currently:
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06','2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],
'line': [8,np.nan,10,8, 4, 5,6,7] })
alt.Chart(df).mark_line(point=True,size=2,).encode(
alt.X('monthdate(date):O'), y='line:Q')
If you drop the NaNs, you will get the behavior that you want
alt.Chart(df.dropna()).mark_line(point=True,size=2).encode(
alt.X('monthdate(date):O'), y='line:Q')
For your example above if you want the barplot to retain all values and not drop the rows with NaN in the line column, while still using both layer and facet, you need to reference the same dataframe in both charts an use Altair's transform_filter instead of pandas dropna:
(alt.Chart(df).mark_line(point=True,size=2)
.transform_filter('isValid(datum.line)')
.encode(alt.X('monthdate(date):O'), y='line:Q'))

How to make a categorical barplot with time series in Bokeh?

I'd like to make a categorical barplot with timeseries on the x-axis.
My dataframe looks like this:
VRI TIME QTY
0 308 00:00:00 613.0
1 308 00:15:00 581.0
...
92 309 00:00:00 299.0
93 309 00:15:00 300.5
...
188 310 00:00:00 166.0
189 310 00:15:00 125.0
...
284 328 00:00:00 133.5
285 328 00:15:00 85.5
The VRI needs to be the categorical variable, so I'd like to create 4 bargraphs next to each other.
On the X-axis I would like to have the TIME column, which consists of all the hours of a day per 15 minutes.
This is what my code looks like right now:
source = ColumnDataSource(vri_data)
p = figure(x_axis_type='datetime', title='Total traffic intensity per VRI', plot_width=1000)
p.vbar(x='time',top='aantal', width=timedelta(minutes=10), source=source, hover_line_color="black")
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Traffic intensity'
hover = HoverTool(tooltips=
[
('Time', '#time'),
('Traffic Intensity', '#aantal'),
('VRI Number', '#vri')
])
p.add_tools(hover)
show(p)
It outputs this:
In this plot all the 4 graphs are placed on top of each other, making some invisible. Now what I would like is to have 4 bargraphs next to each other instead of on top of each other, one for every distinct VRI value.
I have tried to use:
p = figure(x_range = vri_data['vri'], ...
But this outputs ValueError: Unrecognized range input:
Does anyone know a fix in order to get the plot as I want it?
Thanks!
There are two options:
Turn the X axis to a proper categorical one, making each of those 15 minutes intervals a separate categories. That would allow you to use nested categories as described here in the Bokeh documentation.
Do it all manually. Either add a color column to the data source and use specify the corresponding vbar parameter or just create 4 vbars, 1 for each VRI value.

pandas display categories incorrect displayed in matplotlib

I am trying to represent categories in matplotlib and for some reason I have categories overlapping on x-axis, as well as missing categories, but y-axis values present. I marked this with red arrows in the picture from the bottom of the question.
The data is contained in sales.csv file that looks like this:
date,first name,last name,city,cost,rooms,bathrooms,type,status
2018-03-04 12:13:21,Linda,Evangelista,Balm Beach,333000,2,2,townhouse,sold
2018-02-01 07:20:20,Rita,Ford,Balm Beach,818000,2,2,detached,sold
2018-03-08 07:13:00,Ali,Hassan,Bowmanville,413000,2,2,bungalow,forsale
2018-05-08 21:00:00,Rashid,Forani,Bowmanville,467000,2,2,townhouse,sold
2018-02-07 16:43:00,Kumar,Yoshi,Bowmanville,613000,3,3,bungalow,sold
2018-01-05 13:43:00,Srini,Santinaram,Bowmanville,723000,2,2,bungalow,forsale
2018-01-03 14:19:00,Maria,Dugall,Brampton,900000,4,3,semidetached,forsale
2018-05-04 19:22:00,Zina,Evangel,Burlington,221000,1,1,townhouse,forsale
2018-05-01 19:44:00,Pierre,Merci,Gatineau,3199000,14,14,bungalow,forsale
2018-05-31 18:10:00,Istvan,Kerekes,Kingston,1110000,4,5,bungalow,sold
2018-03-25 08:22:00,Dumitru,Plamada,Kingston,1650000,5,5,bungalow,forsale
2018-01-01 11:54:00,John,Smith,Markham,1200000,3,3,bungalow,sold
2018-05-07 15:30:00,Arturo,Gonzales,Mississauga,187000,3,3,bungalow,forsale
2018-03-07 22:20:00,Lei,Zhang,North York,122000,1,1,townhouse,forsale
2018-05-04 20:04:00,William,King,Oaks,,3,3,bungalow,sold
2018-03-04 13:05:00,Jeffrey,Kong,Oakville,,2,2,townhouse,forsale
2018-01-04 17:23:00,Abdul,Karrem,Orillia,883000,3,4,townhouse,sold
2018-03-01 13:09:00,Jean,Paumier,Ottawa,1520000,4,4,townhouse,sold
2018-02-01 10:00:00,Ken,Beaufort,Ottawa,3440000,5,5,bungalow,forsale
2018-02-15 11:33:00,Gheorghe,Ionescu,Richmond Hill,1630000,4,3,bungalow,forsale
2018-01-05 10:32:00,Ion,Popescu,Scarborough,1420000,5,3,semidetached,sold
2018-02-07 11:44:00,Xu,Yang,Toronto,422000,2,2,townhouse,forsale
2018-05-29 00:33:00,Giovanni,Gianparello,Toronto,1917000,4,4,bungalow,forsale
2018-03-25 08:27:00,John,Saint-Claire,Toronto,3337000,5,4,bungalow,forsale
2018-01-06 14:06:00,Ann,Murdoch Pyrell,Toronto,1427000,5,4,bungalow,forsale
2018-02-15 13:12:00,Claire,Coldwell,Toronto,3777000,5,4,bungalow,forsale
2018-01-02 09:37:00,Kyle,MCDonald,Toronto,,2,2,townhouse,forsale
2018-02-01 21:22:00,Miriam,Berg,Toronto,,4,4,townhouse,forsale
The code to load the data and display the graph is below:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
sales_brute = pd.read_csv('sales.csv', parse_dates=True, index_col='date')
# Fix the columns names by stripping the extra spaces
sales_brute = sales_brute.rename(columns=lambda x: x.strip())
# Fix the N/A from cost column
sales_brute['cost'].fillna(sales_brute['cost'].mean(), inplace=True)
# Draws a scattered plot, price by cities. Change the colors of plot.
plt.scatter(sales_brute['city'], sales_brute['cost'], color='red')
# Rotates the ticks with 70 grd
plt.xticks(sales_brute['city'], rotation=70)
plt.tight_layout()
# Add grid
plt.grid()
plt.show()
and the results looks strangely like this:
Incorrect display of categories
Maybe we have different versions of matplotlib, but I can't use plt.scatter at all with sales_brute['city'] as first argument.
ValueError: could not convert string to float: 'Toronto'
Instead I made up a new x-axis:
x = range(len(sales_brute))
plt.scatter(x=x, y=sales_brute['cost'], color='red')
plt.xticks(x, sales_brute['city'], rotation=70)
plt.show()
Which results in:
(some stretching required to see the full names)
plt.scatter seems to be happy to take strings as the x-coordinate and arrange them in alphabetical order. plt.xticks, however, wants a list matching the number of ticks and in the same order.
If you change:
plt.xticks(sales_brute['city'], rotation=70)
to
plt.xticks(sales_brute['city'].sort_values().unique(), rotation=70),
you'll get the effect you want.

Pandas: Histogram Plotting

I have a dataframe with dates (datetime) in python. How can I plot a histogram with 30 min bins from the occurrences using this dataframe?
starttime
1 2016-09-11 00:24:24
2 2016-08-28 00:24:24
3 2016-07-31 05:48:31
4 2016-09-11 00:23:14
5 2016-08-21 00:55:23
6 2016-08-21 01:17:31
.............
989872 2016-10-29 17:31:33
989877 2016-10-02 10:00:35
989878 2016-10-29 16:42:41
989888 2016-10-09 07:43:27
989889 2016-10-09 07:42:59
989890 2016-11-05 14:30:59
I have tried looking at examples from Plotting series histogram in Pandas and A per-hour histogram of datetime using Pandas. But they seem to be using a bar plot which is not what I need. I have attempted to create the histogram using temp.groupby([temp["starttime"].dt.hour, temp["starttime"].dt.minute]).count().plot(kind="hist") giving me the results as shown below
If possible I would like the X axis to display the time(e.g 07:30:00)
I think you need bar plot and for axis with times simpliest is convert datetimes to strings by strftime:
temp = temp.resample('30T', on='starttime').count()
ax = temp.groupby(temp.index.strftime('%H:%M')).sum().plot(kind="bar")
#for nicer bar some ticklabels are hidden
spacing = 2
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Plot datetime.date / time series in a pandas dataframe

I created a pandas dataframe from some value counts on particular calendar dates. Here is how I did it:
time_series = pd.DataFrame(df['Operation Date'].value_counts().reset_index())
time_series.columns = ['date', 'count']
Basically, it is two columns, the first "date" is a column with datetime.date objects and the second column, "count" are simply integer values. Now, I'd like to plot a scatter or a KDE to represent how the value changes over the calendar days.
But when I try:
time_series.plot(kind='kde')
plt.show()
I get a plot where the x-axis is from -50 to 150 as if it is parsing the datetime.date objects as integers somehow. Also, it is yielding two identical plots rather than just one.
Any idea how I can plot them and see the calendars day along the x-axis?
you sure you got datetime? i just tried this and it worked fine:
df = date count
7 2012-06-11 16:51:32 1.0
3 2012-09-28 08:05:14 12.0
19 2012-10-01 18:01:47 4.0
2 2012-10-03 15:18:23 29.0
6 2012-12-22 19:50:43 4.0
1 2013-02-19 19:54:03 28.0
9 2013-02-28 16:08:40 17.0
12 2013-03-12 08:42:55 6.0
4 2013-04-04 05:27:27 6.0
17 2013-04-18 09:40:37 29.0
11 2013-05-17 16:34:51 22.0
5 2013-07-07 14:32:59 16.0
14 2013-10-22 06:56:29 13.0
13 2014-01-16 23:08:46 20.0
15 2014-02-25 00:49:26 10.0
18 2014-03-19 15:58:38 25.0
0 2014-03-31 05:53:28 16.0
16 2014-04-01 09:59:32 27.0
8 2014-04-27 12:07:41 17.0
10 2014-09-20 04:42:39 21.0
df = df.sort_values('date', ascending=True)
plt.plot(df['date'], df['count'])
plt.xticks(rotation='vertical')
EDIT:
if you want a scatter plot you can:
plt.plot(df['date'], df['count'], '*')
plt.xticks(rotation='vertical')
If the column is datetime dtype (not object), then you can call plot() directly on the dataframe. You don't need to sort by date either, it's done behind the scenes if x-axis is datetime.
df['date'] = pd.to_datetime(df['date'])
df.plot(x='date', y='count', kind='scatter', rot='vertical');
You can also pass many arguments to make the plot nicer (add titles, change figsize and fontsize, rotate ticklabels, set subplots axis etc.) See the docs for full list of possible arguments.
df.plot(x='date', y='count', kind='line', rot=45, legend=None,
title='Count across time', xlabel='', fontsize=10, figsize=(12,4));
You can even use another column to color scatter plots. In the example below, the months are used to assign color. Tip: To get the full list of possible colormaps, pass any gibberish string to colormap and the error message will show you the full list.
df.plot(x='date', y='count', kind='scatter', rot=90, c=df['date'].dt.month, colormap='tab20', sharex=False);

Categories