I would like to vizualize a dataframe using altair.
It is a line and a barchart in one graph, drawn for each group (ID) in my dataframe.
My dataframe has missing values. According to https://altair-viz.github.io/user_guide/transform/impute.html
missing entries are skipped and a line is drawn across the missing data point.
This is actually what I want, but with my data this does not seem to work.
I get a break in my line graph where the value is missing.
I prepared a simple example to explain my problem:
import altair as alt
import numpy as np
#create dataframe
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06','2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],'bar': [np.nan,8,np.nan,np.nan, np.nan, 8,np.nan,np.nan],
'line': [8,np.nan,10,8, 4, 5,6,7] })
df:
date ID bar line
0 2020-04-03 a NaN 8.0
1 2020-04-04 a 8.0 NaN
2 2020-04-05 a NaN 10.0
3 2020-04-06 a NaN 8.0
4 2020-04-03 b NaN 4.0
5 2020-04-04 b 8.0 5.0
6 2020-04-05 b NaN 6.0
7 2020-04-06 b NaN 7.0
# create graph
bars = alt.Chart(df).mark_bar(color="grey", size=5).encode(
alt.X('monthdate(date):O'), y='bar:Q')
lines = alt.Chart(df).mark_line(point=True,size=2,).encode(
alt.X('monthdate(date):O'), y='line:Q')
alt.layer(bars + lines,width=350,height=150).facet(facet=alt.Facet('ID:N'),
).resolve_axis(y='independent',x='independent')
it gives this image
Has anyone an idea why the line has a break (a) and how to draw the line through the missing data point?
I know I could use "impute" to calculate the mean and replace the missing value.
But this implies a data point for the date which is actually not true.
Thanks for any hints, ideas or help!
It is because you have the value recorded as NaN in the dataframe, so there is a valid date entry for this observation, and an NaN for the y-xis which can't be plotted.
This is what you have currently:
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06','2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],
'line': [8,np.nan,10,8, 4, 5,6,7] })
alt.Chart(df).mark_line(point=True,size=2,).encode(
alt.X('monthdate(date):O'), y='line:Q')
If you drop the NaNs, you will get the behavior that you want
alt.Chart(df.dropna()).mark_line(point=True,size=2).encode(
alt.X('monthdate(date):O'), y='line:Q')
For your example above if you want the barplot to retain all values and not drop the rows with NaN in the line column, while still using both layer and facet, you need to reference the same dataframe in both charts an use Altair's transform_filter instead of pandas dropna:
(alt.Chart(df).mark_line(point=True,size=2)
.transform_filter('isValid(datum.line)')
.encode(alt.X('monthdate(date):O'), y='line:Q'))
I am trying to represent categories in matplotlib and for some reason I have categories overlapping on x-axis, as well as missing categories, but y-axis values present. I marked this with red arrows in the picture from the bottom of the question.
The data is contained in sales.csv file that looks like this:
date,first name,last name,city,cost,rooms,bathrooms,type,status
2018-03-04 12:13:21,Linda,Evangelista,Balm Beach,333000,2,2,townhouse,sold
2018-02-01 07:20:20,Rita,Ford,Balm Beach,818000,2,2,detached,sold
2018-03-08 07:13:00,Ali,Hassan,Bowmanville,413000,2,2,bungalow,forsale
2018-05-08 21:00:00,Rashid,Forani,Bowmanville,467000,2,2,townhouse,sold
2018-02-07 16:43:00,Kumar,Yoshi,Bowmanville,613000,3,3,bungalow,sold
2018-01-05 13:43:00,Srini,Santinaram,Bowmanville,723000,2,2,bungalow,forsale
2018-01-03 14:19:00,Maria,Dugall,Brampton,900000,4,3,semidetached,forsale
2018-05-04 19:22:00,Zina,Evangel,Burlington,221000,1,1,townhouse,forsale
2018-05-01 19:44:00,Pierre,Merci,Gatineau,3199000,14,14,bungalow,forsale
2018-05-31 18:10:00,Istvan,Kerekes,Kingston,1110000,4,5,bungalow,sold
2018-03-25 08:22:00,Dumitru,Plamada,Kingston,1650000,5,5,bungalow,forsale
2018-01-01 11:54:00,John,Smith,Markham,1200000,3,3,bungalow,sold
2018-05-07 15:30:00,Arturo,Gonzales,Mississauga,187000,3,3,bungalow,forsale
2018-03-07 22:20:00,Lei,Zhang,North York,122000,1,1,townhouse,forsale
2018-05-04 20:04:00,William,King,Oaks,,3,3,bungalow,sold
2018-03-04 13:05:00,Jeffrey,Kong,Oakville,,2,2,townhouse,forsale
2018-01-04 17:23:00,Abdul,Karrem,Orillia,883000,3,4,townhouse,sold
2018-03-01 13:09:00,Jean,Paumier,Ottawa,1520000,4,4,townhouse,sold
2018-02-01 10:00:00,Ken,Beaufort,Ottawa,3440000,5,5,bungalow,forsale
2018-02-15 11:33:00,Gheorghe,Ionescu,Richmond Hill,1630000,4,3,bungalow,forsale
2018-01-05 10:32:00,Ion,Popescu,Scarborough,1420000,5,3,semidetached,sold
2018-02-07 11:44:00,Xu,Yang,Toronto,422000,2,2,townhouse,forsale
2018-05-29 00:33:00,Giovanni,Gianparello,Toronto,1917000,4,4,bungalow,forsale
2018-03-25 08:27:00,John,Saint-Claire,Toronto,3337000,5,4,bungalow,forsale
2018-01-06 14:06:00,Ann,Murdoch Pyrell,Toronto,1427000,5,4,bungalow,forsale
2018-02-15 13:12:00,Claire,Coldwell,Toronto,3777000,5,4,bungalow,forsale
2018-01-02 09:37:00,Kyle,MCDonald,Toronto,,2,2,townhouse,forsale
2018-02-01 21:22:00,Miriam,Berg,Toronto,,4,4,townhouse,forsale
The code to load the data and display the graph is below:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
sales_brute = pd.read_csv('sales.csv', parse_dates=True, index_col='date')
# Fix the columns names by stripping the extra spaces
sales_brute = sales_brute.rename(columns=lambda x: x.strip())
# Fix the N/A from cost column
sales_brute['cost'].fillna(sales_brute['cost'].mean(), inplace=True)
# Draws a scattered plot, price by cities. Change the colors of plot.
plt.scatter(sales_brute['city'], sales_brute['cost'], color='red')
# Rotates the ticks with 70 grd
plt.xticks(sales_brute['city'], rotation=70)
plt.tight_layout()
# Add grid
plt.grid()
plt.show()
and the results looks strangely like this:
Incorrect display of categories
Maybe we have different versions of matplotlib, but I can't use plt.scatter at all with sales_brute['city'] as first argument.
ValueError: could not convert string to float: 'Toronto'
Instead I made up a new x-axis:
x = range(len(sales_brute))
plt.scatter(x=x, y=sales_brute['cost'], color='red')
plt.xticks(x, sales_brute['city'], rotation=70)
plt.show()
Which results in:
(some stretching required to see the full names)
plt.scatter seems to be happy to take strings as the x-coordinate and arrange them in alphabetical order. plt.xticks, however, wants a list matching the number of ticks and in the same order.
If you change:
plt.xticks(sales_brute['city'], rotation=70)
to
plt.xticks(sales_brute['city'].sort_values().unique(), rotation=70),
you'll get the effect you want.
I created a pandas dataframe from some value counts on particular calendar dates. Here is how I did it:
time_series = pd.DataFrame(df['Operation Date'].value_counts().reset_index())
time_series.columns = ['date', 'count']
Basically, it is two columns, the first "date" is a column with datetime.date objects and the second column, "count" are simply integer values. Now, I'd like to plot a scatter or a KDE to represent how the value changes over the calendar days.
But when I try:
time_series.plot(kind='kde')
plt.show()
I get a plot where the x-axis is from -50 to 150 as if it is parsing the datetime.date objects as integers somehow. Also, it is yielding two identical plots rather than just one.
Any idea how I can plot them and see the calendars day along the x-axis?
you sure you got datetime? i just tried this and it worked fine:
df = date count
7 2012-06-11 16:51:32 1.0
3 2012-09-28 08:05:14 12.0
19 2012-10-01 18:01:47 4.0
2 2012-10-03 15:18:23 29.0
6 2012-12-22 19:50:43 4.0
1 2013-02-19 19:54:03 28.0
9 2013-02-28 16:08:40 17.0
12 2013-03-12 08:42:55 6.0
4 2013-04-04 05:27:27 6.0
17 2013-04-18 09:40:37 29.0
11 2013-05-17 16:34:51 22.0
5 2013-07-07 14:32:59 16.0
14 2013-10-22 06:56:29 13.0
13 2014-01-16 23:08:46 20.0
15 2014-02-25 00:49:26 10.0
18 2014-03-19 15:58:38 25.0
0 2014-03-31 05:53:28 16.0
16 2014-04-01 09:59:32 27.0
8 2014-04-27 12:07:41 17.0
10 2014-09-20 04:42:39 21.0
df = df.sort_values('date', ascending=True)
plt.plot(df['date'], df['count'])
plt.xticks(rotation='vertical')
EDIT:
if you want a scatter plot you can:
plt.plot(df['date'], df['count'], '*')
plt.xticks(rotation='vertical')
If the column is datetime dtype (not object), then you can call plot() directly on the dataframe. You don't need to sort by date either, it's done behind the scenes if x-axis is datetime.
df['date'] = pd.to_datetime(df['date'])
df.plot(x='date', y='count', kind='scatter', rot='vertical');
You can also pass many arguments to make the plot nicer (add titles, change figsize and fontsize, rotate ticklabels, set subplots axis etc.) See the docs for full list of possible arguments.
df.plot(x='date', y='count', kind='line', rot=45, legend=None,
title='Count across time', xlabel='', fontsize=10, figsize=(12,4));
You can even use another column to color scatter plots. In the example below, the months are used to assign color. Tip: To get the full list of possible colormaps, pass any gibberish string to colormap and the error message will show you the full list.
df.plot(x='date', y='count', kind='scatter', rot=90, c=df['date'].dt.month, colormap='tab20', sharex=False);