I have very large dataset that I cannot plot directly using holoviews. I want to make a scatterplot with categorial data. Unfortunately my data is very sparse and many points have NA as category. I would like to make these points gray. Is there any way to make datashader know what I want to do?
I show you the way I do it now (as more or less proposed in https://holoviews.org/user_guide/Large_Data.html ).
I provide you an example:
import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
import datashader as ds
from datashader.colors import Sets1to3
from holoviews.operation.datashader import datashade,spread
raw_data = [('Alice', 60, 'London', 5) ,
('Bob', 14, 'Delhi' , 7) ,
('Charlie', 66, np.NaN, 11) ,
('Dave', np.NaN,'Delhi' , 15) ,
('Eveline', 33, 'Delhi' , 4) ,
('Fred', 32, 'New York', np.NaN ),
('George', 95, 'Paris', 11)
]
# Create a DataFrame object
df = pd.DataFrame(raw_data, columns=['Name', 'Age', 'City', 'Experience'])
df['City']=pd.Categorical(df['City'])
x='Age'
y='Experience'
color='City'
cats=df[color].cat.categories
# Make dummy-points (currently the only way to make a legend: https://holoviews.org/user_guide/Large_Data.html)
for cat in cats:
#Just to make clear how many points of a given category we have
print(cat,((df[color]==cat)&(df[x].notnull())&(df[y].notnull())).sum())
color_key=[(name,color) for name, color in zip(cats,Sets1to3)]
color_points = hv.NdOverlay({n: hv.Points([0,0], label=str(n)).opts(color=c,size=0) for n,c in color_key})
# Create the plot with datashader
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),)
datashaded=datashade(points,aggregator=ds.by(color)).opts(width=800, height=480)
(spread(datashaded,px=4, shape='square')*color_points).opts(legend_position='right')
It produces the following picture:
You can see some issues:
Most importantly although there is just one person from Paris you see that the NA-person (Charlie) is also printed in purple, the color for Paris. Is there a way to make the dot gray? I have tried many plots and it seems like the NAs always take the color of the last item in the legend.
Then there are some minor issues I have I did not want to open questions for. (If you think they deserve their own question please tell me, I am new to stackoverflow and appreciate your advice.)
One other problem:
The dots are not all of the same size. This is quite ugly. Is there a way to change that?
And then there is also a question that I have: Does the datashader internally also use the .cat.categories-method to decide what color to use? How are the colors, that datashader uses, determined? Because I wonder whether the legend is always in correct order (showing the correct colors: If you permute the order in cats then color_key and cats are not in the same order anymore and the legend shows wrong colors). It seems to always work the way I do but I feel a bit insecure.
And maybe someone wants to give their opinion whether Points is okay to use for scatterplots in this case. Because I do not see any difference to Scatter and also semantically there is not really one variable that causes the other (although one might argue that age causes experience in this case, but I am going to plot variables where it is not easy at all to find those kinds of causalities) so it is best to use Points if I understood the documentation https://holoviews.org/reference/elements/bokeh/Points.html correctly.
Most importantly although there is just one person from Paris you see that the NA-person (Charlie) is also printed in purple, the color for Paris. Is there a way to make the dot gray? I have tried many plots and it seems like the NAs always take the color of the last item in the legend.
Right now I believe Datashader replaces NaNs with zeros (see https://github.com/holoviz/datashader/blob/master/datashader/transfer_functions/__init__.py#L351). Seems like a good feature request to be able to supply Datashader with a color to use for NaNs instead, but in the meantime, I'd recommend replacing the NaNs with an actual category name like "Other" or "Missing" or "Unknown", and then both the coloring and the legend should reflect that name.
One other problem: The dots are not all of the same size. This is quite ugly. Is there a way to change that?
Usually Datashader in a Bokeh HoloViews plot will render once initially before it is put into a Bokeh layout, and will be triggered to update once the layout is finished with a final version. Here, the initial rendering is being auto-ranged to precisely the range of the data points, then clipped by the boundaries of the plot (making squares near the edges become rectangles), and then the range of the plot is updated once the legend is added. To see how that works, remove *color_points and you'll see the same shape of dots, but now cropped by the plot edges:
You can manually trigger an update to the plot by zooming or panning slightly once it's displayed, but to force it to update without needing manual intervention, you can supply an explicit plot range:
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),).redim.range(Age=(0,90), Experience=(0,14))
It would be great if you could file a bug report on HoloViews asking why it is not refreshing automatically in this case when the legend is included. Hopefully a simple fix!
Does the datashader internally also use the .cat.categories-method to decide what color to use? How are the colors, that datashader uses, determined? Because I wonder whether the legend is always in correct order (showing the correct colors: If you permute the order in cats then color_key and cats are not in the same order anymore and the legend shows wrong colors). It seems to always work the way I do but I feel a bit insecure.
Definitely! Whenever you show a legend, you should be sure to pass in the color key as a dictionary so that you can be sure that the legend and the plot coloring use the same colors for each category. Just pass in color_key={k:v for k,v in color_key} when you call datashade.
And maybe someone wants to give their opinion whether Points is okay to use for scatterplots in this case. Because I do not see any difference to Scatter and also semantically there is not really one variable that causes the other (although one might argue that age causes experience in this case, but I am going to plot variables where it is not easy at all to find those kinds of causalities) so it is best to use Points if I understood the documentation https://holoviews.org/reference/elements/bokeh/Points.html correctly.
Points is meant for a 2D location in a plane where both axes are interchangeable, such as the physical x,y location on a geometric plane. Points has two independent dimensions, and expects you to have any dependent dimensions be used for the color, marker shape, etc. It's true that if you don't know which variable might be dependent on the other, a Scatter is tricky to use, but simply choosing to put something on the x axis will set people up to think that that variable is independent, so there's not much that can be done about it. Definitely not appropriate to use Points in this case.
In python seaborn, What is the difference between countplot and catplot?
Eg:
sns.catplot(x='class', y='survived', hue='sex', kind='bar', data=titanic);
sns.countplot(y='deck', hue='class', data=titanic);
seaborn.countplot
Shows the counts of observations in each categorical bin using bars.
seaborn.catplot
Provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.
There is a lot of overhead in catplot, or for that matter in FacetGrid, that will ensure that the categories are synchronized along the grid. Consider e.g. that you have a variable you plot along the columns of the grid for which not every age group occurs. You would still need to show that non-occuring age group and hold on to its color. Hence, two countplots next to each other do not necessarily make up one catplot.
However, if you are only interested in a single countplot, a catplot is clearly overkill. On the other hand, even a single countplot is overkill compared to a barplot of the counts.
I want to make an area plot in seaborn or matplotlib using an index of categorical variables. I have tried a few things but I can't seem to get it. Here's an image of my dataframe. Thanks for any help.
Here's some examples of what I've tried.
plt.plot(areaData.index.values,areaData['Badassery'], data=areaData)
plt.plot(areaData['Badassery'])
I'm not really sure what else I should be doing. Usually I get errors like "Series objects are mutable, and thus can't be hashed" or a blank chart.
It appears that your categorical variable is set as the index, you want to reset it so that you can use it as a column.
#reset the index
areaData.reset_index()
areaData.plot.area(y='Badassery')
Refer to the documentation on area plots with pandas.
This is not duplicate, because existing answers on similar questions don't describe exactly what I need.
Matplotlib has great formatters inside and I love to use them:
ax.xaxis.set_major_locator(matplotlib.dates.MonthLocator())
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b%y'))
They let me plot such stock market charts:
This is what I need, but it has 1 issue: weekends. They are present on x axis and make my chart a little ugly.
Other questions about this issue give advice to create custom formatter. They show examples of such formatters. But no one of them do pretty formatting like matplotlib do:
May19, Jun19, Jul19...
I mean this line of code:
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b%y'))
My question is: please help me to format x axis like matplotlib do: May19, Jun19, Jul19... and don't create weekends when stock market is closed.
What you could almost always do is something similar to what Nic Wanavit suggested.
Manually set your labels, depending on what you need on your axis.
Especially in this case the plot is looking a bit ugly because you have timespans in your data that are not provided with actual data (the weekends in this case) so pyplot will simply connect these points with the corresponding length from the x-axis.
What you can do then is just to plot your data equally distant - which is correct if the data is daily - otherwise consider to interpolate it using e.g. pandas bultin interpolation.
To avoid pyplot automatically detect the index I had to do this:
df['plotidx'] = [i for i in range(len(df['close'])):
Here all the closing values for the stock are stored in a column named 'close' obvsl.
You plot this correspondingly.
Then you can obtain all the ticks created via
labels = [item.get_text() for item in ax.get_xticklabels()]
Adjust them as desired with
labels[i] = string_for_the_label_no_i
Then get them back on the graph using
ax.xaxis.set_ticklabels(labels)
You need to somewhat "update" the plot then. Also keep in mind, that resizing a lot could end up with the labels being as also said in the documentation strange location.
It is some kind of a workaround but worked fine for me because it feels natural to plot data equally distant next to each other rather then making up some data for the weekends.
Greets
to set the x ticks
assuming that you have the dates variable in dataframe row df['dates']
ax.xaxis.set_ticks(df['dates'])
It's a simple thing but I've searched for quite a while without success: I want to customise a figure legend by reversing the horizontal order of the symbols and labels.
In Gnuplot, this is simply achieved by set key reverse. Example: change x data1 to data1 x. In matplotlib, there seems to be no user-friendly solution. Thus, I thought about changing a kind of handle anchor or just shifting the handle's position, but couldn't find any point to start with.
The requested feature is already there, as the keyword markerfirst of the legend command.
plt.plot([1,2],[3,4], label='labeltext')
plt.legend(markerfirst=False)
plt.show()
If you want to make it your default behaviour, you can change the value of legend.markerfirst in rcParams, see customizing matplotlib.