Python [Bokeh charts] Heatmap ranges & coloring - python

I would like to represent my data (which consist of 256 values) using a bokeh heat map where each value has its own color (so every item with the same value should have the same color).
I've been experimenting and bokeh is doing ranges for me such as the range between 24 - 47 has the same color and so on, but i wish to have a color for each value.
What is the best way to approach this problem?
I've been experimenting with palettes and some perform way better than others, for example Inferno256 is doing a good job but is that the correct way to solve this? I mean is there a way to tell the chart/heat-map to display every value with a color (specify ranges?) or should i for example define a palette of 256 colors any thoughts?
Example where bokeh create big ranges for me:
Data=column_of_values[:1000]
data = {'fruit': [1]*len(Data), # Sections
'fruit_count': Data,
'sample': list(range(1,len(Data)+1))}
hm = HeatMap(data, x='sample', y='fruit', values='fruit_count', palette=bp.Plasma11 , title='Fruits', stat=None)
hm.width=5000
output_file('heatmap.html')
show(hm)
The second part of my question (if possible), does bokeh handle big data well?
forexample plotting 1000 values is different from plotting 10,000 using the same code, the values seem to be smashed together or something, should I fix that by expanding the width or something else :-)
Heat map plotting 1000 values then 10,000 values

Related

Random colors in pyplot.scatter based on value

I have a large pandas dataframe that I clustered and the cluster id is stored in a column of a dataframe.
I would like to display clusters in such a way that each cluster has a different color.
I tried doing this with a colormap but the problem is that I have too many points and clusters so nearby clusters get assign only slightly different colors, so when I plot all of them I just get a big mashup that looks like this:
Note that this is image contains about 4000 clusters, but because colors of clusters are just assigned top to bottom, nearby clusters blend together.
I would like nearby clusters to be painted in different colors so I tried making a random color for each cluster and then assign each point a color based on its cluster label like this:
# creating a color for each distinct cluster label
colors = [(random.random(), random.random(), random.random())
for _ in range(len(set(data['labels'])))]
# assigning color to a point based on its cluster label
for index, row in data.iterrows():
plt.scatter(row['x'], row['y'], color=colors[int(row['labels'])])
Now this program works but it is much slower that vectorized version above.
Is there a way to do color each cluster in clearly different colors without writing a for loop?
This creates a random colormap of 256 colors that you can then pass to scatter :
def segmentation_cmap():
vals = np.linspace(0, 1, 256)
np.random.shuffle(vals)
return plt.cm.colors.ListedColormap(plt.cm.CMRmap(vals))
ax.scatter(row['x'],row['y'],c=row['labels'],s=1,cmap=segmentation_cmap())
You may add colors, but you would have trouble seeing the differences anyways at some point !

how to create interactive graph on a large data set?

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
This is how I read the data and plot a scatter plot.
trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter
Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up.
I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code.
Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help
Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.
A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).
Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.
If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).
The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.
So a solution could be to convert your categorical variable to a numerical code.
See the code example below for both hvPlot (which I prefer) and HoloViews:
import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread
# sample data
text = """
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
"""
# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()
# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes
# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]
# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
x='Time',
y='ID_code',
datashade=True,
dynspread=True,
padding=0.05,
).opts(yticks=yticks)
# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)
More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html
Resulting plot:

Best way to plot square grid of coordinates, each assigned a colour

I am looking for the best way to plot data in the following format. I have 960 x 960 points equally divided over the range -2.2 and 2.2 in the x-axis and -2 and 2 in the y-axis. Based on some mathematics, each point is then assigned one of three colours.
The data is formatted as a 921600 long (960*960=921600) list with each element containing the 3 element list - [x_value, y_value, colour], where colour is either 'r', 'b' or 'g'. The method I have used is just a scatter plot in matplotlib which gives an image similar to the following:
The code used for this is roughly:
plt.figure(figsize=(16,16))
plt.ylabel("Velocity (x_dot OR y)")
plt.xlabel("Position (x)")
plt.scatter(x_data, y_data, color = colour, marker = 's', s=0.5)
Although speed for plotting these images doesn't matter, I imagine that there is a neater way of doing this. I used this figure as an example as its file size is within the limit for stackoverflow images, however, when the image has a lot of mixed colours, the smoothed edges of the 's' square-like marker start to show and the image file size is very large.
I really only need one square/pixel for each coordinate and the labels are less important. Can anyone suggest a better way to plot this data? Thanks

Any way to swap my seaborn heatmap's colormap values with its y-axis values?

I'm new to heatmaps and seaborn. I've been trying this for days but haven't been able to find a solution or any related threads. I think I'm setting up the problem incorrectly but would like to know if what I'm trying to do, is possible with seaborn heatmaps... or if a heatmap isn't the right graphic representation for what I want to show.
I have a .csv file of scores. It looks something like:
Genus,Pariaconus,Trioza,Non-members
-40,-80,-90,-300
-40.15,-80,-100,-320
,-40.17,-86,-101,-470
,-86.2,-130,-488
,,-132,-489
,,,-500
...
As I try to show above, the columns are different lengths. Let's say length of (the number of values in) Genus is 10, Pariaconus is 15, Trioza is 20, and Non-members is 18,000.
In addition, the columns and rows are not related to each other. Each score is individual and just falls under the column group. What I want to show with the heatmap is the range of scores that occur in each column.
I would ideally like to represent the data using a heatmap, where:
X-axis is "Genus", "Pariaconus", "Trioza", "Non-members".
Y-axis is
range of scores that occur in the dataset. In the example above,
Y-axis values would go from -40 to -500.
Colorbar is the
normalized population of the columns that get that score in
the Y-axis. For example, if 100% of the Genus column scores around
-40, that area in Y-axis would be colored red (for 1.0). The remainder of the y-axis for Genus would be colored blue (for 0.0),
because no scores for Genus are in the range -50 to -500. For the
purposes of my project, I'd like to show that the majority of scores
of "Genus" fall in a certain range, "Pariaconus" in another range,
"Non-members" in another range, and so on.
The reason I want to represent this with a heatmap and not, say, a line graph is because line graphs would suggest that there is a trend between rows in the same column. In the example above (Genus column), a line/scatter graph would make it seem that there's a relationship between the score -40, -41, and -45 as you move down the X-axis. In contrast, I just want to show the range of scores in each column.
With the data in the .csv format above, right now I have the following heatmap: https://imgur.com/a/VwgQwfQ
I get this with the line of code:
sns.heatmap(df, cmap="coolwarm")
In this heatmap, the values of the Y-axis are automatically set as the row indices from the .csv file, and the colormap values are the scores (values of the rows).
If I could just figure out how to swap the colormap and the Y-axis, then I hope that I could then move on to figuring out how to normalize the populations of each column instead of having it as the raw indices: 0 to 18000. But I've been trying to do this for days and haven't come close to what I want.
Ideally, I would want something like this: https://imgur.com/a/3A0eaOD. Of course, in the heatmap, there would be gradients instead of solid colors.
If anyone can answer, I had these questions:
Is what I'm trying to do achievable/is this something I can do with
a heatmap? Or should I be using a different representation?
Is this possibly a problem with how my input data is represented? If so, what is the correct representation when building a heatmap like this?
Any other guidance would be super appreciated.

How to set pivots on colorbar at customisable location in Matplotlib?

I have some trouble in using Matplotlib colorbar, perhaps I am not understanding the documentation (I am not a native English speaker) or its core concept.
Suppose I have a matrix of data (shape, N*2). I want to make a scatter plot of this data and add a color scheme based on a column of label (N*1), in float. I know how to use colorbar and scalarmappable.
But, I am interested in some pivot values in this label column, and I wish to present these value in some interesting position of the colorbar. For example, label value 0, I want to position it at 1/3 place or in the middle -- which in the colorbar I choose could have a white or grey colour.
But if I understand it correctly, colorbar only takes data array that mapped in [0, 1] from the original data in [min, max]. In this case, the pivot value that I am interested would be end up in somewhere random, unless I define my normalisation function very carefully.
So to put the white colour I prefer for my pivot value is in the middle of the colour bar, I have to have defined the normalisation function which not only normalised my data, but also make the pivot value at the position of 0.5.
For my limited Matplotlib experience, this is the solution I know.
Ideally, suppose I have a column of float data, I could pick some pivot value, and give them some special position. and then I get them normalised and give to the colormap. The colorbar, however, I could set special colours for those special positions that I previous defined. and get a corresponding colorbar with the right tick locator and tick labels, that indicate my special pivot value.
I am looking for an easier way (from the standard lib) that I could use achieve this.
It will be very helpful if you can post a plot that you wish to make. But based on my understanding, you just want to do something to the colorbar at one or more particular spot. That is easy, the following cases shows a example of writing a text string at 0.5.
x1=np.random.random(1000)
x2=np.random.random(1000)
x3=np.random.random(1000)
plt.scatter(x1, x2, c=x3, marker='+')
cb=plt.colorbar()
color_norm=lambda x: (x-cb.vmin)/(cb.vmax-cb.vmin)
cb.ax.text(0.5, color_norm(0.5), 'Do something.\nRight here', color='r')
If you want to have value 0.5 at exactly 1/3 height of the colorbar, you need to adjust the colorbar limit using cb.set_clim((cmin, cmax)) method. There will be infinite possible (cmin, cmax) fit your need so additional constrains are necessary, such as keeping the min constant or keeping the max constant or keeping the max-min constant.

Categories