How to reduce the number of data points in a scatter chart? - python

Currently I have a problem for plotting a huge amount of X,Y data in a scatter chart by using the plotly's engine and python. So the browser can't actually render this amount of points without crashing after some time. (I've also tried the Scattergl option https://plot.ly/python/webgl-vs-svg/)
Is there any algorithms to reduce this huge amount of points without losing the original shape of the scatter chart? Maybe something like the iterative end-point fit algorithm?
EDIT:
some code
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import plot
import numpy as np
N = 1000000
trace = go.Scattergl(
x = np.random.randn(N),
y = np.random.randn(N),
mode = 'markers',
marker = dict(
line = dict(
width = 1,
color = '#404040')
)
)
data = [trace]
layout = go.Layout(title='A Simple Plot', width=1000, height=350)
fig = go.Figure(data=data, layout=layout)
plot(fig)

If you are just trying to visualize the regions where the data points exist, it might be more effective to convert the x-y data into a grid of densities. This may be better than a scatter plot because when you have a very large number of points, the points can obscure each other so you really have no idea how many points there are in certain areas.
I'm not familiar with plotly (I use matplotlib.pyplot) but I see there is at least one way to do this.

One way would be to randomly sample from the scatter points. As long as you're sampling enough points, it can be extremely likely you have a similar shape.
For example, to randomly sample 10,000 of the 1 million points you would use
i_plot = np.random.choice(N, size=10000, replace=False)
trace = go.Scattergl(
x = np.random.randn(N)[i_plot],
y = np.random.randn(N)[i_plot],
mode = 'markers',
marker = dict(
line = dict(
width = 1,
color = '#404040')
)
)
This snippet might look silly, but in reality you'll have an actual arrays instead of np.random.randn(N), so it will make sense to randomly sample from those arrays.
You'll want to test different numbers of points, and probably increase it to the maximum number of points the engine can handle without lagging or crashing.

You should try DataShader package (http://datashader.readthedocs.io/en/latest/) which focuses exactly on that - transformation of huge number of data points into something more amenable to visualization. They also provide argumentation why their approach might be better than a simple heatmap: https://anaconda.org/jbednar/plotting_pitfalls/notebook

Related

Plotly scatter3d go empty dealing with a huge datapoints

I am trying to plot a huge number of data points, if I use the following code, it can work properly
N = 615677
df = pd.DataFrame(dict(x=np.random.randn(N),
y=np.random.randn(N),
z=np.random.randn(N)))
marker_data = go.Scatter3d(
x=np.random.randn(N),
y=np.random.randn(N),
z=np.random.randn(N),
marker=go.scatter3d.Marker(size=1),
mode='markers',
)
fig = go.Figure(data=marker_data)
fig.show()
figure 1, N=615677, normal plot
However, if I set
N = 615678
I will get an empty graph, it only plots axes without any data points.
figure 2, N=615678, wrong plot
Does anyone know what caused it? I can deal with it with downsampling, but it may be not the best way.

t-SNE map into 2D or 3D plot

features = ["Ask1", "Bid1", "smooth_midprice", "BidSize1", "AskSize1"]
client = InfluxDBClient(host='127.0.0.1', port=8086, database='data',
username=username, password=password)
series = "DCIX_2016_11_15"
sql = "SELECT * FROM {} where time >= '{}' AND time <= '{}' ".format(series,FROMT,TOT)
df = pd.DataFrame(client.query(sql).get_points())
#Separating out the features
X = df.loc[:, features].values
# Standardizing the features
X = StandardScaler().fit_transform(X)
tsne = TSNE(n_components=3, n_jobs=5).fit_transform(X)
I would like map my 5 features into a 2D or 3D plot. I am a bit confused how to do that. How can I build a plot from that information?
You already have most of the work done. t-SNE is a common visualization for understanding high-dimensional data, and right now the variable tsne is an array where each row represents a set of (x, y, z) coordinates from the obtained embedding. You could use other visualizations if you would like, but t-SNE is probably a good starting place.
As far as actually seeing the results, even though you have the coordinates available you still need to plot them somehow. The matplotlib library is a good option, and that's what we'll use here.
To plot in 2D you have a couple of options. You can either keep most of your code the same and simply perform a 2D t-SNE with
tsne = TSNE(n_components=2, n_jobs=5).fit_transform(X)
Or you can just use the components you have and only look at two of them at a time. The following snippet should handle either case:
import matplotlib.pyplot as plt
plt.scatter(*zip(*tsne[:,:2]))
plt.show()
The zip(*...) transposes your data so that you can pass the x coordinates and the y coordinates individually to scatter(), and the [:,:2] piece selects two coordinates to view. You could ignore it if your data is already 2D, or you could replace it with something like [:,[0,2]] to view, for example, the 0th and 2nd features in higher-dimensional data rather than just the first 2.
To plot in 3D the code looks much the same, at least for a minimal version.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(*zip(*tsne))
plt.show()
The main differences are a use of 3D plotting libraries and making a 3D subplot.
Adding color: t-SNE visualizations are typically more helpful if they're color-coded somehow. One example might be the smooth midprice you currently have stored in X[:,2]. For exploratory visualizations, I find 2D plots more helpful, so I'll use that as the example:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2])
You still need the imports and whatnot, but by passing the keyword argument c you can color code the scatter plot. To adjust how that numeric data is displayed, you could use a different color map like so:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2], cmap='RdBu')
As the name might suggest, this colormap consists of a gradient between red and blue, and the lower values of X[:,2] will correspond to red.

How to make legends have colours that correspond to data points using python's plotly?

I'm following the plotly documentation for colouring a scatter graph. Here is my code:
I first create a fake data frame with the same shape as what I'm working with
import pandas
import colorlover as cl
import plotly
import plotly.graph_objs as go
import numpy
data = numpy.random.normal(0, 1, 3*6*11*2)
data = data.reshape(((3*6*11), 2))
data = pandas.DataFrame(data)
sub_experiments = ['Subexperiment_{}'.format(i) for i in [1, 2, 3]]
repeats = ['Repeat_{}'.format(i) for i in range(1, 7)]
time = ['{}h'.format(i) for i in range(11)]
array = [sub_experiments, repeats, time]
idx = pandas.MultiIndex.from_product(array, names=['SubExperiment', 'Repeats', 'Time'])
data.index = idx
Now my I want to create a scatter graph with plotly:
scatters = []
for label, df in data.groupby(level=[0, 1]):
scales = cl.scales[str(df.shape[0])]
colour = scales['qual']['Paired']
d = go.Scatter(
x=df[0],
y=df[1],
mode='markers',
name=reduce(lambda x, y: '{}_{}'.format(x, y), label),
marker=go.Marker(color=colour, line=go.Line(color='black')),
)
scatters.append(d)
And looks like this:
Note that since I've made up data for this example and I'm actually doing principle component analysis, the plot in the above screenshot shows clusters while the example code will not.
The problem here is that plotly has not coloured the legend like the points.
How can I colour the legend in the same way as the points ?
You are passing an array with colours for each trace, i.e. the each traces has several different colors, picking one in the legend doesn't really make sense (try removing the color=colour to see the effect).
Based on your code I am not sure what you are expecting to see, i.e. whether the color should be based on the label, i.e subexperiment+repeat or time (as defined in the code).
In the first case, you could just pick one color for each trace. In the latter case, plotting the time and assigning one color to each time makes more sense in my opinion.

Changing size of scattered points in matplotlib

I am doing some plotting using cartopy and matplotlib, and I am producing a few images using the same set of points with a different domain size shown in each image. As my domain size gets bigger, the size of each plotted point remains fixed, so eventually as I zoom out, things get scrunched up, overlapped, and generally messy. I want to change the size of the points, and I know that I could do so by plotting them again, but I am wondering if there is a way to change their size without going through that process another time.
this is the line that I am using to plot the points:
plt.scatter(longs, lats, color = str(platformColors[platform]), zorder = 2, s = 8, marker = 'o')
and this is the line that I am using to change the domain size:
ax.set_extent([lon-offset, lon+offset, lat-offset, lat+offset])
Any advice would be greatly appreciated!
scatter has the option set_sizes, which you can use to set a new size. For example:
import matplotlib.pylab as pl
import numpy as np
x = np.random.random(10)
y = np.random.random(10)
s = np.random.random(10)*100
pl.figure()
l=pl.scatter(x,y,s=s)
s = np.random.random(10)*100
l.set_sizes(s)
It seems that set_sizes only accepts arrays, so for a constant marker size you could do something like:
l.set_sizes(np.ones(x.size)*100)
Or for a relative change, something like:
l.set_sizes(l.get_sizes()*2)
http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter
These are the parameters that plt.scatter take and the s parameter is the size of the scattered point so change s to whatever you like, something like so
plt.scatter(longs, lats, color = str(platformColors[platform]), zorder = 2, s = 20, marker = 'o')

Speeding up matplotlib scatter plots

I'm trying to make an interactive program which primarily uses matplotlib to make scatter plots of rather a lot of points (10k-100k or so). Right now it works, but changes take too long to render. Small numbers of points are ok, but once the number rises things get frustrating in a hurry. So, I'm working on ways to speed up scatter, but I'm not having much luck
There's the obvious way to do thing (the way it's implemented now)
(I realize the plot redraws without updating. I didn't want to alter the fps result with large calls to random).
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import time
X = np.random.randn(10000) #x pos
Y = np.random.randn(10000) #y pos
C = np.random.random(10000) #will be color
S = (1+np.random.randn(10000)**2)*3 #size
#build the colors from a color map
colors = mpl.cm.jet(C)
#there are easier ways to do static alpha, but this allows
#per point alpha later on.
colors[:,3] = 0.1
fig, ax = plt.subplots()
fig.show()
background = fig.canvas.copy_from_bbox(ax.bbox)
#this makes the base collection
coll = ax.scatter(X,Y,facecolor=colors, s=S, edgecolor='None',marker='D')
fig.canvas.draw()
sTime = time.time()
for i in range(10):
print i
#don't change anything, but redraw the plot
ax.cla()
coll = ax.scatter(X,Y,facecolor=colors, s=S, edgecolor='None',marker='D')
fig.canvas.draw()
print '%2.1f FPS'%( (time.time()-sTime)/10 )
Which gives a speedy 0.7 fps
Alternatively, I can edit the collection returned by scatter. For that, I can change color and position, but don't know how to change the size of each point. That would I think look something like this
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import time
X = np.random.randn(10000) #x pos
Y = np.random.randn(10000) #y pos
C = np.random.random(10000) #will be color
S = (1+np.random.randn(10000)**2)*3 #size
#build the colors from a color map
colors = mpl.cm.jet(C)
#there are easier ways to do static alpha, but this allows
#per point alpha later on.
colors[:,3] = 0.1
fig, ax = plt.subplots()
fig.show()
background = fig.canvas.copy_from_bbox(ax.bbox)
#this makes the base collection
coll = ax.scatter(X,Y,facecolor=colors, s=S, edgecolor='None', marker='D')
fig.canvas.draw()
sTime = time.time()
for i in range(10):
print i
#don't change anything, but redraw the plot
coll.set_facecolors(colors)
coll.set_offsets( np.array([X,Y]).T )
#for starters lets not change anything!
fig.canvas.restore_region(background)
ax.draw_artist(coll)
fig.canvas.blit(ax.bbox)
print '%2.1f FPS'%( (time.time()-sTime)/10 )
This results in a slower 0.7 fps. I wanted to try using CircleCollection or RegularPolygonCollection, as this would allow me to change the sizes easily, and I don't care about changing the marker. But, I can't get either to draw so I have no idea if they'd be faster. So, at this point I'm looking for ideas.
I've been through this a few times trying to speed up scatter plots with large numbers of points, variously trying:
Different marker types
Limiting colours
Cutting down the dataset
Using a heatmap / grid instead of a scatter plot
And none of these things worked. Matplotlib is just not very performant when it comes to scatter plots. My only recommendation is to use a different plotting library, though I haven't personally found one that was suitable. I know this doesn't help much, but it may save you some hours of fruitless tinkering.
We are actively working on performance for large matplotlib scatter plots.
I'd encourage you to get involved in the conversation (http://matplotlib.1069221.n5.nabble.com/mpl-1-2-1-Speedup-code-by-removing-startswith-calls-and-some-for-loops-td41767.html) and, even better, test out the pull request that has been submitted to make life much better for a similar case (https://github.com/matplotlib/matplotlib/pull/2156).
HTH

Categories