How to drawheat map with large data set in python

How to drawheat map with large data set in python - python

I am trying to plot a sine wave, and the color of the curve at each point is represented by its tangential slope value.
For example, a 3600 * 1000 data frame should be filled:
x_axis = list(range(0, 3601))
y_axis = list(range(-1000, 1001))
wave = pd.DataFrame(index = y_axis,columns= x_axis )
for i in range(0, 3601, 1):
y = int(round(np.sin(np.radians(i / 10)), 3) * 1000)
wave.loc[y, i] = -abs(y)
wave = wave.fillna(0)
wave[wave == 0] =np.nan
seaborn.heatmap(wave)
and by using seaborn.heatmap(wave) the heatmap will be generated like attached image. But what I am looking for is to draw maybe 50-100 sine wave like this in one picture, so the dataframe size will be much larger to 360000*10000. With this size of dataframe I still want to show similar heatmap, or any type or drawing that can represent the value change for each cell. My work station seems to freeze by using seaborn heatmap with this dataset.
Some of my thoughts would be to normalize all the values to 0-255 and use some GLV plotting function, I am still researching it.

You could create a similar plot using plt.scatter:
import matplotlib.pyplot as plt
import numpy as np
x_axis = np.arange(0, 360, 0.1)
y = np.round(np.sin(np.radians(x_axis)), 3) * 1000
plt.scatter(x_axis, y, c=-np.abs(y), s=1, cmap='gist_heat')
plt.show()
To get a wider curve, just increase s. To get rid of the white part of the colormap, you can move the color limits (called vmin and vmax). Standard they are the minimum and maximum of the given color values. In this case the maximum is 0 and the minimum is -1000. Setting vmax to +100 would leave out 10% of the color range.
plt.scatter(x_axis, y, c=-np.abs(y), vmax=0.1*y.max(), s=10, cmap='gist_heat')

Related

How to helpfully plot time series data in python

I have a 2 Dimensional set of time series data containing about 1000 samples. I.e I have a long list of 1000 elements each entry is a list of two numbers.
It can be thought of the position, x and y coordinates, of a car with a picture taken each second for 1000 seconds. When I plot this, as seen below, you get a decent idea of the trajectory but it's unclear where the car starts or finishes, i.e which direction it is traveling in. I was thinking about including arrows between each point but I think this would get quite clustered (maybe you know a way to overcome that issue?) Also, I thought of colouring each point with a spectrum that made it clear to see time increasing, i.e hotter points to colder points as time goes on. Any idea how to achieve this in matplotlib?

I believe both your ideas would work well, I just think you need to test which option works best for your case.
Option 1: arrows
To avoid a cluttered plot I believe you could plot arrows between only a selection of points to show the general direction of your trajectory. In my example below I only plot an arrow between points 1 and 2, 6 and 7, and so on and. You might want to increase the spacing between the points to make this work for your long series. It is also possible to connect points that are seperated by, say, 10 points to make them more clearly visible.
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.linspace(0, 10, 100)
y = x
plt.figure()
# plot the data points
for i in range(len(x)):
plt.plot(x[i], y[i], "ro")
# plot arrows between points 1 and 2, 6 and 7 and so on.
for i in range(0, len(x)-1, 5):
plt.arrow(x[i], y[i], x[i+1] - x[i], y[i+1] - y[i], color = "black",zorder = 2, width = 0.05)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This yields this plot.
Option 2: colors
You can generate any number of colors from a colormap, meaning you can make a list of 1000 sequential colors. This way you can plot each of your points in an increasingly warm color.
Example:
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.linspace(0, 10, 100)
y = x
# generate 100 (number of data points) colors from colormap
colors = [plt.get_cmap("coolwarm")(i) for i in np.linspace(0,1, len(x))]
plt.figure()
# plot the data points with the generated colors
for i in range(len(x)):
plt.plot(x[i], y[i], color = colors[i], marker = "o")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This yields this figure, where the oldest data point is cool (blue) and the newest is red (warm).

How to make a spectrum plot

I am trying to replicate a spectrum plot like the figure below with both Python and Matlab, no success so far.
The image is from Electric Field Instrument data. The plot should have time on x-axis, frequency on y-axis and colorbar on the right y-axis.
The data is a two dimensional matrix, each row represents the time stamp, the column represents different frequency after FFT. the problem is the data has a lot of NaN values, only a few frequency has data, when I used plt.imshow() it give me completely blank image. Besides, the value ranges from 1e-12 to 1e-7, very small.
Any hint on how to visualize image like this would be greatly appreciated.
Screenshot of the data. The data is from NASA EFI data.
I utilized plt.imshow with Python and imagesc in Matlab with the whole 2d matrix, it give me blank image of the same color.
Below is my Python code trial, all gave me wrong images:
plt.matshow(dt, cmap='jet');plt.colorbar(); plt.show()
for i in range(dt.shape[0]):
plt.plot(dt.iloc[i, :]);plt.show()

You shouldn't use imshow because this will display it as if it were an image (because you have a 2D matrix).
You need to plot each row separately, like so:
import numpy as np
import matplotlib.pyplot as plt
sin1 = np.sin(np.linspace(0, 2*np.pi, 100))
sin2 = np.sin(np.linspace(0, 2*np.pi, 100)) + 0.5
sin3 = np.sin(np.linspace(0, 2*np.pi, 100)) + 1
sin1[10] = np.nan
sin2[20] = np.nan
sin3[30] = np.nan
data = np.array([sin1, sin2, sin3])
# plot each row as a separate series
for i in range(data.shape[0]):
plt.plot(data[i, :])
plt.show()
and then the nan's should just be empty spots in the graph.

How do I correctly implement contours of histograms with logscale binning in numpy/matplotlib

I am trying to plot contours of data that his been binned using numpy.hist2d, except the bins are set using numpy.logscale (equal binning in log space).
Unfortunately, this results in a strange behavior that I can't seem to resolve: the placement of the contours does not match the location of the points in x/y. I plot both the 2d histogram of the data, and the contours, and they do not overlap.
It looks like what is actually happening is the contours are being placed on the physical location of the plot in linear space where I expect them to be placed in log space.
It's a strange phenomenon that I think can be best described by the following plots, using identical data but binned in different ways.:
Here is a minimum working example to produce the logbinned data:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(loc=500, scale=100,size=10000)
y = np.random.normal(loc=600, scale=60, size=10000)
nbins = 50
bins = (np.logspace(np.log10(10),np.log10(1000),nbins),np.logspace(np.log10(10),np.log10(1000),nbins))
HH, xe, ye = np.histogram2d(x,y,bins=bins)
plt.hist2d(x,y,bins=bins,cmin=1);
grid = HH.transpose()
extent = np.array([xe.min(), xe.max(), ye.min(), ye.max()])
cs = plt.contourf(grid,2,extent=extent,extend='max',cmap='plasma',alpha=0.5,zorder=100)
plt.contour(grid,2,extent=extent,colors='k',zorder=100)
plt.yscale('log')
plt.xscale('log')
It's fairly clear what is happening -- the contour is getting misplaced do the scaling of the bins. I'd like to be able to plot the histogram and the contour here together.
If anyone has an idea of how to resolve this, that would be very helpful - thanks!

This is your problem:
cs = plt.contourf(grid,2,extent=extent,...)
You are passing in a single 2d array specifying the values of the histograms, but you aren't passing the x and y coordinates these data correspond to. By only passing in extent there's no way for pyplot to do anything other than assume that the underlying grid is uniform, stretched out to fit extent.
So instead what you have to do is to define x and y components for each value in grid. You have to think a bit how to do this, because you have (n, n)-shaped data and (n+1,)-shaped edges to go with it. We should probably choose the center of each bin to associate a data point with. So we need to find the midpoint of each bin, and pass those arrays to contour[f].
Something like this:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
size = 10000
x = rng.normal(loc=500, scale=100, size=size)
y = rng.normal(loc=600, scale=60, size=size)
nbins = 50
bins = (np.geomspace(10, 1000, nbins),) * 2
HH, xe, ye = np.histogram2d(x, y, bins=bins)
fig, ax = plt.subplots()
ax.hist2d(x, y, bins=bins, cmin=1)
grid = HH.transpose()
# compute bin midpoints
midpoints = (xe[1:] + xe[:-1])/2, (ye[1:] + ye[:-1])/2
cs = ax.contourf(*midpoints, grid, levels=2, extend='max', cmap='plasma', alpha=0.5, zorder=100)
ax.contour(*midpoints, grid, levels=2, colors='k', zorder=100)
# these are a red herring during debugging:
#ax.set_yscale('log')
#ax.set_xscale('log')
(I've cleaned up your code a bit.)
Alternatively, if you want to avoid having those white strips at the top and edge, you can keep your bin edges, and pad your grid with zeros:
grid_padded = np.pad(grid, [(0, 1)])
cs = ax.contourf(xe, ye, grid_padded, levels=2, extend='max', cmap='plasma', alpha=0.5, zorder=100)
ax.contour(xe, ye, grid_padded, levels=2, colors='k', zorder=100)
This gives us something like
This seems prettier, but if you think about your data this is less exact, because your data points are shifted with respect to the bin coordinates they correspond to. If you look closely you can see the contours being shifted with respect to the output of hist2d. You could fix this by generating geomspaces with one more final value which you only use for this final plotting step, and again use the midpoints of these edges (complete with a last auxiliary one).

Using numpy arrays to count the number of points within the cells of a regular grid

I am working with a large number of 3D points, each with x,y,z values stored in numpy arrays.
For background, the points will always fall within a cylinder of fixed radius, and height = max z value of the points.
My objective is to split the bounding cylinder (or column if it is easier) into e.g. 1 m height strata, and then count the number of points within each cell
of a regular grid (e.g. 1 m x 1 m) overlaid on each strata.
Conceptually, the operation would be the same as overlaying a raster and counting the points intersecting each pixel.
The grid of cells can form a square or a disk, it doesn't matter.
After a lot of searching and reading, my current thinking is to use some combination of numpy.linspace and numpy.meshgrid to generate the vertices of each cell stored within an array and test each cell against each point to see if it is 'in'. This seems inefficient, especially when working with thousands of points.
The numpy / scipy suite seems well suited to the problem, but I have not found a solution yet. Any suggestions would be much appreciated.
I have included a few example points and some code to visualize the data.
# Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load in X,Y,Z values from a sub-sample of 10 points for testing
# XY Values are scaled to a reasonable point of origin
z_vals = np.array([3.08,4.46,0.27,2.40,0.48,0.21,0.31,3.28,4.09,1.75])
x_vals = np.array([22.88,20.00,20.36,24.11,40.48,29.08,36.02,29.14,32.20,18.96])
y_vals = np.array([31.31,25.04,31.86,41.81,38.23,31.57,42.65,18.09,35.78,31.78])
# This plot is instructive to visualize the problem
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x_vals, y_vals, z_vals, c='b', marker='o')
plt.show()

I am not sure I understand perfectly what you are looking for, but since every "cell" seems to have a 1m side for all directions, couldn't you:
round all your values to integers (rasterize your data) probably with some floor function;
create a bijection from these integer coordinates to something more convenient with something like:
(64**2)*x + (64)*y + z # assuming all values are in [0,63]
You can put z rather at the beginning if you want to more easely focus on height later
compute the histogram of each "cell" (several functions from numpy/scipy or numpy can do it);
revert the bijection if needed (ie. know the "true" coordinates of each cell once the count is known)
Maybe I didn't understand well, but in case it helps...

Thanks #Baruchel. It turns out the n-dimensional histograms suggested by #DilithiumMatrix provides a fairly simple solution to the problem I posted. After some reading, here is my current solution for anyone else that faces a similar problem.
As this is my first Python/Numpy effort any improvements/suggestions, especially regarding performance, would be welcome. Thanks.
# Setup
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load in X,Y,Z values from a sub-sample of 10 points for testing
# XY Values are scaled to a reasonable point of origin
z_vals = np.array([3.08,4.46,0.27,2.40,0.48,0.21,0.31,3.28,4.09,1.75])
x_vals = np.array([22.88,20.00,20.36,24.11,40.48,29.08,36.02,29.14,32.20,18.96])
y_vals = np.array([31.31,25.04,31.86,41.81,38.23,31.57,42.65,18.09,35.78,31.78])
# Updated code below
# Variables needed for 2D,3D histograms
xmax, ymax, zmax = int(x_vals.max())+1, int(y_vals.max())+1, int(z_vals.max())+1
xmin, ymin, zmin = int(x_vals.min()), int(y_vals.min()), int(z_vals.min())
xrange, yrange, zrange = xmax-xmin, ymax-ymin, zmax-zmin
xedges = np.linspace(xmin, xmax, (xrange + 1), dtype=int)
yedges = np.linspace(ymin, ymax, (yrange + 1), dtype=int)
zedges = np.linspace(zmin, zmax, (zrange + 1), dtype=int)
# Make the 2D histogram
h2d, xedges, yedges = np.histogram2d(x_vals, y_vals, bins=(xedges, yedges))
assert np.count_nonzero(h2d) == len(x_vals), "Unclassified points in the array"
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.imshow(h2d.transpose(), extent=extent, interpolation='none', origin='low')
# Transpose and origin must be used to make the array line up when using imshow, unsure why
# Plot settings, not sure yet on matplotlib update/override objects
plt.grid(b=True, which='both')
plt.xticks(xedges)
plt.yticks(yedges)
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.plot(x_vals, y_vals, 'ro')
plt.show()
# 3-dimensional histogram with 1 x 1 x 1 m bins. Produces point counts in each 1m3 cell.
xyzstack = np.stack([x_vals,y_vals,z_vals], axis=1)
h3d, Hedges = np.histogramdd(xyzstack, bins=(xedges, yedges, zedges))
assert np.count_nonzero(h3d) == len(x_vals), "Unclassified points in the array"
h3d.shape # Shape of the array should be same as the edge dimensions
testzbin = np.sum(np.logical_and(z_vals >= 1, z_vals < 2)) # Slice to test with
np.sum(h3d[:,:,1]) == testzbin # Test num points in second bins
np.sum(h3d, axis=2) # Sum of all vertical points above each x,y 'pixel'
# only in this example the h2d and np.sum(h3d,axis=2) arrays will match as no z bins have >1 points
# Remaining issue - how to get a r x c count of empty z bins.
# i.e. for each 'pixel' how many z bins contained no points?
# Possible solution is to reshape to use logical operators
count2d = h3d.reshape(xrange * yrange, zrange) # Maintain dimensions per num 3D cells defined
zerobins = (count2d == 0).sum(1)
zerobins.shape
# Get back to x,y grid with counts - ready for output as image with counts=pixel digital number
bincount_pixels = zerobins.reshape(xrange,yrange)
# Appears to work, perhaps there is a way without reshapeing?
PS if you are facing a similar problem scikit patch extraction looks like another possible solution.

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?

There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()

If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.

I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()

I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.

In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to drawheat map with large data set in python - python

Related

How to helpfully plot time series data in python

How to make a spectrum plot

How do I correctly implement contours of histograms with logscale binning in numpy/matplotlib

Using numpy arrays to count the number of points within the cells of a regular grid

matplotlib: disregard outliers when plotting

Categories

Resources