I have a data set of distances between two particles, and I want to bin these data in custom bins. For example, I want to see how many distance values lay in the interval from 1 to 2 micrometers, and so on. I wrote a code about it, and it seems to work. This is my code for this part:
#Custom binning of data
bins= [0,1,2,3,4,5,6,7,8,9,10]
fig, ax = plt.subplots(n,m,figsize = (30,10)) #using this because I actually have 5 histograms, but only posted one here
ax.hist(dist_from_spacer1, bins=bins, edgecolor="k")
ax.set_xlabel('Distance from spacer 1 [µm]')
ax.set_ylabel('counts')
plt.xticks(bins)
plt.show()
However, now I wish to extract those data values from the intervals, and store them into lists. I tried to use:
np.histogram(dist_from_spacer1, bins=bins)
However, this just gives how many data points are on each bin and the bin intervals, just like this:
(array([ 0, 0, 44, 567, 481, 279, 309, 202, 117, 0]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
How can I get the exact data that belong to each histogram bin?
Yes, np.histogram calculates what you need for a histogram, and hence the specific data points are not necessary, just bins' boundaries and count for each bin. However, the bins' boundaries is sufficient to acheive what you want by using np.digitizr
counts, bins = np.histogram(dist_from_spacer1)
indices = np.digitize(dist_from_spacer1, bins)
lists = [[] for _ in range(len(bins))]
[lists[i].append(x) for i, x in zip(indices, dist_from_spacer1)
In your case, the bins' boundaries are predefined, so you can use np.digitize directly
I'm displaying some data using matplotlib.pyplot.pcolormesh in python, and I want to leave blank spaces where there are missing data points.
Suppose I've collected data for x values 0 to 10, and y values 0 to 10, but not every such value. At present, I initialize my data storage array using np.zeros((11,11)), then use a for loop to change the values of that array to the data value if I have the data for that point.
That leaves me with a bunch of data plus some zeros in an array. When I plot this, it is impossible to distinguish between points that have no data and points which have data with small values.
Is it possible to have missing data points clearly distinct from non-missing data points? For example, in the code below I want the squares at (3,1), (5,7), and (8,8) colored but the rest of the squares white.
I've tried initializing my data storage array with np.empty((11,11)) and np.full((11,11),np.nan) as well, but they both produce the same output as np.zeros. Here's the code below:
import numpy as np
import matplotlib.pyplot as plt
data_storage = np.zeros((11,11))
collected_data = [[3, 1, 45.2], [5, 7, 23.9], [8, 8, 78.4]
for data in collected_data:
x_coord = data[0]
y_coord = data[1]
value = data[2]
data_storage[y_coord,x_coord] = value
all_x_values = np.linspace(0,10,11)
all_y_values = np.linspace(0,10,11)
plt.pcolormesh(all_x_values, all_y_values, data_storage)
plt.show()
One approach is to change all zeros to NaN, which would make the corresponding cells transparent.
Please note that the x and y values for pcolormesh are for the grid points, not for the centers, so you need one value more in each dimension (11 cells, 12 cell borders). This allows to create color meshes with unequal cell sizes. If you want the ticks to be nicely in the center of the cells, you can put the cell borders at the halves.
(In the code below the forloop has been written more concise).
import numpy as np
import matplotlib.pyplot as plt
data_storage = np.zeros((11, 11))
collected_data = [[3, 1, 45.2], [5, 7, 23.9], [8, 8, 78.4]]
for x_coord, y_coord, value in collected_data:
data_storage[y_coord, x_coord] = value
all_x_values = np.arange(0, 12) - 0.5
all_y_values = np.arange(0, 12) - 0.5
plt.pcolormesh(all_x_values, all_y_values, np.where(data_storage == 0, np.nan, data_storage))
plt.gca().xaxis.set_major_locator(MultipleLocator(1))
plt.gca().yaxis.set_major_locator(MultipleLocator(1))
plt.colorbar()
plt.show()
An alternative approach could be to create a colormap, set an 'under' color and set vmin to a value slightly larger than 0. Optionally, the 'under' color can be visualized in the colorbar with extend='min'.
from copy import copy
my_cmap = copy(plt.cm.get_cmap('viridis'))
my_cmap.set_under('lightgrey')
plt.pcolormesh(all_x_values, all_y_values, data_storage, cmap=my_cmap, vmin=0.000001)
plt.colorbar(extend='min', extendrect=True)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have the following set of OHLC data:
[[datetime.datetime(2020, 7, 1, 6, 30), '0.00013449', '0.00013866', '0.00013440', '0.00013857', '430864.00000000', 1593579599999, '59.09906346', 1885, '208801.00000000', '28.63104974', '0', 3.0336828016952944], [datetime.datetime(2020, 7, 1, 7, 0), '0.00013854', '0.00013887', '0.00013767', '0.00013851', '162518.00000000', 1593581399999, '22.48036621', 809, '78014.00000000', '10.79595625', '0', -0.02165439584236435], [datetime.datetime(2020, 7, 1, 7, 30), '0.00013851', '0.00013890', '0.00013664', '0.00013780', '313823.00000000', 1593583199999, '43.21919087', 1077, '157083.00000000', '21.62390537', '0', -0.5125983683488642], [datetime.datetime(2020, 7, 1, 8, 0), '0.00013771', '0.00013818', '0.00013654', '0.00013707', '126925.00000000', 1593584999999, '17.44448931', 428, '56767.00000000', '7.79977280', '0', -0.46474475346744676], [datetime.datetime(2020, 7, 1, 8, 30), '0.00013712', '0.00013776', '0.00013656', '0.00013757', '62261.00000000', 1593586799999, '8.54915420', 330, '26921.00000000', '3.69342184', '0', 0.3281796966161107], [datetime.datetime(2020, 7, 1, 9, 0), '0.00013757', '0.00013804', '0.00013628', '0.00013640', '115154.00000000', 1593588599999, '15.80169390', 510, '52830.00000000', '7.24924784', '0', -0.8504761212473579], [datetime.datetime(2020, 7, 1, 9, 30), '0.00013640', '0.00013675', '0.00013598', '0.00013675', '66186.00000000', 1593590399999, '9.02070446', 311, '24798.00000000', '3.38107106', '0', 0.25659824046919455], [datetime.datetime(2020, 7, 1, 10, 0), '0.00013655', '0.00013662', '0.00013577', '0.00013625', '56656.00000000', 1593592199999, '7.71123423', 367, '27936.00000000', '3.80394497', '0', -0.2196997436836377], [datetime.datetime(2020, 7, 1, 10, 30), '0.00013625', '0.00013834', '0.00013625', '0.00013799', '114257.00000000', 1593593999999, '15.70194874', 679, '56070.00000000', '7.70405037', '0', 1.2770642201834814], [datetime.datetime(2020, 7, 1, 11, 0), '0.00013812', '0.00013822', '0.00013630', '0.00013805', '104746.00000000', 1593595799999, '14.39147417', 564, '46626.00000000', '6.39959586', '0', -0.05068056762237037], [datetime.datetime(2020, 7, 1, 11, 30), '0.00013805', '0.00013810', '0.00013720', '0.00013732', '37071.00000000', 1593597599999, '5.10447229', 231, '16349.00000000', '2.25258584', '0', -0.5287939152480996], [datetime.datetime(2020, 7, 1, 12, 0), '0.00013733', '0.00013741', '0.00013698', '0.00013724', '27004.00000000', 1593599399999, '3.70524540', 161, '15398.00000000', '2.11351192', '0', -0.06553557125171522], [datetime.datetime(2020, 7, 1, 12, 30), '0.00013724', '0.00013727', '0.00013687', '0.00013717', '27856.00000000', 1593601199999, '3.81864840', 140, '11883.00000000', '1.62931445', '0', -0.05100553774411102], [datetime.datetime(2020, 7, 1, 13, 0), '0.00013716', '0.00013801', '0.00013702', '0.00013741', '83867.00000000', 1593602999999, '11.54964001', 329, '42113.00000000', '5.80085155', '0', 0.18226888305628908], [datetime.datetime(2020, 7, 1, 13, 30), '0.00013741', '0.00013766', '0.00013690', '0.00013707', '50299.00000000', 1593604799999, '6.90474065', 249, '20871.00000000', '2.86749244', '0', -0.2474346845207872], [datetime.datetime(2020, 7, 1, 14, 0), '0.00013707', '0.00013736', '0.00013680', '0.00013704', '44745.00000000', 1593606599999, '6.13189248', 205, '14012.00000000', '1.92132206', '0', -0.02188662727072625], [datetime.datetime(2020, 7, 1, 14, 30), '0.00013704', '0.00014005', '0.00013703', '0.00013960', '203169.00000000', 1593608399999, '28.26967457', 904, '150857.00000000', '21.00600041', '0', 1.8680677174547595]]
That looks like this:
I'm trying to detect a pattern that looks like the one above in other sets of OHLC data. It doesn't have to be the same, it only needs to be similar, i.e. the number of candles doesn't have to be the same. Just the shape needs to be similar.
The problem:
I don't know where to start to accomplish this. I know it's not easy to do, but I'm sure there is a way to do this.
What I have tried:
Until now, I only managed to cut away manually the OHLC data that I don't need, so that I can only have the patterns I want. Then, I plotted it using a Pandas dataframe:
import mplfinance as mpf
import numpy as np
import pandas as pd
df = pd.DataFrame([x[:6] for x in OHLC],
columns=['Date', 'Open', 'High', 'Low', 'Close', 'Volume'])
format = '%Y-%m-%d %H:%M:%S'
df['Date'] = pd.to_datetime(df['Date'], format=format)
df = df.set_index(pd.DatetimeIndex(df['Date']))
df["Open"] = pd.to_numeric(df["Open"],errors='coerce')
df["High"] = pd.to_numeric(df["High"],errors='coerce')
df["Low"] = pd.to_numeric(df["Low"],errors='coerce')
df["Close"] = pd.to_numeric(df["Close"],errors='coerce')
df["Volume"] = pd.to_numeric(df["Volume"],errors='coerce')
mpf.plot(df, type='candle', figscale=2, figratio=(50, 50))
What I thought: A possible solution to this problem is using Neural Networks, so I would have to feed images of the patterns I want to a NN and let the NN loop though other charts and see if it can find the patterns I specified. Before going this way, I was looking for simpler solutions, since I don't know much about Neural Networks and I don't know what kind of NN I would need to do and what tools would I be supposed to use.
Another solution I was thinking about was the following: I would need, somehow, to convert the pattern I want to find on other datasets in a series of values. So for example the OHLC data I posted above would be quantified, somehow, and on another set of OHLC data I would just need to find values that get close to the pattern I want. This approach is very empirical for now and I don't know how to put that in code.
A tool I was suggested to use: Stumpy
What I need:
I don't need the exact code, I only need an example, an article, a library or any kind of source that can point me out on how to work when I want to detect a certain pattern specified by me on a OHLC data set. I hope I was specific enough; any kind of advice is appreciated!
Stumpy will work for you.
Basic Methodology
The basic gist of the algorithm is to compute a matrix profile of a data stream, and then use that to find areas that are similar. (You can think of the matrix profile as a sliding window that gives a rating of how closely two patters match using Z-normalized Euclidean Distance).
This article explains matrix profiles in a pretty straightforward way. Here's an excerpt that explains what you want:
Simply put, a motif is a repeated pattern in a time series and a discord is an anomaly.
With the Matrix Profile computed, it is simple to find the top-K number of motifs or
discords. The Matrix Profile stores the distances in Euclidean space meaning that a
distance close to 0 is most similar to another sub-sequence in the time series and a
distance far away from 0, say 100, is unlike any other sub-sequence. Extracting the lowest
distances gives the motifs and the largest distances gives the discords.
The benefits of using a matrix profile can be found here.
The gist of what you want to do is compute the matrix profile, then look for minima. Minima mean the sliding window matched another place well.
This example shows how to use it to find repeating patterns in one data set:
To reproduce their results myself, I navigated to the DAT file and downloaded it myself, then opened and read it instead of using their broken urllib calls to get the data.
Replace
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = "https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat"
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
with
steam_df = None
with open("steamgen.dat", "r") as data:
steam_df = pd.read_csv(data, header=None, sep="\s+")
I also had to add some plt.show() calls since I ran it outside of Jupyter. With those tweaks, you can run their example and see how it works.
Here's the full code I used, so you don't have to repeat what I did:
import pandas as pd
import stumpy
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import urllib
import ssl
import io
import os
def change_plot_size(width, height, plt):
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = width
fig_size[1] = height
plt.rcParams["figure.figsize"] = fig_size
plt.rcParams["xtick.direction"] = "out"
change_plot_size(20, 6, plt)
colnames = ["drum pressure", "excess oxygen", "water level", "steam flow"]
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = "https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat"
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
steam_df = None
with open("steamgen.dat", "r") as data:
steam_df = pd.read_csv(data, header=None, sep="\s+")
steam_df.columns = colnames
steam_df.head()
plt.suptitle("Steamgen Dataset", fontsize="25")
plt.xlabel("Time", fontsize="20")
plt.ylabel("Steam Flow", fontsize="20")
plt.plot(steam_df["steam flow"].values)
plt.show()
m = 640
mp = stumpy.stump(steam_df["steam flow"], m)
true_P = mp[:, 0]
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={"hspace": 0})
plt.suptitle("Motif (Pattern) Discovery", fontsize="25")
axs[0].plot(steam_df["steam flow"].values)
axs[0].set_ylabel("Steam Flow", fontsize="20")
rect = Rectangle((643, 0), m, 40, facecolor="lightgrey")
axs[0].add_patch(rect)
rect = Rectangle((8724, 0), m, 40, facecolor="lightgrey")
axs[0].add_patch(rect)
axs[1].set_xlabel("Time", fontsize="20")
axs[1].set_ylabel("Matrix Profile", fontsize="20")
axs[1].axvline(x=643, linestyle="dashed")
axs[1].axvline(x=8724, linestyle="dashed")
axs[1].plot(true_P)
def compare_approximation(true_P, approx_P):
fig, ax = plt.subplots(gridspec_kw={"hspace": 0})
ax.set_xlabel("Time", fontsize="20")
ax.axvline(x=643, linestyle="dashed")
ax.axvline(x=8724, linestyle="dashed")
ax.set_ylim((5, 28))
ax.plot(approx_P, color="C1", label="Approximate Matrix Profile")
ax.plot(true_P, label="True Matrix Profile")
ax.legend()
plt.show()
approx = stumpy.scrump(steam_df["steam flow"], m, percentage=0.01, pre_scrump=False)
approx.update()
approx_P = approx.P_
seed = np.random.randint(100000)
np.random.seed(seed)
approx = stumpy.scrump(steam_df["steam flow"], m, percentage=0.01, pre_scrump=False)
compare_approximation(true_P, approx_P)
# Refine the profile
for _ in range(9):
approx.update()
approx_P = approx.P_
compare_approximation(true_P, approx_P)
# Pre-processing
approx = stumpy.scrump(
steam_df["steam flow"], m, percentage=0.01, pre_scrump=True, s=None
)
approx.update()
approx_P = approx.P_
compare_approximation(true_P, approx_P)
Self join vs. join against target
Note that this example was a "self join", meaning it was looking for repeated patterns in it's own data. You'll want to join with the target you are looking to match.
Looking at the signature of stumpy.stump shows you how to do this:
def stump(T_A, m, T_B=None, ignore_trivial=True):
"""
Compute the matrix profile with parallelized STOMP
This is a convenience wrapper around the Numba JIT-compiled parallelized
`_stump` function which computes the matrix profile according to STOMP.
Parameters
----------
T_A : ndarray
The time series or sequence for which to compute the matrix profile
m : int
Window size
T_B : ndarray
The time series or sequence that contain your query subsequences
of interest. Default is `None` which corresponds to a self-join.
ignore_trivial : bool
Set to `True` if this is a self-join. Otherwise, for AB-join, set this
to `False`. Default is `True`.
Returns
-------
out : ndarray
The first column consists of the matrix profile, the second column
consists of the matrix profile indices, the third column consists of
the left matrix profile indices, and the fourth column consists of
the right matrix profile indices.
What you'll want to do is pass the data (pattern) you want to look for as T_B and then the larger sets you want to look in as T_A. The window size specifies how large of a search area you want (this will probably be the length of your T_B data, I'd imagine, or smaller if you want).
Once you have the matrix profile, you will just want to do a simple search and get the indicies of the lowest values. Each window starting at that index is a good match. You may also want to define some threshold minimum such that you only consider it a match if there is at least one value in the matrix profile below that minimum.
Another thing to realize is that your data set is really several correlated data sets (Open, High, Low, Close, and Volume). You'll have to decide which you want to match. Maybe you want a good match just for the opening prices, or maybe you want a good match for all of them. You'll have to decide what a good match means and calculate the matrix for each, then decide what to do if only one or a couple of those subsets match. For example, one data set may match the opening prices well, but close prices don't match as well. Another set's volume may match and that's it. Maybe you'll want to see if the normalized prices match (meaning you'd only be looking at the shape and not the relative magnitudes, i.e. a $1 stock going to $10 would look the same as a $10 one going to $100). All of that is pretty straightforward once you can compute a matrix profile.