Is this an error in the seaborn.lineplot hue parameter? - python

With this code snippet, I'm expecting a line plot with one line per hue, which has these distinct values: [1, 5, 10, 20, 40].
import math
import pandas as pd
import seaborn as sns
sns.set(style="whitegrid")
TANH_SCALING = [1, 5, 10, 20, 40]
X_VALUES = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
COLUMNS = ['x', 'y', 'hue group']
tanh_df = pd.DataFrame(columns=COLUMNS)
for sc in TANH_SCALING:
data = {
COLUMNS[0]: X_VALUES,
COLUMNS[1]: [math.tanh(x/sc) for x in X_VALUES],
COLUMNS[2]: len(X_VALUES)*[sc]}
tanh_df = tanh_df.append(
pd.DataFrame(data=data, columns=COLUMNS),
ignore_index=True
)
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1], hue=COLUMNS[2], data=tanh_df);
However, what I get is a hue legend with values [0, 15, 30, 45], and an additional line, like so:
Is this a bug or am I missing something obvious?

This is a known bug of seaborn when the hue can be cast to integers. You could add a prefix to the hue so casting to integers fails:
for sc in TANH_SCALING:
data = {
COLUMNS[0]: X_VALUES,
COLUMNS[1]: [math.tanh(x/sc) for x in X_VALUES],
COLUMNS[2]: len(X_VALUES)*[f'A{sc}']} # changes here
tanh_df = tanh_df.append(
pd.DataFrame(data=data, columns=COLUMNS),
ignore_index=True
)
Output:
Or after you created your data:
# data creation
for sc in TANH_SCALING:
data = {
COLUMNS[0]: X_VALUES,
COLUMNS[1]: [math.tanh(x/sc) for x in X_VALUES],
COLUMNS[2]: len(X_VALUES)*[f'A{sc}']}
tanh_df = tanh_df.append(
pd.DataFrame(data=data, columns=COLUMNS),
ignore_index=True
)
# hue manipulation
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1],
hue='A_' + tanh_df[COLUMNS[2]].astype(str), # change hue here
data=tanh_df);

As #LudvigH's comment on the other answer says, this isn't a bug, even if the default behavior is surprising in this case. As explained in the docs:
The default treatment of the hue (and to a lesser extent, size) semantic, if present, depends on whether the variable is inferred to represent “numeric” or “categorical” data. In particular, numeric variables are represented with a sequential colormap by default, and the legend entries show regular “ticks” with values that may or may not exist in the data. This behavior can be controlled through various parameters, as described and illustrated below.
Here are two specific ways to control the behavior.
If you want to keep the numeric color mapping but have the legend show the exact values in your data, set legend="full":
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1], hue=COLUMNS[2], data=tanh_df, legend="full")
If you want to have seaborn treat the levels of the hue parameter as discrete categorical values, pass a named categorical colormap or either a list or dictionary of the specific colors you want to use:
sns.lineplot(x=COLUMNS[0], y=COLUMNS[1], hue=COLUMNS[2], data=tanh_df, palette="deep")

Related

Extracting data from a histogram with custom bins in Python

I have a data set of distances between two particles, and I want to bin these data in custom bins. For example, I want to see how many distance values lay in the interval from 1 to 2 micrometers, and so on. I wrote a code about it, and it seems to work. This is my code for this part:
#Custom binning of data
bins= [0,1,2,3,4,5,6,7,8,9,10]
fig, ax = plt.subplots(n,m,figsize = (30,10)) #using this because I actually have 5 histograms, but only posted one here
ax.hist(dist_from_spacer1, bins=bins, edgecolor="k")
ax.set_xlabel('Distance from spacer 1 [µm]')
ax.set_ylabel('counts')
plt.xticks(bins)
plt.show()
However, now I wish to extract those data values from the intervals, and store them into lists. I tried to use:
np.histogram(dist_from_spacer1, bins=bins)
However, this just gives how many data points are on each bin and the bin intervals, just like this:
(array([ 0, 0, 44, 567, 481, 279, 309, 202, 117, 0]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
How can I get the exact data that belong to each histogram bin?
Yes, np.histogram calculates what you need for a histogram, and hence the specific data points are not necessary, just bins' boundaries and count for each bin. However, the bins' boundaries is sufficient to acheive what you want by using np.digitizr
counts, bins = np.histogram(dist_from_spacer1)
indices = np.digitize(dist_from_spacer1, bins)
lists = [[] for _ in range(len(bins))]
[lists[i].append(x) for i, x in zip(indices, dist_from_spacer1)
In your case, the bins' boundaries are predefined, so you can use np.digitize directly

Is it possible to leave blank spaces in matplotlib's pcolormesh plots?

I'm displaying some data using matplotlib.pyplot.pcolormesh in python, and I want to leave blank spaces where there are missing data points.
Suppose I've collected data for x values 0 to 10, and y values 0 to 10, but not every such value. At present, I initialize my data storage array using np.zeros((11,11)), then use a for loop to change the values of that array to the data value if I have the data for that point.
That leaves me with a bunch of data plus some zeros in an array. When I plot this, it is impossible to distinguish between points that have no data and points which have data with small values.
Is it possible to have missing data points clearly distinct from non-missing data points? For example, in the code below I want the squares at (3,1), (5,7), and (8,8) colored but the rest of the squares white.
I've tried initializing my data storage array with np.empty((11,11)) and np.full((11,11),np.nan) as well, but they both produce the same output as np.zeros. Here's the code below:
import numpy as np
import matplotlib.pyplot as plt
data_storage = np.zeros((11,11))
collected_data = [[3, 1, 45.2], [5, 7, 23.9], [8, 8, 78.4]
for data in collected_data:
x_coord = data[0]
y_coord = data[1]
value = data[2]
data_storage[y_coord,x_coord] = value
all_x_values = np.linspace(0,10,11)
all_y_values = np.linspace(0,10,11)
plt.pcolormesh(all_x_values, all_y_values, data_storage)
plt.show()
One approach is to change all zeros to NaN, which would make the corresponding cells transparent.
Please note that the x and y values for pcolormesh are for the grid points, not for the centers, so you need one value more in each dimension (11 cells, 12 cell borders). This allows to create color meshes with unequal cell sizes. If you want the ticks to be nicely in the center of the cells, you can put the cell borders at the halves.
(In the code below the forloop has been written more concise).
import numpy as np
import matplotlib.pyplot as plt
data_storage = np.zeros((11, 11))
collected_data = [[3, 1, 45.2], [5, 7, 23.9], [8, 8, 78.4]]
for x_coord, y_coord, value in collected_data:
data_storage[y_coord, x_coord] = value
all_x_values = np.arange(0, 12) - 0.5
all_y_values = np.arange(0, 12) - 0.5
plt.pcolormesh(all_x_values, all_y_values, np.where(data_storage == 0, np.nan, data_storage))
plt.gca().xaxis.set_major_locator(MultipleLocator(1))
plt.gca().yaxis.set_major_locator(MultipleLocator(1))
plt.colorbar()
plt.show()
An alternative approach could be to create a colormap, set an 'under' color and set vmin to a value slightly larger than 0. Optionally, the 'under' color can be visualized in the colorbar with extend='min'.
from copy import copy
my_cmap = copy(plt.cm.get_cmap('viridis'))
my_cmap.set_under('lightgrey')
plt.pcolormesh(all_x_values, all_y_values, data_storage, cmap=my_cmap, vmin=0.000001)
plt.colorbar(extend='min', extendrect=True)

Detecting pattern in OHLC data in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have the following set of OHLC data:
[[datetime.datetime(2020, 7, 1, 6, 30), '0.00013449', '0.00013866', '0.00013440', '0.00013857', '430864.00000000', 1593579599999, '59.09906346', 1885, '208801.00000000', '28.63104974', '0', 3.0336828016952944], [datetime.datetime(2020, 7, 1, 7, 0), '0.00013854', '0.00013887', '0.00013767', '0.00013851', '162518.00000000', 1593581399999, '22.48036621', 809, '78014.00000000', '10.79595625', '0', -0.02165439584236435], [datetime.datetime(2020, 7, 1, 7, 30), '0.00013851', '0.00013890', '0.00013664', '0.00013780', '313823.00000000', 1593583199999, '43.21919087', 1077, '157083.00000000', '21.62390537', '0', -0.5125983683488642], [datetime.datetime(2020, 7, 1, 8, 0), '0.00013771', '0.00013818', '0.00013654', '0.00013707', '126925.00000000', 1593584999999, '17.44448931', 428, '56767.00000000', '7.79977280', '0', -0.46474475346744676], [datetime.datetime(2020, 7, 1, 8, 30), '0.00013712', '0.00013776', '0.00013656', '0.00013757', '62261.00000000', 1593586799999, '8.54915420', 330, '26921.00000000', '3.69342184', '0', 0.3281796966161107], [datetime.datetime(2020, 7, 1, 9, 0), '0.00013757', '0.00013804', '0.00013628', '0.00013640', '115154.00000000', 1593588599999, '15.80169390', 510, '52830.00000000', '7.24924784', '0', -0.8504761212473579], [datetime.datetime(2020, 7, 1, 9, 30), '0.00013640', '0.00013675', '0.00013598', '0.00013675', '66186.00000000', 1593590399999, '9.02070446', 311, '24798.00000000', '3.38107106', '0', 0.25659824046919455], [datetime.datetime(2020, 7, 1, 10, 0), '0.00013655', '0.00013662', '0.00013577', '0.00013625', '56656.00000000', 1593592199999, '7.71123423', 367, '27936.00000000', '3.80394497', '0', -0.2196997436836377], [datetime.datetime(2020, 7, 1, 10, 30), '0.00013625', '0.00013834', '0.00013625', '0.00013799', '114257.00000000', 1593593999999, '15.70194874', 679, '56070.00000000', '7.70405037', '0', 1.2770642201834814], [datetime.datetime(2020, 7, 1, 11, 0), '0.00013812', '0.00013822', '0.00013630', '0.00013805', '104746.00000000', 1593595799999, '14.39147417', 564, '46626.00000000', '6.39959586', '0', -0.05068056762237037], [datetime.datetime(2020, 7, 1, 11, 30), '0.00013805', '0.00013810', '0.00013720', '0.00013732', '37071.00000000', 1593597599999, '5.10447229', 231, '16349.00000000', '2.25258584', '0', -0.5287939152480996], [datetime.datetime(2020, 7, 1, 12, 0), '0.00013733', '0.00013741', '0.00013698', '0.00013724', '27004.00000000', 1593599399999, '3.70524540', 161, '15398.00000000', '2.11351192', '0', -0.06553557125171522], [datetime.datetime(2020, 7, 1, 12, 30), '0.00013724', '0.00013727', '0.00013687', '0.00013717', '27856.00000000', 1593601199999, '3.81864840', 140, '11883.00000000', '1.62931445', '0', -0.05100553774411102], [datetime.datetime(2020, 7, 1, 13, 0), '0.00013716', '0.00013801', '0.00013702', '0.00013741', '83867.00000000', 1593602999999, '11.54964001', 329, '42113.00000000', '5.80085155', '0', 0.18226888305628908], [datetime.datetime(2020, 7, 1, 13, 30), '0.00013741', '0.00013766', '0.00013690', '0.00013707', '50299.00000000', 1593604799999, '6.90474065', 249, '20871.00000000', '2.86749244', '0', -0.2474346845207872], [datetime.datetime(2020, 7, 1, 14, 0), '0.00013707', '0.00013736', '0.00013680', '0.00013704', '44745.00000000', 1593606599999, '6.13189248', 205, '14012.00000000', '1.92132206', '0', -0.02188662727072625], [datetime.datetime(2020, 7, 1, 14, 30), '0.00013704', '0.00014005', '0.00013703', '0.00013960', '203169.00000000', 1593608399999, '28.26967457', 904, '150857.00000000', '21.00600041', '0', 1.8680677174547595]]
That looks like this:
I'm trying to detect a pattern that looks like the one above in other sets of OHLC data. It doesn't have to be the same, it only needs to be similar, i.e. the number of candles doesn't have to be the same. Just the shape needs to be similar.
The problem:
I don't know where to start to accomplish this. I know it's not easy to do, but I'm sure there is a way to do this.
What I have tried:
Until now, I only managed to cut away manually the OHLC data that I don't need, so that I can only have the patterns I want. Then, I plotted it using a Pandas dataframe:
import mplfinance as mpf
import numpy as np
import pandas as pd
df = pd.DataFrame([x[:6] for x in OHLC],
columns=['Date', 'Open', 'High', 'Low', 'Close', 'Volume'])
format = '%Y-%m-%d %H:%M:%S'
df['Date'] = pd.to_datetime(df['Date'], format=format)
df = df.set_index(pd.DatetimeIndex(df['Date']))
df["Open"] = pd.to_numeric(df["Open"],errors='coerce')
df["High"] = pd.to_numeric(df["High"],errors='coerce')
df["Low"] = pd.to_numeric(df["Low"],errors='coerce')
df["Close"] = pd.to_numeric(df["Close"],errors='coerce')
df["Volume"] = pd.to_numeric(df["Volume"],errors='coerce')
mpf.plot(df, type='candle', figscale=2, figratio=(50, 50))
What I thought: A possible solution to this problem is using Neural Networks, so I would have to feed images of the patterns I want to a NN and let the NN loop though other charts and see if it can find the patterns I specified. Before going this way, I was looking for simpler solutions, since I don't know much about Neural Networks and I don't know what kind of NN I would need to do and what tools would I be supposed to use.
Another solution I was thinking about was the following: I would need, somehow, to convert the pattern I want to find on other datasets in a series of values. So for example the OHLC data I posted above would be quantified, somehow, and on another set of OHLC data I would just need to find values that get close to the pattern I want. This approach is very empirical for now and I don't know how to put that in code.
A tool I was suggested to use: Stumpy
What I need:
I don't need the exact code, I only need an example, an article, a library or any kind of source that can point me out on how to work when I want to detect a certain pattern specified by me on a OHLC data set. I hope I was specific enough; any kind of advice is appreciated!
Stumpy will work for you.
Basic Methodology
The basic gist of the algorithm is to compute a matrix profile of a data stream, and then use that to find areas that are similar. (You can think of the matrix profile as a sliding window that gives a rating of how closely two patters match using Z-normalized Euclidean Distance).
This article explains matrix profiles in a pretty straightforward way. Here's an excerpt that explains what you want:
Simply put, a motif is a repeated pattern in a time series and a discord is an anomaly.
With the Matrix Profile computed, it is simple to find the top-K number of motifs or
discords. The Matrix Profile stores the distances in Euclidean space meaning that a
distance close to 0 is most similar to another sub-sequence in the time series and a
distance far away from 0, say 100, is unlike any other sub-sequence. Extracting the lowest
distances gives the motifs and the largest distances gives the discords.
The benefits of using a matrix profile can be found here.
The gist of what you want to do is compute the matrix profile, then look for minima. Minima mean the sliding window matched another place well.
This example shows how to use it to find repeating patterns in one data set:
To reproduce their results myself, I navigated to the DAT file and downloaded it myself, then opened and read it instead of using their broken urllib calls to get the data.
Replace
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = "https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat"
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
with
steam_df = None
with open("steamgen.dat", "r") as data:
steam_df = pd.read_csv(data, header=None, sep="\s+")
I also had to add some plt.show() calls since I ran it outside of Jupyter. With those tweaks, you can run their example and see how it works.
Here's the full code I used, so you don't have to repeat what I did:
import pandas as pd
import stumpy
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import urllib
import ssl
import io
import os
def change_plot_size(width, height, plt):
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = width
fig_size[1] = height
plt.rcParams["figure.figsize"] = fig_size
plt.rcParams["xtick.direction"] = "out"
change_plot_size(20, 6, plt)
colnames = ["drum pressure", "excess oxygen", "water level", "steam flow"]
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = "https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat"
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
steam_df = None
with open("steamgen.dat", "r") as data:
steam_df = pd.read_csv(data, header=None, sep="\s+")
steam_df.columns = colnames
steam_df.head()
plt.suptitle("Steamgen Dataset", fontsize="25")
plt.xlabel("Time", fontsize="20")
plt.ylabel("Steam Flow", fontsize="20")
plt.plot(steam_df["steam flow"].values)
plt.show()
m = 640
mp = stumpy.stump(steam_df["steam flow"], m)
true_P = mp[:, 0]
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={"hspace": 0})
plt.suptitle("Motif (Pattern) Discovery", fontsize="25")
axs[0].plot(steam_df["steam flow"].values)
axs[0].set_ylabel("Steam Flow", fontsize="20")
rect = Rectangle((643, 0), m, 40, facecolor="lightgrey")
axs[0].add_patch(rect)
rect = Rectangle((8724, 0), m, 40, facecolor="lightgrey")
axs[0].add_patch(rect)
axs[1].set_xlabel("Time", fontsize="20")
axs[1].set_ylabel("Matrix Profile", fontsize="20")
axs[1].axvline(x=643, linestyle="dashed")
axs[1].axvline(x=8724, linestyle="dashed")
axs[1].plot(true_P)
def compare_approximation(true_P, approx_P):
fig, ax = plt.subplots(gridspec_kw={"hspace": 0})
ax.set_xlabel("Time", fontsize="20")
ax.axvline(x=643, linestyle="dashed")
ax.axvline(x=8724, linestyle="dashed")
ax.set_ylim((5, 28))
ax.plot(approx_P, color="C1", label="Approximate Matrix Profile")
ax.plot(true_P, label="True Matrix Profile")
ax.legend()
plt.show()
approx = stumpy.scrump(steam_df["steam flow"], m, percentage=0.01, pre_scrump=False)
approx.update()
approx_P = approx.P_
seed = np.random.randint(100000)
np.random.seed(seed)
approx = stumpy.scrump(steam_df["steam flow"], m, percentage=0.01, pre_scrump=False)
compare_approximation(true_P, approx_P)
# Refine the profile
for _ in range(9):
approx.update()
approx_P = approx.P_
compare_approximation(true_P, approx_P)
# Pre-processing
approx = stumpy.scrump(
steam_df["steam flow"], m, percentage=0.01, pre_scrump=True, s=None
)
approx.update()
approx_P = approx.P_
compare_approximation(true_P, approx_P)
Self join vs. join against target
Note that this example was a "self join", meaning it was looking for repeated patterns in it's own data. You'll want to join with the target you are looking to match.
Looking at the signature of stumpy.stump shows you how to do this:
def stump(T_A, m, T_B=None, ignore_trivial=True):
"""
Compute the matrix profile with parallelized STOMP
This is a convenience wrapper around the Numba JIT-compiled parallelized
`_stump` function which computes the matrix profile according to STOMP.
Parameters
----------
T_A : ndarray
The time series or sequence for which to compute the matrix profile
m : int
Window size
T_B : ndarray
The time series or sequence that contain your query subsequences
of interest. Default is `None` which corresponds to a self-join.
ignore_trivial : bool
Set to `True` if this is a self-join. Otherwise, for AB-join, set this
to `False`. Default is `True`.
Returns
-------
out : ndarray
The first column consists of the matrix profile, the second column
consists of the matrix profile indices, the third column consists of
the left matrix profile indices, and the fourth column consists of
the right matrix profile indices.
What you'll want to do is pass the data (pattern) you want to look for as T_B and then the larger sets you want to look in as T_A. The window size specifies how large of a search area you want (this will probably be the length of your T_B data, I'd imagine, or smaller if you want).
Once you have the matrix profile, you will just want to do a simple search and get the indicies of the lowest values. Each window starting at that index is a good match. You may also want to define some threshold minimum such that you only consider it a match if there is at least one value in the matrix profile below that minimum.
Another thing to realize is that your data set is really several correlated data sets (Open, High, Low, Close, and Volume). You'll have to decide which you want to match. Maybe you want a good match just for the opening prices, or maybe you want a good match for all of them. You'll have to decide what a good match means and calculate the matrix for each, then decide what to do if only one or a couple of those subsets match. For example, one data set may match the opening prices well, but close prices don't match as well. Another set's volume may match and that's it. Maybe you'll want to see if the normalized prices match (meaning you'd only be looking at the shape and not the relative magnitudes, i.e. a $1 stock going to $10 would look the same as a $10 one going to $100). All of that is pretty straightforward once you can compute a matrix profile.

An elegant way to add long names and units to plots with Holoviews

I've started to use Holoviews with Python3 and Jupyter notebooks, and I'm looking for a good way to put long names and units on my plot axis. An example looks like this:
import holoviews as hv
import pandas as pd
from IPython.display import display
hv.notebook_extension()
dataframe = pd.DataFrame({"time": [0, 1, 2, 3],
"photons": [10, 30, 20, 15],
"norm_photons": [0.33, 1, 0.67, 0.5],
"rate": [1, 3, 2, 1.5]}, index=[0, 1, 2, 3])
hvdata = hv.Table(dataframe, kdims=["time"])
display(hvdata.to.curve(vdims='rate'))
This gives me a nice plot, but instead of 'time' on the x-axis and 'rate' on the y-axis, I would prefer something like 'Time (ns)' and 'Rate (1/s)', but I don't want to type that in the code every time.
I've found this blog post by PhilippJFR which kind of does what I need, but the DFrame() function which he uses is depreciated, so I would like to avoid using that, if possible. Any ideas?
Turns out it's easy to do but hard to find in the documentation. You just pass a holoviews.Dimension instead of a string as the kdims parameter:
hvdata = hv.Table(dataframe, kdims=[hv.Dimension('time', label='Time', unit='ns')])
display(hvdata.to.curve(vdims=hv.Dimension('rate', label='Rate', unit='1/s')))
You can find good alternatives in this SO question:
Setting x and y labels with holoviews
I like doing it like this:
Creating a tuple with the name of the variable and the long name you would like to see printed on the plot:
hvdata = hv.Table(
dataframe,
kdims=[('time', 'Time (ns)')],
vdims=[('rate', 'Rate (1/s)')],
)

python pandas histogram plot including NaN values

I wanted to draw a histogram of some data. sorry that I could not attach a sample histogram as I don't have enough reputation. Hope that my description of the problem I am facing will be understood by you. I am using python pandas and I realize that any NaN value is treated as a 0 by pandas. Is there any method that I can use to include the count of Nan value in the histogram? What I mean is that the x-axis should have the NaN value as well. Please help... Thank you very much.
I was looking for the same thing. I ended up with the following solution:
figure = plt.figure(figsize=(6,9), dpi=100);
graph = figure.add_subplot(111);
freq = pandas.value_counts(data)
bins = freq.index
x=graph.bar(bins, freq.values) #gives the graph without NaN
graphmissing = figure.add_subplot(111)
y = graphmissing.bar([0], freq[numpy.NaN]) #gives a bar for the number of missing values at x=0
figure.show()
This gave me a histogram with a column at 0 showing the number of missing values in the data.
Did you try replacing NaN with some other unique value and then plot the histogram?
x= some unique value
plt.hist(df.replace(np.nan, x)
As pointed out by Sreeram TP, it is possible to use the argument dropna=False in the function value_counts to include the counts of NaNs.
df = pd.DataFrame({'feature1': [1, 2, 2, 4, 3, 2, 3, 4, np.NaN],
'feature2': [4, 4, 3, 4, 1, 4, 3, np.NaN, np.NaN]})
# Calculates the histogram for feature1
counts = df['feature1'].value_counts(dropna=False)
counts.plot.bar(title='feat1', grid=True)
I can not insert images. So, here is the result:
image plot here
By using .iloc[::-1] on the output of value_counts(), you can reverse its order.
The code would look like this:
df["column"].value_counts().iloc[::-1]

Categories