how to create interactive graph on a large data set?

how to create interactive graph on a large data set? - python

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
This is how I read the data and plot a scatter plot.
trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter
Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up.
I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code.
Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help

Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.
A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).
Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.
If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).

The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.
So a solution could be to convert your categorical variable to a numerical code.
See the code example below for both hvPlot (which I prefer) and HoloViews:
import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread
# sample data
text = """
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
"""
# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()
# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes
# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]
# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
x='Time',
y='ID_code',
datashade=True,
dynspread=True,
padding=0.05,
).opts(yticks=yticks)
# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)
More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html
Resulting plot:

Related

How to determine multiple Periodicities present in Timeseries data?

My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?

I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.

using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.

How to plot linear regression between two continous values?

I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:

Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

Python [Bokeh charts] Heatmap ranges & coloring

I would like to represent my data (which consist of 256 values) using a bokeh heat map where each value has its own color (so every item with the same value should have the same color).
I've been experimenting and bokeh is doing ranges for me such as the range between 24 - 47 has the same color and so on, but i wish to have a color for each value.
What is the best way to approach this problem?
I've been experimenting with palettes and some perform way better than others, for example Inferno256 is doing a good job but is that the correct way to solve this? I mean is there a way to tell the chart/heat-map to display every value with a color (specify ranges?) or should i for example define a palette of 256 colors any thoughts?
Example where bokeh create big ranges for me:
Data=column_of_values[:1000]
data = {'fruit': [1]*len(Data), # Sections
'fruit_count': Data,
'sample': list(range(1,len(Data)+1))}
hm = HeatMap(data, x='sample', y='fruit', values='fruit_count', palette=bp.Plasma11 , title='Fruits', stat=None)
hm.width=5000
output_file('heatmap.html')
show(hm)
The second part of my question (if possible), does bokeh handle big data well?
forexample plotting 1000 values is different from plotting 10,000 using the same code, the values seem to be smashed together or something, should I fix that by expanding the width or something else :-)
Heat map plotting 1000 values then 10,000 values

How to set pivots on colorbar at customisable location in Matplotlib?

I have some trouble in using Matplotlib colorbar, perhaps I am not understanding the documentation (I am not a native English speaker) or its core concept.
Suppose I have a matrix of data (shape, N*2). I want to make a scatter plot of this data and add a color scheme based on a column of label (N*1), in float. I know how to use colorbar and scalarmappable.
But, I am interested in some pivot values in this label column, and I wish to present these value in some interesting position of the colorbar. For example, label value 0, I want to position it at 1/3 place or in the middle -- which in the colorbar I choose could have a white or grey colour.
But if I understand it correctly, colorbar only takes data array that mapped in [0, 1] from the original data in [min, max]. In this case, the pivot value that I am interested would be end up in somewhere random, unless I define my normalisation function very carefully.
So to put the white colour I prefer for my pivot value is in the middle of the colour bar, I have to have defined the normalisation function which not only normalised my data, but also make the pivot value at the position of 0.5.
For my limited Matplotlib experience, this is the solution I know.
Ideally, suppose I have a column of float data, I could pick some pivot value, and give them some special position. and then I get them normalised and give to the colormap. The colorbar, however, I could set special colours for those special positions that I previous defined. and get a corresponding colorbar with the right tick locator and tick labels, that indicate my special pivot value.
I am looking for an easier way (from the standard lib) that I could use achieve this.

It will be very helpful if you can post a plot that you wish to make. But based on my understanding, you just want to do something to the colorbar at one or more particular spot. That is easy, the following cases shows a example of writing a text string at 0.5.
x1=np.random.random(1000)
x2=np.random.random(1000)
x3=np.random.random(1000)
plt.scatter(x1, x2, c=x3, marker='+')
cb=plt.colorbar()
color_norm=lambda x: (x-cb.vmin)/(cb.vmax-cb.vmin)
cb.ax.text(0.5, color_norm(0.5), 'Do something.\nRight here', color='r')
If you want to have value 0.5 at exactly 1/3 height of the colorbar, you need to adjust the colorbar limit using cb.set_clim((cmin, cmax)) method. There will be infinite possible (cmin, cmax) fit your need so additional constrains are necessary, such as keeping the min constant or keeping the max constant or keeping the max-min constant.

Python - FFT leads to wrong physical meanings

I am new to Python.
I intend to do Fourier Transform to an array of discrete points, (time, acceleration), and plot the result out.
I copy and paste the sample FFT code, and modify accordingly.
Please see codes:
import numpy as np
import matplotlib.pyplot as plt
# Load the .txt file in
myData = np.loadtxt('twenty_z_up.txt')
# Extract the time and acceleration columns
time = copy(myData[:,0])
# Extract the acceleration columns
zAcc = copy(myData[:,3])
t = np.arange(10080)
sp = np.fft.fft(zAcc)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, sp.real)
myData is a rectangular matrix with 10080 rows and 10 columns.
Thus, zAcc is the row3 extracted from the matrix.
In the plot drawn by Spyder, most of the harmonics concentrated around 0.
They are all extremely small.
But my data are actually the accelerations of the phone carried by a walking person (including the gravity). So I expect the most significant harmonic happens around 2Hz.
Why is the graph non-sense?
Thanks in advance!
==============UPDATES: My Graphs======================
The first time domain one:
x-axis is in millisecond.
y-axis is in m/s^2, due to earth gravity, it has a DC offset of ~10.

You do get two spikes at (approximately) 2Hz. Your sampling period is around 2.8 ms (as best as I can infer from your first plot), giving +/-2Hz the normalized frequency of +/-0.056, which is about where your spikes are. fft.fftfreq by default returns the normalized frequency (which scales the sampling period). You can set the d argument to be the sampling period, and you'll get a vector containing the actual frequency.
Your huge spike in the middle is obviously the DC offset (which you can trivially remove by subtracting the mean).

As others said, we need to see the data, post it somewhere. Just to check, try first fixing the timestep size in fftfreq, then plot this synthetic signal, and then plot your signal to see how they compare:
timestep=1./50.#Assume sampling at 50Hz. Change this accordingly.
N=10080#the number of samples
T=N*timestep
t = np.linspace(0,T,N)#needed only to generate xAcc_synthetic
freq=2.#peak a frequency at 2Hz
#generate synthetic signal at 2Hz and add some noise to it
xAcc_synthetic = sin((2*np.pi)*freq*t)+np.random.rand(N)*0.2
sp_synthetic = np.fft.fft(xAcc_synthetic)
freq = np.fft.fftfreq(t.size,d=timestep)
print max(abs(freq))==(1/timestep)/2.#simple check highest freq.
plt.plot(freq, abs(sp_synthetic))
xlabel('Hz')
Now, at the x axis equal to 2 you actually have a physical frequency of 2Hz, and you may spot the more pronounced peak you are looking for. Moreover, you may want to have a look also at yAcc and zAcc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.