I have 2 sets of data, both time series that are Variable (same in both cases) vs. Time and I have imported and plotted them using pandas and matplotlib.
from os import chdir
chdir('C:\\Users\\me\\Documents\\Folder')
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read in csv file
file_df = pd.read_csv('C://Users//me//Documents//Folder//file.csv')
# define csv columns and assign values
VarA = file_df.loc[:, 'VarA'].values
TimeA = file_df.loc[:, 'TimeA'].values
VarB = file_df.loc[:, 'VarB'].values
TimeB = file_df.loc[:, 'TimeB'].values
# plot data selection and aesthetics
plt.plot(TimeA, VarA)
plt.plot(TimeB, VarB)
# plot labels
plt.xlabel('Time')
plt.ylabel('Variable')
#plot and add legend based on plot labels
plt.legend()
In both cases, the variable is sampled between 0 minutes and 320 minutes. However, one data set has 775 samples (taken at random intervals across the 320 minutes) and the other data set has 1732 samples (again, taken at random intervals across the 320 minutes).
Essentially what I want to do is make two new datasets, based on the old ones, where I have the variable vs time between 0 and 320 minutes but both with the same amount of data points for variable A taken at the same time steps (e.g. variable at every minute for 320 samples).
I'm guessing some interpolation is required? I genuinely don't know where to start. I have both datasets in the same .csv and I need them to be the same sample size so that I can run the following calculation. At the moment it doesn't run because 'VarA' and 'VarB' have different amounts of data.
x_values = VarB
y_values = VarA
correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2
I think resample might be useful here.
There are a lot of ways to solve this. What is the problem that you're trying to solve by calculating the correlation between two variables over time?
One option is to calculate some kind of weighted moving average over time, and then do the correlation that way. The simplest way to do this is an exponentially weighted moving average, like the loess function. There are more sophisticated methods too.
Here's some example code, where I took a cosine function and the same function with random noise added. To do a loess fit, use the loess() function, and to get access to the fitted values you want the "fitted" variable of value returned by lowess.
x = seq(from = 1, to = 100)
y1 = cos(x / 10)
y2 = cos(x / 10) + rnorm(100, mean = 0, sd = 0.25)
smooth_y2 = loess(y2 ~ x)
plot(x, y1, type = 'l')
lines(x, smooth_y2$fitted, type = 'l', col = 'red')
print(cor(y1, smooth_y2$fitted))
Related
I want to align two signals that are similar but shifted using cross-correlation. While this question has been answered a few times before (see references at the bottom), this situation is slightly different and / or I was unable to get the solutions work in my application.
The main difference is that the signals have different sampling rates and that I am inputting not just two signals, but their corresponding time vectors as well.
I thought I would be able to solve this problem by just interpolating both datasets onto the same time line, but I could not get this to work properly.
Here's what I have tried so far.
Create two signals at different sampling rates, the second being shifted by 7 seconds w.r.t the first signal. They are however the same signals if not for the different sampling rate.
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import correlate
from scipy.interpolate import interp1d
dt1 = 2.4
t1 = np.arange(0,20,dt1)
y1 = np.sin(t1) + t1/10
dt2 = 1
t2 = np.arange(0,20,dt2)
y2 = np.sin(t2) + t2/10
offset_t2 = 7 # would want to recover this eventually.
t2 = t2 + offset_t2
In order to not have to deal with the issue of the different sampling rates, I interpolate the two datasets onto timelines with the same sampling rate (the coarser one).
max_dt = max(dt1,dt2)
t1_resampled = np.arange(t1[0],t1[-1],max_dt)
t2_resampled = np.arange(t2[0],t2[-1],max_dt)
y1_resampled = interp1d(t1,y1)(t1_resampled)
y2_resampled = interp1d(t2,y2)(t2_resampled)
I try to use the maximum of the cross-correlation to get the shift that I need to apply but that does not yield the right result as shown in this plot.
fig,axs=plt.subplots(2,1)
ax = axs[0]
ax.plot(t1,y1,"-o",label='y1')
ax.plot(t2,y2,"-o",label='y2')
xcorr = correlate(y1_resampled,y2_resampled)
argmax_index = np.argmax(xcorr)
shift = (argmax_index-(len(y2_resampled)+1))*max_dt
ax.plot(t2+shift,y2,"-o",label='y2 shifted')
ax = axs[1]
ax.plot(xcorr)
ax.scatter(argmax_index,xcorr[argmax_index],color='red')
axs[0].legend()
print(f"computed shift: {shift}\nexpected shift: {offset_t2}")
Clearly the blue and the green curve do not overlap and the computed shift of -4.8 does not match the offset of 7.
So I wonder if someone could help me implementing the shift function that I need for my example. It should return a value delta_t such that when plotting (t1,y1) and (t2+delta_t,y2) the signals overlap as well as possible.
It should look something like the following snippet, but I am unable to implement it.
def shift(t1,y1,t2,y2)->float:
# If necessary, interpolate to same sampling rate.
# But this might not be necessary.
max_dt = max(dt1,dt2)
t1_resampled = np.arange(t1[0],t1[-1],max_dt)
t2_resampled = np.arange(t2[0],t2[-1],max_dt)
y1_resampled = interp1d(t1,y1)(t1_resampled)
y2_resampled = interp1d(t2,y2)(t2_resampled)
# Do something with the cross correlation ...
# ...
# delta_t = ...
return delta_t
References that did not help
Use of pandas.shift() to align datasets based on scipy.signal.correlate
Python aligning, stretching and synchronizing array data in python (signal processing)
Python cross correlation - why does shifting a timeseries not change the results (lag)?
I'm trying to plot the Amplitude (dBFS) vs. Time (s) plot of an audio (.wav) file using matplotlib. I managed to do that with the following code:
def convert_to_decibel(sample):
ref = 32768 # Using a signed 16-bit PCM format wav file. So, 2^16 is the max. value.
if sample!=0:
return 20 * np.log10(abs(sample) / ref)
else:
return 20 * np.log10(0.000001)
from scipy.io.wavfile import read as readWav
from scipy.fftpack import fft
import matplotlib.pyplot as gplot1
import matplotlib.pyplot as gplot2
import numpy as np
import struct
import gc
wavfile1 = '/home/user01/audio/speech.wav'
wavsamplerate1, wavdata1 = readWav(wavfile1)
wavdlen1 = wavdata1.size
wavdtype1 = wavdata1.dtype
gplot1.rcParams['figure.figsize'] = [15, 5]
pltaxis1 = gplot1.gca()
gplot1.axhline(y=0, c="black")
gplot1.xticks(np.arange(0, 10, 0.5))
gplot1.yticks(np.arange(-200, 200, 5))
gplot1.grid(linestyle = '--')
wavdata3 = np.array([convert_to_decibel(i) for i in wavdata1], dtype=np.int16)
yvals3 = wavdata3
t3 = wavdata3.size / wavsamplerate1
xvals3 = np.linspace(0, t3, wavdata3.size)
pltaxis1.set_xlim([0, t3 + 2])
pltaxis1.set_title('Amplitude (dBFS) vs Time(s)')
pltaxis1.plot(xvals3, yvals3, '-')
which gives the following output:
I had also plotted the Power Spectral Density (PSD, in dBm) using the code below:
from scipy.signal import welch as psd # Computes PSD using Welch's method.
fpsd, wPSD = psd(wavdata1, wavsamplerate1, nperseg=1024)
gplot2.rcParams['figure.figsize'] = [15, 5]
pltpsdm = gplot2.gca()
gplot2.axhline(y=0, c="black")
pltpsdm.plot(fpsd, 20*np.log10(wPSD))
gplot2.xticks(np.arange(0, 4000, 400))
gplot2.yticks(np.arange(-150, 160, 10))
pltpsdm.set_xlim([0, 4000])
pltpsdm.set_ylim([-150, 150])
gplot2.grid(linestyle = '--')
which gives the output as:
The second output above, using the Welch's method plots a more presentable output. The dBFS plot though informative is not very presentable IMO. Is this because of:
the difference in the domains (time in case of 1st output vs frequency in the 2nd output)?
the way plot function is implemented in pyplot?
Also, is there a way I can plot my dBFS output as a peak-to-peak style of plot just like in my PSD (dBm) plot rather than a dense stem plot?
Would be much helpful and would appreciate any pointers, answers or suggestions from experts here as I'm just a beginner with matplotlib and plots in python in general.
TLNR
This has nothing to do with pyplot.
The frequency domain is different from the time domain, but that's not why you didn't get what you want.
The calculation of dbFS in your code is wrong.
You should frame your data, calculate RMSs or peaks in every frame, and then convert that value to dbFS instead of applying this transformation to every sample point.
When we talk about the amplitude, we are talking about a periodic signal. And when we read in a series of data from a sound file, we read in a series of sample points of a signal(may be or be not periodic). The value of every sample point represents a, say, voltage value, or sound pressure value sampled at a specific time.
We assume that, within a very short time interval, maybe 10ms for example, the signal is stationary. Every such interval is called a frame.
Some specific function is applied to each frame usually, to reduce the sudden change at the edge of this frame, and these functions are called window functions. If you did nothing to every frame, you added rectangle windows to them.
An example: when the sampling frequency of your sound is 44100Hz, in a 10ms-long frame, there are 44100*0.01=441 sample points. That's what the nperseg argument means in your psd function but it has nothing to do with dbFS.
Given the knowledge above, now we can talk about the amplitude.
There are two methods a get the value of amplitude in every frame:
The most straightforward one is to get the maximum(peak) values in every frame.
Another one is to calculate the RMS(Root Mean Sqaure) of every frame.
After that, the peak values or RMS values can be converted to dbFS values.
Let's start coding:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
# Determine full scall(maximum possible amplitude) by bit depth
bit_depth = 16
full_scale = 2 ** bit_depth
# dbFS function
to_dbFS = lambda x: 20 * np.log10(x / full_scale)
# Read in the wave file
fname = "01.wav"
fs,data = wavfile.read(fname)
# Determine frame length(number of sample points in a frame) and total frame numbers by window length(how long is a frame in seconds)
window_length = 0.01
signal_length = data.shape[0]
frame_length = int(window_length * fs)
nframes = signal_length // frame_length
# Get frames by broadcast. No overlaps are used.
idx = frame_length * np.arange(nframes)[:,None] + np.arange(frame_length)
frames = data[idx].astype("int64") # Convert to in 64 to avoid integer overflow
# Get RMS and peaks
rms = ((frames**2).sum(axis=1)/frame_length)**.5
peaks = np.abs(frames).max(axis=1)
# Convert them to dbfs
dbfs_rms = to_dbFS(rms)
dbfs_peak = to_dbFS(peaks)
# Let's start to plot
# Get time arrays of every sample point and ever frame
frame_time = np.arange(nframes) * window_length
data_time = np.linspace(0,signal_length/fs,signal_length)
# Plot
f,ax = plt.subplots()
ax.plot(data_time,data,color="k",alpha=.3)
# Plot the dbfs values on a twin x Axes since the y limits are not comparable between data values and dbfs
tax = ax.twinx()
tax.plot(frame_time,dbfs_rms,label="RMS")
tax.plot(frame_time,dbfs_peak,label="Peak")
tax.legend()
f.tight_layout()
# Save serval details
f.savefig("whole.png",dpi=300)
ax.set_xlim(1,2)
f.savefig("1-2sec.png",dpi=300)
ax.set_xlim(1.295,1.325)
f.savefig("1.2-1.3sec.png",dpi=300)
The whole time span looks like(the unit of the right axis is dbFS):
And the voiced part looks like:
You can see that the dbFS values become greater while the amplitudes become greater at the vowel start point:
this is quite a specific problem I was hoping the community could help me out with. Thanks in advance.
So I have 2 sets of data, one is experimental and the other is based off of an equation. I am trying to fit my data points to this curve and hence obtain the missing variables I am interested in. Namely, a and b in the Ebfit function.
Here is the code:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as spys
from scipy.optimize import curve_fit
time = [60,220,520,1840]
Moment = [0.64227262,0.468318916,0.197100772,0.104512508]
Temperature = 25 # Bake temperature in degrees C
Nb = len(Moment) # Number of bake measurements
Baketime_a = time #[s]
N_Device = 10000 # No. of devices considered in the array
T_ambient = 273 + Temperature
kt = 0.0256*(T_ambient/298) # In units of eV
f0 = 1e9 # Attempt frequency
def Ebfit(x,a,b):
Eb_mean = a*(0.0256/kt) # Eb at bake temperature
Eb_sigma = b*Eb_mean
Foursigma = 4*Eb_sigma
Eb_a = np.linspace(Eb_mean-Foursigma,Eb_mean+Foursigma,N_Device)
dEb = Eb_a[1] - Eb_a[0]
pdfEb_a = spys.norm.pdf(Eb_a,Eb_mean,Eb_sigma)
## Retention Time
DMom = np.zeros(len(x),float)
tau = (1/f0)*np.exp(Eb_a)
for bb in range(len(x)):
DMom[bb]= (1 - 2*(sum(pdfEb_a*(1 - np.exp(np.divide(-x[bb],tau))))*dEb))
return DMom
a = 30
b = 0.10
params,extras = curve_fit(Ebfit,time,Moment)
x_new = list(range(0,2000,1))
y_new = Ebfit(x_new,params[0],params[1])
plt.plot(time,Moment, 'o', label = 'data points')
plt.plot(x_new,y_new, label = 'fitted curve')
plt.legend()
The main problem I am having is that the fitting of the data to the function does not work when I use large number of points. In the above code When I use the 4 points (time & moment), this code works fine.
I get the following values for a and b.
array([ 29.11832766, 0.13918353])
The expected values for a is (23-50) and b is (0.06 - 0.15). So these values are within the acceptable range. This is the corresponding plot:
However, when I use my actual experimental normalized data with about 500 points.
EDIT: This data:
Normalized Data
https://www.dropbox.com/s/64zke4wckxc1r75/Normalized%20Data.csv?dl=0
Raw Data
https://www.dropbox.com/s/ojgse5ibp59r8nw/Data1.csv?dl=0
I get the following values and plot for a and b which are out of the acceptable range,
array([-13.76687781, -12.90494196])
I know these values are wrong and if I were to do it manually (slowly adjusting values to obtain the proper fit) it would be around a=30.1 and b=0.09. And when plotted looks as such:
I have tried changing the initial guess values for a & b, other sets of experimental data as well and other suggestions in similar threads. None seem to work for me. Any help you can provide is appreciated. Thanks.
.
.
.
.
ADDITIONAL INFORMATION
The model I am trying to fit the data to comes from the following equation:
where Dmom = 1 - 2*Psw
a is the Eb value while b is the Sigma value where, Eb has a range of values determined by the probability density function and 4 times of the sigma values (i.e. Foursigma). This distribution is then summed up to use for the final equation.
It looks like you do need to play around with the initial guesses for a and b after all. Perhaps the function you're fitting is not very well behaved, which is why it's so prone to fail for intitial guesses away from the global minumum. That being said, here's a working example of how to fit your data:
import pandas as pd
data_df = pd.read_csv('data.csv')
time = data_df['Time since start, Time [s]'].values
moment = data_df['Signal X direction, Moment [emu]'].values
params, extras = curve_fit(Ebfit, time, moment, p0=[40, 0.3])
Yields the values of a and b of:
In [6]: params
Out[6]: array([ 30.47553689, 0.08839412])
Which results in a nicely aligned fit of a function.
x_big = np.linspace(1, 1800, 2000)
y_big = Ebfit(x_big, params[0], params[1])
plt.plot(time, moment, 'o', alpha=0.5, label='all points')
plt.plot(x_big, y_big, label = 'fitted curve')
plt.legend()
plt.show()
I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output
I have time series data from a repeated-measures eyetracking experiment.
The dataset consists of a number of respondents and for each respondent, there is 48 trials.
The data set has a variable ('saccade') which is the transitions between eye-fixations and a variable ('time') which ranges for 0-1 for each trial. The transitions are classified into three different categories ('ver', 'hor' and 'diag').
Here is a script that will create a small example data set in python (one participant and two trials):
import numpy as np
import pandas as pd
saccade1 = np.array(['diag','hor','ver','hor','diag','ver','hor','diag','diag',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','hor','hor','diag',
'diag','ver','ver','ver','ver'])
time1 = np.array(range(len(saccade1)))/float(len(saccade1)-1)
trial1 = [1]*len(time1)
saccade2 = np.array(['diag','ver','hor','diag','diag','diag','hor','ver','hor',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','diag',
'diag','hor','hor','diag','diag','ver','ver','ver','ver','hor','diag','diag'])
time2 = np.array(range(len(saccade2)))/float(len(saccade2)-1)
trial2 = [2]*len(time2)
saccade = np.append(saccade1,saccade2)
time = np.append(time1,time2)
trial = np.append(trial1,trial2)
subject = [1]*len(time)
df = pd.DataFrame(index=range(len(subject)))
df['subject'] = subject
df['saccade'] = saccade
df['trial'] = trial
df['time'] = time
Alternatively I have made a csv-file with the same data which can be downloaded here
I would like to be able to make a so-called scarf plot to visualize the sequence of transitions over time, but I have no clue how to make these plots.
I would like plots (for each participant separately) where time is on the x-axis and trial is on the y-axis. For each trial I would like the transitions represented as colored "stacked" bars.
The only example I have of these kinds of plots are in the book "Eye Tracking - A comprehensive guide to methods and measures" (fig. 6.8b) link
Can anyone tell/help me in doing this?
(I can deal which python or R programming - preferably python)
Here is a solution in R using ggplot2. You need to recode time2 so that it indicates the enlapsed time instead of the total time.
library(ggplot2)
dataset <- read.csv("~/Downloads/example_data_for_scarf.csv")
dataset$trial <- factor(dataset$trial)
dataset$saccade <- factor(dataset$saccade)
dataset$time2 <- c(0, diff(dataset$time))
dataset$time2[dataset$time == 0] <- 0
ggplot(dataset, aes(x = trial, y = time2, fill = saccade)) +
geom_bar(stat = "identity") +
coord_flip()