Plotting a discrete variable over time (scarf plot) - python

I have time series data from a repeated-measures eyetracking experiment.
The dataset consists of a number of respondents and for each respondent, there is 48 trials.
The data set has a variable ('saccade') which is the transitions between eye-fixations and a variable ('time') which ranges for 0-1 for each trial. The transitions are classified into three different categories ('ver', 'hor' and 'diag').
Here is a script that will create a small example data set in python (one participant and two trials):
import numpy as np
import pandas as pd
saccade1 = np.array(['diag','hor','ver','hor','diag','ver','hor','diag','diag',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','hor','hor','diag',
'diag','ver','ver','ver','ver'])
time1 = np.array(range(len(saccade1)))/float(len(saccade1)-1)
trial1 = [1]*len(time1)
saccade2 = np.array(['diag','ver','hor','diag','diag','diag','hor','ver','hor',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','diag',
'diag','hor','hor','diag','diag','ver','ver','ver','ver','hor','diag','diag'])
time2 = np.array(range(len(saccade2)))/float(len(saccade2)-1)
trial2 = [2]*len(time2)
saccade = np.append(saccade1,saccade2)
time = np.append(time1,time2)
trial = np.append(trial1,trial2)
subject = [1]*len(time)
df = pd.DataFrame(index=range(len(subject)))
df['subject'] = subject
df['saccade'] = saccade
df['trial'] = trial
df['time'] = time
Alternatively I have made a csv-file with the same data which can be downloaded here
I would like to be able to make a so-called scarf plot to visualize the sequence of transitions over time, but I have no clue how to make these plots.
I would like plots (for each participant separately) where time is on the x-axis and trial is on the y-axis. For each trial I would like the transitions represented as colored "stacked" bars.
The only example I have of these kinds of plots are in the book "Eye Tracking - A comprehensive guide to methods and measures" (fig. 6.8b) link
Can anyone tell/help me in doing this?
(I can deal which python or R programming - preferably python)

Here is a solution in R using ggplot2. You need to recode time2 so that it indicates the enlapsed time instead of the total time.
library(ggplot2)
dataset <- read.csv("~/Downloads/example_data_for_scarf.csv")
dataset$trial <- factor(dataset$trial)
dataset$saccade <- factor(dataset$saccade)
dataset$time2 <- c(0, diff(dataset$time))
dataset$time2[dataset$time == 0] <- 0
ggplot(dataset, aes(x = trial, y = time2, fill = saccade)) +
geom_bar(stat = "identity") +
coord_flip()

Related

Align two signals with different sampling rates using cross correlation

I want to align two signals that are similar but shifted using cross-correlation. While this question has been answered a few times before (see references at the bottom), this situation is slightly different and / or I was unable to get the solutions work in my application.
The main difference is that the signals have different sampling rates and that I am inputting not just two signals, but their corresponding time vectors as well.
I thought I would be able to solve this problem by just interpolating both datasets onto the same time line, but I could not get this to work properly.
Here's what I have tried so far.
Create two signals at different sampling rates, the second being shifted by 7 seconds w.r.t the first signal. They are however the same signals if not for the different sampling rate.
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import correlate
from scipy.interpolate import interp1d
dt1 = 2.4
t1 = np.arange(0,20,dt1)
y1 = np.sin(t1) + t1/10
dt2 = 1
t2 = np.arange(0,20,dt2)
y2 = np.sin(t2) + t2/10
offset_t2 = 7 # would want to recover this eventually.
t2 = t2 + offset_t2
In order to not have to deal with the issue of the different sampling rates, I interpolate the two datasets onto timelines with the same sampling rate (the coarser one).
max_dt = max(dt1,dt2)
t1_resampled = np.arange(t1[0],t1[-1],max_dt)
t2_resampled = np.arange(t2[0],t2[-1],max_dt)
y1_resampled = interp1d(t1,y1)(t1_resampled)
y2_resampled = interp1d(t2,y2)(t2_resampled)
I try to use the maximum of the cross-correlation to get the shift that I need to apply but that does not yield the right result as shown in this plot.
fig,axs=plt.subplots(2,1)
ax = axs[0]
ax.plot(t1,y1,"-o",label='y1')
ax.plot(t2,y2,"-o",label='y2')
xcorr = correlate(y1_resampled,y2_resampled)
argmax_index = np.argmax(xcorr)
shift = (argmax_index-(len(y2_resampled)+1))*max_dt
ax.plot(t2+shift,y2,"-o",label='y2 shifted')
ax = axs[1]
ax.plot(xcorr)
ax.scatter(argmax_index,xcorr[argmax_index],color='red')
axs[0].legend()
print(f"computed shift: {shift}\nexpected shift: {offset_t2}")
Clearly the blue and the green curve do not overlap and the computed shift of -4.8 does not match the offset of 7.
So I wonder if someone could help me implementing the shift function that I need for my example. It should return a value delta_t such that when plotting (t1,y1) and (t2+delta_t,y2) the signals overlap as well as possible.
It should look something like the following snippet, but I am unable to implement it.
def shift(t1,y1,t2,y2)->float:
# If necessary, interpolate to same sampling rate.
# But this might not be necessary.
max_dt = max(dt1,dt2)
t1_resampled = np.arange(t1[0],t1[-1],max_dt)
t2_resampled = np.arange(t2[0],t2[-1],max_dt)
y1_resampled = interp1d(t1,y1)(t1_resampled)
y2_resampled = interp1d(t2,y2)(t2_resampled)
# Do something with the cross correlation ...
# ...
# delta_t = ...
return delta_t
References that did not help
Use of pandas.shift() to align datasets based on scipy.signal.correlate
Python aligning, stretching and synchronizing array data in python (signal processing)
Python cross correlation - why does shifting a timeseries not change the results (lag)?

Making two time-series with different sampling rates comparable

I have 2 sets of data, both time series that are Variable (same in both cases) vs. Time and I have imported and plotted them using pandas and matplotlib.
from os import chdir
chdir('C:\\Users\\me\\Documents\\Folder')
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read in csv file
file_df = pd.read_csv('C://Users//me//Documents//Folder//file.csv')
# define csv columns and assign values
VarA = file_df.loc[:, 'VarA'].values
TimeA = file_df.loc[:, 'TimeA'].values
VarB = file_df.loc[:, 'VarB'].values
TimeB = file_df.loc[:, 'TimeB'].values
# plot data selection and aesthetics
plt.plot(TimeA, VarA)
plt.plot(TimeB, VarB)
# plot labels
plt.xlabel('Time')
plt.ylabel('Variable')
#plot and add legend based on plot labels
plt.legend()
In both cases, the variable is sampled between 0 minutes and 320 minutes. However, one data set has 775 samples (taken at random intervals across the 320 minutes) and the other data set has 1732 samples (again, taken at random intervals across the 320 minutes).
Essentially what I want to do is make two new datasets, based on the old ones, where I have the variable vs time between 0 and 320 minutes but both with the same amount of data points for variable A taken at the same time steps (e.g. variable at every minute for 320 samples).
I'm guessing some interpolation is required? I genuinely don't know where to start. I have both datasets in the same .csv and I need them to be the same sample size so that I can run the following calculation. At the moment it doesn't run because 'VarA' and 'VarB' have different amounts of data.
x_values = VarB
y_values = VarA
correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2
I think resample might be useful here.
There are a lot of ways to solve this. What is the problem that you're trying to solve by calculating the correlation between two variables over time?
One option is to calculate some kind of weighted moving average over time, and then do the correlation that way. The simplest way to do this is an exponentially weighted moving average, like the loess function. There are more sophisticated methods too.
Here's some example code, where I took a cosine function and the same function with random noise added. To do a loess fit, use the loess() function, and to get access to the fitted values you want the "fitted" variable of value returned by lowess.
x = seq(from = 1, to = 100)
y1 = cos(x / 10)
y2 = cos(x / 10) + rnorm(100, mean = 0, sd = 0.25)
smooth_y2 = loess(y2 ~ x)
plot(x, y1, type = 'l')
lines(x, smooth_y2$fitted, type = 'l', col = 'red')
print(cor(y1, smooth_y2$fitted))

ggplot summarise mean value of categorical variable on y axis

I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions
This is the Python code to generate the plot, the dataset used can be found here:
import seaborn as sns
...
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Here is the resulting plot.
The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:
library(dplyr)
library(ggplot2)
...
train_df %>%
ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.y = mean, geom = "line") +
facet_grid(Embarked ~ .)
The output can be found here.
There are some issues:
There seems to be an empty facet, maybe from NA's in Embarked?
The points don't align with the line
The lines are different than those in the Python plot
I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().
There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.
train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")
ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.data = 'mean_cl_boot') +
geom_line(stat = 'summary', fun.y = mean) +
facet_grid(Embarked ~ .)
You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.
Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.

Pandas: parallel plots using groupie

I was wondering if anyone could help me with parallel coordinate plotting.
First this is how the data looks like:
It's data manipulated from : https://data.cityofnewyork.us/Transportation/2016-Yellow-Taxi-Trip-Data/k67s-dv2t
So I'm trying to normalise some features and use that to compute the mean of trip distance, passenger count and payment amount for each day of the week.
from pandas.tools.plotting import parallel_coordinates
feature = ['trip_distance','passenger_count','payment_amount']
#normalizing data
for feature in features:
df[feature] = (df[feature]-df[feature].min())/(df[feature].max()-df[feature].min())
#change format to datetime
pickup_time = pd.to_datetime(df['pickup_datetime'], format ='%d/%m/%y %H:%M')
#fill dayofweek column with 0~6 0:Monday and 6:Sunday
df['dayofweek'] = pickup_time.dt.weekday
mean_trip = df.groupby('dayofweek').trip_distance.mean()
mean_passanger = df.groupby('dayofweek').passenger_count.mean()
mean_payment = df.groupby('dayofweek').payment_amount.mean()
#parallel_coordinates('notsurewattoput')
So if I print mean_trip:
It shows the mean of each day of the week but I'm not sure how I would use this to draw a parallel coordinate plot with all 3 means on the same plot.
Does anyone know how to implement this?
I think you can change 3 times aggregating mean to one with output DataFrame instead 3 Series:
mean_trip = df.groupby('dayofweek').trip_distance.mean()
mean_passanger = df.groupby('dayofweek').passenger_count.mean()
mean_payment = df.groupby('dayofweek').payment_amount.mean()
to:
from pandas.tools.plotting import parallel_coordinates
cols = ['trip_distance','passenger_count','payment_amount']
df1 = df.groupby('dayofweek', as_index=False)[cols].mean()
#https://stackoverflow.com/a/45082022
parallel_coordinates(df1, class_column='dayofweek', cols=cols)

Curve fitting with large number of data points

this is quite a specific problem I was hoping the community could help me out with. Thanks in advance.
So I have 2 sets of data, one is experimental and the other is based off of an equation. I am trying to fit my data points to this curve and hence obtain the missing variables I am interested in. Namely, a and b in the Ebfit function.
Here is the code:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as spys
from scipy.optimize import curve_fit
time = [60,220,520,1840]
Moment = [0.64227262,0.468318916,0.197100772,0.104512508]
Temperature = 25 # Bake temperature in degrees C
Nb = len(Moment) # Number of bake measurements
Baketime_a = time #[s]
N_Device = 10000 # No. of devices considered in the array
T_ambient = 273 + Temperature
kt = 0.0256*(T_ambient/298) # In units of eV
f0 = 1e9 # Attempt frequency
def Ebfit(x,a,b):
Eb_mean = a*(0.0256/kt) # Eb at bake temperature
Eb_sigma = b*Eb_mean
Foursigma = 4*Eb_sigma
Eb_a = np.linspace(Eb_mean-Foursigma,Eb_mean+Foursigma,N_Device)
dEb = Eb_a[1] - Eb_a[0]
pdfEb_a = spys.norm.pdf(Eb_a,Eb_mean,Eb_sigma)
## Retention Time
DMom = np.zeros(len(x),float)
tau = (1/f0)*np.exp(Eb_a)
for bb in range(len(x)):
DMom[bb]= (1 - 2*(sum(pdfEb_a*(1 - np.exp(np.divide(-x[bb],tau))))*dEb))
return DMom
a = 30
b = 0.10
params,extras = curve_fit(Ebfit,time,Moment)
x_new = list(range(0,2000,1))
y_new = Ebfit(x_new,params[0],params[1])
plt.plot(time,Moment, 'o', label = 'data points')
plt.plot(x_new,y_new, label = 'fitted curve')
plt.legend()
The main problem I am having is that the fitting of the data to the function does not work when I use large number of points. In the above code When I use the 4 points (time & moment), this code works fine.
I get the following values for a and b.
array([ 29.11832766, 0.13918353])
The expected values for a is (23-50) and b is (0.06 - 0.15). So these values are within the acceptable range. This is the corresponding plot:
However, when I use my actual experimental normalized data with about 500 points.
EDIT: This data:
Normalized Data
https://www.dropbox.com/s/64zke4wckxc1r75/Normalized%20Data.csv?dl=0
Raw Data
https://www.dropbox.com/s/ojgse5ibp59r8nw/Data1.csv?dl=0
I get the following values and plot for a and b which are out of the acceptable range,
array([-13.76687781, -12.90494196])
I know these values are wrong and if I were to do it manually (slowly adjusting values to obtain the proper fit) it would be around a=30.1 and b=0.09. And when plotted looks as such:
I have tried changing the initial guess values for a & b, other sets of experimental data as well and other suggestions in similar threads. None seem to work for me. Any help you can provide is appreciated. Thanks.
.
.
.
.
ADDITIONAL INFORMATION
The model I am trying to fit the data to comes from the following equation:
where Dmom = 1 - 2*Psw
a is the Eb value while b is the Sigma value where, Eb has a range of values determined by the probability density function and 4 times of the sigma values (i.e. Foursigma). This distribution is then summed up to use for the final equation.
It looks like you do need to play around with the initial guesses for a and b after all. Perhaps the function you're fitting is not very well behaved, which is why it's so prone to fail for intitial guesses away from the global minumum. That being said, here's a working example of how to fit your data:
import pandas as pd
data_df = pd.read_csv('data.csv')
time = data_df['Time since start, Time [s]'].values
moment = data_df['Signal X direction, Moment [emu]'].values
params, extras = curve_fit(Ebfit, time, moment, p0=[40, 0.3])
Yields the values of a and b of:
In [6]: params
Out[6]: array([ 30.47553689, 0.08839412])
Which results in a nicely aligned fit of a function.
x_big = np.linspace(1, 1800, 2000)
y_big = Ebfit(x_big, params[0], params[1])
plt.plot(time, moment, 'o', alpha=0.5, label='all points')
plt.plot(x_big, y_big, label = 'fitted curve')
plt.legend()
plt.show()

Categories