I have 2 sets of data, both time series that are Variable (same in both cases) vs. Time and I have imported and plotted them using pandas and matplotlib.
from os import chdir
chdir('C:\\Users\\me\\Documents\\Folder')
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read in csv file
file_df = pd.read_csv('C://Users//me//Documents//Folder//file.csv')
# define csv columns and assign values
VarA = file_df.loc[:, 'VarA'].values
TimeA = file_df.loc[:, 'TimeA'].values
VarB = file_df.loc[:, 'VarB'].values
TimeB = file_df.loc[:, 'TimeB'].values
# plot data selection and aesthetics
plt.plot(TimeA, VarA)
plt.plot(TimeB, VarB)
# plot labels
plt.xlabel('Time')
plt.ylabel('Variable')
#plot and add legend based on plot labels
plt.legend()
In both cases, the variable is sampled between 0 minutes and 320 minutes. However, one data set has 775 samples (taken at random intervals across the 320 minutes) and the other data set has 1732 samples (again, taken at random intervals across the 320 minutes).
Essentially what I want to do is make two new datasets, based on the old ones, where I have the variable vs time between 0 and 320 minutes but both with the same amount of data points for variable A taken at the same time steps (e.g. variable at every minute for 320 samples).
I'm guessing some interpolation is required? I genuinely don't know where to start. I have both datasets in the same .csv and I need them to be the same sample size so that I can run the following calculation. At the moment it doesn't run because 'VarA' and 'VarB' have different amounts of data.
x_values = VarB
y_values = VarA
correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2
I think resample might be useful here.
There are a lot of ways to solve this. What is the problem that you're trying to solve by calculating the correlation between two variables over time?
One option is to calculate some kind of weighted moving average over time, and then do the correlation that way. The simplest way to do this is an exponentially weighted moving average, like the loess function. There are more sophisticated methods too.
Here's some example code, where I took a cosine function and the same function with random noise added. To do a loess fit, use the loess() function, and to get access to the fitted values you want the "fitted" variable of value returned by lowess.
x = seq(from = 1, to = 100)
y1 = cos(x / 10)
y2 = cos(x / 10) + rnorm(100, mean = 0, sd = 0.25)
smooth_y2 = loess(y2 ~ x)
plot(x, y1, type = 'l')
lines(x, smooth_y2$fitted, type = 'l', col = 'red')
print(cor(y1, smooth_y2$fitted))
I am new to the fourier theory and I've seen very good tutorials on how to apply fft to a signal and plot it in order to see the frequencies it contains. Somehow, all of them create a mix of sines as their data and i am having trouble adapting it to my real problem.
I have 242 hourly observations with a daily periodicity, meaning that my period is 24. So I expect to have a peak around 24 on my fft plot.
A sample of my data.csv is here:
https://pastebin.com/1srKFpJQ
Data plotted:
My code:
data = pd.read_csv('data.csv',index_col=0)
data.index = pd.to_datetime(data.index)
data = data['max_open_files'].astype(float).values
N = data.shape[0] #number of elements
t = np.linspace(0, N * 3600, N) #converting hours to seconds
s = data
fft = np.fft.fft(s)
T = t[1] - t[0]
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.bar(f[:N // 2], np.abs(fft)[:N // 2] * 1 / N, width=1.5) # 1 / N is a normalization factor
plt.show()
This outputs a very weird result where it seems I am getting the same value for every frequency.
I suppose that the problems comes with the definition of N, t and T but I cannot find anything online that has helped me understand this clearly. Please help :)
EDIT1:
With the code provided by charles answer I have a spike around 0 that seems very weird. I have used rfft and rfftfreq instead to avoid having too much frequencies.
I have read that this might be because of the DC component of the series, so after substracting the mean i get:
I am having trouble interpreting this, the spikes seem to happen periodically but the values in Hz don't let me obtain my 24 value (the overall frequency). Anybody knows how to interpret this ? What am I missing ?
The problem you're seeing is because the bars are too wide, and you're only seeing one bar. You will have to change the width of the bars to 0.00001 or smaller to see them show up.
Instead of using a bar chart, make your x axis using fftfreq = np.fft.fftfreq(len(s)) and then use the plot function, plt.plot(fftfreq, fft):
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv',index_col=0)
data.index = pd.to_datetime(data.index)
data = data['max_open_files'].astype(float).values
N = data.shape[0] #number of elements
t = np.linspace(0, N * 3600, N) #converting hours to seconds
s = data
fft = np.fft.fft(s)
fftfreq = np.fft.fftfreq(len(s))
T = t[1] - t[0]
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.plot(fftfreq,fft)
plt.show()
Could you please advise me on the following:
I gather data from an Arduino ADC and store the data in a list on a Raspberry Pi 4 with Python 3.
The list is called 'dataList' and contains 1024 10 bits samples. This all works fine: I can reproduce the sampled signal on the Raspberry.
I would like to use the power spectrum of the acquired signal using numpy FFT.
I tried the following:
[see below]
This should illustrate what I'm trying to do; however this produces incoherent output. The sampled signal has a frequency of about 300 Hz. I would be very grateful for any hints in the right direction!
def show_FFT(window):
fft = np.fft.fft (dataList, 1024, -1, None)
for X_value in range (0,512, 1):
Y_value = fft ([X_value]
gfxdraw.pixel (window, X_value, int(abs(Y_value), black)
As you mentioned in your question, you have a data set whith X starting from 0 to... but for numpy.fft.fft you must keep in mind that it is a discrete Fourier transform (DFT) which caculate the fft of equaly spaced samples and i must mntion that it must be a symetric range of dataset from -x to x. You can simply try it with a gausian finction and change the parameters as you wish and see what are the results...
Since you didn''t give any data set here , I would refer you to a generl case with below code:
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
# create data from dataframes
x = np.random.rand(50) #unequaly spaced measurment
x.sort()
y = np.exp(-x*x) #measured signal
based on the answer here you can resample your data into equaly spaced points by:
f = interpolate.interp1d(x, y)
num = 500
xx = np.linspace(x[0], x[-1], num)
yy = f(xx)
plt.close('all')
plt.plot(x,y,'bo')
plt.plot(xx,yy, 'g.-')
plt.show()
enter image description here
then you can make your x data symetric very simply by :
x=xx
y=yy
xsample = x-((x.max()-x.min())/2)
xsample=xsample-(xsample.max()+xsample.min())/2
x=xsample
thne if you try fft you will get the corect results as:
ysample =yy
ysample_fft = np.fft.fftshift(np.abs(np.fft.fft(ysample/ysample.max()))) /
np.sqrt(len(ysample))
plt.plot(xsample,ysample_fft/ysample_fft.max(),'b--')
plt.show()
enter image description here
I found this code :
import numpy as np
import matplotlib.pyplot as plt
# We create 1000 realizations with 200 steps each
n_stories = 1000
t_max = 500
t = np.arange(t_max)
# Steps can be -1 or 1 (note that randint excludes the upper limit)
steps = 2 * np.random.randint(0, 1 + 1, (n_stories, t_max)) - 1
# The time evolution of the position is obtained by successively
# summing up individual steps. This is done for each of the
# realizations, i.e. along axis 1.
positions = np.cumsum(steps, axis=1)
# Determine the time evolution of the mean square distance.
sq_distance = positions**2
mean_sq_distance = np.mean(sq_distance, axis=0)
# Plot the distance d from the origin as a function of time and
# compare with the theoretically expected result where d(t)
# grows as a square root of time t.
plt.figure(figsize=(10, 7))
plt.plot(t, np.sqrt(mean_sq_distance), 'g.', t, np.sqrt(t), 'y-')
plt.xlabel(r"$t$")
plt.tight_layout()
plt.show()
Instead of doing just steps -1 or 1 , I would like to do steps following a standard normal distribution ... when I am inserting np.random.normal(0,1,1000) instead of np.random.randint(...) it is not working.
I am really new to Python btw.
Many thanks in advance and Kind regards
You are entering a single number as third parameter of np.random.normal, therefore you get a 1d array, instead of 2d, see the documentation. Try this:
steps = np.random.normal(0, 1, (n_stories, t_max))
I have some noisy data that can contain 0 and n gaussian shapes, I am trying to implement an algorithm that takes the highest data points and fits a gaussian to that as per the following 'scheme':
New attempt, steps:
fit a spline through all data points
get first derivative of spline function
get both data points (left/right) where f'(x) = around 0 the data point with max intensity
fit a gaussian through the data points returned from 3
4a. Plot the gaussian (stopping at baseline) in the pdf
Calculate area under gaussian curve
Calculate area under raw data points
Calculate percentage of total area explained by gaussian area
I have implemented this concept using the following code (minimal working example):
#! /usr/bin/env python
from scipy.interpolate import InterpolatedUnivariateSpline
from scipy.optimize import curve_fit
from scipy.signal import argrelextrema
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data = [(9.60380153195,187214),(9.62028167623,181023),(9.63676350256,174588),(9.65324602212,169389),(9.66972824591,166921),(9.68621215187,167597),(9.70269675106,170838),(9.71918105436,175816),(9.73566703995,181552),(9.75215371878,186978),(9.76864010158,191718),(9.78512816681,194473),(9.80161692526,194169),(9.81810538757,191203),(9.83459553243,186603),(9.85108637051,180273),(9.86757691233,171996),(9.88406913682,163653),(9.90056205454,156032),(9.91705467586,149928),(9.93354897998,145410),(9.95004397733,141818),(9.96653867816,139042),(9.98303506191,137546),(9.99953213889,138724)]
data2 = [(9.60476933166,163571),(9.62125990879,156662),(9.63775225872,150535),(9.65424539203,146960),(9.67073831905,146794),(9.68723301904,149326),(9.70372850238,152616),(9.72022377931,155420),(9.73672082933,156151),(9.75321866271,154633),(9.76971628954,151549),(9.78621568961,148298),(9.80271587303,146333),(9.81921584976,146734),(9.83571759987,150351),(9.85222013334,156612),(9.86872245996,164192),(9.88522656011,171199),(9.90173144362,175697),(9.91823612015,176867),(9.93474257034,175029),(9.95124980389,171762),(9.96775683032,168449),(9.98426563055,165026)]
def gaussFunction(x, *p):
""" TODO
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
def quantify(data):
""" TODO
"""
backGround = 105000 # Normally this is dynamically determined but this value is fine for testing on the provided data
time,intensity = zip(*data)
x_data = np.array(time)
y_data = np.array(intensity)
newX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
f = InterpolatedUnivariateSpline(x_data, y_data)
fPrime = f.derivative()
newY = f(newX)
newPrimeY = fPrime(newX)
maxm = argrelextrema(newPrimeY, np.greater)
minm = argrelextrema(newPrimeY, np.less)
breaks = maxm[0].tolist() + minm[0].tolist()
maxPoint = 0
for index,j in enumerate(breaks):
try:
if max(newY[breaks[index]:breaks[index+1]]) > maxPoint:
maxPoint = max(newY[breaks[index]:breaks[index+1]])
xData = newX[breaks[index]:breaks[index+1]]
yData = [x - backGround for x in newY[breaks[index]:breaks[index+1]]]
except:
pass
# Gaussian fit on main points
newGaussX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
p0 = [np.max(yData), xData[np.argmax(yData)],0.1]
try:
coeff, var_matrix = curve_fit(gaussFunction, xData, yData, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
newGaussY = [x + backGround for x in newGaussY]
# Generate plot for visual confirmation
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_data, y_data, 'b*')
plt.plot((newX[0],newX[-1]),(backGround,backGround),'red')
plt.plot(newX,newY, color='blue',linestyle='dashed')
plt.plot(newGaussX, newGaussY, color='green',linestyle='dashed')
plt.title("Test")
plt.xlabel("rt [m]")
plt.ylabel("intensity [au]")
plt.savefig("Test.pdf",bbox_inches="tight")
plt.close(fig)
except:
pass
# Call the test
#quantify(data)
quantify(data2)
where normally the background (red line in below pictures) is dynamically determined, but for the sake of this example I have set it to a fixed number. The problem that I have is that for some data it works really well:
Corresponding f'(x):
However, for some other data it fails horrendously:
Corresponding f'(x):
Therefore, I would like to hear some suggestions or ideas on why this happens and on potential approaches to fix it. I have included the data that is shown in the picture below (in case anyone wants to try it):
The error lied in the following bit:
breaks = maxm[0].tolist() + minm[0].tolist()
for index,j in enumerate(breaks):
The breaks list now contains both the maxima and minima, but they are not sorted by time. Resulting in the list yielding the following data points for the poor fit: 9.78, 9.62 and 9.86.
The program would then examine data from 9.78 to 9.62 and 9.62 to 9.86, which meant that 9.62 to 9.86 contained the highest intensity data point yielding the fit that is shown in the second graph.
The fix was rather simple by just adding a sort on the breaks in between, as follows:
breaks = maxm[0].tolist() + minm[0].tolist()
breaks = sorted(breaks)
for index,j in enumerate(breaks):
The program then yielded a fit more closely resembling what I would expect: