Sounds very complicated but a simple plot will make it easy to understand:
I have three curves of cumulative sum of some values over time, which are the blue lines.
I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval.
I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it.
The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them.
Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval?
I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves).
There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that.
Test code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)
## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])
df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])
df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])
## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time = pd.concat([df1_combined_sorted['time'],
df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)
## creating confidence intervals
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()
## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()
First of all, your sample code could be re-written to make better use of pd. For example
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
reset_index().drop('index', axis=1)
df['cumulative'] = df.vals.cumsum()
return df
# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# join
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])
# render function
def render(window=10):
# compute rolling means and confident intervals
mean_val = df_all.cumulative.rolling(window).mean()
std_val = df_all.cumulative.rolling(window).std()
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
plt.figure(figsize=(16,9))
for df in dfs:
plt.plot(df.time, df.cumulative, c='blue')
plt.plot(df_all.time, mean_val, c='r')
plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
plt.show()
The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives:
while render(30) gives:
Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows:
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
# note that we set time as index of the returned data
df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
df['cumulative'] = df.vals.cumsum()
return df
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# rename column for later plotting
for i,df in zip(range(3),dfs):
df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)
# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()
# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)
# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)
plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()
and we get:
Related
I have been trying to estimate the power spectrum of a timeseries using fourier transform. I have tried to do this using two variations of the spectral density estimate using np.fft.rfft. The two functions are the following:
def TracePSD_1st(x, dt):
"""
Estimate Power spectral density:
Inputs:
u : timeseries, np.array
dt: 1/sampling frequency
"""
B_pow = np.abs(np.fft.rfft(x, norm='ortho'))**2
freqs = np.fft.rfftfreq(len(x), dt)
freqs = freqs[freqs>0]
idx = np.argsort(freqs)
return freqs[idx], B_pow[idx]
def TracePSD_2nd(x, dt):
"""
Estimate Power spectral density:
Inputs:
u : timeseries, np.array
dt: 1/sampling frequency
"""
N = len(x)
yf = np.fft.rfft(x)
B_pow = abs(yf) ** 2 / N * dt
freqs = np.fft.fftfreq(len(x), dt)
freqs = freqs[freqs>0]
idx = np.argsort(freqs)
return freqs[idx], B_pow[idx]
The issue arrises when I try to downsample my original timeseries and re-estimate the spectrum. The first method gives a different PSD depending on the resolution while the second one gives a pretty similar result.
The results I am getting when using these two functions are shown below:
The weird thing is that the PSD estimated using the first method is in rough accordance with Parseval's theorem while the second one is not.
Any suggestions of what the correct method is? Or an improved version is needed?
I append here a piece of code to reproduce the figures I just showed using a timeseries corresponding to fractional brownian motion ( you will need to pip install fbm)
from fbm import fbm
# create a sythetic timeseries using a fractional brownian motion !( In case you don't have fbm-> pip install fbm)
start_time = datetime.datetime.now()
# Create index for timeseries
end_time = datetime.datetime.now()+ pd.Timedelta('1H')
freq = '10ms'
index = pd.date_range(
start = start_time,
end = end_time,
freq = freq
)
# Generate a fBm realization
fbm_sample = fbm(n=len(index), hurst=0.75, length=1, method='daviesharte')
# Create a dataframe to resample the timeseries.
df_b = pd.DataFrame({'DateTime': index, 'Br':fbm_sample[:-1]}).set_index('DateTime')
#Original version of timeseries
y = df_b.Br
# Resample the synthetic timeseries
x = df_b.Br.resample(str(int(resolution))+"ms").mean()
# Estimate the sampling rate
dtx = (x.dropna().index.to_series().diff()/np.timedelta64(1, 's')).median()
dty = (y.dropna().index.to_series().diff()/np.timedelta64(1, 's')).median()
# Estimate PSD using first method
resy = TracePSD_1st(y, dty)
resx = TracePSD_1st(x, dtx)
# Estimate PSD using second method
resya = TracePSD_2nd(y, dty)
resxa = TracePSD_2nd(x, dtx)
fig, ax =plt.subplots(1, 3, figsize=(30,10), sharex=True, sharey=True )
ax[0].loglog(resy[0], resy[1], label ='Original timeseries, 1st method')
ax[0].loglog(resx[0], resx[1], label ='Downsampled timeseries, 1st method')
ax[0].text(5*1e-4, 1e-8, r'$\frac{Power_{Real}}{Power_{Fourier}}$ = '+ str(round(sum(abs(y**2))/ sum(abs(resy[1])) ,2)), fontsize =20)
ax[0].legend()
y = df_b.Br
x = df_b.Br.resample(str(int(resolution))+"ms").mean()
dtx = (x.dropna().index.to_series().diff()/np.timedelta64(1, 's')).median()
dty = (y.dropna().index.to_series().diff()/np.timedelta64(1, 's')).median()
ax[1].loglog(resya[0], resya[1], label ='Original timeseries, 2nd method')
ax[1].loglog(resxa[0], resxa[1], label ='Downsampled timeseries, 2nd method')
ax[1].text(5*1e-4, 1e-8, r'$\frac{Power_{Real}}{Power_{Fourier}}$ = '+ str(round(sum(abs(y**2))/ sum(abs(resya[1])) ,2)), fontsize =20)
ax[1].legend()
ax[2].loglog(resy[0], resy[1], label ='Original timeseries, 1st method')
ax[2].loglog(resya[0], resya[1], label ='Original timeseries, 2nd method')
for i in range(3):
ax[i].set_ylabel(r'$PSD$')
ax[i].set_xlabel(r'$Frequency \ [Hz]$')
ax[2].legend()
I have data showing the price to lease different cars.
i have created a matrix to show the correlations between each of the elements involved but i do not trust it. in my experience the correlations it is showing should not be. the blp (the cost to fully purchase the car) should be the most important factor, however im getting seats and engine volume. (engine volume i can understand, but seats?)
perhaps the problem may be how i scaled my data.
correlation matrix image
from matplotlib import pyplot
import pandas as pd
import numpy
from sklearn import *
def scale_this_data(data, col_names):
print("scalling data now")
new_df = pd.DataFrame(columns = col_names)
for col in data.columns:
wanted_col = False
for the_col in col_names:
if the_col == col:
wanted_col = True
if wanted_col == True:
np_arr = data[col].values
np_arr = np_arr.reshape(-1, 1)
min_max_scaler = preprocessing.MinMaxScaler()
np_arr = min_max_scaler.fit_transform(np_arr)
#for n in range(len(data[col])):
old = data[col].iloc[3]
data[col] = np_arr
print(str(data[col].iloc[3])+ " this became this = "+ str(data[col]))
return data
Path = "new_ratebook.csv"
col_names = ['Net Rental2','Doors2', 'Seats2', 'BHP2', 'Eng CC2', 'CO22', 'blp2']
data = pd.read_csv(Path , dtype = str , index_col=False, low_memory=False)
data = scale_this_data(data, col_names)
data.to_csv("scaleddata.csv")
correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=0, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,7,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(col_names)
ax.set_yticklabels(col_names)
pyplot.savefig('correlations.png')
pyplot.show()
question, how do i confirm to myself the the correlation is correct
You can confirm it with various ways. Some are the following:
Verify that data are correct.
Take some of your data (reduce the length of your data frame).
Calculate it by hand (good way to convince one's self), calculator, excel or an online correlation coefficient calculator like the Pearson Correlation Coefficient Calculator from google results.
By the way, correlation does not imply effect/causality (link, archived).
I tried to plot the Probability Density Function (PDF) plot of my data after finding the best parameters, but the plot is showing a flat line instead of a curve.
Is it a matter of scaling?
Is it a problem of Continuous or Discrete data? Data file is available here
The purpose here is to get the best distribution fittings and then plot PDF function.
My data values are so small like: 0.21, 1.117 .etc. The data statistics and PDF plots are shown below:
My script is given below:
from time import time
from datetime import datetime
start_time = datetime.now()
import pandas as pd
pd.options.display.float_format = '{:.4f}'.format
import numpy as np
import pickle
import scipy
import scipy.stats
import matplotlib.pyplot as plt
data= pd.read_csv("line_RXC_data.csv",usecols=['R'],parse_dates=True, squeeze=True)
df=data
y_std=df
# del yy
import warnings
warnings.filterwarnings("ignore")
# Create an index array (x) for data
y=df
#
# Create an index array (x) for data
x = np.arange(len(y))
size = len(y)
#simple visualisation of the data
plt.hist(y)
plt.title("Histogram of resistance ")
plt.xlabel("Resistance data visualization ")
plt.ylabel("Frequency")
plt.show()
y_df = pd.DataFrame(y)
tt=y_df.describe()
print(tt)
dist_names = [
'foldcauchy',
'beta',
'expon',
'exponnorm',
'norm',
'lognorm',
'dweibull',
'pareto',
'gamma'
]
x = np.arange(len(df))
size = len(df)
y_std = df
y=df
chi_square = []
p_values = []
# Set up 50 bins for chi-square test
# Observed data will be approximately evenly distrubuted aross all bins
percentile_bins = np.linspace(0,100,51)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)
# Loop through candidate distributions
for distribution in dist_names:
s1 = time()
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
# print("1")
param = dist.fit(y_std)
# print("2")
# Obtain the KS test P statistic, round it to 5 decimal places
p = scipy.stats.kstest(y_std, distribution, args=param)[1]
p = np.around(p, 5)
p_values.append(p)
# print("3")
# Get expected counts in percentile bins
# This is based on a 'cumulative distrubution function' (cdf)
cdf_fitted = dist.cdf(percentile_cutoffs, *param[:-2], loc=param[-2],
scale=param[-1])
# print("4")
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)
# calculate chi-squared
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square.append(ss)
print(f"chi_square {distribution} time: {time() - s1}")
# print("std of predicted probability : ", np.std(cum_observed_frequency))
# Collate results and sort by goodness of fit (best at top)
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square
results['p_value'] = p_values
results.sort_values(['chi_square'], inplace=True)
# Report results
print ('\nDistributions sorted by goodness of fit:')
print ('----------------------------------------')
print (results)
#%%
# Divide the observed data into 100 bins for plotting (this can be changed)
number_of_bins = 100
bin_cutoffs = np.linspace(np.percentile(y,0), np.percentile(y,99),number_of_bins)
# Create the plot
plt.figure(figsize=(7, 4))
h = plt.hist(y, bins = bin_cutoffs, color='0.70')
# Get the top three distributions from the previous phase
number_distributions_to_plot = 5
dist_names = results['Distribution'].iloc[0:number_distributions_to_plot]
#%%
# Create an empty list to stroe fitted distribution parameters
parameters = []
# Loop through the distributions ot get line fit and paraemters
for dist_name in dist_names:
# Set up distribution and store distribution paraemters
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
parameters.append(param)
# Get line for each distribution (and scale to match observed data)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
scale_pdf = np.trapz (h[0], h[1][:-1]) / np.trapz (pdf_fitted, x)
pdf_fitted *= scale_pdf
# Add the line to the plot
plt.plot(pdf_fitted, label=dist_name)
# Set the plot x axis to contain 99% of the data
# This can be removed, but sometimes outlier data makes the plot less clear
plt.xlim(0,np.percentile(y,99))
# Add legend and display plotfig = plt.figure(figsize=(8,5))
plt.legend()
plt.title(u'Data distribution charateristics) \n' )
plt.xlabel(u'Resistance')
plt.ylabel('Frequency )')
plt.show()
# Store distribution paraemters in a dataframe (this could also be saved)
dist_parameters = pd.DataFrame()
dist_parameters['Distribution'] = (
results['Distribution'].iloc[0:number_distributions_to_plot])
dist_parameters['Distribution parameters'] = parameters
# Print parameter results
print ('\nDistribution parameters:')
print ('------------------------')
for index, row in dist_parameters.iterrows():
print ('\nDistribution:', row[0])
print ('Parameters:', row[1] )
If you look at the following categorical frequency analysis, you'll see that you have only 15 distinct values spread across the range with large gaps in between—not a continuum of values. Half the observations have the value 0.211, with another ~36% occurring at the value 1.117, ~8% at 0.194, and ~4% at 0.001. I think it's a mistake to treat this as continuous data.
I have multiple Dataframes (up to 30) which all contain timestamps with associated values. The timestamp in the DataFrames do not necessarily overlap and the recorded values can only stay the same or increase. A DataFrame may look like this:
time coverage
0 0.000000 32.111748
1 0.875050 32.482579
2 1.850576 32.784133
3 3.693440 34.205134
...
I uploaded a couple of csv files with data here 1, 2, 3, 4.
So what I am trying to do is to plot the increase of the mean and median coverage values over time for all recordings, as follows:
# data is a list of dataframes
keys = ["Run " + str(i) for i in range(len(data))]
glued = pd.concat(data, keys=keys).reset_index(level=0).rename(columns={'level_0': 'Run'})
glued["roundtime"] = glued["time"] / 60
glued["roundtime"] = glued["roundtime"].round(0) # 1 significant digit
f, (ax1, ax2) = plt.subplots(2)
my_dpi = 96
stepsize = 5
start = 0
end = 60
ax1.set_title("Mean")
ax2.set_title("Median")
f.set_size_inches(1980 / my_dpi, 1080 / my_dpi)
ax1 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator="mean", data=glued, ax=ax1)
ax1.set(xlabel="Time", ylabel="Coverage in percent")
ax1.xaxis.set_ticks(np.arange(start, end, stepsize))
ax1.set_xlim(0, 70)
ax2 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator='median', data=glued, ax=ax2)
ax2.set(xlabel="Time", ylabel="Coverage in percent")
ax2.xaxis.set_ticks(np.arange(start, end, stepsize))
ax2.set_xlim(0, 70)
plt.show()
The result looks like this.
However, the curve should never decrease as the "coverage" values can never decrease either. The reason for this, I suspect, is that at certain points in time I only have recordings of some DataFrames with lower values and therefore the mean/median is also lower.
I tried to fix this by aligning the indices of all the DataFrames and filling missing values with previous recordings, before doing any of the previous code. Like this:
#create a common index
index = None
for df in data:
df.set_index("time", inplace=True, drop=False)
if index is not None:
index = index.union(df.index)
else:
index = df.index
# reindex all dataframes and fill missing values
new_data = []
for df in data:
print(df)
new_df = df.reindex(index, fill_value=np.NaN)
new_df = new_df.fillna(method="ffill")
new_data.append(new_df)
data = new_data
The result however does change much and decreases at certain times. It looks like this:
Is this approach wrong or am I simply missing something?
I have a pandas data frame with two columns one is temperature the other is time.
I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema.
Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure.
Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal,
order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal,
order=n)[0]]['data']
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
plt.show()
Some points:
you might need to check the points afterward to ensure there are no twine points very close to each other.
you can play with n to filter the noisy points
argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1]*0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
# Find local peaks
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
df.data.plot()
using Numpy
ser = np.random.randint(-40, 40, 100) # 100 points
peak = np.where(np.diff(ser) < 0)[0]
or
double_difference = np.diff(np.sign(np.diff(ser)))
peak = np.where(double_difference == -2)[0]
using Pandas
ser = pd.Series(np.random.randint(2, 5, 100))
peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function:
# Find local peaks
n = 5 #rolling period
local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()]
local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()]
plt.scatter(local_min_vals.index, local_min_vals, c='r')
plt.scatter(local_max_vals.index, local_max_vals, c='g')