how to use Copulae in Python - python

I'm using Copulae package, following the example of https://github.com/DanielBok/copulae.
My understanding is that the simulated values should have similar distribution as the input ones.
I would like to see the output dataframe of the fit (that is: the simulated data), and check its distribution. However, "fitted" produced below is not an array or df that I can open.
How can I extract the fitted data?
from copulae import NormalCopula
import numpy as np
np.random.seed(8)
data = np.random.normal(size=(300, 8))
plt.hist(data[:,1], bins=100) #checking input data histogram
cop = NormalCopula(8)
cop.fit(data) #fitting data with copula
fitted=cop.fit(data)

Related

How to implement a butterworth filter

I am trying to implement a butterworthfilter with python in jupyter Notebook. I wrote this code by a tutorial.
The data are from a CSV-File, it calls Samples.csv
The data in Samples.csv are like
998,4778415
1009,209592
1006,619094
1001,785406
993,9426543
990,1408991
992,736118
995,8127334
1002,381664
1006,094429
1000,634799
999,3287747
1002,318812
999,3287747
1004,427698
1008,516733
1007,964781
1002,680906
1000,14449
994,257009
The column calls Euclidian Norm. The range of the data are from 0 to 1679.286158 and theyre are 1838 rows.
This is the code in Jupyter:
from scipy.signal import filtfilt
from scipy import stats
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
def plot():
data=pd.read_csv('Samples.csv',sep=";", decimal=",")
sensor_data=data[['Euclidian Norm']]
sensor_data=np.array(sensor_data)
time=np.linspace(0,1679.286158,1838)
plt.plot(time,sensor_data)
plt.show()
filtered_signal=bandPassFilter(sensor_data)
plt.plot(time,sensor_data)
plt.show()
def bandPassFilter(signal):
fs = 4000.0
lowcut=20.0
highcut=50.0
nyq=0.5*fs
low=lowcut/nyq
high=highcut/nyq
order =2
b,a=scipy.signal.butter(order,[low,high],'bandpass',analog=False)
y=scipy.signal.filtfilt(b,a,signal,axis=0)
return(y)
plot()
My problem is that nothing changes in my data. It doesnt filtered my data. The graph of the filtered data is the same like the source data. Does anyone know what could be wrong?
The first graph is the source data and the second graph is the filtered graph. It looks very similar. Its like the same graph
I can't comment yet.
You're never using filtered_signal and plot with the same arguments twice.
Here`s one of my implementations with added interpolation, very similar to yours:
def butterFit(data, freq, order=2):
ar = scipy.signal.butter(order, freq) # Gets params for filttilt
return spfilter.filtfilt(ar[0], ar[1], data)
def plotFilteredSplines(timeframe, data, amount_points):
# Generate evenly spread indices for the data points.
indices = np.arange(0, len(data), amount_points)
cutoff_freq = 2 / (2/10 * len(timeframe))
# Reshape the data with butter :)
data = butterFit(data, cutoff_freq)
# Plot Fitlered data
plt.plot(timeframe, data, '-.')
interpol_x = np.linspace(timeframe[0], timeframe[-1], 100)
# Get the cubic spline approx function
interpolation = sp.interpolate.interp1d(timeframe, data, kind='cubic')
# Plot the interpolation over the extended time frame.
plt.plot(interpol_x, interpolation(interpol_x), '-r')

plotting noise spectrum of the data

i have a single column file(contain only one column) and a matrix file(contain 10 columns) of data which are noisy data and i want to plot the noise spectrum of both file using python.
sample data for single column file is attached here
-1.064599999999999921e-02
-1.146800000000000076e-02
-1.011899999999999952e-02
-7.400200000000000007e-03
-4.306500000000000432e-03
-1.644800000000000081e-03
1.936600000000000127e-04
1.239199999999999980e-03
1.759200000000000043e-03
2.019799999999999981e-03
2.148699999999999916e-03
2.153099999999999806e-03
2.008799999999999822e-03
1.700899999999999981e-03
1.181500000000000042e-03
3.194000000000000116e-04
-1.072000000000000036e-03
-3.133799999999999954e-03
and sample data for matrix file is attached here
-2.596100000000000057e-03 -1.856000000000000011e-03 -1.821400000000000102e-02 5.023599999999999594e-03 -1.064599999999999921e-02 -1.906300000000000008e-02 -6.370799999999999380e-05 5.814800000000000177e-03 -5.391800000000000412e-03 -1.311000000000000013e-02
1.636700000000000047e-03 -8.651600000000000176e-04 -2.490799999999999959e-02 1.645399999999999988e-02 -1.146800000000000076e-02 -4.609199999999999929e-03 6.475800000000000306e-03 1.265800000000000085e-02 1.855799999999999898e-03 -5.387499999999999928e-03
4.516499999999999682e-03 1.438899999999999901e-03 -2.911599999999999952e-02 2.590800000000000047e-02 -1.011899999999999952e-02 2.378800000000000012e-02 1.080200000000000084e-02 1.994299999999999892e-02 8.882299999999999224e-03 2.866500000000000124e-03
5.604699999999999786e-03 4.557799999999999872e-03 -2.870800000000000088e-02 2.832300000000000095e-02 -7.400200000000000007e-03 2.882099999999999940e-02 1.145799999999999944e-02 2.488800000000000040e-02 1.367299999999999939e-02 8.998799999999999508e-03
4.797400000000000275e-03 7.657399999999999970e-03 -2.582800000000000026e-02 2.288000000000000103e-02 -4.306500000000000432e-03 8.315499999999999975e-03 7.967600000000000030e-03 2.487999999999999934e-02 1.516600000000000066e-02 1.177899999999999954e-02
2.314300000000000038e-03 9.749700000000000033e-03 -2.252099999999999935e-02 1.762000000000000025e-02 -1.644800000000000081e-03 -1.257800000000000064e-02 1.220600000000000070e-03 1.866299999999999903e-02 1.377199999999999952e-02 1.163999999999999931e-02
-1.290700000000000094e-03 9.894599999999999923e-03 -1.928900000000000059e-02 1.360300000000000051e-02 1.936600000000000127e-04 -2.438999999999999849e-02 -6.739199999999999878e-03 6.961199999999999853e-03 1.086299999999999939e-02 1.015199999999999957e-02
-5.137400000000000300e-03 7.453800000000000009e-03 -1.615099999999999869e-02 1.018799999999999914e-02 1.239199999999999980e-03 -1.585699999999999957e-02 -1.349500000000000005e-02 -7.773600000000000301e-03 7.680499999999999827e-03 9.148399999999999241e-03
-8.159500000000000086e-03 2.403600000000000094e-03 -1.270400000000000001e-02 5.359000000000000048e-03 1.759200000000000043e-03 -9.746799999999999908e-03 -1.730999999999999900e-02 -2.229599999999999985e-02 4.641100000000000433e-03 9.871700000000000613e-03
-9.419600000000000195e-03 -4.305599999999999705e-03 -8.259700000000000028e-03 -3.140800000000000015e-03 2.019799999999999981e-03 -5.883300000000000161e-03 -1.772100000000000064e-02 -2.695099999999999926e-02 1.592399999999999892e-03 1.255299999999999992e-02
-8.469000000000000833e-03 -1.101399999999999949e-02 -2.205400000000000155e-03 -1.641199999999999951e-02 2.148699999999999916e-03 -3.635199999999999890e-03 -1.558000000000000010e-02 -1.839000000000000010e-02 -1.408900000000000039e-03 1.642899999999999916e-02
-5.529599999999999967e-03 -1.553999999999999999e-02 5.413199999999999956e-03 -4.248000000000000040e-03 2.153099999999999806e-03 -2.403199999999999868e-03 -1.255099999999999966e-02 -8.339100000000000332e-03 -3.665700000000000035e-03 2.009499999999999828e-02
i tried with https://www.earthinversion.com/techniques/visualizing-power-spectral-density-demo-obspy/ but for my ascii data set i could not do it.I hope experts may help me .Thanks in advance.
Maybe this can give you a start. Give your matrix of data in the file "x.data", this plots the raw data as 10 curves, then runs an FFT on each column and displays the FFT. The FFT isn't very interesting with only 12 points, but it will spark ideas.
There's still the problem of "how do you define noise"? The signals you presented do not seem to be very noisy. Unless you know what kind of signal you're expecting,an FFT might not do much good.
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack
data = np.loadtxt("x.data")
for i in range(data.shape[1]):
plt.plot( data[:,i] )
plt.show()
for i in range(data.shape[1]):
f = scipy.fftpack.fft(data[:,i])
plt.plot(np.abs(f))
plt.show()
Use numpy.loadtxt() to convert the data to a numpy array. Then you can apply the method described in the link you provided in order to obtain the spectra. E.g.:
import numpy as np
data = np.loadtxt("file.txt")
The you plot the spectrum for that data. E.g.:
import matplotlib.pyplot as plt import scipy.fftpack
yf = scipy.fftpack.fft(data)
fig, ax = plt.subplots()
ax.plot(np.abs(yf))
plt.show()

Value Error with Python's PyALE function for 2D ALE plots

I am creating Accumulated Local Effect plots using Python's PyALE function. I am using a RandomForestRegression function to build the model.
I can create 1D ALE plots. However, I get a Value Error when I try to create a 2D ALE plot using the same model and training data.
Here is my code.
ale(training_data, model=model1, feature=["feature1", "feature2"])
I can plot a 1D ALE plot for feature1 and feature2 with the following code.
ale(training_data, model=model1, feature=["feature1"], feature_type="continuous")
ale(training_data, model=model1, feature=["feature2"], feature_type="continuous")
There are no missing or infinite values for any column in the data frame.
I am getting the following error with the 2D ALE plot command.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This is a link to the function https://pypi.org/project/PyALE/#description
I am not sure why I am getting this error. I would appreciate some help on this.
Thank you,
Rohin
This issue was addressed in release v1.1.2 of the package PyALE. For those using earlier versions the workaround mentioned in the issue thread in github is to reset the index of the dataset fed to the function ale. For completeness here's a code that reproduces the error and the workaround:
from PyALE import ale
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.ensemble import RandomForestRegressor
# get the raw diamond data (from R's ggplot2)
dat_diamonds = pd.read_csv(
"https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv"
)
X = dat_diamonds.loc[:, ~dat_diamonds.columns.str.contains("price")].copy()
y = dat_diamonds.loc[:, "price"].copy()
features = ["carat","depth", "table", "x", "y", "z"]
# fit the model
model = RandomForestRegressor(random_state=1345)
model.fit(X[features], y)
# sample the data
random.seed(1234)
indices = random.sample(range(X.shape[0]), 10000)
sampleData = X.loc[indices, :]
# get the effects.....
# This throws the error
ale_eff = ale(X=sampleData[features], model=model, feature=["z", "table"], grid_size=100)
# This will work, just reset the index with drop=True
ale_eff = ale(X=sampleData[features].reset_index(drop=True), model=model, feature=["z", "table"], grid_size=100)

Get better fit on test data using Auto_Arima

I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.
This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.
The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:

Output K-Means to CSV with SciKit Learn - give cluster names

I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:

Categories