I recreated the "simple example" from the documentation of the chemical_kinetics module used to load, plot and fit chemical kinetics data. I used the module to fit some of my own data succesfully but now I also want to get numerical values from the data.dataset created. The DataFrame contains all the input data but also the parameters and the output data used to plot the fitted line.
The following code is used to import data
ds = data.Dataset(
files_c = ["data/concentrations vs time.csv"],
t_label = "Time [a.u.]",
c_label = "Concentration [a.u.]"
)
And the following adds the fitting data
from chemical_kinetics import fit
fit.fit_dataset(
dataset = ds,
derivatives = derivatives,
parameters = parameters,
c0 = c0
)
The function
fit.print_result(ds)
only reports the values for each parameter and
plot.plot_c(ds)
plots all the data I want to extract. I want to numerically extract the very data that is plotted.
I tried using some methods to get data from pandas Dataframe as the documentation says that the data.dataset created pandas.Dataframes but it never works.
The entire example is listed on the following link:
https://chemical-kinetics.readthedocs.io/en/latest/simple_example.html
Related
I am using tsfresh for extracting features from my data.
inital data:
My inital data was timeseries data of a machine sensor.
I used the third column to add another column named hub. It represents the cycles of the machine. Also I converted the Timestamp to integer "Timesteps" for each cycle. Rsulting in this Dataframe:
when extracting the features the algorithm returns 787 features for each of my datarows.
from tsfresh import extract_features
extracted_features = extract_features(df_sample, column_id="hub", column_sort="step")
features = extracted_features.columns.tolist()
But when I use the select features method with the labeled vector y, it gives back an empty dataframe.
I dont understand why?
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(extracted_features)
features_filtered = select_features(extracted_features, y)
I am pretty new to feature extraction.
If anybody has any pointers, as to how I could extract good features from the cyclic timeseries data, I would be very thankful.
The plot of the sensor value over the crankshaft position is shown below.
I am creating Accumulated Local Effect plots using Python's PyALE function. I am using a RandomForestRegression function to build the model.
I can create 1D ALE plots. However, I get a Value Error when I try to create a 2D ALE plot using the same model and training data.
Here is my code.
ale(training_data, model=model1, feature=["feature1", "feature2"])
I can plot a 1D ALE plot for feature1 and feature2 with the following code.
ale(training_data, model=model1, feature=["feature1"], feature_type="continuous")
ale(training_data, model=model1, feature=["feature2"], feature_type="continuous")
There are no missing or infinite values for any column in the data frame.
I am getting the following error with the 2D ALE plot command.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This is a link to the function https://pypi.org/project/PyALE/#description
I am not sure why I am getting this error. I would appreciate some help on this.
Thank you,
Rohin
This issue was addressed in release v1.1.2 of the package PyALE. For those using earlier versions the workaround mentioned in the issue thread in github is to reset the index of the dataset fed to the function ale. For completeness here's a code that reproduces the error and the workaround:
from PyALE import ale
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.ensemble import RandomForestRegressor
# get the raw diamond data (from R's ggplot2)
dat_diamonds = pd.read_csv(
"https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv"
)
X = dat_diamonds.loc[:, ~dat_diamonds.columns.str.contains("price")].copy()
y = dat_diamonds.loc[:, "price"].copy()
features = ["carat","depth", "table", "x", "y", "z"]
# fit the model
model = RandomForestRegressor(random_state=1345)
model.fit(X[features], y)
# sample the data
random.seed(1234)
indices = random.sample(range(X.shape[0]), 10000)
sampleData = X.loc[indices, :]
# get the effects.....
# This throws the error
ale_eff = ale(X=sampleData[features], model=model, feature=["z", "table"], grid_size=100)
# This will work, just reset the index with drop=True
ale_eff = ale(X=sampleData[features].reset_index(drop=True), model=model, feature=["z", "table"], grid_size=100)
I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.
I'm using Copulae package, following the example of https://github.com/DanielBok/copulae.
My understanding is that the simulated values should have similar distribution as the input ones.
I would like to see the output dataframe of the fit (that is: the simulated data), and check its distribution. However, "fitted" produced below is not an array or df that I can open.
How can I extract the fitted data?
from copulae import NormalCopula
import numpy as np
np.random.seed(8)
data = np.random.normal(size=(300, 8))
plt.hist(data[:,1], bins=100) #checking input data histogram
cop = NormalCopula(8)
cop.fit(data) #fitting data with copula
fitted=cop.fit(data)
I have the following code that runs through the following:
Draw a number of points from a true distribution.
Use those points with curve_fit to extract the parameters.
Check if those parameters are, on average, close to the true values.
(You can do this by creating the "Pull distribution" and see if it returns
a standard normal variable.
# This script calculates the mean and standard deviation for
# the pull distributions on the estimators that curve_fit returns
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import gauss
import format
numTrials = 10000
# Pull given by (a_j - a_true)/a_error)
error_vec_A = []
error_vec_mean = []
error_vec_sigma = []
# Loop to determine pull distribution
for i in xrange(0,numTrials):
# Draw from primary distribution
mean = 0; var = 1; sigma = np.sqrt(var);
N = 20000
A = 1/np.sqrt((2*np.pi*var))
points = gauss.draw_1dGauss(mean,var,N)
# Histogram parameters
bin_size = 0.1; min_edge = mean-6*sigma; max_edge = mean+9*sigma
Nn = (max_edge-min_edge)/bin_size; Nplus1 = Nn + 1
bins = np.linspace(min_edge, max_edge, Nplus1)
# Obtain histogram from primary distributions
hist, bin_edges = np.histogram(points,bins,density=True)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
# Initial guess
p0 = [5, 2, 4]
coeff, var_matrix = curve_fit(gauss.gaussFun, bin_centres, hist, p0=p0)
# Get the fitted curve
hist_fit = gauss.gaussFun(bin_centres, *coeff)
# Error on the estimates
error_parameters = np.sqrt(np.array([var_matrix[0][0],var_matrix[1][1],var_matrix[2][2]]))
# Obtain the error for each value: A,mu,sigma
A_std = (coeff[0]-A)/error_parameters[0]
mean_std = ((coeff[1]-mean)/error_parameters[1])
sigma_std = (np.abs(coeff[2])-sigma)/error_parameters[2]
# Store results in container
error_vec_A.append(A_std)
error_vec_mean.append(mean_std)
error_vec_sigma.append(sigma_std)
# Plot the distribution of each estimator
plt.figure(1); plt.hist(error_vec_A,bins,normed=True); plt.title('Pull of A')
plt.figure(2); plt.hist(error_vec_mean,bins,normed=True); plt.title('Pull of Mu')
plt.figure(3); plt.hist(error_vec_sigma,bins,normed=True); plt.title('Pull of Sigma')
# Store key information regarding distribution
mean_A = np.mean(error_vec_A); sigma_A = np.std(error_vec_A)
mean_mu = np.mean(error_vec_mean); sigma_mu = np.std(error_vec_mean)
mean_sigma = np.mean(error_vec_sigma); sigma_sig = np.std(error_vec_sigma)
info = np.array([[mean_A,sigma_A],[mean_mu,sigma_mu],[mean_sigma,sigma_sig]])
My problem is I don't know how to use python to format the data into a table. I have to manually go into the variables and go to google docs to present the information. I'm just wondering how I can do that using pandas or some other library.
Here's an example of the manual insertion:
Trial 1 Trial 2 Trial 3
Seed [0.2,0,1] [10,2,5] [5,2,4]
Bins for individual runs 20 20 20
Points Thrown 1000 1000 1000
Number of Runs 5000 5000 5000
Bins for pull dist fit 20 20 20
Mean_A -0.11177 -0.12249 -0.10965
sigma_A 1.17442 1.17517 1.17134
Mean_mu 0.00933 -0.02773 -0.01153
sigma_mu 1.38780 1.38203 1.38671
Mean_sig 0.05292 0.06694 0.04670
sigma_sig 1.19411 1.18438 1.19039
I would like to automate this table so If I change my parameters in my code, I get a new table with that new data.
I would go with the CSV module to generate a presentable table.
if you're not already using it, the IPython notebook is really good for rendering rich display formats. It's really good in a lot of other ways, too.
It will render pandas dataframe objects as an html table when they're either the last, unreturned value in a cell or if you explicitly call Ipython.core.display.display function instead of print.
If you're not already using pandas, I highly recommend it. It's basically a wrapper around 2D & 3D numpy arrays; it's just as fast, but it has nice naming conventions, data grouping and filtering funcitons, and some other cool stuff.
At that point, it depends on how you want to present it. You can use nbconvert to render a whole notebook as static html or a pdf. You can copy-paste the html table into Excel or PowerPoint or an E-mail.