DataFrame NumPy value accuracy

DataFrame NumPy value accuracy - python

Im using the following code to scale values from one interval to another. "outputs_max", "outputs_min" are numpy arrays, so are (as a result) "slope" and "intercept".
For higher clarity when displaying the result "scaled_outputs", I used pandas to create a DataFrame of the file "out.npy" which I called "output_array". The resulting array "scaled_outputs" hence is displayed in a DataFrame too and later on stored as a numpy file.
import pandas as pd
import numpy as np
output_file = np.load("U:\\out.npy")
output_array = pd.DataFrame(output_file)
desired_upper_bound = 1
desired_lower_bound = 0
slope = (desired_upper_bound - desired_lower_bound) / (outputs_max - outputs_min)
intercept = desired_upper_bound - (slope * rounded_outputs_max)
scaled_outputs = slope * output_array + intercept
np.save("U:\\scaled_outputs.npy", scaled_outputs)
Am I losing accuracy of the values by creating a DataFrame and passing it into the equation? Would it be better to pass the numpy array "output_file" and creating a DataFrame of "scaled_outputs"?
The result in the console is displayed with 5 decimals at max, which is why I'm asking.

No, you're not losing precision or accuracy by using a dataframe in your equation. What you're seeing on the console is a result of display precision. You can change the display.precision property to see more digits when a dataframe is displayed.
pandas.set_option("display.precision", 10)

Related

How to make a data frame combining different regression results in python?

I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values don´t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks

import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)

Limited factor loading output in python factor-analyzer

I am new here but I hope you guys can help me out.
I'm trying to conduct factor analysis with word vectors in python using the Factor-Analyzer module. I have a DataFrame with
100 columns and more than 15,000 rows.
I did not receive any error when I performed a factor analysis. Below is the output:
FactorAnalyzer(bounds=(0.005, 1), impute='median', is_corr_matrix=False,
method='minres', n_factors=10, rotation='varimax',
rotation_kwargs={}, use_smc=True)
But when I try to get the loadings, it only returns 100 rows. I want to get the loadings for all rows.
Here is my code:
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import numpy as np
import pickle
factor_df = pd.read_pickle("word_vectors.pkl")
factor_df = pd.DataFrame(data=factor_df)
fa = FactorAnalyzer(n_factors=10, rotation='varimax')
fa.fit(factor_df)
loading = fa.loadings_
loadings_df = pd.DataFrame(fa.loadings_)
loadings_df
The pickle file for my dataset is here.

Factor loadings are the weights and correlations between each variable (column in your DataFrame) and the factor, so the fa.loadings_ object is an array with shape (number_of_variables, number_of_factors) - in your example (100, 10).
If you would like to get transformed DataFrame with only 10 columns in each row, you should call fa.transform(factor_df) after fa.fit(factor_df). Returned array will have shape (number_of_rows, number_of_factors) - in your example (15_000, 10).

How to efficiently index a numpy array based on varying start and stop indexes per row

I have a 2D numpy array with rows being time series of a feature, based on which I'm training a neural network. For generalisation purposes, I would like to subset these time series at random points. I'd like them to have a minimum subset length as well. However, the network requires fixed length time series, so I need to pre-pad the resulting subsets with zeroes.
Currently, I'm doing it using the code below, which includes a nasty for-loop, because I don't know how I can use fancy indexing for this particular problem. As this piece of code is part of the network data generator, it needs to be fast to keep up to pace with the data-hungry GPU. Does anyone know a numpy-way of doing this without the for-loop?
import numpy as np
import matplotlib.pyplot as plt
# Amount of time series to consider
batchsize = 25
# Original length of the time series
timesteps = 150
# As an example, fill the 2D array with sine function time series
sinefunction = np.expand_dims(np.sin(np.arange(timesteps)), axis=0)
originalarray = np.repeat(sinefunction, batchsize, axis=0)
# Now the real thing, we want:
# - to start the time series at a random moment (between 0 and maxstart)
# - to end the time series at a random moment
# - however with a minimum length of the resulting subset time series (minlength)
maxstart = 50
minlength = 75
# get random starts
randomstarts = np.random.choice(np.arange(0, maxstart), size=batchsize)
# get random stops
randomstops = np.random.choice(np.arange(maxstart + minlength, timesteps), size=batchsize)
# determine the resulting random sizes of the subset time series
randomsizes = randomstops - randomstarts
# finally create a new 2D array with all the randomly subset time series, however pre-padded with zeros
# THIS IS THE FOR LOOP WE SHOULD TRY TO AVOID
cutarray = np.zeros_like(originalarray)
for i in range(batchsize):
cutarray[i, -randomsizes[i]:] = originalarray[i, randomstarts[i]:randomstops[i]]
To show what goes in and out of the function:
# Show that it worked
f, ax = plt.subplots(2, 1)
ax[0].imshow(originalarray)
ax[0].set_title('original array')
ax[1].imshow(cutarray)
ax[1].set_title('zero-padded subset array')

Approach #1 : Views-based
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a zeros padded version of the input and assign into a zeros padded version of the output. All of that padding is needed for a vectorized solution on account of the ragged nature. Upside is that working on views would be efficient on memory and performance.
The implementation would look something like this -
from skimage.util.shape import view_as_windows
n = randomsizes.max()
max_extent = randomstarts.max()+n
padlen = max_extent - origalarray.shape[1]
p = np.zeros((origalarray.shape[0],padlen),dtype=origalarray.dtype)
a = np.hstack((origalarray,p))
w = view_as_windows(a,(1,n))[...,0,:]
out_vals = w[np.arange(len(randomstarts)),randomstarts]
out_starts = origalarray.shape[1]-randomsizes
out_extensions_max = out_starts.max()+n
out = np.zeros((origalarray.shape[0],out_extensions_max),dtype=origalarray.dtype)
w2 = view_as_windows(out,(1,n))[...,0,:]
w2[np.arange(len(out_starts)),out_starts] = out_vals
cutarray_out = out[:,:origalarray.shape[1]]
Approach #2 : With masking
cutarray_out = np.zeros_like(origalarray)
r = np.arange(origalarray.shape[1])
m = (randomstarts[:,None]<=r) & (randomstops[:,None]>r)
s = origalarray.shape[1]-randomsizes
m2 = s[:,None]<=r
cutarray_out[m2] = origalarray[m]

Python 3.7+Numpy+pandas Arrays Selecting data between a range

Ok I'm going to try to explain my problem, I have a csv file with data, the data is wavelength and amplitude, the image is include here.
CSV data
So, I want to select only data between 500nm and 800nm (wave),
import pandas as pd
import numpy as np
excelfile=pd.read_csv('Files/660nm.csv');
excelfile.head();
wave = excelfile['Longitud'];
wave = np.array(wave);
X = excelfile['Amplitud'];
X = np.array(X);
wave = wave[(wave > 500) & (wave < 800)]
This does what I want in first instance, but I want to extend this selection to the column of amplitude (X), to have two arrays of the same dimensions. In my actual code I have to make an index to select the data in the amplitude array(X):
indices = np.arange(382,775,1)
X = np.take(X, indices)
But this is not the best practice, if I cant extend the first column selection to the the amplitude column I don't have to make another array to index the X array, and check the extension of the array, any idea about it ?
Thanks.

Like #ALollz pointed out, you shouldn't split the DataFrame up. Instead just filter the whole dataframe on wavelength. See the docs for DataFrame.loc
import pandas as pd
import numpy as np
# some dummy data
excelfile = pd.DataFrame({'Longitud': np.random.random(100) * 1000,
'Amplitud': np.arange(100)})
wave = excelfile['Longitud']
excelfile_filtered = excelfile.loc[(wave > 500) & (wave < 800)]
X = excelfile_filtered ['Amplitud'].values # yields an array

Difference and subsequent error between using Pandas Series and Numpy Arrays

When doing some estimations, calculations, and other fun stuff in Python I came across something really weird and upsetting.
I have this thing where I estimate some parameters using ML-estimation, and have previously assumed that everything was peachy and fine. I read csv-data with pandas, and use the subsequent data for the estimation. Therefore, the data has originally been passed down to the ML-estimation function as Pandas Series. Today I wanted to try some matrix-operations on a thing in the calculation for kicks-and-giggles, and converted the input-data to numpy arrays. However, when I ran the code, the estimation results were different. After restoring some of the multiplications, it was still different. Then I changed back to using Pandas series, and it returned to the previously expected result.
This is where I got curious and now turn to you. Is it so that there is a rounding error between float64 Numpy arrays and float64 Pandas Series so different that when doing my calculations, they get so drastically different?
Consider the following code-example containing a sample from my ML-estimator
import pandas as pd
import numpy as np
import math
values = [3.41527085753, 3.606855606852, 3.5550625070226231, 3.680327020956565, \
3.30270511221, 3.704752803295, 3.6307205395804001, 3.200863997609199, \
2.90599272353, 3.555062501231, 2.8047528032711295, 3.415270760685753, \
3.50599277872, 3.445622506242, 3.3047528084632258, 3.219431344191376, \
3.68032756565, 3.451542245654, 3.2244456543387564, 2.999848273256456]
Ps = pd.Series(values, dtype=np.float64)
Narr = np.array(values, dtype=np.float64)
def getLambda(S, delta = 1/255):
n = len(S) - 1
Sx = sum( S[0:-1] )
Sy = sum( S[1:] )
Sxx = sum( S[0:-1]**2 )
Sxy = sum( S[0:-1]*S[1:] )
mu = (Sy*Sxx - Sx*Sxy) / ( n*(Sxx - Sxy) - (Sx**2 - Sx*Sy) )
lambd = np.log((Sxy - mu*Sx - mu*Sy + n*mu**2) / (Sxx -2*mu*Sx + n*mu**2) )/ delta
a = math.exp(-lambd*delta)
return mu, a, lambd
print("Numpy Array calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Narr)[0], getLambda(Narr)[1], getLambda(Narr)[2]))
print("Pandas Series calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Ps)[0], getLambda(Ps)[1], getLambda(Ps)[2]))
The values are just some random value picked from my original data in my larger project.
This will, atleast for me, print:
>> Numpy Array calculation gives me mu = 3.378432651661709, alpha = 102.09644146650535 and Lambda = -1179.6090571432392
>> Pandas Series calculation gives me mu = 3.3981019891871247, alpha = nan and Lambda = nan
The procedure, method, and original data is identical, and it still gets a difference of about 0.019669 already in the calculation of mu, which is for me really weird and upsetting.
If this is due to difference (keep in mind that I explicity stated that it should be float64 in both cases) in rounding between the two methods of handling data its weird as this just makes me question which and why I should use any of them. Otherwise, there has to be a bug in one of them? Or is there a third alternative which explains everything and was something that I did not know of to begin with?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame NumPy value accuracy - python

No, you're not losing precision or accuracy by using a dataframe in your equation. What you're seeing on the console is a result of display precision. You can change the display.precision property to see more digits when a dataframe is displayed. pandas.set_option("display.precision", 10)

Related

How to make a data frame combining different regression results in python?

Limited factor loading output in python factor-analyzer

How to efficiently index a numpy array based on varying start and stop indexes per row

Python 3.7+Numpy+pandas Arrays Selecting data between a range

Difference and subsequent error between using Pandas Series and Numpy Arrays

Categories

Resources