Help please!!
I've seen some anawers here, but they didn't help me. I need to reconstruct the initial data, having 2 matrixes and using first ten principal components. First matrix (Z) (X_reduced_417)- result of applying sklearn.decomposition.PCA. Second matrix (X_loadings_417) (F) is weight matrix. Answer is Initial data = Z*F+mean_matrix. How to use sklearn to find Z?
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn.datasets, sklearn.decomposition
df_loadings = pd.read_csv('X_loadings_417.csv', header=None)
df_reduced = pd.read_csv('X_reduced_417.csv', header=None) ```
import pandas as pd
import numpy as np
# Load the df_loadings and df_reduced matrices from the CSV files
df_loadings = pd.read_csv("X_loadings_417.csv", header=None)
df_reduced = pd.read_csv("X_reduced_417.csv", header=None)
# Convert the DataFrames to numpy arrays
F = df_loadings.values
Z = df_reduced.values
# The mean of the original data is needed to reconstruct the data
mean_matrix = np.mean(X, axis=0)
# Reconstruct the original data using the first ten principal components
X_reconstructed = Z[:,:10].dot(F[:10,:]) + mean_matrix
I am using pyfolio to calcuate the maxdrawdown and other risk indicator. What should be adjusted to get the correct value?
Near 27% should be the right maxdrawdown, I don't why some negative value is returned. And it seems the whole drawdown table is not corrected or as expected.
Thanks in advance
benchmark files
results files
import pandas as pd
import pyfolio as pf
import os
import matplotlib.pyplot as plt
from pandas import read_csv
from pyfolio.utils import (to_utc, to_series)
from pyfolio.tears import (create_full_tear_sheet,
create_simple_tear_sheet,
create_returns_tear_sheet,
create_position_tear_sheet,
create_txn_tear_sheet,
create_round_trip_tear_sheet,
create_interesting_times_tear_sheet,)
test_returns = read_csv("C://temp//test_return.csv", index_col=0, parse_dates=True,header=None, squeeze=True)
print(test_returns)
benchmark_returns = read_csv("C://temp//benchmark.csv", index_col=0, parse_dates=True,header=None, squeeze=True)
print(benchmark_returns)
fig = pf.create_returns_tear_sheet(test_returns,benchmark_rets=benchmark_returns,return_fig=True)
fig.savefig("risk.png")
maxdrawdown = pf.timeseries.max_drawdown(test_returns)
print(maxdrawdown)
table = pf.timeseries.gen_drawdown_table(test_returns)
print(table)
I want to write a program in python that iterate over each row of a data-matrix in a .csv file and then pass each row as an input to time-series-analysis model and the output(which is going to be a single value) of each row analysed over model will be stored in a form of column.
So far, I have tried iterating over rows, passing it through model and printing each output:
import pandas as pd
import numpy as np
from statsmodels.tsa.ar_model import AR
from random import random
data=pd.read_csv('EXAMPLEMATRIX.csv',header=None)
for i in data.iterrows():
df=np.asarray(i)
model=AR(df)
model_fit=model.fit()
yhat=model_fitd.predict(len(df),len(df))
print(yhat)
but I get an error:
ValueError: maxlag should be < nobs
Please help me solve this problem or finding out where it is going wrong or provide me a reference for solving this problem.
THANKS in advance
Use that instead:
import pandas as pd
import numpy as np
from statsmodels.tsa.ar_model import AR
from random import random
for i in range(data.shape[0]):
row = data.iloc[i]
model=AR(row.values)
model_fit=model.fit()
yhat=model_fit.predict(len(row),len(row))
print(yhat)
I am working with a series of images. I read them first and store in the list then I convert them to dataframe and finally I would like to implement Isomap. When I read images (I have 84 of them) I get 84x2303 dataframe of objects. Now each object by itself also looks like a dataframe. I am wondering how to convert all of it to_numeric so I can use Isomap on it and then plot it.
Here is my code:
import pandas as pd
from scipy import misc
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
import matplotlib.pyplot as plt
import glob
from sklearn import manifold
samples = []
path = 'Datasets/ALOI/32/*.png'
files = glob.glob(path)
for name in files:
img = misc.imread(name)
img = img[::2, ::2]
x = (img/255.0).reshape(-1,3)
samples.append(x)
df = pd.DataFrame.from_records(samples)
print df.dtypes
print df.shape
Thanks!
I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()
You should use np.where and then count the length of the obtained vector of indices:
indices = np.where(x3 <= value)
count = len(indices[0])
If efficiency counts, you can use the numpy function bincount, which need integers :
import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))
it takes about 1ms.
Regards.
# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')
count = np.sum(mask)