Limited factor loading output in python factor-analyzer - python

I am new here but I hope you guys can help me out.
I'm trying to conduct factor analysis with word vectors in python using the Factor-Analyzer module. I have a DataFrame with
100 columns and more than 15,000 rows.
I did not receive any error when I performed a factor analysis. Below is the output:
FactorAnalyzer(bounds=(0.005, 1), impute='median', is_corr_matrix=False,
method='minres', n_factors=10, rotation='varimax',
rotation_kwargs={}, use_smc=True)
But when I try to get the loadings, it only returns 100 rows. I want to get the loadings for all rows.
Here is my code:
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import numpy as np
import pickle
factor_df = pd.read_pickle("word_vectors.pkl")
factor_df = pd.DataFrame(data=factor_df)
fa = FactorAnalyzer(n_factors=10, rotation='varimax')
fa.fit(factor_df)
loading = fa.loadings_
loadings_df = pd.DataFrame(fa.loadings_)
loadings_df
The pickle file for my dataset is here.

Factor loadings are the weights and correlations between each variable (column in your DataFrame) and the factor, so the fa.loadings_ object is an array with shape (number_of_variables, number_of_factors) - in your example (100, 10).
If you would like to get transformed DataFrame with only 10 columns in each row, you should call fa.transform(factor_df) after fa.fit(factor_df). Returned array will have shape (number_of_rows, number_of_factors) - in your example (15_000, 10).

Related

How to generate a correlation matrix of a dataset with a large file size in Python?

I'm trying to generate a correlation matrix based on gene expression levels. I have a dataset that has Gene name on the columns and individual experiments on the rows with expression levels in the cells. The matrix is 55,000 genes wide and 150,000 experiments tall so I broke computing down into chunks because my computer's memory cannot hold the entire set in memory.
This was my attempt:
import pandas as pd
import numpy as np
file_path = 'data.tsv'
chunksize = 10**6
corr_matrix = pd.DataFrame()
for chunk in pd.read_csv(file_path, delimiter='\t', chunksize=chunksize):
chunk_corr = chunk.corr()
corr_matrix = (corr_matrix + chunk_corr) / 2
print(corr_matrix)
However running this code slowly eats up ram until it crashes my system/jupyter lab once it's eaten all of it
Is there a better way to run this that might use different cores?
I'm not familiar with making python work with such large data.
UPDATE:
I discovered Dask which supposedly should handle both the size and multithreading. I rewrote rewrote my code as:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
#Read the dataframe with large sample size to overcome a value error
df = dd.read_csv('data.tsv',sep = '\t', sample=1000000000)
#Generate correlation matrix
corr_matrix = np.zeros((54675,54675))
corr_matrix = df.corr(method='pearson')
corr_matrix = corr_matrix.compute()
#Print correlation matrix
print(Cmatrix)
update: This also slowly eats up RAM and crashes once it's capped my ram amount. Back to the drawing board
update update: No one on this website was helpful so I just used a supercomputer with 60gb of ram to generate the correlation matrix with pandas

Read a HDF data to a 3d array and save as a dataframe in python

I am currently working on the NASA aerosol optical depth data (MCD19A2), which is a NASA satellite level three product. I have uploaded the data. I want to save the data as a dataframe including all the information of longitude and latitude, and values. I have successfully converted the 0.47um band file into a three-dimensional array. I want to ask how to convert this array into a correct dataframe includes X, Y and the value.
Below are the codes I have tried:
from osgeo import gdal
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
rds = gdal.Open("MCD19A2.A2006001.h26v04.006.2018036214627.hdf")
names=rds.GetSubDatasets()
names[0][0]
*'HDF4_EOS:EOS_GRID:"MCD19A2.A2006001.h26v04.006.2018036214627.hdf":grid1km:Optical_Depth_047'*
aod_047 = gdal.Open(names[0][0])
a47=aod_047.ReadAsArray()
a47[1].shape
(1200,1200)
I would like the result to be like
X (n=1200)
Y (n=1200)
AOD_047
8896067
5559289
0.0123
I know that in R this can be done by
require('gdalUtils')
require('raster')
require('rgdal')
file.name<-"MCD19A2.A2006001.h26v04.006.2018036214627.hdf"
sds <- get_subdatasets(file.name)
gdal_translate(sds[1], dst_dataset = paste0('tmp047', basename(file.name), '.tiff'), b = nband)
r.047 <- raster(paste0('tmp047', basename(file.name), '.tiff'))
df.047 <- raster::as.data.frame(r.047, xy = T)
names(df.047)[3] <- 'AOD_047'
But, R really relies on memory and saving to 'tif' and reading 'tif' is using a lot of memory. So I want to do this task in python. Thanks a lot for your help.
You can use pandas:
import pandas as pd
df=pd.read_hdf('filename.hdf')

How to iterate over rows of .csv file and pass each row to a time-series analysis model?

I want to write a program in python that iterate over each row of a data-matrix in a .csv file and then pass each row as an input to time-series-analysis model and the output(which is going to be a single value) of each row analysed over model will be stored in a form of column.
So far, I have tried iterating over rows, passing it through model and printing each output:
import pandas as pd
import numpy as np
from statsmodels.tsa.ar_model import AR
from random import random
data=pd.read_csv('EXAMPLEMATRIX.csv',header=None)
for i in data.iterrows():
df=np.asarray(i)
model=AR(df)
model_fit=model.fit()
yhat=model_fitd.predict(len(df),len(df))
print(yhat)
but I get an error:
ValueError: maxlag should be < nobs
Please help me solve this problem or finding out where it is going wrong or provide me a reference for solving this problem.
THANKS in advance
Use that instead:
import pandas as pd
import numpy as np
from statsmodels.tsa.ar_model import AR
from random import random
for i in range(data.shape[0]):
row = data.iloc[i]
model=AR(row.values)
model_fit=model.fit()
yhat=model_fit.predict(len(row),len(row))
print(yhat)

DataFrame NumPy value accuracy

Im using the following code to scale values from one interval to another. "outputs_max", "outputs_min" are numpy arrays, so are (as a result) "slope" and "intercept".
For higher clarity when displaying the result "scaled_outputs", I used pandas to create a DataFrame of the file "out.npy" which I called "output_array". The resulting array "scaled_outputs" hence is displayed in a DataFrame too and later on stored as a numpy file.
import pandas as pd
import numpy as np
output_file = np.load("U:\\out.npy")
output_array = pd.DataFrame(output_file)
desired_upper_bound = 1
desired_lower_bound = 0
slope = (desired_upper_bound - desired_lower_bound) / (outputs_max - outputs_min)
intercept = desired_upper_bound - (slope * rounded_outputs_max)
scaled_outputs = slope * output_array + intercept
np.save("U:\\scaled_outputs.npy", scaled_outputs)
Am I losing accuracy of the values by creating a DataFrame and passing it into the equation? Would it be better to pass the numpy array "output_file" and creating a DataFrame of "scaled_outputs"?
The result in the console is displayed with 5 decimals at max, which is why I'm asking.
No, you're not losing precision or accuracy by using a dataframe in your equation. What you're seeing on the console is a result of display precision. You can change the display.precision property to see more digits when a dataframe is displayed.
pandas.set_option("display.precision", 10)

how to combine numpy ndarray?

I have MODIS atmospheric product. I used the code below to read the data.
%matplotlib inline
import numpy as np
from pyhdf import SD
import matplotlib.pyplot as plt
files = ['file1.hdf','file2.hdf','file3.hdf']
for n in files:
hdf=SD.SD(n)
lat = (hdf.select('Latitude'))[:]
lon = (hdf.select('Longitude'))[:]
sds=hdf.select('Deep_Blue_Aerosol_Optical_Depth_550_Land')
data=sds.get()
attributes = sds.attributes()
scale_factor = attributes['scale_factor']
data= data*scale_factor
plt.contourf(lon,lat,data)
The problem is, in some days, there are 3 data sets (as in this case, some days have four datasets) so I can not use hstack or vstack to merge these datasets.
My intention is to get the single array from three different data arrays.
I have also attached datafiles along with this link:https://drive.google.com/open?id=0B2rkXkOkG7ExYW9RNERaZU5lam8
your help will be highly appreciated.

Categories