Use Kmodes in Python with a big csv file

Use Kmodes in Python with a big csv file - python

I would like some assist with a problem I have. I have a big csv file (6239292, 5) and want to perform an unsupervised machine learning technique (kmodes). My code is this:
import numpy as np
import pandas as pd
print("initialising")
syms = np.genfromtxt('foo.csv', delimiter = ';', dtype=str, skip_header=1, invalid_raise=False)[:, 0:]
print(syms.shape)
X = np.genfromtxt('foo.csv',dtype=object, delimiter=';', invalid_raise=False, skip_header=1)[:, 1:]
X[1:, 0] = X[1:, 0].astype(float)
from kmodes.kprototypes import KPrototypes
print("Imported successfully")
kproto = KPrototypes(n_clusters=6, init='random', n_init=2, verbose=2)
clusters = kproto.fit_predict(X, categorical=[2,1,3,])
Due to the size of the file, it's taking forever. Is there any technique I could use to reduce the time? Thank you in advance!

You can select the first n rows like:
read_csv(..., nrows=999999)
or skip some rows and then select the next n rows:
read_csv(..., skiprows=1000000, nrows=999999)
There shouldn't be a problem with your results due to the Central Limit Theorem
The Central Limit Theorem (CLT) is a statistical theory states that
given a sufficiently large sample size from a population with a finite
level of variance, the mean of all samples from the same population
will be approximately equal to the mean of the population.

Related

How to generate a correlation matrix of a dataset with a large file size in Python?

I'm trying to generate a correlation matrix based on gene expression levels. I have a dataset that has Gene name on the columns and individual experiments on the rows with expression levels in the cells. The matrix is 55,000 genes wide and 150,000 experiments tall so I broke computing down into chunks because my computer's memory cannot hold the entire set in memory.
This was my attempt:
import pandas as pd
import numpy as np
file_path = 'data.tsv'
chunksize = 10**6
corr_matrix = pd.DataFrame()
for chunk in pd.read_csv(file_path, delimiter='\t', chunksize=chunksize):
chunk_corr = chunk.corr()
corr_matrix = (corr_matrix + chunk_corr) / 2
print(corr_matrix)
However running this code slowly eats up ram until it crashes my system/jupyter lab once it's eaten all of it
Is there a better way to run this that might use different cores?
I'm not familiar with making python work with such large data.
UPDATE:
I discovered Dask which supposedly should handle both the size and multithreading. I rewrote rewrote my code as:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
#Read the dataframe with large sample size to overcome a value error
df = dd.read_csv('data.tsv',sep = '\t', sample=1000000000)
#Generate correlation matrix
corr_matrix = np.zeros((54675,54675))
corr_matrix = df.corr(method='pearson')
corr_matrix = corr_matrix.compute()
#Print correlation matrix
print(Cmatrix)
update: This also slowly eats up RAM and crashes once it's capped my ram amount. Back to the drawing board
update update: No one on this website was helpful so I just used a supercomputer with 60gb of ram to generate the correlation matrix with pandas

TruncatedSVD n_oversamples seems to have no bearing

I'm looking for way to improve the quality of my eigenvectors produced by sklearn TruncatedSVD. The documentation at scikit-learn.org suggests that the n_oversamples parameter is a good place to start. I have a sparse 2200 square matrix as input (provided as three separate files consisting of row indexes, column indexes, and data value.) Here's my code:
from array import array
import sys
import numpy as np
import struct
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
path="c:\\users\\lenwh\\documents\\wikipedia\\weights\\"
file=sys.argv[1]
dims=int(sys.argv[2]) #I use 300
with open(path+ file + ".rows","rb") as f:
rows=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".cols","rb") as f:
cols=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".data","rb") as f:
data = np.fromfile(f, dtype=np.float32)
rowCount=len(np.unique(rows))
csr=csr_matrix((data, (rows, cols)), shape=(rowCount, rowCount))
vectorsfile=path+"eigens.vec"
transfile=path+ file + ".eig"
oversamples=10;
pca=TruncatedSVD(n_components=dims, n_oversamples=oversamples)
pca.fit(csr)
np.savetxt(transfile,pca.transform(csr),fmt='%16f')
The problem is that whether I have oversamples set to 10, 100, or 1000, the results are not discernably different, meaning the explained variance is the same for all, as is the performance of the results in my application. As a minimum, I expected that the quality of the explained variance would change. I would appreciate any explanation of where my expectations are misguided, and whether there are any other settings -- or alternatives to TruncatedSVD -- that I could looked to other than the n_components setting.

How to manage large a dataset for regression?

my question has to do with a very large dataset I'm running a regression on in Python. I have categorical data (gender, industry, region, salary groupings, etc.) that I would like to run a regression on with statsmodels. The whole dataframe comes out to be about 83 columns in width after using pd.getdummies() on roughly 5 million lines.
Code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from datetime import datetime as dt
#Start time
print('Start Time: ', dt.now())
#Variables
groups = ['sex', 'central_age', 'group_size', 'industry', 'region', 'salary']
base_cases = ['sex_Male', 'central_age_47.0', 'group_size_F. 100-249', 'salary_A. < 25',
'industry_H. Manufacturing - heavy, steel etc.', 'region_C. Division 3: East North Central']
aggregates = ['death_amount_exposed', 'death_claim_amount']
#Read/ format data to transform data into categorical variables
df = pd.read_pickle(r'./Life_Mortality_Data.pkl')
df = df[df['death_amount_exposed']!=0]
df['central_age'] = df['central_age'].apply(str)
final = pd.get_dummies(df[groups]).join(df[aggregates]).astype(float)
final.drop(base_cases, axis=1, inplace=True)
#Prepare sting of variables to regress on in next step
var_columns = list(final.columns)
for i in aggregates:
var_columns.remove(i)
variables = '+'.join('Q("' + i + '")' for i in var_columns)
#Training and testing with Poisson model
print('Regression Time: ', dt.now(), '\n')
res1 = smf.glm(formula='death_claim_amount ~'+variables, data=final, offset=np.log(final['death_amount_exposed']), family=sm.families.Poisson(sm.families.links.log())).fit()
#Print stats summary, base cases, and multiplicative factors
print(res1.summary())
print('Base Cases:')
for case in base_cases:
print(case)
print('\nParameters:\n', np.exp(res1.params))
#This takes the result of a statsmodel results table and transforms it into a dataframe
def results_summary_to_dataframe(results):
pvals = results.pvalues
coeff = results.params
std_err = results.bse
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"std_error":std_err,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering columns
results_df = results_df[["coeff","std_error","pvals","conf_lower","conf_higher"]]
return results_df
#Write data to excel
results_summary_to_dataframe(res1).to_excel(r'./All_Regression_Amounts_v1.xlsx')
#End time
print('\nEnd Time: ', dt.now())
The problem I'm having is that I run out of memory at the point where the statsmodels regression is run. I am using the 64-bit version of Python on Windows and have 32 GB of memory which I thought would be more than enough to handle this kind of computation but am not sure if I'm not using all available memory or if something may be wrong with my code. I'm very new to this kind of analysis and handling this much data. I'd really appreciate any help on what I can do to resolve this error

When building linear models on datasets which are too large to hold in memory your best bet is to train the model with Stochastic Gradient Descent. This fits the model iteratively by gradient descent using repeated small samples of the data rather than all the data at once.
Scikit-learn has a SGDClassifier module which fits a linear model like this. You could take a look at that and see if it might work for you.

Gap Statistic Method

import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0

Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.

One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.

Pearson correlation on big numpy matrices

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks

The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use Kmodes in Python with a big csv file - python

Related

How to generate a correlation matrix of a dataset with a large file size in Python?

TruncatedSVD n_oversamples seems to have no bearing

How to manage large a dataset for regression?

Gap Statistic Method

Pearson correlation on big numpy matrices

Categories

Resources