KNN on Spark dataframe with 15 Million records - python

I have a Pyspark dataframe like this:
0 [0.010904288850724697, -0.010935504920780659, ...
1 [0.34882408380508423, -0.19240069389343262, -0...
2 [0.13833148777484894, -0.23080679774284363, -0...
3 [0.12398581206798553, -0.4803846478462219, -0....
4 [0.16033919155597687, -0.06204992160201073, -0.
Now I want to find 100 Nearest Neighbor for all this arrays.
Here's my try:
df_collect = df.toPandas()
features = np.array(df_collect.features.to_list())
knnobj = NearestNeighbors(n_neighbors=100).fit(features)
distance_mat, neighbours_mat = knnobj.kneighbors(features)
But as df is too big it's taking too long. I know I can broadcast and parallelize the last step but I'm not able to find how to fit spark df to scikit-learn knn model. Is there any other way I can do it?
I also read some articles where they have mentioned about ANN(Approximate Nearest Neighbor) Sparkit-Learn spark_sklearn but I'm not able to find their implementation for Nearest Neighbor. Can anyone guide me what to do next

1.Load the data only using libraries like datatable, cuDF or dask. They are always faster than Pandas.
2.Reduce the memory consumption by up to 90% by casting each column to the smallest subtype possible.
3.Choose a data manipulation library you are comfortable with or based on what you need.
4.Take a 10–20% sample of the data for rapid analysis and experimentation.
5.Think in vectors and use vectorized functions.
6.Choose a fast ML library like CatBoost for building baselines and doing feature engineering.

Related

How do I format my training data for an LSTM network using Keras when I have multiple varying length time-series data? [duplicate]

This question already has an answer here:
Multivariate LSTM with missing values
(1 answer)
Closed 2 years ago.
I have two sets of training data that are of different lengths. I'll call these data series as the x_train data. Their shapes are (70480, 7) and (69058, 7), respectively. Each column represents a different sensor reading.
I am trying to use an LSTM network on this data. Should I merge the data into one object? How would I do that?
I also have two sets of data that are the resultant output from the x_train data. These are both of size (315,1). Would I use this as my y_train data?
So far I have read the data using pandas.read_csv() as follows:
c4_x_train = pd.read_csv('path')
c4_y_train = pd.read_csv('path')
c6_x_train = pd.read_csv('path')
c6_y_train = pd.read_csv('path')
Any clarification is appreciated. Thanks!
Just a few points
For fast file reading, consider using a different format like parquet or feather. Careful about depreciation, so for longtime storage, csv is just fine.
pd.concat is your friend here. Use like this
from pathlib import Path
import pandas as pd
dir_path = r"yourFolderPath"
files_list = [str(p) for p in dir_path.glob("**/*.csv")]
if files_list:
source_dfs = [pd.read_csv(file_) for file_ in files_list]
df = pd.concat(source_dfs, ignore_index=True)
This df then you can use to do your training.
Now, regarding the training. Well, that really depends as always. If you have the datetime in those csvs and they are continuous, go right ahead. If you have breaks inbetween the measurements, you might run into problems. Depending on trends, saisonality and noise, you could interpolate missing data. There are multiple approaches, such as the naive approach, filling it with the mean, forecasting from the values before, and many more. There is no right or wrong, it just really depends on what your data looks like.
EDIT: Comments don't like codeblocks.
Works like this:
Example:
#df1:
time value
1 1.4
2 2.5
#df2:
time value
3 1.1
4 1.0
#will be glued together to become df = pd.concat([df1, df2], ignore_index=True)
time value
1 1.4
2 2.5
3 1.1
4 1.0

How to cluster data based on a subset of attributes (4 attributes)?

I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?
Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.
depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.

Performance decrease for huge amount of columns. Pyspark

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \
VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.
Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.

How to use dask to quickly access subsets of the data?

One of the main reasons I love pandas is that it's easy to home in on subsets, e.g. df[df.sample.isin(['a', 'c', 'p'])] or df[df.age < 35]. Is dask dataframe good at (optimized for) this as well? The tutorials I've seen have focused on whole-column manipulations.
My specific application is (thousands of named GCMS samples) x (~20000 time points per sample) x (500 m/z channels) x (intensity), and I'm looking for the fastest tool to pull arbitrary subsets, e.g.
df[df.sample.isin([...]) & df.rt.lt(800) & df.rt.gt(600) & df.mz.isin(...)]
If dask is a good choice, then I would appreciate advice on how best to structure it.
What I've tried
What I've tried so far is to convert each sample to pandas dataframe that looks like
smp rt 14 15 16 17 18
0 160602_JK_OFCmix:1 271.0 64088.0 9976.0 26848.0 23928.0 89600.0
1 160602_JK_OFCmix:1 271.1 65472.0 10880.0 28328.0 24808.0 91840.0
2 160602_JK_OFCmix:1 271.2 64528.0 10232.0 27672.0 25464.0 90624.0
3 160602_JK_OFCmix:1 271.3 63424.0 10272.0 27600.0 25064.0 90176.0
4 160602_JK_OFCmix:1 271.4 64816.0 10640.0 27592.0 24896.0 90624.0
('smp' is sample name, 'rt' is retention time, 14,15,...500 are m/z channels), save to hdf with zlib, level=1, then make the dask dataframe with
ddf = dd.read_hdf(*.hdf5, key='/*', chunksize=100000, lock=False)
but df = ddf[ddf.smp.isin([...a couple of samples...]).compute() is 100x slower than ddf['57'].mean().compute().
(Note: this is with dask.set_options(get=dask.multiprocessing.get))
Your dask.dataframe is backed by an HDF file, so every time you do any operation you're reading in the data from disk. This is great if your data doesn't fit in memory but wasteful if your data does fit in memory.
If your data fits in memory
Instead, if your data fits in memory then try backing your dask.dataframe off of a Pandas dataframe:
# ddf = dd.from_hdf(...)
ddf = dd.from_pandas(df, npartitions=20)
I expect you'll see better performance from the threaded or distributed schedulers: http://dask.pydata.org/en/latest/scheduler-choice.html
If your data doesn't fit in memory
Try to reduce the number of bytes you have to read by specifying a set of columns to read in your read_hdf call
df = dd.read_hdf(..., columns=['57'])
Or, better yet, use a data store that lets you efficiently load individual columns. You could try something like Feather or Parquet, though both are in early stages:
https://github.com/wesm/feather
http://fastparquet.readthedocs.io/en/latest/
I suspect that if you're careful to avoid reading in all of the columns at once you could probably get by with just Pandas instead of using Dask.dataframe.

increasing pandas dataframe imputation performance

I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?
try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])
I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!

Categories