Performance decrease for huge amount of columns. Pyspark - python

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \

VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.

Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.

Related

Quickest way to access & compare huge data in Python

I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance
You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.
You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.

Applying Column-based Data Transformations on PySpark DataFrames in Parallel

I've searched across SO a bit and haven't been able to find a question that resembles mine; I hope this isn't a duplicate, but feel free to point me in the right direction if a similar question has already been asked!
I'm in the process of K-Fold mean-encoding a set of categorical vectors in a very large dataset (think, 30+ million rows).
I currently have my code set up such that:
the dataframe is split into random subsets using randomSplit()
for each split, I iterate through each column of type categorical and calculate the mean-encoding for that column and split
I keep track of the split's mean-encoding results in a dictionary
following completion of all splits, I average the results
My problem is that this is taking a good amount of time (to perform the mean-encoding calculation on a single column across 5 splits takes a little over 6 minutes; I have multiple hundreds of categorical columns) and I'm fairly certain that I can speed it up by simply running the task in parallel (i.e. apply the same function to all splits simultaneously). However, I can't seem to figure out how to perform said function in parallel using PySpark's built-in functionality. I'm not interested in bringing in threading or pools simply because I'm unsure how it actually interacts with PySpark (if I'm totally wrong and this is the optimal way to go, please let me know).
If it helps, here is the function I've put together for the purposes of calculating the mean-encoding for a specified column for a specified DF, followed by the loop that I'm talking about. Any way to increase the efficiency and speed of this would be hugely appreciated.
def determine_means(df, col, target):
"""
:param df: pyspark.sql.dataframe
dataframe to apply target mean encoding
:param col: str list
column to apply target encoding
:param target: str
target column
:return:
dict of {string:float}
"""
means = df.groupby(F.col(col)).agg(F.mean(target).alias(f"{col}_mean_encoding"))
means = means.withColumn(f"{col}_mean_encoding", means[f"{col}_mean_encoding"].cast(FloatType()))
means = means.toPandas()
return dict(zip(list(means[col].values), list(means[f"{col}_mean_encoding"].values)))
meta_means_dict = dict()
splits = PYSPARK_DF.randomSplit([.2, .2, .2, .2, .2])
for sp in splits:
for col in CATEGORICAL_COLUMNS:
if col not in meta_means_dict.keys():
meta_means_dict[col] = dict()
for k, v in determine_means(sp, col, TARGET_COL).items():
if k in meta_means_dict[col].keys():
meta_means_dict[col][k].append(v)
else:
meta_means_dict[col][k] = [v]
Does anyone have any advice or tips?

KNN on Spark dataframe with 15 Million records

I have a Pyspark dataframe like this:
0 [0.010904288850724697, -0.010935504920780659, ...
1 [0.34882408380508423, -0.19240069389343262, -0...
2 [0.13833148777484894, -0.23080679774284363, -0...
3 [0.12398581206798553, -0.4803846478462219, -0....
4 [0.16033919155597687, -0.06204992160201073, -0.
Now I want to find 100 Nearest Neighbor for all this arrays.
Here's my try:
df_collect = df.toPandas()
features = np.array(df_collect.features.to_list())
knnobj = NearestNeighbors(n_neighbors=100).fit(features)
distance_mat, neighbours_mat = knnobj.kneighbors(features)
But as df is too big it's taking too long. I know I can broadcast and parallelize the last step but I'm not able to find how to fit spark df to scikit-learn knn model. Is there any other way I can do it?
I also read some articles where they have mentioned about ANN(Approximate Nearest Neighbor) Sparkit-Learn spark_sklearn but I'm not able to find their implementation for Nearest Neighbor. Can anyone guide me what to do next
1.Load the data only using libraries like datatable, cuDF or dask. They are always faster than Pandas.
2.Reduce the memory consumption by up to 90% by casting each column to the smallest subtype possible.
3.Choose a data manipulation library you are comfortable with or based on what you need.
4.Take a 10–20% sample of the data for rapid analysis and experimentation.
5.Think in vectors and use vectorized functions.
6.Choose a fast ML library like CatBoost for building baselines and doing feature engineering.

How to apply multiple functions to several chunks of a dask dataframe?

I have a dataframe of 500,000 lines and 3 columns. I would like to compute the result of three functions for every chunk of 5,000 lines in the dataframe (that is, 100 chunks). Two of the three functions are used-defined, while the third is the mean of the values in column 3.
At the moment, I am first extracting a chunk, and then computing the results of the functions for that chunk. For the mean of column 3 I am using df.iloc[:,2].compute().mean() but the other functions are performed outside of dask.
Is there a way to leverage dask's multithreading ability, taking the entire dataframe and a chunk size as input, and have it computing the same functions but automatically? This feels like the more appropriate way of using Dask.
Also, this feels like a basic dask question to me, so please if this is a duplicate, just point me to the right place (I'm new to dask and I might have not looked for the right thing so far).
I would repartition your dataframe, and then use the map_partitions function to apply each of your functions in parallel
df = df.repartition(npartitions=100)
a = df.map_partitions(func1)
b = df.map_partitions(func2)
c = df.map_partitions(func3)
a, b, c = dask.compute(a, b, c)
You can create an artificial column for grouping indices into those 100 chunks.
ranges = np.arange(0, df.shape[0], 5000)
df['idx_group'] = ranges.searchsorted(df.index, side='right')
Then use this idx_group to perform your operations using pandas groupby.
NOTE: You can play with searchsorted to exactly fit your chunk requirements.

increasing pandas dataframe imputation performance

I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?
try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])
I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!

Categories