Remove outliers more efficiently for very large-scale data

Remove outliers more efficiently for very large-scale data - python

I am trying to remove outliers for all my columns. I already used a very large instance on AWS (4v CPU + 16 GiB), but still couldn't run it through.
num_col = data.select_dtypes(include=['int64','float64']).columns.tolist()
data[num_col] = data[num_col].apply(lambda x: x.clip(*x.quantile([0.01, 0.99])))
There are total 102 columns that I need to remove outliers.
Is there a more efficient way to code in order to run faster and less memory.

Related

Applying Column-based Data Transformations on PySpark DataFrames in Parallel

I've searched across SO a bit and haven't been able to find a question that resembles mine; I hope this isn't a duplicate, but feel free to point me in the right direction if a similar question has already been asked!
I'm in the process of K-Fold mean-encoding a set of categorical vectors in a very large dataset (think, 30+ million rows).
I currently have my code set up such that:
the dataframe is split into random subsets using randomSplit()
for each split, I iterate through each column of type categorical and calculate the mean-encoding for that column and split
I keep track of the split's mean-encoding results in a dictionary
following completion of all splits, I average the results
My problem is that this is taking a good amount of time (to perform the mean-encoding calculation on a single column across 5 splits takes a little over 6 minutes; I have multiple hundreds of categorical columns) and I'm fairly certain that I can speed it up by simply running the task in parallel (i.e. apply the same function to all splits simultaneously). However, I can't seem to figure out how to perform said function in parallel using PySpark's built-in functionality. I'm not interested in bringing in threading or pools simply because I'm unsure how it actually interacts with PySpark (if I'm totally wrong and this is the optimal way to go, please let me know).
If it helps, here is the function I've put together for the purposes of calculating the mean-encoding for a specified column for a specified DF, followed by the loop that I'm talking about. Any way to increase the efficiency and speed of this would be hugely appreciated.
def determine_means(df, col, target):
"""
:param df: pyspark.sql.dataframe
dataframe to apply target mean encoding
:param col: str list
column to apply target encoding
:param target: str
target column
:return:
dict of {string:float}
"""
means = df.groupby(F.col(col)).agg(F.mean(target).alias(f"{col}_mean_encoding"))
means = means.withColumn(f"{col}_mean_encoding", means[f"{col}_mean_encoding"].cast(FloatType()))
means = means.toPandas()
return dict(zip(list(means[col].values), list(means[f"{col}_mean_encoding"].values)))
meta_means_dict = dict()
splits = PYSPARK_DF.randomSplit([.2, .2, .2, .2, .2])
for sp in splits:
for col in CATEGORICAL_COLUMNS:
if col not in meta_means_dict.keys():
meta_means_dict[col] = dict()
for k, v in determine_means(sp, col, TARGET_COL).items():
if k in meta_means_dict[col].keys():
meta_means_dict[col][k].append(v)
else:
meta_means_dict[col][k] = [v]
Does anyone have any advice or tips?

How to work with a DataFrame which cannot be transformed by pandas pivot due to excessive memory usage?

I have dataframe with this structure
i built this dfp with 100 rows of the original for testing
and then i tried to make a pivot operation to get a dataframe like this
The problem with the pivot operations using all data is that the solution would have 131209 rows and 39123 columns. When I try the operation the memory collapse and restar my pc.
I tried segmenting the dataframe with 10 or 20. The pivot works but when I do a concat operation it crashes the memory again.
My pc have 16gb of memory. I have also tried with collabs but it also collapses the memory.
Is there a format or another strategy to work on this operation?

You may try this,
dfu = dfp.groupby(['order_id','product_id'])[['my_column']].sum().unstack().fillna(0)
Another way is you split product_id to process and concatinate back to ,
front_part = []
rear_part = []
dfp_f = dfp[dfp['product_id'].isin(front_part)]
dfp_r = dfp[dfp['product_id'].isin(rear_part)]
dfs_f = dfp_f.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs_r = dfp_r.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs = pd.concat([dfs_f, dfs_r], axis=1)
front_part, rear_part means we wanna separate product_id into two parts, but we need to specify the discrete numerical value into lists.

Pandas: How to efficiently diff() after a groupby() operation?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.
Here is how my dataset looks like:
prec type
location_id hours
135 78 12.0 A
79 14.0 A
80 14.3 A
81 15.0 A
82 15.0 A
83 15.0 A
84 15.5 A
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to apply the diff() function for each location on the prec column. The original dataset piles up the prec numbers; by applying diff() I will get the appropriate prec value for each hour.
With these in mind, I have implemented the following algorithm in Pandas:
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered
This works really well functionally, however the performance and the memory consumption is horrible. It is taking around 30 minutes on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
Also, the overall memory consumption of the Python script is sky-rocketing during this operation; it grows around 300%! The memory consumed by the main df_data data frame doesn't change but the overall process memory consumption rises.

With the input from #Quang Hoang and #Ben. T, I figured out a solution that is pretty fast but still consumes a lot of memory.
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered
I am guessing 2 things can be done to improve memory usage:
df_filtered seems like a copy of the data; that should increase the memory a lot.
df_diffed is also a copy.
The memory usage is very intensive while computing these two variables. I am not sure if there is any in-place way to execute such operations.

Performance decrease for huge amount of columns. Pyspark

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \

VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.

Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.

increasing pandas dataframe imputation performance

I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?

try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])

I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove outliers more efficiently for very large-scale data - python

Related

Applying Column-based Data Transformations on PySpark DataFrames in Parallel

How to work with a DataFrame which cannot be transformed by pandas pivot due to excessive memory usage?

Pandas: How to efficiently diff() after a groupby() operation?

Performance decrease for huge amount of columns. Pyspark

increasing pandas dataframe imputation performance

Categories

Resources