increasing pandas dataframe imputation performance - python

I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?

try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])

I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!

Related

Applying Column-based Data Transformations on PySpark DataFrames in Parallel

I've searched across SO a bit and haven't been able to find a question that resembles mine; I hope this isn't a duplicate, but feel free to point me in the right direction if a similar question has already been asked!
I'm in the process of K-Fold mean-encoding a set of categorical vectors in a very large dataset (think, 30+ million rows).
I currently have my code set up such that:
the dataframe is split into random subsets using randomSplit()
for each split, I iterate through each column of type categorical and calculate the mean-encoding for that column and split
I keep track of the split's mean-encoding results in a dictionary
following completion of all splits, I average the results
My problem is that this is taking a good amount of time (to perform the mean-encoding calculation on a single column across 5 splits takes a little over 6 minutes; I have multiple hundreds of categorical columns) and I'm fairly certain that I can speed it up by simply running the task in parallel (i.e. apply the same function to all splits simultaneously). However, I can't seem to figure out how to perform said function in parallel using PySpark's built-in functionality. I'm not interested in bringing in threading or pools simply because I'm unsure how it actually interacts with PySpark (if I'm totally wrong and this is the optimal way to go, please let me know).
If it helps, here is the function I've put together for the purposes of calculating the mean-encoding for a specified column for a specified DF, followed by the loop that I'm talking about. Any way to increase the efficiency and speed of this would be hugely appreciated.
def determine_means(df, col, target):
"""
:param df: pyspark.sql.dataframe
dataframe to apply target mean encoding
:param col: str list
column to apply target encoding
:param target: str
target column
:return:
dict of {string:float}
"""
means = df.groupby(F.col(col)).agg(F.mean(target).alias(f"{col}_mean_encoding"))
means = means.withColumn(f"{col}_mean_encoding", means[f"{col}_mean_encoding"].cast(FloatType()))
means = means.toPandas()
return dict(zip(list(means[col].values), list(means[f"{col}_mean_encoding"].values)))
meta_means_dict = dict()
splits = PYSPARK_DF.randomSplit([.2, .2, .2, .2, .2])
for sp in splits:
for col in CATEGORICAL_COLUMNS:
if col not in meta_means_dict.keys():
meta_means_dict[col] = dict()
for k, v in determine_means(sp, col, TARGET_COL).items():
if k in meta_means_dict[col].keys():
meta_means_dict[col][k].append(v)
else:
meta_means_dict[col][k] = [v]
Does anyone have any advice or tips?

High performance apply on group by pandas

I need to calculate percentile on a column of a pandas dataframe. A subset of the dataframe is as below:
I want to calculate the 20th percentile of the SaleQTY, but for each group of ["Barcode","ShopCode"]:
so I define a function as below:
def quant(group):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
return group
And apply this function on each group pf my sales data which has almost 18 million rows and roughly 3 million groups of ["Barcode","ShopCode"]:
quant_sale = sales.groupby(['Barcode','ShopCode']).apply(quant)
That took 2 hours to complete on a windows server with 128 GB Ram and 32 Core.
It make not sense because that is one small part of my code. S o I start searching the net to enhance the performance.
I came up with "numba" solution with below code which didn't work:
from numba import njit, jit
#jit(nopython=True)
def quant_numba(df):
final_quant = []
for bar_shop,group in df.groupby(['Barcode','ShopCode']):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
final_quant.append((bar_shop,group["Quantile"]))
return final_quant
result = quant_numba(sales)
It seems that I cannot use pandas objects within this decorator.
I am not sure whether I can use of multi processing (which I'm unfamiliar with the whole concept) or whether is there any solution to speed up my code. So any help would be appreciated.
You can try DataFrameGroupBy.quantile:
df1 = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].quantile(0.2)
Or like montioned #Jon Clements for new columns filled by percentiles use GroupBy.transform:
df['Quantile'] = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].transform('quantile', q=0.2)
There is a inbuilt function in panda called quantile().
quantile() will help to get nth percentile of a column in df.
Doc reference link
geeksforgeeks example reference

Updating Dictionaries from Dataframe Very Slow in Python

I want to read each row of dataframe and add them into a dictionary.
The below code takes 18 seconds to run. The dataframe has about 150000 rows. vehicledid and engineconfigid are numerical values.
engineconfigid = {}
for index, row in data_engineconfig.iterrows():
engineconfigid.update({row['vehicleid-h']:row['engineconfigid-h']})
However, the following code takes hours. The only difference is that there are more values to add and some of the values are strings. What accounts for bulk of the difference between the two lines? The strings are not big. My program runs at 20% CPU (single core) and only uses 60MB RAM.
for index, row in data_enginebase.iterrows():
enginebase.update({row['enginebaseid-f']:{'liter':row['liter-f'],
'cc':row['cc-f'],'cid':row['cid-f'],
'cylinders-f':row['cylinders-f']}})
You can try using set_index. Rather than iterating over rows, this should give better result:
# answer 1
engineconfigid = data_engineconfig.set_index('vehicleid-h')['engineconfigid-h'].to_dict()
# answer 2
data_engineconfig.to_dict(orient='index')

Latency in large file analysis with python pandas

I have a very large file (10 GB) with approximately 400 billion lines, it is a csv with 4 fields. Here the description, the first field is an ID and the second is a current position of the ID, the third field is a correlative number assigned to that row.
Similar to this:
41545496|4154|1
10546767|2791|2
15049399|491|3
38029772|491|4
15049399|1034|5
My intention is to create a fourth column (old position) in another file or the same, where the previous position in which your ID was stored is stored, what I do is verify if the ID number has already appeared before, I look for its last appearance and assigned to his field of old position, the position he had in the last appearance. If the ID has not appeared before, then I assign to its old position the current position it has in that same row.
Something like this:
41545496|4154|1|4154
10546767|2791|2|2791
15049399|491|3|491
38029772|491|4|491
15049399|1034|5|491
I have created a program that does the reading and analysis of the file but performs a reading of 10 thousand lines every minute, so it gives me a very high time to read the entire file, more than 5 days approximately.
import pandas as pd
with open('file_in.csv', 'rb') as inf:
df = pd.read_csv(inf, sep='|', header=None)
cont = 0
df[3] = 0
def test(x):
global cont
a = df.iloc[:cont,0]
try:
index = a[a == df[0][cont]].index[-1]
df[3][cont] = df[1][index]
except IndexError:
df[3][cont] = df[1][cont]
pass
cont+= 1
df.apply(test, axis=1)
df.to_csv('file_out.csv', sep='|', index=False, header=False)
I have a computer 64 processors with 64 GB of RAM in the university but still it's a long time, is there any way to reduce that time? thank you very much!
Processing the data efficiently
You have two main problems in your approach:
That amount of data should have never been written to a text file
Your approach needs (n^2/2) comparisons
A better idea, is to index-sort your array first before doing the actual work. So you need only about 2n operations for comparisons and n*log(n) operations for sorting in the worst case.
I also used numba to compile that function which will speed up the computation time by a factor of 100 or more.
import numpy as np
#the hardest thing to do efficiently
data = np.genfromtxt('Test.csv', delimiter='|',dtype=np.int64)
#it is important that we use a stable sort algorithm here
idx_1=np.argsort(data[:,0],kind='mergesort')
column_4=last_IDs(data,idx_1)
#This function isn't very hard to vectorize, but I expect better
#peformance and easier understanding when doing it in this way
import numba as nb
#nb.njit()
def last_IDs(data,idx_1):
#I assume that all values in the second column are positive
res=np.zeros(data.shape[0],dtype=np.int64) -1
for i in range(1,data.shape[0]):
if (data[idx_1[i],0]==data[idx_1[i-1],0]):
res[idx_1[i]]=data[idx_1[i-1],1]
same_ID=res==-1
res[same_ID]=data[same_ID,1]
return res
For performant writing and reading data have a look at: https://stackoverflow.com/a/48997927/4045774
If you don't get at least 100 M/s IO-speed, please ask.

Performance decrease for huge amount of columns. Pyspark

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \
VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.
Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.

Categories