High performance apply on group by pandas - python

I need to calculate percentile on a column of a pandas dataframe. A subset of the dataframe is as below:
I want to calculate the 20th percentile of the SaleQTY, but for each group of ["Barcode","ShopCode"]:
so I define a function as below:
def quant(group):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
return group
And apply this function on each group pf my sales data which has almost 18 million rows and roughly 3 million groups of ["Barcode","ShopCode"]:
quant_sale = sales.groupby(['Barcode','ShopCode']).apply(quant)
That took 2 hours to complete on a windows server with 128 GB Ram and 32 Core.
It make not sense because that is one small part of my code. S o I start searching the net to enhance the performance.
I came up with "numba" solution with below code which didn't work:
from numba import njit, jit
#jit(nopython=True)
def quant_numba(df):
final_quant = []
for bar_shop,group in df.groupby(['Barcode','ShopCode']):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
final_quant.append((bar_shop,group["Quantile"]))
return final_quant
result = quant_numba(sales)
It seems that I cannot use pandas objects within this decorator.
I am not sure whether I can use of multi processing (which I'm unfamiliar with the whole concept) or whether is there any solution to speed up my code. So any help would be appreciated.

You can try DataFrameGroupBy.quantile:
df1 = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].quantile(0.2)
Or like montioned #Jon Clements for new columns filled by percentiles use GroupBy.transform:
df['Quantile'] = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].transform('quantile', q=0.2)

There is a inbuilt function in panda called quantile().
quantile() will help to get nth percentile of a column in df.
Doc reference link
geeksforgeeks example reference

Related

Quickest way to access & compare huge data in Python

I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance
You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.
You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.

How to find the mean of subseries in DataFrames?

My personnel side project right now is to analyze GDP growth rates per capita. More specifically, I want to find the average growth rate for each decade since 1960, and then analyze it.
I pulled data from the World Bank API("wbgapi")as a DataFrame:
import pandas as pd
import wbgapi as wb
gdp=wb.data.DataFrame('NY.GDP.PCAP.KD.ZG')
gdp.head()
Output:
gdp
I then used nested for loops to calculate the mean for every decade and added it to a new dataframe.
row, col = gdp.shape
meandata = pd.DataFrame(columns = ['Country', 'Decade', 'MeanGDP', 'Region'])
for r in range (0, row, 1):
countrydata = gdp.iloc[r]
for c in range (0, col-9, 10):
decade = 1960+c
tenyeargdp = countrydata.array[c:c+10].mean()
meandata = meandata.append({'Country': gdp.iloc[r].name, 'Decade': decade, 'MeanGDP': tenyeargdp}, ignore_index=True)
meandata.head(10)
The code works and generates the following output: meandata
However, I have a few questions about this step:
Is there a more efficient way to do access the subseries of dataframes? I read that "for loops" should never be used for dataframes and that one should vectorize operations on dataframes?
Is the complexity O(n^2) since there are 2 for loops?
The second step is to group the individual countries by region, for future analysis. To do so I rely on the World Bank API which has its own Region, which each has a list of member economies/countries.
I iterated through the regions and the member list of each region. If a Country is part of the Region list I added that region series.
Since an economy/country can be part of multiple regions(ie the 'USA' can be part of NA and HIC(high-income country)), I concatenated the region to the previously added regions.
for rg in wb.region.list():
for co in wb.region.members(rg['code']):
str1 ='-'+meandata.loc[meandata['Country']==co, ['Region']].astype(str)
meandata.loc[meandata['Country']==co, ['Region']] = rg['code']+ str1
The code works mostly, however, sometimes it gives the error message that 'meandata' is not defined. I use Jupyter-Lab.
Additionally, Is there a simpler/more efficient way of doing the second step?
Thanks for reading and helping. Also, this is my first python/pandas coding experience, and as such general feedback is appreciated.
Consider to use groupby:
The aggregation will be based on columns you insert inside a List of columns in groupby functions.
In sample below I get the mean for 'County' and 'Region'.
metadata = metadata.groupby(['County','Region']).agg('MeanGDP':'mean').reset_index()

Pandas How does the corrwith() work in this funcrion?

The Function is to find the correlation of any store with another store
input=store number which is to be compared
output=dataframe with correlation coefficient values
def calcCorr(store):
a=[]
metrix=pre_df[['TOT_SALES','TXN_PER_CUST']]```#add metrics as required e.g.
,'TXN_PER_CUST'
for i in metrix.index:
a.append(metrix.loc[store].corrwith(metrix.loc[i[0]]))
df= pd.DataFrame(a)
df.index=metrix.index
df=df.drop_duplicates()
df.index=[s[0] for s in df.index]
df.index.name="STORE_NBR"
return df
I dont' understand this part :corrwith(metrix.loc[i[0]])) Why there has a [0]? Thanks for your help!
The dataframe pre_df is looked like this:
enter image description here
As commented, this should not be the way to go as it produces a lot of duplicates (looping through all the rows but only keep the first level). The function can be written as:
def calcCorr1(store, df):
return pd.DataFrame({k:df.loc[store].corrwith(df.loc[k])
for k in df.index.unique('STORE_NBR')
}).T
Notice that instead of looping through all the rows, we just loop through the unique values in the first level (STORE_NBR) only. Since each store contains many rows, we are looking at a magnitude less of runtime here.

increasing pandas dataframe imputation performance

I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?
try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])
I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!

Creating dataframe by merging a number of unknown length dataframes

I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?
it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)
I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.

Categories