Pandas multiprocessing on very large dataframe

Pandas multiprocessing on very large dataframe - python

I'm trying to use the multiprocessing package to compute a function on a very large Pandas dataframe. However I ran into a problem with the following error:
OverflowError: cannot serialize a bytes objects larger than 4GiB
After applying the solution to this question and using protocol 4 for pickling, I ran into the following error instead, which is also quoted by the solution itself:
error: 'i' format requires -2147483648 <= number <= 2147483647
The answer to this question then suggests to use the dataframe as a global variable.
But ideally I would like the dataframe to still be an input of the function, without having the multiprocessing library copying and pickling it multiple times in the background.
Is there some other way I can design the code to not run into the issue?
I was able to replicate the problem with this example:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import functools
print('Total memory usage for the dataframe: {} GB'.format(df.memory_usage().sum() / 1e9))
def slow_function(some_parameter, df):
time.sleep(1)
return some_parameter
parameters = list(range(100))
with mp.Pool(20) as pool:
function = functools.partial(slow_function, df=df)
results = pool.map(function, parameters)

try Dask
import dask.dataframe as dd
df = dd.read_csv('data.csv')
docs : https://docs.dask.org/en/latest/dataframe-api.html

Related

How do I get tqdm working on pandas apply?

Tqdm documentation shows an example of tqdm working on pandas apply using progress_apply. I adapted the following code from here https://tqdm.github.io/docs/tqdm/ on a process that regularly take several minutes to perform (func1 is a regex function).
from tqdm import tqdm
tqdm.pandas()
df.progress_apply(lambda x: func1(x.textbody), axis=1)
The resulting progress bar doesn't show any progress. It just jumps from 0 at the start of the loop to 100 when it is finished. I am currently running tqdm version 4.61.2

Utilizing tqdm with pandas
Generally speaking, people tend to use lambdas when performing operations on a column or row. This can be done in a number of ways.
Please note: that if you are working in jupyter notebook you should use tqdm_notebook instead of tqdm.
Also I'm not sure what your code looks like but if you're simply following the example given in the tqdm docs, and you're only performing 100 interations, computers are fast and will blow through that before your progress bar has time to update. Perhaps it would be more instructive to use a larger dataset like I provided below.
Example 1:
from tqdm import tqdm # version 4.62.2
import pandas as pd # version 1.4.1
import numpy as np
tqdm.pandas(desc='My bar!') # lots of cool paramiters you can pass here.
# the below line generates a very large dataset for us to work with.
df = pd.DataFrame(np.random.randn(100000000, 4), columns=['a','b','c','d'])
# the below line will square the contents of each element in an column-wise
# fashion
df.progress_apply(lambda x: x**2)
Output:
Output
Example 2:
# you could apply a function within the lambda expression for more complex
# operations. And keeping with the above example...
tqdm.pandas(desc='My bar!') # lots of cool paramiters you can pass here.
# the below line generates a very large dataset for us to work with.
df = pd.DataFrame(np.random.randn(100000000, 4), columns=['a','b','c','d'])
def function(x):
return x**2
df.progress_apply(lambda x: function(x))

Parallel-processing efficient_apriori code in Python

I have 12 millions of data from an eshop. I would like to compute association rules using efficient_apriori package. The problem is that 12 millions observations are too many, so the computation tooks too much time. Is there a way how to speed up the algorithm? I am thinking about some Parallel-processing or compile python code into C. I tried PYPY, but PYPY does not support pandas package. Thank you for any help or idea.
If you want to see my code:
import pandas as pd
from efficient_apriori import apriori
orders = pd.read_csv("orders.csv", sep=";")
customer = orders.groupby("id_customer")["name"].agg(tuple).tolist()
itemsets, rules = apriori(
customer, min_support=100/len(customer), min_confidence=0
)

can you this approach to run this task parallel:
from multiprocessing import Pool
length_of_input_file=len(raw_data_min)
total_offset_count=4 # number of parallel process to run
offset=int(length_of_input_file/total_offset_count // 1)
dataNew1=customer[0:offset-1]
dataNew2=customer[offset:2*offset-1]
dataNew3=customer[2*offset:3*offset-1]
dataNew4=customer[3*offset:4*offset-1]
def calculate_frequent_itemset(fractional_data):
"""Function that calculated the frequent dataset parallely"""
itemsets, rules = apriori(fractional_data, min_support=MIN_SUPPORT,
min_confidence=MIN_CONFIDENCE)
return itemsets, rules
p=Pool()
frequent_itemsets=p.map(calculate_frequent_itemset,(dataNew1,dataNew2,dataNew3,dataNew4))
p.close()
p.join()
itemsets1, rules1 =frequent_itemsets[0]
itemsets2, rules2=frequent_itemsets[1]
itemsets3, rules3=frequent_itemsets[2]
itemsets4, rules4=frequent_itemsets[3]

optimize.fmin error: IndexError: too many indices for array

I am trying to optimize a function in python, using optimize.fmin from scipy. The function should optimize a vector of parameters, given initial conditions and arguments. However, I keep receiving the following error when I try to run the optimization, while running the function itself works:
IndexError: too many indices for array, line 1, in parametrization
In brief, my code is like:
import numpy as np # import numpy library
import pandas as pd # import pandas library
from scipy import optimize # import optimize from scipy library
from KF_GATSM import KF_GATSM # import script with Kalman filter
yields=pd.read_excel('data.xlsx',index_col=None,header=None) # Import observed yields
Omega0=pd.read_excel('parameters.xlsx') # Import initial parameters
# Function to optimize
def GATSM(Omega,yields,N):
# recover parameters
Omega=np.matrix(Omega)
muQ,muP=parametrization(N,Omega) # run parametrization
Y=muQ+muP # or any other function
return Y
# Parametrization of the function
def parametrization(nstate,N,Omega):
muQ=np.matrix([[Omega[0,0],0,0]]).T # intercept risk-neutral world
muP=np.matrix([[Omega[1,0],Omega[2,0],Omega[3,0]]]).T # intercept physical world
return muQ,muP
# Run optimization
def MLE(data,Omega0):
# extract number of observations and yields maturities
N=np.shape(yields)[1]
# local optimization
omega_opt=optimize.fmin(GATSM,np.array(Omega0)[:,0],args=(yields,N))
return Y

I solved the issue. It seems that I cannot select the element of an array as follows in Scipy (although it works in Numpy):
Omega[0,0]
Omega[0]
The trick is to use:
Omega.item(0)

How to implement add sections of python code dynamically

I want to create a python file that uses a code stored in database
I have a table called CodeTable that has These data
ID Code
-----------
1 import pymssql import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np df = pd.read_csv(r'C:\Projects\G.csv') plt.figure(figsize=(12, 9))
2 X = 1 + MasterKey
and in my code I have this
MasterKey = 7
#Some code to call Record with ID = 2 from DB
# a function to execute Python dynamically <-------- I need this?!!
print(MasterKey) #<------------ Should return 8
Thanks

You can use the exec builtin function. For example, exec("print('Hello World!')")
Exec Documentation:
This function supports dynamic execution of Python code. object must be either a string or a code object. If it is a string, the string is parsed as a suite of Python statements which is then executed (unless a syntax error occurs).

I think this is pandas dataframe so we can using eval
pd.eval(df.loc[df.ID==2,'Code'].str.split('=').str[-1])[0]
8

Plotting pandas dataframe and multiprocessing in Python

I have a pandas dataframe and I want to plot slices of it, in a function using multiprocessing. Even though the function "process_expression" works when I call it independently, when I use the "multiprocessing" option it is not giving any plots.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import seaborn as sns
import sys
from multiprocessing import Pool
import os
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool()
gn = pool.map(process_expression, gene_ids)
pool.close()
pool.join()
def process_expression(gn_name, df_gn=df_coding):
df_part = df_gn.loc[df_gn['Gene_id'] == gn_name]
df_part = df_part.drop('Gene_id', 1)
df_part = df_part.drop('Transcript_biotype', 1)
COUNT100 = df_part[df_part >100 ].count()
COUNT10 = (df_part[df_part >10 ].count()) - COUNT100
COUNT1 = (df_part[df_part >1].count())- COUNT100 - COUNT10
COUNT0 = (df_part[df_part >0].count())- COUNT100-COUNT10- COUNT1
result = pd.concat([COUNT0,COUNT1,COUNT10,COUNT100], axis=1)
result.columns = [ '0 TO 1', '1 TO 10','10 TO 100', '>100']
result.plot( kind='bar', figsize=(50, 20), fontsize=7, stacked=True)
plt.savefig('./expression_levels/all_genes/'+gn_name+'.png')#,bbox_inches='tight')
plt.close()
the df_coding table is something like (it has more columns, I erased some):
Isoform_name,heart,heart.1,lung.3,Gene_id,Transcript_biotype
ENST00000296782,0.14546900000000001,0.161245,0.09479889999999999,ENSG00000164327,protein_coding
ENST00000357387,6.53902,5.86969,7.057689999999999,ENSG00000164327,protein_coding
ENST00000514735,0.0,0.0,0.0,ENSG00000164327,protein_coding
The input dataframe df_coding is a dataframe with a column Gene_id. In this column I have a list of gn_name. What I want is to take each time only the parts of the dataframe which have the name gn_name[i] in the Gene_id column and plot a barplot based on this dataframe.
For example if I call the 'process_expression('ENSG00000164327')', which is a specific gn_name, the output is something like this:
What am I doing wrong? I know that the process stops at the plotting command when I run it with multiprocessing.

The problem is between multiprocessing and matplotlib. With multiprocessing you create a completely new context with each process. The new context does not (and can not) successfully initialize the context because it is already initialized in the parent process.
If you are trying to overcome a performance issue then you may be on the right track. However, plotting back to the correctly initialized context of the parent process will require you to go a lot deeper into the structure of the underlying matplotlib guts. Here is an example of setting a data pipe back to the original application. Really this is only going to help if you are dealing with a lot of processing of the data before it is plotted. It doesn't look like that is what you are doing here.
If you are trying to get a visual effect like stacked / overlayed results then you probably want to look into repeating the plot function or modifying the data structure to better represent what you want to visualize.
So. What problem are you trying to solve? A performance problem, or a visualization problem? If it is a visualization problem then you do NOT want to use multiprocessing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas multiprocessing on very large dataframe - python

try Dask import dask.dataframe as dd df = dd.read_csv('data.csv') docs : https://docs.dask.org/en/latest/dataframe-api.html

Related

How do I get tqdm working on pandas apply?

Parallel-processing efficient_apriori code in Python

optimize.fmin error: IndexError: too many indices for array

How to implement add sections of python code dynamically

Plotting pandas dataframe and multiprocessing in Python

Categories

Resources