How do I get tqdm working on pandas apply? - python

Tqdm documentation shows an example of tqdm working on pandas apply using progress_apply. I adapted the following code from here https://tqdm.github.io/docs/tqdm/ on a process that regularly take several minutes to perform (func1 is a regex function).
from tqdm import tqdm
tqdm.pandas()
df.progress_apply(lambda x: func1(x.textbody), axis=1)
The resulting progress bar doesn't show any progress. It just jumps from 0 at the start of the loop to 100 when it is finished. I am currently running tqdm version 4.61.2

Utilizing tqdm with pandas
Generally speaking, people tend to use lambdas when performing operations on a column or row. This can be done in a number of ways.
Please note: that if you are working in jupyter notebook you should use tqdm_notebook instead of tqdm.
Also I'm not sure what your code looks like but if you're simply following the example given in the tqdm docs, and you're only performing 100 interations, computers are fast and will blow through that before your progress bar has time to update. Perhaps it would be more instructive to use a larger dataset like I provided below.
Example 1:
from tqdm import tqdm # version 4.62.2
import pandas as pd # version 1.4.1
import numpy as np
tqdm.pandas(desc='My bar!') # lots of cool paramiters you can pass here.
# the below line generates a very large dataset for us to work with.
df = pd.DataFrame(np.random.randn(100000000, 4), columns=['a','b','c','d'])
# the below line will square the contents of each element in an column-wise
# fashion
df.progress_apply(lambda x: x**2)
Output:
Output
Example 2:
# you could apply a function within the lambda expression for more complex
# operations. And keeping with the above example...
tqdm.pandas(desc='My bar!') # lots of cool paramiters you can pass here.
# the below line generates a very large dataset for us to work with.
df = pd.DataFrame(np.random.randn(100000000, 4), columns=['a','b','c','d'])
def function(x):
return x**2
df.progress_apply(lambda x: function(x))

Related

Should I use numpy's Random Generator?

I have a large Python code that I've been maintaining/updating/expanding since ~2014. Recently I came across numpy's Random Number Generator Policy (2018-05) and now I'm a bit confused.
I'm not sure what changed, and if I should upgrade my code accordingly to use the new Random Generator. For example, the Random sampling docs say:
# Do this
from numpy.random import default_rng
rng = default_rng()
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)
# instead of this
from numpy import random
vals = random.standard_normal(10)
more_vals = random.standard_normal(10)
All my code depends on the (old?) syntax shown in the second block (i.e., I don't use default_rng but simple calls to np.random.seed(), np.random.uniform(), np.random.normal(), etc), and I don't know why I should use the first block instead of the second block.
Could someone shed some light over this please?
1.In python2(old code) default_rng is not available.
2.In python3(new code) both first and second blocks you mentioned will runs without an error and executed.
3.In future they may drop the random.standard_normal from coming versions of python,that's why they mentioned to use of default_rng instead of random.standard_normal

Importing Numpy increases the execution time of the first iteration in timeit.repeat

Using timeit.repeat to measure the execution time of some expressions, I realized that the first iteration takes significantly longer if I imported numpy beforehand. Consider the following example:
from __future__ import print_function
import timeit
import numpy as np # comment this line
results = timeit.repeat(
"d['a']",
setup="d = dict(zip('abc', '123'))",
repeat=5, number=10**6
)
print(['{:.2e}'.format(x) for x in results])
I obtain the following results:
['5.38e-02', '2.72e-02', '2.70e-02', '2.68e-02', '2.70e-02']
The first iteration took significantly longer than the remaining ones (I verified this pattern by running the code multiple times).
Now when commenting out the import numpy as np line in the above code, then the timing results change as follows:
['2.73e-02', '2.71e-02', '2.65e-02', '2.68e-02', '2.66e-02']
Here the execution time of the first iteration is comparable to the others.
This behavior doesn't occur for Python 3.8 where I obtain similar timings, no matter if Numpy has been imported or not:
['2.64e-02', '2.89e-02', '2.65e-02', '2.63e-02', '2.63e-02'] # with 'import numpy'
['2.63e-02', '2.56e-02', '2.52e-02', '2.50e-02', '2.51e-02'] # without 'import numpy'
What causes this increase in execution time in conjunction with import numpy for Python 2.7?
Detailed version information:
Python 2.7.12: numpy==1.14.0
Python 3.8.1: numpy==1.18.1

Pandas multiprocessing on very large dataframe

I'm trying to use the multiprocessing package to compute a function on a very large Pandas dataframe. However I ran into a problem with the following error:
OverflowError: cannot serialize a bytes objects larger than 4GiB
After applying the solution to this question and using protocol 4 for pickling, I ran into the following error instead, which is also quoted by the solution itself:
error: 'i' format requires -2147483648 <= number <= 2147483647
The answer to this question then suggests to use the dataframe as a global variable.
But ideally I would like the dataframe to still be an input of the function, without having the multiprocessing library copying and pickling it multiple times in the background.
Is there some other way I can design the code to not run into the issue?
I was able to replicate the problem with this example:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import functools
print('Total memory usage for the dataframe: {} GB'.format(df.memory_usage().sum() / 1e9))
def slow_function(some_parameter, df):
time.sleep(1)
return some_parameter
parameters = list(range(100))
with mp.Pool(20) as pool:
function = functools.partial(slow_function, df=df)
results = pool.map(function, parameters)
try Dask
import dask.dataframe as dd
df = dd.read_csv('data.csv')
docs : https://docs.dask.org/en/latest/dataframe-api.html

optimize.fmin error: IndexError: too many indices for array

I am trying to optimize a function in python, using optimize.fmin from scipy. The function should optimize a vector of parameters, given initial conditions and arguments. However, I keep receiving the following error when I try to run the optimization, while running the function itself works:
IndexError: too many indices for array, line 1, in parametrization
In brief, my code is like:
import numpy as np # import numpy library
import pandas as pd # import pandas library
from scipy import optimize # import optimize from scipy library
from KF_GATSM import KF_GATSM # import script with Kalman filter
yields=pd.read_excel('data.xlsx',index_col=None,header=None) # Import observed yields
Omega0=pd.read_excel('parameters.xlsx') # Import initial parameters
# Function to optimize
def GATSM(Omega,yields,N):
# recover parameters
Omega=np.matrix(Omega)
muQ,muP=parametrization(N,Omega) # run parametrization
Y=muQ+muP # or any other function
return Y
# Parametrization of the function
def parametrization(nstate,N,Omega):
muQ=np.matrix([[Omega[0,0],0,0]]).T # intercept risk-neutral world
muP=np.matrix([[Omega[1,0],Omega[2,0],Omega[3,0]]]).T # intercept physical world
return muQ,muP
# Run optimization
def MLE(data,Omega0):
# extract number of observations and yields maturities
N=np.shape(yields)[1]
# local optimization
omega_opt=optimize.fmin(GATSM,np.array(Omega0)[:,0],args=(yields,N))
return Y
I solved the issue. It seems that I cannot select the element of an array as follows in Scipy (although it works in Numpy):
Omega[0,0]
Omega[0]
The trick is to use:
Omega.item(0)

Plotting pandas dataframe and multiprocessing in Python

I have a pandas dataframe and I want to plot slices of it, in a function using multiprocessing. Even though the function "process_expression" works when I call it independently, when I use the "multiprocessing" option it is not giving any plots.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import seaborn as sns
import sys
from multiprocessing import Pool
import os
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool()
gn = pool.map(process_expression, gene_ids)
pool.close()
pool.join()
def process_expression(gn_name, df_gn=df_coding):
df_part = df_gn.loc[df_gn['Gene_id'] == gn_name]
df_part = df_part.drop('Gene_id', 1)
df_part = df_part.drop('Transcript_biotype', 1)
COUNT100 = df_part[df_part >100 ].count()
COUNT10 = (df_part[df_part >10 ].count()) - COUNT100
COUNT1 = (df_part[df_part >1].count())- COUNT100 - COUNT10
COUNT0 = (df_part[df_part >0].count())- COUNT100-COUNT10- COUNT1
result = pd.concat([COUNT0,COUNT1,COUNT10,COUNT100], axis=1)
result.columns = [ '0 TO 1', '1 TO 10','10 TO 100', '>100']
result.plot( kind='bar', figsize=(50, 20), fontsize=7, stacked=True)
plt.savefig('./expression_levels/all_genes/'+gn_name+'.png')#,bbox_inches='tight')
plt.close()
the df_coding table is something like (it has more columns, I erased some):
Isoform_name,heart,heart.1,lung.3,Gene_id,Transcript_biotype
ENST00000296782,0.14546900000000001,0.161245,0.09479889999999999,ENSG00000164327,protein_coding
ENST00000357387,6.53902,5.86969,7.057689999999999,ENSG00000164327,protein_coding
ENST00000514735,0.0,0.0,0.0,ENSG00000164327,protein_coding
The input dataframe df_coding is a dataframe with a column Gene_id. In this column I have a list of gn_name. What I want is to take each time only the parts of the dataframe which have the name gn_name[i] in the Gene_id column and plot a barplot based on this dataframe.
For example if I call the 'process_expression('ENSG00000164327')', which is a specific gn_name, the output is something like this:
What am I doing wrong? I know that the process stops at the plotting command when I run it with multiprocessing.
The problem is between multiprocessing and matplotlib. With multiprocessing you create a completely new context with each process. The new context does not (and can not) successfully initialize the context because it is already initialized in the parent process.
If you are trying to overcome a performance issue then you may be on the right track. However, plotting back to the correctly initialized context of the parent process will require you to go a lot deeper into the structure of the underlying matplotlib guts. Here is an example of setting a data pipe back to the original application. Really this is only going to help if you are dealing with a lot of processing of the data before it is plotted. It doesn't look like that is what you are doing here.
If you are trying to get a visual effect like stacked / overlayed results then you probably want to look into repeating the plot function or modifying the data structure to better represent what you want to visualize.
So. What problem are you trying to solve? A performance problem, or a visualization problem? If it is a visualization problem then you do NOT want to use multiprocessing.

Categories