Plotting pandas dataframe and multiprocessing in Python - python

I have a pandas dataframe and I want to plot slices of it, in a function using multiprocessing. Even though the function "process_expression" works when I call it independently, when I use the "multiprocessing" option it is not giving any plots.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import seaborn as sns
import sys
from multiprocessing import Pool
import os
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool()
gn = pool.map(process_expression, gene_ids)
pool.close()
pool.join()
def process_expression(gn_name, df_gn=df_coding):
df_part = df_gn.loc[df_gn['Gene_id'] == gn_name]
df_part = df_part.drop('Gene_id', 1)
df_part = df_part.drop('Transcript_biotype', 1)
COUNT100 = df_part[df_part >100 ].count()
COUNT10 = (df_part[df_part >10 ].count()) - COUNT100
COUNT1 = (df_part[df_part >1].count())- COUNT100 - COUNT10
COUNT0 = (df_part[df_part >0].count())- COUNT100-COUNT10- COUNT1
result = pd.concat([COUNT0,COUNT1,COUNT10,COUNT100], axis=1)
result.columns = [ '0 TO 1', '1 TO 10','10 TO 100', '>100']
result.plot( kind='bar', figsize=(50, 20), fontsize=7, stacked=True)
plt.savefig('./expression_levels/all_genes/'+gn_name+'.png')#,bbox_inches='tight')
plt.close()
the df_coding table is something like (it has more columns, I erased some):
Isoform_name,heart,heart.1,lung.3,Gene_id,Transcript_biotype
ENST00000296782,0.14546900000000001,0.161245,0.09479889999999999,ENSG00000164327,protein_coding
ENST00000357387,6.53902,5.86969,7.057689999999999,ENSG00000164327,protein_coding
ENST00000514735,0.0,0.0,0.0,ENSG00000164327,protein_coding
The input dataframe df_coding is a dataframe with a column Gene_id. In this column I have a list of gn_name. What I want is to take each time only the parts of the dataframe which have the name gn_name[i] in the Gene_id column and plot a barplot based on this dataframe.
For example if I call the 'process_expression('ENSG00000164327')', which is a specific gn_name, the output is something like this:
What am I doing wrong? I know that the process stops at the plotting command when I run it with multiprocessing.

The problem is between multiprocessing and matplotlib. With multiprocessing you create a completely new context with each process. The new context does not (and can not) successfully initialize the context because it is already initialized in the parent process.
If you are trying to overcome a performance issue then you may be on the right track. However, plotting back to the correctly initialized context of the parent process will require you to go a lot deeper into the structure of the underlying matplotlib guts. Here is an example of setting a data pipe back to the original application. Really this is only going to help if you are dealing with a lot of processing of the data before it is plotted. It doesn't look like that is what you are doing here.
If you are trying to get a visual effect like stacked / overlayed results then you probably want to look into repeating the plot function or modifying the data structure to better represent what you want to visualize.
So. What problem are you trying to solve? A performance problem, or a visualization problem? If it is a visualization problem then you do NOT want to use multiprocessing.

Related

How do I get tqdm working on pandas apply?

Tqdm documentation shows an example of tqdm working on pandas apply using progress_apply. I adapted the following code from here https://tqdm.github.io/docs/tqdm/ on a process that regularly take several minutes to perform (func1 is a regex function).
from tqdm import tqdm
tqdm.pandas()
df.progress_apply(lambda x: func1(x.textbody), axis=1)
The resulting progress bar doesn't show any progress. It just jumps from 0 at the start of the loop to 100 when it is finished. I am currently running tqdm version 4.61.2
Utilizing tqdm with pandas
Generally speaking, people tend to use lambdas when performing operations on a column or row. This can be done in a number of ways.
Please note: that if you are working in jupyter notebook you should use tqdm_notebook instead of tqdm.
Also I'm not sure what your code looks like but if you're simply following the example given in the tqdm docs, and you're only performing 100 interations, computers are fast and will blow through that before your progress bar has time to update. Perhaps it would be more instructive to use a larger dataset like I provided below.
Example 1:
from tqdm import tqdm # version 4.62.2
import pandas as pd # version 1.4.1
import numpy as np
tqdm.pandas(desc='My bar!') # lots of cool paramiters you can pass here.
# the below line generates a very large dataset for us to work with.
df = pd.DataFrame(np.random.randn(100000000, 4), columns=['a','b','c','d'])
# the below line will square the contents of each element in an column-wise
# fashion
df.progress_apply(lambda x: x**2)
Output:
Output
Example 2:
# you could apply a function within the lambda expression for more complex
# operations. And keeping with the above example...
tqdm.pandas(desc='My bar!') # lots of cool paramiters you can pass here.
# the below line generates a very large dataset for us to work with.
df = pd.DataFrame(np.random.randn(100000000, 4), columns=['a','b','c','d'])
def function(x):
return x**2
df.progress_apply(lambda x: function(x))

When I run '''sns.histplot(df['price'])''' in pycharm I get the code output but no graph, why is this?

I'm using pycharm to run some code using Seaborn. I'm very new to python and am just trying to learn the ropes so I'm following a tutorial online. I've imported the necessary libraries and have run the below code
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# import the data saved as a csv
df = pd.read_csv('summer-products-with-rating-and-performance_2020-08.csv')
df["has_urgency_banner"] = df["has_urgency_banner"].fillna(0)
df["discount"] = (df["retail_price"] -
df["price"])/df["retail_price"]
df["rating_five_percent"] = df["rating_five_count"]/df["rating_count"]
df["rating_four_percent"] = df["rating_four_count"]/df["rating_count"]
df["rating_three_percent"] = df["rating_three_count"]/df["rating_count"]
df["rating_two_percent"] = df["rating_two_count"]/df["rating_count"]
df["rating_one_percent"] = df["rating_one_count"]/df["rating_count"]
ratings = [
"rating_five_percent",
"rating_four_percent",
"rating_three_percent",
"rating_two_percent",
"rating_one_percent"
]
for rating in ratings:
df[rating] = df[rating].apply(lambda x: x if x>= 0 and x<= 1 else 0)
# Distribution plot on price
sns.histplot(df['price'])
My output is as follows:
Process finished with exit code 0
so I know there are no errors in the code but I don't see any graphs anywhere as I'm supposed to.
Ive found a way around this by using this at the end
plt.show()
which opens a new tab and uses matplotlib to show me a similar graph.
However in the code I'm using to follow along, matplotlib is not imported or used (I understand that seaborn has built in Matplotlib functionality) as in the plt.show statement is not used but the a visual graph is still achieved.
I've also used print which gives me the following
AxesSubplot(0.125,0.11;0.775x0.77)
Last point to mention is that the code im following along with uses the following
import seaborn as sns
# Distribution plot on price
sns.distplot(df['price'])
but distplot has now depreciated and I've now used histplot because I think that's the best alternative vs using displot, If that's incorrect please let me know.
I feel there is a simple solution as to why I'm not seeing a graph but I'm not sure if it's to do with pycharm or due to something within the code.
matplotlib is a dependency of seaborn. As such, importing matplotlib with import matplotlib.pyplot as plt and calling plt.show() does not add any overhead to your code.
While it is annoying that there is no sns.plt.show() at this time (see this similar question for discussion), I think this is the simplest solution to force plots to show when using PyCharm Community.
Importing matplotlib in this way will not affect how your exercises run as long as you use a namespace like plt.
Be aware the 'data' must be pandas DataFrame object, not: <class 'pandas.core.series.Series'>
I using this, work finely:
# Distribution plot on price
sns.histplot(df[['price']])
plt.show()

Pandas multiprocessing on very large dataframe

I'm trying to use the multiprocessing package to compute a function on a very large Pandas dataframe. However I ran into a problem with the following error:
OverflowError: cannot serialize a bytes objects larger than 4GiB
After applying the solution to this question and using protocol 4 for pickling, I ran into the following error instead, which is also quoted by the solution itself:
error: 'i' format requires -2147483648 <= number <= 2147483647
The answer to this question then suggests to use the dataframe as a global variable.
But ideally I would like the dataframe to still be an input of the function, without having the multiprocessing library copying and pickling it multiple times in the background.
Is there some other way I can design the code to not run into the issue?
I was able to replicate the problem with this example:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import functools
print('Total memory usage for the dataframe: {} GB'.format(df.memory_usage().sum() / 1e9))
def slow_function(some_parameter, df):
time.sleep(1)
return some_parameter
parameters = list(range(100))
with mp.Pool(20) as pool:
function = functools.partial(slow_function, df=df)
results = pool.map(function, parameters)
try Dask
import dask.dataframe as dd
df = dd.read_csv('data.csv')
docs : https://docs.dask.org/en/latest/dataframe-api.html

Python ggplot and ggplotly

Former R user, I used to combine extensively ggplot and plot_ly libraries via the ggplotly() function to display data.
Newly arrived in Python, I see that the ggplot library is available, but cant find anything on a simple combination with plotly for graphical reactive displays.
What I would look for is something like :
from ggplot import*
import numpy as np
import pandas as pd
a = pd.DataFrame({'grid': np.arange(-4, 4),
'test_data': np.random.random_integers(0, 10,8)})
p2 = ggplot(a, aes(x = 'grid', y = 'test_data'))+geom_line()
p2
ggplotly(p2)
Where the last line would launch a classic plotly dynamic viewer with all the great functionalities of mouse graphical interactions, curves selections and so on...
Thanks for your help :),
Guillaume
This open plotnine issue describes a similar enhancement request.
Currently the mpl_to_plotly function seems to work sometimes (for some geoms?), but not consistently. The following code, seems to work ok.
from plotnine import *
from plotly.tools import mpl_to_plotly as ggplotly
import numpy as np
import pandas as pd
a = pd.DataFrame({'grid': np.arange(-4, 4),
'test_data': np.random.randint(0, 10,8)})
p2 = ggplot(a, aes(x = 'grid', y = 'test_data')) + geom_point()
ggplotly(p2.draw())
You don't need ggplotly in python if all you are seeking is an interactive interface.
ggplot (or at least plotnine, which is the implementation that I am using) uses matplotlib which is already interactive, unlike the R ggplot2 package that requires plotly on top.

Creating movie from a series of matplotlib plots using matplotlib.animation

I have a script which generates a series of time dependent plots. I'd like to "stitch" these together to make a movie.
I would preferably like to use matplotlib.animation. I have looked at examples from matplotlib documentation but I can't understand how it works.
My script currently makes 20 plots at successive time values and saves these as 00001.png up to 000020.png:
from scipy.integrate import odeint
from numpy import *
from math import cos
import pylab
omega=1.4
delta=0.1
F=0.35
def f(initial,t):
x,v=initial
xdot=v
vdot=x-x**3-delta*v-F*cos(omega*t)
return array([xdot,vdot])
T=2*pi/omega
nperiods = 100
totalsteps= 1000
small=int((totalsteps)/nperiods)
ntransients= 10
initial=[-1,0]
kArray= linspace(0,1,20)
for g in range (0,20):
k=kArray[g]
x,v=initial
xpc=[]
vpc=[]
if k==0:
x,v=x,v
else:
for i in range(1,nperiods)
x,v=odeint(f,[x,v],linspace(0,k*T,small))[-1] )
for i in range (1,nperiods):
x,v=odeint(f,[x,v],linspace(k*T,T+k*T,small))[-1]
xpc.append(x)
vpc.append(v)
xpc=xpc[ntransients:]
vpc=vpc[ntransients:]
pylab.figure(17.8,10)
pylab.scatter(xpc,vpc,color='red',s=0.2)
pylab.ylim([-1.5,1.5])
pylab.xlim([-2,2])
pylab.savefig('0000{0}.png'.format(g), dpi=200)
I'd appreciate any help. Thank you.
I think matplotlib.animation.FuncAnimation is what you're looking for. Basically, it repeatedly calls a defined function, passing in (optional) arguments as needed. This is exactly what you're already doing in your for g in range(0,20): code. You can also define an init function to get things set up. Check out the base class matplotlib.animation.Animation for more info on formats, saving, the MovieWriter class, etc.

Categories