Run short python code directly on snakemake - python

I have a snakemake pipeline where I need to do a small step of processing the data (applying a rolling average to a dataframe).
I would like to write something like this:
rule average_df:
input:
# script = ,
df_raw = "{sample}_raw.csv"
params:
window = 83
output:
df_avg = "{sample}_avg.csv"
shell:
"""
python
import pandas as pd
df=pd.read_csv("{input.df_raw}")
df=df.rolling(window={params.window}, center=True, min_periods=1).mean()
df.to_csv("{output.df_avg}")
"""
However it does not work.
Do I have to create a python file with those 4 lines of code? The alternative that occurs to me is a bit cumbersome. It would be
average_df.py
import pandas as pd
def average_df(i_path, o_path, window):
df=pd.read_csv(path)
df=df.rolling(window=window, center=True, min_periods=1).mean()
df.to_csv(o_path)
return None
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Description of your program')
parser.add_argument('-i_path', '--input_path', help='csv file', required=True)
parser.add_argument('-o_path', '--output_path', help='csv file ', required=True)
parser.add_argument('-w', '--window', help='window for averaging', required=True)
args = vars(parser.parse_args())
i_path = args['input_path']
o_path = args['output_path']
window = args['window']
average_df(i_path, o_path, window)
And then have the snakemake rule like this:
rule average_df:
input:
script = average_df.py,
df_raw = "{sample}_raw.csv"
params:
window = 83
output:
df_avg = "{sample}_avg.csv"
shell:
"""
python average_df.py --input_path {input.df_raw} --ouput_path {output.df_avg} -window {params.window}
"""
Is there a smarter or more efficient way to do this? That would be great! Looking forward to your input!

This can be achieved via run directive:
rule average_df:
input:
# script = ,
df_raw = "{sample}_raw.csv"
params:
window = 83
output:
df_avg = "{sample}_avg.csv"
run:
import pandas as pd
df=pd.read_csv(input.df_raw)
df=df.rolling(window=params.window, center=True, min_periods=1).mean()
df.to_csv(output.df_avg)
Note that all snakemake objects are available directly via input, output, params, etc.

The run directive seems the way to go. It may be good to know that you could do the same using the -c argument in python to run a script passed as a string. E.g.:
shell:
r"""
python -c '
import pandas as pd
df=pd.read_csv("{input.df_raw}")
etc etc...
'
"""

Related

Passing wildcard values in params in snakemake

I am trying to clean a data pipeline by using snakemake. It looks like wildcards are what I need but I don't manage to make it work in params
My function needs a parameter that depends on the wildcard value. For instance, let's say
it depends on sample that can either be A or B.
I tried the following (my example is more complicated but this is basically what I am trying to do) :
sample = ["A","B"]
import pandas as pd
def dummy_example(sample):
return pd.DataFrame({"values": [0,1], "sample": sample})
rule all:
input:
"mybucket/sample_{sample}.csv"
rule testing_wildcards:
output:
newfile="mybucket/sample_{sample}.csv"
params:
additional="{sample}"
run:
df = dummy_example(params.additional)
df.to_csv(output.newfile, index = False)
which gives me the following error:
Wildcards in input files cannot be determined from output files:
'sample'
I followed the doc and put expand in output section.
For the params, it looked like this section and this thread was giving me everything needed
sample_list = ["A","B"]
import pandas as pd
import re
def dummy_example(sample):
return pd.DataFrame({"values": [0,1], "sample": sample})
def get_wildcard_from_output(output):
return re.search(r'sample_(.*?).csv', output).group(1)
rule all:
input:
expand("sample_{sample}.csv", sample = sample_list)
rule testing_wildcards:
output:
newfile=expand("sample_{sample}.csv", sample = sample_list)
params:
additional=lambda wildcards, output: get_wildcard_from_output(output)
run:
print(params.additional)
df = dummy_example(params.additional)
df.to_csv(output.newfile, index = False)
InputFunctionException in line 16 of /home/jovyan/work/Snakefile:
Error:
TypeError: expected string or bytes-like object
Wildcards:
Is there some way to catch the value of the wildcard in params to apply the value in run ?
I think that you are trying to get the sample wildcard to use as a parameter in your script.
The wc variable is an instance of snakemake.io.Wildcards which is a snakemake.io.Namedlist.
You can call .get(key) on these objects, so we can use a lambda function to generate the params.
samples_from_wc=lambda wc: wc.get("sample") and use this in the run/shell as params.samples_from_wc.
sample_list = ["A","B"]
import pandas as pd
def dummy_data(sample):
return pd.DataFrame({"values": [0, 1], "sample": sample})
rule all:
input: expand("sample_{sample}.csv", sample=sample_list)
rule testing_wildcards:
output:
newfile="sample_{sample}.csv"
params:
samples_from_wc=lambda wc: wc.get("sample")
run:
# Load input
df = dummy_data(params.samples_from_wc)
# Write output
df.to_csv(output.newfile, index=False)

How to customise my csv file using python instead of Excel macro

I want to store my python code result in to csv file, but here is my python code i am not show my python result in my csv file
I have converted the macro vb file to python... Any advise would be more appreciated, because I am new to this.
*unable to enter full code due to site error.
Please find my code
import pandas as pd
import csv
import numpy as np
from vb2py.vbfunctions import *
from vb2py.vbdebug import *
def My_custom_MACRO():
#
# My_custom_MACRO Macro
#
#
Range('A1:A2').Select()
Range('A2').Activate()
Columns('A:A').EntireColumn.AutoFit()
Columns('G:G').Select()
Selection.Insert(Shift=xlToRight, CopyOrigin=xlFormatFromLeftOrAbove)
Columns('P:P').EntireColumn.AutoFit()
Columns('P:P').Select()
Selection.Cut(Destination=Columns('G:G'))
Range('G53').Select()
ActiveWindow.SmallScroll(Down=- 45)
Range('G1').Select()
ActiveCell.FormulaR1C1 = 'Live Deli'
ActiveWindow.SmallScroll(Down=- 9)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=7, Criteria1='>50', Operator=xlAnd)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=4, Criteria1='>50', Operator=xlAnd)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=7, Criteria1='>50', Operator=xlAnd)
ActiveWindow.SmallScroll(Down=0)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=4, Criteria1='>50%', Operator=xlAnd)
df_1=pd.read_csv(r'D:\proj\project.csv',My_custom_MACRO)
df_1.to_csv(r'D:\proj\project_output.csv')
I don't see sort in your macro just a column move and filter. Try this ;
import pandas as pd
df = pd.read_csv(r'c:\temp\project.csv')
cols = df.columns
# cut column P paste column G as Live Deli
colP = df.pop(cols[14])
df.insert(14,'','')
df.insert(6,'Live Deli',colP)
# apply filter to col 4 and 7
df1 = df.loc[ (df[cols[3]] > 0.5) & (df['Live Deli'] > 50) ]
# save
df1.to_csv(r'c:\temp\project_output.csv', index=False)

`mp.pool.ThreadPool` fails when `mp.pool` works normally

I'm using the multiprocessing python library to run in parallel feature selection for a machine learning problem. This function accepts as input a pandas dataframe and returns some figures.
When I execute this function using mp.pool.map() everything runs smoothly. However, if I substitute it with mp.pool.ThreadPool.map() it fails with this error:
AssertionError: Number of manager items must equal union of block items
# manager items: 15, # tot_items: 20.
Strangely, I was running the ThreadPool code normally till yesterday. Then, I tried to re-run it and started getting these errors. I need ThreadPool since this is an IO bound job and it was running much faster compared to pool.
EDIT:
The code goes like that (python 2.7):
import multiprocessing as mp
import pandas as pd (version 0.22.0)
def main_functionality(df, params):
df = df[params['feature']]
#Run 5-fold cross-validation
data_df = pd.DataFrame(....)
pred_df = pred_df.append(data_df)
return statistics from pred_df
def a_function(df_init, feature, params_init):
params = dict(params_init)
df = df_init.copy()
params['feature'] = feature
try:
results = main_functionality(df, params)
except:
results = (0,0,0)
return results
def b_function(df, features):
pool = mp.pool.ThreadPool(4)
params = {...}
results = pool.map(a_function,(df, feature, params) for f in features))
results_df = pd.DataFrame(results)
results_df.to_csv(...)
if __name__ == '__main__':
df = read.csv(...) # A big CSV file (i.e. few GBs)
features = [i for i in df.columns if i ....]
b_function(df, features)

Is it possible to extraxt information about HOW a user defined function is being called?

Or is it possible to capture the function call itself in any way (describe which values are assigned to the different arguments)?
Sorry for the poor phrasing of the question. Let me explain with some reproducible code:
import pandas as pd
import numpy as np
import matplotlib.dates as mdates
import inspect
# 1. Here is Dataframe with some random numbers
np.random.seed(123)
rows = 10
df = pd.DataFrame(np.random.randint(90,110,size=(rows, 2)), columns=list('AB'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
#print(df)
# 2. And here is a very basic function to do something with the dataframe
def manipulate(df, factor):
df = df * factor
return df
# 3. Now I can describe the function using:
print(inspect.getargspec(manipulate))
# And get:
# ArgSpec(args=['df', 'factor'], varargs=None, keywords=None,
# defaults=None)
# __main__:1: DeprecationWarning: inspect.getargspec() is
# deprecated, use inspect.signature() or inspect.getfullargspec()
# 4. But what I'm really looking for is a way to
# extract or store the function AND the variables
# used when the function is called, like this:
df2 = manipulate(df = df, factor = 20)
# So in the example using Inspect, the desired output could be:
# ArgSpec(args=['df = df', 'factor = 10'], varargs=None,
# and so on...
I realize that this may seem a bit peculiar, but it would actually be of great use to me to be able to do something like this. If anyone is interested, I'd be happy to explain everything in more detail, including how this would fit in in mye data science work-flow.
Thank you for any suggestions!
You can bind the parameters to the function and create a new callable
import functools
func = functools.partial(manipulate, df=df, factor=20)
the resulting partial object allows argument inspection and modification using the attributes args and keywords:
func.keywords # {'df': <pandas dataframe>, 'factor': 20}
and and can finally be called using
func()

Get CSV from Tensorflow summaries

I have some very large tensorflow summaries. If these are plotted using tensorboard, I can download CSV files from them.
However, plotting these using tensorboard would take a very long time. I found in the docs that there is a method for reading the summary directly in Python. This method is summary_iterator and can be used as follows:
import tensorflow as tf
for e in tf.train.summary_iterator(path to events file):
print(e)
Can I use this method to create CSV files directly? If so, how can I do this? This would save a lot of time.
One possible way of doing it would be like this:
from tensorboard.backend.event_processing import event_accumulator
import numpy as np
import pandas as pd
import sys
def create_csv(inpath, outpath):
sg = {event_accumulator.COMPRESSED_HISTOGRAMS: 1,
event_accumulator.IMAGES: 1,
event_accumulator.AUDIO: 1,
event_accumulator.SCALARS: 0,
event_accumulator.HISTOGRAMS: 1}
ea = event_accumulator.EventAccumulator(inpath, size_guidance=sg)
ea.Reload()
scalar_tags = ea.Tags()['scalars']
df = pd.DataFrame(columns=scalar_tags)
for tag in scalar_tags:
events = ea.Scalars(tag)
scalars = np.array(map(lambda x: x.value, events))
df.loc[:, tag] = scalars
df.to_csv(outpath)
if __name__ == '__main__':
args = sys.argv
inpath = args[1]
outpath = args[2]
create_csv(inpath, outpath)
Please note, this code will load the entire event file into memory, so best to run this on a cluster. For information about the sg argument of the EventAccumulator, see this SO question.
An additional improvement might be to not only store the value of each scalar, but also the step.
Note The code snippet was updated for recent versions of TF. For TF < 1.1 use the following import instead:
from tensorflow.tensorboard.backend.event_processing import event_accumulator as eva

Categories