I am trying to clean a data pipeline by using snakemake. It looks like wildcards are what I need but I don't manage to make it work in params
My function needs a parameter that depends on the wildcard value. For instance, let's say
it depends on sample that can either be A or B.
I tried the following (my example is more complicated but this is basically what I am trying to do) :
sample = ["A","B"]
import pandas as pd
def dummy_example(sample):
return pd.DataFrame({"values": [0,1], "sample": sample})
rule all:
input:
"mybucket/sample_{sample}.csv"
rule testing_wildcards:
output:
newfile="mybucket/sample_{sample}.csv"
params:
additional="{sample}"
run:
df = dummy_example(params.additional)
df.to_csv(output.newfile, index = False)
which gives me the following error:
Wildcards in input files cannot be determined from output files:
'sample'
I followed the doc and put expand in output section.
For the params, it looked like this section and this thread was giving me everything needed
sample_list = ["A","B"]
import pandas as pd
import re
def dummy_example(sample):
return pd.DataFrame({"values": [0,1], "sample": sample})
def get_wildcard_from_output(output):
return re.search(r'sample_(.*?).csv', output).group(1)
rule all:
input:
expand("sample_{sample}.csv", sample = sample_list)
rule testing_wildcards:
output:
newfile=expand("sample_{sample}.csv", sample = sample_list)
params:
additional=lambda wildcards, output: get_wildcard_from_output(output)
run:
print(params.additional)
df = dummy_example(params.additional)
df.to_csv(output.newfile, index = False)
InputFunctionException in line 16 of /home/jovyan/work/Snakefile:
Error:
TypeError: expected string or bytes-like object
Wildcards:
Is there some way to catch the value of the wildcard in params to apply the value in run ?
I think that you are trying to get the sample wildcard to use as a parameter in your script.
The wc variable is an instance of snakemake.io.Wildcards which is a snakemake.io.Namedlist.
You can call .get(key) on these objects, so we can use a lambda function to generate the params.
samples_from_wc=lambda wc: wc.get("sample") and use this in the run/shell as params.samples_from_wc.
sample_list = ["A","B"]
import pandas as pd
def dummy_data(sample):
return pd.DataFrame({"values": [0, 1], "sample": sample})
rule all:
input: expand("sample_{sample}.csv", sample=sample_list)
rule testing_wildcards:
output:
newfile="sample_{sample}.csv"
params:
samples_from_wc=lambda wc: wc.get("sample")
run:
# Load input
df = dummy_data(params.samples_from_wc)
# Write output
df.to_csv(output.newfile, index=False)
Related
I have a snakemake pipeline where I need to do a small step of processing the data (applying a rolling average to a dataframe).
I would like to write something like this:
rule average_df:
input:
# script = ,
df_raw = "{sample}_raw.csv"
params:
window = 83
output:
df_avg = "{sample}_avg.csv"
shell:
"""
python
import pandas as pd
df=pd.read_csv("{input.df_raw}")
df=df.rolling(window={params.window}, center=True, min_periods=1).mean()
df.to_csv("{output.df_avg}")
"""
However it does not work.
Do I have to create a python file with those 4 lines of code? The alternative that occurs to me is a bit cumbersome. It would be
average_df.py
import pandas as pd
def average_df(i_path, o_path, window):
df=pd.read_csv(path)
df=df.rolling(window=window, center=True, min_periods=1).mean()
df.to_csv(o_path)
return None
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Description of your program')
parser.add_argument('-i_path', '--input_path', help='csv file', required=True)
parser.add_argument('-o_path', '--output_path', help='csv file ', required=True)
parser.add_argument('-w', '--window', help='window for averaging', required=True)
args = vars(parser.parse_args())
i_path = args['input_path']
o_path = args['output_path']
window = args['window']
average_df(i_path, o_path, window)
And then have the snakemake rule like this:
rule average_df:
input:
script = average_df.py,
df_raw = "{sample}_raw.csv"
params:
window = 83
output:
df_avg = "{sample}_avg.csv"
shell:
"""
python average_df.py --input_path {input.df_raw} --ouput_path {output.df_avg} -window {params.window}
"""
Is there a smarter or more efficient way to do this? That would be great! Looking forward to your input!
This can be achieved via run directive:
rule average_df:
input:
# script = ,
df_raw = "{sample}_raw.csv"
params:
window = 83
output:
df_avg = "{sample}_avg.csv"
run:
import pandas as pd
df=pd.read_csv(input.df_raw)
df=df.rolling(window=params.window, center=True, min_periods=1).mean()
df.to_csv(output.df_avg)
Note that all snakemake objects are available directly via input, output, params, etc.
The run directive seems the way to go. It may be good to know that you could do the same using the -c argument in python to run a script passed as a string. E.g.:
shell:
r"""
python -c '
import pandas as pd
df=pd.read_csv("{input.df_raw}")
etc etc...
'
"""
I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')
I have list with all file names as below and i need to sort them and process in ascending order. code i used is working fine in python3 commandline but not working pyspark. Code i tried is
from datetime import datetime
def sorted_paths(paths):
paths.sort(key = lambda path: datetime.strptime(path.split('_')[2], '%Y%m%d'))
return paths
Gives an error:
Error: time data daily doesn't match the format '%Y%m%d'
Input List is as below:
file_d_20190101_htp.csv
file_d_20180401_html.csv
file_d_20200701_ksh.csv
file_d_20190301_htp.csv
Required output
file_d_20180401_html.csv
file_d_20190101_htp.csv
file_d_20190301_htp.csv
file_d_20200701_ksh.csv
just do this, convenient and quick:
paths = ['file_d_20180401_html.csv',
'file_d_20190301_htp.csv',
'file_d_20180401_html.csv',
'file_d_20200701_ksh.csv',
'file_d_20190101_htp.csv',
]
paths.sort() # in place sort
You can try to use python embedded function sorted to resolve this:
import datetime
arr = ['file_d_20190101_htp.csv',
'file_d_20180401_html.csv',
'file_d_20200701_ksh.csv',
'file_d_20190301_htp.csv']
print(sorted(arr, key=lambda x: datetime.datetime.strptime(x.split("_")[2], '%Y%m%d')))
One way using dateutil.parser:
import dateutil.parser as dparser
f = lambda x: dparser.parse(x, fuzzy=True)
sorted(paths, key=f)
Output:
['file_d_20180401_html.csv',
'file_d_20190101_htp.csv',
'file_d_20190301_htp.csv',
'file_d_20200701_ksh.csv']
I am working on a code where I need to map multiple NamedTuple placed into a list.
Below is the code example - My main problem is around the mapping of the List of dual NamedTuple PeopleName and PeopleAge - I'm not clear how can this be done. Should this be done into two steps, 1/ extract the full row into a generic NamedTupe, then 2/ split the record into different NamedTuple PeopleName and PeopleAge
from typing import NamedTuple, List
import pandas as pd
data = [["tom", 10, "ab 11"], ["nick", 15, "ab 22"], ["juli", 14, "ab 11"]]
people = pd.DataFrame(data, columns=["Name", "Age", "PostalCode"])
PeopleName = NamedTuple("PeopleName", [("Name", str)])
PeopleAge = NamedTuple("PeopleAge", [("Age", int)])
PeoplePC = NamedTuple("PeoplePC", [("PostalCode", str)])
# The code below is not correct
Demography = NamedTuple(
"Demography", [("names", List[(PeopleName, PeopleAge)]), ("postalcodes", PeoplePC)],
)
def to_nested_tuple(k, g):
peoples = list(
g["Name"].to_frame().itertuples(name="Person", index=False),
# rec["Age"].to_frame().itertuples(name="PeopleAge", index=False),
)
return Demography(peoples, PeoplePC(k))
d = [to_nested_tuple(*item) for item in people.groupby("PostalCode")]
print(d)
This annotation List[(PeopleName, PeopleAge)] throws TypeError: Too many parameters for typing.List; actual 2, expected 1.
That tuple with 2 different types should also be annotated with typing.Tuple:
List[Tuple[PeopleName, PeopleAge]]
But, to annotate arguments it is preferred to use an abstract collection type such as Sequence or Iterable:
Demography = NamedTuple(
"Demography", [("names", Sequence[Tuple[PeopleName, PeopleAge]]), ("postalcodes", PeoplePC)],
)
Instead of applying to_nested_tuple for each group I would go straightforwardly in the following way:
d = [Demography([(PeopleName(row['Name']), PeopleAge(row['Age'])) for _, row in group.iterrows()], PeoplePC(k))
for k, group in people.groupby("PostalCode")]
Now, the result would be printed as:
[Demography(names=[(PeopleName(Name='tom'), PeopleAge(Age=10)), (PeopleName(Name='juli'), PeopleAge(Age=14))], postalcodes=PeoplePC(PostalCode='ab 11')),
Demography(names=[(PeopleName(Name='nick'), PeopleAge(Age=15))], postalcodes=PeoplePC(PostalCode='ab 22'))]
Use list(df.itertuples()) where df is your dataframe.
I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!
Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.