snakemake: list of pathes in input

snakemake: list of pathes in input - python

I am sorry for low level question, I am junior. I try to learn snakemake along with click. Please, help me to understand, for this example, how can I put a list of pathes to input in rule? And
get this list in python script.
Snakemake:
path_1 = 'data/raw/data2process/'
path_2 = 'data/raw/table.xlsx'
rule:
input:
list_of_pathes = "list of all pathes to .xlsx/.csv/.xls files from path_1"
other_table = path_2
output:
{some .xlsx file}
shell:
"script_1.py {input.list_of_pathes} {output}"
"script_2.py {input.other_table} {output}"
script_1.py:
#click.command()
#click.argument(input_list_of_pathes, type=*??*)
#click.argument("out_path", type=click.Path())
def foo(input_list_of_pathes: list, out_path: str):
df = pd.DataFrame()
for path in input_list_of_pathes:
table = pd.read_excel(path)
**do smthng**
df = pd.concat([df, table])
df.to_excel(out_path)
script_2.py:
#click.command()
#click.argument("input_path", type=type=click.Path(exist=True))
#click.argument("output_path", type=click.Path())
def foo_1(input_path: str, output_path: str):
table = pd.read_excel(input_path)
**do smthng**
table.to_excel(output_path)

Using pathlib, and the glob method of a Path object, you could proceed as follows:
from itertools import chain
from pathlib import Path
path_1 = Path('data/raw/data2process/')
exts = ["xlsx", "csv", "xls"]
path_1_path_lists = [
list(path_1.glob(f"*.{ext}"))
for ext in exts]
path_1_all_paths = list(chain.from_iterable(path_1_dict.values()))
The chain.from_iterables allows to "flatten" the list of lists, but I'm not sure Snakemake even needs this for the input of its rules.
Then, in your rule:
input:
list_of_paths = path_1_all_paths,
other_table = path_2
I think that Path objects can be used directly. Otherwise, you need to turn them into strings with str:
input:
list_of_paths = [str(p) for p in path_1_all_paths],
other_table = path_2

Related

Ignore path does not exist in pyspark

I want ignore the paths that generate the Error:
'Path does not exist'
when I read parquet files with pyspark. For example I have a list of paths:
list_paths = ['path1','path2','path3']
and read the files like:
dataframe = spark.read.parquet(*list_paths)
but the path path2 does not exist. In general, I do not know which path does not exits, so I want ignore path2 automatically. How can I do it and obtain only one dataframe?

You can use Hadoop FS API to check if the files exist before you pass them to spark.read:
conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
filtered_paths = [p for p in list_paths if Path(p).getFileSystem(conf).exists(Path(p))]
dataframe = spark.read.parquet(*filtered_paths)
Where sc is the SparkContext.

Maybe you can do
existing_paths = [path for path in list_paths if os.path.exists(path)]
dataframe = spark.read.parquet(*existing_paths)

Adding to #blackbishop's answer, you can further use Hadoop pattern strings to check for files/objects before loading them.
It's also worth noting that spark.read.load() accepts lists of path strings.
from functools import partial
from typing import Iterator
from pyspark.sql import SparkSession
def iterhadoopfiles(spark: SparkSession, path_pattern: str) -> Iterator[str]:
"""Return iterator of object/file paths that match path_pattern."""
sc = spark.sparkContext
FileUtil = sc._gateway.jvm.org.apache.hadoop.fs.FileUtil
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
hadoop_config = sc._jsc.hadoopConfiguration()
p = Path(path_pattern)
return (
str(x)
for x in FileUtil.stat2Paths(
p.getFileSystem(hadoop_config).globStatus(p)
)
)
def pathnotempty(spark: SparkSession, path_pattern: str) -> bool:
"""Return true if path matches at least one object/file."""
try:
next(iterhadoopfiles(spark, path_pattern))
except StopIteration:
return False
return True
paths_to_load = list(filter(partial(pathnotempty, spark), ["file:///*.parquet"]))
spark.read.format('parquet').load(paths_to_load)

file renaming adding a suffix

How could I add from the following variable fIN = T1_r.nii.gz
the following suffix _brain and create the following output filename?
fOut = T1_r_brain.nii.gz
When I use the following command line
fIn2, file_extension = os.path.splitext(fIn)
it only removes the .gz extension.
Thank you for your help
Fred

I had to write a utility for this, and here's what I came up with.
from pathlib import Path
def add_str_before_suffixes(filepath, string: str) -> Path:
"""Append a string to a filename immediately before extension(s).
Parameters
----------
filepath : Path-like
Path to modify. Can contain multiple extensions like `.bed.gz`.
string : str
String to append to filename.
Returns
-------
Instance of `pathlib.Path`.
Examples
--------
>>> add_str_before_suffixes("foo", "_baz")
PosixPath('foo_baz')
>>> add_str_before_suffixes("foo.bed", "_baz")
PosixPath('foo_baz.bed')
>>> add_str_before_suffixes("foo.bed.gz", "_baz")
PosixPath('foo_baz.bed.gz')
"""
filepath = Path(filepath)
suffix = "".join(filepath.suffixes)
orig_name = filepath.name.replace(suffix, "")
new_name = f"{orig_name}{string}{suffix}"
return filepath.with_name(new_name)
Here is an example:
>>> f_in = "T1_r.nii.gz"
>>> add_str_before_suffixes(f_in, "_brain")
PosixPath('T1_r_brain.nii.gz')

split_path = 'T1_r.nii.gz'.split('.')
split_path[0] += '_brain'
final_path = ".".join(split_path)

How to get the filename in directory with the max number in the filename python?

I have some xml files in a folder as example 'assests/2020/2010.xml', 'assests/2020/20005.xml', 'assests/2020/20999.xml' etc. I want to get the filename with max value in the '2020' folder. For above three files output should be 20999.xml
I am trying as following:
import glob
import os
list_of_files = glob.glob('assets/2020/*')
# latest_file = max(list_of_files, key=os.path.getctime)
# print (latest_file)
I couldn't be able to find logic to get the required file.
Here is the resource that have best answer to my query but I couldn't build my logic.

You can use pathlib to glob for the xml files and access the Path object attributes like .name and .stem:
from pathlib import Path
list_of_files = Path('assets/2020/').glob('*.xml')
print(max((Path(fn).name for fn in list_of_files), key=lambda fn: int(Path(fn).stem)))
Output:
20999.xml

I can't test it out right now, but you may try this:
files = []
for filename in list_of_files:
filename = str(filename)
filename = filename.replace('.xml','') #Assuming it's not printing your complete directory path
filename = int(filename)
files += [filename]
print(files)
This should get you your filenames in integer format and now you should be able to sort them in descending order and get the first item of the sorted list.

Use re to search for the appropriate endings in your file paths. If found use re again to extract the nr.
import re
list_of_files = [
'assests/2020/2010.xml',
'assests/2020/20005.xml',
'assests/2020/20999.xml'
]
highest_nr = -1
highest_nr_file = ''
for f in list_of_files:
re_result = re.findall(r'\d+\.xml$', f)
if re_result:
nr = int(re.findall(r'\d+', re_result[0])[0])
if nr > highest_nr:
highest_nr = nr
highest_nr_file = f
print(highest_nr_file)
Result
assests/2020/20999.xml

You can also try this way.
import os, re
path = "assests/2020/"
files =[
"assests/2020/2010.xml",
"assests/2020/20005.xml",
"assests/2020/20999.xml"
]
n = [int(re.findall(r'\d+\.xml$',file)[0].split('.')[0]) for file in files]
output = str(max(n))+".xml"
print("Biggest max file name of .xml file is ",os.path.join(path,output))
Output:
Biggest max file name of .xml file is assests/2020/20999.xml

import glob
xmlFiles = []
# this will store all the xml files in your directory
for file in glob.glob("*.xml"):
xmlFiles.append(file[:4])
# this will print the maximum one
print(max(xmlFiles))

Match file names containing a given list of substrings

I'm writing a module that will take in an array of strings among other command line arguments. The array would be something like:
['PUPSFF', 'PCASPE', 'PCASEN']
My module has a method that will search for files matching a possible format in a directory:
def search(self, fundCode, type):
funds_string = '_'.join(fundCode)
files = set(os.listdir(self.unmappedDir))
file_match = 'citco_unmapped_{type}_{funds}_{start}_{end}.csv'.format(type=type, funds=funds_string, start=self.startDate, end=self.endDate)
if file_match in files:
filename = os.path.join(self.unmappedDir, file_match)
return self.read_file(filename)
else:
Logger.error('No {type} file/s found for {funds}, between {start} and {end}'.format(type=type, funds=fundCode, start=self.startDate, end=self.endDate))
So if my directory has a file like this one:
citco_unmapped_positions_PUPSFF_PCASPE_PCASEN_2018-07-01_2018-07-11.csv
And I pass this array as the cmd line argument: ['PUPSFF', 'PCASPE', 'PCASEN']
After calling my method (and passing in the rest of the self arguments) like this:
positions = alerter.search(alerter.fundCodes, 'positions')
It will search, find that file, and do whatever in needs to do.
However, I want it to be independent of the order. so that it will still find the file if the command line arguments are written like this:
['PCASPE', 'PCASEN', 'PUPSFF'] or
['PCASEN', 'PUPSFF', 'PCASPE'] or whatever
Any ideas on how to go on about this?

Use the all function to see the each of the needed tags is in the file name. This example should get you going:
files = [
"citco_unmapped_positions_PUPSFF_PCASPE_PCASEN_2018-07-01_2018-07-11.csv", # yes
"citco_unmapped_positions_PUPSFF_NO_WAY_PCASEN_2018-07-01_2018-07-11.csv", # no
"citco_unmapped_positions_PCASEN_PCASEN_PUPSFF_2018-07-01_2018-07-11.csv", # no
"citco_unmapped_positions_PCASPE_PCASEN_PUPSFF_2018-07-01_2018-07-11.csv", # yes
]
tags = ['PUPSFF', 'PCASPE', 'PCASEN']
for fname in files:
if (all(tag in fname for tag in tags)):
# the file is a match.
print("Match", fname)
Output:
Match citco_unmapped_positions_PUPSFF_PCASPE_PCASEN_2018-07-01_2018-07-11.csv
Match citco_unmapped_positions_PCASPE_PCASEN_PUPSFF_2018-07-01_2018-07-11.csv

Found a possible solution with permutations from itertools
def search(self, fundCodes, type):
permutations = self.find_permutations(fundCodes)
files = set(os.listdir(self.unmappedDir))
for perm in permutations:
fund_codes = '_'.join(perm)
file_match = 'citco_unmapped_{type}_{funds}_{start}_{end}.csv'.format(type=type, funds=fund_codes, start=self.startDate, end=self.endDate)
if file_match in files:
filename = os.path.join(self.unmappedDir, file_match)
return self.read_file(filename)
else:
Logger.error('No {type} file/s found for {funds}, between {start} and {end}'.format(type=type, funds=fund_codes, start=self.startDate, end=self.endDate))
def find_permutations(self, list):
perms = [p for p in permutations(list)]
return perms
Probably really slow though.

Translating Matlab code to Python

I need to translate chunks of matlab code into Python. My code seems to be 'unreachable' though. Any idea why this is happening?
Also: am I doing it right? I'm a real newbie.
Matlab code:
function Dir = getScriptDir()
fullPath = mfilename('fullpath');
[Dir, ~,~] = fileparts(fullPath);
end
function [list,listSize] = getFileList(Dir)
DirResult = dir( Dir );
list = DirResult(~[DirResult.isdir]); % select files
listSize = size(list);
end
My Python code:
def Dir = getScriptDir():
return os.path.dirname(os.path.realpath(__file__)
def getFileList(Dir):
list = os.listdir(Dir)
listSize = len(list)
getFileList() = [list, listSize]

Your syntax is incorrect. If I'm reading this correctly, you're trying to get the names of the files in the same directory as the script and print the number of files in that list.
Here's an example of how you might do this (based on the program you gave):
import os
def getFileList(directory = os.path.dirname(os.path.realpath(__file__))):
list = os.listdir(directory)
listSize = len(list)
return [list, listSize]
print(getFileList())
Output example:
[['program.py', 'data', 'syntax.py'], 3]

Your function definitions were incorrect. I have modified the code you provided. You can also consolidate the getScriptDir() functionality into the getFileList() function.
import os
def getFileList():
dir = os.path.dirname(os.path.realpath(__file__))
list = os.listdir(dir)
listSize = len(list)
fileList = [list, listSize]
return fileList
print(getFileList())
Returns: (in my environment)
[['test.py', 'test.txt', 'test2.py', 'test2.txt', 'test3.py', 'test4.py', 'testlog.txt', '__pycache__'], 8]
Your script functions - including getScriptDir(modified):
import os
def getScriptDir():
return os.path.dirname(os.path.realpath(__file__))
def getFileList(dir):
dir = os.path.dirname(os.path.realpath(__file__))
list = os.listdir(dir)
listSize = len(list)
fileList = [list, listSize]
return fileList
dir = getScriptDir()
print(getFileList(dir))

Remember that you need to return variables from a python-function to get their results.
More on how to define your own functions in python: https://docs.python.org/3/tutorial/controlflow.html#defining-functions

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

snakemake: list of pathes in input - python

Related

Ignore path does not exist in pyspark

file renaming adding a suffix

How to get the filename in directory with the max number in the filename python?

Match file names containing a given list of substrings

Translating Matlab code to Python

Categories

Resources