Passing multiple text files as arguments to a script using a pattern - python

First of all I'd like to state that this is a debug question for an exercise, but I can't get any help from the lecturer, and as much as I've read up on arguments I can't seem to figure it out, so here I am. So I have a python script that compares .txt files passed as arguments. Currently it is called it as follows:
python compare.py -s stop_list.txt NEWS/news01.txt NEWS/news02.txt
and the files are parsed into a list of names using
import sys, re, getopt, glob
opts, args = getopt.getopt(sys.argv[1:],'hs:bI:')
opts = dict(opts)
filenames = args
if '-I' in opts:
filenames = glob.glob(opts['-I'])
print('INPUT-FILES:', ' '.join(filenames))
print(filenames)
I can pass more than two files by simply listing them together
python compare.py -s stop_list.txt NEWS/news01.txt NEWS/news02.txt NEWS/news03.txt NEWS/news04.txt
but this can quickly become impractical.
Now it is suggested that more files can be passed using a pattern
python compare.py -s stop_list.txt -I ’NEWS/news??.txt’
i.e.:
python compare.py -s stop_list.txt -I ’NEWS/news0[123].txt’
However it seems to behave a bit weirdly. First of all if I write:
python compare.py -s stop_list.txt -I NEWS/news01.txt NEWS/news02.txt
only news01.txt will be passed to the script.
Following, when using the pattern as suggested there is no input whatsoever. I can't really understand if the code for parsing the input files is wrong and needs some altering, or I'm doing something wrong.
The -h states:
USE: python <PROGNAME> (options) file1...fileN
OPTIONS:
-h : print this help message
-b : use BINARY weights (default: count weighting)
-s FILE : use stoplist file FILE
-I PATT : identify input files using pattern PATT,
(otherwise uses files listed on command line)
Thanks in advance :)

Check the quotes. They seem special. Try ' or ", instead.

Related

MD task with `snakemake`

I want to create a very simple pipeline for molecular dynamic simulation. The program (Amber) just wants 3 files as input, and produces a lot of files, some of them I will be needed in the future. So my pipeline is extremely simple:
Check that *.in, *.prmtop and *.rst are in folder (I guarantee it's only one file for any of these extensions) and warn if these files are not present
Run shell command (based on name of all input files)
Check that *.out, mden, mdinfo and *.nc were produced
That's all. It's standard approach to the program I deal with. One folder, one task, short and simple file names based on file purpose, not on its content.
I wrote a simple pipline:
rule all:
input: '{inp}.out'
rule amber:
input:
'{inp}.in',
'{top}.prmtop',
'{coord}.rst'
output:
'{inp}.out',
'mden',
'mdinfo',
'{inp}.nc'
shell:
'pmemd.cuda'
' -O'
' -i {inp}.in'
' -o {inp}.out'
' -p {top}.prmtop'
' -c {coord}.rst'
' -r {inp}.rst'
' -x {inp}.nc'
' -ref {coord}.rst'
And it doesn't work.
All inputs in all rule must be explicit. (Why? Why it cannot be regex or wildcard expression? If I see *.out in folder and status code of shell script was 0, that's all, work is done)
I must to use all wildcards from input in output, but I want to use some only in shell or in another rules
I must to not expect to get files like mden with potentially "non-unique" names because it's could be change with another task, but I know, that it will be only one task and it's a direct way how my MD program works (yeah, I know about Ambers's -e and -inf keys, but it's over-complication of simple task).
So, I would like to decide is it worth using snakemake for this, or not. It's very simple task, but I already spent several hours, I see a lot of documentation, a lot of examples, that I can't apply to my case. snakemake looks exactly what I need, but I can't express simple task in general terms with this framework, I don't want to specify explicit filenames, because I'll lose in flexibility, I want to run hundreds of simple tasks automatically, only input files will be different. I'm sure I just haven't figured out how to handle this framework yet. Maybe you can show me how should I? Thank you!
Hopefully this will put you in the right direction.
If I understand correctly, the input to snakemake is a folder containing the input files to amber. You know that this folder contains one .in file, one .prmtop file, and one .rst file but you don't know the full names of these files.
If you want snakemake to run on a single input folder, then you don't need wildcards at all and the script below should do.
import glob
import os
input_folder = config['amber_folder']
# We don't know the name of input file. We only know it ends in '.in'
inp = glob.glob(os.path.join(input_folder, '*.in'))
assert len(inp) == 1
inp = inp[0]
name = os.path.splitext(os.path.basename(inp))[0]
output_folder = name + '_results'
out = os.path.join(output_folder, name + '.out')
rule all:
input:
out
rule amber:
input:
inp= inp,
top= glob.glob(os.path.join(input_folder, '*.prmtop')),
rst= glob.glob(os.path.join(input_folder, '*.rst')),
output:
out= out,
nc= os.path.join(output_folder, name + '.nc'),
mden= os.path.join(output_folder, 'mden'),
mdinfo= os.path.join(output_folder, 'mdinfo'),
shell:
r"""
pmemd.cuda \
-O \
-i {input.inp} \
-o {output.out} \
-p {input.top} \
-c {input.rst} \
-r {input.rst} \
-x {output.nc} \
-ref {input.rst}
"""
Execute with:
snakemake -j 1 -C amber_folder='your-input-folder'
If you have many input folders you could write a for-loop to execute the command above but probably better is to pass the list of inputs to snakemake and let it handle them.

I am trying to print the last line of every file in a directory using shell command from python script

I am storing the number of files in a directory in a variable and storing their names in an array. I'm unable to store file names in the array.
Here is the piece of code I have written.
import os
temp = os.system('ls -l /home/demo/ | wc -l')
no_of_files = temp - 1
command = "ls -l /home/demo/ | awk 'NR>1 {print $9}'"
file_list=[os.system(command)]
for i in range(len(file_list))
os.system('tail -1 file_list[i]')
Your shell scripting is orders of magnitude too complex.
output = subprocess.check_output('tail -qn1 *', shell=True)
or if you really prefer,
os.system('tail -qn1 *')
which however does not capture the output in a Python variable.
If you have a recent-enough Python, you'll want to use subprocess.run() instead. You can also easily let Python do the enumeration of the files to avoid the pesky shell=True:
output = subprocess.check_output(['tail', '-qn1'] + os.listdir('.'))
As noted above, if you genuinely just want the output to be printed to the screen and not be available to Python, you can of course use os.system() instead, though subprocess is recommended even in the os.system() documentation because it is much more versatile and more efficient to boot (if used correctly). If you really insist on running one tail process per file (perhaps because your tail doesn't support the -q option?) you can do that too, of course:
for filename in os.listdir('.'):
os.system("tail -n 1 '%s'" % filename)
This will still work incorrectly if you have a file name which contains a single quote. There are workarounds, but avoiding a shell is vastly preferred (so back to subprocess without shell=True and the problem of correctly coping with escaping shell metacharacters disappears because there is no shell to escape metacharacters from).
for filename in os.listdir('.'):
print(subprocess.check_output(['tail', '-n1', filename]))
Finally, tail doesn't particularly do anything which cannot easily be done by Python itself.
for filename in os.listdir('.'):
with open (filename, 'r') as handle:
for line in handle:
pass
# print the last one only
print(line.rstrip('\r\n'))
If you have knowledge of the expected line lengths and the files are big, maybe seek to somewhere near the end of the file, though obviously you need to know how far from the end to seek in order to be able to read all of the last line in each of the files.
os.system returns the exitcode of the command and not the output. Try using subprocess.check_output with shell=True
Example:
>>> a = subprocess.check_output("ls -l /home/demo/ | awk 'NR>1 {print $9}'", shell=True)
>>> a.decode("utf-8").split("\n")
Edit (as suggested by #tripleee) you probably don't want to do this as it will get crazy. Python has great functions for things like this. For example:
>>> import glob
>>> names = glob.glob("/home/demo/*")
will directly give you a list of files and folders inside that folder. Once you have this, you can just do len(names) to get the first command.
Another option is:
>>> import os
>>> os.listdir("/home/demo")
Here, glob will give you the whole filepath /home/demo/file.txt and os.listdir will just give you the filename file.txt
The ls -l /home/demo/ | wc -l command is also not the correct value as ls -l will show you "total X" on top mentioning how many total files it found and other info.
You could likely use a loop without much issue:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
with open(f, 'rb') as fh:
last = fh.readlines()[-1].decode()
print('file: {0}\n{1}\n'.format(f, last))
fh.close()
Output:
file.txt
Hello, World!
...
If your files are large then readlines() probably isn't the best option. Maybe go with tail instead:
for f in files:
print('file: {0}'.format(f))
subprocess.check_call(['tail', '-n', '1', f])
print('\n')
The decode is optional, although for text "utf-8" usually works or if it's a combination of binary/text/etc then maybe something such as "iso-8859-1" usually should work.
you are not able to store file names because os.system does not return output as you expect it to be. For more information see : this.
From the docs
On Unix, the return value is the exit status of the process encoded in the format specified for wait(). Note that POSIX does not specify the meaning of the return value of the C system() function, so the return value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell after running command, given by the Windows environment variable COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always 0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit status of the command run; on systems using a non-native shell, consult your shell documentation.
os.system executes linux shell commands as it is. for getting output for these shell commands you have to use python subprocess
Note : In your case you can get file names using either glob module or os.listdir(): see How to list all files of a directory

for fi in sys.argv[1:]: argument list too long

I am trying to execute a python script on all text files in a folder:
for fi in sys.argv[1:]:
And I get the following error
-bash: /usr/bin/python: Argument list too long
The way I call this Python function is the following:
python functionName.py *.txt
The folder has around 9000 files. Is there some way to run this function without having to split my data in more folders etc? Splitting the files would not be very practical because I will have to execute the function in even more files in the future... Thanks
EDIT: Based on the selected correct reply and the comments of the replier (Charles Duffy), what worked for me is the following:
printf '%s\0' *.txt | xargs -0 python ./functionName.py
because I don't have a valid shebang..
This is an OS-level problem (limit on command line length), and is conventionally solved with an OS-level (or, at least, outside-your-Python-process) solution:
find . -maxdepth 1 -type f -name '*.txt' -exec ./your-python-program '{}' +
...or...
printf '%s\0' *.txt | xargs -0 ./your-python-program
Note that this runs your-python-program once per batch of files found, where the batch size is dependent on the number of names that can fit in ARG_MAX; see the excellent answer by Marcus Müller if this is unsuitable.
No. That is a kernel limitation for the length (in bytes) of a command line.
Typically, you can determine that limit by doing
getconf ARG_MAX
which, at least for me, yields 2097152 (bytes), which means about 2MB.
I recommend using python to work through a folder yourself, i.e. giving your python program the ability to work with directories instead of individidual files, or to read file names from a file.
The former can easily be done using os.walk(...), whereas the second option is (in my opinion) the more flexible one. Use the argparse module to give your python program an easy-to-use command line syntax, then add an argument of a file type (see reference documentation), and python will automatically be able to understand special filenames like -, meaning you could instead of
for fi in sys.argv[1:]
do
for fi in opts.file_to_read_filenames_from.read().split(chr(0))
which would even allow you to do something like
find -iname '*.txt' -type f -print0|my_python_program.py -file-to-read-filenames-from -
Don't do it this way. Pass mask to your python script (e.g. call it as python functionName.py "*.txt") and expand it using glob (https://docs.python.org/2/library/glob.html).
I think about using glob module. With this module you invoke your program like:
python functionName.py "*.txt"
then shell will not expand *.txt into file names. You Python program will receive *.txt in argumens list and you can pass it into glob.glob():
for fi in glob.glob(sys.argv[1]):
...

Find (bash command) doesn't work with subprocess?

I have renamed a css class name in a number of (python-django) templates. The css files however are wide-spread across multiple files in multiple directories. I have a python snippet to start renaming from the root dir and then recursively rename all the css files.
from os import walk, curdir
import subprocess
COMMAND = "find %s -iname *.css | xargs sed -i s/[Ff][Oo][Oo]/bar/g"
test_command = 'echo "This is just a test. DIR: %s"'
def renamer(command):
print command # Please ignore the print commands.
proccess = subprocess.Popen(command.split(), stdout = subprocess.PIPE)
op = proccess.communicate()[0]
print op
for root, dirs, files in walk(curdir):
if root:
command = COMMAND % root
renamer(command)
It doesn't work, gives:
find ./cms/djangoapps/contentstore/management/commands/tests -iname *.css | xargs sed -i s/[Ee][Dd][Xx]/gurukul/g
find: paths must precede expression: |
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]
find ./cms/djangoapps/contentstore/views -iname *.css | xargs sed -i s/[Ee][Dd][Xx]/gurukul/g
find: paths must precede expression: |
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]
When I copy and run the same command (printed above), find doesn't error out and sed either gets no input files or it works.
What is wrong with the python snippet?
You're not trying to run a single command, but a shell pipeline of multiple commands, and you're trying to do it without invoking the shell. That can't possibly work. The way you're doing this, | is just one of the arguments to find, which is why find is telling you that it doesn't understand that argument with that "paths must precede expression: |" error.
You can fix that by adding shell=True to your Popen.
But a better solution is to do the pipeline in Python and keep the shell out of it. See Replacing Older Functions with the subprocess Module in the docs for an explanation, but I'll show an example.
Meanwhile, you should never use split to split a command line. The best solution is to write the list of separate arguments instead of joining them up into a string just to split them out. If you must do that, use the shlex module; that's what it's for. But in your case, even that won't help you, because you're inserting random strings verbatim, which could easily have spaces or quotes in them, and there's no way anything—shlex or otherwise—can reconstruct the data in the first place.
So:
pfind = Popen(['find', root, '-iname', '*.css'], stdout=PIPE)
pxargs = Popen(['xargs', 'sed', '-i', 's/[Ff][Oo][Oo]/bar/g'],
stdin=pfind.stdout, stdout=PIPE)
pfind.stdout.close()
output = pxargs.communicate()
But there's an even better solution here.
Python has os.walk to do the same thing as find, you can simulate xargs easily, but there's really no need to do so, and it has its own re module to use instead of sed. So, why not use them?
Or, conversely, bash is much better at driving and connecting up simple commands than Python, so if you'd rather use find and sed instead of os.walk and re.sub, why write the driving script in Python in the first place?
The problem is the pipe. To use a pipe with the subprocess module, you have to pass shell=True.

Get a file from CLI input

How do you get a file name from command line when you run a Python code? Like if your code opens a file and reads the line, but the file varies whenever you run it, how to you say:
python code.py input.txt
so the code analyzes "input.txt"? What would you have to do in the actual Python code? I know, this is a pretty vague question, but I don't really know how to explain it any better.
A great option is the fileinput module, which will grab any or all filenames from the command line, and give the specified files' contents to your script as though they were one big file.
import fileinput
for line in fileinput.input():
process(line)
More information here.
import sys
filename = sys.argv[-1]
This will get the last argument on the command line. If no arguments are passed, it will be the script name itself, as sys.argv[0] is the name of the running program.
Using argparse is quite intuitive:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--file", "-f", type=str, required=True)
args = parser.parse_args()
Now the name of the file is located in:
args.file
You just have to run the program a little differently:
python code.py -f input.txt
Command line parameters are available as a list via the sys module's argv list. The first element in the list is the name of the program (sys.argv[0]). The remaining elements are the command line parameters.
See also the getopt, optparse, and argparse modules for more complex command line parsing.
If you're using Linux or Windows PowerShell you could pipe " | " it after using cat on input.txt file, suppose you have input.txt file and your code.py file in same directory you could use:
cat input.txt | python code.py
This will provide python input from STDIN. for example: if for example you're trying get names from input.txt file
input.txt has
john,matthew,peter,albert
code.py has
print(" is not ".join(input().rstrip().split(',')))
would give
john is not matthew is not peter is not albert
I also like argparse, it's clean, simple, fairly standard, gives free error handling, and add a [-h] option to help the user.
Here is a version that do not need the named parameters, which may be annoying for a very simple script:
#!/usr/bin/python3
import argparse
arg_parser = argparse.ArgumentParser( description = "Copy source_file as target_file." )
arg_parser.add_argument( "source_file" )
arg_parser.add_argument( "target_file" )
arguments = arg_parser.parse_args()
source = arguments.source_file
target = arguments.target_file
print( "Copying [{}] to [{}]".format(source, target) )
Example of how it handles errors and help for you:
>my_test.py
usage: my_test.py [-h] source_file target_file
my_test.py: error: the following arguments are required: source_file, target_file
>my_test.py my_source.cpp
usage: my_test.py [-h] source_file target_file
my_test.py: error: the following arguments are required: target_file
>my_test.py -h
usage: .py [-h] source_file target_file
Copy source_file as target_file.
positional arguments:
source_file
target_file
optional arguments:
-h, --help show this help message and exit
>my_test.py my_source.cpp my_target.cpp
Copying [my_source.cpp] to [my_target.cpp]
In addition to what is mentioned by the already existing answers, there is an other alternative relying on the use of Command Line Interface Creation Kit (Click). Its latest stable version by the time I posted this answer is version 6. The official documentation has examples on how to deal with files and pass them as command line arguments.
Just use the basic command raw_input
declare input file name as string
inFile = ""
inFile = raw_input("Enter the input File Name: ")
Now you can open the file by using with open(inFile,'w')

Categories