gnu parallel --pipe producing empty output files

gnu parallel --pipe producing empty output files - python

I'm struggling to try and run gnu parallel. I have a shell script that calls a python program several thousand times with different input params:
python /path/to/program/run.py A_02_01 input.fasta > /path/to/output/out.txt
python /path/to/program/run.py A_02_02 input.fasta > /path/to/output/out.txt
I tried using gnu parallel like so:
cat iedb_classi_call.sh | parallel --recstart 'python' --recend '\n' --pipe bash
But all my output files are empty. I'm struggling to figure out why. I'm not getting errors from gnu parallel.
Before I added the recstart and recend options, I was getting non-empty output files for some python calls, but other program calls weren't executing and getting errors like:
run.py: error: incorrect number of arguments
bash: line 422: 01_ input.fasta: command not found
Usage: run.py allele fasta_file
Which made me think parallel was reading in chunks not separated properly and I added the --recstart / --recend parameters to parallel
I'm using gnu parallel version 20180722

(This should be a comment as it does not answer the actual question, but formatting code does not work very well).
If the lines are like:
python /path/to/program/run.py A_02_01 input.fasta > /path/to/output/out.txt
python /path/to/program/run.py A_02_02 input.fasta > /path/to/output/out.txt
Then you might be able to do:
cat file-with-A_names |
parallel --results /path/{}.out python /path/to/program/run.py {} input.fasta >/dev/null
The output will be stored in /path/A....out.

Related

How to pipe out data from Python PRNG script similar to /dev/urandom into Dieharder

I have created a pseudorandom number generator and I am trying to pipe out the data similar to if someone ran
"$ cat /dev/urandom/"
Specifically, I am trying to pipe out the data to the dieharder RNG test suite.
Typically, reading from urandom to dieharder looks like
"$ cat /dev/urandom | dieharder -a -g 200"
My program is designed to infinitely generate numbers and outputs them in main as:
def main():
... # setup and variables
for _ in iter(int,1): # infinite loop
PRNG_VAL = PRNG_FUNC(count_str,pad,1) # returns b'xx'
PRNG_VAL = int(PRNG_VAL,16) # returns integer
sys.stdout.write(chr(PRNG_VAL)) # integer to chr, similar to /dev/urandom type output
Obviously, when I run something like
"$ cat ./top.py | dieharder ..."
the resulting output is just the reading of the contents of the file.
How do I, instead of read the contents of 'top.py', run the file and pipe the output into dieharder similar to reading from /dev/urandom?
Any help is appreciated!

How do I, instead of read the contents of 'top.py', run the file and pipe the output into dieharder…
In sh, and most other POSIX shells, you run it the same way you'd normally run it, and pipe that (the same way you're piping the output of cat):
./top.py | dieharder
… or:
python top.py | dieharder
The reason you use cat /dev/urandom is that urandom isn't a program, it's a file. Of course it's not a regular file full of bytes sitting on the disk, it's a special device file, created by a device driver and mounted via mknod, but you don't have to worry about that (unless you want to write your own device drivers); it acts as if it were a regular file full of bytes. You can't easily do the same thing, but then you don't have to.
You should read a good tutorial on basic shell scripting. I don't have one to recommend, but I'm sure there are plenty of them.

Hadoop commands from python

I am trying to get some stats for a directory in hdfs. I am trying to get the no of files/subdirs and the size for each. I started out thinking that I can do this in bash.
#!/bin/bash
OP=$(hadoop fs -ls hdfs://mydirectory)
echo $(wc -l < "$OP")
I only have this much so far and I quickly realised that python might be a better option for this. However I am not able to figure out how to execute hadoop commands like hadoop fs -ls from python

Try the following snippet:
output = subprocess.Popen(["hadoop", "fs", "-ls", "/user"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for line in output.stdout:
print(line)
Additionally, you can refer to this sub-process example, where you can get return status, output and error message separately.

See https://docs.python.org/2/library/commands.html for your options, including how to get the return status (in case of an error). The basic code you're missing is
import commands
hdir_list = commands.getoutput('hadoop fs -ls hdfs://mydirectory')
Yes: deprecated in 2.6, still useful in 2.7, but removed from Python 3. If that bothers you, switch to
os.command (<code string>)
... or better yet use subprocess.call (introduced in 2.4).

Python Uudecode Call Corruption

I am working on extracting PDFs from SEC filings. They usually come like this:
SEC Filing Example
For whatever reason when I save the raw PDF to a .text file, and then try to run
uudecode -o output_file.pdf input_file.txt
from the python subprocess.call() function or any other python function that allows commands to be executed from the command line, the PDF files that are generated are corrupted. If I run this same command from the command line directly there is no corruption.
When taking a closer look at the PDF file being output from the python script, it looks like the file ends prematurely. Is there some sort of output limit when executing a command line command from python?
Thanks!

This script worked fine for me running under Python 3.4.1 on Fedora 21 x86_64 with uudecode 4.15.2:
import subprocess
subprocess.call("uudecode -o output_file.pdf input_file.txt", shell=True)
Using the linked SEC filing (length: 173,141 B; sha1: e4f7fa2cbb3422411c2f2968d954d6bb9808b884), the decoded PDF (length: 124,557 B; sha1: 1676320e1d9923e14d19451c16688198bc93ca0d) appears correct when viewed.
There may be something else in your environment causing the problem. You may want to add additional details to your question.
Is there some sort of output limit when executing a command line command from python?
If by "output limit" you mean the size of the file being written by uudecode, then no. The only type of "output limit" you need to worry about when using the subprocess module is when you pass stdout=PIPE or stderr=PIPE when creating a child process. If the child process writes enough data to either of these streams, and your script does not regularly drain them, the child process will block (see the subprocess module documentation). In my test, uudecode wrote nothing to stdout or stderr.

Passing individual lines from files into a python script using a bash script

This might be a simple question, but I am new to bash scripting and have spent quite a bit of time on this with no luck; I hope I can get an answer here.
I am trying to write a bash script that reads individual lines from a text file and passes them along as argument for a python script. I have a list of files (which I have saved into a single text file, all on individual lines) that I need to be used as arguments in my python script, and I would like to use a bash script to send them all through. Of course I can take the tedious way and copy/paste the rest of the python command to individual lines in the script, but I would think there is a way to do this with the "read line" command. I have tried all sorts of combinations of commands, but here is the most recent one I have:
#!/bin/bash
# Command Output Test
cat infile.txt << EOF
while read line
do
VALUE = $line
python fits_edit_head.py $line $line NEW_PARA 5
echo VALUE+"huh"
done
EOF
When I do this, all I get returned is the individual lines from the input file. I have the extra VALUE there to see if it will print that, but it does not. Clearly there is something simple about the "read line" command that I do not understand but after messing with it for quite a long time, I do not know what it is. I admit I am still a rookie to this bash scripting game, and not a very good one at that. Any help would certainly be appreciated.

You probably meant:
while read line; do
VALUE=$line ## No spaces allowed
python fits_edit_head.py "$line" "$line" NEW_PARA 5 ## Quote properly to isolate arguments well
echo "$VALUE+huh" ## You don't expand without $
done < infile.txt
Python may also read STDIN so that it could accidentally read input from infile.txt so you can use another file descriptor:
while read -u 4 line; do
...
done 4< infile.txt
Better yet if you're using Bash 4.0, it's safer and cleaner to use readarray:
readarray -t lines < infile.txt
for line in "${lines[#]}; do
...
done

use two pipelines for python input file argument and stdin streaming

Is there a one-liner approach to running the following python script in linux bash, without saving any temporary file (except /dev/std* ) ?
my python script test.py takes in a filename as an argument, but also sys.stdin as a streaming input.
#test.py
#!/usr/bin/python
import sys
fn=sys.argv[1]
checkofflist=[]
with open(fn,'r') as f:
for line in f.readlines():
checkofflist.append(line)
for line in sys.stdin:
if line in checkofflist:
# do something to line
I would like to do something like
hadoop fs -cat inputfile.txt > /dev/stdout | cat streamingfile.txt | python test.py /dev/stdin
But of course this doesn't work since the middle cat corrupts the intended /dev/stdin content. Being able to do this is nice since then I don't need to save hdfs files locally every time I need to work with them.

I think what you're looking for is:
python test.py <( hadoop fs -cat inputfile.txt ) <streamingfile.txt
In bash, <( ... ) is Process Substitution. The command inside the parentheses is run with its output connected to a fifo or equivalent, and the name of the fifo (or /dev/fd/n if bash is able to use an unnamed pipe) is substituted as an argument. The tool sees a filename, which it can just open and use normally. (>(...) is also available, with input connected to a fifo, in case you want a named streaming output.)

Without relying on bash process substitution, you might also try
hadoop fs -cat inputfile.txt | python test.py streamingfile.txt
This provides streamingfile.txt as a command-line argument for test.py to use as a file name to open, as well as providing the contents of inputfile.txt on standard input.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

gnu parallel --pipe producing empty output files - python

Related

How to pipe out data from Python PRNG script similar to /dev/urandom into Dieharder

Hadoop commands from python

Python Uudecode Call Corruption

Passing individual lines from files into a python script using a bash script

use two pipelines for python input file argument and stdin streaming

Categories

Resources