use two pipelines for python input file argument and stdin streaming

use two pipelines for python input file argument and stdin streaming - python

Is there a one-liner approach to running the following python script in linux bash, without saving any temporary file (except /dev/std* ) ?
my python script test.py takes in a filename as an argument, but also sys.stdin as a streaming input.
#test.py
#!/usr/bin/python
import sys
fn=sys.argv[1]
checkofflist=[]
with open(fn,'r') as f:
for line in f.readlines():
checkofflist.append(line)
for line in sys.stdin:
if line in checkofflist:
# do something to line
I would like to do something like
hadoop fs -cat inputfile.txt > /dev/stdout | cat streamingfile.txt | python test.py /dev/stdin
But of course this doesn't work since the middle cat corrupts the intended /dev/stdin content. Being able to do this is nice since then I don't need to save hdfs files locally every time I need to work with them.

I think what you're looking for is:
python test.py <( hadoop fs -cat inputfile.txt ) <streamingfile.txt
In bash, <( ... ) is Process Substitution. The command inside the parentheses is run with its output connected to a fifo or equivalent, and the name of the fifo (or /dev/fd/n if bash is able to use an unnamed pipe) is substituted as an argument. The tool sees a filename, which it can just open and use normally. (>(...) is also available, with input connected to a fifo, in case you want a named streaming output.)

Without relying on bash process substitution, you might also try
hadoop fs -cat inputfile.txt | python test.py streamingfile.txt
This provides streamingfile.txt as a command-line argument for test.py to use as a file name to open, as well as providing the contents of inputfile.txt on standard input.

Related

Avoid temp file in call to suprocess.run()

In a Python project, I need the output of an external (non-Python) command.
Let's call it identify_stuff.*
Command-line scenario
When called from command-line, this command requires a file name as an argument.
If its input is generated dynamically, we cannot pipe it into the command – this doesn't work:
cat input/* | ./identify_stuff > output.txt
cat input/* | ./identify_stuff - > output.txt
... it strictly requires a file name it can open, so one needs to create a temporary file on disk for the output of the first command, from where the second command can read the data.
However, the identify_stuff program really iterates over the input lines only once, no seeking or re-reading is involved.
So in Bash we can avoid the temporary file with the <(...) construct.
This works:
./identify_stuff <(cat input/*) > output.txt
This pipes the output of the first command to some device at a path /dev/fdX, which can be used for opening a stream like the path to a regular file on disk.
Actual scenario: call from within Python
Now, instead of just cat input/*, the input text is created inside a Python program, which continues to run after the output of identify_stuff has been captured.
The natural choice for calling an external command is the standard-library's subprocess.run().
For performance reasons, I would like to avoid creating a file on-disk.
Is there any way to do this with the subprocess tools?
The stdin and input parameters of subprocess.run won't work, because the external command ignores STDIN and specifically requires a file-name argument.
*Actually, it's this tool: https://github.com/jakelever/Ab3P/blob/master/identify_abbr.C

I am trying to print the last line of every file in a directory using shell command from python script

I am storing the number of files in a directory in a variable and storing their names in an array. I'm unable to store file names in the array.
Here is the piece of code I have written.
import os
temp = os.system('ls -l /home/demo/ | wc -l')
no_of_files = temp - 1
command = "ls -l /home/demo/ | awk 'NR>1 {print $9}'"
file_list=[os.system(command)]
for i in range(len(file_list))
os.system('tail -1 file_list[i]')

Your shell scripting is orders of magnitude too complex.
output = subprocess.check_output('tail -qn1 *', shell=True)
or if you really prefer,
os.system('tail -qn1 *')
which however does not capture the output in a Python variable.
If you have a recent-enough Python, you'll want to use subprocess.run() instead. You can also easily let Python do the enumeration of the files to avoid the pesky shell=True:
output = subprocess.check_output(['tail', '-qn1'] + os.listdir('.'))
As noted above, if you genuinely just want the output to be printed to the screen and not be available to Python, you can of course use os.system() instead, though subprocess is recommended even in the os.system() documentation because it is much more versatile and more efficient to boot (if used correctly). If you really insist on running one tail process per file (perhaps because your tail doesn't support the -q option?) you can do that too, of course:
for filename in os.listdir('.'):
os.system("tail -n 1 '%s'" % filename)
This will still work incorrectly if you have a file name which contains a single quote. There are workarounds, but avoiding a shell is vastly preferred (so back to subprocess without shell=True and the problem of correctly coping with escaping shell metacharacters disappears because there is no shell to escape metacharacters from).
for filename in os.listdir('.'):
print(subprocess.check_output(['tail', '-n1', filename]))
Finally, tail doesn't particularly do anything which cannot easily be done by Python itself.
for filename in os.listdir('.'):
with open (filename, 'r') as handle:
for line in handle:
pass
# print the last one only
print(line.rstrip('\r\n'))
If you have knowledge of the expected line lengths and the files are big, maybe seek to somewhere near the end of the file, though obviously you need to know how far from the end to seek in order to be able to read all of the last line in each of the files.

os.system returns the exitcode of the command and not the output. Try using subprocess.check_output with shell=True
Example:
>>> a = subprocess.check_output("ls -l /home/demo/ | awk 'NR>1 {print $9}'", shell=True)
>>> a.decode("utf-8").split("\n")
Edit (as suggested by #tripleee) you probably don't want to do this as it will get crazy. Python has great functions for things like this. For example:
>>> import glob
>>> names = glob.glob("/home/demo/*")
will directly give you a list of files and folders inside that folder. Once you have this, you can just do len(names) to get the first command.
Another option is:
>>> import os
>>> os.listdir("/home/demo")
Here, glob will give you the whole filepath /home/demo/file.txt and os.listdir will just give you the filename file.txt
The ls -l /home/demo/ | wc -l command is also not the correct value as ls -l will show you "total X" on top mentioning how many total files it found and other info.

You could likely use a loop without much issue:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
with open(f, 'rb') as fh:
last = fh.readlines()[-1].decode()
print('file: {0}\n{1}\n'.format(f, last))
fh.close()
Output:
file.txt
Hello, World!
...
If your files are large then readlines() probably isn't the best option. Maybe go with tail instead:
for f in files:
print('file: {0}'.format(f))
subprocess.check_call(['tail', '-n', '1', f])
print('\n')
The decode is optional, although for text "utf-8" usually works or if it's a combination of binary/text/etc then maybe something such as "iso-8859-1" usually should work.

you are not able to store file names because os.system does not return output as you expect it to be. For more information see : this.
From the docs
On Unix, the return value is the exit status of the process encoded in the format specified for wait(). Note that POSIX does not specify the meaning of the return value of the C system() function, so the return value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell after running command, given by the Windows environment variable COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always 0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit status of the command run; on systems using a non-native shell, consult your shell documentation.
os.system executes linux shell commands as it is. for getting output for these shell commands you have to use python subprocess
Note : In your case you can get file names using either glob module or os.listdir(): see How to list all files of a directory

How to pipe out data from Python PRNG script similar to /dev/urandom into Dieharder

I have created a pseudorandom number generator and I am trying to pipe out the data similar to if someone ran
"$ cat /dev/urandom/"
Specifically, I am trying to pipe out the data to the dieharder RNG test suite.
Typically, reading from urandom to dieharder looks like
"$ cat /dev/urandom | dieharder -a -g 200"
My program is designed to infinitely generate numbers and outputs them in main as:
def main():
... # setup and variables
for _ in iter(int,1): # infinite loop
PRNG_VAL = PRNG_FUNC(count_str,pad,1) # returns b'xx'
PRNG_VAL = int(PRNG_VAL,16) # returns integer
sys.stdout.write(chr(PRNG_VAL)) # integer to chr, similar to /dev/urandom type output
Obviously, when I run something like
"$ cat ./top.py | dieharder ..."
the resulting output is just the reading of the contents of the file.
How do I, instead of read the contents of 'top.py', run the file and pipe the output into dieharder similar to reading from /dev/urandom?
Any help is appreciated!

How do I, instead of read the contents of 'top.py', run the file and pipe the output into dieharder…
In sh, and most other POSIX shells, you run it the same way you'd normally run it, and pipe that (the same way you're piping the output of cat):
./top.py | dieharder
… or:
python top.py | dieharder
The reason you use cat /dev/urandom is that urandom isn't a program, it's a file. Of course it's not a regular file full of bytes sitting on the disk, it's a special device file, created by a device driver and mounted via mknod, but you don't have to worry about that (unless you want to write your own device drivers); it acts as if it were a regular file full of bytes. You can't easily do the same thing, but then you don't have to.
You should read a good tutorial on basic shell scripting. I don't have one to recommend, but I'm sure there are plenty of them.

Python to read stdin from other source continously

Is it possible to allow Python to read from stdin from another source such as a file continually? Basically I'm trying to allow my script to use stdin to echo input and I'd like to use a file or external source to interact with it (while remaining open).
An example might be (input.py):
#!/usr/bin/python
import sys
line = sys.stdin.readline()
while line:
print line,
line = sys.stdin.readline()
Executing this directly I can continuously enter text and it echos back while the script remains alive. If you want to use an external source though such as a file or input from bash then the script exits immediately after receiving input:
$ echo "hello" | python input.py
hello
$
Ultimately what I'd like to do is:
$ tail -f file | python input.py
Then if the file updates have input.py echo back anything that is added to file while remaining open. Maybe I'm approaching this the wrong way or I'm simply clueless, but is there a way to do it?

Use the -F option to tail to make it reopen the file if it gets renamed or deleted and a new file is created with the original name. Some editors write the file this way, and logfile rotation scripts also usually work this way (they rename the original file to filename.1, and create a new log file).
$ tail -F file | python input.py

Using Linux redirect to overwrite file from Python script

I have a simple python script that just takes in a filename, and spits out a modified version of that file. I would like to redirect stdout (using '>' from the command line) so that I can use my script to overwrite a file with my modifications, e.g. python myScript.py test.txt > test.txt
When I do this, the resulting test.txt does not contain any text from the original test.txt - just the additions made by myScript.py. However, if I don't redirect stdout, then the modifications come out correctly.
To be more specific, here's an example:
myScript.py:
#!/usr/bin/python
import sys
fileName = sys.argv[1]
sys.stderr.write('opening ' + fileName + '\n')
fileHandle = file(fileName)
currFile = fileHandle.read()
fileHandle.close()
sys.stdout.write('MODIFYING\n\n' + currFile + '\n\nMODIFIED!\n')
test.txt
Hello World
Result of python myScript.py test.txt > test.txt:
MODIFYING
MODIFIED!

The reason it works that way is that, before Python even starts, Bash interprets the redirection operator and opens an output stream to write stdout to the file. That operation truncates the file to size 0 - in other words, it clears the contents of the file. So by the time your Python script starts, it sees an empty input file.
The simplest solution is to redirect stdout to a different file, and then afterwards, rename it to the original filename.
python myScript.py test.txt > test.out && mv test.out test.txt
Alternatively, you could alter your Python script to write the modified data back to the file itself, so you wouldn't have to redirect standard output at all.

Try redirecting it to a new file, the redirect operator probably deletes the file just before appending.

The utility sponge present in the moreutils package in Debian can handle this gracefully.
python myScript.py test.txt | sponge test.txt
As indicated by its name, sponge will drain its standard input completely before opening test.txt and writing out the full contents of stdin.
You can get the latest version of sponge here. The moreutils home page is here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.