take raw streaming data from stdout into python program - python

I have a hackrf hardware unit that is feeding a continuous raw uint8 data stream to a linux shell pipe.
For example this will pipe continous data to another application in a linux shell like so:
hackrf_transfer -r /dev/stdout -f 92700000 -s 8000000 - | (another application)
In python this will do the same:
hackout = subprocess.Popen(['hackrf_transfer', '-r', '/dev/stdout', '-f', '92700000', '-s', '8000000'], stdout=subprocess.PIPE)
BUT I cannot get the Hackrf pipe stream into a python script. For example I may want to decimate that raw data stream or manipulate it in some way and then send it on to another subprocess application etc. like so:
(HackRF)source subprocess >> a python script >> sink subprocess (eg.
baudline)
or in a single python script:
source hackrf >> my_function >> sink application
I can do source >> sink in a python script where both applications already accept a shell command such as a hackrf subprocess pipe into Baudline subprocess stdin pipe. In other words if the two apps work in the shell using a pipe it works in a python subprocess call. But I can't get a python function between this shell pipe to alter the data using a python script or function.
Does anyone have any ideas on how I could go about this please?

The output of hackrf_transfer is a byte stream, not line-oriented, so readlines() doesn't work; use read(8*1024) instead.
If I use hackout.stdout.read() or hackout.communicate it 'sinks' the data stream.
Right, those calls without arguments cannot be used to read a continuous data stream in parallel.
read(size=-1):
As a convenience, if size is unspecified or -1, all bytes until EOF are returned.
communicate(input=None, timeout=None):
Read data from stdout and stderr, until end-of-file is reached. Wait for process to terminate.
This is why I told to use read(8*1024).
Its not running errors or messages with this: data = hackout.stdout.readlines(8*1024) but I want to take this and feed it to stdout. I tried sys.stdin.write(data) its not writing but It's seeing 'data' as a list. So its captured the data but I can't write that captured data back out.
I hope you meant read rather than readlines, for the reason I stated at the beginning of this post.
Here's a sketch of working code, based on what you posted in a meanwhile deleted "answer":
hackout = subprocess.Popen(['hackrf_transfer', …], stdout=subprocess.PIPE)
# We need to start the "sink subprocess" at the outset, and that with stdin=PIPE
baudline = subprocess.Popen("baudline … -stdin …", stdin=subprocess.PIPE, shell=True)
def decimator():
for iq_samples in iter(lambda: bytearray(hackout.stdout.read(8*1024)), b''):
# convert the samples chunk for use by numpy, if you wish
dat = np.array(iq_samples)
dat = dat.astype(float)/255
# dat = … do further processing, as you wish
# now convert data back for use by baudline -format u8 and write out
baudline.stdin.write((dat*255).astype('i1').tostring())
decimator()

Related

Python's sub-process returns truncated output when using PIPE to read very long outputs

We have a rasterization utility developed in NodeJS that converts HTML string to the Base64 of the rendered HTML page. The way we are using it is by using sub-process module to run the utility and then reading its STDOUT by using PIPE. The basic code that implements this is as follows:
from subprocess import run, PIPE
result = run(['capture', tmp_file.name, '--type', 'jpeg'], stdout=PIPE, stderr=PIPE, check=True)
output = result.stdout.decode('utf-8')
The output contains the Base64 string of the rendered HTML page. As Base64 is very large for large pages, I have noticed that for some HTML pages, the output is truncated and is not complete. But, this happens randomly so Base64 could be correct for a page one time but truncated next time. It is important to mention here that I'm currently using threading (10 threads) to convert HTML to Base64 images concurrently so that might play a role here.
I analyzed this in detail and found out that, under the hood, the subprocess.run method uses the _communicate method which in turn uses the os.read() method to read from the PIPE. I printed its output and found out that it's also truncated and that's why STDOUT is truncated. Strange behavior altogether.
Finally, I was able to solve this by using a file handle instead of the PIPE and it works perfectly.
with open(output_filename, 'w+') as out_file:
result = run(['capture', tmp_file.name, '--type', 'jpeg'], stdout=out_file, stderr=PIPE, check=True)
I'm just curious why the PIPE fails to handle complete output and that too, randomly.
When you run subprocess, the command gets executed on bash.
When you use PIPE as stdout, internally bash stores data in a temp variable, which has hard limit of 128 Kb. anything that spills over 128kb gets truncated.
Best way to handle large data is to capture the output in a file.

Python: have check_call output to file continously?

Is it possible to get the following check_call procedure:
logPath="log.txt"
with open(logPath,"w") as log:
subprocess.check_call(command, stdout = log, stderr=subprocess.STDOUT )
to output the stdout and stderr to a file continously?
On my machine, the output is written to the file only after the subprocess.check_call finished.
To achieve this, perhaps we could modify the buffer length of the log filestream?
Not without some OS tricks.
That happens because the output usually is line-buffered (i.e. after a newline character, the buffer is flushed) when the output is a terminal, but it is block-buffered when the output is a file or pipe, so in the block-buffering case, you won't see the output written "continuously", but rather it will be written every 1k or 4k or whatever the block size it is.
This is the default behavior of libc, so if the subprocess is written in C and using printf()/fprintf(), it will check the output if it is a terminal or a file and change the buffering mode accordingly.
The concept of buffering is (better) explained at http://www.gnu.org/software/libc/manual/html_node/Buffering-Concepts.html
This is done for performance reasons (see the answer to this question).
If you can modify subprocess' code, you can put a call to flush() after each line or when needed.
Otherwise there are external tools to force line buffering mode (by tricking programs into believing the output is a terminal):
unbuffer part of the expect package
stdbuf
Possibly related:
Force line-buffering of stdout when piping to tee (suggests use of unbuffer)
java subprocess does not write its output until it terminates (a shorter explanation of mine written years ago)
How to get “instant" output of “tail -f” as input? (suggests stdbuf usage)
Piping of grep is not working with tail? (only for grep)

using Python subprocess to redirect stdout to stdin?

I'm making a call to a program from the shell using the subprocess module that outputs a binary file to STDOUT.
I use Popen() to call the program and then I want to pass the stream to a function in a Python package (called "pysam") that unfortunately cannot Python file objects, but can read from STDIN. So what I'd like to do is have the output of the shell command go from STDOUT into STDIN.
How can this be done from within Popen/subprocess module? This is the way I'm calling the shell program:
p = subprocess.Popen(my_cmd, stdout=subprocess.PIPE, shell=True).stdout
This will read "my_cmd"'s STDOUT output and get a stream to it in p. Since my Python module cannot read from "p" directly, I am trying to redirect STDOUT of "my_cmd" back into STDIN using:
p = subprocess.Popen(my_cmd, stdout=subprocess.PIPE, stdin=subprocess.PIPE, shell=True).stdout
I then call my module, which uses "-" as a placeholder for STDIN:
s = pysam.Samfile("-", "rb")
The above call just means read from STDIN (denoted "-") and read it as a binary file ("rb").
When I try this, I just get binary output sent to the screen, and it doesn't look like the Samfile() function can read it. This occurs even if I remove the call to Samfile, so I think it's my call to Popen that is the problem and not downstream steps.
EDIT: In response to answers, I tried:
sys.stdin = subprocess.Popen(tagBam_cmd, stdout=subprocess.PIPE, shell=True).stdout
print "Opening SAM.."
s = pysam.Samfile("-","rb")
print "Done?"
sys.stdin = sys.__stdin__
This seems to hang. I get the output:
Opening SAM..
but it never gets past the Samfile("-", "rb") line. Any idea why?
Any idea how this can be fixed?
EDIT 2: I am adding a link to Pysam documentation in case it helps, I really cannot figure this out. The documentation page is:
http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/usage.html
and the specific note about streams is here:
http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/usage.html#using-streams
In particular:
"""
Pysam does not support reading and writing from true python file objects, but it does support reading and writing from stdin and stdout. The following example reads from stdin and writes to stdout:
infile = pysam.Samfile( "-", "r" )
outfile = pysam.Samfile( "-", "w", template = infile )
for s in infile: outfile.write(s)
It will also work with BAM files. The following script converts a BAM formatted file on stdin to a SAM formatted file on stdout:
infile = pysam.Samfile( "-", "rb" )
outfile = pysam.Samfile( "-", "w", template = infile )
for s in infile: outfile.write(s)
Note, only the file open mode needs to changed from r to rb.
"""
So I simply want to take the stream coming from Popen, which reads stdout, and redirect that into stdin, so that I can use Samfile("-", "rb") as the above section states is possible.
thanks.
I'm a little confused that you see binary on stdout if you are using stdout=subprocess.PIPE, however, the overall problem is that you need to work with sys.stdin if you want to trick pysam into using it.
For instance:
sys.stdin = subprocess.Popen(my_cmd, stdout=subprocess.PIPE, shell=True).stdout
s = pysam.Samfile("-", "rb")
sys.stdin = sys.__stdin__ # restore original stdin
UPDATE: This assumed that pysam is running in the context of the Python interpreter and thus means the Python interpreter's stdin when "-" is specified. Unfortunately, it doesn't; when "-" is specified it reads directly from file descriptor 0.
In other words, it is not using Python's concept of stdin (sys.stdin) so replacing it has no effect on pysam.Samfile(). It also is not possible to take the output from the Popen call and somehow "push" it on to file descriptor 0; it's readonly and the other end of that is connected to your terminal.
The only real way to get that output onto file descriptor 0 is to just move it to an additional script and connect the two together from the first. That ensures that the output from the Popen in the first script will end up on file descriptor 0 of the second one.
So, in this case, your best option is to split this into two scripts. The first one will invoke my_cmd and take the output of that and use it for the input to a second Popen of another Python script that invokes pysam.Samfile("-", "rb").
In the specific case of dealing with pysam, I was able to work around the issue using a named pipe (http://docs.python.org/library/os.html#os.mkfifo), which is a pipe that can be accessed like a regular file. In general, you want the consumer (reader) of the pipe to listen before you start writing to the pipe, to ensure you don't miss anything. However, pysam.Samfile("-", "rb") will hang as you noted above if nothing is already registered on stdin.
Assuming you're dealing with a prior computation that takes a decent amount of time (e.g. sorting the bam before passing it into pysam), you can start that prior computation and then listen on the stream before anything gets output:
import os
import tempfile
import subprocess
import shutil
import pysam
# Create a named pipe
tmpdir = tempfile.mkdtemp()
samtools_prefix = os.path.join(tmpdir, "namedpipe")
fifo = samtools_prefix + ".bam"
os.mkfifo(fifo)
# The example below sorts the file 'input.bam',
# creates a pysam.Samfile object of the sorted data,
# and prints out the name of each record in sorted order
# Your prior process that spits out data to stdout/a file
# We pass samtools_prefix as the output prefix, knowing that its
# ending file will be named what we called the named pipe
subprocess.Popen(["samtools", "sort", "input.bam", samtools_prefix])
# Read from the named pipe
samfile = pysam.Samfile(fifo, "rb")
# Print out the names of each record
for read in samfile:
print read.qname
# Clean up the named pipe and associated temp directory
shutil.rmtree(tmpdir)
If your system supports it; you could use /dev/fd/# filenames:
process = subprocess.Popen(args, stdout=subprocess.PIPE)
samfile = pysam.Samfile("/dev/fd/%d" % process.stdout.fileno(), "rb")

using a python list as input for linux command that uses stdin as input

I am using python scripts to load data to a database bulk loader.
The input to the loader is stdin. I have been unable to get the correct syntax to call the unix based bulk loader passing the contents of a python list to be loaded.
I have been reading about Popen and PIPE but they have not been behaving as i expect.
The python list contains database records to be bulkloaded. In linux it would look similar to this:
echo "this is the string being written to the DB" | sql -c "COPY table FROM stdin"
What would be the correct way replace the echo statement with a python list to be used with this command ?
I do not have sample code for this process as i have been experimenting with the features of Popen and PIPE with some very simple syntax and not obtaining the desired result.
Any help would be very much appreciated.
Thanks
If your data is short and simple, you could preformat the entire list and do it simple with subprocess like this:
import subprocess
data = ["list", "of", "stuff"]
proc = subprocess.Popen(["sql", "-c", "COPY table FROM stdin"], stdin=subprocess.PIPE)
proc.communicate("\n".join(data))
If the data is too big to preformat like this, then you can attempt to use the stdin pipe directly, though subprocess module is flaky when using the pipes if you need to read from stdout/stderr too.
for line in data:
print >>proc.stdin, line

Running a command line from python and piping arguments from memory

I was wondering if there was a way to run a command line executable in python, but pass it the argument values from memory, without having to write the memory data into a temporary file on disk. From what I have seen, it seems to that the subprocess.Popen(args) is the preferred way to run programs from inside python scripts.
For example, I have a pdf file in memory. I want to convert it to text using the commandline function pdftotext which is present in most linux distros. But I would prefer not to write the in-memory pdf file to a temporary file on disk.
pdfInMemory = myPdfReader.read()
convertedText = subprocess.<method>(['pdftotext', ??]) <- what is the value of ??
what is the method I should call and how should I pipe in memory data into its first input and pipe its output back to another variable in memory?
I am guessing there are other pdf modules that can do the conversion in memory and information about those modules would be helpful. But for future reference, I am also interested about how to pipe input and output to the commandline from inside python.
Any help would be much appreciated.
with Popen.communicate:
import subprocess
out, err = subprocess.Popen(["pdftotext", "-", "-"], stdout=subprocess.PIPE).communicate(pdf_data)
os.tmpfile is useful if you need a seekable thing. It uses a file, but it's nearly as simple as a pipe approach, no need for cleanup.
tf=os.tmpfile()
tf.write(...)
tf.seek(0)
subprocess.Popen( ... , stdin = tf)
This may not work on Posix-impaired OS 'Windows'.
Popen.communicate from subprocess takes an input parameter that is used to send data to stdin, you can use that to input your data. You also get the output of your program from communicate, so you don't have to write it into a file.
The documentation for communicate explicitly warns that everything is buffered in memory, which seems to be exactly what you want to achieve.

Categories