Hadoop commands from python

Hadoop commands from python - python

I am trying to get some stats for a directory in hdfs. I am trying to get the no of files/subdirs and the size for each. I started out thinking that I can do this in bash.
#!/bin/bash
OP=$(hadoop fs -ls hdfs://mydirectory)
echo $(wc -l < "$OP")
I only have this much so far and I quickly realised that python might be a better option for this. However I am not able to figure out how to execute hadoop commands like hadoop fs -ls from python

Try the following snippet:
output = subprocess.Popen(["hadoop", "fs", "-ls", "/user"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for line in output.stdout:
print(line)
Additionally, you can refer to this sub-process example, where you can get return status, output and error message separately.

See https://docs.python.org/2/library/commands.html for your options, including how to get the return status (in case of an error). The basic code you're missing is
import commands
hdir_list = commands.getoutput('hadoop fs -ls hdfs://mydirectory')
Yes: deprecated in 2.6, still useful in 2.7, but removed from Python 3. If that bothers you, switch to
os.command (<code string>)
... or better yet use subprocess.call (introduced in 2.4).

Related

Python script: get latest tag info from remote Git repository?

I have found a way to get the info on the latest Git tag from the remote repository:
git ls-remote --tags --refs --sort="version:refname" git://github.com/git/git.git | awk -F/ 'END{print$NF}'
This works fine from the command line.
How would I get this info from the python script?
I tried os.system(command), but it doesn't seem like a good choice?

You can use subprocess.check_output and do what your awk incantation does in Python (namely print the last line's last /-separated segment).
Using the list form of arguments for subprocess.check_output ensures you don't need to think about shell escaping (in fact, the shell isn't even involved).
import subprocess
repo_url = "https://github.com/git/git.git"
output_lines = subprocess.check_output(
[
"git",
"ls-remote",
"--tags",
"--refs",
"--sort=version:refname",
repo_url,
],
encoding="utf-8",
).splitlines()
last_line_ref = output_lines[-1].rpartition("/")[-1]
print(last_line_ref)

Use of '&' in a python subprocess for background process

Are there any differences between the following 2 lines:
subprocess.Popen(command + '> output.txt', shell=True)
subprocess.Popen(command +' &> output.txt', shell=True)
As the popen already triggers the command to run in the background, should I use &? Does use of & ensure that the command runs even if the python script ends executing?
Please let me know the difference between the 2 lines and also suggest which of the 2 is better.
Thanks.

&> specifies that standard error has to be redirected to the same destination that standard output is directed. Which means both the output log of the command and error log will also be written in the output.txt file.
using > alone makes only standard output being copies to the output.txt file and the standard error can be written using command 2> error.txt

gnu parallel --pipe producing empty output files

I'm struggling to try and run gnu parallel. I have a shell script that calls a python program several thousand times with different input params:
python /path/to/program/run.py A_02_01 input.fasta > /path/to/output/out.txt
python /path/to/program/run.py A_02_02 input.fasta > /path/to/output/out.txt
I tried using gnu parallel like so:
cat iedb_classi_call.sh | parallel --recstart 'python' --recend '\n' --pipe bash
But all my output files are empty. I'm struggling to figure out why. I'm not getting errors from gnu parallel.
Before I added the recstart and recend options, I was getting non-empty output files for some python calls, but other program calls weren't executing and getting errors like:
run.py: error: incorrect number of arguments
bash: line 422: 01_ input.fasta: command not found
Usage: run.py allele fasta_file
Which made me think parallel was reading in chunks not separated properly and I added the --recstart / --recend parameters to parallel
I'm using gnu parallel version 20180722

(This should be a comment as it does not answer the actual question, but formatting code does not work very well).
If the lines are like:
python /path/to/program/run.py A_02_01 input.fasta > /path/to/output/out.txt
python /path/to/program/run.py A_02_02 input.fasta > /path/to/output/out.txt
Then you might be able to do:
cat file-with-A_names |
parallel --results /path/{}.out python /path/to/program/run.py {} input.fasta >/dev/null
The output will be stored in /path/A....out.

use two pipelines for python input file argument and stdin streaming

Is there a one-liner approach to running the following python script in linux bash, without saving any temporary file (except /dev/std* ) ?
my python script test.py takes in a filename as an argument, but also sys.stdin as a streaming input.
#test.py
#!/usr/bin/python
import sys
fn=sys.argv[1]
checkofflist=[]
with open(fn,'r') as f:
for line in f.readlines():
checkofflist.append(line)
for line in sys.stdin:
if line in checkofflist:
# do something to line
I would like to do something like
hadoop fs -cat inputfile.txt > /dev/stdout | cat streamingfile.txt | python test.py /dev/stdin
But of course this doesn't work since the middle cat corrupts the intended /dev/stdin content. Being able to do this is nice since then I don't need to save hdfs files locally every time I need to work with them.

I think what you're looking for is:
python test.py <( hadoop fs -cat inputfile.txt ) <streamingfile.txt
In bash, <( ... ) is Process Substitution. The command inside the parentheses is run with its output connected to a fifo or equivalent, and the name of the fifo (or /dev/fd/n if bash is able to use an unnamed pipe) is substituted as an argument. The tool sees a filename, which it can just open and use normally. (>(...) is also available, with input connected to a fifo, in case you want a named streaming output.)

Without relying on bash process substitution, you might also try
hadoop fs -cat inputfile.txt | python test.py streamingfile.txt
This provides streamingfile.txt as a command-line argument for test.py to use as a file name to open, as well as providing the contents of inputfile.txt on standard input.

Is there any way to get ps output programmatically?

I've got a webserver that I'm presently benchmarking for CPU usage. What I'm doing is essentially running one process to slam the server with requests, then running the following bash script to determine the CPU usage:
#! /bin/bash
for (( ;; ))
do
echo "`python -c 'import time; print time.time()'`, `ps -p $1 -o '%cpu' | grep -vi '%CPU'`"
sleep 5
done
It would be nice to be able to do this in Python so I can run it in one script instead of having to run two. I can't seem to find any platform independent (or at least platform independent to linux and OS X) way to get the ps output in Python without actually launching another process to run the command. I can do that, but it would be really nice if there were an API for doing this.
Is there a way to do this, or am I going to have to launch the external script?

You could check out this question about parsing ps output using Python.
One of the answers suggests using the PSI python module. It's an extension though, so I don't really know how suitable that is for you.
It also shows in the question how you can call a ps subprocess using python :)

My preference is to do something like this.
collection.sh
for (( ;; ))
do
date; ps -p $1 -o '%cpu'
done
Then run collection.sh >someFile while you "slam the server with requests".
Then kill this collection.sh operation after the server has been slammed.
At the end, you'll have file with your log of date stamps and CPU values.
analysis.py
import datetime
with( "someFile", "r" ) as source:
for line in source:
if line.strip() == "%CPU": continue
try:
date= datetime.datetime.strptime( line, "%a %b %d %H:%M:%S %Z %Y" )
except ValueError:
cpu= float(line)
print date, cpu # or whatever else you want to do with this data.

You could query the CPU usage with PySNMP. This has the added benefit of being able to take measurements from a remote computer. For that matter, you could install a VM of Zenoss or its kin, and let it do the monitoring for you.

if you don't want to invoke PS then why don't you try with /proc file system.I think you can write you python program and read the files from /proc file system and extract the data you want.I did this using perl,by writing inlined C code in perl script.I think you can find similar way in python as well.I think its doable,but you need to go through /prof file system and need to figure out what you want and how you can get it.
http://www.faqs.org/docs/kernel/x716.html
above URL might give some initial push.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hadoop commands from python - python

Try the following snippet: output = subprocess.Popen(["hadoop", "fs", "-ls", "/user"], stdout=subprocess.PIPE, stderr=subprocess.PIPE) for line in output.stdout: print(line) Additionally, you can refer to this sub-process example, where you can get return status, output and error message separately.

Related

Python script: get latest tag info from remote Git repository?

Use of '&' in a python subprocess for background process

gnu parallel --pipe producing empty output files

use two pipelines for python input file argument and stdin streaming

Is there any way to get ps output programmatically?

Categories

Resources