Running python code in Apache Nifi ExecuteStreamCommand

Running python code in Apache Nifi ExecuteStreamCommand - python

I'm trying to run python code in Nifi ExecuteStreamCommand processor.
The code includes non pure python modules like Pandas and Numpy so to use Nifi executeScript is not an option.
The issue is around reading in flow file and modifying flow file content.
Apparently it is possible to read incoming flow file with STDIN and to write out with STDOUT, see this SO question:
Python Script using ExecuteStreamCommand
But I have not been able to get this working.
1.
Tried simply reading in a CSV from STDIN and modifying it, but when sent to putFile processor the file is the same.
import sys
import pandas as pd
import io
df = pd.read_csv(io.StringIO(sys.stdin.read(1)))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df2 = df.append(df2)
2.
Tried wrapping some other code in a function and returning in assumption that function output would go to STDOUT, but same outcome.
def convert_csv_dataframe():
a = pd.read_csv(io.StringIO(sys.stdin.read(1)))
a.replace(["ABC", "AB"], "A", inplace=True)
return a
convert_csv_dataframe()
If anybody can help it would be most appreciated.
EDIT:
This code works. The issue was in Nifi. I was reading from "original" relationship instead of "output flow" relationship. Note that stdin is reading one line but don't think that should make a difference. The only question I have is: Can I reference a flow file itself (not it's contents) from executeStreamCommand ?
import sys
a = sys.stdin.readline()
a = a.upper()
sys.stdout.write(a)

I think you need to write to STDOUT somewhere in your script. I don't know much Python, but both examples look like you read from STDIN and then modify data in memory, but never write it back out.

Related

How to use "dump_pstats" properly to retrieve the sorted data of the "cProfile" into "txt" file?

As the title indicate I have this issue of retrieving those information from dump_stats properly. Without further ado here is my simple code.
Code
import cProfile
import pstats
def fun_to_profile():
... code to be profilled ...
profiler = cProfile.Profile()
profiler.runcall(fun_to)profile)
stats.sort_stats('cumulative')
stats.print_stats()
stats.dump_stats("output.txt")
This is the simple code that I could found, and I really read multiple times the documentation.
Problem
My problem when I open the file "output.txt", even if it's empty or with non comprehended characters. So do I need to specify any extension of the file, or maybe the issue is with my compiler.
Thanks in advance.

Apparently working with cProfile is so easy and straight forwards. I figure the solution for the problem.
First of all we need to know that the more adequate file extension is "file.dat". Then we need to read it and writing down in the desired files format like text.txt.
For that we need the following piece of code :
import cProfile
import pstats
cProfile.run("fun_to_profile", "Out_put_profile.dat") # here we just run and save the output
with open("Profile_time.txt", "w") as f:
p = pstats.Stats("Out_put_profile.dat", stream=f)
p.sort_stats("time").print_stats() # here we sort our analysis by the time-spent
And just like this we will have a more materials for analyzing the code and in human readable format. Thanks for IDG TECHtalk for sharing the solution.
Link to the youtube video: https://youtu.be/dmnA3axZ3FY.

Open and display a file within a python script in Linux

I'd like to open/display an excel file that I'm saving as part of python processing within the python script.
The save portion of the code works fine (ie, it saves successfully, and I'm open and view from a Nautilus window), but attempting to open programmatically throws errors no matter how I attempt it.
I've been using the Popen method within the subprocess package:
from subprocess import Popen
Popen('./temp/testtest.xlsx')
Gives:
PermissionError: [Errno 13] Permission denied: './temp/testtest.xlsx'
I subsequently tried changing file permissions:
import os
from subprocess import Popen
os.chmod('./temp/testtest.xlsx',0o777)
Popen('./temp/testtest.xlsx')
Which gave:
Out[127]: <subprocess.Popen at 0x7faeb22a4b00>invalid file (bad magic number): Exec format error
And against better judgement tried running as shell:
from subprocess import Popen
Popen('./temp/testtest.xlsx', shell=True)
Which gave:
invalid file (bad magic number): Exec format error
Out[129]: <subprocess.Popen at 0x7faeb22a46a0>
I also tried it with the file saved in a different directory with similar errors. If it matters, I'm using the openpyxl module to create and save the excel file, but I have the same issues even if it's an excel file I created manually.

The argument to Popen() needs to be a shell command to open the file. If you're using LibreOffice, the program is localc; if you're using OpenOffice it's oocalc.
from subprocess import Popen
import os
f = os.path.join('temp', filename)
Popen(['localc', f])

You have Many options here. You could install libreoffice, this is an open source office suite and is fairly decent you should be able to open that file directly with ooffice —-calc “filename”. If you really want to stay with python You could save the data to a .csv file, and pythons anaconda distribution has pandas library and you could read the .csv into a data frame fairly easily. import pandas as pd Then
pd.read_csv(“File_name.csv”) returns to you a dataframe, but remember to import os and os.chdir(r“/path/to/data”).
From that point pandas lets you easily access the data for plotting or manipulation.
Here is all the functionality of a data frame and see if it meets your fancy.
Python Pandas DataFrame

subprocess, Popen, and stdin: Seeking practical advice on automating user input to .exe

Despite my obviously beginning Python skills, I’ve got a script that pulls a line of data from a 2,000-row CSV file, reads key parameters, and outputs a buffer CSV file organized as an N-by-2 rectangle, and uses the subprocess module to call the external program POVCALLC.EXE, which takes a CSV file organized that way as input. The relevant portion of the code is shown below. I THINK that subprocess or one of its methods should allow me to interact with the external program, but am not quite sure how - or indeed whether this is the module I need.
In particular, when POVCALLC.EXE starts it first asks for the input file, which in this case is buffer.csv. It then asks for several additional parameters including the name of an output file, which come from outside the snippet below. It then starts computing results, and then ask for further user input, including several carriage returns . Obviously, I would prefer to automate this interaction for the 2,000 rows in the original CSV.
Am I on the right track with subprocess, or should I be looking elsewhere to automate this interaction with the external executable?
Many thanks in advance!
# Begin inner loop to fetch Lorenz curve data for each survey
for i in range(int(L_points_number)):
index = 3 * i
line = []
P = L_points[index]
line.append(P)
L = L_points[index + 1]
line.append(L)
with open('buffer.csv', 'a', newline='') as buffer:
writer = csv.writer(buffer, delimiter=',')
P=1
line.append(P)
L=1
line.append(L)
writer.writerow(line)
subprocess.call('povcallc.exe')
# TODO: CALL povcallc and compute results
# TODO: USE Regex to interpret results and append them to
# output file

If your program expects these arguments on the standard input (e.g. after running POVCALLC you type csv filenames into the console), you could use subprocess.Popen() [see https://docs.python.org/3/library/subprocess.html#subprocess.Popen ] with stdin redirection (stdin=PIPE), and use the returned object to send data to stdin.
It would looks something like this:
my_proc = subprocess.Popen('povcallc.exe', stdin=subprocess.PIPE)
my_proc.communicate(input="my_filename_which_is_expected_by_the_program.csv")
You can also use the tuple returned by communicate to automatically check the programs stdout and stderr (see the link to docs for more).

using a python list as input for linux command that uses stdin as input

I am using python scripts to load data to a database bulk loader.
The input to the loader is stdin. I have been unable to get the correct syntax to call the unix based bulk loader passing the contents of a python list to be loaded.
I have been reading about Popen and PIPE but they have not been behaving as i expect.
The python list contains database records to be bulkloaded. In linux it would look similar to this:
echo "this is the string being written to the DB" | sql -c "COPY table FROM stdin"
What would be the correct way replace the echo statement with a python list to be used with this command ?
I do not have sample code for this process as i have been experimenting with the features of Popen and PIPE with some very simple syntax and not obtaining the desired result.
Any help would be very much appreciated.
Thanks

If your data is short and simple, you could preformat the entire list and do it simple with subprocess like this:
import subprocess
data = ["list", "of", "stuff"]
proc = subprocess.Popen(["sql", "-c", "COPY table FROM stdin"], stdin=subprocess.PIPE)
proc.communicate("\n".join(data))
If the data is too big to preformat like this, then you can attempt to use the stdin pipe directly, though subprocess module is flaky when using the pipes if you need to read from stdout/stderr too.
for line in data:
print >>proc.stdin, line

Running a command line from python and piping arguments from memory

I was wondering if there was a way to run a command line executable in python, but pass it the argument values from memory, without having to write the memory data into a temporary file on disk. From what I have seen, it seems to that the subprocess.Popen(args) is the preferred way to run programs from inside python scripts.
For example, I have a pdf file in memory. I want to convert it to text using the commandline function pdftotext which is present in most linux distros. But I would prefer not to write the in-memory pdf file to a temporary file on disk.
pdfInMemory = myPdfReader.read()
convertedText = subprocess.<method>(['pdftotext', ??]) <- what is the value of ??
what is the method I should call and how should I pipe in memory data into its first input and pipe its output back to another variable in memory?
I am guessing there are other pdf modules that can do the conversion in memory and information about those modules would be helpful. But for future reference, I am also interested about how to pipe input and output to the commandline from inside python.
Any help would be much appreciated.

with Popen.communicate:
import subprocess
out, err = subprocess.Popen(["pdftotext", "-", "-"], stdout=subprocess.PIPE).communicate(pdf_data)

os.tmpfile is useful if you need a seekable thing. It uses a file, but it's nearly as simple as a pipe approach, no need for cleanup.
tf=os.tmpfile()
tf.write(...)
tf.seek(0)
subprocess.Popen( ... , stdin = tf)
This may not work on Posix-impaired OS 'Windows'.

Popen.communicate from subprocess takes an input parameter that is used to send data to stdin, you can use that to input your data. You also get the output of your program from communicate, so you don't have to write it into a file.
The documentation for communicate explicitly warns that everything is buffered in memory, which seems to be exactly what you want to achieve.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running python code in Apache Nifi ExecuteStreamCommand - python

I think you need to write to STDOUT somewhere in your script. I don't know much Python, but both examples look like you read from STDIN and then modify data in memory, but never write it back out.

Related

How to use "dump_pstats" properly to retrieve the sorted data of the "cProfile" into "txt" file?

Open and display a file within a python script in Linux

subprocess, Popen, and stdin: Seeking practical advice on automating user input to .exe

using a python list as input for linux command that uses stdin as input

Running a command line from python and piping arguments from memory

Categories

Resources