Modify flow file attributes in NiFi with Python sys.stdout? - python

In my pipeline I have a flow file that contains some data I'd like to add as attributes to the flow file. I know in Groovy I can add attributes to flow files, but I am less familiar with Groovy and much more comfortable with using Python to parse strings (which is what I'll need to do to extract the values of these attributes). The question is, can I achieve this in Python when I use ExecuteStreamCommand to read in a file with sys.stdin.read() and write out my file with sys.stdout.write()?
So, for example, I use the code below to extract the timestamp from my flowfile. How do I then add ts as an attribute when I'm writing out ff?
import sys
ff = sys.stdin.read()
t_split = ff.split('\t')
ts = t_split[0]
sys.stdout.write(ff)

Instead of writing back the entire file again, you can simply write the attribute value from the input FlowFile
sys.stdout.write(ts) #timestamp in you case
and then, set the Output Destination Attribute property of the ExecuteStreamCommand processor with the desired attribute name.
Hence, the output of the stream command will be put into an attribute of the original FlowFile and the same can be found in the original relationship queue.
For more details, you can refer to ExecuteStreamCommand-Properties

If you're not importing any native (CPython) modules, you can try ExecuteScript with Jython rather than ExecuteStreamCommand. I have an example in Jython in an ExecuteScript cookbook. Note that you don't use stdin/stdout with ExecuteScript, instead you have to get the flow file from the session and either transfer it as-is (after you're done reading) or overwrite it (there are examples in the second part of the cookbook).

Related

Passing to SOAP arguments from the command line

I have a python script that successfully sends SOAP to insert a record into a system. The values are static in the test. I need to make the value dynamic/argument that is passed through the command line or other stored value.
execute: python myscript.py
<d4p1:Address>MainStreet</d4p1:Address> ....this works to add hard coded "MainStreet"
execute: python myscript.py MainStreet
...this is now trying to pass the argument MainStreet
<d4p1:Address>sys.argv[1]</d4p1:Address> ....this does not work
It saves the literal text address as "sys.argv[1]" ... I have imported sys ..I have tried %, {}, etc from web searches, what syntax am I missing??
You need to read a little about how to create strings in Python, below is how it could look like in your code. Sorry it's hard to say more without seeing your actual code. And you actually shouldn't create XMLs like that, you should use for instance xml module from standard library.
test = "<d4p1:Address>" + sys.argv[1] + "</d4p1:Address>"

How do I save python code in JSON

I've got a large python project with several components, that exchange information with JSON files. Actually, this project is our internal tool for analysis and integration testing, and our developers use it either from web-UI, or from a command line.
The python modules process a labeled database, consisting of large amount of files, and labels are encoded in file names. For example, file name ab001l_AS_5_15Fps_1.raw contains information that it stores data from user ab001l, collected in session number 1 under conditions, that we encode as AS.
There are several such encodings exist.
JSON files usually store file names.
My question is: how can I save a python code into JSON file, so that another module could load it and decode file name into components?
I guess you can store python code as text in JSON, then use the exec built-in function to execute the text. See
https://docs.python.org/3/library/functions.html?highlight=exec#exec.
But it seems a much better approach to share your module and import your module like any python code.
You can use jsonpickle. Please check the documentation page for usage.
import jsonpickle
class Thing(object):
def __init__(self, name):
self.name = name
obj = Thing('Awesome')
frozen = jsonpickle.encode(obj)
thawed = jsonpickle.decode(frozen)

subprocess, Popen, and stdin: Seeking practical advice on automating user input to .exe

Despite my obviously beginning Python skills, I’ve got a script that pulls a line of data from a 2,000-row CSV file, reads key parameters, and outputs a buffer CSV file organized as an N-by-2 rectangle, and uses the subprocess module to call the external program POVCALLC.EXE, which takes a CSV file organized that way as input. The relevant portion of the code is shown below. I THINK that subprocess or one of its methods should allow me to interact with the external program, but am not quite sure how - or indeed whether this is the module I need.
In particular, when POVCALLC.EXE starts it first asks for the input file, which in this case is buffer.csv. It then asks for several additional parameters including the name of an output file, which come from outside the snippet below. It then starts computing results, and then ask for further user input, including several carriage returns . Obviously, I would prefer to automate this interaction for the 2,000 rows in the original CSV.
Am I on the right track with subprocess, or should I be looking elsewhere to automate this interaction with the external executable?
Many thanks in advance!
# Begin inner loop to fetch Lorenz curve data for each survey
for i in range(int(L_points_number)):
index = 3 * i
line = []
P = L_points[index]
line.append(P)
L = L_points[index + 1]
line.append(L)
with open('buffer.csv', 'a', newline='') as buffer:
writer = csv.writer(buffer, delimiter=',')
P=1
line.append(P)
L=1
line.append(L)
writer.writerow(line)
subprocess.call('povcallc.exe')
# TODO: CALL povcallc and compute results
# TODO: USE Regex to interpret results and append them to
# output file
If your program expects these arguments on the standard input (e.g. after running POVCALLC you type csv filenames into the console), you could use subprocess.Popen() [see https://docs.python.org/3/library/subprocess.html#subprocess.Popen ] with stdin redirection (stdin=PIPE), and use the returned object to send data to stdin.
It would looks something like this:
my_proc = subprocess.Popen('povcallc.exe', stdin=subprocess.PIPE)
my_proc.communicate(input="my_filename_which_is_expected_by_the_program.csv")
You can also use the tuple returned by communicate to automatically check the programs stdout and stderr (see the link to docs for more).

read() from a ExFileObject always cause StreamError exception

I am trying to read only one file from a tar.gz file. All operations over tarfile object works fine, but when I read from concrete member, always StreamError is raised, check this code:
import tarfile
fd = tarfile.open('file.tar.gz', 'r|gz')
for member in fd.getmembers():
if not member.isfile():
continue
cfile = fd.extractfile(member)
print cfile.read()
cfile.close()
fd.close()
cfile.read() always causes "tarfile.StreamError: seeking backwards is not allowed"
I need to read contents to mem, not dumping to file (extractall works fine)
Thank you!
The problem is this line:
fd = tarfile.open('file.tar.gz', 'r|gz')
You don't want 'r|gz', you want 'r:gz'.
If I run your code on a trivial tarball, I can even print out the member and see test/foo, and then I get the same error on read that you get.
If I fix it to use 'r:gz', it works.
From the docs:
mode has to be a string of the form 'filemode[:compression]'
...
For special purposes, there is a second format for mode: 'filemode|[compression]'. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file… Use this variant in combination with e.g. sys.stdin, a socket file object or a tape device. However, such a TarFile object is limited in that it does not allow to be accessed randomly, see Examples.
'r|gz' is meant for when you have a non-seekable stream, and it only provides a subset of the operations. Unfortunately, it doesn't seem to document exactly which operations are allowed—and the link to Examples doesn't help, because none of the examples use this feature. So, you have to either read the source, or figure it out through trial and error.
But, since you have a normal, seekable file, you don't have to worry about that; just use 'r:gz'.
In addition to the file mode, I attempted to seek on a network stream.
I had the same error when trying to requests.get the file, so I extracted all to a tmp directory:
# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:
tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
with open(os.path.join(t, fn)) as payload:
print(payload.read())

How to diff file and output stream "on-the-fly"?

I need to create a diff file using standard UNIX diff command with python subprocess module. The problem is that I must compare file and stream without creating tempopary file. I thought about using named pipes via os.mkfifo method, but didn't reach any good result. Please, can you write a simple example on how to solve this stuff? I tried like so:
fifo = 'pipe'
os.mkfifo(fifo)
op = popen('cat ', fifo)
print >> open(fifo, 'w'), output
os.unlink(fifo)
proc = Popen(['diff', '-u', dumpfile], stdin=op, stdout=PIPE)
but it seems like diff doesn't see the second argument.
You can use "-" as an argument to diff to mean stdin.
You could perhaps consider using the difflib python module (I've linked to an example here) and create something that generates and prints the diff directly rather than relying on diff. The various function methods inside difflib can receive character buffers which can be processed into diffs of various types.
Alternatively, you can construct a shell pipeline and use process substitution like so
diff <(cat pipe) dumpfile # You compare the output of a process and a physical file without explicitly using a temporary file.
For details, check out http://tldp.org/LDP/abs/html/process-sub.html

Categories