In my pipeline I have a flow file that contains some data I'd like to add as attributes to the flow file. I know in Groovy I can add attributes to flow files, but I am less familiar with Groovy and much more comfortable with using Python to parse strings (which is what I'll need to do to extract the values of these attributes). The question is, can I achieve this in Python when I use ExecuteStreamCommand to read in a file with sys.stdin.read() and write out my file with sys.stdout.write()?
So, for example, I use the code below to extract the timestamp from my flowfile. How do I then add ts as an attribute when I'm writing out ff?
import sys
ff = sys.stdin.read()
t_split = ff.split('\t')
ts = t_split[0]
sys.stdout.write(ff)
Instead of writing back the entire file again, you can simply write the attribute value from the input FlowFile
sys.stdout.write(ts) #timestamp in you case
and then, set the Output Destination Attribute property of the ExecuteStreamCommand processor with the desired attribute name.
Hence, the output of the stream command will be put into an attribute of the original FlowFile and the same can be found in the original relationship queue.
For more details, you can refer to ExecuteStreamCommand-Properties
If you're not importing any native (CPython) modules, you can try ExecuteScript with Jython rather than ExecuteStreamCommand. I have an example in Jython in an ExecuteScript cookbook. Note that you don't use stdin/stdout with ExecuteScript, instead you have to get the flow file from the session and either transfer it as-is (after you're done reading) or overwrite it (there are examples in the second part of the cookbook).
I'm confused on how exactly we should use the python sh library, specifically the sh.Command(). Basically, I wish to pass input_file_a to program_b.py and store its output in a different directory as output_file_b, how should I achieve this using the sh library in python?
If you mean input and output redirection, then see here (in) and here (out) respectively. In particular, looks like to "redirect" stdin you need to pass as argument the actual bytes (e.g. read them beforehand), in particular, the following should work according to their documentation (untested, as I don't have/work with sh - please let know if this works for you / fix whatever is missing):
import sh
python3 = sh.Command("python3")
with open(input_file_a, 'r') as ifile:
python3("program_b.py", _in=ifile.read(), _out=output_file_b)
Note that may need to specify argument search_paths for sh.Command for it to find python. Also, may need to specify full path to program_b.py file or os.chdir() accordingly.
I'm developing uTensor and the code generator for it, utensor_cgen.
https://github.com/uTensor/uTensor
https://github.com/uTensor/utensor_cgen
For the code generator, I have to parse a quantized graph_def protocolbuff file (the pb file).
However, I found that there are some input node not visible in tensorboard but visible in the graph_def.node.
Their name starts with a ^ and when I look into tf.import_graph_def source code, I found this _IsControlInput function:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/importer.py#L92
I really need to know the definition of "control input" so I can properly parse the pb file and convert it into a correct uTensor implementation targeting MCU.
Can somebody tell me where I can find the definition?
It seems like not found in the documentation.
Thanks.
It might be related to control dependencies. https://www.tensorflow.org/api_docs/python/tf/Graph#control_dependencies
I'm importing an mp3 file using IPython (more specifically, the IPython.display.display(IPython.display.Audio() command) and I wanted to know if there was a specific way you were supposed to format the file path.
The documentation says it takes the file path so I assumed (perhaps incorrectly) that it should be something like \home\downloads\randomfile.mp3 which I used an online converter to convert into unicode. I put that in (using, of course, filename=u'unicode here' but that didn't work, instead giving a bunch of errors. I tried reformatting it in different ways (just \downloads\randomfile.mp3, etc) but none of them worked. For those who are curious, here is the unicode characters: \u005c\u0044\u006f\u0077\u006e\u006c\u006f\u0061\u0064\u0073\u005c\u0062\u0064\u0061\u0079\u0069\u006e\u0073\u0074\u0072\u0075\u006d\u0065\u006e\u0074\u002e\u006d\u0070\u0033 which translates to \home\Downloads\bdayinstrument.mp3, I believe.
So, am I doing something wrong? What is the correct way to format the "file path"?
Thanks!
I have a python script that analyzes a set of error messages and checks for each message if it matches a certain pattern (regular expression) in order to group these messages. For example "file x does not exist" and "file y does not exist" would match "file .* does not exist" and be accounted as two occurrences of "file not found" category.
As the number of patterns and categories is growing, I'd like to put these couples "regular expression/display string" in a configuration file, basically a dictionary serialization of some sort.
I would like this file to be editable by hand, so I'm discarding any form of binary serialization, and also I'd rather not resort to xml serialization to avoid problems with characters to escape (& <> and so on...).
Do you have any idea of what could be a good way of accomplishing this?
Update: thanks to Daren Thomas and Federico Ramponi, but I cannot have an external python file with possibly arbitrary code.
I sometimes just write a python module (i.e. file) called config.py or something with following contents:
config = {
'name': 'hello',
'see?': 'world'
}
this can then be 'read' like so:
from config import config
config['name']
config['see?']
easy.
You have two decent options:
Python standard config file format
using ConfigParser
YAML using a library like PyYAML
The standard Python configuration files look like INI files with [sections] and key : value or key = value pairs. The advantages to this format are:
No third-party libraries necessary
Simple, familiar file format.
YAML is different in that it is designed to be a human friendly data serialization format rather than specifically designed for configuration. It is very readable and gives you a couple different ways to represent the same data. For your problem, you could create a YAML file that looks like this:
file .* does not exist : file not found
user .* not found : authorization error
Or like this:
{ file .* does not exist: file not found,
user .* not found: authorization error }
Using PyYAML couldn't be simpler:
import yaml
errors = yaml.load(open('my.yaml'))
At this point errors is a Python dictionary with the expected format. YAML is capable of representing more than dictionaries: if you prefer a list of pairs, use this format:
-
- file .* does not exist
- file not found
-
- user .* not found
- authorization error
Or
[ [file .* does not exist, file not found],
[user .* not found, authorization error]]
Which will produce a list of lists when yaml.load is called.
One advantage of YAML is that you could use it to export your existing, hard-coded data out to a file to create the initial version, rather than cut/paste plus a bunch of find/replace to get the data into the right format.
The YAML format will take a little more time to get familiar with, but using PyYAML is even simpler than using ConfigParser with the advantage is that you have more options regarding how your data is represented using YAML.
Either one sounds like it will fit your current needs, ConfigParser will be easier to start with while YAML gives you more flexibilty in the future, if your needs expand.
Best of luck!
I've heard that ConfigObj is easier to work with than ConfigParser. It is used by a lot of big projects, IPython, Trac, Turbogears, etc...
From their introduction:
ConfigObj is a simple but powerful config file reader and writer: an ini file round tripper. Its main feature is that it is very easy to use, with a straightforward programmer's interface and a simple syntax for config files. It has lots of other features though :
Nested sections (subsections), to any level
List values
Multiple line values
String interpolation (substitution)
Integrated with a powerful validation system
including automatic type checking/conversion
repeated sections
and allowing default values
When writing out config files, ConfigObj preserves all comments and the order of members and sections
Many useful methods and options for working with configuration files (like the 'reload' method)
Full Unicode support
I think you want the ConfigParser module in the standard library. It reads and writes INI style files. The examples and documentation in the standard documentation I've linked to are very comprehensive.
If you are the only one that has access to the configuration file, you can use a simple, low-level solution. Keep the "dictionary" in a text file as a list of tuples (regexp, message) exactly as if it was a python expression:
[
("file .* does not exist", "file not found"),
("user .* not authorized", "authorization error")
]
In your code, load it, then eval it, and compile the regexps in the result:
f = open("messages.py")
messages = eval(f.read()) # caution: you must be sure of what's in that file
f.close()
messages = [(re.compile(r), m) for (r,m) in messages]
and you end up with a list of tuples (compiled_regexp, message).
I typically do as Daren suggested, just make your config file a Python script:
patterns = {
'file .* does not exist': 'file not found',
'user .* not found': 'authorization error',
}
Then you can use it as:
import config
for pattern in config.patterns:
if re.search(pattern, log_message):
print config.patterns[pattern]
This is what Django does with their settings file, by the way.