I've got a large python project with several components, that exchange information with JSON files. Actually, this project is our internal tool for analysis and integration testing, and our developers use it either from web-UI, or from a command line.
The python modules process a labeled database, consisting of large amount of files, and labels are encoded in file names. For example, file name ab001l_AS_5_15Fps_1.raw contains information that it stores data from user ab001l, collected in session number 1 under conditions, that we encode as AS.
There are several such encodings exist.
JSON files usually store file names.
My question is: how can I save a python code into JSON file, so that another module could load it and decode file name into components?
I guess you can store python code as text in JSON, then use the exec built-in function to execute the text. See
https://docs.python.org/3/library/functions.html?highlight=exec#exec.
But it seems a much better approach to share your module and import your module like any python code.
You can use jsonpickle. Please check the documentation page for usage.
import jsonpickle
class Thing(object):
def __init__(self, name):
self.name = name
obj = Thing('Awesome')
frozen = jsonpickle.encode(obj)
thawed = jsonpickle.decode(frozen)
Related
I have a bibtex file that I get from the frontend and I'm trying to parse this file with biblib (a python library to parse bibtex files). Because I get the file from the frontend its not stored in a file on my computer. The file gets passed through a variable from the frontend to python and is then stored in the python variable fileFromFrontend. So I can use for example:
bibtexFile = fileFromFrontend.read()
to read the file.
now I'm trying to do something like the following to print the parsed file in the python terminal:
from pybtex.database.input import bibtex
parser = bibtex.Parser()
bibtexFile= parser.parse_file(fileFromFrontend)
print (bibtexFile.entries)
but then I get this error:
-->bibtexFile = parser.parse_file(filesFromFrontend)
-->with open_file(filename, encoding=self.encoding) as f:
-->AttributeError: __enter__
This is probably because the parser tries to open the file but he doesn't have to open this file, he just needs to read this file. I don't know what function of the biblib library to use for parsing the file from a variable and haven't found anything so far to solve my problem.
Hopefully somebody can help
thanks
According to documentation ( https://docs.pybtex.org/api/parsing.html ) there is methods
parse_string and parse_bytes which could work.
so like this
from pybtex.database.input import bibtex
parser = bibtex.Parser()
bibtexFile= parser.parse_bytes(fileFromFrontend.read())
print (bibtexFile.entries)
I don't have pybtex installed, so I couldn't try it myself. But try those methods. Parse_bytes and parse_string needs bib-format as second parameter. In examples that is bibtex, so I tried it here.
In my pipeline I have a flow file that contains some data I'd like to add as attributes to the flow file. I know in Groovy I can add attributes to flow files, but I am less familiar with Groovy and much more comfortable with using Python to parse strings (which is what I'll need to do to extract the values of these attributes). The question is, can I achieve this in Python when I use ExecuteStreamCommand to read in a file with sys.stdin.read() and write out my file with sys.stdout.write()?
So, for example, I use the code below to extract the timestamp from my flowfile. How do I then add ts as an attribute when I'm writing out ff?
import sys
ff = sys.stdin.read()
t_split = ff.split('\t')
ts = t_split[0]
sys.stdout.write(ff)
Instead of writing back the entire file again, you can simply write the attribute value from the input FlowFile
sys.stdout.write(ts) #timestamp in you case
and then, set the Output Destination Attribute property of the ExecuteStreamCommand processor with the desired attribute name.
Hence, the output of the stream command will be put into an attribute of the original FlowFile and the same can be found in the original relationship queue.
For more details, you can refer to ExecuteStreamCommand-Properties
If you're not importing any native (CPython) modules, you can try ExecuteScript with Jython rather than ExecuteStreamCommand. I have an example in Jython in an ExecuteScript cookbook. Note that you don't use stdin/stdout with ExecuteScript, instead you have to get the flow file from the session and either transfer it as-is (after you're done reading) or overwrite it (there are examples in the second part of the cookbook).
In a python project that I collaborate, we're intending initially to parse information from a input fasta file into a dictionary.
Parsing method is already implemented (here and here), and the problem is: code works fine when running in Python3 (fasta file is loaded, its information is parsed for FDB data-strucuture, and then it's saved in a new fdb-file), but when it runs in Python 2, generated dictionary doesn't contains value-information from read fasta file, just the keys.
Links above show code developed for parsing, and block below contains test we execute (which works fine with Python 3 but not save fasta information in Python 2).
print("Instantiating a FastaDB object...")
fasta_db = FastaDB()
print("Defining input file name...")
filename = "../FastaDB/test2.fasta"
username = "inacio_medeiros"
print("Invoking FDB parsing...")
parsed_fdb_structure = fasta_db.ImportFasta(filename, username)
print("Saving in file...")
content = json.dumps(parsed_fdb_structure)
fdb_file_name = filename+".fdb"
fdb_file = open(fdb_file_name, "w")
fdb_file.write(content)
Does anyone have an idea why dictionaries are working fine in Python 3, but not in Python 2?
The problem is not the dictionary, but how classes were created. While on python3 all classes inherit from object class (unless u actually make it inherit from other class), on python2 they dont.
Hence, class A() on python3 is the same as class A(object), but on python2 they are different things: the latter is a "new style class", while the former is an "old style class". I'm a python3 guy too, so this is new for me, but u can find more information on this SO thread
TL;DR: just replace class FDBRegister(): for class FDBRegister(object): and it will work! I tested here ;)
I have two python scripts, scriptA and scriptB, which run on Unix systems. scriptA takes 20s to run and generates a number X. scriptB needs X when it is run and takes around 500ms. I need to run scriptB everyday but scriptA only once every month. So I don't want to run scriptA from scriptB. I also don't want to manually edit scriptB each time I run scriptA. I thought of updating a file through scriptA but I'm not sure where such a file can be placed ideally so that scriptB can read it later; independent of the location of these two scripts. What is the best way of storing this value X in an Unix system so that it can be used later by scriptB?
Many programs in Linux/Unix keep config in /etc/ and use subfolder in /var/ for other files.
But probably you could need root privilages.
If you run script in your home folder than you could create file ~/.scripB.rc or folder ~/.scriptB/ or ~/.config/scriptB/
See also on wikipedia Filesystem Hierarchy Standard
It sounds like you want to serialize ScriptA's results, save it in a file or database somewhere, then have ScriptB read those results (possibly also modifying the file or updating the database entry to indicate that those results have now been processed).
To make that work you need for ScriptA and ScriptB to agree on the location and format of the data ... and you might want to implement some sort of locking to ensure that ScriptB doesn't end up with corrupted inputs if it happens to be run at the same time that ScriptA is writing or updating the data (and, conversely, that ScriptA doesn't corrupt the data store by writing thereto while ScriptB is accessing it).
Of course ScriptA and ScriptB could each have a filename or other data location hard-coded into their sources. However, that would violation the DRY Principle. So you might want them to share a configuration file. (Of course the configuration filename is also repeated in these sources ... or at least the import of the common bit of configuration code ... but the latter still ensures that an installation/configuration detail (location and, possibly, format, of the data store) is decoupled from the source code. Thus it can be changed (in the shared config) without affecting the rest of the code for either script.
As for precisely which type of file and serialization to use ... that's a different question.
These days, as strange as it may sound, I'd would suggest using SQLite3. It may seem like over-kill to use an SQL "database" for simply storing a single value. However, SQLite3 is included in the Python standard libraries, and it only needs a filename for configuration.
You could also use a pickle or JSON or even YAML (which would require a third party module) ... or even just text or some binary representation using something like struct. However, any of those will require that you parse your results and deal with any parsing or formatting errors. JSON would be the simplest option among these alternatives. Additionally you'd have to do your own file locking and handling if you wanted ScriptA and ScriptB (and, potentially, any other scripts you ever write for manipulating this particular data) to be robust against any chance of concurrent operations.
The advantage of SQLite3 is that it handles the parsing and decoding and the locking and concurrency for you. You create the table once (perhaps embedded in ScriptA as a rarely used "--initdb" option for occasions when you need to recreate the data store). Your code to read it might look as simple as:
#!/usr/bin/python
import sqlite3
db = sqlite3.connect('./foo.db')
cur = db.cursor()
results = cur.execute(SELECT value, MAX(date) FROM results').fetchone()[0]
... and writing a new value would look a bit like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('INSERT INTO results (value) VALUES (?)', (myvalue,))
All of this assuming you had, at some time, initialized the data store (foo.db in this example) with something like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('CREATE TABLE IF NOT EXISTS results (value INTEGER NOT NULL, date TIMESTAMP DEFAULT current_timestamp)')
(Actually you could just execute that command every time if you wanted your scripts to recovery silently from cleaning out the old data).
This might seem like more code than a JSON file-based approach. However, SQLite3 is providing ACID(transactional) semantics as well as abstracting away the serialization and deserialization.
Also note that I'm glossing over a few details. My example above are actually creating a whole table of results, with timestamps for when they were written to your datastore. These would accumulate over time and, if you were using this approach, you'd periodically want to clean up your "results" table with a command like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('DELETE FROM results where date < ?', cur.execute('SELECT MAX(date) from results').fetchone())
Alternatively if you really never want to have access to your prior results that change from INSERT into UPDATE like so:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute(cur.execute('UPDATE results SET value=(?)', (mynewvalue,))
(Also note that the (mynewvalue,) is a single element tuple. The DBAPI requires that our parameters be wrapped in tuples which is easy to forget when you first start using it with single parameters such as this).
Obviously if you took this UPDATE only approach you could drop the 'date' column from the 'results' table and all those references to MAX(data) from the queries.
I chose use the slightly more complex schema in my early examples because they allow your scripts to be a bit more robust with very little additional complexity. You could then do other error checking, detecting missing values where ScriptB finds that ScriptA hasn't been run as intended, for example).
Edit/run crontab -e:
# this will run every month on the 25th at 2am
0 2 25 * * python /path/to/scriptA.py > /dev/null
# this will run every day at 2:10 am
10 2 * * * python /path/to/scriptB.py > /dev/null
Create an external file for both scripts:
In scriptA:
>>> with open('/path/to/test_doc','w+') as f:
... f.write('1')
...
In scriptB:
>>> with open('/path/to/test_doc','r') as f:
... v = f.read()
...
>>> v
'1'
You can take a look at PyPubSub
It's a python package which provides a publish - subscribe Python API that facilitates event-based programming.
It'll give you an OS independent solution to your problem and only requires few additional lines of code in both A and B.
Also you don't need to handle messy files!
Assuming you are not running the two scripts at the same time, you can (pickle and) save the go between object anywhere so long as when you load and save the file you point to the same system path. For example:
import pickle # or import cPickle as pickle
# Create a python object like a dictionary, list, etc.
favorite_color = { "lion": "yellow", "kitty": "red" }
# Write to file ScriptA
f_myfile = open('C:\\My Documents\\My Favorite Folder\\myfile.pickle', 'wb')
pickle.dump(favorite_color, f_myfile)
f_myfile.close()
# Read from file ScriptB
f_myfile = open('C:\\My Documents\\My Favorite Folder\\myfile.pickle', 'rb')
favorite_color = pickle.load(f_myfile) # variables come out in the order you put them in
f_myfile.close()
I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?
The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards
You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.