I'm trying to retrieve pickle data I have uploaded to an openstack object storage using openstacksdk's connection.get_object(container,object), I get a response from it, however the file body is a string, I can even save it to file with the outfile option without issues. However I would like to be able to work with it directly without having to resort to save it to file first and then loading it into pickle.
Simply using pickle's load and loads doesn't work as neither takes string objects. Is there another way to retrieve the data so I can work with the pickled data directly or is there some way to parse to string/set a config parameter on get_object()?
If you are using Python 3 - pickle expects a bytes-like object. The load method takes a file path, and relies on the file type to handle the providing of bytes back into pickle. When you use the loads method you need to provide it a bytes-like object, not a string, so you will need to convert the string to bytes.
Best way to convert string to bytes in Python 3?
EDIT:
I found the solution, for pickled objects or any other files retrieved from openstack with openstacksdk, there are a few ways of dealing with the data without resorting to disk.
First my implemented solution was to use openstack's connection method get_object_raw:
conn = connection(foo,bar, arg**)
pickle.loads(conn.get_object_raw('containerName', 'ObjectName').content)
.get_object_raw returns a response request object with the attribute content which is the binary file content which is the pickle content one can load with pickle.
You could also create a temporary in-memory file with io.BytesIO, and using it as the outfile argument in get_object from the connection object.
Related
In my pipeline I have a flow file that contains some data I'd like to add as attributes to the flow file. I know in Groovy I can add attributes to flow files, but I am less familiar with Groovy and much more comfortable with using Python to parse strings (which is what I'll need to do to extract the values of these attributes). The question is, can I achieve this in Python when I use ExecuteStreamCommand to read in a file with sys.stdin.read() and write out my file with sys.stdout.write()?
So, for example, I use the code below to extract the timestamp from my flowfile. How do I then add ts as an attribute when I'm writing out ff?
import sys
ff = sys.stdin.read()
t_split = ff.split('\t')
ts = t_split[0]
sys.stdout.write(ff)
Instead of writing back the entire file again, you can simply write the attribute value from the input FlowFile
sys.stdout.write(ts) #timestamp in you case
and then, set the Output Destination Attribute property of the ExecuteStreamCommand processor with the desired attribute name.
Hence, the output of the stream command will be put into an attribute of the original FlowFile and the same can be found in the original relationship queue.
For more details, you can refer to ExecuteStreamCommand-Properties
If you're not importing any native (CPython) modules, you can try ExecuteScript with Jython rather than ExecuteStreamCommand. I have an example in Jython in an ExecuteScript cookbook. Note that you don't use stdin/stdout with ExecuteScript, instead you have to get the flow file from the session and either transfer it as-is (after you're done reading) or overwrite it (there are examples in the second part of the cookbook).
I am using mqtt for the first time to transfer some binary files, so far I have no issues transferring it using a code like bellow
import paho.mqtt.client as paho
f=open("./file_name.csv.gz","rb")
filename= f.read()
f.close()
byteArray = bytearray(filename)
mqttc = paho.Client()
mqttc.will_set("/event/dropped", "Sorry, I seem to have died.")
mqttc.connect(*connection definition here*)
mqttc.publish("hello/world", byteArray )
However together with the file itself there is some extra info I want to send (the original file name, creation date,etc...), I can't find any proper way to transfer it using mqtt, is there any way to do that or do I need to add that info to the message byteArray itself? How would I do that?
You need to build your own data structor to hold the file and it's meta data.
How you build that structure is up to you. A couple of options would be:
base64/uuencode encode the file and add it as a field in a JSON object and save the meta data as other fields then publish the JSON object.
Build a Python map with the file as a field and other meta data as other fields. Then use pickle to serialise the map.
I have created a hdf5 file using file = open() command. In this case, I can write and read the file. But it is giving me attribute error when I am trying file.keys(). The error is AttributeError: 'file' object has no attribute 'keys'.
Then I have created a new hdf5 file using file = h5py.File() command. In this case, I can read and use command file.keys() without any error. But I can not write in the file. The error is AttributeError: 'File' object has no attribute 'write'.
What are the reasons behind these error? Is there any difference between 'file' object and 'File' object?
open() returns an object of type file, that is the Python built in standard type to represent a file. This has quite a simple / low level interface and you would use it if you were reading a text file or parsing the content (be that text or binary) yourself. You can read the docs on the methods the file type has here - https://docs.python.org/2/library/stdtypes.html#bltin-file-objects
h5py.File() returns a different type of object that has additional functionality to handle the hdf5 format and provides it's own different API e.g. the keys() method you mention.
When opening a h5py.File() you must specify how you want to open it e.g. r+ for read/write mode. Someone with a better understanding of the h5py library may be able to give a better explanation for this but the reason you can not call write() on the h5py.File() object is because it does not have a write method as suggested by the error message.
Checkout the API docs for h5py, it provides different methods for writing different data to the file - http://docs.h5py.org/en/latest/high/dataset.html
I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?
The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).
i've a issue with Python.
My case: i've a gzipped file from a partner platform (i.e. h..p//....namesite.../xxx)
If i click the link from my browser, it will download a file like (i.e. namefile.xml.gz).
So... if i read this file with python i can decompress and read it.
Code:
content = gzip.open(namefile.xml.gz,'rb')
print content.read()
But i can't if i try to read the file from remote source.
From remote file i can read only the encoded string, but not decoded it.
Code:
response = urllib2.urlopen(url)
encoded =response.read()
print encoded
With this code i can read the string encoded... but i can't decoded it with gzip or lzip.
Any advices?
Thanks a lot
Unfortunately the method #Aya suggests does not work, since GzipFile extensively uses seek method of the file object (not supported by response).
So you have basically two options:
Read the contents of the remote file into io.StringIO, and pass the object into gzip.GzipFile (if the file is small)
download the file into a temporary file on disk, and use gzip.open
There is another option (which requires some coding) - to implement your own reader using zlib module. It is rather easy, but you will need to know about a magic constant (How can I decompress a gzip stream with zlib?).
If you use Python 3.2 or later the bug in GzipFile (requiring tell support) is fixed, but they apparently aren't going to backport the fix to Python 2.x
For Python v3.2 or later, you can use the gzip.GzipFile class to wrap the file object returned by urllib2.urlopen(), with something like this...
import urllib2
import gzip
response = urllib2.urlopen(url)
gunzip_response = gzip.GzipFile(fileobj=response)
content = gunzip_response.read()
print content
...which will transparently decompress the response stream as you read it.