How to attach a gz file to a json object in python?

How to attach a gz file to a json object in python? - python

I have an api that returns a gz file. The application from where I am running the api accepts only json formats. Is there a way to attach the returned gz file to a json object?
Would converting the gz file to base64 format and then creating a json object like
{ "file": "the base64 format" } work?
print(json.dumps({'file': base64.b64decode(response_alert.content)}))
I get the error
Object of type bytes is not JSON serializable

String encoding is str→bytes and decoding is bytes→str. Base64 is the reverse, as it encodes binary data as characters rather than characters as binary data. However, since many of its use cases involve ASCII protocols like SMTP, base64.b64encode actually requires and returns bytes (ASCII in the latter case). You therefore want
json.dumps(dict(file=base64.b64encode(response_alert.content).decode()))
which takes advantage of the default encoding (UTF-8) supporting ASCII text. On the other end, you don’t have to bother encoding back to bytes, since str is accepted by base64.b64decode.

Related

Python - pdfme - writing utf-8 characters to file

I would like to generate report to pdf using pdfme library. I need the Polish characters to be there as well. The example report end with:
with open('document.pdf', 'wb') as f:
build_pdf(document, f)
So I cannot add encoding = "utf-8". Is there any way I can still use Polish characters?
I tried:
Change to write mode and set encoding to utf-8. Getting: "TypeError: write() argument must be str, not bytes".
While having Polish characters add .encode("utf-8"). Example: "Paweł".encode("utf-8"). Getting: "TypeError: value of . attr must be of type str, list or tuple: b'Pawe\xc5\x82'"

In this case, the part of the code responsible for dealing with the unicode characters is the PDF library. The build_pdf call there, for whatever library it is, has to be able to handle any character in "document". And if it fails it is the context for the PDF library, owner of the "build_pdf" call that has to be changed so that it will handle all the characters you need.
"utf-8" is just one form os expressing characters as bytes - aPDF file is a binary file, and it does have internal headers, structures and settings to do its own character encoding handling: your text may endup inside the PDF either encoded as utf-8, or some other, legacy encoding- but that will be transparent for you and anyone using the PDF file.
It may be that the document, if it is text (we don't know if it is plain text, or if it is some object from your library that has already been pre-processed) - but if it is text, and your library says that build_pdf can accept bytes instead, you can encode the document prior to this call:
build_pdf(document.encode('utf-8', f) - but that would be some strange way of working - it is likely that either build_pdf will do the encoding, or whatever process generated the document had already done so.
To get more meaningful help, you have to say which library you are using to geneate the PDF, and include the import lines in your code,including the creation of your document so that we have a minimal reproducible example: i.e. I can copy your code, paste in a .py file here, install the lib, run it, and see a corrupted PDF file with the Polish characters magled: then I, and others, can be able to fix it. Otherwise, this answer is as far as I can get.

How to cast string to byte like object

I am trying to make a python script that encodes any file into a txt file. So far I can encode an input file into a txt file that contains the original files name, content and a sha hash for integrity checking (if you wish I can dump the code so far here).
Currently, I just make a standard list and dump said list into a txt file.
However, when reading from the list I get strings, but have no way of casting the string into a "byte like object", and a Google search only shows conversion (cast and convert are different) methods.
Is there any way to cast the string in the way you can do str(variable)?
Also, on a side note: does "rb" include metadata, title, etc?

Converting a PDF file (or any binary) to a string in python (not grab text out of pdf)

I am using an api that only takes strings. It's intended to store things. I would like to be able to read in a binary file, convert the binary data to a string, and store the string. Then I would like to retrieve the string, convery back to binary, and save the file.
so what I am trying to do is (in python):
PDF -> load into program as string -> store string ->retrieve string ->save as binary PDF file
For example, I have a PDF called PDFfile. I want to read it in:
datafile=open(PDFfile,'rb')
pdfdata=datafile.read()
When I read up on the .read function it says that it's supposed to result in a string. It does not, or if it does, its taking the parts that define it as a binary also. I have two lines of code later that prints it out:
print(pdfdata[:20])
print(str(pdfdata[:20]))
The result is this:
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
Those look like 2 bytes types to me, but apparently, the second one is a string. When I do type(pdfdata) I get bytes.
I am struggling to try to get a clean string that represents the PDF file, that I can then convert back to a bytes format. The API fails if I send it without stringifying it.
str(pdfdata)
I have also tried playing around with encode and decode, but I get errors that encode/decode cant handle 0xc4 which is apparently in the binary file.
The final oddity:
When I store the str(pdfdata) and retrieve it into 'retdata' I print some bytes out of it and compare to the original
print(pdfdata[:20])
print(retdata[:20])
i get really different results
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe
But the data is there, if I show 50 characters of the retdata
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\
Needless to say, when I retrieve the data, and store as a pdf, its corrupted and doesn't work.
When I save the stringified pdf and the string version of the retrieved data, they are identical. so the storage and retrieval of a string is working fine.
So I think the corruption is happening when I convert to a string.
I know I'm getting loquacious, but you guys like to have all the info.

OK I got this to work. The key was to properly encode the binary data BEFORE it was turned into a string.
Step 1) Read in binary data
datafile=open(PDFfile,'rb')
pdfdatab=datafile.read() #this is binary data
datafile.close()
Step 2) encode the data into a bytes array
import codecs
b64PDF = codecs.encode(pdfdatab, 'base64')
Step 3) convert bytes array into a string
Sb64PDF=b64PDF.decode('utf-8')
Now the string can be restored. To get it back, you just go through the reverse. Load string data from storage into string variable retdata.
#so we have a string and want it to be bytes
bretdata=retdata.encode('utf-8')
#now lets get it back into the binary file format
bPDFout=codecs.decode(bretdata, 'base64')
#open a new file and put defragments data into it!
datafile=open(newPDFFile,'wb')
datafile.write(bPDFout)
datafile.close()

how to serialize arbitrary file types to json string in python

My server is going to be sending a JSON, serialized as a string, through a socket to another client machine. I'll take my final json and do this:
import json
python_dict_obj = { "id" : 1001, "name" : "something", "file" : <???> }
serialized_json_str = json.dumps(python_dict_obj)
I'd like to have one of the fields in my JSON have the value that is a file, encoded as a string.
Performance-wise (but also interoperability-wise) what is the best way to encode a file using python? Base64? Binary? Just the raw string text?
EDIT - For those suggestion base64, something like this?
# get file
import base64
import json
with open(filename, 'r') as f:
filecontents = f.read()
encoded = base64.b64encode(filecontents)
python_dict_obj['file'] = encoded
serialized_json_str = json.dumps(python_dict_obj)
# ... sent to client via socket
# decrpyting
json_again = json.loads(serialized)
filecontents_again = base64.b64decode(json_again['file'])

I'd use base64. JSON isn't designed to communicate binary data. So unless your file's content is vanilla text, it "should be" encoded to use vanilla text. Virtually everything can encode and decode base64. If you instead use (for example) Python's repr(file_content), that also produces "plain text", but the receiving end would need to know how to decode the string escapes Python's repr() uses.

JSON cannot handle binary. You will need to encode the data as text before serializing, and the easiest to encode it as is Base64. You do not need to use the URL-safe form of encoding unless there are requirements for it further down the processing chain.

Handling B64 encoded data in Python

In my Google App Engine based app, I am fetching data from a SOAP webservice.
The problem is that one of the tag contains binary 64 encoded data. I decode it using
decodedStr = base64.b64decode(str(content))
It seems that the decoding is not done correctly a I get garbage data in decodeStr. I think the problem is that the content string is falsely parsed as a unicode string instead of simple byte string
Can any Python guru tell me how to handle b64 encoded data in Python?
For now I am using this workaround
fileContent = str(fileContent)
fileContent = fileContent[3:-3]
self.response.out.write(base64.b64decode(fileContent))

You could try using base64.decodestring or if you were passed an url base64.urlsafe_b64decode.
Make sure that the data is not in base16 or base32.

Strange. If the content were not b64 encoded, the call to decode should raise a TypeError exception. I assume that's not happening?
Which would lead me to wonder how you know the resulting decodedStr is not what your after?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to attach a gz file to a json object in python? - python

Related

Python - pdfme - writing utf-8 characters to file

How to cast string to byte like object

Converting a PDF file (or any binary) to a string in python (not grab text out of pdf)

how to serialize arbitrary file types to json string in python

Handling B64 encoded data in Python

Categories

Resources