Decompress a gzip compressed dictionary object using python - python

I need to decompress this "H4sIAAAAAAAA/6tWKkktLjFUsjI00lEAs42UrCAMpVoAbyLr+R0AAAA=" which actually is compressed form of {"test1":12, "test2": "test"}. Now in python I'm using gzip library and getting below mentioned response:
>>> import gzip
>>> gzip.decompress("H4sIAAAAAAAA/6tWKkktLjFUsjI00lEAs42UrCAMpVoAbyLr+R0AAAA=".encode("UTF-8"))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 548, in decompress
return f.read()
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 479, in read
if not self._read_gzip_header():
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 427, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'H4')
Is there any way to decompress the string in python ?

The string is Base64 encoded. Therefore:-
import gzip
import base64
b = base64.b64decode('H4sIAAAAAAAA/6tWKkktLjFUsjI00lEAs42UrCAMpVoAbyLr+R0AAAA=')
r = gzip.decompress(b)
print(r.decode())

Related

Transform docx to html raises python MemoryError

I have a function that converts a docx to html and a large docx file to be converted.
The problem is this function is part of a bigger program and the converted html is parsed afterwards so I cannot afford to use another converter without impacting the rest of the code (which is not wanted). Running on python 2.7.13 installed on 32-bit, but changing to 64-bit is also not desired.
This is the function:
import logging
from ooxml import serialize
def trasnformDocxtoHtml(inputFile, outputFile):
logging.basicConfig(filename='ooxml.log', level=logging.INFO)
dfile = ooxml.read_from_file(inputFile)
with open(outputFile,'w') as htmlFile:
htmlFile.write( serialize.serialize(dfile.document))
and here's the error:
>>> import library
>>> library.trasnformDocxtoHtml(r'large_file.docx', 'output.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "library.py", line 9, in trasnformDocxtoHtml
dfile = ooxml.read_from_file(inputFile)
File "C:\Python27\lib\site-packages\ooxml\__init__.py", line 52, in read_from_file
dfile.parse()
File "C:\Python27\lib\site-packages\ooxml\docxfile.py", line 46, in parse
self._doc = parse_from_file(self)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 655, in parse_from_file
document = parse_document(doc_content)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 463, in parse_document
document.elements.append(parse_table(document, elem))
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 436, in parse_table
for p in tc.xpath('./w:p', namespaces=NAMESPACES):
File "src\lxml\etree.pyx", line 1583, in lxml.etree._Element.xpath
MemoryError
no mem for new parser
MemoryError
Could I somehow increase the buffer memory in python? Or fix the function without impacting the html output format?

botocore s3 put has issue hashing file due to encoding?

I'm having trouble figuring out why the file, the contents of which are "DELETE ME LATER", which is loaded with encoding utf-8 causes an exception in botocore when it's being hashed.
with io.open('deleteme','r', encoding='utf-8') as f:
try:
resp=client.put_object(
Body=f,
Bucket='s3-bucket-actual-name-for-real',
Key='testing/a/put'
)
print('deleteme exists')
print(resp)
except:
print('deleteme could not put')
raise
Produces:
deleteme could not put
Traceback (most recent call last): File
"./test_operator.py", line 41, in
Key='testing/a/put' File "/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/client.py",
line 312, in _api_call
return self._make_api_call(operation_name, kwargs) File "/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/client.py",
line 582, in _make_api_call
request_signer=self._request_signer, context=request_context) File
"/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/hooks.py",
line 242, in emit_until_response
responses = self._emit(event_name, kwargs, stop_on_response=True) File
"/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/hooks.py",
line 210, in _emit
response = handler(**kwargs) File "/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/handlers.py",
line 201, in conditionally_calculate_md5
calculate_md5(params, **kwargs) File "/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/handlers.py",
line 179, in calculate_md5
binary_md5 = _calculate_md5_from_file(body) File "/Users/lamblin/VEnvs/awscli/lib/python3.6/site-packages/botocore/handlers.py",
line 193, in _calculate_md5_from_file
md5.update(chunk)
TypeError: Unicode-objects must be encoded before hashing
Now this can be avoided by opening the file with 'rb' but, isn't the file object f clearly using an encoding?
Now this can be avoided by opening the file with 'rb' but, isn't the file object f clearly using an encoding?
The encoding specified to io.open in mode='r' is used to decode the content. So when you iterate f, the content has already been converted from bytes to str (text) by Python.
To interface with botocore directly, open your file with mode 'rb', and drop the encoding kwarg. There is no point to decode it to text when the first thing botocore will have to do in order to transport the content is just encode back into bytes again.

urllib alongwith json to save to a variable

Please correct my code. I am trying to save the result of this web page in json format to a variable in python.
Error:
Traceback (most recent call last):
File "C:/Users/Varen/Desktop/json_v1.py", line 5, in <module>
json.dump(link, f)
File "C:\Python27\lib\json\__init__.py", line 189, in dump
for chunk in iterable:
File "C:\Python27\lib\json\encoder.py", line 442, in _iterencode
o = _default(o)
File "C:\Python27\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <addinfourl at 53244992 whose fp = <socket._fileobject object at 0x032B4AF0>> is not JSON serializable
Code:
import urllib
import json
link = urllib.urlopen("http://www.saferproducts.gov/RestWebServices/Recall?RecallDateStart=2015-01-01&RecallDateEnd=2015-12-31&format=json")
with open('link.json', 'w') as f:
json.dump(link, f)
You need to read the data from the file like object returned by urlopen():
import urllib
import json
link = urllib.urlopen("http://www.saferproducts.gov/RestWebServices/Recall?RecallDateStart=2015-01-01&RecallDateEnd=2015-12-31&format=json")
with open('link.json', 'w') as f:
json.dump(link.read(), f)
will do the trick.

How to upload binary file with ftplib in Python?

My python2 script uploads files nicely using this method but python3 is presenting problems and I'm stuck as to where to go next (googling hasn't helped).
from ftplib import FTP
ftp = FTP(ftp_host, ftp_user, ftp_pass)
ftp.storbinary('STOR myfile.txt', open('myfile.txt'))
The error I get is
Traceback (most recent call last):
File "/Library/WebServer/CGI-Executables/rob3/functions/cli_f.py", line 12, in upload
ftp.storlines('STOR myfile.txt', open('myfile.txt'))
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 454, in storbinary
conn.sendall(buf)
TypeError: must be bytes or buffer, not str
I tried altering the code to
from ftplib import FTP
ftp = FTP(ftp_host, ftp_user, ftp_pass)
ftp.storbinary('STOR myfile.txt'.encode('utf-8'), open('myfile.txt'))
But instead I got this
Traceback (most recent call last):
File "/Library/WebServer/CGI-Executables/rob3/functions/cli_f.py", line 12, in upload
ftp.storbinary('STOR myfile.txt'.encode('utf-8'), open('myfile.txt'))
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 450, in storbinary
conn = self.transfercmd(cmd)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 358, in transfercmd
return self.ntransfercmd(cmd, rest)[0]
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 329, in ntransfercmd
resp = self.sendcmd(cmd)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 244, in sendcmd
self.putcmd(cmd)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 179, in putcmd
self.putline(line)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/ftplib.py", line 172, in putline
line = line + CRLF
TypeError: can't concat bytes to str
Can anybody point me in the right direction
The issue is not with the command argument, but with the the file object. Since you're storing binary you need to open file with 'rb' flag:
>>> ftp.storbinary('STOR myfile.txt', open('myfile.txt', 'rb'))
'226 File receive OK.'
APPEND to file in FTP.
Note: it's not SFTP - FTP only
import ftplib
ftp = ftplib.FTP('localhost')
ftp.login ('user','password')
fin = open ('foo.txt', 'r')
ftp.storbinary ('APPE foo2.txt', fin, 1)
Ref: Thanks to Noah

Python gzip: is there a way to decompress from a string?

I've read this SO post around the problem to no avail.
I am trying to decompress a .gz file coming from an URL.
url_file_handle=StringIO( gz_data )
gzip_file_handle=gzip.open(url_file_handle,"r")
decompressed_data = gzip_file_handle.read()
gzip_file_handle.close()
... but I get TypeError: coercing to Unicode: need string or buffer, cStringIO.StringI found
What's going on?
Traceback (most recent call last):
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2974, in _HandleRequest
base_env_dict=env_dict)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 411, in Dispatch
base_env_dict=base_env_dict)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2243, in Dispatch
self._module_dict)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2161, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/opt/google/google_appengine-1.2.5/google/appengine/tools/dev_appserver.py", line 2057, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/home/jldupont/workspace/jldupont/trunk/site/app/server/tasks/debian/repo_fetcher.py", line 36, in <module>
main()
File "/home/jldupont/workspace/jldupont/trunk/site/app/server/tasks/debian/repo_fetcher.py", line 30, in main
gziph=gzip.open(fh,'r')
File "/usr/lib/python2.5/gzip.py", line 49, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.5/gzip.py", line 95, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: coercing to Unicode: need string or buffer, cStringIO.StringI found
If your data is already in a string, try zlib, which claims to be fully gzip compatible:
import zlib
decompressed_data = zlib.decompress(gz_data, 16+zlib.MAX_WBITS)
Read more: http://docs.python.org/library/zlib.html‎
gzip.open is a shorthand for opening a file, what you want is gzip.GzipFile which you can pass a fileobj
open(filename, mode='rb', compresslevel=9)
#Shorthand for GzipFile(filename, mode, compresslevel).
vs
class GzipFile
__init__(self, filename=None, mode=None, compresslevel=9, fileobj=None)
# At least one of fileobj and filename must be given a non-trivial value.
so this should work for you
gzip_file_handle = gzip.GzipFile(fileobj=url_file_handle)
You can use gzip.decompress from the gzip builtin Python library(available for Python 3.2+).
Example on how to decompress bytes:
import gzip
gzip.decompress(gzip_data)
Documentation
https://docs.python.org/3.5/library/gzip.html#gzip.decompress
Consider using gzip.GzipFile if you don't like passing obscure arguments to zlib.decompress.
When you deal with urllib2.urlopen response that can be either gzip-compressed or uncompressed:
import gzip
from StringIO import StringIO
# response = urllib2.urlopen(...
content_raw = response.read()
if 'gzip' in response.info().getheader('Content-Encoding'):
content = gzip.GzipFile(fileobj=StringIO(content_raw)).read()
When you deal with a file that can store either gzip-compressed or uncompressed data:
import gzip
# some_file = open(...
try:
content = gzip.GzipFile(fileobj=some_file).read()
except IOError:
some_file.seek(0)
content = some_file.read()
The examples above are in Python 2.7

Categories