How can I convert pdf response to pdf file? - python

I'm using Docraptor to convert HTML to PDF, docraptor does the conversion and sends me a response, I'm having some trouble understanding how I could convert this response to a PDF file.
Here's what the response looks like :
b'%PDF-1.4\n%\xe2\xe3\xcf\xd3\n\n1 0 obj\n<</Type /Catalog\n/Pages 2 0 R>>\nendobj\n\n2 0 obj\n<</Type /Pages\n/Kids [3 0 R]\n/Count 1>>
\nendobj\n\n4 0 obj\n<</Length 5 0 R\n/Filter /FlateDecode>>\nstream\nx\x9cs\n\xe125\xd13\x00\x02\x05s#3=sSC#\x85\x90\x14.}7C\x05C#\x88x
H\x1a\x97\x86GjNN\xbeB\xb8\xa6BH\x16\x97\x89\x81\x9e\x81\x91\xa9\x89\x82\x0
... ... ...
... ... lots of code ... ...
... ... ...
<</Info 10 0 R\n/Size 11\n/Root 1 0 R\n/ID [<5FCD137048BC4E60BF5E3D2E3741CD4B> <5FCD137048BC4E60BF5E3D2E3741CD4B>]>>\nstartxref\n12234\n
%%EOF\n'
I was thinking to do something like that :
#docraptor response
response = doc_api.create_doc({ "type": "pdf", "document_content": "<html><body>Hello World!</body></html>" })
with open("test.pdf", "wb") as f:
f.write(response)
file = open(f.name, 'r').read()
Error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 195: character maps to
How can I achieve this ?

Use binary mode when opening the file for reading:
with open('test.pdf', 'rb') as f:
doc = f.read()
Without the binary flag Python 3 expects that the data is encoded with the default file system encoding, and it will attempt to decode the incoming data into a unicode string:
>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'
On my system the default encoding is UTF-8. When in text mode Python will try to decode from UTF8 into a str object, but that might fail if the data in the file is not UTF-8 encoded.

Related

Python emoji to bytes write to txt file

I get this error
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f525' in position 0: character maps to
I would like to write for example "๐Ÿ”ฅ" to a txt file and it should be \U0001f525 written in the txt file
Here's my code
test1 = f"{config['emoji']}"
with open('emoji.txt', 'w') as f:
f.write(test1)
test1 = "๐Ÿ”ฅ"
with open('emoji.txt', 'w') as f:
transformed = (test1
.encode('utf-16', 'surrogatepass')\
.decode('utf-16')\
.encode("raw_unicode_escape")\
.decode("latin_1"))
f.write(transformed)
Adapted from this answer

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I am trying to scrap a picture from the link and put it into a image file. The request response is returning a byte stream. So I am using decode('utf-8') to convert to unicode stream however, I am facing the following error:
print (info.decode(('utf-8')))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
from urllib import request
img = request.urlopen('http://www.py4inf.com/cover.jpg')
fhand = open('cover.jpg', 'w')
size = 0
while True:
info = img.read(100000)
if len(info) < 1 : break
size = size + len(info)
print (info.decode(('utf-8')))
fhand.write(info.decode(('utf-8')))
print (size,'characters copied.')
fhand.close()
Please let me know how can I proceed. Thanks.
The file should be opened in binary mode and then you can copy the stream byte for byte. Since shutil already has a handy helper utility, you can
import shutil
import os
from urllib import request
img = request.urlopen('http://www.py4inf.com/cover.jpg')
with open('cover.jpg', 'wb') as fhand:
shutil.copyfileobj(img, fhand)
print(os.stat('cover.jpg').st_size, 'characters copied')
Don't use Unicode transformations for JPG images.
Unicode is for text. What you are downloading is not text, it is something else.
Try this:
from urllib import request
img = request.urlopen('http://www.py4inf.com/cover.jpg')
fhand = open('cover.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1 : break
size = size + len(info)
fhand.write(info)
print (size,'characters copied.')
Or, more simply:
from urllib import request
request.urlretrieve('http://www.py4inf.com/cover.jpg', 'cover.jpg')

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

how to write a unicode csv in Python 2.7

I want to write data to files where a row from a CSV should look like this list (directly from the Python console):
row = ['\xef\xbb\xbft_11651497', 'http://kozbeszerzes.ceu.hu/entity/t/11651497.xml', "Szabolcs Mag '98 Kft.", 'ny\xc3\xadregyh\xc3\xa1za', 'ny\xc3\xadregyh\xc3\xa1za', '4400', 't\xc3\xbcnde utca 20.', 47.935175, 21.744975, u'Ny\xedregyh\xe1za', u'Borb\xe1nya', u'Szabolcs-Szatm\xe1r-Bereg', u'Ny\xedregyh\xe1zai', u'20', u'T\xfcnde utca', u'Magyarorsz\xe1g', u'4405']
Py2k does not do Unicode, but I had a UnicodeWriter wrapper:
import cStringIO, codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
However, these lines still produce the dreaded encoding error message below:
f.write(codecs.BOM_UTF8)
writer = UnicodeWriter(f)
writer.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)
What is there to do? Thanks!
You are passing bytestrings containing non-ASCII data in, and these are being decoded to Unicode using the default codec at this line:
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
unicode(bytestring) with data that cannot be decoded as ASCII fails:
>>> unicode('\xef\xbb\xbft_11651497')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
Decode the data to Unicode before passing it to the writer:
row = [v.decode('utf8') if isinstance(v, str) else v for v in row]
This assumes that your bytestring values contain UTF-8 data instead. If you have a mix of encodings, try to decode to Unicode at the point of origin; where your program first sourced the data. You really want to do so anyway, regardless of where the data came from or if it already was encoded to UTF-8 as well.

How to pass Unicode string as argument to urllib.urlencode()

I'm using Microsoft's free translation service to translate some Hindi characters to English. They don't provide an API for Python, but I borrowed code from: tinyurl.com/dxh6thr
I'm trying to use the 'Detect' method as described here: tinyurl.com/bxkt3we
The 'hindi.txt' file is saved in unicode charset.
>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>
The response shows that the Translator detected 'en', instead of 'hi' (for Hindi). When I check the encoding, it shows as 'string':
>>> type(hindi_string)
<type 'str'>
For reference, here is content of 'hindi.txt':
เคนเคพเคฏ, เค•เฅˆเคธเฅ‡ เค†เคช เค†เคœ เค•เคฐ เคฐเคนเฅ‡ เคนเฅˆเค‚เฅค เคฎเฅˆเค‚ เค…เคšเฅเค›เฅ€ เคคเคฐเคน เคธเฅ‡, เค†เคชเค•เฅ‹ เคงเคจเฅเคฏเคตเคพเคฆ เค•เคฐ เคฐเคนเคพ เคนเฅ‚เคเฅค
I'm not sure if using string.encode or string.decode applies here. If it does, what do I need to encode/decode from/to? What is the best method to pass a Unicode string as a urllib.urlencode argument? How can I ensure that the actual Hindi characters are passed as the argument?
Thank you.
** Additional Information **
I tried using codecs.open() as suggested, but I get the following error:
>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\codecs.py", line 671, in read
return self.reader.read(size)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is the repr(hindi_string) output:
>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"
Your file is utf-16, so you need to decode the content before sending it:
hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...
You could try opening the file using codecs.open and decode it with utf-8:
import codecs
with codecs.open('hindi.txt', encoding='utf-8') as f:
hindi_text = f.read()

Categories