I am working on a scrapy spider, trying to extract text multiple pdfs in a directory, using slate (https://pypi.python.org/pypi/slate). I have no interest in saving the actual PDF to disk , and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams. Based on Creating bytesIO object , I have initialized the bytesIO class with the pdf body, but now I need to pass the data to the slate module. So far I have:
def save_pdf(self, response):
in_memory_pdf = BytesIO(response.body)
with open(in_memory_pdf, 'rb') as f:
doc = slate.PDF(f)
print(doc[0])
I'm getting:
in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'
How can I get this working?
edit:
with open(in_memory_pdf, 'rb') as f:
TypeError: coercing to Unicode: need string or buffer, _io.BytesIO found
edit 2:
def save_pdf(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
print(doc)
You already know the answer. It is clearly mentioned in the Python TypeError message and clear from the documentation:
class io.BytesIO([initial_bytes])
BytesIO accepts bytes. And you are passing it contents. i.e: response.body which is a string.
Related
I'm currently building an application in python where I have a class Corpus. I would like to convert this class to a json format and save it to a json file. Then load the file and finally turn back the json to its Class Corpus.
In order to do that I use the library jsonpickle. The problem is when I load the json, the type is a dictionary and jsonpickle.decode wants a string. I tried to convert the dictionary to a string but its not working.
I hope someone will be able to help me.
Here is my code of my class "Json" (to save and load my Corpus)"
import json, jsonpickle
class Json:
def __init__(self):
self.corpus = {}
def saveCorpus(self,corpus):
jsonCorpus = jsonpickle.encode(corpus,indent=4,make_refs=False)
with open('json_data.json', 'w') as outfile:
outfile.write(jsonCorpus)
def loadCorpus(self):
with open('json_data.json', 'r') as f:
self.corpus = json.load(f)
def getCorpus(self):
return self.corpus
error :
TypeError: the JSON object must be str, bytes or bytearray, not dict
I found the problem.
The issue was the way I was saving my json to a file.
Here is the solution:
def saveCorpus(self,corpus):
jsonCorpus = jsonpickle.encode(corpus,indent=4,make_refs=False)
with open('json_data.json', 'w') as outfile:
json.dump(jsonCorpus, outfile)
def convertToBinaryData(filename):
# Convert digital data to binary format
with open(filename, 'rb') as file:
binaryData = file.read()
return binaryData
This is my function for converting an image to binary...
uploaded_file = request.files['file']
if uploaded_file.filename != '':
uploaded_file.save(uploaded_file.filename)
empPicture = convertToBinaryData(uploaded_file)
and this is the block of code where the uploaded file is received and saved, however, when it runs, I get this error...
with open(filename, 'rb') as file:
TypeError: expected str, bytes or os.PathLike object, not FileStorage
I'm pretty new to python and I've been stuck on this for a while, any help would be appreciated. Thanks in advance
while calling 'convertToBinaryData' you are passing 'uploaded_file' which is not a filename but and object.
You need to pass the filename (with correct path if saved in custom location) to your 'convertToBinaryData' funciton.
Something like this:
convertToBinaryData(uploaded_file.filename)
uploaded_file is not a filename, it's a Flask FileStorage object. You can read from this directly, you don't need to call open().
So just do:
empPicture = uploaded_file.read()
See Read file data without saving it in Flask
Following multiple suggestions from other StackOverflow questions and the mutagen documentation, I was able to come up with code to get and set every ID3 tag in both MP3 and MP4 files. The issue I have is with setting the cover art for M4B files.
I have reproduced the code exactly like it is laid out in this answer:
Embedding album cover in MP4 file using Mutagen
But I am still receiving errors when I attempt to run the code. If I run the code with the 'albumart' value by itself I receive the error:
MP4file.tags['covr'] = albumart
Exception has occurred: TypeError
can't concat int to bytes
However, if I surround the albumart variable with brackets like is shown in the aforementioned StackOverflow question I get this output:
MP4file.tags['covr'] = [albumart]
Exception has occurred: struct.error
required argument is not an integer
Here is the function in it's entirety. The MP3 section works without any problems.
from mutagen.mp3 import MP3
from mutagen.mp4 import MP4, MP4Cover
def set_cover(filename, cover):
r = requests.get(cover)
with open('C:/temp/cover.jpg', 'wb') as q:
q.write(r.content)
if(filename.endswith(".mp3")):
MP3file = MP3(filename, ID3=ID3)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
mime = 'image/jpg'
else:
mime = 'image/png'
with open('C:/temp/cover.jpg', 'rb') as albumart:
MP3file.tags.add(APIC(encoding=3, mime=mime, type=3, desc=u'Cover', data=albumart.read()))
MP3file.save(filename)
else:
MP4file = MP4(filename)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
cover_format = 'MP4Cover.FORMAT_JPEG'
else:
cover_format = 'MP4Cover.FORMAT_PNG'
with open('C:/temp/cover.jpg', 'rb') as f:
albumart = MP4Cover(f.read(), imageformat=cover_format)
MP4file.tags['covr'] = [albumart]
I have been trying to figure out what I am doing wrong for two days now. If anyone can help me spot the problem I would be in your debt.
Thanks!
In the source code of mutagen at the location where the exception is being raised I've found the following lines:
def __render_cover(self, key, value):
...
for cover in value:
try:
imageformat = cover.imageformat
except AttributeError:
imageformat = MP4Cover.FORMAT_JPEG
...
Atom.render(b"data", struct.pack(">2I", imageformat, 0) + cover))
...
There key is the name for the cover tag and value is the data read from the image, wrapped into an MP4Cover object. Well, it turns out that if you iterates over an MP4Cover object, as the above code does, the iteration yields one byte of the image per iteration as int.
Moreover, in Python version 3+, struct.pack returns an object of type bytes. I think the cover argument was intended to be the collection of bytes taken from the cover image.
In the code you've given above the bytes of the cover image are wrapped inside an object of type MP4Cover that cannot be added to bytes as done in the second argument of Atom.render.
To avoid having to edit or patch the mutagen library source code, the trick is converting the 'MP4Cover' object to bytes and wrapping the result inside a collection as shown below.
import requests
from mutagen.mp3 import MP3
from mutagen.mp4 import MP4, MP4Cover
def set_cover(filename, cover):
r = requests.get(cover)
with open('C:/temp/cover.jpg', 'wb') as q:
q.write(r.content)
if(filename.endswith(".mp3")):
MP3file = MP3(filename, ID3=ID3)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
mime = 'image/jpg'
else:
mime = 'image/png'
with open('C:/temp/cover.jpg', 'rb') as albumart:
MP3file.tags.add(APIC(encoding=3, mime=mime, type=3, desc=u'Cover', data=albumart.read()))
MP3file.save(filename)
else:
MP4file = MP4(filename)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
cover_format = 'MP4Cover.FORMAT_JPEG'
else:
cover_format = 'MP4Cover.FORMAT_PNG'
with open('C:/temp/cover.jpg', 'rb') as f:
albumart = MP4Cover(f.read(), imageformat=cover_format)
MP4file.tags['covr'] = [bytes(albumart)]
MP4file.save(filename)
I've also added MP4file.save(filename) as the last line of the code to persists the changes done to the file.
I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams.
However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:
class Ove_Spider(BaseSpider):
name = "ove"
allowed_domains = ['myurl.com']
start_urls = ['myurl/hgh/']
def parse(self, response):
for a in response.xpath('//a[#href]/#href'):
link = a.extract()
if link.endswith('.pdf'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
in_memory_pdf = BytesIO()
in_memory_pdf.read(response.body) # Trying to read in PDF which is in response body
I'm getting:
in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'
How can I get this working?
When you do in_memory_pdf.read(response.body) you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.
In python 2, just initialize BytesIO as:
in_memory_pdf = BytesIO(response.body)
In Python 3, you cannot use BytesIO with a string because it expects bytes. The error message shows that response.body is of type str: we have to encode it.
in_memory_pdf = BytesIO(bytes(response.body,'ascii'))
But as a pdf can be binary data, I suppose that response.body would be bytes, not str. In that case, the simple in_memory_pdf = BytesIO(response.body) works.
Suppose the Flask-Admin view below (note I'm using flask_wtf not wtforms). I'd like to upload a csv file, and then on the model_change, parse the csv and do some stuff to it before returning the result which will then be stored into the model. However, I get the error: TypeError: coercing to Unicode: need string or buffer, FileField found
from flask_wtf.file import FileField
class myView(ModelView):
[...]
def scaffold_form(self):
form_class = super(myView, self).scaffold_form()
form_class.csv = FileField('Upload CSV')
return form_class
def on_model_change(self, form, model):
csv = form.csv
csv_data = self.parse_file(csv)
model.csv_data = csv_data
def parse_file(self, csv):
with open(csv, 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
doSomething()
When accessing csv.data, I'll get <FileStorage: u'samplefile.csv' ('text/csv')> but this object doesn't actually contain the csv's data.
Okay, after digging further into the flask_wtf module I was able to find enough to go on and get a workaround. The FileField object has a data attribute that wraps the werkzeug.datastructures.FileStorage class which exposes the stream attribute. The docs say this typically points to the open file resource, but since I'm doing this in memory in this case it's a stream buffer io.BytesIO object.
Attempting to open():
with open(field.data.stream, 'rU') as csv_data:
[...]
Will result in an TypeError: coercing to Unicode: need string or buffer, _io.BytesIO found.
BUT, csv.reader can also take a string or buffer directly, so we pass in the straight shootin' buffer to the csv.reader:
buffer = csv_field.data.stream # csv_field is the FileField obj
csv_data = csv.reader(buffer, delimiter=',')
for row in csv_data:
print row
I found it interesting that if you need additional coercion to/from Unicode UTF-8, the csv examples in docs provides a snippet on wrapping an encoder/decoder.
For me, this did the trick:
def on_model_change(self, form, model):
tweet_file = form.tweet_keywords_file
buffer = tweet_file.data.stream
file_data = buffer.getvalue()