I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams.
However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:
class Ove_Spider(BaseSpider):
name = "ove"
allowed_domains = ['myurl.com']
start_urls = ['myurl/hgh/']
def parse(self, response):
for a in response.xpath('//a[#href]/#href'):
link = a.extract()
if link.endswith('.pdf'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
in_memory_pdf = BytesIO()
in_memory_pdf.read(response.body) # Trying to read in PDF which is in response body
I'm getting:
in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'
How can I get this working?
When you do in_memory_pdf.read(response.body) you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.
In python 2, just initialize BytesIO as:
in_memory_pdf = BytesIO(response.body)
In Python 3, you cannot use BytesIO with a string because it expects bytes. The error message shows that response.body is of type str: we have to encode it.
in_memory_pdf = BytesIO(bytes(response.body,'ascii'))
But as a pdf can be binary data, I suppose that response.body would be bytes, not str. In that case, the simple in_memory_pdf = BytesIO(response.body) works.
Related
I have a json file.
with open('list.json', "r") as f:
r_list = json.load(f)
crashes with:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char
0
I checked the schema online and the schema works.
The schema is very simple:
{"foo": [
{"name": "AAA\u2019s BBB CCC", "url": "/foome/foo"}
]}
Tried to play with:
file encoding
Try a dummy file
.. run out of ideas - is it something where ´json.load´ expects a binary?
Edit 1
Code works in a plain file, does not work in the scrapy class
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
class myScraper(scrapy.Spider):
name="testScraper"
def start_requests(self):
with open('test.json') as f:
self.logger.info(f.read()) #shows the file content
r_list = json.load(f) # breaks with the error msg
yield "foo"
def parse(self, response):
self.logger.info("foo")
'test.json'
{
"too": "foo"
}
Most likely your file is empty.
Example:
https://repl.it/#mark_boyle_sp/SphericalImpressiveIrc
updated:
Your iterator is exhausted as also discussed in the comments.
Since you log the files contents the iterator is at the end of the file. (looks like an empty file, hence the exception)
Reset the iterator or read the contents to a local value and operate on that.
json_str = f.read()
self.logger.info(json_str) #shows the file content
r_list = json.loads(json_str)
updated again
(I assume) The scrapy issue you are having is in the parse method? The response body is a bytes object you will need to decode it and use loads on the resulting string like so :
def parse(self, response):
self.logger.info("foo")
resp_str = response.body.decode('utf-8')
self.logger.info(resp_str) #shows the response
r_list = json.loads(json_str)
I am writing a function in a Python Script which will read the json file and print it.
The scripts reads as:
def main(conn):
global link, link_ID
with open('ad_link.json', 'r') as statusFile:
status = json.loads(statusFile.read())
statusFile.close()
print(status)
link_data = json.load[status]
link = link_data["link"]
link_ID = link_data["link_id"]
print(link)
print(link_ID)
I am getting error as:
link_data = json.load[status]
TypeError: 'function' object is not subscriptable
What is the issue?
The content of ad_link.json The file I am receiving is saved in this manner.
"{\"link\": \"https://res.cloudinary.com/dnq9phiao/video/upload/v1534157695/Adidas-Break-Free_nfadrz.mp4\", \"link_id\": \"ad_Bprise_ID_Adidas_0000\"}"
The function to receive and write JSON file
def on_message2(client, userdata, message):
print("New MQTT message received. File %s line %d" % (filename, cf.f_lineno))
print("message received?/'/'/' ", str(message.payload.decode("utf-8")), \
"topic", message.topic, "retained ", message.retain)
global links
links = str(message.payload.decode("utf-8")
logging.debug("Got new mqtt message as %s" % message.payload.decode("utf-8"))
status_data = str(message.payload.decode("utf-8"))
print(status_data)
print("in function on_message2")
with open("ad_link.json", "w") as outFile:
json.dump(status_data, outFile)
time.sleep(3)
The output of this function
New MQTT message received. File C:/Users/arunav.sahay/PycharmProjects/MediaPlayer/venv/Include/mediaplayer_db_mqtt.py line 358
message received?/'/'/' {"link": "https://res.cloudinary.com/dnq9phiao/video/upload/v1534157695/Adidas-Break-Free_nfadrz.mp4", "link_id": "ad_Bprise_ID_Adidas_0000"} topic ios_push retained 1
{"link": "https://res.cloudinary.com/dnq9phiao/video/upload/v1534157695/Adidas-Break-Free_nfadrz.mp4", "link_id": "ad_Bprise_ID_Adidas_0000"}
EDIT
I found out the error is in JSON format. I am receiving the JSON data in a wrong format. How will I correct that?
There are two major errors here:
You are trying to use the json.load function as a sequence or dictionary mapping. It's a function, you can only call it; you'd use json.load(file_object). Since status is actually a string, you'd have to use json.loads(status) to actually decode a JSON document stored in a string.
In on_message2, you encoded JSON data to JSON again. Now you have to decode it twice. That's an unfortunate waste of computer resources.
In the on_message2 function, the message.payload object is a bytes-value containing a UTF-8 encoded JSON document, if you want to write that to a file, don't decode to text, and don't encode the text to JSON again. Just write those bytes directly to a file:
def on_message2(client, userdata, message):
logging.debug("Got new mqtt message as %s" % message.payload.decode("utf-8"))
with open("ad_link.json", "wb") as out:
out.write(message.payload)
Note the 'wb' status; that opens a file in binary mode for writing, at which point you can write the bytes object to that file.
When you open a file without a b in the mode, you open a file in text mode, and when you write a text string to that file object, Python encodes that text to bytes for you. The default encoding depends on your OS settings, so without an explicit encoding argument to open() you can't even be certain that you end up with UTF-8 JSON bytes again! Since you already have a bytes value, there is no need to manually decode then have Python encode again, so use a binary file object and avoid that decode / encode dance here too.
You can now load the file contents with json.load() without having to decode again:
def main(conn):
with open('ad_link.json', 'rb') as status_file:
status = json.load(status_file)
link = status["link"]
link_id = status["link_id"]
Note that I opened the file as binary again. As of Python 3.6, the json.load() function can work both with binary files and text files, and for binary files it can auto-detect if the JSON data was encoded as UTF-8, UTF-16 or UTF-32.\
If you are using Python 3.5 or earlier, open the file as text, but do explicitly set the encoding to UTF-8:
def main(conn):
with open('ad_link.json', 'r', encoding='utf-8') as status_file:
status = json.load(status_file)
link = status["link"]
link_id = status["link_id"]
def main(conn):
global link, link_ID
with open('ad_link.json', 'r') as statusFile:
link_data = json.loads(statusFile.read())
link = link_data["link"]
link_ID = link_data["link_id"]
print(link)
print(link_ID)
replace loads with load when dealing with file object which supports read like operation
def main(conn):
global link, link_ID
with open('ad_link.json', 'r') as statusFile:
status = json.load(statusFile)
status=json.loads(status)
link = status["link"]
link_ID = status["link_id"]
print(link)
print(link_ID)
I have a following spider:
class Downloader(scrapy.Spider):
name = "sor_spider"
download_folder = FOLDER
def get_links(self):
df = pd.read_excel(LIST)
return df["Value"].loc
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.download_file, errback=self.errback_httpbin, meta=index, dont_filter=True)
def download_file(self, response):
url = response.url
index = response.meta["index"]
content_type = response.headers['Content-Type']
download_path = os.path.join(self.download_folder, r"{}".format(str(index)))
with open(download_path, "wb") as f:
f.write(response.body)
yield LinkCheckerItem(index=response.meta["index"], url=url, code="downloaded")
def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="error")
It should:
read excel with links (LIST)
go to each link and download file to the FOLDER
log results in LinkCheckerItem(I am exporting it to csv)
That would normally work fine but my list contains files of different types - zip, pdf, doc etc.
These are the examples of links in my LIST:
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=2c5fb68702294531afd03041e877ca84
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1173293
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1263289
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=eb9f06d2b837401eba9c66c8bf5be813
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=952317
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=1042224
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1160005
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=925955
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166563
http://npoimpuls.ru/templates/npoimpuls/material/documents/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA%20%D0%B0%D1%84%D1%84%D0%B8%D0%BB%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%BD%D1%8B%D1%85%20%D0%BB%D0%B8%D1%86%20%D0%BD%D0%B0%2030.06.2016.pdf
http://нпоимпульс.рф/assets/download/sal30.09.2017.pdf
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166287
I would like it to save file with its original extension, whatever it is... Just like my browser when it opens an alert to save file.
I tried to use response.headers["Content-type"] to find out the type but in this case it's always application/octet-stream .
How could I do it?
You need to parse Content-Disposition header for the correct file name.
I'm trying to scrap some PDF from a website using, instead of letting Scrapy name these files I want to name these PDF with the titles I scraped from the website.So I define ReportsPDFPipeline and override the file_path function.
class ReportsPDFPipeline(FilesPipeline):
def file_path(self, request, response = None, info = None):
#print("我被调用了")
file_guid = request.meta["title"]
return "full/%s" % (file_guid)
The problem is that there are some unicode(Chinese) characters in the title, so no PDF files were stored in this path.
Then I tried a simple case:
class ReportsPDFPipeline(FilesPipeline):
def file_path(self, request, response = None, info = None):
#print("我被调用了")
return u"full/" + u"我被调用了" + u".PDF"
This time, the file could be renamed and stored but there are some messy characters like this:
What am I supposed to do to rename the files correctly?
I am working on a scrapy spider, trying to extract text multiple pdfs in a directory, using slate (https://pypi.python.org/pypi/slate). I have no interest in saving the actual PDF to disk , and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams. Based on Creating bytesIO object , I have initialized the bytesIO class with the pdf body, but now I need to pass the data to the slate module. So far I have:
def save_pdf(self, response):
in_memory_pdf = BytesIO(response.body)
with open(in_memory_pdf, 'rb') as f:
doc = slate.PDF(f)
print(doc[0])
I'm getting:
in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'
How can I get this working?
edit:
with open(in_memory_pdf, 'rb') as f:
TypeError: coercing to Unicode: need string or buffer, _io.BytesIO found
edit 2:
def save_pdf(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
print(doc)
You already know the answer. It is clearly mentioned in the Python TypeError message and clear from the documentation:
class io.BytesIO([initial_bytes])
BytesIO accepts bytes. And you are passing it contents. i.e: response.body which is a string.