How to read start_urls from csv file in scrapy? - python

I have two spiders. Let's say A and B. A scrapes bunch of urls and writes it into a csv file and B scrapes inside those urls reading from the csv file generated by A. But it throws FileNotFound error from B before A can actually create the file. How can I make my spiders behave such that B waits until A comes back with url? Any other solution would be helpful.
WriteToCsv.py file
def write_to_csv(item):
with open('urls.csv', 'a', newline='') as csvfile:
fieldnames = ['url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'url': item})
class WriteToCsv(object):
def process_item(self, item, spider):
if item['url']:
write_to_csv("http://pypi.org" +item["url"])
return item
Pipelines.py file
ITEM_PIPELINES = {
'PyPi.WriteToCsv.WriteToCsv': 100,
'PyPi.pipelines.PypiPipeline': 300,
}
read_csv method
def read_csv():
x = []
with open('urls.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
x = [''.join(url) for url in reader]
return x
start_urls in B spider file
start_urls = read_csv() #Error here

I would consider using a single spider with two methods parse and final_parse. As far as I can tell from the context you have provided there is no need to write the URLs to disk.
parse should contain the logic for scraping the URLs that spider A is currently writing to the csv and should return a new request with a callback to the final_parse method.
def parse(self, response):
url = do_something(response.body_as_unicode())
return scrapy.Request(url, callback=self.final_parse)
final_parse should then contain the parsing logic that was previously in spider B.
def final_parse(self, response):
item = do_something_else(response.body_as_unicode())
return item
Note: If you need to pass any additional information from parse to final_parse you can use the meta argument of scrapy.Request.
If you do need the URLs, you could add this as a field to your item.
It can be accessed with response.url.

Related

How to append multiple value in csv file header with python

This is my code and I can't append value in 'Title, Ingredients, instructions, nutrients, Image, link'
from recipe_scrapers import scrape_me
import requests
from recipe_scrapers import scrape_html
from csv import writer
with open('recipe.csv', 'w', encoding='utf8', newline='') as file:
#create new CSV file and write header that name Title ,Ingredients,instructions,nutrients,Image,link.
thewriter = writer(file)
header = ['Title', 'Ingredients', 'Instructions', 'Nutrition_Facts','image','links']
thewriter.writerow(header)
url = "https://www.allrecipes.com/recipe/220751/quick-chicken-piccata/"
html = requests.get(url).content
scraper = scrape_html(html=html, org_url=url)
for scrap in scraper:
#this loop add Title ,Ingredients,instructions,nutrients,Image,link value .
info = ['title, Ingredients, instructions, nutrients,Image,link']
thewriter.writerow(info)
Title = scraper.title()
Ingredients = scraper.ingredients()
instructions = scraper.instructions()
nutrients = scraper.nutrients()
Image = scraper.image()
link = scraper.links()
print(scrap)
How I can solve this code
There are a number of problems with your code. Firstly, your indentation is off. You are creating thewriter variable in a different code block and then trying to access it in a different code block. To fix this, you will have to indent all the code below your with open statement to the same level.
Secondly, according to the recipe-scrapers doc, scraper is an AllRecipesCurated object that cannot be iterated, so your line:
for scrap in scraper:
makes no sense since your trying to iterate over a non-iterable object and will give you an error.
Finally, these two lines:
info = ['title, Ingredients, instructions, nutrients,Image,link']
thewriter.writerow(info)
mean that you will always have the heading written into your file, not the data you get from the calling the URL. You should instead make it point to the data you extract from the url:
thewriter.writerow([scraper.title(), scraper.ingredients(), scraper.instructions(), scraper.nutrients(), scraper.image(), scraper.links()])
Here is the full code fixed. You should be able to get the correct results using it:
import requests
from recipe_scrapers import scrape_html
from csv import writer
with open('recipe.csv', 'w', encoding='utf8', newline='') as file:
# create new CSV file and write header that name Title ,Ingredients,instructions,nutrients,Image,link.
thewriter = writer(file)
header = ['Title', 'Ingredients', 'Instructions', 'Nutrition_Facts', 'image', 'links']
thewriter.writerow(header)
url = "https://www.allrecipes.com/recipe/220751/quick-chicken-piccata/"
html = requests.get(url).content
scraper = scrape_html(html=html, org_url=url)
thewriter.writerow([scraper.title(), scraper.ingredients(), scraper.instructions(), scraper.nutrients(), scraper.image(), scraper.links()])

Reading a file in python: json.decoder.JSONDecodeError

I have a json file.
with open('list.json', "r") as f:
r_list = json.load(f)
crashes with:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char
0
I checked the schema online and the schema works.
The schema is very simple:
{"foo": [
{"name": "AAA\u2019s BBB CCC", "url": "/foome/foo"}
]}
Tried to play with:
file encoding
Try a dummy file
.. run out of ideas - is it something where ´json.load´ expects a binary?
Edit 1
Code works in a plain file, does not work in the scrapy class
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
class myScraper(scrapy.Spider):
name="testScraper"
def start_requests(self):
with open('test.json') as f:
self.logger.info(f.read()) #shows the file content
r_list = json.load(f) # breaks with the error msg
yield "foo"
def parse(self, response):
self.logger.info("foo")
'test.json'
{
"too": "foo"
}
Most likely your file is empty.
Example:
https://repl.it/#mark_boyle_sp/SphericalImpressiveIrc
updated:
Your iterator is exhausted as also discussed in the comments.
Since you log the files contents the iterator is at the end of the file. (looks like an empty file, hence the exception)
Reset the iterator or read the contents to a local value and operate on that.
json_str = f.read()
self.logger.info(json_str) #shows the file content
r_list = json.loads(json_str)
updated again
(I assume) The scrapy issue you are having is in the parse method? The response body is a bytes object you will need to decode it and use loads on the resulting string like so :
def parse(self, response):
self.logger.info("foo")
resp_str = response.body.decode('utf-8')
self.logger.info(resp_str) #shows the response
r_list = json.loads(json_str)

scrapy - download with correct extension

I have a following spider:
class Downloader(scrapy.Spider):
name = "sor_spider"
download_folder = FOLDER
def get_links(self):
df = pd.read_excel(LIST)
return df["Value"].loc
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.download_file, errback=self.errback_httpbin, meta=index, dont_filter=True)
def download_file(self, response):
url = response.url
index = response.meta["index"]
content_type = response.headers['Content-Type']
download_path = os.path.join(self.download_folder, r"{}".format(str(index)))
with open(download_path, "wb") as f:
f.write(response.body)
yield LinkCheckerItem(index=response.meta["index"], url=url, code="downloaded")
def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="error")
It should:
read excel with links (LIST)
go to each link and download file to the FOLDER
log results in LinkCheckerItem(I am exporting it to csv)
That would normally work fine but my list contains files of different types - zip, pdf, doc etc.
These are the examples of links in my LIST:
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=2c5fb68702294531afd03041e877ca84
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1173293
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1263289
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=eb9f06d2b837401eba9c66c8bf5be813
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=952317
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=1042224
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1160005
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=925955
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166563
http://npoimpuls.ru/templates/npoimpuls/material/documents/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA%20%D0%B0%D1%84%D1%84%D0%B8%D0%BB%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%BD%D1%8B%D1%85%20%D0%BB%D0%B8%D1%86%20%D0%BD%D0%B0%2030.06.2016.pdf
http://нпоимпульс.рф/assets/download/sal30.09.2017.pdf
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166287
I would like it to save file with its original extension, whatever it is... Just like my browser when it opens an alert to save file.
I tried to use response.headers["Content-type"] to find out the type but in this case it's always application/octet-stream .
How could I do it?
You need to parse Content-Disposition header for the correct file name.

Can't find any way to close a csv file

I've written a script in python scrapy to get different ids and its corresponding names from a webpage. When I execute my script, I can see that the result are rightly coming through and I'm getting a data filled in csv file. I'm using python 3.6, so when I go fo scrapy's built-in command (meant to write data in a csv file), I always get a csv file with blank lines in every alternate row. However, I tried the following to serve the purpose and it does it's job. Now, It produces a csv file fixing blank line issues.
My question: how can I close the csv file when the job is done?
This is my try so far:
import scrapy, csv
class SuborgSpider(scrapy.Spider):
name = "suborg"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def __init__(self):
self.file = open("output.csv", "w", newline="")
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
writer = csv.writer(self.file)
writer.writerow([idnum,name])
You can close the actual file instead:
You can call it in the closed() method which is automatically called when the spider is closed.
def closed(self, reason):
self.file.close()

JSON.Dump doesn't capture the whole stream

So I have a simple crawler that crawls 3 store location pages and parses the locations of the stores to json. I print(app_data['stores']) and it prints all three pages of stores. However, when I try to write it out I only get one of the three pages, at random, written to my json file. I'd like everything that streams to be written to the file. Any help would be great. Here's the code:
import scrapy
import json
import js2xml
from pprint import pprint
class StlocSpider(scrapy.Spider):
name = "stloc"
allowed_domains = ["bestbuy.com"]
start_urls = (
'http://www.bestbuy.com/site/store-locator/11356',
'http://www.bestbuy.com/site/store-locator/46617',
'http://www.bestbuy.com/site/store-locator/77521'
)
def parse(self, response):
js = response.xpath('//script[contains(.,"window.appData")]/text()').extract_first()
jstree = js2xml.parse(js)
# print(js2xml.pretty_print(jstree))
app_data_node = jstree.xpath('//assign[left//identifier[#name="appData"]]/right/*')[0]
app_data = js2xml.make_dict(app_data_node)
print(app_data['stores'])
for store in app_data['stores']:
yield store
with open('stores.json', 'w') as f:
json.dump(app_data['stores'], f, indent=4)
You are opening the file for writing every time, but you want to append. Try changing the last part to this:
with open('stores.json', 'a') as f:
json.dump(app_data['stores'], f, indent=4)
Where 'a' opens the file for appending.

Categories