How to use Scrapy Request and get response at same place? - python

I am writing the scrapy crawler to scrape the data from the e-commerce website.
The website has the color variant and each variant has own price, sizes and stock for that sizes.
To get the price, sizes, and the stocks for variant need to visit the link of the variant(color).
And all data needes in one record.
I have tried using requests but it is slow and sometimes fails to load the page.
I have written the crawler using requests.get() and use the response in the scrapy.selector.Selector() and parsing data.
What my question is, is there any way to use scrapy.Request() to get the response where I use it not at the callback function. I need the response at the same place as below(something like below),
response = scrapy.Request(url=variantUrl)
sizes = response.xpath('sizesXpath').extract()
I know scrapy.Request() require parameter called callback=self.callbackparsefunction
that will be called when scrapy generates the response to handle that generated response. I do not want to use callback functions I want to handle the response in the current function.
Or is there any way to return the response from the callback function to function where scrapy.Request() is written as below(something like below),
def parse(self, response):
variants = response.xpath('variantXpath').extract()
for variant in variants:
res = scrapy.Request(url=variant,callback=self.parse_color)
# use of the res response
def parse_color(self, response):
return response

Take a look at scrapy-inline-requests package, I think it's exactly what you are looking for.

Related

Scrapy : How to parse same URL twice, check if results are identical and if not relaunch until two are identicals

I've juste started web-scraping using Python and Scrapy.
I am trying to scrape data from numerous URLs from a website. The URL structure is always the as, so as the data within the page. Thus, the Spider was fairly easy to make and the scraping seems quite easy until now.
However I've noticed that, even if I use a proxy manager, some results I got from response differs from the ones I get if I visit the URL manually in my web browser. From what I have understand it might some kind of 'honeypot'.
I've thought about a way to bypass this problem :
Get the same URL twice
Compare the results values
If they are similar validate them
If not relaunch the request until I get a couple of identical value
I am unfortunately too limited for now to find a technical answer. How could make my spider always parse the same URL twice ? Should I use instead 2 spiders and compare results in an other way ?
Here the raw code of my spider :
import scrapy
def clean_prices(price_str : str):
if price_str:
return price_str.replace('\u202f','').replace(',','.').replace(" ","").replace("\xa0","").replace('\n','').replace('€','')
else:
return price_str
class PricesSpider(scrapy.Spider):
name = "prices"
download_delay = 2.5
start_urls = [...]
def parse(self, response):
apart_cat_range = prices_cat.css("div.prices-summary__apartment-prices ul.prices-summary__price-range")
yield {
'prices' : clean_prices(apart_cat_range.css("li:nth-child(3)::text").get()).split("à")[1] if apart_cat_range.css("li:nth-child(3)::text").get() is not None else None
}

Passing a list as an argument to a function in python

I'm new to Python and I'm struggling with passing a list as an argument to a function.
I've written a block of code to take a url, extract all links from the page and put them into a list (links=[]). I want to pass this list to a function that filters out any link that is not from the same domain as the starting link (aka the first in the list) and output a new list (filtered_list = []).
This is what I have:
import requests
from bs4 import BeautifulSoup
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
print(filter_links(links))
When I run this, I get an unfiltered list and below that, I get None.
Eventually I want to pass the filtered list to a function that grabs the html from each page in the domain linked on the homepage, but I am trying to tackle this problem 1 process at a time. Any tips would be much appreciated, thank you :)
EDIT
I now can pass the list of urls to the filter_links() function however, I'm filtering out too much now. eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url. I have used the built-in startswith function, but its filtering out everything except the starting url. I think I could use regex but this should work too?

Passing Multiple URLs (strings) Into One request.get statement Python

I am trying to have a request.get statement with two urls in it. What I am aiming to do is have requests (Python Module) make two requests based on list or two strings I provide. How can I pass multiple strings from a list into a request.get statement, and have requests go to each url (string) and have do something?
Thanks
Typically if we talking python requests library it only runs one url get request at a time. If what you are trying to do is perform multiple requests with a list of known urls then it's quite easy.
import requests
my_links = ['www.google.com', 'www.yahoo.com']
my_responses = []
for link in my_links:
payload = requests.get(link).json()
print('got response from {}'.format(link))
my_response.append(payload)
print(payload)
my_responses now has all the content from the pages.
You don't. The requests.get() method (or any other method, really) takes single URL and makes a single HTTP request because that is what most humans want it to do.
If you need to make two requests, you must call that method twice.
requests.get(url)
requests.get(another_url)
Of course, these calls are synchronous, the second will only begin once the first response is received.

Creating multiple requests from same method in Scrapy

I am parsing webpages that have a similar structure to this page.
I have the following two functions:
def parse_next(self, response):
# implementation goes here
# create Request(the_next_link, callback=parse_next)
# for link in discovered_links:
# create Request(link, callback=parse_link)
def parse_link(self, response):
pass
I want parse_next() to create a request for the *Next link on the web page. At the same time, I want it to create requests for all the URLs that were discovered on the current page by using parse_link() as the callback. Note that I want parse_next to recursively use itself as a callback because this seems to me as the only possible way to generate requests for all the *Next links.
*Next: The link that appears besides all the numbers on the this page
How am I supposed to solve this problem?
Use a generator function and loop through
your links, then call this on the links
that you want to make a request to:
for link in links:
yield Request(link.url)
Since you are using scrapy, I'm assuming you have link extractors set up.
So, just declare your link extractor as a variable like this:
link_extractor = SgmlLinkExtractor(allow=('.+'))
Then in the parse function, call the link extractor on the 'the_next_link':
links = self.link_extractor.extract_links(response)
Here you go:
http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained

HTTP POST and parsing JSON with Scrapy

I have a site that I want to extract data from. The data retrieval is very straight forward.
It takes the parameters using HTTP POST and returns a JSON object. So, I have a list of queries that I want to do and then repeat at certain intervals to update a database. Is scrapy suitable for this or should I be using something else?
I don't actually need to follow links but I do need to send multiple requests at the same time.
How does looks like the POST request? There are many variations, like simple query parameters (?a=1&b=2), form-like payload (the body contains a=1&b=2), or any other kind of payload (the body contains a string in some format, like json or xml).
In scrapy is fairly straightforward to make POST requests, see: http://doc.scrapy.org/en/latest/topics/request-response.html#request-usage-examples
For example, you may need something like this:
# Warning: take care of the undefined variables and modules!
def start_requests(self):
payload = {"a": 1, "b": 2}
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
# do stuff with data...
data = json.loads(response.body)
For handling requests and retrieving response, scrapy is more than enough. And to parse JSON, just use the json module in the standard library:
import json
data = ...
json_data = json.loads(data)
Hope this helps!
Based on my understanding of the question, you just want to fetch/scrape data from a web page at certain intervals. Scrapy is generally used for crawling.
If you just want to make http post requests you might consider using the python requests library.

Categories