I have an Intranet page with multiple input fields, I need Scrapy to run a search using the webpages "search for products" input field, it has an id of "searchBox"
I have been able to lock onto the correct search box using both Scrapy and Beautiful Soup but I am not sure how to pass that data back to Scrapys form submission function correctly.
In Method 1 I have tried to simply pass the results to Scrapys FormRequest.from_response function as an input but it does not work.
Method 1 - Using Scrapy to find the data
#Search for products
def parse(self, response):
##Let's try search using scrapy only
sel = Selector(response)
results = sel.xpath("//*[contains(#id, 'searchBox')]")
for result in results:
print (result.extract()) #Print out what scrapy found
return scrapy.FormRequest.from_response(results, formdata = {'Item': 'Whirlpool Washing Machine'}) #formdata is the data we are sending
Method 2 - Using Beautiful soup to find the data
#Search for products
def parse(self, response):
##Let's try search using Beautiful Soup only
soup = BeautifulSoup(response.text, 'html.parser')
product_search = []
product_search.append(soup.find("input", id="searchBox"))
print(product_search) #Print what BS found
About scrapy variant:
You should yield request, not return.
In function from_response you should use selector of form as first argument. Now you pass there some input data, as far as I could understand from your code.
Try something like:
yield scrapy.FormRequest.from_response(response.css('form'), formdata={'Item': 'Whirlpool Washing Machine'})
Just fix form selector in this expression. Also check what else should be used in this request, maybe some headers, cookies, etc.
Related
Hello everyone I'm a beginner at scraping and i try to scrape all iPhones in https://www.electroplanet.ma/
this is the scripts i wrote
import scrapy
from ..items import EpItem
class ep(scrapy.Spider):
name = "ep"
start_urls = ["https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=1",
"https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=2"
]
def parse(self, response):
products = response.css("ol li") # to find all items in the page
for product in products :
try:
lien = product.css("a.product-item-link::attr(href)").get() # get the link of each item
image= product.css("a.product-item-photo::attr(href)").get() # get the image
# and to get in each item page and scrap it, i use follow method
# i passed image as argument to parse_item cauz i couldn't scrap the image from item's page
# i think it's hidden
yield response.follow(lien,callback = self.parse_item,cb_kwargs={"image":image})
except: pass
def parse_item(self,response,image):
item = EpItem()
item["Nom"]= response.css(".ref::text").get()
pattern = re.compile(r"\s*(\S+(?:\s+\S+)*)\s*")
item["Catégorie"]= pattern.search(response.xpath("//h1/a/text()").get()).group(1)
item["Marque"]=pattern.search(response.xpath("//*[#data-th='Marque']/text()").get()).group(1)
try :
item["RAM"]= pattern.search(response.xpath("//*[#data-th='MÉMOIRE RAM']/text()").get()).group(1)
except:
pass
item["ROM"]=pattern.search(response.xpath("//*[#data-th='MÉMOIRE DE STOCKAGE']/text()").get()).group(1)
item["Couleur"]=pattern.search(response.xpath("//*[#data-th='COULEUR']/text()").get()).group(1)
item["lien"]=response.request.url
item["image"]=image
item["état"]="neuf"
item["Market"]= "Electro Planet"
yield item
i found problems to scrape all the pages, because it uses javascript to follow pages so i write all pages links in start urls and i believe it's not the best practice so i ask you to give some advices to improve my code
you can use the scrapy-playwright plugin to scrape the interactive websites, and for the start_urls, just add the main website index URL if there is just one website, and check this link in the scrapy docs to make the spider follow the pages links automatically instead of written them manually
I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination
To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}
I am a newbie and I've written a script in python scrapy to get information recursively.
Firstly, it scrapes links of city including information of tours then it tracks down each cities and reach their pages. Next, it get needed information of tours related to city before move to next pages then so on. Pagination is running on java-script without visible link.
The command I used to get the result along with a csv output is:
scrapy crawl pratice -o practice.csv -t csv
The expected result is csv file:
title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...
The problem is that csv file is empty. The running is stopped at "parse_page" and callback="self.parse_item" doesn't work. I don't know how to fix it. Maybe my workflow is invalid or my code has issues. Thanks for your help.
name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]
def parse(self, response): # Extract cities from country
hxs = HtmlXPathSelector(response)
urls = hxs.select("//div[#class='swiper-wrapper cityData']/a/#href").extract()
for url in urls:
url = urllib.parse.urljoin(response.url, url)
self.log('Found city url: %s' % url)
yield response.follow(url, callback=self.parse_page) # Link to city
def parse_page(self, response): # Move to next page
url_ = response.request.url
yield response.follow(url_, callback=self.parse_item)
# I will use selenium to move next page because of next button is running
# on javascript without fixed url.
def parse_item(self, response): # Extract tours
for block in response.xpath("//div[#class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
article = {}
article['title'] = block.xpath('.//h3[#class="title"]/text()').extract()
article['city'] = response.xpath(".//div[#class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
article['price'] = re.sub(" +","",block.xpath(".//span[#class='latest_price']/b/text()").extract_first()).strip()
article['tour_url'] = 'www.klook.com'+block.xpath(".//a/#href").extract_first()
yield article
hxs = HtmlXPathSelector(response) #response is already in Selector, use direct `response.xpath`
url = urllib.parse.urljoin(response.url, url)
use as:
url = response.urljoin(url)
yes it will stop as its a duplicate request to prev. url, you need to add dont_filter=True check
Instead of using Selenium, figure out what request the website performs using JavaScript (watch the Network tab of the developer tools of your browser while you navigate) and reproduce a similar request.
The website uses JSON requests undernead to fetch the items, which is much easier to parse than the HTML.
Also, if you are not familiar with Scrapy’s asynchronous nature, you are likely to get unexpected issues while using it in combination with Selenium.
Solutions like Splash or Selenium are only meant to be used as last resource, when everything else fails.
Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.
I am crawling a site with scrapy. The parse method first extracts all the category links and then dispatch a request with callback to parse_category.
The problem is if any of the category has one product it redirects to the products page. And my parse_category fails to recognize this page.
Now how do I parse that redirectted category page with product page parser?
Here is an example.
parse finds 3 category pages.
http://example.com/products/samsung
http://example.com/products/dell
http://example.com/products/apple
pare_category calls all those pages. Each returns a html page with list of product. But apple has one single product iMac 27". So it redirects to http://example.com/products/apple/imac_27. This is a product page.The category parse fails to parse it.
I need the product parse method parse_product should be called in this scenario. How do I do that?
I can add some logic in my parse_category method and call parse_product. I dont want it. I want scrapy will do it. But yes, I'll give url patterns or any other info necessary.
Here is the code.
class ExampleSpider(BaseSpider):
name = u'example.com'
allowed_domains = [u'www.example.com']
start_urls = [u'http://www.example.com/category.aspx']
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('/xpath')
for anchor in anchors:
yield Request(urljoin(get_base_url(response), anchor), callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(url, callback=self.parse_product)
def parse_product(self, response):
# product parsing ...
pass
You can opt to write a middleware which implements the process_response method. Whenever your response is for a product URL instead of a category, create a copy of the Request object and change the callback function to your product parser.
In the end, return the new Request object from the middleware. Note: You might need to set dont_filter to True for the new Request to ensure the DupeFilter doesn't filter the Request.