I'm working on a web scraping project for scraping products off a website. I've been doing projects like this for a few months with pretty good success, but this most recent website is giving me some trouble. This is the website I'm scraping: https://www.phoenixcontact.com/online/portal/us?1dmy&urile=wcm%3apath%3a/usen/web/home. Here's an example of a product page: https://www.phoenixcontact.com/online/portal/us/?uri=pxc-oc-itemdetail:pid=3248125&library=usen&pcck=P-15-11-08-02-05&tab=1&selectedCategory=ALL. I have a program that lets me navigate to each product page and extract a majority of the information using BeautifulSoup.
The place I run into issues is trying to get the product number of all the products under the "Accessories Tab". I tried to use Selenium rather than Beautiful Soup to pull up the page and actually click through the Accessories pages. The website throws a 403 error if you try to update the page by clicking on the page numbers or arrow or change the displayed number of products. The buttons themselves don't have an actual link, the href tag = "#" to take you back to the top of the section after it updates the list. I have found that the request URL in the XHR request when you click on one of those page links would take you to a page that has the product information. From there I can make slight changes to the site= and itemsPerPage= parts of the URL and scrape the information pretty easily.
I am scraping 30,000 of these product pages and each one has a different request URL for the XHR request, but there's no recognizable relationship between the page URL and the request URL. Any ideas on how to get the XHR request URL from each page?
I'm pretty fluent in Selenium and Beautiful soup, but any other web scraping packages are unfamiliar and would warrant a little extra explanation.
EDIT: This shows what happens when I try to use Selenium to navigate through the pages. The product list doesn't change, and it gives that error. Selenium Error
This shows the XHR request that I've found. I just need a way to retrieve that URL to give to Beautiful Soup. XHR Request
Related
i’ve developed a web scraper that extracts reviews from a particular shopping website. It’s coded by Python and the scraping is used based on Selenium + BS4. But my client thinks it’s TOO SLOW and wants it to be using Requests. To scrape the reviews, I have to wait until the reviews show up (or to click a review tab) and then page through for every reviews. I’m guessing the review div is an xhr element or an ajax because the whole page doesn’t load up when i click the next page. All the scrapings are used by BeautifulSoup.
I’m leaving an url so you guys can all go and check!
https://smartstore.naver.com/hoskus/products/4351439834?NaPm=ct%3Dkeuq83q8%7Cci%3Df5e8bd34633b9b48d81db83b289f1b2e0512d2f0%7Ctr%3Dslsl%7Csn%3D888653%7Chk%3D9822b0c3e9b322fa2d874575218c223ce2454a42
I’ve always thought Requests seem to read the HTML far faster than Selenium. But I don’t know how to attain the HTML when it’s all hidden by buttons. Does anybody have an idea? Or something I can refer to?
Hello I am trying to scrape the following link "https://eprocure.gov.in/eprocure/app;jsessionid=9AD8A7A17E1B2868527E25799DBE45A2.eprocgep2?page=FrontEndLatestActiveTenders&service=page" with bs4 in python .For the first page everything seems to be ok .But When I am navigating to next page the URL pattern is changing completely .Now here is the next page URL pattern :"https://eprocure.gov.in/eprocure/app?component=%24TablePages.linkPage&page=FrontEndLatestActiveTenders&service=direct&session=T&sp=AFrontEndLatestActiveTenders%2Ctable&sp=2"..Due to the pattern change I can not automate the scraping process for every page ..But when I try to scrape the second page manually the soup object can not fetch any of the tags .But in network inspect showing those tags for second page ...can any one solve the issue ?? scrape all of the pages.. please share your solution
I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!
I am try to scrape a web page that you have to use a specific link to access the website. The issue is that this link takes you to the home page and that each product on the website has a unique url. My question is what would I do to access these product pages in order to scrape and download the PDF?
I am used to just looping thru the URLs directly but have never had to go thru one link to access the other urls. Any help would be great.
I am using Python and bs4.
Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.