Webscraping Multi page issue - python

Hello I am trying to scrape the following link "https://eprocure.gov.in/eprocure/app;jsessionid=9AD8A7A17E1B2868527E25799DBE45A2.eprocgep2?page=FrontEndLatestActiveTenders&service=page" with bs4 in python .For the first page everything seems to be ok .But When I am navigating to next page the URL pattern is changing completely .Now here is the next page URL pattern :"https://eprocure.gov.in/eprocure/app?component=%24TablePages.linkPage&page=FrontEndLatestActiveTenders&service=direct&session=T&sp=AFrontEndLatestActiveTenders%2Ctable&sp=2"..Due to the pattern change I can not automate the scraping process for every page ..But when I try to scrape the second page manually the soup object can not fetch any of the tags .But in network inspect showing those tags for second page ...can any one solve the issue ?? scrape all of the pages.. please share your solution

Related

Selenium project to Requests

i’ve developed a web scraper that extracts reviews from a particular shopping website. It’s coded by Python and the scraping is used based on Selenium + BS4. But my client thinks it’s TOO SLOW and wants it to be using Requests. To scrape the reviews, I have to wait until the reviews show up (or to click a review tab) and then page through for every reviews. I’m guessing the review div is an xhr element or an ajax because the whole page doesn’t load up when i click the next page. All the scrapings are used by BeautifulSoup.
I’m leaving an url so you guys can all go and check!
https://smartstore.naver.com/hoskus/products/4351439834?NaPm=ct%3Dkeuq83q8%7Cci%3Df5e8bd34633b9b48d81db83b289f1b2e0512d2f0%7Ctr%3Dslsl%7Csn%3D888653%7Chk%3D9822b0c3e9b322fa2d874575218c223ce2454a42
I’ve always thought Requests seem to read the HTML far faster than Selenium. But I don’t know how to attain the HTML when it’s all hidden by buttons. Does anybody have an idea? Or something I can refer to?

Getting URL of an XHR request in Python?

I'm working on a web scraping project for scraping products off a website. I've been doing projects like this for a few months with pretty good success, but this most recent website is giving me some trouble. This is the website I'm scraping: https://www.phoenixcontact.com/online/portal/us?1dmy&urile=wcm%3apath%3a/usen/web/home. Here's an example of a product page: https://www.phoenixcontact.com/online/portal/us/?uri=pxc-oc-itemdetail:pid=3248125&library=usen&pcck=P-15-11-08-02-05&tab=1&selectedCategory=ALL. I have a program that lets me navigate to each product page and extract a majority of the information using BeautifulSoup.
The place I run into issues is trying to get the product number of all the products under the "Accessories Tab". I tried to use Selenium rather than Beautiful Soup to pull up the page and actually click through the Accessories pages. The website throws a 403 error if you try to update the page by clicking on the page numbers or arrow or change the displayed number of products. The buttons themselves don't have an actual link, the href tag = "#" to take you back to the top of the section after it updates the list. I have found that the request URL in the XHR request when you click on one of those page links would take you to a page that has the product information. From there I can make slight changes to the site= and itemsPerPage= parts of the URL and scrape the information pretty easily.
I am scraping 30,000 of these product pages and each one has a different request URL for the XHR request, but there's no recognizable relationship between the page URL and the request URL. Any ideas on how to get the XHR request URL from each page?
I'm pretty fluent in Selenium and Beautiful soup, but any other web scraping packages are unfamiliar and would warrant a little extra explanation.
EDIT: This shows what happens when I try to use Selenium to navigate through the pages. The product list doesn't change, and it gives that error. Selenium Error
This shows the XHR request that I've found. I just need a way to retrieve that URL to give to Beautiful Soup. XHR Request

How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs
First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

Scrape and Get main content of any web site

I want to scrape web page an get title and main content of any web site. i see this. if you copy any url (for example copy http://en-maktoob.news.yahoo.com/pakistani-army-fuels-anger-securing-swat-taliban-025337458.html to textbox and press enter) from any article this web page get title and article and summarize it. it's work for the most websites. i want to know how is work without using html tag parsing for any website? how get main article of each webpage?

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

Categories