Selenium project to Requests

Selenium project to Requests - python

i’ve developed a web scraper that extracts reviews from a particular shopping website. It’s coded by Python and the scraping is used based on Selenium + BS4. But my client thinks it’s TOO SLOW and wants it to be using Requests. To scrape the reviews, I have to wait until the reviews show up (or to click a review tab) and then page through for every reviews. I’m guessing the review div is an xhr element or an ajax because the whole page doesn’t load up when i click the next page. All the scrapings are used by BeautifulSoup.
I’m leaving an url so you guys can all go and check!
https://smartstore.naver.com/hoskus/products/4351439834?NaPm=ct%3Dkeuq83q8%7Cci%3Df5e8bd34633b9b48d81db83b289f1b2e0512d2f0%7Ctr%3Dslsl%7Csn%3D888653%7Chk%3D9822b0c3e9b322fa2d874575218c223ce2454a42
I’ve always thought Requests seem to read the HTML far faster than Selenium. But I don’t know how to attain the HTML when it’s all hidden by buttons. Does anybody have an idea? Or something I can refer to?

Related

Scraping a website that URL doesn't change when clicking on "next page" button

I'm trying to scrape a BBC website
https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests
and I would like to get all the news articles. But the URL doesn't change when clicking on the next page button so I can only get the first page information. Can anyone help? I'm using Selenium but familiar with requests too. Thanks!

use developer console in your browser, go to networks tab, disable cache.
you can see api requests being made for each page change. you dont need selenium, you can just use requests or aiohttp.
this is an example:
https://push.api.bbci.co.uk/batch?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2Fd5803bfc-472d-4abf-b334-d3fc4aa8ebf9%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F2%2Fversion%2F1.5.6?timeout=5
type "batch" in the filter bar and you should see only the api calls I believe to be responsible for page change.
you can get the about id(d5803bfc-472d-4abf-b334-d3fc4aa8ebf9) of this topic in the webpage source. in this case in, https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests

How to approach web-scraping in python

I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize

I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.

You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp

How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs

First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

Python scraping using inspect element or firebug

As I am going through this youtube scraping tutorial https://www.youtube.com/watch?v=qbEN3boz7_M, I was introduced that instead of scraping from the "public" page loaded heavily with all other stuff, there is a way to find a "private" page to scrape the necessary information much more efficiently using inspect element/firebug.
google chrome > inspect element > network > XHR
The person in the youtube video uses stock price as an example and be able to locate a "private" page to scrape much quickly and less intensive to the server. Though when I tried to look at sites I wanted to scrape, for example, http://www.rottentomatoes.com/m/grigris/, going through the inspect element (chrome) > Network > XHR > checking the headers' request URL and preview, I didn't seem to find anything useful.
Am I missing something? How can I ensure if a raw or condensed information is hidden somewhere? Using the Rottentomatoes.com page as an example, how can I tell if there is 1) a "private page" that gives the title and year of the movie and 2) a summary page (in csv-like format) that "stores" all the movies' titles and year in one page?

You can only find XHR requests, if the page is dynamically loading data. In your example, the only thing of note is this URL:
http://www.rottentomatoes.com/api/private/v1.0/users/current/ratings/771355871
Which contains some information about the movie in JSON.
{"media":{"type":"movie","id":771355871,"title":"Grigris","url":"http://www.rottentomatoes.com/m/grigris/","year":2014,"mpaa":"Unrated","runtime":"1 hr. 40 min.","synopsis":"Despite a bum leg, 25-year-old Grigris has hopes of becoming a professional dancer, making some extra cash putting his killer moves to good use on the...","thumbnail":"http://content6.flixster.com/movie/11/17/21/11172196_mob.jpg","cast":[{"name":"Souleymane Démé","id":"771446344"},{"name":"Anaïs Monory","id":"771446153"}]}}
Make sure you have the chrome developer tools open when you load the site. If not, the developer tools don't capture any requests. You can open them and refresh the page, then you should see them under the XHR filter.

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.

Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium project to Requests - python

Related

Scraping a website that URL doesn't change when clicking on "next page" button

How to approach web-scraping in python

How to loop through each page of website for web scraping with BeautifulSoup

Python scraping using inspect element or firebug

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Categories

Resources