Navigate through all the search results pages with BeautifulSoup - python

I can not seem to grasp.
How can I make BeautifulSoup parse every page by navigating using Next page link up until the last page and stop parsing when there is no "Next page" found. On a site like this
enter link description here
I try looking for the Next button element name, I use 'find' to find it, but do not know how to make it recurring to do iterations until all pages are scraped.
Thank you

beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.
Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.
htt...ru/moskva/transport
and we see in the source of the page:
<div class="pagination-pages clearfix">
<span class="pagination-page pagination-page_current">1</span>
<a class="pagination-page" href="/moskva/transport?p=2">2</a>
lets check what happens when we go to page 2
ht...ru/moskva/transport?p=2
<div class="pagination-pages clearfix">
<a class="pagination-page" href="/moskva/transport">1</a>
<span class="pagination-page pagination-page_current">2</span>
<a class="pagination-page" href="/moskva/transport?p=3">3</a>
perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161
ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162
the page seems to go back to page 1 automatically. great!
so now we have everything we need to make our soup loop.
instead of clicking next each time, just make a url statement. you know the elements required.
url = ht...ru/moskva/$searchterm?p=$pagenum
im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call
request = requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)
and now you can wrap that whole thing in a while loop, and each time except the first time check
mysoup.select['.pagination-page_current'][0].text == 1
this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.
this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.
this should be everything you need to do this properly. :)

BeautifulSoup by itself does not load pages. You need to use something like requests, scrape the URL you want to follow, load it and pass its content to another BS4 soup.
import requests
# Scrape your url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # You can now scrape the new page

Related

Cannot find the text I want to scrape in the Page Source

Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess

Load entire html page in python

I need to store in a str variable an entire html page.
I'm doing this:
import requests
from bs4 import BeautifulSoup
url = my_url
response = requests.get(url)
page = str(BeautifulSoup(response.content))
This works but the page in my_url is not "complete". It is a website in which going to the end, new things will load, and i need all the page, not only the main visible part.
Is there a way to load the entire page and then store it?
I also tried to load the page manually and then looking at the source code, but the final part of the page is still not visible.
Alternatively, all I want from my_url page are all the links inside it, and all of them are like:
my_url/something/first-post
my_url/something/second-post
Is there a way to find all the links in another way? So, all the possible url that starts with "my_url/something/"
Thanks in advance
I think you should use Selenium and then scroll down with it to get entire the page.
as I know requests can't handle dynamic pages.
For the alternative option, you can find the <a> tags via find_all
links = soup.find_all('a')
to get all starting with you can use the following
result = [link for link in links if link.startswith('my_url/something/')]

Webscraping Multi page issue

Hello I am trying to scrape the following link "https://eprocure.gov.in/eprocure/app;jsessionid=9AD8A7A17E1B2868527E25799DBE45A2.eprocgep2?page=FrontEndLatestActiveTenders&service=page" with bs4 in python .For the first page everything seems to be ok .But When I am navigating to next page the URL pattern is changing completely .Now here is the next page URL pattern :"https://eprocure.gov.in/eprocure/app?component=%24TablePages.linkPage&page=FrontEndLatestActiveTenders&service=direct&session=T&sp=AFrontEndLatestActiveTenders%2Ctable&sp=2"..Due to the pattern change I can not automate the scraping process for every page ..But when I try to scrape the second page manually the soup object can not fetch any of the tags .But in network inspect showing those tags for second page ...can any one solve the issue ?? scrape all of the pages.. please share your solution

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

How to get the href and associated information using scrapy?

I'm new to scrapy but using python for a while. I took lesson from the scrapy docs along with the xpath selectors. Now, I would like to turn the knowledge to do a small project. I'm trying to scrap the job links and the associated info like job title, location, emails (if any), phone numbers (if any) from the job board https://www.germanystartupjobs.com/ using the scrapy.
I have this starter code,
import scrapy
class GermanSpider(scrapy.Spider):
# spider name
name = 'germany'
# the first page of the website
start_urls= ['https://www.germanystartupjobs.com/']
print start_urls
def parse(self, response):
pass
def parse_detail(self, response):
pass
and will run the spider scrapy runspider germany
Inside the parse function, I would like to get the hrefs and details inside the parse_detail function.
When, I opened the mentioned page with chrome developer tools and inspect the listed jobs, I see that all the jobs are inside this ul
<ul id="job-listing-view" class="job_listings job-listings-table-bordered">
and then, the separates jobs are listed in the many inside divs of
<div class="job-info-row-listing-class"> with associate infos, say, the href is provided inside <a href="https://www.germanystartupjobs.com/job/foodpanda-berlin-germany-2-sem-manager-mf/">
Other divs provides job title, company name, location etc with divs such as
<div>
<h4 class="job-title-class">
SEM Manager (m/f) </h4>
</div>
<div class="job-company-name">
<normal>foodpanda<normal> </normal></normal></div>
</div>
<div class="location">
<div class="job-location-class"><i class="glyphicon glyphicon-map-marker"></i>
Berlin, Germany </div>
</div>
The first step will be to get the href using the parse function and then, the associated info inside the parse_details using the response. I find that the email and the phone number only provided when you will open the links from the hrefs but the title and location is provided inside the current divs of the same page.
As I mentioned, I have okay programming skill in python, but, I struggles with the using xpaths even after having this tutorial. How do find the links and associated info ? Some sample code with little explanation will help a lot.
I try using the code
# firstly
for element in response.css("job-info-row-listing-class"):
href = element.xpath('#href').extract()[0]
print href
yield scrapy.Request(href, callback=self.parse_detail)
# secondly
values = response.xpath('//div[#class="job-info-row-listing-class"]//a/text()').extract()
for v in values:
print v
#
values = response.xpath('//ul[#id="job-listing-view"]//div[#class="job-info-row-listing-class"]//a/text()').extract()
They seems return nothing so far after runing the spider using scrapy runspider germany
You probably won't be able to extract the information on this site that easily, since the actual job-listings are loaded as a POST-request.
How do you know this?
Type scrapy shell "https://www.germanystartupjobs.com/" in your terminal of choice. (This opens up the, you guessed it, shell, which is highly recommendable, when first starting to scrape a website. There you can try out functions, xpaths etc.)
In the shell, type view(response). This opens the response scrapy is getting in your default browser.
When the page has finished loading, you should be able to see, that there are no job listings. This is because they are loaded through a POST-Request.
How do we find out what request it is? (I work with Firebug for FireFox, don't know how it works on Chrome)
Fire up firebug (e.g. by right-clicking on an element and clicking Inspect with Firebug. This opens up Firebug, which is essentially like the Developer tools in Chrome. I prefer it.
Here you can click the Network-Tab. If there is nothing there, reload the page.
Now you should be able to see the request with which the job listings are loaded.
In this case, the request to https://www.germanystartupjobs.com/jm-ajax/get_listings/ returns a JSON-object (click JSON) with the HTML-code as aprt of it.
For your spider this means that you will need to tell scrapy to get this request and process the HTML-part of the JSON-object in order to be able to apply your xpaths.
You do this by import the json-module at the top of your spider and then something along the lines of:
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
For example, if you'd like to extract all the urls from the site and follow them, you'd need to specify the xpath, where to urls are found and yield a new request to this url. So basically you're telling scrapy "Look, here is the url, now go and follow it".
An example for an xpath would be:
url = selector.xpath('//a/#href').extract()
So everything in the brackets is your xpath. You don't need to specify all the path from ul[#id="job-listing-view"]/ or so, you just need to make sure it is an identifiable path. Here for example, we only have the urls in the a-tags that you want, there are no other a-tags on the site.
This is pretty much the basic stuff.
I strongly recommend you to play around in the shell until you feel you get a hang of the xpaths. Take a site that looks quite easy, without any requests and see if you can find any element you want through the xpaths.

Categories