How to go to next page when web crawling using Python - python

I trying to do web crawling using python. But I can't figure out how to change pages automatically.
So I found the pattern but I don't know how to go to next page automatically until it reaches end of page.
so the pattern is
'http//.../sortBy=helpful&pageNumber=0'
'http//.../sortBy=helpful&pageNumber=1'
'http//.../sortBy=helpful&pageNumber=2'
'http//.../sortBy=helpful&pageNumber=3'
and so on ...
import re
from urllib.parse import urljoin
def review_next_page(page=1):
list_url = 'https://www.amazon.com/Quest-Nutrition-Protein-Apple-2-12oz/product-reviews/B00U3RGAMW/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber={0}'.format(page)
list_url = [urljoin(list_url, review_link) for review_link in ???]
return list_url
I am trying to change last number increases by 1 until it reaches the end...
Should I use for loop?
Thanks in advance!

Not directly answering the question, but this is something that can be easily and conveniently handled by Scrapy's CrawlSpider class and the link extractors. You can configure what patterns should href match for the link to be followed. In your case, it would be something like:
Rule(LinkExtractor(allow=r'sortBy=helpful&pageNumber=\d+$'), callback=self.parse_page)

Related

Can Python get a Href link on page one, and then get a paragraph from page 2?

I am fairly new to Python, but i was wondering if i could utilize Python and its modules. To retrieve a href from page 1, and then the first paragraph in page 2.
Q2: Also, how could I scrape the first 10 link hrefs with the same div class on page one, and then scrape the first 10 paragraphs, while looping?
Yes, I believe you should be able to.
Try to lookup the requests and beautifulsoup python modules.
There are two python modules that I would use for this: requests and regular expressions. I would use requests to get the website raw html and then use a regex to get for example your paragraph:
import requests, re
site = requests.get("http://somewebsite.com").text
paragraphs = re.findall(r"<p>(.*?)</p>", site, re.DOTALL)
firstPara = paragraphs[0]
print(firstPara)
the requests line here is self-explanatory and the regex says look for the first <p> tag then the brackets mean return just this bit of .*? which is as many (*) of any charachters (.) up to (?) the closing </p> tag. Finally the re.DOTALL just means that it will math newlines as part of the lookup
An alternative to using beautifulsoup would be to use the webbrowser module.
With the webbrowser module you can open in the default webbrowser or even specify a preferred browser to open (using the default is preferable, however, as of course there is no guarantee that the user's preference will match yours!)
So you could open a url like so:
import webbrowser
webbrowser.open_new('https://stackoverflow.com/help/formatting')
or like this:
import webbrowser
a = webbrowser.get('chrome') #target chrome (e.g)
a.open('https://www.stackoverflow.com')
Unfortunately, if you just stick a hashtag (for an anchor) onto the end of a url, webbrowser doesn't seem to like this. Instead you should define your anchor using a variable and pass it into a function as a parameter:
def open_anchor(self, anchor):
""" Open selected anchor in the default webbrowser
"""
webbrowser.open( anchor )
There are more webbrowser examples on this page
Hope this helps

Python 3.5.2 web-scraping - list index out of range

I am new web scraping and trying to scrape all the contents of Restaurant's Details Form so that I can proceed my further scraping.
import requests
from bs4 import BeautifulSoup
import urllib
url = "https://www.foodpanda.in/restaurants"
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
print(soup.find_all("Section",class_="js-infscroll-load-more-here")[0])
The problem is with accessing element at index 0 for soup.find_all("Section",class_="js-infscroll-load-more-here"‌​), because the result is an empty list.
html has no notion of uppercase tag and regardless even in the source itself it is section not Section with a lowercase s:
section = soup.find_all("section",class_="js-infscroll-load-more-here")[0]
Since there is only one you can also use find:
section = soup.find("section",class_="js-infscroll-load-more-here")
Both of which will find what you are looking for.

web-scraping, regex and iteration in python

I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance
does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)
Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

Creating multiple requests from same method in Scrapy

I am parsing webpages that have a similar structure to this page.
I have the following two functions:
def parse_next(self, response):
# implementation goes here
# create Request(the_next_link, callback=parse_next)
# for link in discovered_links:
# create Request(link, callback=parse_link)
def parse_link(self, response):
pass
I want parse_next() to create a request for the *Next link on the web page. At the same time, I want it to create requests for all the URLs that were discovered on the current page by using parse_link() as the callback. Note that I want parse_next to recursively use itself as a callback because this seems to me as the only possible way to generate requests for all the *Next links.
*Next: The link that appears besides all the numbers on the this page
How am I supposed to solve this problem?
Use a generator function and loop through
your links, then call this on the links
that you want to make a request to:
for link in links:
yield Request(link.url)
Since you are using scrapy, I'm assuming you have link extractors set up.
So, just declare your link extractor as a variable like this:
link_extractor = SgmlLinkExtractor(allow=('.+'))
Then in the parse function, call the link extractor on the 'the_next_link':
links = self.link_extractor.extract_links(response)
Here you go:
http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Categories