When i run my scraper it fetches titles and hrefs to the titles form a webpage. The page has pagination option in the footer which contains 6 new links which are being scraped by the second "print" in my scraper. But, at this point I can't make use of this next-page links, I meant I can't find any way to insert it somewhere in the function so that I can grab the titles and hrefs from each next-page link. Sorry for any mistakes I've made and thanks in advance to take a look into it.
import requests
from lxml import html
Page_link="http://www.wiseowl.co.uk/videos/"
def GrabbingData(url):
base="http://www.wiseowl.co.uk"
response = requests.get(url)
tree = html.fromstring(response.text)
title = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]/a/text()')
link = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]/a/#href')
for i,j in zip(title,link):
print(i,j)
pagination=tree.xpath("//div[contains(concat(' ', #class, ' '), ' woPaging ')]//a[#class='woPagingItem' or #class='woPagingNext']/#href")
for nextp in pagination:
print(base + nextp)
GrabbingData(Page_link)
You can easily make it a recursive function, like this:
import requests
from lxml import html
Page_link="http://www.wiseowl.co.uk/videos/"
visited_links = []
def GrabbingData(url):
base="http://www.wiseowl.co.uk"
response = requests.get(url)
visited_links.append(url)
tree = html.fromstring(response.text)
title = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]//a/text()')
link = tree.xpath('//p[#class="woVideoListDefaultSeriesTitle"]//a/#href')
for i,j in zip(title,link):
print(i,j)
pagination=tree.xpath("//div[contains(concat(' ', #class, ' '), ' woPaging ')]//a[#class='woPagingItem' or #class='woPagingNext']/#href")
for nextp in pagination:
url1 = str(base + nextp)
if url1 not in visited_links:
#print(url1)
GrabbingData(url1)
if __name__ == "__main__":
GrabbingData(Page_link)
Since the HTML on a next page URL will contain "Back" links, I also added a list visited_links, so you don't go back to pages already visited and you don't end up with infinite loop.
The last part starting with
if __name__ == "__main__":
is commonly used to call a function if the file is called directly (as opposed to being imported).
Related
Im new to programming and cannot figure out why this wont loop. It prints and converts the first item exactly how I want. But stops after the first iteration.
from bs4 import BeautifulSoup
import requests
import re
import json
url = 'http://books.toscrape.com/'
page = requests.get(url)
html = BeautifulSoup(page.content, 'html.parser')
section = html.find_all('ol', class_='row')
for books in section:
#Title Element
header_element = books.find("article", class_='product_pod')
title_element = header_element.img
title = title_element['alt']
#Price Element
price_element = books.find(class_='price_color')
price_str = str(price_element.text)
price = price_str[1:]
#Create JSON
final_results_json = {"Title":title, "Price":price}
final_result = json.dumps(final_results_json, sort_keys=True, indent=1)
print(title)
print(price)
print()
print(final_result)
First, clarify what you are looking for? Probably, you wish to print the title, price and final_result for every book that has been scraped from the URL books.toscrape.com. The code is working as it is written though the expectation is different. If you notice you are finding all the "ol" tags with class name = "row" and there's just one such element on the page thus, section has only one element eventually the for loop iterates just once.
How to debug it?
Check the type of section, type(section)
Print the section to know what it contains
write some print statements in for loop to understand what happens when
It isn't hard to debug this one.
You need to change:
section = html.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
there is only 1 <ol> in that doc
I think you want
for book in section[0].find_all('li'):
ol means ordered list, of which there is one in this case, there are many li or list items in that ol
I'm building a Scrapy that crawling under two pages (e.x: PageDucky, PageHorse), and I pass that two pages in a starts_url field.
But for pagination, I need to pass my URL and concatenate with "?page=", so I can't pass the entire list.
I already tried to make a for loop, but without success.
Anyone does how can I make the pagination work for both pages?
Here is my code for now:
class QuotesSpider(scrapy.Spider):
name = 'QuotesSpider'
start_urls = ['https://PageDucky.com', 'https://PageHorse.com']
categories = []
count = 1
def parse(self, response):
# Get categories
urli = response.url
QuotesSpider.categories = urli[urli.find('/browse')+7:].split('/')
QuotesSpider.categories.pop(0)
#GET ITEMS PER PAGE AND CALC THE PAGINATION
items = int(response.xpath(
'*//div[#id="body"]/div/label[#class="item-count"]/text()').get().replace(' items', ''))
pages = items / 10
#CALL THE OTHER DEF TO READ THE PAGE ITSELF
for i in response.css('div#body div a::attr(href)').getall():
if i[:5] == '/item':
yield scrapy.Request('http://mainpage' + i, callback=self.parseobj)
#HERE IS THE PROBLEM, I TESTED AND WITHOUT FOR LOOP WORKS FOR ONE URL ONLY
for y in QuotesSpider.start_urls:
if pages >= QuotesSpider.count:
next_page = y + '?page=' + str(QuotesSpider.count)
QuotesSpider.count = QuotesSpider.count + 1
yield scrapy.Request(next_page, callback=self.parse)
Whatever website you're scraping, find the xpath/css location where the 'next page' button is. Get the href of that, and yield your next request to that link.
Alternatively you don't need to use start_urls if you write your own start_requests function, where you can put custom logic inside of it, like looping through your desired urls and appendimng the correct page number to each. See: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
UPDATE WITH SOLUTION
I can't use "href" because isn't the same link, for example the page 01 was 'https:pageducky.com' and the page 02 was 'https:duckyducky.com?page=2'
So I use response.url and manipulate the string considering the ?page=... something like that:
resp1 = response.url[:response.url.find('?page=')]
resp = resp1 + '?page=' + str(QuotesSpider.count)
I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() .
I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:
import requests
from bs4 import BeautifulSoup
import re
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
print(filter_links(links))
I have used the built-in startswith function, but its filtering out everything except the starting url.
Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?
Try this :
import requests
from bs4 import BeautifulSoup
import re
import tldextract
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
ext = tldextract.extract(start_url)
domain = ext.domain
filtered_links = []
for link in links:
if domain in link:
filtered_links.append(link)
return filtered_links
print(filter_links(links))
Note :
You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.
Output :
['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
Okay so you made an indentation error in filter_links(links). The function should be like this
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.
Hope this helps :)
Possible Solution
What about if you kept all links which 'contain' the domain?
For example
import pandas as pd
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
# results in a dataframe with links containing "enzymebiosystems".
If you want to search multiple domains, see this answer
I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url
Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!
I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url
I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.
Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())
I'm trying to scrape a list of URL's from the European Parliament's Legislative Observatory. I do not type in any search keyword in order to get all links to documents (currently 13172). I can easily scrape a list of the first 10 results which are displayed on the website using the code below. However, I want to have all links so that I would not need to somehow press the next page button. Please let me know if you know of a way to achieve this.
import requests, bs4, re
# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'
# function gets a list of links to the procedures
def links_to_procedures (url_main):
# requesting html code from the main search site of the Legislative Observatory
response = requests.get(url_main)
soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
return links
print(links_to_procedures(url_main))
You can follow the pagination by specifying the page GET parameter.
First, get the results count, then calculate the number of pages to process by dividing the count on the results count per page. Then, iterate over pages one by one and collect the links:
import re
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)
# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)
results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page
links = []
for page in xrange(1, num_results/results_per_page + 1):
print "Current page: " + str(page)
url = base_url.format(page=page)
response = requests.get(url)
soup = BeautifulSoup(response.content)
links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]
print links