I wrote a simple Python crawler to fetch urls from a website. Here is the code:
from bs4 import BeautifulSoup
import requests as req
def get_soup(url):
content = req.get(url).content
return BeautifulSoup(content,'lxml')
def extract_links(url):
soup = get_soup(url)
a_tags = soup.find_all('a', class_="kkyou true-load-invoker")
links = set(a_tag.get('href') for a_tag in a_tags)
return links
def set_of_links(url, size):
'''
breadth-first search for article hyperlinks
'''
seen = set()
active = extract_links(start_url)
while active:
next_active = set()
for item in active:
for result in extract_links(item):
if result not in seen:
if len(seen) >= size:
break
else:
seen.add(result)
next_active.add(result)
active = next_active
return seen
Essentially, I get a soup from a start url I specify, extract all urls within the start url that have class kkyou true-load-invoker and then I repeat the process in a breadth-first fashion for all urls I have collected. I stop this process when I have seen a certain number of urls.
Until a couple of weeks ago, I had no problem running this. I could specify any number of urls and it would fetch them for me. I just tried the exact same code today, and it only returns a maximum of 14 urls. For example, if I ask it to fetch me 50 urls, it will only fetch 10 and stop. Clearly, this cannot be a problem with the code because I changed nothing! I am wondering whether the page I am trying to crawl is using some mechanism to stop me from crawling an "excessive" number of pages. The page I am trying to crawl is this (choose any article as a start url).
Any insights on this will be greatly appreciated! I am a complete novice in web crawling.
Related
Hello everyone I'm a beginner at scraping and i try to scrape all iPhones in https://www.electroplanet.ma/
this is the scripts i wrote
import scrapy
from ..items import EpItem
class ep(scrapy.Spider):
name = "ep"
start_urls = ["https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=1",
"https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=2"
]
def parse(self, response):
products = response.css("ol li") # to find all items in the page
for product in products :
try:
lien = product.css("a.product-item-link::attr(href)").get() # get the link of each item
image= product.css("a.product-item-photo::attr(href)").get() # get the image
# and to get in each item page and scrap it, i use follow method
# i passed image as argument to parse_item cauz i couldn't scrap the image from item's page
# i think it's hidden
yield response.follow(lien,callback = self.parse_item,cb_kwargs={"image":image})
except: pass
def parse_item(self,response,image):
item = EpItem()
item["Nom"]= response.css(".ref::text").get()
pattern = re.compile(r"\s*(\S+(?:\s+\S+)*)\s*")
item["Catégorie"]= pattern.search(response.xpath("//h1/a/text()").get()).group(1)
item["Marque"]=pattern.search(response.xpath("//*[#data-th='Marque']/text()").get()).group(1)
try :
item["RAM"]= pattern.search(response.xpath("//*[#data-th='MÉMOIRE RAM']/text()").get()).group(1)
except:
pass
item["ROM"]=pattern.search(response.xpath("//*[#data-th='MÉMOIRE DE STOCKAGE']/text()").get()).group(1)
item["Couleur"]=pattern.search(response.xpath("//*[#data-th='COULEUR']/text()").get()).group(1)
item["lien"]=response.request.url
item["image"]=image
item["état"]="neuf"
item["Market"]= "Electro Planet"
yield item
i found problems to scrape all the pages, because it uses javascript to follow pages so i write all pages links in start urls and i believe it's not the best practice so i ask you to give some advices to improve my code
you can use the scrapy-playwright plugin to scrape the interactive websites, and for the start_urls, just add the main website index URL if there is just one website, and check this link in the scrapy docs to make the spider follow the pages links automatically instead of written them manually
I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)
So, I am parsing emails from many websites
1)
I take them from the front page and from the contacts section ('kont' or 'cont' in hrefs)
There could be many links with 'kont' or 'cont' at the front page
I don't want to visit all of them in the "for" loop
I would like the program to go to another website when the data is found in one of those links (email_list_2 != []). how to do that?
2)
There is some redundancy in the code, I yield data at the front page because I am afraid the request from the for loop would be unsuccessful, in which case I will lose data from the front page.
Can I just yield {'site': site,
'email_list_1': email_list_1,
'email_list_2': []} if data is not found
or
{'site': site,
'email_list_1': email_list_1,
'email_list_2': ['xyz']} if data is found without double yielding?
Please help
Regards,
class QuotesSpider(scrapy.Spider):
name = 'enrichment'
start_urls = website_list
def parse(self, response):
site = response.url
data = response.text
email_list_1 = emailRegex.findall(data)
yield {'lvl': '1',
'site': site,
'email_list_1': email_list_1,
'email_list_2': [],
}
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
raw_url = link.get('href')
full_url = str(site) + str(raw_url)
if (re.search('cont', full_url) != None or
re.search('kont', full_url) != None):
yield scrapy.Request(url=full_url,
callback=self.parse_2d_level,
meta={'site': site,'email_list_1': email_list_1 }
)
def parse_2d_level(self, response):
site = response.meta['site']
email_list_1 = response.meta['email_list_1']
data_2 = response.text
email_list_2 = emailRegex.findall(data_2)
yield {'lvl': '2',
'site': site,
'email_list_1': email_list_1,
'email_list_2': email_list_2,
}
I'm not sure I fully understand your question, but here it goes:
1 - You want to scrape PAGE1, look for 'cont' or 'kont' and if these components exists, make a new request for PAGE2. In PAGE2 you search for a email_list_2 and yield results. You asked:
I would like the program to go to another website when the data is
found in one of those links (email_list_2 != []). how to do that?
What website do you want it to go? Is it a follow on the page you are already scraping? Is it another website in your start_urls?
At current state, after parsing PAGE2 (on parse_2d_level method) your spider will yield results, whether it found values for email_list_2 or not. If there are other requests on queue, scrapy will go on to execute those, if there aren't, the spider will end.
2- You want to make sure the data you already found before the loop is yielded in case the request from inside the loop fails. Since you said
the request from the for loop would be unsuccessful
I'll assume you are only worried about the REQUEST failure, there are other ways your parsing could fail.
For failed request you can catch and handle the issue with a scrapy signal called spider_error, take a look here.
3-You should take a look at Scrapy's selectors, they are a very powerful tool. You don't need beautiful soup for the parsing, and the Selectors will help a lot with the precision.
I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))
Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.
I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass