I am learning web scraping for a hobby project (Using Selenium in python).
I'm trying to get the information on product listings . There are about a 100 web pages , and I am sequentially loading each , processing the data in each then moving to the next.
This process takes over 5 minutes with the major bottleneck being the loading time of each page.
As I am only "reading" from the pages (not interacting with any of them)... I would like to know is it possible to send requests for all the pages together(as opposed to waiting for one page to load, then requesting the next page) and process the data as they arrive.
PS: Please tell me if there are some other solutions to reduce the loading time
You could use the Python Requests Module and the built-in threading module to make it faster. So for example:
import threading
import requests
list_of_links = [
# All your links here
]
threads = []
all_html = {
# This will be a dictionary of the links as key and the HTML of the links as values
}
def get_html(link):
r = requests.get(link)
all_html[link] = r.text
for link in list_of_links:
thread = threading.Thread(target=get_html, args=(link,))
thread.start()
threads.append(thread)
for t in threads:
t.join()
print(all_html)
print("DONE")
Related
I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)
I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd
For example, if you go here, you will see how it takes time to load page 2 data
I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!
from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read()
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
while(true):
true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
if(not True):
true=False
else:
req = Request(softwareadvice+'?review.page=2', headers=hdr)
sleep(10)
webpage = urlopen(req, timeout=10)
sleep(10)
webpage = webpage.read().decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
The reviews are loaded from external source via Ajax request. You can use this example how to load them:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
"https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)
params = {
"q": "s*|-s*",
"facet.gdm_industry_id": '{"sort":"bucket","size":200}',
"fq": "(and product_id: '{}' listed:1)",
"q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
"size": "50",
"start": "50",
"sort": "completeness_score desc,date_submitted desc",
}
# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))
params["fq"] = params["fq"].format(id_)
for start in range(0, 3): # <-- increase the number of pages here
params["start"] = 50 * start
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for h in data["hits"]["hit"]:
if "review" in h["fields"]:
print(h["fields"]["review"])
print("-" * 80)
Prints:
After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible, Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting. We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio.
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------
...and so on.
I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.
The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).
In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.
The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.
About the code you introduced:
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
webpage = web_byte.decode('utf-8')
All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.
So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.
Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.
Here are some links that might be of help to you.
scrape html generated by javascript with python
https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium
Thanks for reading.
I am working on a web scraping project, and I have to get links from 19062 facilities. If I use a for loop, it will take almost 3 hours to complete. I tried making a generator but failed to make any logic, and I am not sure that it can be done using a generator. So, is there any Python expert who has an idea to get what I want faster? In my code, I execute it for just 20 ids. Thanks
import requests, json
from bs4 import BeautifulSoup as bs
url = 'https://hilfe.diakonie.de/hilfe-vor-ort/marker-json.php?ersteller=&kategorie=0&text=& n=55.0815&e=15.0418321&s=47.270127&w=5.8662579&zoom=20000'
res = requests.get(url).json()
url_1 = 'https://hilfe.diakonie.de/hilfe-vor-ort/info-window-html.php?id='
# extracting all the id= from .json res object
id = []
for item in res['items'][0]["elements"]:
id.append(item["id"])
# opening a .json file and making a dict for links
file = open('links.json', 'a')
links = {'links': []}
def link_parser(url, id):
resp = requests.get(url + id).content
soup = bs(resp, "html.parser")
link = soup.select_one('p > a').attrs['href']
links['links'].append(link)
# dumping the dict into links.json file
for item in id[:20]:
link_parser(url_1, item)
json.dump(links, file)
file.close()
In web scraping, speed is not a good idea! You will be hitting the server numerous times a second and will most likely get blocked if you use a For Loop. A generator will not make this quicker. Ideally, you want to hit the server once and process the data locally.
If it were me, I would want to use a framework like Scrapy that encourages good practice and various Spider classes to support standard techniques.
Scraping a python list of web domains, would like to put a 4 second delay between each scrape in order to comply with robots.txt. Would like each iteration to run asynchronously, so the loop will continue running every 4 seconds, irrespective of whether the scrape for that particular page has finished.
I have tried implementing asyncio gather, coroutine and was beginning to attempt throttling. However my solutions were getting very complex and I believe there must be a simpler way, or that I am missing something here. In one of my past versions, I just put a sleep(4) inside the for in loop, though to my updated understanding this is bad as it sleeps the entire interpreter and other loops won't run asynchronously at that time?
import requests
import csv
csvFile = open('test.csv', 'w+')
urls = [
'domain1', 'domain2', 'domain3'...
];
YOURAPIKEY = <KEY>;
from bs4 import BeautifulSoup
writer = csv.writer(csvFile)
writer.writerow(('Scraped text', 'other info 1', 'other info 2'))
lastI = len(urls) - 1
for i, a in enumerate(urls):
payload = {'api_key': YOURAPIKEY, 'url': a}
r = requests.get('http://api.scraperapi.com', params=payload)
soup = BeautifulSoup(r.text, 'html.parser')
def parse(self, response):
scraper_url = 'http://api.scraperapi.com/?api_key=YOURAPIKEY&url=' + a
yield scrapy.Request(scraper_url, self.parse)
price_cells = soup.select('.step > b.whb:first-child')
lastF = len(price_cells) - 1
for f, price_cell in enumerate(price_cells):
writer.writerow((price_cell.text.rstrip(), '...', '...'))
print(price_cell.text.rstrip())
if (i == lastI and f == lastF):
print('closing now')
csvFile.close()
No errors with the above code that I can tell. Just want each loop to keep running at 4s intervals and the results coming back from the fetch to be saved to the excel document ad hoc.
In scrapy the appropriate setting in the setting.py file would be:
DOWNLOAD_DELAY
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported.
DOWNLOAD_DELAY = 4 # 4s of delay
https://doc.scrapy.org/en/latest/topics/settings.html
Thanks in advance for your help. I'm new to Python and trying to figure out how to use the threading module to scrape the NY Daily News site for urls. I put the following together and the script is scrapping but it doesn't seem to be any faster than it was before so I'm not sure the threading is happening. Can you let me know if it is? Can I write in anything so that I can tell? And also any other tips you have about threading?
Thank you.
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import os
import io
import threading
def fetch_url():
for i in xrange(15500, 6100, -1):
page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('h2')
for h2 in snippet:
for link in h2.find_all('a'):
logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
print "finished another url from page {}".format(i)
with open("dailynewsurls.txt", 'a') as logfile:
threads = threading.Thread(target=fetch_url())
threads.start()
The below is a naive implementation (which will very quickly get you blacklisted from nydailynews.com):
def fetch_url(i, logfile):
page = urllib2.urlopen("http://www.nydailynews.com/search-results/search-results-7.113?kw=&tfq=&afq=&page={}&sortOrder=Relevance&selecturl=site&q=the&sfq=&dtfq=seven_years".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('h2')
for h2 in snippet:
for link in h2.find_all('a'):
logfile.write("http://www.nydailynews.com" + link.get('href') + "\n")
print "finished another url from page {}".format(i)
with open("dailynewsurls.txt", 'a') as logfile:
threads = []
for i in xrange(15500, 6100, -1):
t = threading.Thread(target=fetch_url, args=(i, logfile))
t.start()
threads.append(t)
for t in threads:
t.join()
Note that fetch_url takes the number to substitute in the URL as an argument, and each possible value for that argument is started in its own, separate thread.
I would strongly suggest dividing the job into smaller batches, and running one batch at a time.
No, you're not using threads. threads = threading.Thread(target=fetch_url()) calls fetch_url() in your main thread, waits for it to complete and passes its return value (None) to the threading.Thread constructor.