How to handle large scale Web Scraping?

How to handle large scale Web Scraping? - python

The Situation:
I recently started web scraping using selenium and scrapy and i was working on a project where i have a csv file which contains 42 thousand zip codes and my job is to take that zip code and go on this site input the zip code and scrape all the results.
The Problem:
The problem here is that in doing this I have to continuously click the 'load more' button until all the results have been displayed and only once that has finished I can collect the data.
This may not be much of an issue, however it takes 2 minutes to do this per zip code and I have 42 000 to do this with.
The Code:
import scrapy
from numpy.lib.npyio import load
from selenium import webdriver
from selenium.common.exceptions import ElementClickInterceptedException, ElementNotInteractableException, ElementNotSelectableException, NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.keys import Keys
from items import CareCreditItem
from datetime import datetime
import os
from scrapy.crawler import CrawlerProcess
global pin_code
pin_code = input("enter pin code")
class CareCredit1Spider(scrapy.Spider):
name = 'care_credit_1'
start_urls = ['https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty//?Sort=D&Radius=75&Page=1']
def start_requests(self):
directory = os.getcwd()
options = webdriver.ChromeOptions()
options.headless = True
options.add_experimental_option("excludeSwitches", ["enable-logging"])
path = (directory+r"\\Chromedriver.exe")
driver = webdriver.Chrome(path,options=options)
#URL of the website
url = "https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty/" +pin_code + "/?Sort=D&Radius=75&Page=1"
driver.maximize_window()
#opening link in the browser
driver.get(url)
driver.implicitly_wait(200)
try:
cookies = driver.find_element_by_xpath('//*[#id="onetrust-accept-btn-handler"]')
cookies.click()
except:
pass
i = 0
loadMoreButtonExists = True
while loadMoreButtonExists:
try:
load_more = driver.find_element_by_xpath('//*[#id="next-page"]')
load_more.click()
driver.implicitly_wait(30)
except ElementNotInteractableException:
loadMoreButtonExists = False
except ElementClickInterceptedException:
pass
except StaleElementReferenceException:
pass
except NoSuchElementException:
loadMoreButtonExists = False
try:
previous_page = driver.find_element_by_xpath('//*[#id="previous-page"]')
previous_page.click()
except:
pass
name = driver.find_elements_by_class_name('dl-result-item')
r = 1
temp_list=[]
j = 0
for element in name:
link = element.find_element_by_tag_name('a')
c = link.get_property('href')
yield scrapy.Request(c)
def parse(self, response):
item = CareCreditItem()
item['Practise_name'] = response.css('h1 ::text').get()
item['address'] = response.css('.google-maps-external ::text').get()
item['phone_no'] = response.css('.dl-detail-phone ::text').get()
yield item
now = datetime.now()
dt_string = now.strftime("%d/%m/%Y")
dt = now.strftime("%H-%M-%S")
file_name = dt_string+"_"+dt+"zip-code"+pin_code+".csv"
process = CrawlerProcess(settings={
'FEED_URI' : file_name,
'FEED_FORMAT':'csv'
})
process.crawl(CareCredit1Spider)
process.start()
print("CSV File is Ready")
items.py
import scrapy
class CareCreditItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Practise_name = scrapy.Field()
address = scrapy.Field()
phone_no = scrapy.Field()
The Question:
Essentially my question is simple. Is there a way to optimize this code in order for it to perform faster? Or what are the other potential methods in order to handle scraping this data without it taking forever?

Since the site loads the data dynamically from an api you can retrieve the data directly from the api. This will speed things up quite a bit, but I'd still implement a wait to avoid hitting the rate limit.
import requests
import time
import pandas as pd
zipcode = '00704'
radius = 75
url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page=1'
req = requests.get(url)
r = req.json()
data = r['results']
for i in range(2,r['maxPage']+1):
url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page={i}'
req = requests.get(url)
r = req.json()
data.extend(r['results'])
time.sleep(1)
df = pd.DataFrame(data)
df.to_csv(f'{pd.Timestamp.now().strftime("%d/%m/%Y_%H-%M-%S")}zip-code{zipcode}.csv')

There are multiple ways in which you can do this.
1. Creating a distributed system in which you run the spider through multiple machines in order to run in parallel.
This in my opinio is the better of the options as you can also create a scalable dynamic solution that you will be able to use many times over.
There are many ways of doing this normally it will consist of dividing the seedlist (The Zip Codes) into many separate seedlists in order to have the separate processes working with seperate seedlists, thus the downloads will run in parallel so for example if its on 2 machines it will go 2 times faster, but if on 10 machines its 10 times faster, etc.
In order to do this I might suggest looking into AWS, namely AWS Lambda , AWS EC2 Instances or even AWS Spot Instances these are the ones I have worked wiht previously and they are not terribly hard to work with.
2. Alternatively, if you are wanting to run it on a single machine you can take a look into Multithreading with Python, which can help you run the process in parallel on the singular machine.
3. This is another option particularly if it is a once off process. You can try running it simply with requests which may speed it up but with a massive amount of seeds it usually is faster to develop a process running in parallel.

Related

Beautifulsoup4 find_all not getting the results I need

I'm trying to get data from flashscore.com to a project I'm doing as a part of my self-tought Python study:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.flashscore.com/")
soup = BeautifulSoup(res.text, "lxml")
games = soup.find_all("div", {'class':['event__match', 'event__match--scheduled', 'event__match--twoLine']})
print(games)
When I run this, it gets me an empty list []
Why?

When an empty list is returned in find_all(), that means the elements that you specified could not be found.
Make sure that what you are trying to scrape isn't dynamically added such as an iframe in some cases

The failure is due to the fact that the website uses a set of Ajax technologies, specifically dynamically added content with the help of the JavaScript client scripting language. The client code for scripting languages is executed in the browser itself, not at the web server level. The success of such code depends on the browser's ability to interpret and execute it correctly. With the help of the BeatifulSoup library in the program you wrote, you only check the HTML code. JavaScript code can be open, for example, with the help of the Selenium library: https://www.selenium.dev/. Below is the full code for the data that I suppose you are interested in:
# crawler_her_sel.py
import time
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import pandas as pd
def firefoxdriver(my_url):
"""
Preparing of the browser for the work and adding the headers to
the browser.
"""
# Preparing of the Tor browser for the work.
options = Options()
options.add_argument("--headless")
driver = Firefox(options=options)
return driver
def scrapingitems(driver, my_list, my_xpath):
"""
Create appropriate lists of the data for the pandas library.
"""
try:
elem_to_scrap = driver.find_element(By.XPATH, my_xpath).text
my_list.append(elem_to_scrap)
except:
elem_to_scrap = ""
my_list.append(elem_to_scrap)
# Variable with the URL of the website.
my_url = "https://www.flashscore.com/"
# Preparing of the Tor browser for the work and adding the headers
# to the browser.
driver = firefoxdriver(my_url)
# Loads the website code as the Selenium object.
driver.get(my_url)
# Prepare the blank dictionary to fill in for pandas.
matches = {}
# Preparation of lists with scraped data.
countries = []
leagues = []
home_teams = []
scores_home = []
scores_away = []
away_teams = []
# Wait for page to fully render
try:
element = WebDriverWait(driver, 25).until(
EC.presence_of_element_located((By.CLASS_NAME, "adsclick")))
except TimeoutException:
print("Loading took too much time!. Please rerun the script.")
except Exception as e:
print(str(e))
else:
# Loads the website code as the BeautifulSoup object.
pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource, "lxml")
# Determining the number of the football matches with the help of
# the BeautifulSoup.
games_1 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--home"})
games_2 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--home fontBold"})
games_3 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--away"})
games_4 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--away fontBold"})
# Determining the number of the countries for the given football
# matches.
all_countries = driver.find_elements(By.CLASS_NAME, "event__title--type")
# Determination of the number that determines the number of
# the loop iterations.
sum_to_iterate = len(all_countries) + len(games_1) + len(games_2)
+ len(games_3) + len(games_4)
for ind in range(1, (sum_to_iterate+1)):
# Scraping of the country names.
xpath_countries = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[2]/div/span[1]')
scrapingitems(driver, countries, xpath_countries)
# Scraping of the league names.
xpath_leagues = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[2]/div/span[2]')
scrapingitems(driver, leagues, xpath_leagues)
# Scraping of the home team names.
xpath_home_teams = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[3]')
scrapingitems(driver, home_teams, xpath_home_teams)
# Scraping of the home team scores.
xpath_scores_home = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[5]')
scrapingitems(driver, scores_home, xpath_scores_home)
# Scraping of the away team scores.
xpath_scores_away = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[6]')
scrapingitems(driver, scores_away, xpath_scores_away)
# Scraping of the away team names.
xpath_away_teams = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[4]')
scrapingitems(driver, away_teams, xpath_away_teams)
# Add lists with the scraped data to the dictionary in the correct
# order.
matches["Countries"] = countries
matches["Leagues"] = leagues
matches["Home_teams"] = home_teams
matches["Scores_for_home_teams"] = scores_home
matches["Scores_for_away_teams"] = scores_away
matches["Away_teams"] = away_teams
# Creating of the frame for the data with the help of the pandas
# package.
df_res = pd.DataFrame(matches)
# Saving of the properly formatted data to the csv file. The date
# and the time of the scraping are hidden in the file name.
name_of_file = lambda: "flashscore{}.csv".format(time.strftime(
"%Y%m%d-%H.%M.%S"))
df_res.to_csv(name_of_file(), encoding="utf-8")
finally:
driver.quit()
The result of the script is a csv file, which, when loaded as data into Excel, gives the following table, e.g.:
It is worth mentioning here to download the necessary driver for your browser: https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/.
In addition, I give you links to two other interesting scripts that relate to scraping from the https://www.flashscore.com/ portal, i.e.: How can i scrape a football results from flashscore using python and Scraping stats with Selenium.
I would also like to raise legal issues here. The robots.txt file downloaded from the https://www.flashscore.com/robots.txt website looks like this:
It shows that you can scrape the home page. But the „General Terms of Use” says that quoting „Without prior authorisation in writing from the Provider, Visitors are not authorised to copy, modify, tamper with, distribute, transmit, display, reproduce, transfer, upload, download or otherwise use or alter any of the content of the App. ”
This, unfortunately, introduces ambiguity and ultimately it is not clear what the owner really wants. Therefore, I recommend that you do not use this script constantly, and certainly not for commercial purposes and I ask other visitors for this that visit this website. I myself wrote this script for the purpose of learning to scrape and I do not intend to use it at all.
The finished script can be downloaded from my GitHub.

Python multithreading crawler for unknown size

I have a list of pages to crawl using selenium
Let's say the website is example.com/1...N (up to unknown size)
from concurrent.futures import ThreadPoolExecutor, as_completed
from webdriver_manager.chrome import ChromeDriverManager
def crawl_example(page):
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(f"example.com/{page}")
# Do some processing
result = "Fetched data"
return result
N_THREAD = 10
MAX_SIZE = 100
with ThreadPoolExecutor(N_THREAD) as ex:
futures = [ex.submit(crawl_example, page) for page in range(MAX_SIZE)]
Setting MAX_SIZE call unnecessary request after N, so I wanted to find a better solution.
I could only think of creating a global variable (is_done) or add another parameter to the function.
What would be the most pythonic approach to solve the above issue?

Initialize a last_page variable to be infinity (preferably in a class variable)
And update and crawl with the following logic would be good enough
Since two threads can update last_page at the same time,
prevent higher page overwrite last_page updated by lower page
from threading import Lock
last_page_lock = Lock()
def crawl_page(page):
if page > last_page:
continue
if page_empty():
with last_page_lock():
last_page = min(last_page, page)
...

Skip selenium Webdriver.get() call inside for loop if it takes too long

Hey guys I'm having trouble understanding how can I add exceptions to a for in range loop. Right now I'm pulling URLs from an excel sheet and scraping the information while moving throughout the pages until I reach page 200. The thing is that not all URLs have pages up to 200 so It's taking a lot of time until the loop ends and program can continue with another URL. Is there a way to implement exceptions in to the code here?
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome("C:/Users/Acer/Desktop/chromedriver.exe")
companies = []
df = pd.read_excel('C:/Users/Acer/Desktop/urls.xlsx')
for index, row in df.iterrows():
base_url = (row['urls'])
for i in range(1,201,1):
url = "{base_url}?curpage={i}".format(base_url=base_url, i=i)
driver.get(url)
time.sleep(2)
name = driver.find_elements_by_xpath('//a/div/div/p')
for names in name:
print(names.text, url)
companies.append([names.text, url])

You can set a max timeout on the Webdriver and then watch for Timeout exceptions in the loop:
from selenium.common.exceptions import TimeoutException
MAX_TIMEOUT_SECONDS = 5
driver = webdriver.Chrome("C:/Users/Acer/Desktop/chromedriver.exe")
driver.set_page_load_timeout(MAX_TIMEOUT_SECONDS)
for i in range(1, 201):
try:
url = "{base_url}?curpage={i}".format(base_url=base_url, i=i)
driver.get(url)
except TimeoutException:
# skip this if it takes more than 5 seconds
continue
... # process the scraped URL as usual
If a timeout occurs, the current iteration is skipped via continue.

Is time.sleep() enough to safely create a delay for a simple webscraper?

I'm using a webscraping code, without a headless browser, in order to scrape about 500 inputs from Transfer Mrkt for a personal project.
According to best practices, I need to randomize the web scraping pattern I have, along with using a delay and dealing with errors/loading delays, in order to successfully scrape Transfer Markt without getting raising any flags.
I understand how Selenium and Chromedriver can help with all of these in order to scrape more safely, but I've used requests and BeautifulSoup to create a much simpler webscraper:
import requests, re, ast
from bs4 import BeautifulSoup import pandas as pd
i = 1
url_list = []
while True:
page = requests.get('https://www.transfermarkt.us/spieler-statistik/wertvollstespieler/marktwertetop?page=' + str(i), headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_page = BeautifulSoup(page,'lxml')
all_links = []
for link in parsed_page.find_all('a', href=True): link = str(link['href']) all_links.append(link)
r = re.compile('.*profil/spieler.*')
player_links = list(filter(r.match, all_links))
for plink in range(0,25):
url_list.append('https://www.transfermarkt.us' + player_links[plink])
i += 1
if i > 20: break
final_url_list = []
for i in url_list:
int_page = requests.get(i, headers = {'User-Agent':'Mozilla/5.0'}).text
parsed_int_page = BeautifulSoup(int_page,'lxml')
graph_container = parsed_int_page.find('div', class_='large-7 columns small-12 marktwertentwicklung-graph')
graph_a = graph_container.find('a')
graph_link = graph_a.get('href')
final_url_list.append('https://www.transfermarkt.us' + graph_link)
for url in final_url_list:
r = requests.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r"'data':(.*)}\],")
s = p.findall(r.text)[0]
s = s.encode().decode('unicode_escape')
data = ast.literal_eval(s)
#rest of the code to write scraped info below this
I was wondering if this is generally considered a safe enough way to scrape a website like Transfer Mrkt if I add the time.sleep() method from the time library, as detailed here, in order to create a delay - long enough to allow the page to load, like 10 seconds - to scrape the 500 inputs successfully without raising any flags.
I would also forego using randomized clicks (which I think can only be done with selenium/chromedriver) to mimic human behavior, and was wondering if that too would be ok to exclude in order to scrape safely.

Python and Selenium: I am automating web scraping among pages. How can I loop by Next button?

I already written several lines of codes to pull url from this website.
http://www.worldhospitaldirectory.com/United%20States/hospitals
code is below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get('http://www.worldhospitaldirectory.com/United%20States/hospitals')
url = []
pagenbr = 1
while pagenbr <= 115:
current = driver.current_url
driver.get(current)
lks = driver.find_elements_by_xpath('//*[#href]')
for ii in lks:
link = ii.get_attribute('href')
if '/info' in link:
url.append(link)
print('page ' + str(pagenbr) + ' is done.')
if pagenbr <=114:
elm = driver.find_element_by_link_text('Next')
driver.implicitly_wait(10)
elm.click()
time.sleep(2)
pagenbr += 1
ls = list(set(url))
with open('US_GeneralHospital.csv', 'wb') as myfile:
wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
for u in ls:
wr.writerow([u])
And it worked very well to pull each individual links from this website.
But the problem is I need to change the page number I need to loop by myself every time.
I want to let this code upgrade to iterate by calculating how many time it need. Not by manually inputting.
Thank you very much.

This is bad idea to hardcode the number of pages in your script. Try just to click "Next" button while it is enabled:
from selenium.common.exceptions import NoSuchElementException
while True:
try:
# do whatever you need to do on page
driver.find_element_by_xpath('//li[not(#class="disabled")]/span[text()="Next"]').click()
except NoSuchElementException:
break
This should allow you to execute page scraping until the last page reached
Also note that using lines current = driver.current_url and driver.get(current) makes no sense at all, so you might skip them

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle large scale Web Scraping? - python

Related

Beautifulsoup4 find_all not getting the results I need

Python multithreading crawler for unknown size

Skip selenium Webdriver.get() call inside for loop if it takes too long

Is time.sleep() enough to safely create a delay for a simple webscraper?

Python and Selenium: I am automating web scraping among pages. How can I loop by Next button?

Categories

Resources