I am creating a web crawler using Scrapy and Selenium.
The code looks like this:
class MySpider(scrapy.Spider):
urls = [/* a very long list of url */]
def start_requests(self):
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_item)
def parse_item(self, response):
item = Item()
item['field1'] = response.xpath('some xpath').extract()[0]
yield item
sub_item_url = response.xpath('some another xpath').extract()[0]
# Sub items are Javascript generated so it needs a web driver
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.set_window_size(1920, 1080)
sub_item_generator = self.get_sub_item_generator(driver, sub_item_url)
while True:
try:
yield next(sub_item_generator)
except StopIteration:
break
driver.close()
def get_sub_item_generator(driver, url):
# Crawling using the web driver goes here which takes a long time to finish
yield sub_item
The problem is that the crawler running for a while then it crashed due to run out of memory. Scrapy keeps scheduling a new URL from the list so there are too many web driver processes running.
Is there any way to control the Scrapy scheduler not to schedule a new URL when there is some number of web driver process running?
You could try setting CONCURRENT_REQUESTS to something lower than the default of 16 (as shown here):
class MySpider(scrapy.Spider):
# urls = [/* a very long list of url */]
custom_settings = {
'CONCURRENT_REQUESTS': 5 # default of 16 seemed like it was too much?
}
Try using driver.quit() instead of driver.close()
I have had same problem despite of using driver.close() then I did this, kill all firefox instances before the script starts.
from subprocess import call
call(["killall", "firefox"])
Related
I have a problem with running scrapy. Seems like scrapy is skiping last pages. For example I've set 20 pages to scrap but Scrapy is missing last 10 or 7 pages. It does not have any problem when Im setting one single page "for page in range(6,7)". Terminal shows that it is scraping all pages from 1 to 100 but output in my database is ending at random pages. Any ideas why is that heppening?
Mayber there is a way to run Scrapy synchronously. To scrap every item in first page -> second page -> third page and so on
class SomeSpider(scrapy.Spider):
name = 'default'
urls = [f'https://www.somewebsite.com/pl/c/cat?page={page}' for page in range(1, 101)]
service = Service(ChromeDriverManager().install())
options = Options()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--headless")
options.add_argument("--allow-running-insecure-content")
options.add_argument("--enable-crash-reporter")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-default-apps")
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
def start_requests(self):
for url in self.urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response):
for videos in response.css('div.card-img'):
item = WebsitescrapperItem()
link = f'https://www.somewebsite.com{videos.css("a.item-link").attrib["href"]}'
SomeSpider.driver.get(link)
domain_name = SomeSpider.driver.current_url
SomeSpider.driver.back()
item['name'] = videos.css('span.title::text').get().strip()
item['duration'] = videos.css('span.duration::text').get().strip()
item['image'] = videos.css('img.thumb::attr(src)').get()
item['url'] = domain_name
item['hd'] = videos.css('span.hd-icon').get()
yield item
Try running the code using this format of calling,
def parse(self, response):
# do some stuff
for page in range(self.total_pages):
yield Requests(f'https://example.com/search?page={page}',
callback=self.parse)
Also, If you yield multiple requests from start_requests, or have multiple URLs in start_urls, those will be handled asynchronously, according to your concurrency settings (Scrapy’s default is 8 concurrent requests per domain, 16 total). Make sure you set accordingly in settings.py.
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
If you want to run it synchronously you would do it like so.
def parse(self, response, current_page):
url = 'https://www.somewebsite.com/pl/c/cat?page={}'
# do some stuff
self.current_page += 1
yield Request(url.format(current_page), call_back=self.parse)
TL;DR
In scrapy, I want the Request to wait till all spider parse callbacks finish. So the whole process needs to be sequential. Like this:
Request1 -> Crawl1 -> Request2 -> Crawl2 ...
But what is happening now:
Request1 -> Request2 -> Request3 ...
Crawl1
Crawl2
Crawl3 ...
Long version
I am new to scrapy + selenium web scraping.
I am trying to scrape a website where the contents are being updated heavily with javascript. Firstly I am opening the website with selenium and logging in. After that, I am creating a using a downloader middleware that handles the requests with selenium and returns the responses. Below is the middleware's process_request implementation:
class XYZDownloaderMiddleware:
'''Other functions are as is. I just changed this one'''
def process_request(self, request, spider):
driver = request.meta['driver']
# We are opening a new link
if request.meta['load_url']:
driver.get(request.url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
# We are clicking on an element to get new data using javascript.
elif request.meta['click_bet']:
element = request.meta['click_bet']
element.click()
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)
In settings, I have also set CONCURRENT_REQUESTS = 1 so that, multiple driver.get() do not called and selenium can peacefully load responses one by one.
Now what I see happening is selenium opens each URL, scrapy lets selenium wait for the response to finish loading and then middleware returns the response properly (goes to if response.meta['load_url'] block).
But, after I got the response, I want to use the selenium driver (in the parse(response) functions) to click on each of the elements by yielding a Request and return the updated HTML from the middleware (the elif request.meta['click_bet'] block).
The Spider is minimally like this:
class XYZSpider(scrapy.Spider):
def start_requests(self):
start_urls = [
'https://www.example.com/a',
'https://www.example.com/b'
]
self.driver = self.getSeleniumDriver()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.parse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '/div/bla/bla'
request.meta['click_bet'] = None
yield request
def parse(self, response):
urls = response.xpath('//a/#href').getall()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.rightSectionParse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '//div[contains(#class, "rightSection")]'
request.meta['click_bet'] = None
yield request
def rightSectionParse(self, response):
...
So what is happening is, scrapy is not waiting for the spider to parse. Scrapy gets the response, and then parallelly calls parse callback and next fetch response. But selenium driver needs to be used by the parse callback function before the next request processing.
I want the requests to wait until the parse callback is finished.
I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page).
Anyone can help!
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.result-content-columns div.result-title')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
return splash:html()
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})
Issue:
The problem that you can't scrape html which is out of your selection scope. When a new link is clicked, if there is an iframe involved, it rarely brings it into scope for scraping.
Solution:
Choose a method of selecting the new iframe, and then proceed to parse the new html.
The Scrapy-Splash method
(This is an adaptation of Mikhail Korobov's solution from this answer)
If you are able to get the src link of the new page that pops up, it may be the most reliable, however, you can also try selecting iframe this way:
# ...
yield SplashRequest(url, self.parse_result, endpoint='render.json',
args={'html': 1, 'iframes': 1})
def parse_result(self, response):
iframe_html = response.data['childFrames'][0]['html']
sel = parsel.Selector(iframe_html)
item = {
'my_field': sel.xpath(...),
# ...
}
The Selenium method
(requires pip install selenium,bs4, and possibly a chrome driver download from here for your os: Selenium Chromedrivers) Supports Javascript parsing! Woohoo!
With the following code, this will switch scopes to the new frame:
# Goes at the top
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
import time
# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)
url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)
soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')
# This will return any content found in tags called '<table>'
table = soup.find_all('table')
My favorite of the two options is Selenium, but try the first solution if you are more comfortable with it!
I am writing a scrapy spider with selenium to cral a dynamic web page.
I am pretty sure the regular expression works fine.But the 'page_link' from the linkextractor is getting nothing and the program terminates before parse_item function get called. Can't figure out what is wrong.
class OikotieSpider(CrawlSpider):
name = 'oikotie'
allowed_domains = [my_domain]
start_urls=['https://asunnot.oikotie.fi/myytavat-uudisasunnot?cardType=100&locations=%5B%22helsinki%22%5D&newDevelopment=1&buildingType%5B%5D=1&buildingType%5B%5D=256&pagination=1']
def __init__(self):
CrawlSpider.__init__(self)
chrome_driver = 'mydriver_location'
os.environ["webdriver.chrome.driver"] = chrome_driver
chromeOptions = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chromeOptions.add_experimental_option("prefs", prefs)
#driver instance and call
self.driver = webdriver.Chrome(executable_path=chrome_driver, chrome_options=chromeOptions)
self.driver.get('my_url')
self.selector=Selector(text=self.driver.page_source)
self.driver.close()
self.driver.quit()
page_link=LinkExtractor(allow=('myytavat-asunnot\/helsinki\/\d+',))
rules = (
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(page_link, callback='parse_item',follow=True),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
print("parse_item is called!!")
self.driver.get(response.url)
self.driver.implicitly_wait(30)
return ....
I think you should use 'DownloadMiddleware' to achieve getting webpage. In 'DownloadMiddlewrae', you can init browser to get webpage. Look at this:
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=DownloadMiddleware
It looks like your LinkExtractor allow argument is not an absolute regex. It needs to be:https://doc.scrapy.org/en/latest/topics/link-extractors.html.
Now part of being an absolute regex could just be *+current regex....but that would be terrible:). Just make it absolute.
I'm scraping a website that strongly depends on Javascript. The main page from which I need to extract the urls that will be parsed depends on Javascript, so I have to modify start_requests.
I'm looking for a way to connect start_requests, with the linkextractor and with process_match
class MatchSpider(CrawlSpider):
name = "match"
allowed_domains = ["whoscored"]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(#class, "match-report")]//#href'), callback='parse_item'),
)
def start_requests(self):
url = 'https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/Fixtures/England-Premier-League-2016-2017'
browser = Browser(browser='Chrome')
browser.get(url)
# should return a request with the html body from Selenium driver so that LinkExtractor rule can be applied
def process_match(self, response):
match_item = MatchItem()
regex = re.compile("matchCentreData = \{.*?\};", re.S)
match = re.search(regex, response.text).group()
match = match.replace('matchCentreData =', '').replace(';', '')
match_item['match'] = json.loads(match)
match_item['url'] = response.url
match_item['project'] = self.settings.get('BOT_NAME')
match_item['spider'] = self.name
match_item['server'] = socket.gethostname()
match_item['date'] = datetime.datetime.now()
yield match_item
A wrapper I'm using around Selenium:
class Browser:
"""
selenium on steroids. allows you to create different types of browsers plus
adds methods for safer calls
"""
def __init__(self, browser='Firefox'):
"""
type: silent or not
browser: chrome of firefox
"""
self.browser = browser
self._start()
def _start(self):
'''
starts browser
'''
if self.browser == 'Chrome':
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_extension('./libcommon/adblockpluschrome-1.10.0.1526.crx')
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("user-agent={0}".format(random.choice(USER_AGENTS)))
self.driver_ = webdriver.Chrome(executable_path='./libcommon/chromedriver', chrome_options=chrome_options)
elif self.browser == 'Firefox':
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", random.choice(USER_AGENTS))
profile.add_extension('./libcommon/adblock_plus-2.7.1-sm+tb+an+fx.xpi')
profile.set_preference('permissions.default.image', 2)
profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
profile.set_preference("webdriver.load.strategy", "unstable")
self.driver_ = webdriver.Firefox(profile)
elif self.browser == 'PhantomJS':
self.driver_ = webdriver.PhantomJS()
self.driver_.set_window_size(1120, 550)
def close(self):
self.driver_.close()
def return_when(self, condition, locator):
"""
returns browser execution when condition is met
"""
for _ in range(5):
with suppress(Exception):
wait = WebDriverWait(self.driver_, timeout=100, poll_frequency=0.1)
wait.until(condition(locator))
self.driver_.execute_script("return window.stop")
return True
return False
def __getattr__(self, name):
"""
ruby-like method missing: derive methods not implemented to attribute that
holds selenium browser
"""
def _missing(*args, **kwargs):
return getattr(self.driver_, name)(*args, **kwargs)
return _missing
There's two problems I see after looking into this. Forgive any ignorance on my part, because it's been a while since I was last in the Python/Scrapy world.
First: How do we get the HTML from Selenium?
According to the Selenium docs, the driver should have a page_source attribute containing the contents of the page.
browser = Browser(browser='Chrome')
browser.get(url)
html = browser.driver_.page_source
browser.close()
You may want to make this a function in your browser class to avoid accessing browser.driver_ from MatchSpider.
# class Browser
def page_source(self):
return self.driver_.page_source
# end class
browser.get(url)
html = browser.page_source()
Second: How do we override Scrapy's internal web requests?
It looks like Scrapy tries to decouple the behind-the-scenes web requests from the what-am-I-trying-to-parse functionality of each spider you write. start_requests() should "return an iterable with the first Requests to crawl" and make_requests_from_url(url) (which is called if you don't override start_requests()) takes "a URL and returns a Request object". When internally processing a Spider, Scrapy starts creating a plethora of Request objects that will be asynchronously executed and the subsequent Response will be sent to parse(response)...the Spider never actually does the processing from Request to `Response.
Long story short, this means you would need to create middleware for the Scrapy Downloader to use Selenium. Then, you can remove your overridden start_requests() method and add a start_urls attribute. Specifically, your SeleniumDownloaderMiddleware should overwrite the process_request(request, spider) method to use the above Selenium code.