I'm currently using a combination of Scrapy and Selenium to quickly search the USPTO TradeMark database. These pages have a session token attached.
The things I've tried and read about don't seem to be integrated enough- meaning that while Selenium can pass found URLs to scrapy, scrapy makes a new request to that page, thus invalidating the token, so I need Selenium to deliver the HTML to scrapy for parsing. Is this possible?
# -*- coding: utf-8 -*-
# from terminal run: scrapy crawl trademarks -o items.csv -t csv
import time
import scrapy
from scrapy.http import Request
from scrapy.item import Item, Field
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
from selenium import webdriver
class TrademarkscrapeItem(scrapy.Item):
category = Field()
wordmark = Field()
registrant = Field()
registration_date = Field()
description = Field()
class TradeMarkSpider(CrawlSpider):
name = "trademarks"
allowed_domains = ["uspto.gov"]
start_urls = ['http://www.uspto.gov']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
# Navigate through the site to get to the page I want to scrape
self.driver.get(response.url)
next = self.driver.find_element_by_xpath("//*[#id='menu-84852-1']/a")
next.click()
time.sleep(2) # Let any js render in page
next = self.driver.find_element_by_xpath("//*[#id='content']/article/ul[1]/li[1]/article/h4/a")
next.click()
time.sleep(2)
# How to get this next part to point at Selenium-delivered HTML?
TradeDict = {}
SelectXpath = Selector(SeleniumHTML).xpath #SeleniumHTML is psuedoCode
TradeDict['description'] = SelectXpath("//*[#id='content']/article/div/p/text()").extract()
self.driver.close()
return TradeDict
Related
I'm not sure if this is the correct place for this question.
Here's my question:
If I run scrapy, it can't see the email addresses in the page source. The page has email addresses that are visible only when you hover over a user with an email address .
When I run my spider, I get no emails. What am I doing wrong?
Thank You.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class MailsSpider(CrawlSpider):
name = 'mails'
allowed_domains = ['biorxiv.org']
start_urls = ['https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
emals = re.findall(r'[\w\.]+#[\w\.]+',response.text)
print(response.url)
print(emails)
Assuming you're allowed to scrape email contacts from a public website,
as said, scrapy does not loads js scripts, you need a full render browser like Playwright to get the address.
I've wrote down a quick and dirty example on how it could work, you can start from here if you wish (after you've installed playwright of course)
import scrapy
from scrapy.http import Request, FormRequest
from playwright.sync_api import sync_playwright
from scrapy.http import HtmlResponse
class PhaseASpider(scrapy.Spider):
name = "test"
def start_requests(self):
yield Request('https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3', callback=self.parse_page)
def parse_page(self,response):
with sync_playwright() as p:
browser = p.firefox.launch(headless=False)
self.page = browser.new_page().
url='https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3'
self.page.goto(url)
self.page.wait_for_load_state("load")
html_page=self.page.content()
response_sel = HtmlResponse(url="my HTML string", body=html_page, encoding='utf-8')
mails=response_sel.xpath('//a[contains(#href, "mailto")]/#href').extract()
for mail in mails:
print(mail.split('mailto:')[1])
I'm trying to extract comments from a news page. The Crawler starts at the homepage and follows all the internal links found on the site. The comments are just on the article-sites and those comments are embedded from an external Website, so the section with the comments are in an JavaScript iframe. Here's an example article site
My first Step was to build a crawler and a selenium middleware. The crawler follows all the links and those are loaded through Selenium:
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CrawlerSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ['www.merkur.de', 'disqus.com/embed/comments/']
start_urls = ['https://www.merkur.de/welt/novavax-corona-totimpfstoff-omikron-zulassung-impfstoff-weihnachten-wirkung-covid-lauterbach-zr-91197497.html']
rules = [Rule(LinkExtractor(allow=r'.*'), callback='parse',
follow=True)]
def parse(self, response):
title = response.xpath('//html/head/title/text()').extract_first()
iframe_url = response.xpath('//iframe[#title="Disqus"]//#src').get()
yield Request(iframe_url, callback=self.next_parse, meta={'title': title})
def next_parse(self, response):
title = response.meta.get('title')
comments = response.xpath("//div[#class='post-message ']/div/p").getall()
yield {
'title': title,
'comments': comments
}
To get access to the iframe elements the Scrapy Request goes through the middleware:
from scrapy import signals, spiders
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.webdriver.chrome.options import Options
class SeleniumMiddleware(object):
def __init__(self):
chrome_options = Options()
chrome_options.add_argument("--headless")
self.driver = webdriver.Chrome(options=chrome_options)
# Here you get the request you are making to the urls with the LinkExtractor found and use selenium to get them and return a response.
def process_request(self, request, spider):
self.driver.get(request.url)
element = self.driver.find_element_by_xpath('//div[#id="disqus_thread"]')
self.driver.execute_script("arguments[0].scrollIntoView();", element)
time.sleep(1)
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
I am getting the right link from the iframe src here but my CrawlerSpider is not yielding the iframe_url Request so that I can follow the link from the iframe. What am I doing wrong here ? I really appreciate your help!
I making a web crawler with Scrapy which will visit a list of URLs and return all cookies from these domains including those set by third parties.
This spider follows all links on the given URLs and writes each cookie in a separate text file:
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
import requests
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = "a"
start_urls = ['http://www.dailymail.co.uk/home/index.html']
rules = (Rule(LinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
filename = '%s.txt'
if response.headers.getlist('Set-Cookie'):
page = response.url.split("/")[-2]
with open(filename, 'wb') as f:
for cookie in response.headers.getlist('Set-Cookie'):
f.write(cookie)
This results in 11 different text files each containing a cookie. The result is inconsistent with that produced by the website cookie-checker.com.
Is there a way to find all cookies set on a page using Scrapy?
Some cookies could be set via client-side(Javascript)
I suggest you use Selenium + PhantomJS to collect all client/server side cookies.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://www.example.com/')
cookies = driver.get_cookies()
I am trying to use a Scrapy spider to crawl a website using a FormRequest to send a keyword to the search query on a city-specific page. Seems straightforward with what I read, but I'm having trouble. Fairly new to Python so sorry if there is something obvious I'm overlooking.
Here are the main 3 sites I was trying to use to help me:
Mouse vs Python [1]; Stack Overflow; Scrapy.org [3]
From the source code of the specific url I am crawling: www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents
From the source of the particular page I found:
<input name="dnn$ctl01$txtSearch" type="text" maxlength="255" size="20" id="dnn_ctl01_txtSearch" class="NormalTextBox" autocomplete="off" placeholder="Search..." />
Which I think the name of the search is "dnn_ct101_txtSearch" which I would use in the example I found cited as 2, and I wanted to input "toyota" as my keyword within the vehicle search.
Here is the code I have of my spider right now, and I am aware I am importing excessive stuff in the beggining:
import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents"]
start_urls = ['http://www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents/']
def start_requests(self):
return [ FormRequest("www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents",
formdata={'dnn$ctl01$txtSearch':'toyota'},
callback=self.parse) ]
def parsel(self):
print self.status
Why is it not searching or printing any kind of results, is the example I'm copying from only intended for logging in on websites not entering to searchbars?
Thanks,
Dan the newbie Python writer
Here you go :)
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US',
)
def parse(self, response):
section_color = response.xpath(
'//div[#class="pypvi_notes"]/p/text()').extract()
info = response.xpath('//td["pypvi_make"]/text()').extract()
for element in range(0, len(info), 4):
item = Cars()
item["Make"] = info[element]
item["Model"] = info[element + 1]
item["Year"] = info[element + 2]
item["Entered_Yard"] = info[element + 3]
item["Section"] = section_color.pop(
0).replace("Section:", "").strip()
item["Color"] = section_color.pop(0).replace("Color:", "").strip()
yield item
# open_in_browser(response)
# inspect_response(response, self)
The page that you're trying to scrape is generated by an AJAX call.
Scrapy by default doesn't load any dynamically loaded Javascript content including AJAX. Almost all sites that load data dynamically as you scroll down the page are done using AJAX.
^^Trapping^^ AJAX call's are pretty simple using either Chrome Dev Tools or Firebug for Firefox.
All you have to do is observe the XHR requests in Chrome Dev Tools or Firebug. XHR is an AJAX request.
Here's a screen shot of how it looks:
Once you find the link, you can go change its attributes.
This is the link that the XHR request in Chrome Dev Tools gave me:
http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US
I've changed the page size to 1000 up there to give me a 1000 results per page. The default was 15.
There's also a page number over there which you would ideally increase till you capture all the data.
The web page requires javascript rendering framework to load the content in the scrapy code
Use Splash and refer the document for usage.
I was trying to scrap link which has ajax call for pagination.
I am trying to crawl http://www.demo.com link. and in .py file I provided this code for restrict XPATH and coding is:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem
class Sumspider1(sumSpider):
name = 'sumDetailsUrls'
allowed_domains = ['sum.com']
start_urls = ['http://www.demo.com']
rules = (
Rule(LinkExtractor(restrict_xpaths='.//ul[#id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
)
#use parse_start_url if your spider wants to crawl from first page , so overriding
def parse_start_url(self, response):
print '********************************************1**********************************************'
#//div[#class="showMoreCars hide"]/a
#.//ul[#id="pager"]/li[8]/a/#href
self.log('Inside - parse_item %s' % response.url)
hxs = HtmlXPathSelector(response)
item = sumItem()
item['page'] = response.url
title = hxs.xpath('.//h1[#class="page-heading"]/text()').extract()
print '********************************************title**********************************************',title
urls = hxs.xpath('.//a[#id="linkToDetails"]/#href').extract()
print '**********************************************2***url*****************************************',urls
finalurls = []
for url in urls:
print '---------url-------',url
finalurls.append(url)
item['urls'] = finalurls
return item
My items.py file contains
from scrapy.item import Item, Field
class sumItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
page = Field()
urls = Field()
Still I'm not getting exact output not able to fetch all pages when I am crawling it.
I hope the below code will help.
somespider.py
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
return strData
class demoSpider(scrapy.Spider):
name = "domainurls"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/used/cars-in-trichy/']
def __init__(self):
self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(5)
hxs = Selector(response)
item = DemoItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[#class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[#class="page-heading"]/text()').extract()[0])
urls = self.driver.find_elements_by_xpath('.//a[#id="linkToDetails"]')
for url in urls:
url = url.get_attribute("href")
finalurls.append(removeUnicodes(url))
item['urls'] = finalurls
except:
break
self.driver.close()
return item
items.py
from scrapy.item import Item, Field
class DemoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
Note:
You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.
Run your selenium rc server issuing the command :
java -jar selenium-server-standalone-2.44.0.jar
Run your spider using command:
spider crawl domainurls -o someoutput.json
You can check with your browser how the requests are made.
Behind the scene, right after you click on that button "show more cars" your browser will request a JSON data to feed your next page. You can take advantage of this fact and deal directly with the JSON data without the necessity to work with a JavaScript engine as Selenium or PhantomJS.
In your case, as the first step you should simulate an user scrolling down the page given by your start_url parameter and profile at the same time your network requests to discover the endpoint used by the browser to request that JSON. To discover this endpoint in general there is a XHR(XMLHttpRequest) section on the browser's profile tool as here in Safari where you can navigate thought all resources/endpoints used to request the data.
Once you discover this endpoint it's a straightforward task: you give your Spider as start_url the endpoint that you just discovered and according you process and navigate through the JSON's you can discover if it a next page to request.
P.S.: I saw for you that the endpoint url is http://www.carwale.com/webapi/classified/stockfilters/?city=194&kms=0-&year=0-&budget=0-&pn=2
In this case my browser requested the second page, as you can see in the parameter pn. It's is important you set the some header parameters before you send the request. I noticed in your case the headers are:
Accept text/plain, /; q=0.01
Referer http://www.carwale.com/used/cars-in-trichy/
X-Requested-With XMLHttpRequest
sourceid 1
User-Agent Mozilla/5.0...