I am building a crawl.spider to scrape statutory law data from the following website (https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm). I am aiming to extract the statute text, which is contained in the following XPath [//div[#class = 'first']/p/text()]. This path should provide the statute text.
All of my scrapy requests are yielding incomplete html responses, such that when I search for the relevant xpath queries, it yields an empty list. However, when I use the requests library, the html downloads correctly.
Using XPath tester online, I've verified that my xpath queries should produce the desired content. Using scrapy shell, I've viewed the response object from scrapy in my browser - and it looks just like it does when I'm browsing natively. I've tried enabling middleware for both BeautifulSoup and Selenium, but neither has appeared to work.
Here's my crawl spider
class AZspider(CrawlSpider):
name = "arizona"
start_urls = [
"https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm",
]
rule = (Rule(LinkExtractor(restrict_xpaths="//div[#class = 'article']"), callback="parse_stats_az", follow=True),)
def parse_stats_az(self, response):
statutes = response.xpath("//div[#class = 'first']/p")
yield{
"statutes":statutes
}
And here's the code that succsessfuly generated the correct response object
az_leg = requests.get("https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm")
Related
I'm trying to catch dynamic content from a webpage. The data is displayed dynamically on the webpage after loading the content.
On one webpage the response in the console is json formatted and html for the second one.
I've tried to work with scrappy and urllib3 but did not manage to catch something else then the static data from the webpage itself.
Here is what I've tried to use with scrappy.
class spider(scrapy.Spider):
name = 'myspider'
start_urls = [url]
def parse(self, response):
yield scrapy.FormRequest('myurl',
callback=self.write_vente,
headers=headers,
meta={'proxy': 'https://' + str(proxy)})
def write_vente(self, response):
filename = 'vente.html'
with open(filename, 'wb') as f:
f.write(response.body)
If you know any solutions or other libraries/framework to use or even other programming language that allows me to do so
Thanks
The most common used tool to scrape data from dynamic websites is Selenium WebDriver. Which also has good support for Python, can be used headless. Also it has loads of articles if you search for it incombination with scraping.
Scrappy does have some support for pre-loading dynamic content or using Selenium in combination with Scrappy, see: https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering
I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)
I'm very new to python and am trying to build a script that will eventually extract page titles and s from specified URLs to a .csv in the format I specify.
I have tried managed to get the spider to work in CMD using :
response.xpath("/html/head/title/text()").get()
So the xpath must be right.
Unfortunately when I run the the file my spider is in it never seems to work properly. I think the issue is in the final block of code, unfortunately all the guides I follow seem to use CSS. I feel more comfortable with xpath because you can simply copy,paste it from Dev Tools.
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com",
"http://www.example.com/blog"]
def parse(self, response):
for title in response.xpath("/html/head/title/text()"):
yield {
"title": sel.xpath("Title a::text").extract_first()
}
I expected when that to give me the page title of the above URLs.
First of all, your second url on self.start_urls is invalid and returning 404, so you will end up with only one title extracted.
Second, you need to read more about selectors, you extracted the title on your test on shell but got confused when using it on your spider.
Scrapy will call the parse method for each url on self.start_urls, so you don't need to iterate trough titles, you only have one per page.
You also can access the <title> tag directly using // at the beginning of your xpath expression, see this text copied from W3Schools :
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
Here is the fixed code:
import scrapy
class PageSpider(scrapy.Spider):
name = "dorothy"
start_urls = [
"http://www.example.com"
]
def parse(self, response):
yield {
"title": response.xpath('//title/text()').extract_first()
}
I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass
Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.