Scrapy spider cannot log into discord - python

i am trying to make a scraper for Discord to get all the members of a server, i am stuck at the login though, i can't find the csrf token anywhere in the source code for the page maybe that is why i'm getting this error since a few sources say that it is required but i'm not sure, here's my spider causing the problem
from scrapy.http import FormRequest
class RecruteSpider(scrapy.Spider):
name = "Recruteur"
def start_requests(self)
urls = [
'https://discord.com/login',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.login)
def login(self, response):
url = 'https://discord.com/login'
formdata = {"username":"SecretUserName", "password":"SecretPassword"}
yield FormRequest.from_response(
response = response,
url = url,
formdata = formdata,
callback = self.afterLogin
)
def afterLogin(self, response):
print("Success!!")
#do stuff
Wen i run the program i get the error
ValueError: No element found in <200 https://discord.com/login>
Even though there clearly is a form element at that url.
I have also tried using the login url as response variable in the Form response but i get the error
AttributeError: 'str' object has no attribute 'encoding'
if you need any extra detail feel free to ask, any help is greatly appreciated, thanks in advance.

The error you are getting is because discord loads the /login page using javascript and therefore the response does not contain any form elements. You need to render the javascript using either scrapy-playwright(personal favourite), selenium or scrapy-splash.
Also your formdata variable contains invalid keys. See screenshot of the payload sent to the server in the browser.
Using scrapy-playwright, I was able to get to the callback function as below. Also note that the discord server may require you to solve a captcha once you send the login request which presents another challenge that you will need to solve.
discord.py
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import FormRequest
class RecruteSpider(scrapy.Spider):
name = "Recruteur"
def start_requests(self):
urls = ['https://discord.com/login']
for url in urls:
yield scrapy.Request(url=url, callback=self.login, meta={"playwright": True})
def login(self, response):
url = 'https://discord.com/login'
formdata = {"login":"SecretUserName", "password":"SecretPassword"}
yield FormRequest.from_response(
response = response,
url = url,
formdata = formdata,
callback = self.afterLogin
)
def afterLogin(self, response):
print("Success!!")
#do stuff
if __name__ == "__main__":
process = CrawlerProcess(settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}, })
process.crawl(RecruteSpider)
process.start()

Related

Can't get desired results using try/except clause within scrapy

I've written a script in scrapy to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the movie links from it's landing page and then fetch the name of each movie from it's target page. My following script can use rotation of proxies.
I know there is an easier way to change proxies, like it is described here HttpProxyMiddleware but I would still like to stick to the way I'm trying here.
website link
This is my current attempt (It keeps using new proxies to fetch a valid response but every time it gets 503 Service Unavailable):
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_proxies():
response = requests.get("https://www.us-proxy.org/")
soup = BeautifulSoup(response.text,"lxml")
proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxy
class ProxySpider(scrapy.Spider):
name = "proxiedscript"
handle_httpstatus_list = [503]
proxy_vault = get_proxies()
check_url = "https://yts.am/browse-movies"
def start_requests(self):
random.shuffle(self.proxy_vault)
proxy_url = next(cycle(self.proxy_vault))
request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
request.meta['https_proxy'] = f'http://{proxy_url}'
yield request
def parse(self,response):
print(response.meta)
if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
random.shuffle(self.proxy_vault)
proxy_url = next(cycle(self.proxy_vault))
request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
request.meta['https_proxy'] = f'http://{proxy_url}'
yield request
else:
for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
nlink = response.urljoin(item)
yield scrapy.Request(nlink,callback=self.parse_details)
def parse_details(self,response):
name = response.css("#movie-info h1::text").get()
yield {"Name":name}
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
c.crawl(ProxySpider)
c.start()
To make sure whether the request is being proxied, I printed response.meta and could get results like this {'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.
As I've overused the link to check how the proxied request within scrapy works, I'm getting 503 Service Unavailable error at this moment and I can see this keyword within the response DDoS protection by Cloudflare. However, I get valid response when I try with requests module applying the same logic I implemented here.
My earlier question: why I can't get the valid response as (I suppose) I'm using proxies in the right way? [solved]
Bounty Question: how can I define try/except clause within my script so that it will try with different proxies once it throws connection error with a certain proxy?
According to scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs
(and source)
proxy meta key is expected to use (not https_proxy)
#request.meta['https_proxy'] = f'http://{proxy_url}'
request.meta['proxy'] = f'http://{proxy_url}'
As scrapy didn't received valid meta key - your scrapy application didn't use proxies
The start_requests() function is just the entry point. On subsequent requests, you would need to resupply this metadata to the Request object.
Also, errors can occur on two levels: proxy and target server
We need to handle bad response codes from both the proxy and the target server. Proxy errors are returned by the middelware to the errback function. The target server response can be handled during parsing from the response.status
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_proxies():
response = requests.get("https://www.us-proxy.org/")
soup = BeautifulSoup(response.text, "lxml")
proxy = [':'.join([item.select_one("td").text, item.select_one("td:nth-of-type(2)").text]) for item in
soup.select("table.table tbody tr") if "yes" in item.text]
# proxy = ['https://52.0.0.1:8090', 'https://52.0.0.2:8090']
return proxy
def get_random_proxy(proxy_vault):
random.shuffle(proxy_vault)
proxy_url = next(cycle(proxy_vault))
return proxy_url
class ProxySpider(scrapy.Spider):
name = "proxiedscript"
handle_httpstatus_list = [503, 502, 401, 403]
check_url = "https://yts.am/browse-movies"
proxy_vault = get_proxies()
def handle_middleware_errors(self, *args, **kwargs):
# implement middleware error handling here
print('Middleware Error')
# retry request with different proxy
yield self.make_request(url=args[0].request._url, callback=args[0].request._meta['callback'])
def start_requests(self):
yield self.make_request(url=self.check_url, callback=self.parse)
def make_request(self, url, callback, dont_filter=True):
return scrapy.Request(url,
meta={'proxy': f'https://{get_random_proxy(self.proxy_vault)}', 'callback': callback},
callback=callback,
dont_filter=dont_filter,
errback=self.handle_middleware_errors)
def parse(self, response):
print(response.meta)
try:
if response.status != 200:
# implement server status code handling here - this loops forever
print(f'Status code: {response.status}')
raise
else:
for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
nlink = response.urljoin(item)
yield self.make_request(url=nlink, callback=self.parse_details)
except:
# if anything goes wrong fetching the lister page, try again
yield self.make_request(url=self.check_url, callback=self.parse)
def parse_details(self, response):
print(response.meta)
try:
if response.status != 200:
# implement server status code handeling here - this loops forever
print(f'Status code: {response.status}')
raise
name = response.css("#movie-info h1::text").get()
yield {"Name": name}
except:
# if anything goes wrong fetching the detail page, try again
yield self.make_request(url=response.request._url, callback=self.parse_details)
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
c.crawl(ProxySpider)
c.start()

Why is the print() function not echoing to console?

I haven't written any Python code in over 10 years. So I'm trying to use Scrapy to assemble some information off of a website:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I'm trying to parse the url in the parse method. What I can't figure out is, why will those simple print statements not print anything to stdout? They are silent. There doesn't seem to be a way to echo anything there back to the console, and I am very curious about what I am missing here.
Both requests you're doing in your spider receive 404 Not found responses. By default, Scrapy ignores responses with such a status and your callback doesn't get called.
In order to have your self.parse callback called for such responses, you have to add the 404 status code to the list of handled status codes using the handle_httpstatus_list meta key (more info here).
You could change your start_requests method so that the requests instruct Scrapy to handle even 404 responses:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'handle_httpstatus_list': [404]},
)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')

Can't get all http requests in Scrapy

I'm trying to access a site and check if no links redirecting to a page within the site that are down. As there is no sitemap available, I'm using Scrapy to crawl the site and get all links on every page, but I can't get it to output a file with all the links found and their status code. The site I'm using to test the code is quotes.toscrape.com and my code is:
from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/#href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
I've tried adding the following lines after yield:
item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item
But it gets me a bunch of duplicates and no url with status 404 or 301. Does anyone know how I can get all the urls with the status?
Scrapy by default does not return any unsuccessful requests, but you can fetch them and handle them in one of your functions if you set errback on the request.
def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)
def parse_error(self, failure):
# log the response as an error
The parameter failure will contain more information on the exact reason for failure, because it could be HTTP errors (where you can fetch a response), but also DNS lookup errors and such (where there is no response).
The documentation contains an example how to use failure to determine the error reason and access Response if available:
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
You should use the HTTPERROR_ALLOW_ALL in your settings or set the meta key handle_httpstatus_all = Truein all your requests, please refer to the docs for more information.

Scrapy ignore request for a specific domain

I try to crawl the forum category of craiglist.org (https://forums.craigslist.org/).
My spider:
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
yield Request('https://forums.craigslist.org/',
self.getForumPage,
dont_filter=True,
errback=self.error_handler)
def getForumPage(self, response):
print "forum page"
I have this message by the error callback:
[Failure instance: Traceback: :
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult
--- ---
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks
/usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator
/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request
/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks
/usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2
]
But i have this problem only with the forum section of Craigslist. It might be because is https for the forum section in contrary of the rest of website.
So, impossible to get a response...
An idea ?
I post a solution that I found for get around the problem.
I have used urllib2 library. Look:
import urllib2
from scrapy.http import HtmlResponse
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
# Get a valid request with urllib2
req = urllib2.Request('https://forums.craigslist.org/')
# Get the content of this request
pageContent = urllib2.urlopen(req).read()
# Parse the content in a HtmlResponse compatible with Scrapy
response = HtmlResponse(url=response.url, body=pageContent)
print response.css(".forumlistcolumns li").extract()
With this solution, you can parse a good request in a valid Scrapy request and use this normaly.
There is probably a better method but this one is functional.
I think you are dealing with robots.txt. Try running your spider with
custom_settings = {
"ROBOTSTXT_OBEY": False
}
You can also test it using command line settings: scrapy crawl craigslist -s ROBOTSTXT_OBEY=False.

Why doesn't this FormRequest log me in?

Complete Python newb here so I may be asking something painfully obvious, but I've searched through this site, the Scrapy docs, and Google and I'm completely stuck on this problem.
Essentially, I want to use Scrapy's FormRequest to log me in to a site so that I can scrape and save some stats from various pages. The issue is that the response I receive from the site after submitting the form just returns me to the home page (without any login error notifications in the response body). I'm not sure how I am botching this log-in process. Although it is a pop-up login form, I don't think that should be an issue since using Firebug, I can extract the relevant html code (and xpath) for the form embedded in the webpage.
Thanks for any help. The code is pasted below (I replaced my actual username and password):
# -*- coding: utf-8 -*-
import scrapy
class dkspider(scrapy.Spider):
name = "dkspider"
allowed_domains = ["draftkings.com"]
start_urls = ['https://www.draftkings.com/contest-lobby']
def parse(self, response):
return scrapy.http.FormRequest.from_response(response,
formxpath = '//*[#id="login_form"]',
formdata = {'username' : 'myusername', 'password' : 'mypass'},
callback = self.started)
def started(self, response):
filename = 'attempt1.html'
with open(filename, 'wb') as f:
f.write(response.body)
if 'failed' in response.body:
print 'Errors!'
else:
print 'Success'
Seems like your parameters don't match(should be login instead of username) and you are missing some of them in your formdata. This is what firebug shows me is delivered when trying to log in:
Seems like layoutType and returnUrl can just be hardcoded in but profillingSessionId needs to be retrieved from the page source. I checked the source and found this there:
so your Spider should look something like this:
def parse(self, response):
return FormRequest(
url='https://www.draftkings.com/account/login',
formdata={'login': 'login', # login instead of username
'password': 'password',
'profillingSessionId': ''.join(
response.xpath("//input[#id='tmxSessionId']/#value").extract()),
'returnUrl': '',
'layoutType': '2'},
callback=self.started)
def started(self, response):
# Reload the landing page
return Request(self.start_urls[0], self.logged_in)
def logged_in(self, response):
# logged in page here
pass

Categories