Complete Python newb here so I may be asking something painfully obvious, but I've searched through this site, the Scrapy docs, and Google and I'm completely stuck on this problem.
Essentially, I want to use Scrapy's FormRequest to log me in to a site so that I can scrape and save some stats from various pages. The issue is that the response I receive from the site after submitting the form just returns me to the home page (without any login error notifications in the response body). I'm not sure how I am botching this log-in process. Although it is a pop-up login form, I don't think that should be an issue since using Firebug, I can extract the relevant html code (and xpath) for the form embedded in the webpage.
Thanks for any help. The code is pasted below (I replaced my actual username and password):
# -*- coding: utf-8 -*-
import scrapy
class dkspider(scrapy.Spider):
name = "dkspider"
allowed_domains = ["draftkings.com"]
start_urls = ['https://www.draftkings.com/contest-lobby']
def parse(self, response):
return scrapy.http.FormRequest.from_response(response,
formxpath = '//*[#id="login_form"]',
formdata = {'username' : 'myusername', 'password' : 'mypass'},
callback = self.started)
def started(self, response):
filename = 'attempt1.html'
with open(filename, 'wb') as f:
f.write(response.body)
if 'failed' in response.body:
print 'Errors!'
else:
print 'Success'
Seems like your parameters don't match(should be login instead of username) and you are missing some of them in your formdata. This is what firebug shows me is delivered when trying to log in:
Seems like layoutType and returnUrl can just be hardcoded in but profillingSessionId needs to be retrieved from the page source. I checked the source and found this there:
so your Spider should look something like this:
def parse(self, response):
return FormRequest(
url='https://www.draftkings.com/account/login',
formdata={'login': 'login', # login instead of username
'password': 'password',
'profillingSessionId': ''.join(
response.xpath("//input[#id='tmxSessionId']/#value").extract()),
'returnUrl': '',
'layoutType': '2'},
callback=self.started)
def started(self, response):
# Reload the landing page
return Request(self.start_urls[0], self.logged_in)
def logged_in(self, response):
# logged in page here
pass
Related
i am trying to make a scraper for Discord to get all the members of a server, i am stuck at the login though, i can't find the csrf token anywhere in the source code for the page maybe that is why i'm getting this error since a few sources say that it is required but i'm not sure, here's my spider causing the problem
from scrapy.http import FormRequest
class RecruteSpider(scrapy.Spider):
name = "Recruteur"
def start_requests(self)
urls = [
'https://discord.com/login',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.login)
def login(self, response):
url = 'https://discord.com/login'
formdata = {"username":"SecretUserName", "password":"SecretPassword"}
yield FormRequest.from_response(
response = response,
url = url,
formdata = formdata,
callback = self.afterLogin
)
def afterLogin(self, response):
print("Success!!")
#do stuff
Wen i run the program i get the error
ValueError: No element found in <200 https://discord.com/login>
Even though there clearly is a form element at that url.
I have also tried using the login url as response variable in the Form response but i get the error
AttributeError: 'str' object has no attribute 'encoding'
if you need any extra detail feel free to ask, any help is greatly appreciated, thanks in advance.
The error you are getting is because discord loads the /login page using javascript and therefore the response does not contain any form elements. You need to render the javascript using either scrapy-playwright(personal favourite), selenium or scrapy-splash.
Also your formdata variable contains invalid keys. See screenshot of the payload sent to the server in the browser.
Using scrapy-playwright, I was able to get to the callback function as below. Also note that the discord server may require you to solve a captcha once you send the login request which presents another challenge that you will need to solve.
discord.py
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import FormRequest
class RecruteSpider(scrapy.Spider):
name = "Recruteur"
def start_requests(self):
urls = ['https://discord.com/login']
for url in urls:
yield scrapy.Request(url=url, callback=self.login, meta={"playwright": True})
def login(self, response):
url = 'https://discord.com/login'
formdata = {"login":"SecretUserName", "password":"SecretPassword"}
yield FormRequest.from_response(
response = response,
url = url,
formdata = formdata,
callback = self.afterLogin
)
def afterLogin(self, response):
print("Success!!")
#do stuff
if __name__ == "__main__":
process = CrawlerProcess(settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}, })
process.crawl(RecruteSpider)
process.start()
I'm trying to make a login request to the following web page using Scrapy https://www.greatbuyproducts.com/scs/checkout.ssp?is=login&login=T&fragment=login-register#login-register
I'm having issues dealing with this web page's parameters because there is no way I can find the token, maybe I have to generate it but I'm not sure
This is the code I have so far, the issue is where the "token" parts should be.
class GreatBuy(scrapy.Spider):
name = "greatbuy"
start_urls = [
'https://www.greatbuyproducts.com/scs/checkout.ssp?is=login&login=T&fragment=login-register#login-register']
def parse(self, response):
token = # I don't know how to get it
yield scrapy.FormRequest('https://www.greatbuyproducts.com/scs/services/Account.Login.Service.ss?n=2&c=5237170', formdata={'''the token''', 'email': 'my#email.com', 'password': 'mypass', 'send': ''}, callback=self.starscraper)
def startscraper(self, response):
yield scrapy.Response('https://www.greatbuyproducts.com/', callback=self.verifylogin)
def verifylogin(self, response):
print(response.text)
I noticed the query string parameters don't change, can I use them for something here?
**
**
I am trying out scrapy now. I tried the example code in http://doc.scrapy.org/en/1.0/intro/overview.html page. I tried extracting the recent questions with tag 'bigdata'. Everything worked well. But when I tried to extract questions with both tags 'bigdata' and 'python', the results were not correct, with questions having only 'bigdata' tag coming in the result. But on browser I am getting questions with both the tags correctly. Please find the code below:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata?page=1&sort=newest&pagesize=50']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
When I change start_urls as
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python?page=1&sort=newest&pagesize=50']
the results contain questions with only 'bigdata' tag. How to get questions with both the tags only?
Edit: I think what is happening is that scrapy is going into pages with tag 'bigdata' from the main page I gave because the tags are links to the main page for that tag. How can I edit this code to make scrapy not go into the tag pages and only questions in that page? I tried using rules like below but results were still not right.
rules = (Rule(LinkExtractor(restrict_css='.question-summary h3 a::attr(href)'), callback='parse_question'),)
The url you have (as well as the initial css rules) is correct; or more simply:
start_urls = ['https://stackoverflow.com/questions/tagged/python+bigdata']
Extrapolating from this, this will also work:
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata%20python']
The issue you are running into however, is that stackoverflow appears to require you to be logged in to access the multiple tag search feature. To see this, simply log out of your stackoverflow session and try the same url in your browser. It will redirect you to a page of results for the first of the two tags only.
TL;DR the only way to get the multiple tags feature appears to be logging in (enforced via session cookies)
Thus, when using scrapy, the fix is to authenticate the session (login) before doing anything else, and then proceed to parse as normal and it all works. To do this, you can use an InitSpider instead of Spider and add the appropriate login methods. Assuming you login with StackOverflow directly (as opposed to through Google or the like), I was able to get it working as expected like this:
import scrapy
import getpass
from scrapy.spiders.init import InitSpider
class StackOverflowSpider(InitSpider):
name = 'stackoverflow'
login_page = 'https://stackoverflow.com/users/login'
start_urls = ['https://stackoverflow.com/questions/tagged/bigdata+python']
def parse(self, response):
...
def parse_question(self, response):
...
def init_request(self):
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
return scrapy.FormRequest.from_response(response,
formdata={'email': 'yourEmailHere#foobar.com',
'password': getpass.getpass()},
callback=self.check_login_response)
def check_login_response(self, response):
if "/users/logout" in response.body:
self.log("Successfully logged in")
return self.initialized()
else:
self.log("Failed login")
I'm trying to login to a web page using scrapy. However it does not seem to work, however I'm am unable to propely debug it because i cant see what is happening throught scrapy. This is the code i got so far:
# -*- coding: utf-8 -*-
import scrapy
class WordGetterSpider(scrapy.Spider):
name = "word_getter"
#allowed_domains = ["germanpod101.com"]
start_urls = (
'http://www.example.com/member/login_new.php',
)
def parse(self, response):
return scrapy.FormRequest.from_response(response, formdata={'amember_login': 'my username', 'amember_password': 'my password'}, callback=self.parse_index)
def parse_index(self, response):
print response.body
print response.xpath('//title/text()').extract()
The printed body contains the login form and it is apparent that i am not loged in. If i try to visit other pages that requires login i am redirected back to the login page.
Does anybody have any good tips on how i can debug this or how i can make the login work?
I have added this to my settings as well:
COOKIES_ENABLED = True
COOKIES_DEBUG = True
I'm trying to do the following:
log into a web page (in my case zendesk.com)
use that session to do some post requests
In fact zendesk misses some apis (create/alter macros) which I now need to simulate simulating a browser session.
So I'm not writing a spider but try to interact with the website as my script proceeds. The post requests are not known from the start but only during my script.
In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
But it looks like this only works for scraping, but in my case I just want to "hold" the session and further work with that session.
Is there a way to achieve this with scrapy, or are there tools that better fit this task?
Thanks a lot #wawaruk. Based on the stackoverflow post you linked that's the solution I came up with:
import urllib, urllib2, cookielib, re
zendesk_subdomain = 'mysub'
zendesk_username = '...'
zendesk_password = '...'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
resp = opener.open('http://%s.zendesk.com/access/unauthenticated' % (zendesk_subdomain))
s = resp.read()
data = dict()
data['authenticity_token'] = re.findall('<input name="authenticity_token" type="hidden" value="([^"]+)"', s)[0]
data['return_to'] = 'http://%s.zendesk.com/login' % zendesk_subdomain
data['user[email]'] = zendesk_username
data['user[password]'] = zendesk_password
data['commit'] = 'Log in'
data['remember_me'] = '1'
opener.open('https://localch.zendesk.com/access/login', urllib.urlencode(data))
from there with opener all pages can be accessed, e.g.
opener.open('http://%s.zendesk.com/rules/new?filter=macro' % zendesk_subdomain)