Giving Syntax error in Scrapy (Python) - XPath - python

I'm using Scrapy Crawler to extract some details like username, upvotes, join date etc.
I'm using XPath for extracting the contents from each user's webpage.
Code:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.spiders import BaseSpider
from scrapy.http import FormRequest
from loginform import fill_login_form
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class UserSpider(scrapy.Spider):
name = 'userspider'
start_urls = ['http://forum.nafc.org/login/']
#Getting the list of usernames
user_names = ['Bob', 'Tom'] #List of Usernames
def __init__(self, *args, **kwargs):
super(UserSpider, self).__init__(*args, **kwargs)
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'registerUserName': 'user', 'registerPass': 'password'},
callback=self.after_main_login)]
def after_main_login(self, response):
for user in self.user_names:
user_url = 'profile/' + user
yield response.follow(user_url, callback=self.parse_user_pages)
def parse_user_pages(self, response):
yield{
"USERNAME": response.xpath('//div[contains(#class, "main") and contains(#class, "no-sky-main")]/h1[contains(#class, "thread-title")]/text()').extract_first()
"UPVOTES": response.xpath('//div[contains(#class, "proUserInfoLabelLeft") and #id="proVotesCap"]/text()').extract()[0]
}
if __name__ == "__main__":
spider = UserSpider()
Error looks like this
P.S. I have manually checked the syntax of my XPath on the Scrapy Shell and it was working fine
Is there anything that I'm not noticing in the code?

You're missing a , after your first dict element:
{"USERNAME": response.xpath(...).extract_first(),
"UPVOTES": response.xpath(...).extract()[0]}

Related

Does Scrapy crawl HTML that calls :hover to display additional information?

I'm not sure if this is the correct place for this question.
Here's my question:
If I run scrapy, it can't see the email addresses in the page source. The page has email addresses that are visible only when you hover over a user with an email address .
When I run my spider, I get no emails. What am I doing wrong?
Thank You.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class MailsSpider(CrawlSpider):
name = 'mails'
allowed_domains = ['biorxiv.org']
start_urls = ['https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
emals = re.findall(r'[\w\.]+#[\w\.]+',response.text)
print(response.url)
print(emails)
Assuming you're allowed to scrape email contacts from a public website,
as said, scrapy does not loads js scripts, you need a full render browser like Playwright to get the address.
I've wrote down a quick and dirty example on how it could work, you can start from here if you wish (after you've installed playwright of course)
import scrapy
from scrapy.http import Request, FormRequest
from playwright.sync_api import sync_playwright
from scrapy.http import HtmlResponse
class PhaseASpider(scrapy.Spider):
name = "test"
def start_requests(self):
yield Request('https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3', callback=self.parse_page)
def parse_page(self,response):
with sync_playwright() as p:
browser = p.firefox.launch(headless=False)
self.page = browser.new_page().
url='https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3'
self.page.goto(url)
self.page.wait_for_load_state("load")
html_page=self.page.content()
response_sel = HtmlResponse(url="my HTML string", body=html_page, encoding='utf-8')
mails=response_sel.xpath('//a[contains(#href, "mailto")]/#href').extract()
for mail in mails:
print(mail.split('mailto:')[1])

Scrapy spider outputs empy csv file

This is my first question here and I'm learning how to code by myself so please bear with me.
I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff).
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
class Course_item(Item):
name = Field()
#description = Field()
#img_url = Field()
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['https://www.edx.org/']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
item = ItemLoader(Course_item, response)
item.add_xpath('name', '//*[#id="course-intro-heading"]/text()')
yield item.load_item()
When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results.
Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url.
Any ideas for why might be the problem?
You can't get edx content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need.
The real request (to get the urls of the courses) is this one, but you need to generate it from the previous response body (although you could just visit it an also get the correct data).
So, to generate the real request, you need data that is inside a script tag:
from scrapy import Spider
import re
import json
class Course_spider(Spider):
name = 'CourseSpider'
allowed_domains = ['edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
def parse(self, response):
script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
json_data = json.loads(parseable_json_data)
...
Now you have what you need on json_data and only need to create the string URL.
This page use JavaScript to get data from server and add to page.
It uses urls like
https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017
Last part is course's number which you can find in HTML
<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">
Code
from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['www.edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
print('parse_item url:', response.url)
course_id = response.xpath('//*[#id="course-info-page"]/#data-course-id').extract_first()
if course_id:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
yield Request(url, callback=self.parse_json)
def parse_json(self, response):
print('parse_json url:', response.url)
item = json.loads(response.body)
return item
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(Course_spider)
c.start()
from scrapy.http import Request
from scrapy import Spider
import json
class edx_scraper(Spider):
name = "edxScraper"
start_urls = [
'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]
def parse(self, response):
data = json.loads(response.text)
for course in data['objects']['results']:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
yield response.follow(url, self.course_parse)
if 'next' in data['objects'] is not None:
yield response.follow(data['objects']['next'], self.parse)
def course_parse(self, response):
course = json.loads(response.text)
yield{
'name': course['title'],
'effort': course['effort'],
}

Selenium inside scrapy does not work

I have a scrapy Crawlspider that parses links and returns html content just fine. For javascript pages however I enlisted Selenium to access the 'hidden' content. The problem is that while Selenium works outside the scrapy parsing, it does not work inside the parse_items function
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver
class MySpider(CrawlSpider):
name = "spidername"
allowed_domains = ["XXXXX"]
start_urls = ['XXXXX']
rules = (
Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))
def __init__(self):
#this page loads
CrawlSpider.__init__(self)
self.selenium = webdriver.Firefox()
self.selenium.get('XXXXX')
self.selenium.implicitly_wait(30)
def parse_item(self, response):
#this page doesnt
print response.url
self.driver.get(response.url)
self.driver.implicitly_wait(30)
#...do things
You have some variable issues. In init method you are assigning browser instance to self.selenium and then in method parse_item you are using self.driver as browser instance. I have updated your script. Try now.
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver
class MySpider(CrawlSpider):
name = "spidername"
allowed_domains = ["XXXXX"]
start_urls = ['XXXXX']
rules = (
Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))
def __init__(self):
#this page loads
CrawlSpider.__init__(self)
self.driver= webdriver.Firefox()
self.driver.get('XXXXX')
self.driver.implicitly_wait(30)
def parse_item(self, response):
#this page doesnt
print response.url
self.driver.get(response.url)
self.driver.implicitly_wait(30)
#...do things
Great! a combination of Hassan answer and better knowledge of the urls I was scraping lead to the answer (turns out the website had planted 'fake' urls that never loaded)

How to submit a form in scrapy?

I tried to use scrapy to complete the login and collect my project commit count. And here is the code.
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import Spider
from scrapy.utils.response import open_in_browser
class GitSpider(Spider):
name = "github"
allowed_domains = ["github.com"]
start_urls = ["https://www.github.com/login"]
def parse(self, response):
formdata = {'login': 'username',
'password': 'password' }
yield FormRequest.from_response(response,
formdata=formdata,
clickdata={'name': 'commit'},
callback=self.parse1)
def parse1(self, response):
open_in_browser(response)
After running the code
scrapy runspider github.py
It should show me the result page of the form, which should be a failed login in the same page as the username and password is fake. However it shows me the search page. The log file is located in pastebin
How should the code be fixed? Thanks in advance.
Your problem is that FormRequest.from_response() uses a different form - a "search form". But, you wanted it to use a "log in form" instead. Provide a formnumber argument:
yield FormRequest.from_response(response,
formnumber=1,
formdata=formdata,
clickdata={'name': 'commit'},
callback=self.parse1)
Here is what I see opened in the browser after applying the change (used "fake" user):
Solution using webdriver.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.contrib.spiders import CrawlSpider
class GitSpider(CrawlSpider):
name = "gitscrape"
allowed_domains = ["github.com"]
start_urls = ["https://www.github.com/login"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
login_form = self.driver.find_element_by_name('login')
password_form = self.driver.find_element_by_name('password')
commit = self.driver.find_element_by_name('commit')
login_form.send_keys("yourlogin")
password_form.send_keys("yourpassword")
actions = ActionChains(self.driver)
actions.click(commit)
actions.perform()
# by this point you are logged to github and have access
#to all data in the main menĂ¹
time.sleep(3)
self.driver.close()
Using the "formname" argument also works:
yield FormRequest.from_response(response,
formname='Login',
formdata=formdata,
clickdata={'name': 'commit'},
callback=self.parse1)

JSON Response and Scrapy

I'm trying to parse a JSON response from the New York Times API with Scrapy to CSV so that I could have a summary of all related articles to a particular query. I'd like to spit this out as a CSV with link, publication date, summary, and title so that I could run a few keyword searches on the summary description. I'm new to both Python and Scrapy but here's my spider (I'm getting an HTTP 400 error). I've xx'ed out my api key in the spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from nytimesAPIjson.items import NytimesapijsonItem
import json
import urllib2
class MySpider(BaseSpider):
name = "nytimesapijson"
allowed_domains = ["http://api.nytimes.com/svc/search/v2/articlesearch"]
req = urllib2.urlopen('http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-key=xxx)
def json_parse(self, response):
jsonresponse= json.loads(response)
item = NytimesapijsonItem()
item ["pubDate"] = jsonresponse["pub_date"]
item ["description"] = jsonresponse["lead_paragraph"]
item ["title"] = jsonresponse["print_headline"]
item ["link"] = jsonresponse["web_url"]
items.append(item)
return items
If anybody has any ideas/suggestions, including onese outside of Scrapy, please let me know. Thanks in advance.
You should set start_urls and use parse method:
from scrapy.spider import BaseSpider
import json
class MySpider(BaseSpider):
name = "nytimesapijson"
allowed_domains = ["api.nytimes.com"]
start_urls = ['http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-key=xxx']
def parse(self, response):
jsonresponse = json.loads(response)
print jsonresponse

Categories