[Im working on OSX]
I'll post the relevant portions of my program down below:
Spider:
# -*- coding: utf-8 -*-
import scrapy
#import pandas as pd
from ..items import Homedepotv2Item
from scrapy.http import Request
class HomedepotspiderSpider(scrapy.Spider):
name = 'homeDepotSpider'
allowed_domains = ['homedepot.com']
pathName = '/Users/user/Desktop/homeDepotv2Helpers/homeDepotInfo.csv'
export = pd.read_csv(pathName, skiprows = [0], header = None)
#pathName: Find the correct path for the file
#skiprows: The first row is occupied for the title, we dont need that
omsList = export.values.T[1].tolist() #Transpose the matrix + get second path
start_urls = ['https://www.homedepot.com/p/{omsID}'].format(omsID = omsID)
for omsID in omsList]
def parse(self, response):
#call home depot function
for item in self.parseHomeDepot(response):
yield item
pass
Settings:
BOT_NAME = 'homeDepotv2'
SPIDER_MODULES = ['homeDepotv2.spiders']
NEWSPIDER_MODULE = 'homeDepotv2.spiders'
When I try running my spider by using the command: scrapy crawl homeDepotSpider
I get this error ModuleNotFoundError: No module named 'homeDepotv2'
Initially I thought I was having a directory error so instead of using cd to find my directory I copied in the pathname for the directory of the spider which was
/Users/userName/homeDepotv2_Spider/build/lib/homeDepotv2
However that still returned the same error.
Not too sure what's wrong here, so any help would be appreciated!
and here is the fire hierarchy:
Check this video,
Path append | how to fix "Module not found" with Scrapy items.py
I had the same problem, the solution is to use:
from sys import path
path.append(/Users/userName/homeDepotv2_Spider)
You may need to check/modify the path, as Scrapy makes 2 directories with same name.
Related
(This is my items.py)
import scrapy
class FreelanceItem(scrapy.Item):
url = scrapy.Field()
url = scrapy.Field()
When I started another python and imported Package
import scrapy
from scrapy.item import Item , Field
from freelance.items import FreelanceItem
I get this :
ModuleNotFoundError: No module named 'freelance'
How should I do ?
thanks.
Youre accessing it the wrong way..
Lets say you are in a directory called PythonTest, where you also have your main.py file.
Steps:
Create a folder named "freelance" in this PythonTest Directory
add an empty file in this directory (freelance dir) named : "_ init _.py" (this tells python it is a package)
add your items.py file aswell in this directory
Now go to your 'main.py' and add the line:
from freelance.items import FreeLanceItem
Also make sure to have correct indenting in your code.(see below)
import scrapy
class FreeLanceItem(scrapy.Item):
url = scrapy.Field()
url = scrapy.Field()
running the code should not produce an error anymore.
Let me know if this helped!
What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('a[class=business-name]::attr(href)')
for item in items:
print(item)
Simple spider without project.
Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:
In your case:
scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv
The code will also work with any queries. For example:
scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json
etc...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/']
# We can use any pair servise + location on our request
def __init__(self, servise=None, location=None):
self.servise = servise
self.location = location
def parse(self, response):
# If "service " and" location " are defined
if self.servise and self.location:
# Create search phrase using "service" and " location"
search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
# Send request with url "yellowpages.com" + "search_url", then call parse_result
yield Request(url=response.urljoin(search_url), callback=self.parse_result)
else:
# Else close our spider
# You can add deffault value if you want.
self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
raise CloseSpider()
def parse_result(self, response):
# all blocks without AD posts
posts = response.xpath('//div[#class="search-results organic"]//div[#class="v-card"]')
for post in posts:
yield {
'title': post.xpath('.//span[#itemprop="name"]/text()').extract_first(),
'url': response.urljoin(post.xpath('.//a[#class="business-name"]/#href').extract_first()),
}
next_page = response.xpath('//a[#class="next ajax-page"]/#href').extract_first()
# If we have next page url
if next_page:
# Send request with url "yellowpages.com" + "next_page", then call parse_result
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)
for item in items:
print(item)
put yield instead of print there,
for item in items:
yield item
On inspection of your code, I notice a number of problems:
First, you initialize items to a tuple, when it should be a list: items = [].
You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".
start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.
When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.
Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.
Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.
FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]
I am looking at the Architecture Overview page in the Scrapy documentation, but I still have a few questions regarding data and or control flow.
Scrapy Architecture
Default File Structure of Scrapy Projects
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
item.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MyprojectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
which, I'm assuming, becomes
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
so that errors are thrown when trying to populate undeclared fields of Product instances
>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Question #1
Where, when, and how does our crawler become aware of items.py if we have created class CrowdfundingItem in items.py?
Is this done in...
__init__.py?
my_crawler.py?
def __init__() of mycrawler.py?
settings.py?
pipelines.py?
def __init__(self, dbpool) of pipelines.py?
somewhere else?
Question #2
Once I have declared an item such as Product, how do I then store the data by creating instances of Product in a context similar to the one below?
import scrapy
class MycrawlerSpider(CrawlSpider):
name = 'mycrawler'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/']
def parse(self, response):
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)
browser.get(self.start_urls[0])
elements = browser.find_elements_by_xpath('//section')
count = 0
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
# If I am not sure how many items there will be,
# and hence I cannot declare them explicitly,
# how I would go about creating named instances of Product?
# Obviously the code below will not work, but how can you accomplish this?
count += 1
varName + count = Product(name=name, price=price)
...
Lastly, say we forego naming the Product instances altogether, and instead simply create unnamed instances.
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
Product(name=name, price=price)
If such instances are indeed stored somewhere, where are they stored? By creating instances this way, would it be impossible to access them?
Using an Item is optional; they're just a convenient way to declare your data model and apply validation. You can also use a plain dict instead.
If you do choose to use Item, you will need to import it for use in the spider. It's not discovered automatically. In your case:
from items import CrowdfundingItem
As a spider runs the parse method on each page, you can load the extracted data into your Item or dict. Once it's loaded, yield it, which passes it back to the scrapy engine for processing downstream, in pipelines or exporters. This is how scrapy enables "storage" of the data you scrape.
For example:
yield Product(name='Desktop PC', price=1000) # uses Item
yield {'name':'Desktop PC', 'price':1000} # plain dict
I have an up and running webscraper; the tickers are listed on a separate excel document. I am using ScrapingHub's API because it is accessible anywhere, and provides a big convenient factor. I want to create a code that will update and scrape from what is listed on the Excel sheet.
With my excel list, how can I have my code automatically update (ie. I add MSFT to my excel sheet so this updates my code to include MSFT)?
Additionally, is there anyway to have it automatically deploy?
--==Spider Code==--
**tickers appended in each link (search criteria)
import scrapy
import collections
from collections import OrderedDict
from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
name = "NewsScraper"
allowed_domains = ["yahoo.com"]
start_urls = (
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=BGMD,BIOA',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=CANF,CBIO,CCCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO,DRWI,DXTR,ENCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=GNMX,GNUS,GPL,HIPP,HSGX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=MBOT,MBVX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=NBY,NNVC,NTRP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=PGRX,PLXP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=SANW,SBOT,SCON,SCYX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=UNXL,UQM,URRE',
)
itertag = 'item'
def parse_node(self, response, node):
item = collections.OrderedDict()
item['Title'] = node.xpath(
'title/text()').extract_first()
item['PublishDate'] = node.xpath(
'pubDate/text()').extract_first()
item['Description'] = node.xpath(
'description/text()').extract_first()
item['Link'] = node.xpath(
'link/text()').extract_first()
return item
i'm trying to make a script that runs many spiders but i'm getting ImportError: No module named project_name.settings
my script looks like this:
import os
os.system("scrapy crawl spider1")
os.system("scrapy crawl spider2")
....
os.system("scrapy crawl spiderN")
My settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for project_name
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'project_name'
ITEM_PIPELINES = {
'project_name.pipelines.project_namePipelineToJSON': 300,
'project_name.pipelines.project_namePipelineToDB': 800
}
SPIDER_MODULES = ['project_name.spiders']
NEWSPIDER_MODULE = 'project_name.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'project_name (+http://www.yourdomain.com)'
And my spiders look like any normal spider, quite simple ones actually...
import scrapy
from scrapy.crawler import CrawlerProcess
from Projectname.items import ProjectnameItem
class ProjectnameSpiderClass(scrapy.Spider):
name = "Projectname"
allowed_domains = ["Projectname.com"]
start_urls = ["...urls..."]
def parse(self, response):
item = ProjectnameItem()
I gave them generic names but you get the idea, is there a way to solve this error?
Edit 2018:
You need to run the spider from the project folder, meaning that the os.system("scrapy crawl spider1") has to be run from the folder with the spider1.
Or you can do as I did in the past, putting all the code in a single file (old answer, not recommended by me anymore, but still useful and decent solution)
Well, in case someone comes up to this question I finally used a heavily modified version of this https://gist.github.com/alecxe/fc1527d6d9492b59c610 provided by alexce in another question. Hope this helps.