i'm trying to make a script that runs many spiders but i'm getting ImportError: No module named project_name.settings
my script looks like this:
import os
os.system("scrapy crawl spider1")
os.system("scrapy crawl spider2")
....
os.system("scrapy crawl spiderN")
My settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for project_name
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'project_name'
ITEM_PIPELINES = {
'project_name.pipelines.project_namePipelineToJSON': 300,
'project_name.pipelines.project_namePipelineToDB': 800
}
SPIDER_MODULES = ['project_name.spiders']
NEWSPIDER_MODULE = 'project_name.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'project_name (+http://www.yourdomain.com)'
And my spiders look like any normal spider, quite simple ones actually...
import scrapy
from scrapy.crawler import CrawlerProcess
from Projectname.items import ProjectnameItem
class ProjectnameSpiderClass(scrapy.Spider):
name = "Projectname"
allowed_domains = ["Projectname.com"]
start_urls = ["...urls..."]
def parse(self, response):
item = ProjectnameItem()
I gave them generic names but you get the idea, is there a way to solve this error?
Edit 2018:
You need to run the spider from the project folder, meaning that the os.system("scrapy crawl spider1") has to be run from the folder with the spider1.
Or you can do as I did in the past, putting all the code in a single file (old answer, not recommended by me anymore, but still useful and decent solution)
Well, in case someone comes up to this question I finally used a heavily modified version of this https://gist.github.com/alecxe/fc1527d6d9492b59c610 provided by alexce in another question. Hope this helps.
Related
[Im working on OSX]
I'll post the relevant portions of my program down below:
Spider:
# -*- coding: utf-8 -*-
import scrapy
#import pandas as pd
from ..items import Homedepotv2Item
from scrapy.http import Request
class HomedepotspiderSpider(scrapy.Spider):
name = 'homeDepotSpider'
allowed_domains = ['homedepot.com']
pathName = '/Users/user/Desktop/homeDepotv2Helpers/homeDepotInfo.csv'
export = pd.read_csv(pathName, skiprows = [0], header = None)
#pathName: Find the correct path for the file
#skiprows: The first row is occupied for the title, we dont need that
omsList = export.values.T[1].tolist() #Transpose the matrix + get second path
start_urls = ['https://www.homedepot.com/p/{omsID}'].format(omsID = omsID)
for omsID in omsList]
def parse(self, response):
#call home depot function
for item in self.parseHomeDepot(response):
yield item
pass
Settings:
BOT_NAME = 'homeDepotv2'
SPIDER_MODULES = ['homeDepotv2.spiders']
NEWSPIDER_MODULE = 'homeDepotv2.spiders'
When I try running my spider by using the command: scrapy crawl homeDepotSpider
I get this error ModuleNotFoundError: No module named 'homeDepotv2'
Initially I thought I was having a directory error so instead of using cd to find my directory I copied in the pathname for the directory of the spider which was
/Users/userName/homeDepotv2_Spider/build/lib/homeDepotv2
However that still returned the same error.
Not too sure what's wrong here, so any help would be appreciated!
and here is the fire hierarchy:
Check this video,
Path append | how to fix "Module not found" with Scrapy items.py
I had the same problem, the solution is to use:
from sys import path
path.append(/Users/userName/homeDepotv2_Spider)
You may need to check/modify the path, as Scrapy makes 2 directories with same name.
I am new to Scrapy, and I built a simple spider that scrapes my local news site for titles and amount of comments. It scrapes well, but I have a problem with my language encoding.
I have created a Scrapy project that I then run through anaconda prompt to save the output to a file like so (from the project directory):
scrapy crawl MySpider -o test.csv
When I then open the json file with the following code:
with open('test.csv', 'r', encoding = "L2") as f:
file = f.read()
I also tried saving it to json, opening in excel, changing to different encodings from there ... always unreadable, but the characters differ. I am Czech if that is relevant. I need characters like ěščřžýáíé etc., but it is Latin.
What I get: Varuje pĹ\x99ed
What I want: Varuje před
Here is my spider code. I did not change anything in settings or pipeline, though I tried multiple tips from other threads that do this. I spent 2 hours on this already, browsing stack overflow and documentation and I can't find the solution, it's becoming a headache for me. I'm not a programmer so this may be the reason... anyway:
urls = []
for number in range(1,101):
urls.append('https://www.idnes.cz/zpravy/domaci/'+str(number))
class MySpider(scrapy.Spider):
name = "MySpider"
def start_requests(self):
urls = ['https://www.idnes.cz/zpravy/domaci/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
articleBlocks = response.xpath('//div[contains(#class,"art")]')
articleLinks = articleBlocks.xpath('.//a[#class="art-link"]/#href')
linksToFollow = articleLinks.extract()
for url in linksToFollow:
yield response.follow(url = url, callback = self.parse_arts)
print(url)
def parse_arts(self, response):
for article in response.css('div#content'):
yield {
'title': article.css('h1::text').get(),
'comments': article.css('li.community-discusion > a > span::text').get(),
}
Scrapy saves feed exports with utf-8 encoding by default.
Opening the file with the correct encoding displays the characters fine.
If you want to change the encoding used, you can do so by using the FEED_EXPORT_ENCODING setting (or using FEEDS instead).
After one more hour of trial and error, I solved this. The problem was not in Scrapy, it was correctly saving in utf-8, the problem was in the command:
scrapy crawl idnes_spider -o test.csv
that I ran to save it. When I run the command:
scrapy crawl idnes_spider -s FEED_URI=test.csv -s FEED_FORMAT=csv
It works.
(This is my items.py)
import scrapy
class FreelanceItem(scrapy.Item):
url = scrapy.Field()
url = scrapy.Field()
When I started another python and imported Package
import scrapy
from scrapy.item import Item , Field
from freelance.items import FreelanceItem
I get this :
ModuleNotFoundError: No module named 'freelance'
How should I do ?
thanks.
Youre accessing it the wrong way..
Lets say you are in a directory called PythonTest, where you also have your main.py file.
Steps:
Create a folder named "freelance" in this PythonTest Directory
add an empty file in this directory (freelance dir) named : "_ init _.py" (this tells python it is a package)
add your items.py file aswell in this directory
Now go to your 'main.py' and add the line:
from freelance.items import FreeLanceItem
Also make sure to have correct indenting in your code.(see below)
import scrapy
class FreeLanceItem(scrapy.Item):
url = scrapy.Field()
url = scrapy.Field()
running the code should not produce an error anymore.
Let me know if this helped!
What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('a[class=business-name]::attr(href)')
for item in items:
print(item)
Simple spider without project.
Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:
In your case:
scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv
The code will also work with any queries. For example:
scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json
etc...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/']
# We can use any pair servise + location on our request
def __init__(self, servise=None, location=None):
self.servise = servise
self.location = location
def parse(self, response):
# If "service " and" location " are defined
if self.servise and self.location:
# Create search phrase using "service" and " location"
search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
# Send request with url "yellowpages.com" + "search_url", then call parse_result
yield Request(url=response.urljoin(search_url), callback=self.parse_result)
else:
# Else close our spider
# You can add deffault value if you want.
self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
raise CloseSpider()
def parse_result(self, response):
# all blocks without AD posts
posts = response.xpath('//div[#class="search-results organic"]//div[#class="v-card"]')
for post in posts:
yield {
'title': post.xpath('.//span[#itemprop="name"]/text()').extract_first(),
'url': response.urljoin(post.xpath('.//a[#class="business-name"]/#href').extract_first()),
}
next_page = response.xpath('//a[#class="next ajax-page"]/#href').extract_first()
# If we have next page url
if next_page:
# Send request with url "yellowpages.com" + "next_page", then call parse_result
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)
for item in items:
print(item)
put yield instead of print there,
for item in items:
yield item
On inspection of your code, I notice a number of problems:
First, you initialize items to a tuple, when it should be a list: items = [].
You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".
start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.
When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.
Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.
Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.
FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]
So I'm playing around with Scrapy which is a set of classes that allows you to do web scraping and I wanted to throw some data into a data base, but I'm having truble importing the MySQL methods while extending the scrapy library.
here is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
import MySQLdb
class test(BaseSpider): #if i don't extend the class the MySQL works, but the Scrapy functionallity does not.
name = "test"
allowed_domains = ["some-website.com"] #i know this is probibly not a real websit... just using it as an example.
start_urls = [
"http://some-website.com",
]
db = MySQLdb.connect(
host = 'localhost',
user = 'root',
passwd = '',
db = 'scrap'
)
#cursor = db.cursor()
def parse(self, response):
hxs = HtmlXPathSelector(response)
for title in hxs.select('//a[#class="title"]/text()').extract():
print title
cursor.execute("INSERT INTO `scrap`.`shows` (id, title) VALUES (NULL , '"+title+"');")
I am still a noob to python so any help would be greatly appreciated.
Something is wrong with your architecture.
Spider's job is to parse pages, extract data and put it into an Item. It is pipeline's job to save the data from an Item in a database:
Typical use for item pipelines are:
cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database
So, make a pipeline, put its path into settings.py. Try to work with the DB in that pipeline.
I think you need to read the tutorial and see the examples.
Maybe you shold define self.cursor ?
In this way the cursor will be accesible on the class methods.
I do not know about scrapy, but most probably you should do that on the __init__ method or on a get_cursor method of the class test (that probably you should rename as Test)