Printing scrapy data to csv - python

Hi I started scrapy recently,and wrote a crawler. But when outputting the data to csv,they are all printed in a single row. How can print each data to its own row?
I my case am printing links from a website. It works well when printed in json format.
Here's the code.
The items.py file.
import scrapy
from scrapy.item import Item ,Field
class ErcessassignmentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
link = Field()
#pass
The mycrawler.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector # deprecated
from scrapy.selector import Selector
from ercessAssignment.items import ErcessassignmentItem
class MySpider(BaseSpider):
name ="ercessSpider"
allowed_domains =["site_url"]
start_urls = ["site_url"]
def parse(self, response):
hxs = Selector(response)
links = hxs.xpath("//p")
items = []
for linkk in links:
item = ErcessassignmentItem()
item["link"] = linkk.xpath("//a/#href").extract()
items.append(item)
return items`

You should have proper indentation in code
import scrapy
from scrapy.item import Item ,Field
class ErcessassignmentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
link = Field()
Then in your spider, do not use return, your for loop will run only once and you will only have 1 row printed in CSV, instead use yield
Second, where is your code to put items into CSV? I guess you are using scrapy's default way of storing items,
in case you already do not know, please run your scraper like
scrapy crawl ercessSpider -o my_output.csv
Your spider code should be like this, notice changes I made
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector # deprecated
from scrapy.selector import Selector
from ercessAssignment.items import ErcessassignmentItem
class MySpider(BaseSpider):
name ="ercessSpider"
allowed_domains =["site_url"]
start_urls = ["site_url"]
def parse(self, response):
hxs = Selector(response)
links = hxs.xpath("//p")
for linkk in links:
item = ErcessassignmentItem()
item["link"] = linkk.xpath("//a/#href").extract()
yield item

for linkk in links:
item = ErcessassignmentItem()
item["link"] = xpath("//a/#href").extract()[linkk]
yield item
this works good in css selector but if above two solutions are not working then you can try this.

Your code above does not print anything. Moreover, I don't see any .csv part. Also, your items list created in parse() will never be longer than 1 due to something that looks like an indentation error to me (i.e. you return after the first iteration of the for-loop. For better readability, you could use the for/else construct here:
def parse(self, response):
hxs = Selector(response)
links = hxs.xpath("//p")
items = []
for linkk in links:
item = ErcessassignmentItem()
item["link"] = linkk.xpath("//a/#href").extract()
items.append(item)
else: # after for loop is finished
# either return items
# or print link in items here without returning
for link in items: # take one link after another
print link # and print it in one line each

Related

Scrapy Spider cannot Extract contents of web page using xpath

I have scrapy spider and i am using xpath selectors to extract the contents of the page,kindly check where i am going wrong
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from medicalproject.items import MedicalprojectItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import Request
class MySpider(CrawlSpider):
name = "medical"
allowed_domains = ["yananow.org"]
start_urls = ["http://yananow.org/query_stories.php"]
rules = (
Rule(SgmlLinkExtractor(allow=[r'display_story.php\?\id\=\d+']),callback='parse_page',follow=True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td')
items = []
for title in titles:
item = MedicalprojectItem()
item["patient_name"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/img[1]/text()").extract()
item["stories"] = title.xpath("/html/body/div/table/tbody/tr[2]/td/table/tbody/tr/td/div/font/p/text()").extract()
items.append(item)
return(items)
There are a lot of issues with your code so here is a different approach.
I opted against a CrawlSpider to have more control over the scraping process. Especially with grabbing the name from the query page and the story from a detail page.
I tried to simplify the XPath statements by not diving into the (nested) table structures but looking for patterns of content. So if you want to extract a story ... there must be a link to a story.
Here comes the tested code (with comments):
# -*- coding: utf-8 -*-
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
story = scrapy.Field()
class MySpider(scrapy.Spider):
name = 'medical'
allowed_domains = ['yananow.org']
start_urls = ['http://yananow.org/query_stories.php']
def parse(self, response):
rows = response.xpath('//a[contains(#href,"display_story")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
story_url = response.urljoin(row.xpath('./#href').extract()[0]) # extract url from link
request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
text_raw = response.xpath('//font[#size=3]//text()').extract() # extract the story (text)
myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item

SgmlLinkExtractor not displaying results or following link

I am having problems fully understanding how SGML Link Extractor works. When making a crawler with Scrapy, I can successfully extract data from links using specific URLS. The problem is using Rules to follow a next page link in a particular URL.
I think the problem lies in the allow() attribute. When the Rule is added to the code, the results do not display in the command line and the link to the next page is not followed.
Any help is greatly appreciated.
Here is the code...
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule
from tutorial.items import TutorialItem
class AllGigsSpider(CrawlSpider):
name = "allGigs"
allowed_domains = ["http://www.allgigs.co.uk/"]
start_urls = [
"http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
"http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
"http://www.allgigs.co.uk/whats_on/London/comedy-1.html",
"http://www.allgigs.co.uk/whats_on/London/theatre_and_opera-1.html",
"http://www.allgigs.co.uk/whats_on/London/dance_and_ballet-1.html"
]
rules = (Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[#class="more"]',)), callback="parse_me", follow= True),
)
def parse_me(self, response):
hxs = HtmlXPathSelector(response)
infos = hxs.xpath('//div[#class="entry vevent"]')
items = []
for info in infos:
item = TutorialItem()
item ['artist'] = hxs.xpath('//span[#class="summary"]//text()').extract()
item ['date'] = hxs.xpath('//abbr[#class="dtstart dtend"]//text()').extract()
item ['endDate'] = hxs.xpath('//abbr[#class="dtend"]//text()').extract()
item ['startDate'] = hxs.xpath('//abbr[#class="dtstart"]//text()').extract()
items.append(item)
return items
print items
The problem is in the restrict_xpaths - it should point to a block where a link extractor should look for links. Don't specify allow at all:
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//div[#class="more"]'),
callback="parse_me",
follow=True),
]
And you need to fix your allowed_domains:
allowed_domains = ["www.allgigs.co.uk"]
Also note that the print items in the parse_me() callback is not reachable since it lies after the return statement. And, in the loop, you should not apply XPath expression using hxs, the expressions should be used in the info context. And you can simplify the parse_me():
def parse_me(self, response):
for info in response.xpath('//div[#class="entry vevent"]'):
item = TutorialItem()
item['artist'] = info.xpath('.//span[#class="summary"]//text()').extract()
item['date'] = info.xpath('.//abbr[#class="dtstart dtend"]//text()').extract()
item['endDate'] = info.xpath('.//abbr[#class="dtend"]//text()').extract()
item['startDate'] = info.xpath('.//abbr[#class="dtstart"]//text()').extract()
yield item

HTMLXPathSelector for Scrappy returning null results

I just started learning python / Scrapy. I was able to follow tutorials successfully but I am struggling with a 'test' scraping that I want to do on my own.
What I am trying to do now is go on http://jobs.walmart.com/search/finance-jobs and scrape the job listing.
However, I think I may be doing something wrong in the XPath, but I am not sure what.
There is no "id" for that table, so I am using its class.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "walmart"
allowed_domains = ["jobs.walmart.com"]
start_urls = ["http://jobs.walmart.com/search/finance-jobs"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//table[#class='tableSearchResults']")
items = []
for titles in titles:
item = walmart()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/#href").extract()
items.append(item)
return items
here is what the page source looks like:
The problem as you said also, is your XPATH. It is always useful to run:
scrapy view http://jobs.walmart.com/search/finance-jobs
Before running your spider, to see how the website look like from scrapy view.
This should work now:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "walmart"
allowed_domains = ["jobs.walmart.com"]
start_urls = ["http://jobs.walmart.com/search/finance-jobs"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = walmart()
titles = hxs.select("//table[#class='tableSearchResults']/tr")
items = []
for title in titles:
if title.select("td[#class='td1']/a").extract():
item ["title"] = title.select("td[#class='td1']/a/text()").extract()
item ["link"] = title.select("td[#class='td1']/a/#href").extract()
items.append(item)
return items

Scraping with scrapy

I am trying to dig a little deeper with scrapy, but can only get the title of what i am scrapying and not any of the details. Here is the code that I have so far:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div[#class='magicCard']")
vendor = hxs.select("//tr[#class='vendor']")
items = []
for titles in titles:
item = Tcgplayer1Item()
item ["cardname"] = titles.select("//li[#class='cardName']/a/text()").extract()
item ["price"] = vendor.select("//td[#class='price']/br/text()").extract()
item ["quantity"] = vendor.select("//td[#class='quantity']/td/text()").extract()
items.append(item)
return items
I cannot get the price and quantity to show any results. Each card has several vendors each with their own prices and quantities. I think that is where i am having problems. Any help will be greatly appreciated.
First of all, here's the fixed version of the code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[#class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[#class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[#class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[#class='quantity']/text())").extract()
yield item
There were multiple issues with the code:
the vendor class name needs to contain a trailing space: "vendor " - it was tricky to find
there are multiple vendors per-item - you need to define vendor inside the loop
you are redefining titles variable in the loop
xpath expressions in the loop should be relative .//
use Selector instead of deprecated HtmlXPathSelector
use xpath() instead of deprecated select()
use normalize-space() to eliminate new-lines and extra spaces in price and quantity xpaths
First, you could change
item ["price"] = vendor.select("//td[#class='price']/br/text()").extract()
item ["quantity"] = vendor.select("//td[#class='quantity']/td/text()").extract()
To:
item ["price"] = titles.select("//td[#class='price']/br/text()").extract()
item ["quantity"] = titles.select("//td[#class='quantity']/td/text()").extract()
This will ensure you are only getting price and quantity rows for the card you want.
You may also have to remove the /br and /td from the selectors, so your code would look like this:
item ["price"] = titles.select("//td[#class='price']/text()").extract()
item ["quantity"] = titles.select("//td[#class='quantity']/text()").extract()

scrapy xpath selector repeats data

I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[#class='listing_content']") may be causing the problems.
yp_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz
class MySpider(BaseSpider):
name = "ypages"
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
bizs = hxs.select("//div[#class='listing_content']")
items = []
for biz in bizs:
item = Biz()
item['name'] = biz.select("//h3/a/text()").extract()
item['address'] = biz.select("//span[#class='street-address']/text()").extract()
print item
items.append(item)
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
from scrapy.item import Item, Field
class Biz(Item):
name = Field()
address = Field()
def __str__(self):
return "Website: name=%s address=%s" % (self.get('name'), self.get('address'))
The output from 'scrapy crawl ypages -o list.csv -t csv' is a long list of business names then locations and it repeats the same data several times.
you should add one "." to select the relative xpath, and here is from scrapy document(http://doc.scrapy.org/en/0.16/topics/selectors.html)
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all elements from the document, not only those inside elements:
>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>> print p.extract()
This is the proper way to do it (note the dot prefixing the .//p XPath):
>>> for p in divs.select('.//p') # extracts all <p> inside
>>> print p.extract()

Categories