Save Scrapy 'start_urls' and store properly in a Data Frame - python

I am using Scrapy to scrape some website data. But I can't make the step to get my data properly.
This is the output of my code (see code below):
In the command Line:
scrapy crawl myspider -o items.csv
Output:
asin_product product_name
ProductA,,,ProductB,,,ProductC,,, BrandA,,,BrandB,,,BrandC,,,
ProductA,,,ProductD,,,ProductE,,, BrandA,,,BrandB,,,BrandA,,,
#Note that the rows are representing the start_urls and that the ',,,'
#three commas are separating the data.
Desired output:
scrapy crawl myspider -o items.csv
Start_URL asin_product product_name
URL1 ProductA BrandA
URL1 ProductB BrandB
URL1 ProductC BrandC
URL2 ProductA BrandA
URL2 ProductD BrandB
URL2 ProductE BrandA
My Used Code in Scrapy:
import scrapy
from amazon.items import AmazonItem
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=shoes&ref=nb_sb_noss_2", # This should
be #URL 1
"https://www.amazon.com/s?k=computer&ref=nb_sb_noss_2" # This should
be #URL 2
]
def parse(self, response):
items = AmazonItem()
title = response.xpath('//*[#class="a-size-base-plus a-color-base a-
text-normal"]/text()').extract()
asin = response.xpath('//*[#class ="a-link-normal"]/#href').extract()
# Note that I devided the products with ',,,' to make it easy to separate
# them. I am aware that this is not the best approach.
items['product_name'] = ',,,'.join(title).strip()
items['asin_product'] = ',,,'.join(asin).strip()
yield items

First of all, it's recomended to use css when querying by class.
Now to your code:
The product name is within the a tag (product url). So you can iterate though the links and store the URL and the title.
<a class="a-link-normal a-text-normal" href="/adidas-Mens-Lite-Racer-Running/dp/B071P19D3X/ref=sr_1_3?keywords=shoes&qid=1554132536&s=gateway&sr=8-3">
<span class="a-size-base-plus a-color-base a-text-normal">Adidas masculina Lite Racer byd tĂȘnis de corrida</span>
</a>
You need to create one AmazonItem object per line on your csv file.
def parse(self, response):
# You need to improve this css selector because there are links which
# are not a product, this is why I am checking if title is None and continuing.
for product in response.css('a.a-link-normal.a-text-normal'):
# product is a selector
title = product.css('span.a-size-base-plus.a-color-base.a-text-normal::text').get()
if not title:
continue
# The selector is already the a tag, so we only need to extract it's href attribute value.
asin = product.xpath('./#href').get()
item = AmazonItem()
item['product_name'] = title.strip()
item['asin_product'] = asin.strip()
yield item

Make the start_url available in parse method
instead of using start_urls you can yield your initial requests from a method named start_requests (see https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=start_requests#our-first-spider).
With each request you can pass the start url as meta data. This meta data is then available within your parse method (see https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta).
def start_requests(self):
urls = [...] # this is equal to your start_urls
for start_url in urls:
yield Request(url=url, meta={"start_url": start_url})
def parse(self, response):
start_url = response.meta["start_url"]
yield multiple items, one for each product
Instead of joining titles and brands you can yield several items from parse. For the example below i assume the lists title and asin have the same length.
for title, asin in zip(title, asin):
item = AmazonItem()
item['product_name'] = title
item['asin_product'] = asin
yield item
PS: you should check amazons robots.txt. They might not allow you to scrape their site and ban your IP (https://www.amazon.de/robots.txt)

Related

Python Scrapy - saving a 'category' for each entry based on first webpage

I am scraping BBC food for recipes. The logic is as follows:
Main page with about 20 cuisines
-> in each cuisine, there's usually ~20 recipes on 1-3 pages for each letter.
-> in each recipe, there is about 6 things I scrape (ingredients, rating etc.)
Therefore, my logic is: get to main page, create request, extract all cuisine links, then follow each, from there extract each page of recipes, follow each recipe link, and from each recipe finally get all data. Note this is not finished yet as I need to implement the spider to also go through all letters.
I would love to have a 'category' column, i.e. for each recipe in the "african cuisine" link have a column that says "african", for each recipe from the "italian cuisine" an "italian" entry in all columns etc.
Desired outcome:
cook_time prep_time name cuisine
10 30 A italian
20 10 B italian
30 20 C indian
20 10 D indian
30 20 E indian
Here is my following spider:
import scrapy
from recipes_cuisines.items import RecipeItem
class ItalianSpider(scrapy.Spider):
name = "italian_spider"
def start_requests(self):
start_urls = ['https://www.bbc.co.uk/food/cuisines']
for url in start_urls:
yield scrapy.Request(url = url, callback = self.parse_cuisines)
def parse_cuisines(self, response):
cuisine_cards = response.xpath('//a[contains(#class,"promo__cuisine")]/#href').extract()
for url in cuisine_cards:
yield response.follow(url = url, callback = self.parse_main)
def parse_main(self, response):
recipe_cards = response.xpath('//a[contains(#class,"main_course")]/#href').extract()
for url in recipe_cards:
yield response.follow(url = url, callback = self.parse_card)
next_page = response.xpath('//div[#class="pagination gel-wrap"]/ul[#class="pagination__list"]/li[#class="pagination__list-item pagination__priority--0"]/a[#class="pagination__link gel-pica-bold"]/#href').get()
if next_page is not None:
next_page_url = response.urljoin(next_page)
print(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse_main)
def parse_card(self, response):
item = RecipeItem()
item['name'] = response.xpath('//h1[contains(#class,"title__text")]/text()').extract()
item['prep_time'] = response.xpath('//div[contains(#class,"recipe-metadata-wrap")]/p[#class="recipe-metadata__prep-time"]/text()').extract_first()
item['cook_time'] = response.xpath('//p[contains(#class,"cook-time")]/text()').extract_first()
item['servings'] = response.xpath('//p[contains(#class,"serving")]/text()').extract_first()
item['ratings_amount'] = response.xpath('//div[contains(#class="aggregate-rating")]/span[contains(#class="aggregate-rating__total")]/text()').extract()
#item['ratings_amount'] = response.xpath('//*[#id="main-content"]/div[1]/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/span[2]/text()').extract()
item['ingredients'] = response.css('li.recipe-ingredients__list-item > a::text').extract()
return item
and items:
import scrapy
class RecipeItem(scrapy.Item):
name = scrapy.Field()
prep_time = scrapy.Field()
cook_time = scrapy.Field()
servings = scrapy.Field()
ratings_amount = scrapy.Field()
rating = scrapy.Field()
ingredients = scrapy.Field()
cuisine = scrapy.Field()
Note I am saving the output via
scrapy crawl italian_spider -o test.csv
I have read the documentation and tried several things, such as adding the extracted cuisine to a parse_cuisine or parse_main methods, but all yield an error.
There are two ways here. Most common way is to pass some information from one page to another is to use cb_kwargs in your scrapy.Request:
def parse_cousine(self, response):
cousine = response.xpath('//h1/text()').get()
for recipe_url in response.xpath('//div[#id="az-recipes--recipes"]//a[.//h3]').getall():
yield scrapy.Request(
url=response.urljoin(recipe_url),
callback=self.parse_recipe,
cb_kwargs={'cousine': cousine},
)
def parse_recipe(self, response, cousine):
print(cousine)
But one this website you can find it on the recipe page (inside ingredients section after parsing JSON):
def parse_recipe(self, response):
recipe_raw = response.xpath('//script[#type="application/ld+json"][contains(., \'"#type":"Recipe"\')]/text()').get()
recipe = json.loads(recipe_raw)
cousine = recipe['recipeCuisine']
Update This XPath '//script[#type="application/ld+json"][contains(., \'"#type":"Recipe"\')]/text()' finds script node that have type attribute with a value application/ld+json and also contains string "#type":"Recipe" in a text of that node.

1: my spider is giving me all the results in one liners on csv file

In the first place, If I use extract_first, scrapy gives me the first element of each page and if I run it like this it returns all the content I want but in one-liners.
In Second place, I can't make scrapy go to the links I just scraped and get information from inside these links, returning an empty csv file.
from scrapy import Spider
from companies.items import CompaniesItem
import re
class companiesSpider(Spider):
name = "companies"
allowed_domains = ['http://startup.miami',]
# Defining the list of pages to scrape
start_urls = ["http://startup.miami/category/startups/page/" + str(1*i) + "/" for i in range(0, 10)]
def parse(self, response):
rows = response.xpath('//*[#id="datafetch"]')
for row in rows:
link = row.xpath('.//h2/a/#href').extract()
name = row.xpath('.//header/h2/a/text()').extract()
item = CompaniesItem()
item['link'] = link
item['name'] = name
yield item
Your parse-method is not yielding any requests or items. In the part below we go through the pages and get the urls & names. In the parse_detail you can add additional data to the item.
Instead of hardcoding to 10 pages we check if there is a next page, and go through the parse again if it's the case.
from scrapy import Spider
from ..items import CompaniesItem
import scrapy
class CompaniesSpider(Spider):
name = "companies"
allowed_domains = ['startup.miami']
# Defining the list of pages to scrape
start_urls = ["http://startup.miami/category/startups/"]
def parse(self, response):
# get link & name and send item to parse_detail in meta
rows = response.xpath('//*[#id="datafetch"]/article')
for row in rows:
link = row.xpath('.//#href').extract_first()
name = row.xpath(
'.//*[#class="textoCoworking"]/text()').extract_first()
item = CompaniesItem()
item['link'] = link
item['name'] = name.strip()
yield scrapy.Request(link,
callback=self.parse_detail,
meta={'item': item})
# get the next page
next_page = response.xpath(
'//*[#class="next page-numbers"]/#href').extract_first()
if next_page:
yield scrapy.Request(next_page, callback=self.parse)
def parse_detail(self, response):
item = response.meta['item']
# add other details to the item here
yield item
To put the results in a csv file you can launch the scraper like this: scrapy crawl companies -o test_companies.csv

Scrapy yield only last data and merge scrapy data into one

I am scraping some news website with scrapy framework, it seems only store the last item scraped and repeated in loop
I want to store the Title,Date,and Link, which i scrape from the first page
and also store the whole news article. So i want to merge the article which stored in a list into a single string.
Item code
import scrapy
class ScrapedItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
source = scrapy.Field()
date = scrapy.Field()
paragraph = scrapy.Field()
Spider code
import scrapy
from ..items import ScrapedItem
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
box_text = response.xpath("//ul/li/div[#class='ket']")
items = ScrapedItem()
for crawl in box_text:
title = crawl.css("h1 a::text").extract()
source ="https://investasi.kontan.co.id"+(crawl.css("h1 a::attr(href)").extract()[0])
date = crawl.css("span.font-gray::text").extract()[0].replace("|","")
items['title'] = title
items['source'] =source
items['date'] = date
yield scrapy.Request(url = source,
callback=self.parseparagraph,
meta={'item':items})
def parseparagraph(self, response):
items_old = response.meta['item'] #only last item stored
paragraph = response.xpath("//p/text()").extract()
items_old['paragraph'] = paragraph #merge into single string
yield items_old
I expect the output that the Date,Title,and Source can be updated through the loop.
And the article can be merged into single string to be stored in mysql
I defined an empty dictionary and put those variables within it. Moreover, I've brought about some minor changes in your xpaths and css selectors to make them less error prone. The script is working as desired now:
import scrapy
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
for crawl in response.xpath("//*[#id='list-news']//*[#class='ket']"):
d = {}
d['title'] = crawl.css("h1 > a::text").get()
d['source'] = response.urljoin(crawl.css("h1 > a::attr(href)").get())
d['date'] = crawl.css("span.font-gray::text").get().strip("|")
yield scrapy.Request(
url=d['source'],
callback=self.parseparagraph,
meta={'item':d}
)
def parseparagraph(self, response):
items_old = response.meta['item']
items_old['paragraph'] = response.xpath("//p/text()").getall()
yield items_old

Scrape ASIN from Amazon's Search page

I try to scrape the ASIN numbers on Amazon. Please note that this is not about the product details (like this: https://www.youtube.com/watch?v=qRVRIh3GZgI), but this is when you search for a keyword (in this example "trimmer", try this:
https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2). The results are many products, I am able to scrape all the Titles.
What is not visible is the ASIN (which is a unique Amazon number). I saw, while inspecting the HTML a link in the text (href), which is containing the ASIN number. In the example below, the ASIN = B01MSHQ5IQ
<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&qid=1554462204&s=gateway&sr=8-3">
Ending with my question: How can I retrieve all the Product Titles AND ASIN numbers on the page? For example:
Number Title ASIN
1 Braun, Beardtrimmer B07JH1LLYR
2 TNT Pro Series Waist B00R84J2PK
... ... ...
So far, I am using scrapy (but also open for other Python solutions) and I am able to scrape the Titles.
My code so far:
First run in the command line:
scrapy startproject tutorial
Then, adjust the files in the Spider (see example 1) and items.py (see example 2).
Example 1
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"
]
## scrapy crawl AmazonDeals -o Asin_Titles.json
def parse(self, response):
items = AmazonItem()
Title = response.css('.a-text-normal').css('::text').extract()
items['title_Products'] = Title
yield items
As requested by #glhr, adding the items.py code:
Example 2
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Products = scrapy.Field()
You can get the link to the product by extracting the href attribute of <a class="a-link-normal a-text-normal" href="...">:
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
From a link, you can use a regular expression to extract the ASIN number from the link:
(?<=dp/)[A-Z0-9]{10}
The regular expression above will match 10 characters (either uppercase letters or numbers) preceded by dp/. See demo here: https://regex101.com/r/mLMv3k/1
Here's a working implementation of the parse() method:
def parse(self, response):
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
Title = response.css('span.a-text-normal').css('::text').extract()
# for each product, create an AmazonItem, populate the fields and yield the item
for result in zip(Link,Title):
item = AmazonItem()
item['title_Product'] = result[1]
item['link_Product'] = result[0]
# extract ASIN from link
ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
item['ASIN_Product'] = ASIN
yield item
This requires extending AmazonItem with new fields:
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Product = scrapy.Field()
link_Product = scrapy.Field()
ASIN_Product = scrapy.Field()
Sample output:
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
'body, face, nose, and ear hair trimmer, shaver, and clipper'}
Demo: https://repl.it/#glhr/55534679-AmazonSpider
To write the output to a JSON file, simply specify feed export settings in the spider:
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
custom_settings = {
'FEED_URI' : 'Asin_Titles.json',
'FEED_FORMAT' : 'json'
}

Scrapy: scrape item fields from different pages

I'm trying to get item fields info from different pages using scrapy.
What I am trying to do:
main_url > scrape all links from this page > go to each link
from each link > scrape info, put info in items list and go to another link
from another link > scrape info and put info in the same items list
Go to next each link...repeat steps 2 - 4
when all links are done go to next page and repeat steps 1 - 3
I found some information from below but, I still can't get the results I want:
How can i use multiple requests and pass items in between them in scrapy python
http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
Goal: to get the below layout results
What I've done is below
My item class
from scrapy.item import Item, Field
class myItems(Item):
info1 = Field()
info2 = Field()
info3 = Field()
info4 = Field()
My spider class
from scrapy.http import Request
from myProject.items import myItems
class mySpider(scrapy.Spider):
name = 'spider1'
start_urls = ['main_link']
def parse(self, response):
items = []
list1 = response.xpath().extract() #extract all info from here
list2 = response.xpath().extract() #extract all info from here
for i,j in zip(list1, list2):
link1 = 'http...' + i
request = Request(link1, self.parseInfo1, dont_filter =True)
request.meta['item'] = items
yield request
link2 = 'https...' + j
request = Request(link2, self.parseInfo2, meta={'item':items}, dont_filter = True)
# Code for crawling to next page
def parseInfo1(self, response):
item = myItems()
items = response.meta['item']
item[info1] = response.xpath().extract()
item[info2] = response.xpath().extract()
items.append(item)
return items
def parseInfo2(self, response):
item = myItems()
items = response.meta['item']
item[info3] = response.xpath().extract()
item[info4] = response.xpath().extract()
items.append(item)
return items
I executed the spider by typing this on the terminal:
> scrapy crawl spider1 -o filename.csv -t csv
I got the results for all the fields, but they are not in the right order. My csv file looks like this:
Does anyone know how to get the results like in my "Goal" above?
I appreciate the help.
Thanks
Never mind, I found my mistake. I instantiated myItems class twice, which resulted in 2 new objects and gave the results that I got.

Categories