I am working on google search crawling
Here is my code
def parse(self, response):
all_page = response.xpath('//*[#id="main"]')
for page in all_page:
title = page.xpath('//*[#id="main"]/div/div/div/a/h3/div/text()').extract()
link = page.xpath('//*[#id="main"]/div/div/div/a/#href').extract()
print('title', title)
print('link', link)
output is
title ['iPhone - Compare Models - Apple', 'iPhone - Compare Models - Apple (MY)',......]
link[https://www.apple.com/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAGegQIBRAB&usg=AOvVaw1FCyWoMh1LcbM65W6l8ypN', '/url?q=https://www.apple.com/my/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAHegQICBAB&usg=AOvVaw3i33ED_sBrbAuNLAJsOlxe',....]
I want like this
title : 'iPhone - Compare Models - Apple'
Link : https://www.apple.com/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAGegQIBRAB&usg=AOvVaw1FCyWoMh1LcbM65W6l8ypN'
title : ''iPhone - Compare Models - Apple (MY)'
Link :https://www.apple.com/my/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAHegQICBAB&usg=AOvVaw3i33ED_sBrbAuNLAJsOlxe'
How to do that?
Thank you
extract() method of xpath extracts all items found by the search expression, so with page.xpath('//*[#id="main"]/div/div/div/a/h3/div/text()').extract() you get all findings just for this element.
You can zip them together, as mentioned in another answer, but to be more precise you need to start from the parent object.
for page in all_page:
for element in page.xpath('//*[#id="main"]/div/div/div'):
title = element.xpath('a/h3/div/text()'). extract_first()
link = element.xpath('a/#href').extract_first()
print('title', title)
print('link', link)
Your title and link results are lists with several items, so you need to loop over them, in parallel with zip assuming you get one link per title.
It looks like it from your example, but make sure it is the case.
def parse(self, response):
all_page = response.xpath('//*[#id="main"]')
for page in all_page:
titles = page.xpath('//*[#id="main"]/div/div/div/a/h3/div/text()').extract()
links = page.xpath('//*[#id="main"]/div/div/div/a/#href').extract()
for title, link in zip(titles, links):
print (f"title: '{title}'\n\n"
f"Link: {link}")
Related
hello i'm trying to build a crawler using scrapy
my crawler code is :
import scrapy
from shop.items import ShopItem
class ShopspiderSpider(scrapy.Spider):
name = 'shopspider'
allowed_domains = ['www.organics.com']
start_urls = ['https://www.organics.com/product-tag/special-offers/']
def parse(self, response):
items = ShopItem()
title = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/h3').extract()
sale_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/del/span').extract()
product_original_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
category = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_original_price'] = ''.join(product_original_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
yield items
but when i run the command : scrapy crawl shopspider -o info.csv to see the output i can find just the informations about the first product not all the products in this page.
so i remove the numbers between [ ] in the xpath for exemple the xpath of the title ://*[#id="content"]/div/div/ul/li/a/h3
but still get the same result.
the result is : <span class="amount">£40.00</span>,<h3>Halo Skincare Organic Gift Set</h3>,"<span class=""amount"">£40.00</span>","<span class=""amount"">£58.00</span>"
kindely help please
If you remove the indexes on your XPaths, they will find all the items in the page:
response.xpath('//*[#id="content"]/div/div/ul/li/a/h3').extract() # Returns 7 items
However, you should observe that this will return a list of strings of the selected html elements. You should add /text() in the XPath if you want the text inside the element. (Which looks like you do)
Also, the reason you only get one return is because you are concatenating all the items into a single string when assigning them to the item:
items['product_name'] = ''.join(title).strip()
Here title is a list of elements and you concatenate them all in a single string. Same logic applies for the other vars
If that's really what you want you can disregard the following, but I believe a better approach would be to execute a for loop and yield them separately?
My suggestion would be:
def parse(self, response):
products = response.xpath('//*[#id="content"]/div/div/ul/li')
for product in products:
items = ShopItem()
items['product_name'] = product.xpath('a/h3/text()').get()
items['product_sale_price'] = product.xpath('a/span/del/span/text()').get()
items['product_original_price'] = product.xpath('a/span/ins/span/text()').get()
items['product_category'] = product.xpath('a/span/ins/span/text()').get()
yield items
Notice that in your original code your category var has the same XPath that your product_original_price, I kept the logic in the code, but it's probably a mistake.
I'm scraping the content of articles from a site like this where there is no 'Next' button to follow. ItemLoader is passed from parse_issue in the response.meta object as well as some additional data like section_name. Here is the function:
def parse_article(self, response):
self.logger.info('Parse function called parse_article on {}'.format(response.url))
acrobat = response.xpath('//div[#class="txt__lead"]/p[contains(text(), "Plik do pobrania w wersji (pdf) - wymagany Acrobat Reader")]')
limiter = response.xpath('//p[#class="limiter"]')
if not acrobat and not limiter:
loader = ItemLoader(item=response.meta['periodical_item'].copy(), response=response)
loader.add_value('section_name', response.meta['section_name'])
loader.add_value('article_url', response.url)
loader.add_xpath('article_authors', './/p[#class="l doc-author"]/b')
loader.add_xpath('article_title', '//div[#class="cf txt "]//h1')
loader.add_xpath('article_intro', '//div[#class="txt__lead"]//p')
article_content = response.xpath('.//div[#class=" txt__rich-area"]//p').getall()
# # check for pagiantion
next_page_url = response.xpath('//span[#class="pgr_nrs"]/span[contains(text(), 1)]/following-sibling::a[1]/#href').get()
if next_page_url:
# I'm not sure what should be here... Something like this: (???)
yield response.follow(next_page_url, callback=self.parse_article, meta={
'periodical_item' : loader.load_item(),
'article_content' : article_content
})
else:
loader.add_xpath('article_content', article_content)
yield loader.load_item()
The problem is in parse_article function: I don't know how to combine the content of paragraphs from all pages into the one item. Does anybody know how to solve this?
Your parse_article looks good. If the issue is just adding the article_content to the loader, you just needed to fetch it from the response.meta:
I would update this line:
article_content = response.meta.get('article_content', '') + response.xpath('.//div[#class=" txt__rich-area"]//p').getall()
Just set the next page URL to iterate over X amount.
I noticed that article had 4 pages but some could be more
They are simply distinguished by adding /2 or /3 to the end of the URL e.g
https://www.gosc.pl/doc/791526.Zaloz-zbroje/
https://www.gosc.pl/doc/791526.Zaloz-zbroje/2
https://www.gosc.pl/doc/791526.Zaloz-zbroje/3
I don't use scrapy. But when I need multiple pages I would normally just iterate.
When you first scrape the page. Find the max amount of pages for that article first . On that site for example it says 1/4 so you know you will need 4 pages in total.
url = "https://www.gosc.pl/doc/791526.Zaloz-zbroje/"
data_store = ""
for i in range(1, 5):
actual_url = "{}{}".format(url, I)
scrape_stuff = content_you_want
data_store += scrape_stuff
# format the collected data
I have the following code which succesfully pulls links, titles, etc. for podcast episodes. How would I go about just pulling the first one it comes to (i.e. the latest episode) and then immediately stop and produce just that result? Any advice would be greatly appreciated.
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
The answer of #John Gordon is completely correct.
#John Gordon pointed out that:
soup.find()
will always display the first found item (for you thats perfectly fine, when you want to scrape the "latest episode").
However, imagine you just wanted to select the second, third, fourth, etc. item of your BeautifulSoup. Then you could do that with the following line of code:
soup.find()[0] # This will works the same way as soup.find() and displays the first item
When you replace the 0 by any other number (e.g. 4) you solely get the choosen (in this example fourth) item ;).
I'll start with the scrapy code I'm trying to use to iterate through a collection of vehicles and extract the model and price:
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath("//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath("//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was expecting that to loop through the listings and return the price only for the single vehicle being parsed, but it is actually adding an array of all prices on the page to each vehicle item.
I assume the problem is in my xpath selectors - is "//div[contains(#class,'price')]/text()" somehow allowing the parser to look at divs outside the single vehicle that should be getting parsed each time?
For reference, if I do listings[1] it returns only 1 listing, hence the loop should be working.
Edit: I added the line print vehicle.extract() above, and confirmed that vehicle is definitely only a single item (and it changes each time the loop iterates). How is the xpath selector applied to vehicle able to escape the vehicle object and return all prices?
I was having the same problem. I have consulted the document that you have referred. Providing the modified code here so that it would be helpful to beginners like me. Note that the usage of '.' in the xpath .//div[contains(#class,'title-module')]/h2/a/text()
def parse(self, response):
hxs = Selector(response)
split_url = response.url.split("/")
listings = hxs.xpath("//div[contains(#class,'listing-item')]")
for vehicle in listings:
item = Vehicle()
item['make'] = split_url[5]
item['price'] = vehicle.xpath(".//div[contains(#class,'price')]/text()").extract()
item['description'] = vehicle.xpath(".//div[contains(#class,'title-module')]/h2/a/text()").extract()
yield item
I was able to solve the problem with the aid of the manual, here. In summary, the xpath was indeed escaping the iteration because I neglected to put a period in front of the // which meant that it was escaping to the root node every time.
I'm trying to scrape the items from this website.
Items are: Brand, Model and Price. Because of the complexity of the page structure, spider is using 2 xpath selectors.
Brand and Model items are from one xpath, price is from the different xpath. I'm using ( | ) operator as #har07 suggested. Xpaths were tested individually for each item and they were working and extracting the needed items correctly. However, after joining the 2 xpaths, price item started parsing additional items, like commas and prices aren't matched with Brand/Model items, when outputting to csv.
This is how the parse fragment of the spider looks:
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#border="0"]//td[#class="compact"] | //table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('div[#class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('div[#class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('text()').extract()
items.append(item)
return(items)
and that's what csv looks after scraping:
any suggestions how to fix this?
Thank you.
Basically, the issue is being caused by your titles xpath. The xpath goes down too deeply, to the point where you need to use join two xpaths in order to be able to scrape the brand/model field and the price field.
Modifying the titles xpath to a single xpath includes both of the repeating elements for brand/model and price (and subsequently changing the brand, model and price xpaths) means that you no longer get mismatches where the brand and model are in one item, and the price is in the next item.
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#class="table products cl"]//tr[#valign="middle"]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('td[#class="compact"]/div[#class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('td[#class="compact"]/div[#class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('td[#class="cl-price-cont"]//span[4]/text()').extract()
items.append(item)
return(items)