Issue with xpath / regex in Scrapy spider - python

I'm trying to extract a product id from an onclick tag within a "preceding-sibling" which is a ul tag (id="ShowProductImages").
The number I'm trying to extract is directly after ?pid=, example:
...list/ViewAll?pid=234565&image=206...
Below is the content that I'm trying to extract from:
<ul id="ShowProductImages" class="imageView">
<li><img src="http://content.example.com/assets/images/products/j458jk.jpg" width="200" height="150" alt="Product image description here" border="0"></li>
</ul>
<div class="description">
Description here...
</div>
I am using xpath to select the onclick tag along with a regular expression to extract the id. This is the code I'm using (which is not working)
def parse(self, response):
sel = HtmlXPathSelector(response)
products_path = sel.xpath('//div[#class="description"]')
for product_path in products_path:
product = Product()
product['product_pid'] = product_path.xpath('preceding-sibling::ul[#id="ShowProductImages"][1]//li/a[1]/#onclick').re(r'(?:pid=)(.+?)(?:\'|$)')
yield product
Any suggestions? I'm not quite sure where I went wrong.
Thanks for your help in advance.

I suggest you try this, selecting from the ul, and testing its <div class="description"> sibling in a predicate:
sel.xpath("""//ul[following-sibling::div[#class="description"]]
[#id="ShowProductImages"]
/li/a[1]/#onclick""").re(r'(?:pid=)(\d+)')
I changed your regular expression to restrict to digits.

Related

Scrapy Selector only getting the first element in for loop

I don't understand why the following code doesn't work when using Scrapy Selector.
In scrapy shell (to be easily replicable, but the issue remains the same in a spider):
from scrapy.selector import Selector
body = '''<html>
<body>
<li>
<p>1</p>
<p>2</p>
<p>3</p>
</li>
<li>
<p>4</p>
<p>5</p>
<p>6</p>
</li>
<li>
<p>7</p>
<p>8</p>
<p>9</p>
</li>
</body>
</html>'''
sel = Selector(text=body, type="html")
for elem in sel.xpath('//body'):
first = elem.xpath('.//li/p[1]/text()').get()
print(first)
And it prints:
1
while it should be printing:
1
4
7
Any idea on how to solve this problem ?
Thanks
There maybe a chance that you're using the .get() method to fetch the data which you can replace with .getall(). This method will give you all the data in list format through which you can get your desired data with help of python slicing.
Or in other way there maybe a change that class name is differ in each "li" tag or you may have to use pass the class="" in your xpath URL.
Note: Rather then fetching data with path: "elem.xpath('.//li/p[1]/text()').get()" you can simply get all data by using "elem.xpath('.//li/p/text()').getall()" and then you can put the manipulation logic over the list data which is the easiest way if you don't get your desired output.

Using scrapy selector with conditions

I am using "scrapy" to scrape a few articles, like these ones: https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
I am using the following code in my spider:
def parse_article(self, response):
il = ItemLoader(item=Scrapping538Item(), response=response)
il.add_css('article_text', '.entry-content *::text')
...which works. But I'd like to make this CSS-selector a little bit more sophisticated.
Right now, I am extracting every text passage. But looking at the article, there are tables and visualizations in there, which include text, too. The HTML structure looks like this:
<div class="entry-content single-post-content">
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<section class="viz">
<header class="viz">
<h5 class="title">TITLE-text</h5>
<p class="subtitle">SUB-TITLE-text</p>
</header>
<table class="viz full"">TABLE DATA</table>
</section>
<p>text I want</p>
<p>text I want</p>
</div>
With the code snipped above, I get something like:
text I want
text I want
text I want
TITLE-text <<<< (text I don't want)
SUB-TITLE-text <<<< (text I don't want)
TABLE DATA <<<< (text I don't want)
text I want
text I want
My questions:
How can I modify the add_css()function in a way such that it
takes all text except texts from the table?
Would it be easier with the function add_xpath?
In general, what would be the best practise for this? (extract text
under conditions)
Feedback would be much appreciated
Use > in your CSS expression to limit it to children (direct descendants).
.entry-content > *::text
You can get output that you want with XPath and ancestor axis:
'//*[contains(#class, "entry-content")]//text()[not(ancestor::*[#class="viz"])]'
Unless I miss something crucial, the following xpath should work:
import scrapy
import w3lib
raw = response.xpath(
'//div[contains(#class, "entry-content") '
'and contains(#class, "single-post-content")]/p'
).extract()
This omits the table content and only yields the text in paragraphs and links as a list. But there's a catch! Since we didn't use /text(), all <p> and <a> tags are still there. Let's remove them:
cleaned = [w3lib.html.remove_tags(block) for block in raw]

Extract Text From same class name(Python web scraping)

I'm beginner in Python Webscriping using beautifulsoup. I was trying to scrape one real estate website using beautifulsoup but there is row with different information in each column. However each column's class name is same so When I trying to scrape information of each column, I got a same result becuase of same class name.
Link of the website I was trying to scrape.
Code From The HTML
<div class="lst-middle-section resale">
<div class="item-datapoint va-middle">
<div class="lst-sub-title stub text-ellipsis">Built Up Area</div>
<div class="lst-sub-value stub text-ellipsis">2294 sq.ft.</div>
</div>
<div class="item-datapoint va-middle">
<div class="lst-sub-title stub text-ellipsis">Avg. Price</div>
<div class="lst-sub-value stub text-ellipsis"><i class="icon-rupee"></i> 6.5k / sq.ft.</div>
</div>
<div class="item-datapoint va-middle">
<div class="lst-sub-title stub text-ellipsis">Possession Date</div>
<div class="lst-sub-value stub text-ellipsis">31st Dec, 2020</div>
</div>
Code I Tried!
for item in all:
try:
print(item.find('span', {'class': 'lst-price'}).getText())
print(item.find('div',{'class': 'lst-heading'}).getText())
print(item.find('div', {'class': 'item-datapoint va-middle'}).getText())
print('')
except AttributeError:
pass
If I use class 'item-datapoint va-middle' again then it shows sq.ft area not avg.price or Possession date.
Solution? TIA!
Use find_elements_by_class_name instead of find_element_by_class_name.
find_elements_by_class_name("item-datapoint.va-middle")
You will get a list of elements.
Selenium docs: Locating Elements
Edit:
from selenium import webdriver
url = 'https://housing.com/in/buy/search?f=eyJiYXNlIjpbeyJ0eXBlIjoiUE9MWSIsInV1aWQiOiJhMWE1MjFmYjUzNDdjYT' \
'AxNWZlNyIsImxhYmVsIjoiQWhtZWRhYmFkIn1dLCJub25CYXNlQ291bnQiOjAsImV4cGVjdGVkUXVlcnkiOiIlMjBBaG1lZGFiYWQiL' \
'CJxdWVyeSI6IiBBaG1lZGFiYWQiLCJ2IjoyLCJzIjoiZCJ9'
driver = webdriver.Chrome()
driver.get(url)
fields = driver.find_elements_by_class_name("item-datapoint.va-middle")
for i, field in enumerate(fields):
print(i, field.text)
driver.quit()
Now you see the index in the list (fields) for every element.
Print the elements you want like here:
poss_date = fields[2].text

Creating a css selector to locate multiple ids in a single-shot

I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)
You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.
If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)

beautifulsoup CSS Select - find a tag in which a particular attribute (style for ex) is not present

My first here on SO. Thanks for helping us noobs for so long. Coming straight to point:
Scenario:
I am working on an existing program that is reading the CSS selector as a string from a configuration file to make the program dynamic and able to scrap any site by just changing the configuration value of CSS selector.
Problem:
I am trying to scrape a site which is rendering items as one of the 2 options below:
Option1:
.........
<div class="price">
<span class="price" style="color:red;margin-right:0.1in">
<del>$299</del>
</span>
<span class="price">
$195
</span>
</div>
soup = soup.select("span.price") - this doesn't work as I need second span tag or last span tag :(
Option2:
.........
<div class="price">
<span class="price">
$199
</span>
</div>
soup = soup.select("span.price") - this works great!
Question:
In both the above options I want to be able to get the last span tag ($195 or $199) and don't care about the $299. Basically I just want to extract the final sale price and not the original price.
So the 2 ways I know as of now are:
1) Always get the last span tag
2) Always get the span tag which doesn't have style attribute
Now, I know the not operator, last-of-type are not present in bs4 (only nth-of-type is available) so I am stuck here. Any suggestions are helpful.
Edit: - Since this is an existing program, I cant use soup.find_all() or any other method apart from soup.select(). Sorry :(
Thanks!
You can search for the span tag without the style attribute:
prices = soup.select('span.price')
no_style = [price for price in prices if 'style' not in price.attrs]
>> [<span class="price">$199</span>]
This might be a good time to use a function. In this case BeautifulSoup gives span_with_style each tag and the function tests whether the tag's name is span and it has the attribute style. If this is true then BeautifulSoup appends the tag to its list of results.
HTML = '''\
<div class='price'>
<span class='price' style='color: red; margin-right: 0.1in'>
<del>$299</del>
</span>
<span class='price'>
$195
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, 'lxml')
for item in soup.find_all(lambda tag: tag.name=='span' and tag.has_attr('style')):
print (item)
The code inside the select function needs to change to:
def select(soup, the_variable_you_pass):
soup.find('div', attrs={'class': 'price'}).find_all(the_variable_you_pass)[-1]

Categories