Scrapy Selector only getting the first element in for loop - python

I don't understand why the following code doesn't work when using Scrapy Selector.
In scrapy shell (to be easily replicable, but the issue remains the same in a spider):
from scrapy.selector import Selector
body = '''<html>
<body>
<li>
<p>1</p>
<p>2</p>
<p>3</p>
</li>
<li>
<p>4</p>
<p>5</p>
<p>6</p>
</li>
<li>
<p>7</p>
<p>8</p>
<p>9</p>
</li>
</body>
</html>'''
sel = Selector(text=body, type="html")
for elem in sel.xpath('//body'):
first = elem.xpath('.//li/p[1]/text()').get()
print(first)
And it prints:
1
while it should be printing:
1
4
7
Any idea on how to solve this problem ?
Thanks

There maybe a chance that you're using the .get() method to fetch the data which you can replace with .getall(). This method will give you all the data in list format through which you can get your desired data with help of python slicing.
Or in other way there maybe a change that class name is differ in each "li" tag or you may have to use pass the class="" in your xpath URL.
Note: Rather then fetching data with path: "elem.xpath('.//li/p[1]/text()').get()" you can simply get all data by using "elem.xpath('.//li/p/text()').getall()" and then you can put the manipulation logic over the list data which is the easiest way if you don't get your desired output.

Related

Selenium starts-with searchs entire page not in given Webelement

I want to search class name with starts-with in specific Webelement but it search in entire page. I do not know what is wrong.
This returns list
muidatagrid_rows = driver.find_elements(by=By.CLASS_NAME, value='MuiDataGrid-row')
one_row = muidatagrid_rows[0]
This HTML piece in WebElement (one_row)
<div class="market-watcher-title_os_button_container__4-yG+">
<div class="market-watcher-title_tags_container__F37og"></div>
<div>
<a href="#" target="blank" rel="noreferrer" data-testid="ios download button for 1628080370">
<img class="apple-badge-icon-image"></a>
</div>
<div></div>
</div>
If a search with full class name like this:
tags_and_marketplace_section = one_row.find_element(by=By.CLASS_NAME, value="market-watcher-title_os_button_container__4-yG+")
It gives error:
selenium.common.exceptions.InvalidSelectorException: Message: Given css selector expression ".market-watcher-title_os_button_container__4-yG+" is invalid: InvalidSelectorError: Element.querySelector: '.market-watcher-title_os_button_container__4-yG+' is not a valid selector: ".market-watcher-title_os_button_container__4-yG+"
So i want to search with starts-with method but i can not get what i want.
This should returns only two Webelements but it returns 20
tags_and_marketplace_section = one_row.find_element(by=By.XPATH, value='//div[starts-with(#class, "market-watcher-")]')
print(len(tags_and_marketplace_section))
>>> 20
Without seeing the codebase you are scraping from it's difficult to help fully, however what I've found is that "Chaining" values can help to narrow down the returned results. Also, using the "By.CSS_SELECTOR" method works best for me.
For example, if what you want is inside a div and p, then you would do something like this;
driver.find_elements(by=By.CSS_SELECTOR, value="div #MuiDataGrid-row p")
Then you can work with the elements that are returned as you described. You maybe able to use other methods/selectors but this is my favourite route so far.

Creating a css selector to locate multiple ids in a single-shot

I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)
You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.
If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)

Navigating DOM in BeautifulSoup

I'm currently able to find certain elements using the findAll function. Is there a way to navigate to their child?
The code I have is:
data = soup.findAll(id="profile-experience")
print data[0].get_text()
And it returns a block of text (for example, some of the text isn't spaced out properly)
The DOM looks something like this
<div id="profile-experience>
<div class="module-body>
<li class="position">
<li class="position">
<li class="position">
If I just do a findAll on class="position I get way too much crap back. Is there a way using BeautifulSoup to just find the elements that are <li class="position"> that are nested underneath <div id="profile-experience">
I want to do something like this:
data = soup.findAll('li',attrs={'class':'position'})
(Where I'm only getting the nested data)
d in data:
print d.get_text()
Sure, you can "chain" the find* calls:
profile_experience = soup.find(id="profile-experience")
for li in profile_experience.find_all("li", class_="position"):
print(li.get_text())
Or, you can solve it in one go with a CSS selector:
for li in soup.select("#profile-experience li.position"):
print(li.get_text())

Issue with xpath / regex in Scrapy spider

I'm trying to extract a product id from an onclick tag within a "preceding-sibling" which is a ul tag (id="ShowProductImages").
The number I'm trying to extract is directly after ?pid=, example:
...list/ViewAll?pid=234565&image=206...
Below is the content that I'm trying to extract from:
<ul id="ShowProductImages" class="imageView">
<li><img src="http://content.example.com/assets/images/products/j458jk.jpg" width="200" height="150" alt="Product image description here" border="0"></li>
</ul>
<div class="description">
Description here...
</div>
I am using xpath to select the onclick tag along with a regular expression to extract the id. This is the code I'm using (which is not working)
def parse(self, response):
sel = HtmlXPathSelector(response)
products_path = sel.xpath('//div[#class="description"]')
for product_path in products_path:
product = Product()
product['product_pid'] = product_path.xpath('preceding-sibling::ul[#id="ShowProductImages"][1]//li/a[1]/#onclick').re(r'(?:pid=)(.+?)(?:\'|$)')
yield product
Any suggestions? I'm not quite sure where I went wrong.
Thanks for your help in advance.
I suggest you try this, selecting from the ul, and testing its <div class="description"> sibling in a predicate:
sel.xpath("""//ul[following-sibling::div[#class="description"]]
[#id="ShowProductImages"]
/li/a[1]/#onclick""").re(r'(?:pid=)(\d+)')
I changed your regular expression to restrict to digits.

How do I access an inline element inside a loop in lxml?

I am trying to screen scrape values from a website.
# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )
# get all divs with class fruit
fruits = fruitsWebsite.xpath( '//div[#class="fruit"]' )
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
print fruit.xpath('//li[#class="fruit"]/em')[0].text
However, the Python interpreter complains that 0 is an out of bounds iterator. That's interesting because I am sure that the element exists. What is the proper way to access the inside <em> element with lxml?
The following code works for me with my test file.
#test.py
import lxml.html
# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')
# get all divs with class fruit
fruits = fruitsWebsite.xpath('//div[#class="fruit"]')
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
#Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
for item in fruit.xpath('.//li[#class="fruit"]/em'):
print(item.text)
#Alternatively
for item in fruit.xpath('//div[#class="fruit"]//li[#class="fruit"]/em'):
print(item.text)
Here is the html file I used to test again. If this doesn't work for the html you're testing again, you'll need to post a sample file that fails as I requested in the comments above.
<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
<div class='fruit'>Some <em>FRUITY</em> stuff.
<ol>
<li class='fruit'><em>This</em> should show</li>
<li><em>Super</em> Ignored LI</li>
<li class='fruit'><em>Rawr</em> Hear it roar.</li>
</ol>
</div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>
You definitely will get too many results with the code you originally posted (the inner loop will search the entire tree rather than the subtree for each "fruit"). The error you're describing doesn't make much sense unless your input is different than what I understood.

Categories