I am trying to screen scrape values from a website.
# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )
# get all divs with class fruit
fruits = fruitsWebsite.xpath( '//div[#class="fruit"]' )
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
print fruit.xpath('//li[#class="fruit"]/em')[0].text
However, the Python interpreter complains that 0 is an out of bounds iterator. That's interesting because I am sure that the element exists. What is the proper way to access the inside <em> element with lxml?
The following code works for me with my test file.
#test.py
import lxml.html
# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')
# get all divs with class fruit
fruits = fruitsWebsite.xpath('//div[#class="fruit"]')
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
#Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
for item in fruit.xpath('.//li[#class="fruit"]/em'):
print(item.text)
#Alternatively
for item in fruit.xpath('//div[#class="fruit"]//li[#class="fruit"]/em'):
print(item.text)
Here is the html file I used to test again. If this doesn't work for the html you're testing again, you'll need to post a sample file that fails as I requested in the comments above.
<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
<div class='fruit'>Some <em>FRUITY</em> stuff.
<ol>
<li class='fruit'><em>This</em> should show</li>
<li><em>Super</em> Ignored LI</li>
<li class='fruit'><em>Rawr</em> Hear it roar.</li>
</ol>
</div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>
You definitely will get too many results with the code you originally posted (the inner loop will search the entire tree rather than the subtree for each "fruit"). The error you're describing doesn't make much sense unless your input is different than what I understood.
Related
I want to search class name with starts-with in specific Webelement but it search in entire page. I do not know what is wrong.
This returns list
muidatagrid_rows = driver.find_elements(by=By.CLASS_NAME, value='MuiDataGrid-row')
one_row = muidatagrid_rows[0]
This HTML piece in WebElement (one_row)
<div class="market-watcher-title_os_button_container__4-yG+">
<div class="market-watcher-title_tags_container__F37og"></div>
<div>
<a href="#" target="blank" rel="noreferrer" data-testid="ios download button for 1628080370">
<img class="apple-badge-icon-image"></a>
</div>
<div></div>
</div>
If a search with full class name like this:
tags_and_marketplace_section = one_row.find_element(by=By.CLASS_NAME, value="market-watcher-title_os_button_container__4-yG+")
It gives error:
selenium.common.exceptions.InvalidSelectorException: Message: Given css selector expression ".market-watcher-title_os_button_container__4-yG+" is invalid: InvalidSelectorError: Element.querySelector: '.market-watcher-title_os_button_container__4-yG+' is not a valid selector: ".market-watcher-title_os_button_container__4-yG+"
So i want to search with starts-with method but i can not get what i want.
This should returns only two Webelements but it returns 20
tags_and_marketplace_section = one_row.find_element(by=By.XPATH, value='//div[starts-with(#class, "market-watcher-")]')
print(len(tags_and_marketplace_section))
>>> 20
Without seeing the codebase you are scraping from it's difficult to help fully, however what I've found is that "Chaining" values can help to narrow down the returned results. Also, using the "By.CSS_SELECTOR" method works best for me.
For example, if what you want is inside a div and p, then you would do something like this;
driver.find_elements(by=By.CSS_SELECTOR, value="div #MuiDataGrid-row p")
Then you can work with the elements that are returned as you described. You maybe able to use other methods/selectors but this is my favourite route so far.
I don't understand why the following code doesn't work when using Scrapy Selector.
In scrapy shell (to be easily replicable, but the issue remains the same in a spider):
from scrapy.selector import Selector
body = '''<html>
<body>
<li>
<p>1</p>
<p>2</p>
<p>3</p>
</li>
<li>
<p>4</p>
<p>5</p>
<p>6</p>
</li>
<li>
<p>7</p>
<p>8</p>
<p>9</p>
</li>
</body>
</html>'''
sel = Selector(text=body, type="html")
for elem in sel.xpath('//body'):
first = elem.xpath('.//li/p[1]/text()').get()
print(first)
And it prints:
1
while it should be printing:
1
4
7
Any idea on how to solve this problem ?
Thanks
There maybe a chance that you're using the .get() method to fetch the data which you can replace with .getall(). This method will give you all the data in list format through which you can get your desired data with help of python slicing.
Or in other way there maybe a change that class name is differ in each "li" tag or you may have to use pass the class="" in your xpath URL.
Note: Rather then fetching data with path: "elem.xpath('.//li/p[1]/text()').get()" you can simply get all data by using "elem.xpath('.//li/p/text()').getall()" and then you can put the manipulation logic over the list data which is the easiest way if you don't get your desired output.
Say I have an eccormece site I would want to scrape and I am interested in the top ten trending products and when dig into the html element its like this:
<div>
<div>
<span>
<a href='www.mysite/products/1'>
Product 1
</a>
</spa>
</div>
<div>
<span>
<a href='www.mysite/products/2'>
Product 2
</a>
</spa>
</div>
<div>
<span>
<a href='www.mysite/products/3'>
Product 3
</a>
</spa>
</div>
<div>
<span>
<a href='www.mysite/products/4'>
Product 4
</a>
</spa>
</div>
</div>
My first solution was to extract the href attributes and then store them in a list then I would open browser instances for each and every attribute, but then it comes at a cost as I have to close and open the browser and every time I open it I have to authenticate. I then tried solution 2. In my solution two the outer div is the parent and as per selenium way of doing things it would mean that products I stored as follows:
product_1 = driver.find_element_by_xpath("//div/div[1]")
product_2 = driver.find_element_by_xpath("//div/div[2]")
product_3 = driver.find_element_by_xpath("//div/div[3]")
product_4 = driver.find_element_by_xpath("//div/div[4]")
So my objective would is to search for a product and after getting the list target the box's a tag and then click it, go to extract more details on the product and then go back without closing the browser till my list is finished and below is my solution:
for i in range(10):
try:
num = i + 1
path = f"//div/div[{num}]/span/a"
poduct_click = driver.find_element_by_xpath(path)
driver.execute_script("arguments[0].click();", poduct_click)
scrape_product_detail() #function that scrapes the whole product detail
driver.execute_script("window.history.go(-1)") # goes backwards to continue looping
except NoSuchElementException:
print('Element not found')
The problem is it works for the first product and it scrapes all the detail and then it goes back. Despite going back to the product page the program fails to find the second element and those coming afterwards and I am failing to understand what may be the problem. May you kindly assist. Thanks
thanks #Debenjan you did help me a lot there. Your solution is working like a charm. For those who would want to know how I went about here is the following code:
article_elements = self.find_elements_by_class_name("s-card-image")
collection = []
for news_box in article_elements:
# Pulling the hotel name
slug = news_box.find_element_by_tag_name(
'a'
).get_attribute('href')
collection.append(
slug
)
for i in range(len(collection)):
self.execute_script("window.open()")
self.switch_to.window(self.window_handles[i+1])
url = collection[i]
self.get(url)
print(self.title, url, self.current_url)
#A D thanks so much your solution is working too and I just will have to test and see whats the best strategy and go with it. Thanks a lot guys
I'm currently able to find certain elements using the findAll function. Is there a way to navigate to their child?
The code I have is:
data = soup.findAll(id="profile-experience")
print data[0].get_text()
And it returns a block of text (for example, some of the text isn't spaced out properly)
The DOM looks something like this
<div id="profile-experience>
<div class="module-body>
<li class="position">
<li class="position">
<li class="position">
If I just do a findAll on class="position I get way too much crap back. Is there a way using BeautifulSoup to just find the elements that are <li class="position"> that are nested underneath <div id="profile-experience">
I want to do something like this:
data = soup.findAll('li',attrs={'class':'position'})
(Where I'm only getting the nested data)
d in data:
print d.get_text()
Sure, you can "chain" the find* calls:
profile_experience = soup.find(id="profile-experience")
for li in profile_experience.find_all("li", class_="position"):
print(li.get_text())
Or, you can solve it in one go with a CSS selector:
for li in soup.select("#profile-experience li.position"):
print(li.get_text())
I want to get items according to their (preceding) <label> attributes, like this:
<div>
<ul>
<li class="phone">
<label>Mobile</label>
312-999-0000
<div>
<ul>
<li class="phone">
<label>Home</label>
312-999-0001
I want to put the first number in the "Mobile" column/list, and the second in the Home list. I currently have code grabbing both of them, but I don't know the proper syntax for getting the label as it is in the source. This is what I'm using now:
for target in targets:
item = CrawlerItem()
item['phonenumbers'] = target.xpath('div/ul/li[#class="phone"]/text()').extract()
How should I rewrite that for item['mobilephone'] and item['homephone'], using the labels?
I found the answer while finishing up the question, and thought I should share it:
item['mobilephone'] = target.xpath('div/ul/li/label[contains (text(),"Mobile")]/following-sibling::text()').extract()
item['officephone']= target.xpath('div/ul/li/label[contains (text(),"Office")]/following-sibling::text()').extract()