Extract h1 text from div class with scrapy or selenium - python

I am using python along with scrapy and selenium.I want to extract the text from the h1 tag which is inside a div class.
For example:
<div class = "example">
<h1>
This is an example
</h1>
</div>
This is my tried code:
for single_event in range(1,length_of_alllinks):
source_link.append(alllinks[single_event])
driver.get(alllinks[single_event])
s = Selector(response)
temp = s.xpath('//div[#class="example"]//#h1').extract()
print temp
title.append(temp)
print title
Each and every time I tried different methods I got an empty list.
Now, I want to extract "This is an example" i.e h1 text and store it or append it in a list i.e in my example title.
Like:
temp = ['This is an example']

Try the following to extract the intended text:
s.xpath('//div[#class="example"]/h1/text()').extract()

For once, it seems that in your HTML the class attribute of the is "example" but in your code you're looking for other class values; At least for XPath queries, keep in mind that you search by exact attribute value. You can use something like:
s.xpath('//div[contains(#class, "example")]')
To find an element that has the "example" class but may have additional classes. I'm not sure if this is a mistake or this is your actual code. In addition the fact that you have spaces in your HTML around the '=' sign of the class attribute may not be helping some parsers either.
Second, your query used in s.xpath seems wrong. Try something like this:
temp = s.xpath('//div[#class="example"]/h1').extract()
Its not clear from your code what s is, so I'm assuming the extract() method does what you think it does. Maybe a more clean code sample would help us help you.

Related

bs4 BeautifulSoup - can't find what looks like custom tag to save my life

I'm admittedly beginner to intermediate with Python and novice to BeautifulSoup/web-scraping. However, I have successfully built a couple of scrapers. Normal tags = no problem (e.g., div, a, li, etc)
However, can't find how to reference this tag with .select or .find or attrs="" or anything:
..........
<react type="sad" msgid="25314120" num="2"
..........
I ultimately want what looks like the "num" attribute from whatever this ghastly thing is ... a "react" tag (though I don't think that's a thing?)?
.find() works the same way as you'd find other tags such as div, p and a tags. Therefore, we search for the 'react' tag.
react_tag = soup.find('react')
Then, access the num attribute like so.
num_value = react_tag['num']
Should print out:
2
As per bs4 Documentation .find('tag') returns the single tag and .find_all('tag') returns list of tags in html.
In your case if there are multiple react tags use this
for reactTag in soup.find_all('react'):
print(reactTag.get('num'))
To get only first tag use this
print(soup.find('react').get('num'))
The user "s n" was spot on! These are dynamically created javascript which I didn't know anything about, but was pretty easy to figure out. Using the SeleniumLibrary in Python and a "headless" WebChromeDriver together, you can use Selenium selectors like Xpath and many others to find these tags.

How can I get the text from a class with no name?

This is the HTML I am trying to get the text 'RCOVE12776' from
<span class="">SKU</span>
": "
"RCOVE12776"
the code I am using to try and get it is:
driver.find_element_by_xpath('//*[#class=""]/[text()="SKU"]').text
I feel like I'm missing something very simple here, also there may be multiple to catch so I would need to find all the text from all the classes "" that contains SKU
If you have control over the original HTML code, use an attribute id instead of a class, and then use the driver to find the element with the id.
Try this code to get required text:
driver.find_element_by_xpath('//*[span[#class="" and .="SKU"]]').text.split(':')[-1]

scrapy returning an empty object

i am using css selector and continually get a response with empty values. Here is the code.
import scrapy
class WebSpider(scrapy.Spider):
name = 'activities'
start_urls = [
'http://capetown.travel/events/'
]
def parse(self, response):
all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first
title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content
price = all_div_activities.css(".span.ticket-cost::text").extract()
details = all_div_activities.css(".p::text").extract()
yield {
'title':title,
'price':price,
'details':details
}
In your code you're looking to select all events but that output will be a list and you can't select the title etc using extract() with a list as you are trying to do.
This is why you're not getting the data you want. You will need to use a for loop to loop over each event on the page in your case looping over all_div_activities.
Code for Script
def parse(self,response):
all_div_activities = response.css('div.tribe-events-event-content')
for a in all_div_activities:
title = a.css('a.tribe-event-url::text').get()
if a.css('span.ticket-cost::text'):
price = a.css('span.ticket-cost::text').get()
else:
price = 'No price'
details = a.css('div[class*="tribe-events-list-event-description"] > p::text').get()
yield {
'title':title.strip(),
'price':price,
'details':details
}
Notes
Using an if statement for price because there were elements that had no price at all and so inputting some information is a good idea.
Using strip() on title when yielding the dictionary as the title had space and \n attached.
Advice
As a minor point, Scrapy suggests using get() and getall() methods rather than extract_first() and extract(). With extract() its not always possible to know the output is going to be a list or not, in this case the output I got was a list. This is why scrapy docs suggests using get() instead. It's also abit more compact. With get() you will always get a string. This also meant that I could strip newlines and space with the title as you can see in the above code.
Another tip would be if the class attribute is quite long, use a *= selector as long as the partial attribute you select provides a unique result to the data you want. See here for abit more detail here.
Using items instead of yielding a dictionary may be better in the longrun, as you can set default values for data that in some events on the page you're scraping and other events it's not. You have to do this through a pipeline (again if you don't understand this then don't worry). See the docs for items and here for abit more on items.
Here is my one. Hope it will help you.
for item in response.css('div.tribe-events-event-content'):
print(item.css('a.tribe-event-url::text').get())
print(item.css('span.ticket-cost::text').get())
print(item.css('p::text').get())
Thanks.
Here is some steps to get your code fixed
When use period before name that represents the element's class name NOT HTML tag itself.. So change .span.ticket-cost::text --> span.ticket-cost::text
Also .p::text --> p::text.
Obviously you trying to get a string so use get() method instead of extract() method which is return a list.
Make sure to use > when the desired text is inside the child element of the element you've select.
Finally here is a CSS Selector Reference https://www.w3schools.com/cssref/css_selectors.asp

Python/BeautifulSoup - Getting specific attribute in the same tag/element

I am new to Python and BeautifulSoup. So please forgive me if I'm using the wrong terminology.
I am trying to get a specific 'text' from a div tag/element that has multiple attributes in the same .
<div class="property-item" data-id="183" data-name="Brittany Apartments" data-street_number="240" data-street_name="Brittany Drive" data-city="Ottawa" data-province="Ontario" data-postal="K1K 0R7" data-country="Canada" data-phone="613-688-2222" data-path="/apartments-for-rent/brittany-apartments-240-brittany-drive-ottawa/" data-type="High-rise-apartment" data-latitude="45.4461070" data-longitude="-75.6465360" >
Below is my code to loop through and find 'property-item'
for btnMoreDetails in citySoup.findAll(attrs= {"class":"property-item"}):
My question is, if I specifically want the 'data-name' and 'data-path' for example, how do I go about getting it?
I've searched google and even this website. Some were saying using the .contents[2]. But I still wasn't able to get any of it.
Once you have extracted the element (which findAll does one at a time) you can access attributes as though they were dictionary keys. So for example the following code:
data = """<div class="property-item" data-id="183" data-name="Brittany Apartments" data-street_number="240" data-street_name="Brittany Drive" data-city="Ottawa" data-province="Ontario" data-postal="K1K 0R7" data-country="Canada" data-phone="613-688-2222" data-path="/apartments-for-rent/brittany-apartments-240-brittany-drive-ottawa/" data-type="High-rise-apartment" data-latitude="45.4461070" data-longitude="-75.6465360" >"""
import bs4
soup = bs4.BeautifulSoup(data)
for btnMoreDetails in soup.findAll(attrs= {"class":"property-item"}):
print btnMoreDetails["data-name"]
prints out
Brittany Apartments
If you want to get the data-name and data-path attributes, you can simply use the dictionary-like access to Tag's attributes:
for btnMoreDetails in citySoup.findAll(attrs={"class":"property-item"}):
print(btnMoreDetails["data-name"])
print(btnMoreDetails["data-path"])
Note that you can also use the CSS selector to match the property items:
for property_item in citySoup.select(".property-item"):
print(property_item["data-name"])
print(property_item["data-path"])
FYI, if you want to see all the attributes use .attrs property:
for property_item in citySoup.select(".property-item"):
print(property_item.attrs)

pulling multiple values from python ElementTree with lxml and xpath

I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.
I am web-scraping. The pages I am scraping have the following salient elements:
<div class='parent'>
<span class='title'>
<a>THIS IS THE TITLE</a>
</span>
<div class='copy'>
<p>THIS IS THE COPY</p>
</div>
</div>
My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')
Below is my code
## 'tree' is the ElementTree of the document I've just pulled
xpath = "//div[#class='parent']"
filtered_html = tree.xpath(xpath)
arr = []
for i in filtered_html:
title_filter = "//span[#class='author']/a/text()" # xpath for title text
copy_filter = "//div[#class='copy']/p/text()" # xpath for copy text
title = i.getroottree().xpath(title_filter)
copy = i.getroottree().xpath(copy_filter)
arr.append((title, copy))
I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.
What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.
I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?
Your core problem is that the getroottree call you're making on each text element resets you to running your xpath over the whole tree. getroottree does exactly what it sounds like - returns the root element tree of the element you call it on. If you leave that call out it looks to me like you'll get what you want.
I personally would use the iterfind method on the element tree for my main loop, and would probably use the findtext method on the resulting elements to ensure that I receive only one title and one copy.
My (untested!) code would look like this:
parent_div_xpath = "//div[#class='parent']"
title_filter = "//span[#class='title']/a"
copy_filter = "//div[#class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]
Alternately, you could skip explicit iteration entirely:
title_filter = "//div[#class='parent']/span[#class='title']/a/text()"
copy_filter = "//div[#class='parent']/div[#class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))
You might need to drop the text() call from the xpath and move it into a generator expression, I'm not sure offhand whether findall will respect it. If it doesn't, something like:
arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))
And you might need to tweak that xpath if having more than one title/copy pair in a parent div is a possibility.

Categories