Retrieve css tag values in python - python

I am looking to retrieve the value "312 votes" from the below tag hierarchy:
<div class="rating-rank right">
<span class="rating-votes-div-65211">312 votes</span>
</div>
The problem seems to be that the span tag has a unique identifier for every values in the page. In the above case '65211'. What should i do to retrieve the required value?
I am using soup.select to get the values. But it doesn't seem to work.
for tag in soup.select('div.rating-rank right'):
try:
print(tag.string)
except KeyError:
pass

You try to select a right element that follows a div with class rating-rank. You can select what you want like this:
soup.select("div.rating-rank.right span")
With css selectors you have to read them from right to left. So div.rating-rank.right span means I want a span element which is after a div element having rating-rank, right as classes. From the moment you identified your span elements, you can print their contents like you already do.

Related

Python: exclude outer wrapping element when getting content via css/xpath selector

I tried this code to get the HTML content of element div.entry-content:
response.css('div.entry-content').get()
However, it returns the wrapping element too:
<div class="entry-content">
<p>**my content**</p>
<p>more content</p>
</div>
But I want just the contents, so in my case: <p>**my content**</p><p>more content</p>
I also tried an xpath selector response.xpath('//div[#class="entry-content"]').get(), but with the same result as above.
Based on F.Hoque's answer below I tried:
response.xpath('//article/div[#class="entry-content"]//p/text()').getall() and response.xpath('//article/div[#class="entry-content"]//p').getall()
These however, returns arrays of respectively all p elements and the content of each found p element. I however want the HTML contents (in a single value) of the div.entry-content element without the wrapping element itself.
I've tried Googling, but can't find anything.
As you said, your main div contains multiple p tags and you want to extract the text node value from those p tags. //p will select all the p tags.
response.xpath('//div[#class="entry-content"]//p').getall()
The following expression will remove the array
p_tags = ''.join([x.get() for x in response.xpath('//article/div[#class="entry-content"]//p')])
You content is in the <p> tag, not the <div>
response.css('div.entry-content p').get()
or
response.xpath('//div[#class="entry-content"]/p').get()

Targeting Xpath attributes in Selenium

I'm using Selenium to scrape a Web Page and I'm having some problems targeting some attributes.
The page I'm trying to scrape looks like this:
<div>
<span abc> content </span>
<span def> content2 </span>
<div>
My goal would be to retrieve the text within the "span abc" tag, without selecting the other text included in the "span def" tag.
I've tried multiple approaches and looked at a lot of different resources but I wasn't able to find the right approach, since I don't want to select all the spans at the same time and I don't want to search based on the text within the tags.
A simple approach would be indexing cause you do not want to select based on
since I don't want to select all the spans at the same time and I
don't want to search based on the text within the tags.
If abc is an attribute please use :
//div/span[#abc]
or
with indexing :
(//div/span[#abc])[1]
If you only want to pull the first span out of these two, you could easily do this with the XPATH. It would look like this:
span = driver.find_element_by_xpath("/html/body/div/span[1]").text
if you want to pull every span, but execute commands with each of these you could do:
span = len(driver.find_elements_by_xpath("/html/body/div/span"))
m = 1
while m <= 0:
span = driver.find_element_by_xpath("/html/body/div/span["+str(m)+"]")
print(span.text)
m = m + 1
You can use xpath like //span[1]/text() for get text inside of the <span> tag
span = driver.find_element_by_xpath("/html/body/div/span[1]/text()")

Web scraping Button BeautifulSoup Python

i'm trying to webscrape the span from a button that has a determinated class. This is the code of the page on the website.
<button class="sqdOP yWX7d _8A5w5 " type="button">altri <span>17</span></button>
I'd like to find "17" that obviously changes everytime. Thanks.
I've tried with this one but it doesn't work
for item in soup.find_all('button', {'class': 'sqdOP yWX7d _8A5w5 '}):
For complex selections, it's best to use selectors. These work very similar to CSS.
p selects an element with the type p.
p.example selects an element with type p and class example.
p span selects any span inside a p.
There are also others, but only these are needed for this example.
These can be nested as you like. For example, p.example span.foo selects any span with class foo inside any p with class example.
Now, an element can have multiple classes, and they are separated by spaces. <p class="foo bar">Hello, World!</p> has both foo and bar as class.
I think I am safe to assume the class sqdOP is unique. You can build the selector pretty easily using the above:
button.sqdOP span
Now, issue select, and BeautifulSoup will return a list of matching elements. If this is the only one, you can safely use [0] to get the first item. So, the final code to select that span:
soup.select('button.sqdOP span')[0]

How to find all anchor tags inside a div using Beautifulsoup in Python

So this is how my HTML looks that I'm parsing. It is all within a table and gets repeated multiple times and I just want the href attribute value that is inside the div with the attribute class="Special_Div_Name". All these divs are then inside table rows and there are lots of rows.
<tr>
<div class="Special_Div_Name">
text
</div>
</tr>
What I want is only the href attribute values that end in ".mp3" that are inside the div with the attribute class="Special_Div_Name".
So far I was able to come up with this code:
download = soup.find_all('a', href = re.compile('.mp3'))
for text in download:
hrefText = (text['href'])
print hrefText
This code currently prints off every href attribute value on the page that ends in ".mp3" and it's very close to doing exactly what I want. Its just I only want the ".mp3"s that are inside that div class.
This minor adjustment should get you what you want:
special_divs = soup.find_all('div',{'class':'Special_Div_Name'})
for text in special_divs:
download = text.find_all('a', href = re.compile('\.mp3$'))
for text in download:
hrefText = (text['href'])
print hrefText
Since Beautiful Soup accepts most CSS selectors with the .select() method, I'd suggest using the attribute selector [href$=".mp3"] in order to select a elements with an href attribute ending with .mp3.
Then you can just prepend the selector .Special_Div_Name in order to only select anchor elements that are descendants:
for a in soup.select('div.Special_Div_Name a[href$=".mp3"]'):
print (a['href'])
In a more general case, if you would just like to select a elements with an [href] attribute that are a descendant of a div element, then you would use the selector div a[href]:
for a in soup.select('div a[href]'):
print (a)
If you don't use the code above, then based on the original code that you provided, you would need to select all the elements with a class of Special_Div_Name, then you would need to iterate over those elements and select the descendant anchor elements:
for div in soup.select('.Special_Div_Name'):
for a in div.find_all('a', href = re.compile('\.mp3$')):
print (a['href'])
As a side note, re.compile('.mp3') should be re.compile('\.mp3$') since . has special meaning in a regular expression. In addition, you will also want the anchor $ in order to match at the end of the sting (rather than anywhere in the string).

findAll() fails to find any elements within a given parent element

I'm trying to pull gene sequences from the NCBI website using Python and BeautifulSoup.
Upon viewing the HTML from the sequence page, I noticed that the sequence is stored within span elements stored within a pre element stored within a div element.
I've used the findAll() function in an attempt to pull the string contained inside the span element, but the findAll() function returns an empty list. I've attempted to use the findAll() function on the parent div element, and, whilst it returns the div element in question, it contains none of the HTML inside the div element; furthermore, the div element returned by the findAll() function is somewhat "corrupted" in that some of the attributes within the opening div tag are either missing or not in the correct order as given on the HTML webpage.
The following sample code is representative of the scenario:
Actual HTML:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
<pre>
"*some gene information enclosed inside double quotation marks*
"
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 1*</span>
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 2*</span>
...
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE N*</span>
</pre>
</div>
My Code Snippets:
The object of my code is to pull the string contents of the pre element (both the span strings and the opening string beginning, "*some gene information...").
# Assume some predefined gene sequence url, gene_url.
page = urllib2.urlopen(gene_url)
soup = BeautifulSoup(page.read())
spans = soup.findAll('span',{'class':'ff_line'})
for span in spans:
print span.string
This prints nothing because the spans list is empty. Much the same problem occurs if a findAll is applied to pre instead of span.
When I try to find the parent div element using the same procedure as above:
# ...
divs = soup.findAll('div',{'class':'seq gbff'})
for div in divs:
print div
I get the following print output:
<div class="seq gbff" id="viewercontent1" sequencesize="*some_sequencesize*" val="*some_val*" virtualsequence=""></div>
The most obvious difference is that the printed result doesn't contain any of the nested HTML, but also the content within the opening div tag is also different (arguments are either missing or in the wrong order). Compare with equivalent line on webpage:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
Has this issue got something to do with the virtualsequence argument in the opening div tag?
How can I achieve my desired aim?
Class is a reserved keyword in Python (used when creating objects), so maybe this is causing the trouble, you can try to follow it by underscore and passing it as keyword argument, perhaps this will help:
>>> soup.find_all('span',class_='ff_line')
Check out the docs.

Categories