I am trying to get the text of the user's rank from this webpage. By "rank" I mean the text you see in the top right corner of the user's info:
In this example rank is "Competitions Master". So how do I get that text?
If you see clearly after inspecting the rank there's <a href = "/progression" inside of which is a <p> tag and inside of which there's again a <p>tag which contains the rank.
First find the container with <a href = "/progression" and then find all the <p> tags inside it then again find all <p>. Print the text present inside the <p>tag as there is only one <p> tag inside<a href = "/progression" then <p>.
Or there's a second method too: There's a button below "Home" with the name of the rank. You can try scraping that element.
Related
I'm using Selenium to scrape a Web Page and I'm having some problems targeting some attributes.
The page I'm trying to scrape looks like this:
<div>
<span abc> content </span>
<span def> content2 </span>
<div>
My goal would be to retrieve the text within the "span abc" tag, without selecting the other text included in the "span def" tag.
I've tried multiple approaches and looked at a lot of different resources but I wasn't able to find the right approach, since I don't want to select all the spans at the same time and I don't want to search based on the text within the tags.
A simple approach would be indexing cause you do not want to select based on
since I don't want to select all the spans at the same time and I
don't want to search based on the text within the tags.
If abc is an attribute please use :
//div/span[#abc]
or
with indexing :
(//div/span[#abc])[1]
If you only want to pull the first span out of these two, you could easily do this with the XPATH. It would look like this:
span = driver.find_element_by_xpath("/html/body/div/span[1]").text
if you want to pull every span, but execute commands with each of these you could do:
span = len(driver.find_elements_by_xpath("/html/body/div/span"))
m = 1
while m <= 0:
span = driver.find_element_by_xpath("/html/body/div/span["+str(m)+"]")
print(span.text)
m = m + 1
You can use xpath like //span[1]/text() for get text inside of the <span> tag
span = driver.find_element_by_xpath("/html/body/div/span[1]/text()")
I am working with Selenium with Python to solve a problem. I want to extract information inside a paragraph (p tag). I am using "find_elements_by_tag_name" to locate all the p tags in the page. But how can I access some tags that are already inside that paragraph. For example there is html file which ahs a code like
<p> This is a paragraph <h1> but this is a h1 tag </h1></p>
I have used selenium to open the page like
br=webdriver.Chrome()
br.get('file:///C:/Users/Shady/Desktop/New%20Text%20Document.html')
I am able to access the elements of the the P tag by
p_tags=br.find_elements_by_tag_name('p')
It shows only one element and when I do
print(x[0].text)
it shows only
This is a paragraph
How can I access the h1 tag inside the p tag. Can X_path would work? if Yes, Can you please share the code?
The <h1> tag is actually a descendent of the <p> tag. So in your code trials you have identified the <p> tag and extracted the text which correctely gave This is a paragraph.
So to extract the text but this is a h1 tag you have to reach to the descendent <h1> and you can use either of the following Locator Strategies:
Using css_selector:
print(driver.find_element_by_css_selector("p>h1").get_attribute("innerHTML"))
Using xpath:
print(driver.find_element_by_xpath("//p/h1").get_attribute("innerHTML"))
So this is how my HTML looks that I'm parsing. It is all within a table and gets repeated multiple times and I just want the href attribute value that is inside the div with the attribute class="Special_Div_Name". All these divs are then inside table rows and there are lots of rows.
<tr>
<div class="Special_Div_Name">
text
</div>
</tr>
What I want is only the href attribute values that end in ".mp3" that are inside the div with the attribute class="Special_Div_Name".
So far I was able to come up with this code:
download = soup.find_all('a', href = re.compile('.mp3'))
for text in download:
hrefText = (text['href'])
print hrefText
This code currently prints off every href attribute value on the page that ends in ".mp3" and it's very close to doing exactly what I want. Its just I only want the ".mp3"s that are inside that div class.
This minor adjustment should get you what you want:
special_divs = soup.find_all('div',{'class':'Special_Div_Name'})
for text in special_divs:
download = text.find_all('a', href = re.compile('\.mp3$'))
for text in download:
hrefText = (text['href'])
print hrefText
Since Beautiful Soup accepts most CSS selectors with the .select() method, I'd suggest using the attribute selector [href$=".mp3"] in order to select a elements with an href attribute ending with .mp3.
Then you can just prepend the selector .Special_Div_Name in order to only select anchor elements that are descendants:
for a in soup.select('div.Special_Div_Name a[href$=".mp3"]'):
print (a['href'])
In a more general case, if you would just like to select a elements with an [href] attribute that are a descendant of a div element, then you would use the selector div a[href]:
for a in soup.select('div a[href]'):
print (a)
If you don't use the code above, then based on the original code that you provided, you would need to select all the elements with a class of Special_Div_Name, then you would need to iterate over those elements and select the descendant anchor elements:
for div in soup.select('.Special_Div_Name'):
for a in div.find_all('a', href = re.compile('\.mp3$')):
print (a['href'])
As a side note, re.compile('.mp3') should be re.compile('\.mp3$') since . has special meaning in a regular expression. In addition, you will also want the anchor $ in order to match at the end of the sting (rather than anywhere in the string).
I am looking to retrieve the value "312 votes" from the below tag hierarchy:
<div class="rating-rank right">
<span class="rating-votes-div-65211">312 votes</span>
</div>
The problem seems to be that the span tag has a unique identifier for every values in the page. In the above case '65211'. What should i do to retrieve the required value?
I am using soup.select to get the values. But it doesn't seem to work.
for tag in soup.select('div.rating-rank right'):
try:
print(tag.string)
except KeyError:
pass
You try to select a right element that follows a div with class rating-rank. You can select what you want like this:
soup.select("div.rating-rank.right span")
With css selectors you have to read them from right to left. So div.rating-rank.right span means I want a span element which is after a div element having rating-rank, right as classes. From the moment you identified your span elements, you can print their contents like you already do.
I'm trying to pull gene sequences from the NCBI website using Python and BeautifulSoup.
Upon viewing the HTML from the sequence page, I noticed that the sequence is stored within span elements stored within a pre element stored within a div element.
I've used the findAll() function in an attempt to pull the string contained inside the span element, but the findAll() function returns an empty list. I've attempted to use the findAll() function on the parent div element, and, whilst it returns the div element in question, it contains none of the HTML inside the div element; furthermore, the div element returned by the findAll() function is somewhat "corrupted" in that some of the attributes within the opening div tag are either missing or not in the correct order as given on the HTML webpage.
The following sample code is representative of the scenario:
Actual HTML:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
<pre>
"*some gene information enclosed inside double quotation marks*
"
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 1*</span>
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 2*</span>
...
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE N*</span>
</pre>
</div>
My Code Snippets:
The object of my code is to pull the string contents of the pre element (both the span strings and the opening string beginning, "*some gene information...").
# Assume some predefined gene sequence url, gene_url.
page = urllib2.urlopen(gene_url)
soup = BeautifulSoup(page.read())
spans = soup.findAll('span',{'class':'ff_line'})
for span in spans:
print span.string
This prints nothing because the spans list is empty. Much the same problem occurs if a findAll is applied to pre instead of span.
When I try to find the parent div element using the same procedure as above:
# ...
divs = soup.findAll('div',{'class':'seq gbff'})
for div in divs:
print div
I get the following print output:
<div class="seq gbff" id="viewercontent1" sequencesize="*some_sequencesize*" val="*some_val*" virtualsequence=""></div>
The most obvious difference is that the printed result doesn't contain any of the nested HTML, but also the content within the opening div tag is also different (arguments are either missing or in the wrong order). Compare with equivalent line on webpage:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
Has this issue got something to do with the virtualsequence argument in the opening div tag?
How can I achieve my desired aim?
Class is a reserved keyword in Python (used when creating objects), so maybe this is causing the trouble, you can try to follow it by underscore and passing it as keyword argument, perhaps this will help:
>>> soup.find_all('span',class_='ff_line')
Check out the docs.