I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
Related
i'm trying to webscrape the span from a button that has a determinated class. This is the code of the page on the website.
<button class="sqdOP yWX7d _8A5w5 " type="button">altri <span>17</span></button>
I'd like to find "17" that obviously changes everytime. Thanks.
I've tried with this one but it doesn't work
for item in soup.find_all('button', {'class': 'sqdOP yWX7d _8A5w5 '}):
For complex selections, it's best to use selectors. These work very similar to CSS.
p selects an element with the type p.
p.example selects an element with type p and class example.
p span selects any span inside a p.
There are also others, but only these are needed for this example.
These can be nested as you like. For example, p.example span.foo selects any span with class foo inside any p with class example.
Now, an element can have multiple classes, and they are separated by spaces. <p class="foo bar">Hello, World!</p> has both foo and bar as class.
I think I am safe to assume the class sqdOP is unique. You can build the selector pretty easily using the above:
button.sqdOP span
Now, issue select, and BeautifulSoup will return a list of matching elements. If this is the only one, you can safely use [0] to get the first item. So, the final code to select that span:
soup.select('button.sqdOP span')[0]
I have the following:
This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.
In the above string i have to search for "next part" not only "part" so once i find the "next part" I need to check if there is any a tag present in the matched text (sometimes there is not an tag) - how can I do that?
Additional to my main question I can't make my xpath to work to find "next part" in the elements.
I tried this:
//*[contains(text(),"next part")]
But it doesn't find anything probably because I have spaces in there - how do I overcome this?
Thank you in advance,
Let's assume this html:
<p>This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.</p>
We can select with selenium:
p = driver.find_element_by_xpath('//p[contains(.,"next part")]')
And we can determine if it's partly in an a tag with regex (Tony the Pony notwithstanding):
html = p.get_attribute('innerHTML')
partly_in_a = 'next part' in re.sub(r'</?a.*?>', '', html) and 'next part' not in html
There's no pure xpath 1.0 solution for this, and it's a mistake in general to depend on xpath for stuff like this.
You'll need to use a nested XPath selector for this.
//*[contains(text(), 'next') and a[contains(text(), 'part')]]
This will query on any element that contains text next, then also check that the element contains nested a element with text part.
To determine whether or not there actually IS a nested a tag, you will need to write a method for this that checks against two different XPaths. There is no easy way around this, other than to evaluate the elements and see what's there.
public bool DoesElementHaveNestedTag()
{
// check for presence of locator with nested tag
// if driver.findElements returns > 0, then nested tag locator exists
if (driver.findElements(By.XPath("//*[contains(text(), 'next') and a[contains(text(), 'part')]]")).Count > 0) return true
else return false
}
You can change this method to fit your needs, but the idea is the same. There is no way to know if a WebElement has a nested tag or not, unless you try to find the WebElement using two XPaths -- one that checks for the tag, and one that does not.
I am trying to loop over every <h2> tag (get the text of it) that is inside div's with the id="somediv" using this code:
for k,div1 in enumerate(tree.xpath('//div[#id="someid"]')):
print div1.xpath('.//h2['+str(k+1)+']/text()')
but it doesn't work. Why? However this works:
for i in range(5): #let's say there are 5 div's with id="someid" to make things easier
print tree.xpath('//div[#id="someid"]/div/div[1]/div[2]/h2['+str(i)+']/text()'))
Problem here is, that I have to give the absolute path .../div/div[1]/div[2]... which I don't want. My first solution looks nice but is not producing the desired result, instead I can only retrieve all <h2> tags from one div="someid" at a time. Can anyone tell me what I am doing wrong?
.// will continue the search down the tree. A list of h2 text nodes subordinate to your div is just
tree.xpath('//div[#id="someid"]/.//h2/text()'))
I have the following issue when trying to get information from some website using scrapy.
I'm trying to get all the text inside <p> tag, but my problem is that in some cases inside those tags there is not just text, but sometimes also an <a> tag, and my code stops collecting the text when it reaches that tag.
This is my Xpath expression, it's working properly when there aren't tags contained inside:
description = descriptionpath.xpath("span[#itemprop='description']/p/text()").extract()
Posting Pawel Miech's comment as an answer as it appears his comment has helped many of us thus far and contains the right answer:
Tack //text() on the end of the xpath to specify that text should be recursively extracted.
So your xpath would appear like this:
span[#itemprop='description']/p//text()
I'm trying to pull gene sequences from the NCBI website using Python and BeautifulSoup.
Upon viewing the HTML from the sequence page, I noticed that the sequence is stored within span elements stored within a pre element stored within a div element.
I've used the findAll() function in an attempt to pull the string contained inside the span element, but the findAll() function returns an empty list. I've attempted to use the findAll() function on the parent div element, and, whilst it returns the div element in question, it contains none of the HTML inside the div element; furthermore, the div element returned by the findAll() function is somewhat "corrupted" in that some of the attributes within the opening div tag are either missing or not in the correct order as given on the HTML webpage.
The following sample code is representative of the scenario:
Actual HTML:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
<pre>
"*some gene information enclosed inside double quotation marks*
"
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 1*</span>
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 2*</span>
...
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE N*</span>
</pre>
</div>
My Code Snippets:
The object of my code is to pull the string contents of the pre element (both the span strings and the opening string beginning, "*some gene information...").
# Assume some predefined gene sequence url, gene_url.
page = urllib2.urlopen(gene_url)
soup = BeautifulSoup(page.read())
spans = soup.findAll('span',{'class':'ff_line'})
for span in spans:
print span.string
This prints nothing because the spans list is empty. Much the same problem occurs if a findAll is applied to pre instead of span.
When I try to find the parent div element using the same procedure as above:
# ...
divs = soup.findAll('div',{'class':'seq gbff'})
for div in divs:
print div
I get the following print output:
<div class="seq gbff" id="viewercontent1" sequencesize="*some_sequencesize*" val="*some_val*" virtualsequence=""></div>
The most obvious difference is that the printed result doesn't contain any of the nested HTML, but also the content within the opening div tag is also different (arguments are either missing or in the wrong order). Compare with equivalent line on webpage:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
Has this issue got something to do with the virtualsequence argument in the opening div tag?
How can I achieve my desired aim?
Class is a reserved keyword in Python (used when creating objects), so maybe this is causing the trouble, you can try to follow it by underscore and passing it as keyword argument, perhaps this will help:
>>> soup.find_all('span',class_='ff_line')
Check out the docs.