I'm trying to pull gene sequences from the NCBI website using Python and BeautifulSoup.
Upon viewing the HTML from the sequence page, I noticed that the sequence is stored within span elements stored within a pre element stored within a div element.
I've used the findAll() function in an attempt to pull the string contained inside the span element, but the findAll() function returns an empty list. I've attempted to use the findAll() function on the parent div element, and, whilst it returns the div element in question, it contains none of the HTML inside the div element; furthermore, the div element returned by the findAll() function is somewhat "corrupted" in that some of the attributes within the opening div tag are either missing or not in the correct order as given on the HTML webpage.
The following sample code is representative of the scenario:
Actual HTML:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
<pre>
"*some gene information enclosed inside double quotation marks*
"
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 1*</span>
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE 2*</span>
...
<span class="ff_line", id=*some id*>*GENETIC SEQUENCE LINE N*</span>
</pre>
</div>
My Code Snippets:
The object of my code is to pull the string contents of the pre element (both the span strings and the opening string beginning, "*some gene information...").
# Assume some predefined gene sequence url, gene_url.
page = urllib2.urlopen(gene_url)
soup = BeautifulSoup(page.read())
spans = soup.findAll('span',{'class':'ff_line'})
for span in spans:
print span.string
This prints nothing because the spans list is empty. Much the same problem occurs if a findAll is applied to pre instead of span.
When I try to find the parent div element using the same procedure as above:
# ...
divs = soup.findAll('div',{'class':'seq gbff'})
for div in divs:
print div
I get the following print output:
<div class="seq gbff" id="viewercontent1" sequencesize="*some_sequencesize*" val="*some_val*" virtualsequence=""></div>
The most obvious difference is that the printed result doesn't contain any of the nested HTML, but also the content within the opening div tag is also different (arguments are either missing or in the wrong order). Compare with equivalent line on webpage:
<div id=viewercontent1 class="seq gbff" val=*some_value* sequencesize=some_sequencesize* virtualsequence style="display: block;">
Has this issue got something to do with the virtualsequence argument in the opening div tag?
How can I achieve my desired aim?
Class is a reserved keyword in Python (used when creating objects), so maybe this is causing the trouble, you can try to follow it by underscore and passing it as keyword argument, perhaps this will help:
>>> soup.find_all('span',class_='ff_line')
Check out the docs.
Related
I tried this code to get the HTML content of element div.entry-content:
response.css('div.entry-content').get()
However, it returns the wrapping element too:
<div class="entry-content">
<p>**my content**</p>
<p>more content</p>
</div>
But I want just the contents, so in my case: <p>**my content**</p><p>more content</p>
I also tried an xpath selector response.xpath('//div[#class="entry-content"]').get(), but with the same result as above.
Based on F.Hoque's answer below I tried:
response.xpath('//article/div[#class="entry-content"]//p/text()').getall() and response.xpath('//article/div[#class="entry-content"]//p').getall()
These however, returns arrays of respectively all p elements and the content of each found p element. I however want the HTML contents (in a single value) of the div.entry-content element without the wrapping element itself.
I've tried Googling, but can't find anything.
As you said, your main div contains multiple p tags and you want to extract the text node value from those p tags. //p will select all the p tags.
response.xpath('//div[#class="entry-content"]//p').getall()
The following expression will remove the array
p_tags = ''.join([x.get() for x in response.xpath('//article/div[#class="entry-content"]//p')])
You content is in the <p> tag, not the <div>
response.css('div.entry-content p').get()
or
response.xpath('//div[#class="entry-content"]/p').get()
I am using scrapy and want to get all text for Child Node . I am below command in scrapy for getting the text
response.xpath('//div[#class="A"]/text()').get()
I am expecting result :"1 -120u"
<div class="A">
<span id="B" class="C">
<span>1 </span>-110o</span>
<span id="B">
<span>1 </span>
-120u</span>
</div>
I have also tried below things that I discovered on stackoverlow
response.xpath('//div[#class="A"]/text()').getall()
response.xpath('//div[#class="A"]/text()').extract()
response.xpath('//div[#class="A"]//text()').get()
response.xpath('//div[#class="A"]//text()').getall()
response.xpath('//div[#class="A"]/text()').extract()
This should work to select all text inside div.A:
response.xpath('//div[#class="A"]//text()').getall()
And to filter out white-space strings:
response.xpath('//div[#class="A"]//text()[normalize-space()]').getall()
If you're looking to output "1 -120u" then you could:
substrings = response.css('span #B :not(.C)').xpath('.//text()[normalize-space()]').getall()
''.join(substrings)
This uses a css selector to select span with id of B but not class of C, then chains an xpath selector to grab all the non-whitespace text inside this span. That will return a list of substrings, which you join together to return a single string like "1 -120u"
Additional explanation:
The text you're trying to select isn't a direct child of div - it's inside layers of span elements.
div/text() selects only text that's a direct child of div
div//text() selects all text that's a descendent of div
.get() is for selecting one result - if your selector yields a list of results this method will return the first item in that list
.getall() will return a list of results when your selector picks up multiple results, as is the case in your scenario
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
So this is how my HTML looks that I'm parsing. It is all within a table and gets repeated multiple times and I just want the href attribute value that is inside the div with the attribute class="Special_Div_Name". All these divs are then inside table rows and there are lots of rows.
<tr>
<div class="Special_Div_Name">
text
</div>
</tr>
What I want is only the href attribute values that end in ".mp3" that are inside the div with the attribute class="Special_Div_Name".
So far I was able to come up with this code:
download = soup.find_all('a', href = re.compile('.mp3'))
for text in download:
hrefText = (text['href'])
print hrefText
This code currently prints off every href attribute value on the page that ends in ".mp3" and it's very close to doing exactly what I want. Its just I only want the ".mp3"s that are inside that div class.
This minor adjustment should get you what you want:
special_divs = soup.find_all('div',{'class':'Special_Div_Name'})
for text in special_divs:
download = text.find_all('a', href = re.compile('\.mp3$'))
for text in download:
hrefText = (text['href'])
print hrefText
Since Beautiful Soup accepts most CSS selectors with the .select() method, I'd suggest using the attribute selector [href$=".mp3"] in order to select a elements with an href attribute ending with .mp3.
Then you can just prepend the selector .Special_Div_Name in order to only select anchor elements that are descendants:
for a in soup.select('div.Special_Div_Name a[href$=".mp3"]'):
print (a['href'])
In a more general case, if you would just like to select a elements with an [href] attribute that are a descendant of a div element, then you would use the selector div a[href]:
for a in soup.select('div a[href]'):
print (a)
If you don't use the code above, then based on the original code that you provided, you would need to select all the elements with a class of Special_Div_Name, then you would need to iterate over those elements and select the descendant anchor elements:
for div in soup.select('.Special_Div_Name'):
for a in div.find_all('a', href = re.compile('\.mp3$')):
print (a['href'])
As a side note, re.compile('.mp3') should be re.compile('\.mp3$') since . has special meaning in a regular expression. In addition, you will also want the anchor $ in order to match at the end of the sting (rather than anywhere in the string).
I am looking to retrieve the value "312 votes" from the below tag hierarchy:
<div class="rating-rank right">
<span class="rating-votes-div-65211">312 votes</span>
</div>
The problem seems to be that the span tag has a unique identifier for every values in the page. In the above case '65211'. What should i do to retrieve the required value?
I am using soup.select to get the values. But it doesn't seem to work.
for tag in soup.select('div.rating-rank right'):
try:
print(tag.string)
except KeyError:
pass
You try to select a right element that follows a div with class rating-rank. You can select what you want like this:
soup.select("div.rating-rank.right span")
With css selectors you have to read them from right to left. So div.rating-rank.right span means I want a span element which is after a div element having rating-rank, right as classes. From the moment you identified your span elements, you can print their contents like you already do.