BeautifulSoup find and find_all not working as expect - python

I just starting using BeautifulSoup and I am encountering a problem. I set up a html snippet below and make a BeautifulSoup object:
html_snippet = '<p class="course"><span class="text84">Ae 100. Research in Aerospace. </span><span class="text85">Units to be arranged in accordance with work accomplished. </span><span class="text83">Open to suitably qualified undergraduates and first-year graduate students under the direction of the staff. Credit is based on the satisfactory completion of a substantive research report, which must be approved by the Ae 100 adviser and by the option representative. </span> </p>'
subject = BeautifulSoup(html_snippet)
I have tried doing several find and find_all operations like below but all I am getting is nothing or an empty list:
subject.find(text = 'A')
subject.find(text = 'Research')
subject.next_element.find('A')
subject.find_all(text = 'A')
When I created the BeautifulSoup object from a html file on my computer before, the find and find_all operations were all working fine. However, when I pulled the html_snippet from reading a webpage online through urllib2, I am getting problems.
Can anyone point out where the issue is?

Pass the argument like this:
import re
subject.find(text=re.compile('A'))
The default behavior for the text filter is to match on the entire body. Passing in a regular expression lets you match on fragments.
EDIT: To match only bodies beginning with A, you can use the following:
subject.find(text=re.compile('^A'))
To match only bodies containing words that begin with A, you can use:
subject.find_all(text = re.compile(r'\bA'))
It's difficult to tell more specifically what you're looking for, let me know if I've misinterpreted what you're asking.

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Trouble retrieving text from XML with ElementTree with tags

Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central. I'm trying to parse the XML with ElementTree to just have the text from the page. Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no. But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work. Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information. Even the ElementTree documentation doesn't go into this (from what I can tell). Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?
Here's an example. Note: I use Python 3.4
Small snippit of the XML:
<sec sec-type="materials|methods" id="s5">
<title>Materials and Methods</title>
<sec id="s5a">
<title>Overgo design</title>
<p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50–60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
<table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
<label>Table 2</label>
<caption>
<title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
</caption>
<alternatives>
<graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"/>
<col align="center" span="1"/>
</colgroup>
For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).
Now, I already know that I can make the whole XML string into an ElementTree Object
>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)
Now if I try to get the text using the tag like this:
>>> text = root.find('p')
>>> print("".join(text.itertext()))
or
>>> text = root.get('p').text
I can't extract the text that I want. From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.
While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it. Please let me know what I'm missing and how I can fix this. Thanks!
--- EDIT ---
So now I know that I should be using this code to get everything in the 'p' tags:
>>> text = root.find('.//p')
>>> print("".join(text.itertext()))
Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content. Does itertext() only iterate within a tag? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.
---- FINAL EDIT --
I figured out that itertext() only works within one tag and find() only returns the first item. In order to get the enitre text that I want I must use findall()
>>> all_text = root.findall('.//p')
>>> for texts in all_text:
print("".join(texts.itertext()))
root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag.
root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags).
If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find/root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support). In your case it is './/p':
text = root.find('.//p')
print("".join(text.itertext()))

Is there a scrapy following-sibling count?

I'm trying to scrape the title of the following html code:
<FONT COLOR=#5FA505><B>Claim:</B></FONT> Coed makes unintentionally risqué remark about professor's "little quizzies."
<BR><BR>
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER>
I'm using this code:
def parse_article(self, response):
for href in response.xpath('//font[b = "Claim:"]/following-sibling::text()'):
print href.extract()
and I succesfully pull the correct Claim: value that I want from the aforementioned html code but it also, (among others with similar structure in the same page) pulls the below html. I am defining my xpath() to just pull in the font tag named Claim: so why is it pulling in the below Origins as well? And how can I fix it? I tried seeing if I could get only the next following-sibling instead of all of them, but that didn't work
<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends
I think your xpath is missing text() qualifier (explained here). It should be:
'//font/[b/text()="Claim:"]/following-sibling::text()'
The following-sibling axis returns all siblings following an element. If you only want the first sibling, try the XPath expression:
//font[b = "Claim:"]/following-sibling::text()[1]
Or, depending on your exact use case:
(//font[b = "Claim:"]/following-sibling::text())[1]

How can I dependably web-scrape a largely unattached line effectively?

Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying:
Permanent link to this comic: http://xkcd.com/1520/
Where 1520 is the number of the newest comic on display. I want to scrape this number, however, I can't find any good way to do so. Currently all my attempts look really hackish like:
soup = BeautifulSoup(urllib.urlopen('http://xkcd.com/').read())
test = soup.find_all('div')[7].get_text().split()[20][-5:-1]
I mean.. That technically works, but if anything on the website gets moved in the slightest it could break horribly. I know there has to be better way to just search for http:xkcd.com/####/ within the a section of the front page and just return #### but I can't seem to find it. The Permanent link to this comic: http://xkcd.com/1520/ line just seems to be kind of floating around without any kinds of tags, class, or ID. Can anyone offer any assistance?
Usually I insist on using HTML parsers. Here, since we are looking for a specific text in HTML (not checking any tags), it is pretty much okay to apply a regular expression search on:
Permanent link to this comic: http://xkcd.com/(\d+)/
saving digits in a group.
Demo:
>>> import re
>>> import requests
>>>
>>>
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Categories