I've tried to parse a text out of some html elements using string argument the way it is described here but failed miserably. I've tried two different ways but every time I encountered the same AttributeError.
How can I use string argument in this very case to fetch the text?
I've tried with:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
try:
item = soup.find("caption",string="ASIC registration").text
#item = soup.find("caption",string=re.compile("ASIC registration",re.I)).text
except AttributeError:
item = ""
print(item)
Expected output (only using string argument):
ASIC registration
How can I use string argument in this very case to fetch the text?
You can't
Note:
I am assuming that you mean by some change string parameter in
item = soup.find("caption",string="ASIC registration").text
As given in the documentation
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.find("caption")
print(item.string)
Output
None
Here the .string is None as caption has more than one child.
If you are trying to get the parent (caption tag in this case) with the text, you could do
item = soup.find(string=re.compile('ASIC registration')).parent
which will give
<caption><a></a>ASIC registration</caption>
Of course, calling a .text on this parent tag will give the full text within that tag, if it is not the full text within it.
item = soup.find(string=re.compile('ASIC')).parent.text
will give an output
ASIC registration
The issue you're running into is that the string argument searches for strings instead of for tags as it states in the documentation you linked.
The syntax you are using:
soup.find("caption",string="ASIC registration")
is for finding tags.
For finding strings:
soup.find(string=re.compile('ASIC'))
With the first one you are saying find a caption tag with the "string" attribute of your string. The caption tag has no string attribute so nothing is returned.
The second one is saying find the string that contains 'ASIC', so it returns the string.
Turns out the string parameter doesn't work if a tag has a child tag. The following code is stupid, but it works:
real_item = ""
try:
items = soup.find_all("caption")
r = re.compile(u"ASIC registration", re.I)
for item in items:
for s in item.strings:
if r.search(unicode(s)):
real_item = item
break
except AttributeError:
real_item = ""
print(real_item)
Related
I have a class in my html code. I need to locate td class "Currentlocation" using python.
CODE :
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
Below are the codes I tried.
First attempt:
My_result = page_soup.find_element_by_class_name('CURRENTLOCATION')
Getting "TypeError: 'NoneType' object is not callable" error. Second attempt:
My_result = page_soup.find(‘td’, attrs={‘class’: ‘CURRENTLOCATION’})
Getting "invalid character in identifier" error.
Can anyone please help me locate a class in html code using python?
from bs4 import BeautifulSoup
sdata = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(sdata, 'lxml')
mytds = soup.findAll("td", {"class": "CURRENTLOCATION"})
for td in mytds:
print(td)
I tried your code, the second example, and the problem are the quotation marks you use. To me they are apostrophes (‘, unicode code point \u2019), while the python interpreter requires single (') or double (") quotation marks.
Changing them I can find the tag:
>>> bs.find('td', attrs={'class': 'CURRENTLOCATION'})
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
About your first example. I do not know where you find a reference to the method find_element_by_class_name but it seems to not be implemented by the BeautifulSoup class. The class instead implements the __getattr__ method that is a special one that is invoked anytime you try to access a non existing attribute. Here an excerpt of the method:
def __getattr__(self, tag):
#print "Getattr %s.%s" % (self.__class__, tag)
if len(tag) > 3 and tag.endswith('Tag'):
#
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
return self.find(tag)
When you try to access the attribute find_element_by_class_name, you are actually looking for a tag with the same name.
There is a function in BeautifulSoup for this.
You can get all the desired tags and specify the attributes which you are lookin for in find_all function. It returns the list of all the elements which fulfill the criteria
import re
from bs4 import BeautifulSoup
text = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(text, 'lxml')
output_list = soup.find_all('td',{"class": "CURRENTLOCATION"}) # I am looking for all the td tags whose class atrribute is set to CURRENTLOCATION
Im trying to parse a specific "item" on a site, but I dont know if its a class, object, id or something else
my code:
soup = BeautifulSoup(urllib2.urlopen(myURL))
divdata = soup.find('div')
print(divdata)
And it returns:
<div data-store='{"Auth":{"cookie":null,"user":null,"timestamp":1485297666762},"Blocked":{},"Broadcast":
{"forceUpdate":false,"failed":[],"pending":[],"error":
{"isNotFound":false,"isServerError":false,"isUnavailable":false}},"BroadcastCache":{"broadcasts":{"ID1":{"broadcast":
{"data":{"class_name":"Broadcast","id":"ID1","state":"running,
....(more)....
So I want to retrieve the "running" or what ever is in "state"
I tried
statedata = soup.find('div', {"class":"state"})
But it returns nothing, what is the correct way to retrieve it?
import json
div_tag = soup.find('div', {'data-store':True})
data_string = div_tag['data-store'] # get data string
json.loads(data_string)['BroadcastCache']['broadcasts']['ID1']['broadcast']['data']['state'] # convert data string to python dict and get state
out:
'running'
The correct syntax is soup.find_all('div', class_='state').
Note the underscore after class_.
It's unlikely to work in your case without modification, since it looks like the actual class of the div is 'data-store', and the rest is just a string and not actually content of a tag. You could just use string.find('\"state\"') on that one.
I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]
I am using python with BeautifulSoup 4 to find links in a html page that match a particular regular expression. I am able to find links and text matching with the regex but the both things combined together won't work. Here's my code:
import re
import bs4
s = 'Sign in <br />'
soup = bs4.BeautifulSoup(s)
match = re.compile(r'sign\s?in', re.IGNORECASE)
print soup.find_all(text=match) # [u'Sign in\xa0']
print soup.find_all(name='a')[0].text # Sign in
print soup.find_all('a', text=match) # []
Comments are the outputs. As you can see the combined search returns no result. This is strange.
Seems that there's something to do with the "br" tag (or a generic tag) contained inside the link text. If you delete it everything works as expected.
you can either look for the tag or look for its text content but not together:
given that:
self.name = u'a'
self.text = SRE_Pattern: <_sre.SRE_Pattern object at 0xd52a58>
from the source:
# If it's text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
isinstance(markup, basestring):
if not self.name and not self.attrs and self._matches(markup, self.text):
found = markup
that makes #Totem remark the way to go by design
Brief explanation: I have a script which loops through elements of a page, then returns the data. But I want it to return data which is not in an element, but in order.
import argparse, os, socket, urllib2, re
from bs4 import BeautifulSoup
pge = urllib2.urlopen("").read()
src = BeautifulSoup(pge)
body = src.findAll('body')
el = body[0].findChildren()
for s in el:
cname = s.get('class')
if cname[0] == "work":
print s.text
HTML:
<body>
<div class="work">1</div>
<span class="nope">tosee</span>
<span class="work">2</span>
<span class="work">3</span>
4
<span class="work">5</span>
<span class="no">nothing</span>
</body>
It prints 1235 and misses out the 4, but I'd like it to print 12345
Simply:
print soup.find('body').text
You could do:
arr = []
# Get all text elements
for i in body[0].find_all(text=True):
# append to array if it's 'work' element or has no class
if not i.parent.has_attr("class") or "work" in i.parent["class"]:
arr.append(i)
This of course only works if following two rules are always valid:
a valid text element is inside a class="work", or
a valid text element is inside a tag that does not have a class attribute
I formatted your html with line breaks to help show why 4 is not printing where you would expect.
You are iterating through the children of and printing the text from any children that are of the class "work". The number 4 does not fit this criteria, because it is the text of , not a child with a class of "work".
I don't think BeautifulSoup can decode this particular html as you would expect it to.
One solution would be to parse the html yourself, since this isn't a typical situation. One way might be to use regex to find instances of something like:
</span>(not_blank)<span class="{classregex}">(remember)</span>
Build a dictionary of {remember: not_blank}. Then as you loop through body.children(), validate s.text() against this dictionary. If it is a key, print the value, then print s.text().
Depending on what the actual html is this might work...