Why won't BeautifulSoup find text in a table in Python? - python

I'm trying to check whether a numerical value is found in a table. Why would this code not find the numerical text "699" in this table? The print statement gives a value of "None."
html = """
<table>
December 31, 1997 1996 1995 1994 1993
Allowance for credit losses--loans 699 773
Allowance for credit losses--
trading assets 285 190
Allowance for credit losses--
other liabilities 13 10
- --------------------------------------------------------------------------------
Total $ 997 $ 973 $ 992 $1,252 $1,324
================================================================================
</table>
"""
soup = BeautifulSoup(''.join(html))
table = soup.find('table')
test = table.find(text='699')
print test

table.find() will search all tags inside the table, but there are no tags inside the table. There is just a string, which happens to be an ASCII table which is in no way formatted as HTML.
If you want to use BeautifulSoup to parse the table, you need to convert it into an HTML table first. Otherwise you can use table.string to get the string itself and parse that with regex.

If you pass a string as an argument into a Beautiful Soup find() method, Beautiful Soup looks for that exact string. Passing in text='699' will find the string "699", but not a longer string that includes "699".
To find strings that contain a substring, you can use a custom function or a regular expression:
import re
table.find(text=re.compile('699')
table.find(text=lambda x: '699' in x)

Related

Python3 - Extract the text from a bs4.element.Tag and add to a dictonary

I am scraping a website which returns a bs4.element.Tag similar to the following:
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.
What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.
cars = {'528i':['four door', 'inline 4 engine']}
I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!
You need to loop through all the elements by selector and extract text value from these elements.
A selector is a specific path to the element you want. In my case, the selector is .attributes-value span, where .attributes-value allows you to access the class, and span allows you to access the tags within that class.
The get_text() method retrieves the content between the opening and closing tags. This is exactly what you need.
I also recommend using lxml because it will speed up your code.
The full code is attached below:
from bs4 import BeautifulSoup
import lxml
html = '''
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
'''
soup = BeautifulSoup(html, 'lxml')
cars = {
'528i': []
}
for span in soup.select(".attributes-value span"):
cars['528i'].append(span.get_text())
print(cars)
Output:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
You can use:
out = defaultdict(list)
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
out["528i"].append(tag.text)
print(dict(out))
Prints:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}

Pulling a Spesific Text of a Part of Website

I'm new on web scraping and BeautifulSoup. I'm making a currency converter via using a site. I use this code to pull currency rate:
import requests
from bs4 import BeautifulSoup
from_ = input("WHICH CURRENCY DO YOU WANT TO CONVERT: ").upper()
to = input("WHICH CURRENCY DO YOU WANT TO CONVERT TO: ").upper()
url = requests.get(f'https://www.xe.com/currencyconverter/convert/?Amount=1&From={from_}&To={to}').text
soup = BeautifulSoup(url, 'lxml')
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').getText()
print(currency)
This is okay but it returns a full of text (e.g. 0.84311378 Euros). I want to pull only numbers that marked with red in picture:
Due to the number would always be the first element of this tag.An easy way could be :
currency_tag = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod')
print(next(iter(currency_tag)))
And result:
0.84
You can also use .contents and get the first item from it.
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').contents
print(currency[0].strip())
0.84
From what I can see, the string you highlighted in the picture represents the first three characters of the resulted price.
This means that any time you try to convert some type of currency to another one, the numbers marked with red will always represent a string with a length equal to 3.
We can pull the information you need by getting a substring of the paragraph’s text. Just replace the last line you provided with:
print(currency[0:4])
This will always return a string containing exactly the characters you are looking for.

get text in div after specific character [xpath]

I am using XPath to scrape a website, I've been able to access more of the information I need except for the date. The date is text in a div, it is formatted as such below.
October 13, 2018 / 1:31 AM / Updated 5 hours ago
I just want to get the date, not the time or anything else. However, with my current code, I am getting the entire text in the div. My code is below.
item['datePublished'] = response.xpath("//div[contains(#class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()
As hinted, there are ways to do this in XPath 2.0+. However, this should be done in the host language.
One way is to extract the date using a regex after the value has been retrieved, e.g. Regex Demo
\w+\ \d\d?,\ \d{4}
Code Sample:
import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
print (matches.group())

Why does bs4 return tags and then an empty list to this find_all() method?

Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:
url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %
Which grabs the html element including its tags, not just the text within it:
c_black_alone, s_black_alone
(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
<td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)
Above ^, I only want the %'s inside the elements...
Furthermore, why does
test_black = soup.find_all("td", text = "Black")
not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? (Edit: I have been following along with the documentation, so I hope this question doesn't seem too vague...)
To get the text from those matches, use .text to get all contained text:
>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'
Your text search doesn't match anything for two reasons:
A literal string only matches the whole contained text, not a partial match. It'll only work for element with <td>Black</td> as the sole contents.
It will use the .string property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.
The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:
soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
Demo:
>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a) <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007 <!-- SBO315207 --> </td>]
Both of these matches have a comment in the <td> element, making a search with a text match ineffective.

Parsing fixed-format data embedded in HTML in python

I am using google's appengine api
from google.appengine.api import urlfetch
to fetch a webpage. The result of
result = urlfetch.fetch("http://www.example.com/index.html")
is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.
EDIT:
Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way...
END EDIT
If the document is something like this:
<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A 288 AAA
</body></html>
result.content will be this, after urlfetch fetches it:
'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA</body></html>'
Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried
result.content.split('\n')
and
result.content.split('\r')
but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.
Any ideas how I can parse this data? Maybe I need to fetch it differently?
Thanks in advance!
I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.
I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like
import re
data = re.findall('<body>([^\<]*)</body>', result)[0]
then, it should be as easy as:
start = 0
end = 5
while (end<len(data)):
print data[start:end]
start = end+1
end = end+5
print data[start:]
(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)
Only suggestion I can think of is to parse it as if it has fixed width columns. Newlines are not taken into consideration for HTML.
If you have control of the source data, put it into a text file rather than HTML.
Once you have the body text as a single, long string, you can break it up as follows.
This presumes that each record is 26 characters.
body= "AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA"
for i in range(0,len(body),26):
line= body[i:i+26]
# parse the line
EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.
If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):
def generate_lines(datastring):
while datastring:
splitresult = datastring.split(' ', 5)
if len(splitresult) >= 5:
datastring = splitresult[5]
else:
datastring = None
yield splitresult[:5]
for line in generate_lines(data):
process_data_line(line)
Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.
Further suggestions for splitting the string s into 26-character blocks:
As a list:
>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
'BBB 987 2009-01-02 JSE',
'A4A 288 AAA']
As a generator:
>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987 2009-01-02 JSE
A4A 288 AAA
Replace range() with xrange() in Python 2.x if s is very long.

Categories