"Expected string or buffer" error using Beautiful Soup - python

I'm trying a code that will pull numbers from a URL using Beautiful Soup, then sum these numbers, but I keep getting an error that looks like this:
Expected string or buffer
I think it's related to the regular expressions, but I can't pinpoint the problem.
import re
import urllib
from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')
for tag in tags:
y = re.findall ('([0-9]+)',tag.txt)
print sum(y)

I recommend bs4 instead of BeautifulSoup (which is the old version). You also need to change this line:
y = re.findall ('([0-9]+)',tag)
to something like this:
y = re.findall ('([0-9]+)',tag.text)
See if this gets you further:
sum = 0 #initialize the sum
for tag in tags:
y = re.findall ('([0-9]+)',tag.text) #get the text from the tag
print(y[0]) #y is a list, print the first element of the list
sum += int(y[0]) #convert it to an integer and add it to the sum
print('the sum is: {}'.format(sum))

Related

Find Location of All Numbers with a Comma

I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.
I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
text_s = str(proxy_text[i])
strings.append(text)
I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.
number = re.sub('[^>0-9,]', "", text)
Any thoughts would be a huge help! Thank you
You can use:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
soup = BeautifulSoup(requests.get(url).text, "html5lib")
for el in soup.find_all(True): # loop all element in page
if re.search(r"(?=\d+,\d+).*", el.text):
print(el.text)
# print("END OF ELEMENT\n") # debug only
If you simply want to check if a number has a comma or not, and you want to extract it if it does, then you could try the following.
new = []
for i in text:
if ',' in i:
new.append(i)
This will append all the elements in the 'text' collection that contain a comma, even if the exact same element is repeated multiple times.

scraping B in <span><span> flow text </span> B </span> using BeautifulSoup

I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B

How to get all unique HTML tags on a webpage using regular expression?

I have html source code from an HTML page:
import requests
text = requests.get("https://en.wikipedia.org/wiki/Collatz_conjecture").text
What I would like to do is to get a count of the number of unique HTML tags on this page.
For example: <head>, <title>. Closing tags do not count (<head> and </head> would be counted only once).
Yes, I know this is much easier using HTML parsers such as Beautiful Soup but I would like to accomplish this using only Regular Expression.
I've brute force counted this and the answer is in the ballpark of around 60 unique tags.
How would I go about doing this?
I've already tried using re.findall(), to no avail.
Since the answer is around 60, I would like the output to be:
"Number of unique HTML tags: 60"
The following will yield 63 URL's from the url in question
import requests
import re
url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text
url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
# Get all matching patterns of url_pattern
# this will return a list of tuples
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)
# using list comprehension to get the first item of the tuple,
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')
out:
Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture
Please! Do not parse a HTML in regex use modules like bs4. But still if you insist Do that as follows:
import requests
import re
url = 'https://en.wikipedia.org/wiki/Collatz_conjecture'
text = requests.get(url).text
tags = re.findall('<[^>]*>',text)
total=[]
for i in range(len(tags)):
total.append(re.match('<[^\s\>]+',tags[i]).group())
total=[elem+'>' for elem in total]
r= re.compile('</[^<]')
unwanted =list(filter(r.match,total))
un=['<!-->','<!--[if>','<!DOCTYPE>','<![endif]-->']
unwanted.extend(un)
final=[x for x in list(set(total)) if x not in set(unwanted)]
print('Number of Unique HTML tags : ',len(final))

Beautiful Soup : How to extract data from HTML Tags from inconsistent data

I wanted to extract the data from tags which is coming in two forms :
<td><div><font> Something else</font></div></td>
and
<td><div><font> Something <br/>else</font></div></td>
I am using .string() method where in the first case it gives me the required string (Something else) but in the second case, it gives me None.
Is there any better way or alternative way to do it?
Try using .text property instead of .string
from bs4 import BeautifulSoup
html1 = '<td><div><font> Something else</font></div></td>'
html2 = '<td><div><font> Something <br/>else</font></div></td>'
if __name__ == '__main__':
soup1 = BeautifulSoup(html1, 'html.parser')
div1 = soup1.select_one('div')
print(div1.text.strip())
soup2 = BeautifulSoup(html2, 'html.parser')
div2 = soup2.select_one('div')
print(div2.text.strip())
which outputs:
Something else
Something else
You can use regular expression always for such things!
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
print(result[1])
This will work your case. To avoid capturing tag, you need to manipulate string.
Check via print("<br/>" in result[1]), if string contains tag then it'll return True, in that case you need to drop the tag.
result = str(result[1]).split("<br/>") this will give you a list [' Something ', 'else'], join them to get your answer.. result = (" ").join(result)
Here is the complete snippet:
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
if "<br/>" in result[1]:
result = str(result[1]).split("<br/>")
result = (" ").join(result)
print(result)
else:
print(result[1])
I understand this is a pretty poor solution, but it'll work for you!

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

Categories