I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B
Related
When I run the code below,
link = f"https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1"
response = requests.get(link)
html_icerigi = response.content
corba = BeautifulSoup(html_icerigi,"html.parser")
for a in corba.find_all("div",{"class":"paginate-content"}):
x = corba.find_all("div",{"class":"paginate-content"})
print(x)
I get results:
[<div class="paginate-content">
<a class="paginate-element-active" href="javascript:void(0);">1</a>
2
3
4
..
13
</div>]
What I need is just the number 13 (last number) in the last line (13)
Can you help me on how to do this?
You can do it like this
corba.find("div",{"class":"paginate-content"}).find_all('a')[-1].text
this will give you the text content of the last item(13 in your case)
As you have 1 div in x so you can get by following:
x.find_all('a')[-1].text
You can handle the case if no anchor tag found.
There are different approaches possible to scrape the text of your element.
css selectors calling the last element of type:
corba.select_one('.paginate-content a:last-of-type').text
picking last element by its list index:
corba.find('div',{'class':'paginate-content'}).find_all('a')[-1].text
Example
from bs4 import BeautifulSoup
import requests
url = 'https://www.ambalajstore.com/kategori/bardak-tabak?siralama=fiyat:asc&stoktakiler=1&tp=1'
req = requests.get(url)
corba = BeautifulSoup(req.content)
corba.select_one('.paginate-content a:last-of-type').text
Output
13
I have an html document like this: https://dropmefiles.com/wezmb
So I need to extract text inside tags <span id="1" and </span , but I don't know how.
I'm trying and write this code:
from bs4 import BeautifulSoup
with open("10_01.htm") as fp:
soup = BeautifulSoup(fp,features="html.parser")
for a in soup.find_all('span'):
print (a.string)
But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?
What you need is the .contents function. documentation
Find the span <span id = "1"> ... </span> using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'.
If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.
I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))
I'm trying a code that will pull numbers from a URL using Beautiful Soup, then sum these numbers, but I keep getting an error that looks like this:
Expected string or buffer
I think it's related to the regular expressions, but I can't pinpoint the problem.
import re
import urllib
from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')
for tag in tags:
y = re.findall ('([0-9]+)',tag.txt)
print sum(y)
I recommend bs4 instead of BeautifulSoup (which is the old version). You also need to change this line:
y = re.findall ('([0-9]+)',tag)
to something like this:
y = re.findall ('([0-9]+)',tag.text)
See if this gets you further:
sum = 0 #initialize the sum
for tag in tags:
y = re.findall ('([0-9]+)',tag.text) #get the text from the tag
print(y[0]) #y is a list, print the first element of the list
sum += int(y[0]) #convert it to an integer and add it to the sum
print('the sum is: {}'.format(sum))
I am trying to sort through HTML tags and I can't seem to get it right.
What I have done so far
import urllib
import re
s = raw_input('Enter URL: ')
f = urllib.urlopen(s)
s = f.read()
f.close
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)
Where I replace "TAG" with tag I want to see.
Thanks in advance.
You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.
An example from BS is this
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]
As for a regular expression, you can use
aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']
I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with
re.compile("<div.*?>.*?</div>")
Although you will run into some problems with nested divs with the above one.