I am doing some python work and couldn´t find the answer to this question so hoping someone could help.
I am using findAll in python and getting the output of two numbers in text form. however I only want the second number and not the first. How do I target the second number?
This is my code:
product_price_container_after = container.findAll("div",{"class":"discounted"})
product_price_after = product_price_container_after[0].text
print(product_price_after)
This is where I am trying to get it from:
<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>59,98€</strike></span><br/>19,99€
</div>
So the output is:
59,98€19,99€
How do I only get 19,99€ ?
Thanks for the help.
You can use decompose() or extract() methods to remove elements from the tree.
discountedDivs = container.findAll("div", {"class": "discounted"})
for discountedDiv in discountedDivs:
discountedDiv.find("span").extract()
print(discountedDiv.text) ## returns 19,99€
Sorry but I can't reproduce your code, its incomplete.
Try this though:
product_price_after = product_price_container_after[1].text
You could use stripped strings
import requests
from bs4 import BeautifulSoup as bs
res = requests.get('https://store.steampowered.com/search/?specials=1&page=1')
soup = bs(res.content, 'lxml')
prices = soup.select('.discounted')
for price in prices:
if price.text is not None:
strings = [string for string in price.stripped_strings]
print(strings[1])
Or next_sibling:
for price in prices:
if price.text is not None:
print(price.find('br').next_sibling)
Related
I want someone to help me figure this out I want to scrap text value of salary (Confidential) using beautiful soup
import requests
from bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")
src=result.content
soup = BeautifulSoup(src,"lxml")
after this I used
salary=soup.find_all("span":{"class":"css-4xky9y"})
but it returns empty list
ـــــــــــــــــــــــــــــــــــــــــــــــــــ
import requestsfrom bs4 import BeautifulSoup
result=requests.get("https://wuzzuf.net/jobs/p/WBHqaf7WeZYe-Senior-Python-Developer-Trufla-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3")src=result.contentsoup = BeautifulSoup(src,"lxml")salary=soup.find("div",{'id':'app'})salary_text=salary.contents[0]h=salary_text.contents[4]
print(h)
when I print (h) it gives me the value
Please help me Guys finding the text value of salary
I have tried in past 5 days using what is mentioned in above.
If the salary information is in a <span> tag, you can use a code snippet like,
salary = soup.find("span", {"class": "salary"}).text
If this doesn't work, or you are unable to find the specific tag & class, you can always use the find_all() method search for <span> tags, then filter through the resulting list to find the tag that contains the salary information.
for span in soup.find_all("span"):
if span is not None and \
span.text is not None and \
"salary" in span.text:
salary = span.text
break
else:
salary = None # or np.nan
I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))
I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B
I have a really silly and annoying problem, I try to convert html into markdown but my html is silly formatted: I keep having stuff like that:
<strong>Ihre Aufgaben:<br/></strong>
or
<strong> <br/>Über die XXXX GmbH:<br/></strong>
which is totally valid HTML.
However my library to convert to Markdown (HTML2Text) converts it to:
**Ihre Aufgaben:\n**
and
** \nÜber die XXXX GmbH:\n**
which is an already reported issue because the markdown is then invalid and cannot be rendered properly
My approach to this problem was the following:
Use BeautifulSoup to find all the strong that lead to this problem
Classify the <br/> into 2 groups: the ones coming before the text and the ones coming after the text.
Unwrap the ones after the text in order to push them out of the <strong>
My code (not so great formatted yet):
soup = BeautifulSoup(html)
emphased = soup.find_all('strong')
for single in emphased:
children = single.children
before = 0
foundText = None
after = 0
for child in children:
if not isinstance(child, NavigableString):
if foundText:
after += 1
child.unwrap()
else:
before += 1
# DOES NOT WORK
child.unwrap()
else:
foundText = single.get_text().strip()
What is my current problem?
I want to unwrap the <br/> before the content and put them before the <strong> element and I cannot achieve that (and didn't find how to proceed in the doc).
What do I want to achieve more generally?:
I want to transform that:
<strong> <br/>Über die XXXX GmbH: </strong>
into
# Note the space
(whitespace)<br/><strong>Über die XXXX GmbH:</strong>(whitespace)
It doesn't have to use Beautiful Soup, I'm just not aware of other solutions.
Thanks in advance!
Per your example you can extract all the br tags from the strong ones and prepend them, replacing the latest tag with the new one.
Here is a snippet:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<strong>Ihre Aufgaben:<br/></strong>", "html.parser")
for strong in soup.find_all("strong"):
[s.extract() for s in strong.find_all('br')]
strong.string = strong.get_text(strip=True)
strong.replaceWith(BeautifulSoup( " %s%s " % ("<br/>", strong), "html.parser"))
print soup
Which outputs:
<br/><strong>Ihre Aufgaben:</strong>
If my class names are constantly different say for example:
listing-col-line-3-11 dpt 41
listing-col-block-1-22 dpt 41
listing-col-line-4-13 CWK 12
Normally I could do:
for EachPart in soup.find_all("div", {"class" : "ClassNamesHere"}):
print EachPart.get_text()
There are way too many class names to work with here so a bunch of these are out.
I know Python doesn't have a ".contains" I would normally use but it does have an "in". Though I haven't been able to work out a way to incorporate that.
I'm hoping there's a way to do this with regex. Though again my Python syntax is really letting me down I've been trying variations on:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all(regex):
But that doesn't seem to be doing the trick.
BeautifulSoup supports CSS selectors which allow you to select elements based on the content of particular attributes. This includes the selector *= for contains.
The following will return all div elements with a class attribute containing the text 'listing-col-':
for EachPart in soup.select('div[class*="listing-col-"]'):
print EachPart.get_text()
You can try this for loop:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
You could avoid regex by using partial matching with gazpacho...
Input:
html = """\
<div class="listing-col-line-3-11 dpt 41">A</div>
<div class="listing-col-block-1-22 dpt 41">B</div>
<div class="listing-col-line-4-13 CWK 12">C</div>
"""
Partial matching code:
from gazpacho import Soup
soup = Soup(html)
divs = soup.find("div", {"class": "listing-col-"}, partial=True)
[div.text for div in divs]
Output:
['A', 'B', 'C']