I have a really silly and annoying problem, I try to convert html into markdown but my html is silly formatted: I keep having stuff like that:
<strong>Ihre Aufgaben:<br/></strong>
or
<strong> <br/>Über die XXXX GmbH:<br/></strong>
which is totally valid HTML.
However my library to convert to Markdown (HTML2Text) converts it to:
**Ihre Aufgaben:\n**
and
** \nÜber die XXXX GmbH:\n**
which is an already reported issue because the markdown is then invalid and cannot be rendered properly
My approach to this problem was the following:
Use BeautifulSoup to find all the strong that lead to this problem
Classify the <br/> into 2 groups: the ones coming before the text and the ones coming after the text.
Unwrap the ones after the text in order to push them out of the <strong>
My code (not so great formatted yet):
soup = BeautifulSoup(html)
emphased = soup.find_all('strong')
for single in emphased:
children = single.children
before = 0
foundText = None
after = 0
for child in children:
if not isinstance(child, NavigableString):
if foundText:
after += 1
child.unwrap()
else:
before += 1
# DOES NOT WORK
child.unwrap()
else:
foundText = single.get_text().strip()
What is my current problem?
I want to unwrap the <br/> before the content and put them before the <strong> element and I cannot achieve that (and didn't find how to proceed in the doc).
What do I want to achieve more generally?:
I want to transform that:
<strong> <br/>Über die XXXX GmbH: </strong>
into
# Note the space
(whitespace)<br/><strong>Über die XXXX GmbH:</strong>(whitespace)
It doesn't have to use Beautiful Soup, I'm just not aware of other solutions.
Thanks in advance!
Per your example you can extract all the br tags from the strong ones and prepend them, replacing the latest tag with the new one.
Here is a snippet:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<strong>Ihre Aufgaben:<br/></strong>", "html.parser")
for strong in soup.find_all("strong"):
[s.extract() for s in strong.find_all('br')]
strong.string = strong.get_text(strip=True)
strong.replaceWith(BeautifulSoup( " %s%s " % ("<br/>", strong), "html.parser"))
print soup
Which outputs:
<br/><strong>Ihre Aufgaben:</strong>
Related
I have an html document like this: https://dropmefiles.com/wezmb
So I need to extract text inside tags <span id="1" and </span , but I don't know how.
I'm trying and write this code:
from bs4 import BeautifulSoup
with open("10_01.htm") as fp:
soup = BeautifulSoup(fp,features="html.parser")
for a in soup.find_all('span'):
print (a.string)
But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?
What you need is the .contents function. documentation
Find the span <span id = "1"> ... </span> using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'.
If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.
I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))
I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B
I am doing some python work and couldn´t find the answer to this question so hoping someone could help.
I am using findAll in python and getting the output of two numbers in text form. however I only want the second number and not the first. How do I target the second number?
This is my code:
product_price_container_after = container.findAll("div",{"class":"discounted"})
product_price_after = product_price_container_after[0].text
print(product_price_after)
This is where I am trying to get it from:
<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>59,98€</strike></span><br/>19,99€
</div>
So the output is:
59,98€19,99€
How do I only get 19,99€ ?
Thanks for the help.
You can use decompose() or extract() methods to remove elements from the tree.
discountedDivs = container.findAll("div", {"class": "discounted"})
for discountedDiv in discountedDivs:
discountedDiv.find("span").extract()
print(discountedDiv.text) ## returns 19,99€
Sorry but I can't reproduce your code, its incomplete.
Try this though:
product_price_after = product_price_container_after[1].text
You could use stripped strings
import requests
from bs4 import BeautifulSoup as bs
res = requests.get('https://store.steampowered.com/search/?specials=1&page=1')
soup = bs(res.content, 'lxml')
prices = soup.select('.discounted')
for price in prices:
if price.text is not None:
strings = [string for string in price.stripped_strings]
print(strings[1])
Or next_sibling:
for price in prices:
if price.text is not None:
print(price.find('br').next_sibling)
Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.
For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)
Movie Title
Quality
Torrent Link
There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:
item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]
And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:
<div class="browse-info">
<span class="info">
<h3>James Bond: Casino Royale (2006)</h3>
<p><b>Size:</b> 1018.26 MB</p>
<p><b>Quality:</b> 720p</p>
<p><b>Genre:</b> Action | Crime</p>
<p><b>IMDB Rating:</b> 7.9/10</p>
<span>
<p class="peers"><b>Peers:</b> 698</p>
<p class="peers"><b>Seeds:</b> 356</p>
</span>
</span>
<span class="links">
View Info<span></span>
Download<span></span>
</span>
</div>
Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.text
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
...:
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...
or to get exactly the output you wanted:
In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.find(text=True, recursive=False).strip()
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
As you request I paste simple example of parser. As you can see it's use lxml. With lxml you have two ways to work with DOM tree one of these is xpath and the second is css selectors
I prefered xpath.
import lxml.html
import decimal
import urllib
def parse():
url = 'https://sometotosite.com'
doc = lxml.html.fromstring(urllib.urlopen(url).read())
main_div = doc.xpath("//div[#id='line']")[0]
main = {}
tr = []
for el in main_div.getchildren():
if el.xpath("descendant::a[contains(#name,'tn')]/text()"):
category = el.xpath("descendant::a[contains(#name,'tn')]/text()")[0]
main[category] = ''
tr = []
else:
for element in el.getchildren():
if '—' in lxml.html.tostring(element):
tr.append(element)
print category, tr
parse()
LXML official site