Beautiful Soup prettify format only string values - python

I am using Beautiful Soup 4 to parse and modify a couple of Angular templates (HTML files). I have some issues when using the prettify function to write the modified content back into the file. This issue is related to special characters such as: >, <, etc.
I need the HTML files to be formatted exactly as they were before the processing, without converting string values such as × to "×" and without converting attribute values such as *ngIf="alarmCount > 0 to *ngif="alarmCount > 0.
Below you have an example HTML template and the Beautiful Soup's output for each of the three built-in output formatters (html, minimal, None). Neither one does provide the desired output. There is also an option to pass a custom formatter function to prettify(), but since this custom formatter does not know if the string passed to it is an attribute value or a string value, I'm not sure if it's really helpful in this case.
Any suggestion in how to handle this is appreciated.
from bs4 import BeautifulSoup
test_html = """
<div id="someId" (window:resize)="sizeChanged($event)">
<span *ngIf="alarmCount > 0">×</span>
</div>
"""
document = BeautifulSoup(test_html, "html.parser")
print(document.prettify(formatter="html"))
Result with formatter="html":
<div (window:resize)="sizeChanged($event)" id="someId">
<span *ngif="alarmCount > 0">
×
</span>
</div>
Result with formatter=None:
<div (window:resize)="sizeChanged($event)" id="someId">
<span *ngif="alarmCount > 0">
×
</span>
</div>
Result with formatter="minimal":
<div (window:resize)="sizeChanged($event)" id="someId">
<span *ngif="alarmCount > 0">
×
</span>
</div>

Related

Nested Beautiful Soup classes

I am trying to fetch all classes (including the data inside "data_from", "data_to") from the following structure:
<div class="alldata">
<div class="data_from">
<div class="data_to">
<div class="data_to">
<div class="data_from">
</div>
So far I have tried finding all classes, without success. The "data_from", "data_to" classes are not being fetched by:
soup.find_all(class_=True)
When I try to illiterate over "alldata" class I fetch only the first "data_from" class.
for data in soup.findAll('div', attrs={"class": "alldata"}):
print(data.prettify())
All assistance is greatly appreciated. Thank you.
In newer code avoid old syntax findAll() or a mix with new syntax - instead use find_all() only - For more take a minute to check docs
Your HTML is not valid, but to get your goal with valid HTML you could use css selectors that selects all <div> with a class that are contained in your outer <div>:
soup.select('.alldata div[class]')
Example
from bs4 import BeautifulSoup
html='''<div class="alldata">
<div class="data_from"></div>
<div class="data_to"></div>
<div class="data_to"></div>
<div class="data_from"></div>
</div>'''
soup = BeautifulSoup(html)
soup.select('.alldata div[class]')
Output
[<div class="data_from"></div>,
<div class="data_to"></div>,
<div class="data_to"></div>,
<div class="data_from"></div>]
Just in addition if you like to get its texts, iterate over your ResultSet:
for e in soup.select('.alldata div[class]'):
print(e.text)

Search for text inside a tag using beautifulsoup and returning the text in the tag after it

I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.
Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)

Python list processing to extract substrings

I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.
I now have to clean out HTML strings from this list, leaving behind string tokens I need.
The list I start with looks like this:
[<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>]
The whitespaces are deliberate.
I need to reduce that list to:
[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]
What's an efficient way to parse out substrings like this?
I've tried using the split method (e.g. [item.split('<div class="info-1">\n',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.
I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.
Do not convert BS object to string unless you really need to do that.
Use CSS selector to find the class that starts with info
Use stripped_strings to get all the non-empty strings under a tag
Use tuple() to convert an iterable to tuple object
import bs4
html = '''<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
for div in soup.select('div[class^="info"]'):
t = tuple(text for text in div.stripped_strings)
print(t)
out:
('Name1a', 'Score1a')
('Name1b', 'Score1b')
('Name2a', 'Score2a')
('Name2b', 'Score2b')
('Name3a', 'Score3a')
('Name3b', 'Score3b')

Using regex on python + beautiful soup

I have an html page like this:
<td class="subject windowbg2">
<div>
<span id="msg_152617">
<a href= SOME INFO THAT I WANT </a>
</span>
</div>
<div>
<span id="msg_465412">
<a href= SOME INFO THAT I WANT</a>
</span>
</div>
as you can see the id="msg_465412" have a variable number, so this is my code:
import urllib.request, http.cookiejar,re
from bs4 import BeautifulSoup
contenturl = "http://megahd.me/peliculas-microhd/"
htmll=urllib.request.urlopen(contenturl).read()
soup = BeautifulSoup(htmll)
print (soup.find('span', attrs=re.compile(r"{'id': 'msg_\d{6}'}")))
in the last line I tried to find all the "span" tags that contain an id that can be msg_###### (with any number) but something is wrong in my code and it doesn't find anything.
P.S: all the code I want is in a table with 6 columns and I want the third column of all rows, but I thought that it was easier to use regex
You're a bit mixed up with your attrs argument ... at the moment it's a regex which contains the string representation of a dictionary, when it needs to be a dictionary containing the attribute you're searching for and a regex for its value.
This ought to work:
print (soup.find('span', attrs={'id': re.compile(r"msg_\d{6}")}))
Try using the following:
soup.find_all("span" id=re.compile("msg_\d{6}"))

Scrapy, python, Xpath how to match respective items in html

I am new to Xpath, trying to scrapy website with below format:
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
<div class="middle"> listed_value </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_value </div>
</div>
The presences of listed_value & listed_date are optional.
I need to group each tittle_name with respective listed_date, listed_value (if available) then insert reach record to MySQL.
I am using scrapy shell which gives some basic examples like
listings = hxs.select('//div[#class=\'top\']')
for listing in listings:
tittle_name = listing.select('/a//text()').extract()
date_values = listing.select('//div[#class=\'middle\']')
Above code give me list of tittle_name and list of available listed_date, listed_value, but how to match them? (we cannot go by index because the format is not symmetric).
Thanks.
Do note that those XPath expressions are absolute:
/a//text()
//div[#class=\'middle\']
You would need relative XPath expression like these:
a
div[#class=\'middle\']
Second. It's not a good idea to select text nodes in a mixed content model like (X)HTML. You should extract the string value with the proper DOM method or with string() function. (In the last case, you would need to eval the expression for each node because the implicit node set casting into singleton node set)
Well, since the website doesn't specify whether something in a div[#class='middle'] is a date or a value, you'll have to code your own way of deciding this.
I guess the dates have some specific format that you could match with some analysis, maybe using a regular expression.
Can you maybe be more specific on what are possible values for listed_date and listed_value?

Categories