beautiful soup parse url from messy output

beautiful soup parse url from messy output - python

I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.

Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text

Related

Correctly parse empty html tags using beautiful soup

HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.

As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.

I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel

Matching url in HTML using regex

It's been a while since I've used regex, and I feel like this should be simple to figure out.
I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.
import re
string_to_match = 'Roster'
re.findall('Roster',string_to_match)

Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:
import re
from bs4 import BeautifulSoup
data = """
<body>
Roster
</body>
"""
soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]
print(re.search(r"teamId=(\d+)", link).group(1))
Prints 58.

I would recommend using BeautifulSoup or lxml, it's worth the learning curve.
...But if you still want to use regexp
re.findall('href="[^"]*teamId=(\d+)',string_to_match)

Filtering out one string from a print statement in python/BeautifulSoup

I am using BeautifulSoup to scrape a website's many pages for comments. Each page of this website has the comment "[[commentMessage]]". I want to filter out this string so it does not print every time the code runs. I'm very new to python and BeautifulSoup, but I couldn't seems to find this after looking for a bit, though I may be searching for the wrong thing. Any suggestions? My code is below:
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('website url').read()
soup = BeautifulSoup(r, "html.parser")
comments = soup.find_all("div", class_="commentMessage")
for element in comments:
print element.find("span").get_text()
All of the comments are in spans within divs of the class commentMessage, including the unnecessary comment "[[commentMessage]]".

A simple if should do
for element in comments:
text = element.find("span").get_text()
if "[[commentMessage]]" not in text:
print text

How do I only select DIV with similar ID

I am parsing a poorly designed web page using beautiful soup.
At the moment, what I need is the select the comment section of the web page but each comment is treated as a DIV and each have an ID like "IAMCOMMENT_00001" but that's it. No class (This would have helped a lot).
So I am forced to search for all DIVs that start with "IAMCOMMENT" but I can't figure out how to do this. The closest I could find is SoupStrainer but couldn't understand how to even use it.
How would I be able to achieve this?

I would use BeautifulSoup's built in find_all function:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
soup.find_all('div', id_=re.compile('IAMCOMMENT_'))

If you want to parse form comments, first you need to find the comment of the your html. A way to do this is like:
import re
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(myhtml)
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
to find the divs inside a comment,
for comment in comments:
cmnt_soup = BeautifulSoup(comment)
divs = cmnt_soup.find_all('div', attrs={"id": re.compile(r'IAMCOMMENT_\d+')})
# do things with the divs

Remove specific element from html page using lxml python etree

I want to load the html contents of the page in an xml tree and remove the elements in it using lxml in python. I just want to know how would I remove the elements from the content?

You can use the combination of BeautifulSoup4 and lxml to reach your goal easily.
To parse your HTML into tree / soup. You just need to have all the ingredients installed and do.
from bs4 import BeautifulSoup
html = """..."""
soup = BeautifulSoup(html, 'lxml')
...
You modify the tree, here is a whole list of references teaching you how to modify the contents/attribute of a tag etc.
BeautifulSoup/Modify The tree
Here is an example I did to modify the contents of anchor tag

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautiful soup parse url from messy output - python

Related

Correctly parse empty html tags using beautiful soup

Matching url in HTML using regex

Filtering out one string from a print statement in python/BeautifulSoup

How do I only select DIV with similar ID

Remove specific element from html page using lxml python etree

Categories

Resources