I am parsing a poorly designed web page using beautiful soup.
At the moment, what I need is the select the comment section of the web page but each comment is treated as a DIV and each have an ID like "IAMCOMMENT_00001" but that's it. No class (This would have helped a lot).
So I am forced to search for all DIVs that start with "IAMCOMMENT" but I can't figure out how to do this. The closest I could find is SoupStrainer but couldn't understand how to even use it.
How would I be able to achieve this?
I would use BeautifulSoup's built in find_all function:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
soup.find_all('div', id_=re.compile('IAMCOMMENT_'))
If you want to parse form comments, first you need to find the comment of the your html. A way to do this is like:
import re
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(myhtml)
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
to find the divs inside a comment,
for comment in comments:
cmnt_soup = BeautifulSoup(comment)
divs = cmnt_soup.find_all('div', attrs={"id": re.compile(r'IAMCOMMENT_\d+')})
# do things with the divs
Related
Hi I'm trying to update one of table content for the confluence page using beautifulsoup and API requests.
this my code. I'm able to find and update the td but I couldn't insert the updated td into soup variable.
content=requests.get(address,headers=headers).text
soup=BeautifulSoup(content,'html.parser')
for td in soup.find_all('td'):
if td == "<td> i need to update this </td>:
td.replace_with("<td>updated</td>")
i need the updated td to be inserted into soup variable, so when i search soup.find_all('td') i can find updated instead of i need to update this
how can i do that?
Thanks
Note: Do not use regular expressions for parsing HTML. I dearly hope that there is an alternative method for solving this problem. But, alas, I could not think of one.
Using the following regular expression, you can find every td element with no attributes attached. This can be used for finding the elements, but don't try this for actually getting information from the elements:
<td>.*?</td>
You can then use the Pattern.sub() method to substitute every instance of a td element with <td>updated</td>.
import re
import bs4
import requests
content = requests.get(address, headers=headers).text
regex = re.compile(r"<td>.*?</td>")
new_content = regex.sub("<td>update</td>", content)
soup = bs4.BeautifulSoup(new_content, features="html.parser")
I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text
It's been a while since I've used regex, and I feel like this should be simple to figure out.
I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.
import re
string_to_match = 'Roster'
re.findall('Roster',string_to_match)
Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:
import re
from bs4 import BeautifulSoup
data = """
<body>
Roster
</body>
"""
soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]
print(re.search(r"teamId=(\d+)", link).group(1))
Prints 58.
I would recommend using BeautifulSoup or lxml, it's worth the learning curve.
...But if you still want to use regexp
re.findall('href="[^"]*teamId=(\d+)',string_to_match)
I am using BeautifulSoup to scrape a website's many pages for comments. Each page of this website has the comment "[[commentMessage]]". I want to filter out this string so it does not print every time the code runs. I'm very new to python and BeautifulSoup, but I couldn't seems to find this after looking for a bit, though I may be searching for the wrong thing. Any suggestions? My code is below:
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('website url').read()
soup = BeautifulSoup(r, "html.parser")
comments = soup.find_all("div", class_="commentMessage")
for element in comments:
print element.find("span").get_text()
All of the comments are in spans within divs of the class commentMessage, including the unnecessary comment "[[commentMessage]]".
A simple if should do
for element in comments:
text = element.find("span").get_text()
if "[[commentMessage]]" not in text:
print text
So lets say I have the following base url http://example.com/Stuff/preview/v/{id}/fl/1/t/. There are a number of urls with different {id}s on the page being parsed. I want to find all the links matching this template in an HTML page.
I can use xpath to just match to a part of the template//a[contains(#href,preview/v] or just use regexes, but I was wondering if anyone knew a more elegant way to match to the entire template using xpath and regexes so its fast and the matches are definitely correct.
Thanks.
Edit. I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds.
Also if you have Scrapy its one can use Selectors.
data=get(url).text
sel = Selector(text=data, type="html")
a=sel.xpath('//a[re:test(#href,"/Stuff/preview/v/\d+/fl/1/t/")]//#href').extract()
Average time on this is also 0.467
You cannot use regexes in the xpath expressions using lxml, since lxml supports xpath 1.0 and xpath 1.0 doesn't support regular expression search.
Instead, you can find all the links on a page using iterlinks(), iterate over them and check the href attribute value:
import re
import lxml.html
tree = lxml.html.fromstring(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
for element, attribute, link, pos in tree.iterlinks():
if not pattern.match(link):
continue
print link
An alternative option would be to use BeautifulSoup html parser:
import re
from bs4 import BeautifulSoup
data = "your html"
soup = BeautifulSoup(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
print soup.find_all('a', {'href': pattern})
To make BeautifulSoup parsing faster you can let it use lxml:
soup = BeautifulSoup(data, "lxml")
Also, you can make use of a SoupStrainer class that lets you parse only specific web page parts instead of a whole page.
Hope that helps.