How to modify html table content using beautifulsoup - python

Hi I'm trying to update one of table content for the confluence page using beautifulsoup and API requests.
this my code. I'm able to find and update the td but I couldn't insert the updated td into soup variable.
content=requests.get(address,headers=headers).text
soup=BeautifulSoup(content,'html.parser')
for td in soup.find_all('td'):
if td == "<td> i need to update this </td>:
td.replace_with("<td>updated</td>")
i need the updated td to be inserted into soup variable, so when i search soup.find_all('td') i can find updated instead of i need to update this
how can i do that?
Thanks

Note: Do not use regular expressions for parsing HTML. I dearly hope that there is an alternative method for solving this problem. But, alas, I could not think of one.
Using the following regular expression, you can find every td element with no attributes attached. This can be used for finding the elements, but don't try this for actually getting information from the elements:
<td>.*?</td>
You can then use the Pattern.sub() method to substitute every instance of a td element with <td>updated</td>.
import re
import bs4
import requests
content = requests.get(address, headers=headers).text
regex = re.compile(r"<td>.*?</td>")
new_content = regex.sub("<td>update</td>", content)
soup = bs4.BeautifulSoup(new_content, features="html.parser")

Related

Extracting Sub-tag text with BeautfulSoup Issue

I am having issue with some code that I am running. This is to extract, and ultimately create a list of names that are on a website. This is to capture the name in the following:
<th class="left " data-append-csv="David-Cornell" data-stat="player" scope="row">David Cornell</th>
Now I've already created code to capture all of these instances, however even when I use find instances within the code to capture the next tag, I'm getting errors on this. I suspect there is a way for me to just parse the text received, but that would be rather a lot for the purposes, especially when there are a lot of differing pages.
from bs4 import BeautifulSoup as bsoup
import requests as reqs
page = reqs.get("https://fbref.com/en/squads/986a26c1/Northampton-Town")
parsepage = bsoup(page.content, 'html.parser')
findplayers = parsepage.find_all('th',attrs={"data-stat":"player"}).find_next('a')
print(findplayers)
So I can't for the life of me capture the next tag - I've tried a series of iterations, and the error I get when running this, is the following:
AttributeError: ResultSet object has no attribute 'find_next'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
How do I solve this problem?
find_all gives list with many elements and you have to use find_next with every element separatelly. You have to use for-loop
from bs4 import BeautifulSoup as bsoup
import requests as reqs
page = reqs.get("https://fbref.com/en/squads/986a26c1/Northampton-Town")
parsepage = bsoup(page.content, 'html.parser')
finndplayers = parsepage.find_all('th',attrs={"data-stat":"player"})
for item in findplayers:
print( item.find_next('a') )
You could alter your selector and with select do the following
players = [item.text for item in parsepage.select('#stats_player tbody th')]
The names are all in th of table body (tbody) with id stats_player
Or alternatively
#stats_player th.left a
These are slightly faster than the alternatives which use attributes such as:
#stats_player [data-append-csv]

beautiful soup parse url from messy output

I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text

Scrapy/Python web table missing closing TR / TD Tags

I'm redoing a data scraping project. There's a website with a table of data that is missing most or all of the closing TR and TD tags. When I first did the project with JS, I just copied the site and then split the data into arrays of rows when it encountered a new "" tag.
I want to try to rebuild this project using python/scrapy and just wondering if there was an easier way to access the data using selectors. Also I'm a little confused how to split the data when the response.data.split(') doesn't work.
I understand your problem . you can use beautyfulsoup's select method for successfully query. I make a demo code for you. hope this will help you.
import requests
from bs4 import BeautifulSoup
url = 'http://killedbypolice.net/';
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soup.select('table tr')
print(soup.select('table')[0])

Can't parse data from `th` tag along with `td` tag from different tables

I've written a script in python using xpath to parse tabular data from a webpage. Upon execution, it is able to parse the data from tables flawlessly. The only thing that I can't fix is parse the table header that means th tag. If I would do the same using css selector, i could have used .cssselect("th,td") but in case of xpath I got stuck. Any help as to how I could parse the data from th tag also will be highly appreciated.
Here is the script which is able to fetch everything from different tables except for the data within th tag:
import requests
from lxml.html import fromstring
response = requests.get("https://fantasy.premierleague.com/player-list/")
tree = fromstring(response.text)
for row in tree.xpath("//*[#class='ism-table']//tr"):
tab_d = row.xpath('.//td/text()')
print(tab_d)
I'm not sure I get your point, but if you want to fetch both th and td nodes with single XPath, you can try to replace
tab_d = row.xpath('.//td/text()')
with
tab_d = row.xpath('.//*[name()=("th" or "td")]/text()')
Change
.//td/text()
to
.//*[self::td or self::th]/text()
to include th elements too.
Note that it would be reasonable to assume that both td and th are immediate children of the tr context node, so you might further simplify your XPath to this:
*[self::td or self::th]/text()

How do I only select DIV with similar ID

I am parsing a poorly designed web page using beautiful soup.
At the moment, what I need is the select the comment section of the web page but each comment is treated as a DIV and each have an ID like "IAMCOMMENT_00001" but that's it. No class (This would have helped a lot).
So I am forced to search for all DIVs that start with "IAMCOMMENT" but I can't figure out how to do this. The closest I could find is SoupStrainer but couldn't understand how to even use it.
How would I be able to achieve this?
I would use BeautifulSoup's built in find_all function:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
soup.find_all('div', id_=re.compile('IAMCOMMENT_'))
If you want to parse form comments, first you need to find the comment of the your html. A way to do this is like:
import re
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(myhtml)
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
to find the divs inside a comment,
for comment in comments:
cmnt_soup = BeautifulSoup(comment)
divs = cmnt_soup.find_all('div', attrs={"id": re.compile(r'IAMCOMMENT_\d+')})
# do things with the divs

Categories