escape problem with beautifulsoup in python

escape problem with beautifulsoup in python - python

I am using the beautifulsoup4 in python in order to update the table on the confluence page when I use the soup.append(str) function, the < > were escaped and became < > so that i can not update the table correctly. Could someone give me tips? Or maybe some better solution for update the table on the confluence page, thanks in advance. :)
what i expect:
<tr>"string"</tr>
what the result:
<tr >"string"< tr>

That happens because you append string to other tag. The solution is to create new_tag() or whole new soup.
For example:
txt = '''
<div>To this tag I append other tag</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
soup.find('div').append( BeautifulSoup('<tr>string</tr>', 'html.parser') )
print(soup.prettify())
Prints:
<div>
To this tag I append other tag
<tr>
string
</tr>
</div>

Related

Get the content of multiple classes when scraping a website

The problem that I am facing is simple. If I am trying to get some data from a website, there are two classes with the same name. But they both contain a table with different Information. The code that I have only outputs me the content of the very first class. It looks like this:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find("tr", {"class": "table3"})
print(results.prettify())
How can I get the code to put out either the content of both tables or only the content of the second one?
Thanks for your answers in advance!

You can use .find_all() and [1] to get second result. Example:
from bs4 import BeautifulSoup
txt = """
<tr class="table3"> I don't want this </tr>
<tr class="table3"> I want this! </tr>
"""
soup = BeautifulSoup(txt, "html.parser")
results = soup.find_all("tr", class_="table3")
print(results[1]) # <-- get only second one
Prints:
<tr class="table3"> I want this! </tr>

Unable to Scrape Content that comes after a Comment Python BeautifulSoup

I am trying to scrape the tables from the following page:
https://www.baseball-reference.com/boxes/CHA/CHA193805220.shtml
When I reach the html for the batting tables I encounter a very long comment which contains the html for the table
<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
<div class="section_heading">
<div class="section_heading_text">
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
.....
-->
<div class="table_outer_container mobile_table">
<div class="footer no_hide_long">
Where the last two div are what I am interested in scraping and everything in between the <!-- and the --> is a comment which happens to contain a copy of the table in the table_outer_container class below.
The problem is that when I read the page source into beautiful soup it does will not read anything after the comment within the table_wrapper class div which contains everything. The following code illustrates the problem:
batting = page_source.find('div', {'id':'all_WashingtonSenatorsbatting'})
divs = batting.find_all('div')
len(divs)
gives me
Out[1]: 3
When there are obviously 5 div children under the div id="all_WashingtonSenatorsbatting" element.
Even when I extract the comment using
from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
The resulting soup still doesn't contain the last two div elements I want to scrape. I am trying to play with the code using regular expressions but so far no luck, any suggestions?

I found workable solution, By using the following code I extract the comment (which brings with it the last two div elements I wanted to scrape), process it again in BeautifulSoup and scrape the table
s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')
It took me a while to get to this and would be interested to see if anyone comes up with any other solutions or can offer an explanation of how this problem came to be.

Python - beautifulsoup - how to deal with missing closing tags

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)

As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Storing the unknown Id of an html tag

So I am trying to scrape an html using BeautifulSoup, but I am having problems finding a tag id using Python 3.4. I know what the tag ("tr") is, but the id is constantly changing and I would like to save the id when it changes. For example:
<div class = "thisclass"
<table id = "thistable">
<tbody>
<tr id="what i want">
<td class = "someinfo">
<tbody>
<table>
<div>
I can find the div tag and the table, and I know the tr tag is there, but I want to extract the text next to id, without knowing what the text is going to say.
so far I have this code:
soup = BeautifulSoup(url.read())
divTag = soup.find_all("table",id ="thistable")
i = 0
for i in divTag:
trtag = soup.find("tr", id)
print(trtag)
i = i+1
if anyone could help me solve this problem I would appreciate it.

You can use a css selector:
print([element.get('id') for element in soup.select('table#thistable tr[id]'))

Find 2 attributes in BeautifulSoup

Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.

Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

escape problem with beautifulsoup in python - python

Related

Get the content of multiple classes when scraping a website

Unable to Scrape Content that comes after a Comment Python BeautifulSoup

Python - beautifulsoup - how to deal with missing closing tags

Storing the unknown Id of an html tag

Find 2 attributes in BeautifulSoup

Categories

Resources