I am trying to read an Html page and get some information from it.
In one of the lines, the information I need is inside an Image's alt attribute. like so:
<img src='logo.jpg' alt='info i need'>
The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present.
Because of this, the result is something like this:
<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>
Currently, my code consists in this:
name = row.find("td", {"class": "logo"}).find("img")["alt"]
Which should return "info i need" but is currently returning "\'info"
What can I be doing wrong?
Is there any settings that I need to change in order to beautifulsoup to parse this correctly?
Edit:
my code looks something like this ( I used the standard html parser too, but no difference there )
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://myhtml.html'
with urllib.request.urlopen(url) as page:
text = str(page.read())
html = BeautifulSoup(page.read(), "lxml")
table = html.find("table", {"id": "info_table"})
rows = table.find_all("tr")
for row in rows:
if row.find("th") is not None:
continue
info = row.find("td", {"class": "logo"}).find("img")["alt"]
print(info)
if __name__ == '__main__':
main()
and the html:
<div class="table_container">
<table class="info_table" id="info_table">
<tr>
<th class="logo">Important infos</th>
<th class="useless">Other infos</th>
</tr>
<tr >
<td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
<tr >
<td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
Sorry, I am unable to add a comment.
I have tested your case and for me the output seems correct.
HTML:
<html>
<body>
<td class="logo">
<img src='logo.jpg' alt='info i need'>
</td>
</body>
</html>
Python:
from bs4 import BeautifulSoup
with open("myhtml.html", "r") as html:
soup = BeautifulSoup(html, 'html.parser')
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
Returns:
info i need
I think your problem is a encoding problem while write the file back to html.
Please provide the full code and further information.
html
your python code
Update:
I've tested your code, your code is not working at all :/
After rework i was able to get required output as a result.
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://code.mytesturl.net'
with urllib.request.urlopen(url) as page:
soup = BeautifulSoup(page, "html.parser")
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
if __name__ == '__main__':
main()
Possible problems:
Maybe your parser should be html.parser
Python version / bs version ?
Related
I am performing some data web scrapping using Beautiful Soup in Python. How is it possible to extract the class information between <td> when there is no text provided ? See the example I am working on. I'd like Beautiful Soup to provide me the text mm_detail_N, mm_detail_N, mm_detail_SE.
<tr>
<td class="caption">Direction du vent</td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_SE" title="title.wind_SE"></div></center></td>
</tr>
I usually use the following command
data = [i.get_text(strip=True) for i in soup.find_all("td", {"title": "title_of_the_td"})]
I have tried the following commands:
data = [i.get_text(strip=True) for i in soup.find_all("div", {"title": "caption_of_the_td"})
The command executes properly but the outcome is empty
Any ideas ?
As you mentioned above that you would like to extract mm_detail_N, mm_detail_N, mm_detail_SE. So you can select the common class attr value div[class*="mm_detail"] then invoke .get() method to pull the that value as text form as follows:
html_doc = ''''
<tr>
<td class="caption">Direction du vent</td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_SE" title="title.wind_SE"></div></center></td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for td in soup.select('div[class*="mm_detail"]'):
print(td.get('class'))
Output:
['mm_detail_N']
['mm_detail_N']
['mm_detail_SE']
The problem that I am facing is simple. If I am trying to get some data from a website, there are two classes with the same name. But they both contain a table with different Information. The code that I have only outputs me the content of the very first class. It looks like this:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find("tr", {"class": "table3"})
print(results.prettify())
How can I get the code to put out either the content of both tables or only the content of the second one?
Thanks for your answers in advance!
You can use .find_all() and [1] to get second result. Example:
from bs4 import BeautifulSoup
txt = """
<tr class="table3"> I don't want this </tr>
<tr class="table3"> I want this! </tr>
"""
soup = BeautifulSoup(txt, "html.parser")
results = soup.find_all("tr", class_="table3")
print(results[1]) # <-- get only second one
Prints:
<tr class="table3"> I want this! </tr>
As I've recently started learning web scraping, I thought I would try to parse an HTML table from this site using requests and bs4 modules.
I know I need to access td class from tbody -- this is how a web page looks like at least:
When I try, though, it doesn't seem to work properly as it only captures td class from thead and not from tbody. Hence, I cannot capture anything but the headers of the table.
I assume it has something to do with requests module.
url = 'https://vstup.edbo.gov.ua/statistics/requests-by-university/?
qualification=1&education-base=40'
r = requests.get(url)
print(r.text)
The result is as follows (pasting table-related part):
<table id="stats">
<caption></caption>
<thead>
<tr>
<td class="region">Регіон</td>
<td class="university">Назва закладу</td>
<td class="speciality">Спеціальність (спеціалізація)</td>
<td class="average-ball number" title="Середній конкурсний бал">СКБ</td>
<td class="requests-total number">Усього заяв</td>
<td class="requests-budget number">Заяв на бюджет</td>
</tr>
</thead>
<tbody></tbody>
</table>
So the tbody elements are missing in my response object, while they are present in the code of the web page. What am I doing wrong?
#Holdenweb suggested trying Selenium and everything worked.
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://vstup.edbo.gov.ua/statistics/requests-by-university/?
qualification=1&education-base=40'
browser = webdriver.Firefox(executable_path=r'D:/folder/geckodriver.exe')
browser.get(url)
html = browser.page_source
after that, I used BeautifulSoup and managed to parse the web page.
I am struggling with getting the data I want and I am sure its very simple if you know how to use BS. I have been trying to get this right for hours without avail after reading the docs.
Currently my code outputs this in python:
[<td>0.32%</td>, <td><span class="neg color ">>-0.01</span></td>, <td>0.29%</td>, <td>0.38%</td>, <td><span class="neu">0.00</span></td>]
How would I just isolate the content of the td tags that do not contain the tags?
i.e. I would like to see 0.32%, 0.29%, 0.38% only.
Thank you.
import urllib2
from bs4 import BeautifulSoup
fturl = 'http://markets.ft.com/research/Markets/Bonds'
ftcontent = urllib2.urlopen(fturl).read()
soup = BeautifulSoup(ftcontent)
ftdata = soup.find(name="div", attrs={'class':'wsodModuleContent'}).find_all(name="td", attrs={'class':''})
Is this ok solution for you:
html_txt = """<td>0.32%</td>, <td><span class="neg color">
>-0.01</span></td>, <td>0.29%</td>, <td>0.38%</td>,
<td><span class="neu">0.00</span></td>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_txt)
print [tag.text for tag in soup.find_all('td') if tag.text.strip().endswith("%")]
output is:
[u'0.32%', u'0.29%', u'0.38%']
Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry