Extract class description between <td>

Extract class description between <td> - python

I am performing some data web scrapping using Beautiful Soup in Python. How is it possible to extract the class information between <td> when there is no text provided ? See the example I am working on. I'd like Beautiful Soup to provide me the text mm_detail_N, mm_detail_N, mm_detail_SE.
<tr>
<td class="caption">Direction du vent</td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_SE" title="title.wind_SE"></div></center></td>
</tr>
I usually use the following command
data = [i.get_text(strip=True) for i in soup.find_all("td", {"title": "title_of_the_td"})]
I have tried the following commands:
data = [i.get_text(strip=True) for i in soup.find_all("div", {"title": "caption_of_the_td"})
The command executes properly but the outcome is empty
Any ideas ?

As you mentioned above that you would like to extract mm_detail_N, mm_detail_N, mm_detail_SE. So you can select the common class attr value div[class*="mm_detail"] then invoke .get() method to pull the that value as text form as follows:
html_doc = ''''
<tr>
<td class="caption">Direction du vent</td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_N" title="title.wind_N"></div></center></td>
<td><center><div class="mm_detail_SE" title="title.wind_SE"></div></center></td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for td in soup.select('div[class*="mm_detail"]'):
print(td.get('class'))
Output:
['mm_detail_N']
['mm_detail_N']
['mm_detail_SE']

Related

Get the content of multiple classes when scraping a website

The problem that I am facing is simple. If I am trying to get some data from a website, there are two classes with the same name. But they both contain a table with different Information. The code that I have only outputs me the content of the very first class. It looks like this:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find("tr", {"class": "table3"})
print(results.prettify())
How can I get the code to put out either the content of both tables or only the content of the second one?
Thanks for your answers in advance!

You can use .find_all() and [1] to get second result. Example:
from bs4 import BeautifulSoup
txt = """
<tr class="table3"> I don't want this </tr>
<tr class="table3"> I want this! </tr>
"""
soup = BeautifulSoup(txt, "html.parser")
results = soup.find_all("tr", class_="table3")
print(results[1]) # <-- get only second one
Prints:
<tr class="table3"> I want this! </tr>

I cannot crawl HTML text using BeautifulSoup

In my previous question(How to speed up parsing using BeautifulSoup?), I asked the way to crawl HTML website more quickly, and the answer helped me much.
But I encountered another problem. It is about crawling the price of tickets.
I got JSON text in the webpage referring the answer of my previous question. I could get almost every information about festivals in the JSON, such as title, date, location, poster image url, and performers.
But there was no info about pricing, so I tried to get the price in other part of the website.
When I turned on Google Chrome developer mode, there is a table about pricing (It includes Korean, but you don't have to understand it):
<table cellpadding="0" cellspacing="0">
<colgroup>
<col>
<col style="width:20px;">
<col>
</colgroup>
<tbody id="divSalesPrice">
<tr>
<td>2일권(입장권)</td>
<td> </td>
<td class="costTd">
<span>140,000 원</span>
</td>
</tr>
<tr>
<td>1일권(입장권)</td>
<td> </td>
<td class="costTd">
<span>88,000 원</span>
</td>
</tr>
</tbody>
</table>
Numbers in span tag (140000, 80000) are the prices I want to extract. So I thought using Soup will be effective:
from bs4 import BeautifulSoup
import requests
def Soup(content):
soup = BeautifulSoup(content, 'lxml')
return soup
def DetailLink(url):
req = requests.get(url)
soup = Soup(req.content)
spans = soup.findAll('span', class_='fw_bold')
links = [f'{url[:27]}{span.a["href"]}' for span in spans]
return links
def Price():
links = DetailLink('http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes')
with requests.Session() as request:
for link in links:
req = request.get(link)
soup = Soup(req.content)
price = soup.find('tbody', id='divSalesPrice')
print(price)
Price()
However, the result was disappointing...
<tbody id="divSalesPrice">
<!-- 등록된 기본가 가져오기 오류-->
<tr>
<td colspan="3" id="liBasicPrice">
<ul>
</ul>
</td>
</tr>
</tbody>
The comment '등록된 기본가 가져오기 오류' means 'An error occurred while getting the price.'
Is it means that a website operator blocked other users to crawl price info in the page?

Ok, if we look carefully, the price data is not get when you request the page, it's loaded afterwards, that means we need to get the price data from somewhere else.
If you inspect the network section in chrome, there is this strange url:
And it has the data you look for:
Now the only thing you need to do is get the place id and product id. You can get these from homepage as you can see:
The vPC is the location id and vGC is the product id, you can get the product id from url too.
Then this code explains the rest:
import requests, re, json
# Just a random product url, you can adapt the code into yours.
url = "http://ticket.interpark.com/Ticket/Goods/GoodsInfo.asp?GroupCode=20002746"
data = requests.get(url).text
# I used regex to get the matching values `vGC` and `vPC`
vGC = re.search(r"var vGC = \"(\d+)\"", data).groups()[0]
vPC = re.search(r"var vPC = \"(\d+)\"", data).groups()[0]
# Notice that I placed placeholders to use `format`. Placeholders are `{}`.
priceUrl = "http://ticket.interpark.com/Ticket/Goods/GoodsInfoJSON.asp?Flag=SalesPrice&GoodsCode={}&PlaceCode={}"
# Looks like that url needs a referer url and that is the goods page, we will pass it as header.
lastData = requests.get(priceUrl.format(vGC, vPC), headers={"Referer": url}).text
# As the data is a javascript object but inside it is a json object,
# we can remove the callback and parse the inside of callback as json data:
lastData = re.search(r"^Callback\((.*)\);$", lastData).groups()[0]
lastData = json.loads(lastData)["JSON"]
print(lastData)
Output:
[{'DblDiscountOrNot': 'N',
'GoodsName': '뷰티풀 민트 라이프 2020 - 공식 티켓',
'PointDiscountAmt': '0',
'PriceGradeName': '입장권',
'SalesPrice': '140000',
'SeatGradeName': '2일권'},
{'DblDiscountOrNot': 'N',
'GoodsName': '뷰티풀 민트 라이프 2020 - 공식 티켓',
'PointDiscountAmt': '0',
'PriceGradeName': '입장권',
'SalesPrice': '88000',
'SeatGradeName': '1일권'}]

BeatifulSoup and single quotes in attributes

I am trying to read an Html page and get some information from it.
In one of the lines, the information I need is inside an Image's alt attribute. like so:
<img src='logo.jpg' alt='info i need'>
The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present.
Because of this, the result is something like this:
<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>
Currently, my code consists in this:
name = row.find("td", {"class": "logo"}).find("img")["alt"]
Which should return "info i need" but is currently returning "\'info"
What can I be doing wrong?
Is there any settings that I need to change in order to beautifulsoup to parse this correctly?
Edit:
my code looks something like this ( I used the standard html parser too, but no difference there )
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://myhtml.html'
with urllib.request.urlopen(url) as page:
text = str(page.read())
html = BeautifulSoup(page.read(), "lxml")
table = html.find("table", {"id": "info_table"})
rows = table.find_all("tr")
for row in rows:
if row.find("th") is not None:
continue
info = row.find("td", {"class": "logo"}).find("img")["alt"]
print(info)
if __name__ == '__main__':
main()
and the html:
<div class="table_container">
<table class="info_table" id="info_table">
<tr>
<th class="logo">Important infos</th>
<th class="useless">Other infos</th>
</tr>
<tr >
<td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
<tr >
<td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>

Sorry, I am unable to add a comment.
I have tested your case and for me the output seems correct.
HTML:
<html>
<body>
<td class="logo">
<img src='logo.jpg' alt='info i need'>
</td>
</body>
</html>
Python:
from bs4 import BeautifulSoup
with open("myhtml.html", "r") as html:
soup = BeautifulSoup(html, 'html.parser')
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
Returns:
info i need
I think your problem is a encoding problem while write the file back to html.
Please provide the full code and further information.
html
your python code
Update:
I've tested your code, your code is not working at all :/
After rework i was able to get required output as a result.
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://code.mytesturl.net'
with urllib.request.urlopen(url) as page:
soup = BeautifulSoup(page, "html.parser")
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
if __name__ == '__main__':
main()
Possible problems:
Maybe your parser should be html.parser
Python version / bs version ?

How to clean up the data from this webscraping script?

So here is my code:
import requests
from bs4 import BeautifulSoup
import lxml
r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")
tables = soup.find_all('table')
print(tables)
print(tables)
I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:
<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA 4721 </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>
I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:
for tag in soup():
for attribute in ["class", "id", "name", "style", "td", "tr"]:
del tag[attribute]
Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.
So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.
import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
for list_item in tables.select("tr")]
for data in list_items:
print(' '.join(data))
Partial results:
Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36

Python - beautifulsoup - how to deal with missing closing tags

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)

As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract class description between <td> - python

Related

Get the content of multiple classes when scraping a website

I cannot crawl HTML text using BeautifulSoup

BeatifulSoup and single quotes in attributes

How to clean up the data from this webscraping script?

Python - beautifulsoup - how to deal with missing closing tags

Categories

Resources