Get the content of multiple classes when scraping a website - python

The problem that I am facing is simple. If I am trying to get some data from a website, there are two classes with the same name. But they both contain a table with different Information. The code that I have only outputs me the content of the very first class. It looks like this:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find("tr", {"class": "table3"})
print(results.prettify())
How can I get the code to put out either the content of both tables or only the content of the second one?
Thanks for your answers in advance!

You can use .find_all() and [1] to get second result. Example:
from bs4 import BeautifulSoup
txt = """
<tr class="table3"> I don't want this </tr>
<tr class="table3"> I want this! </tr>
"""
soup = BeautifulSoup(txt, "html.parser")
results = soup.find_all("tr", class_="table3")
print(results[1]) # <-- get only second one
Prints:
<tr class="table3"> I want this! </tr>

Related

escape problem with beautifulsoup in python

I am using the beautifulsoup4 in python in order to update the table on the confluence page when I use the soup.append(str) function, the < > were escaped and became < > so that i can not update the table correctly. Could someone give me tips? Or maybe some better solution for update the table on the confluence page, thanks in advance. :)
what i expect:
<tr>"string"</tr>
what the result:
<tr >"string"< tr>
That happens because you append string to other tag. The solution is to create new_tag() or whole new soup.
For example:
txt = '''
<div>To this tag I append other tag</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
soup.find('div').append( BeautifulSoup('<tr>string</tr>', 'html.parser') )
print(soup.prettify())
Prints:
<div>
To this tag I append other tag
<tr>
string
</tr>
</div>

BeautifulSoup4 not able to scrape data from this table

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.
I'm trying to scrape data from this website. Specifically, from this part/table of the page:
末"四"位数 9775,2275,4775,7275
末"五"位数 03881,23881,43881,63881,83881,16913,66913
末"六"位数 313110,563110,813110,063110
末"七"位数 4210962,9210962,9785582
末"八"位数 63262036
末"九"位数 080876872
I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.
Here is my source code:
import bs4 as bs
import urllib.request
def scrapezqh(url):
source = urllib.request.urlopen(url).read()
page = bs.BeautifulSoup(source, 'html.parser')
print(page)
url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))
It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:
<td class="tdcolor">网下有效申购股数(万股)
</td>
<td class="tdwidth" id="td_wxyxsggs"> 
</td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>
<td class="tdcolor">中签号公布日期
</td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!
Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)
If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.
import requests
import re
import json
from pprint import pprint
s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)
url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])
for i in range (len(j)):
print ( j[i]['LOTNUM'])
#pprint(j)
Outputs:
9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872
From where I look at things your question isn't clear to me. But here's what I did.
I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here.
So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.
Here we go.
pip install pywebber --upgrade
from pywebber import PageRipper
page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')
page_soup = page.soup
tr_zqh_table = page_soup.find('tr', id='tr_zqh')
from here you can do tr_zqh_table.find_all('td')
tr_zqh_table.find_all('td')
Output
[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]
Going a bit further
for td in tr_zqh_table.find_all('td'):
print(td.contents)
Output
['中签号\n ']
['中签号公布日期\n ']
['\xa02018-02-22 (周四)\n ']

How to clean up the data from this webscraping script?

So here is my code:
import requests
from bs4 import BeautifulSoup
import lxml
r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")
tables = soup.find_all('table')
print(tables)
print(tables)
I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:
<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA 4721 </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>
I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:
for tag in soup():
for attribute in ["class", "id", "name", "style", "td", "tr"]:
del tag[attribute]
Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.
So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.
Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.
import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
for list_item in tables.select("tr")]
for data in list_items:
print(' '.join(data))
Partial results:
Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36

Python - beautifulsoup - how to deal with missing closing tags

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Beautiful Soup HTML Extraction

I am struggling with getting the data I want and I am sure its very simple if you know how to use BS. I have been trying to get this right for hours without avail after reading the docs.
Currently my code outputs this in python:
[<td>0.32%</td>, <td><span class="neg color ">>-0.01</span></td>, <td>0.29%</td>, <td>0.38%</td>, <td><span class="neu">0.00</span></td>]
How would I just isolate the content of the td tags that do not contain the tags?
i.e. I would like to see 0.32%, 0.29%, 0.38% only.
Thank you.
import urllib2
from bs4 import BeautifulSoup
fturl = 'http://markets.ft.com/research/Markets/Bonds'
ftcontent = urllib2.urlopen(fturl).read()
soup = BeautifulSoup(ftcontent)
ftdata = soup.find(name="div", attrs={'class':'wsodModuleContent'}).find_all(name="td", attrs={'class':''})
Is this ok solution for you:
html_txt = """<td>0.32%</td>, <td><span class="neg color">
>-0.01</span></td>, <td>0.29%</td>, <td>0.38%</td>,
<td><span class="neu">0.00</span></td>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_txt)
print [tag.text for tag in soup.find_all('td') if tag.text.strip().endswith("%")]
output is:
[u'0.32%', u'0.29%', u'0.38%']

Categories