I cannot crawl HTML text using BeautifulSoup

I cannot crawl HTML text using BeautifulSoup - python

In my previous question(How to speed up parsing using BeautifulSoup?), I asked the way to crawl HTML website more quickly, and the answer helped me much.
But I encountered another problem. It is about crawling the price of tickets.
I got JSON text in the webpage referring the answer of my previous question. I could get almost every information about festivals in the JSON, such as title, date, location, poster image url, and performers.
But there was no info about pricing, so I tried to get the price in other part of the website.
When I turned on Google Chrome developer mode, there is a table about pricing (It includes Korean, but you don't have to understand it):
<table cellpadding="0" cellspacing="0">
<colgroup>
<col>
<col style="width:20px;">
<col>
</colgroup>
<tbody id="divSalesPrice">
<tr>
<td>2일권(입장권)</td>
<td> </td>
<td class="costTd">
<span>140,000 원</span>
</td>
</tr>
<tr>
<td>1일권(입장권)</td>
<td> </td>
<td class="costTd">
<span>88,000 원</span>
</td>
</tr>
</tbody>
</table>
Numbers in span tag (140000, 80000) are the prices I want to extract. So I thought using Soup will be effective:
from bs4 import BeautifulSoup
import requests
def Soup(content):
soup = BeautifulSoup(content, 'lxml')
return soup
def DetailLink(url):
req = requests.get(url)
soup = Soup(req.content)
spans = soup.findAll('span', class_='fw_bold')
links = [f'{url[:27]}{span.a["href"]}' for span in spans]
return links
def Price():
links = DetailLink('http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes')
with requests.Session() as request:
for link in links:
req = request.get(link)
soup = Soup(req.content)
price = soup.find('tbody', id='divSalesPrice')
print(price)
Price()
However, the result was disappointing...
<tbody id="divSalesPrice">
<!-- 등록된 기본가 가져오기 오류-->
<tr>
<td colspan="3" id="liBasicPrice">
<ul>
</ul>
</td>
</tr>
</tbody>
The comment '등록된 기본가 가져오기 오류' means 'An error occurred while getting the price.'
Is it means that a website operator blocked other users to crawl price info in the page?

Ok, if we look carefully, the price data is not get when you request the page, it's loaded afterwards, that means we need to get the price data from somewhere else.
If you inspect the network section in chrome, there is this strange url:
And it has the data you look for:
Now the only thing you need to do is get the place id and product id. You can get these from homepage as you can see:
The vPC is the location id and vGC is the product id, you can get the product id from url too.
Then this code explains the rest:
import requests, re, json
# Just a random product url, you can adapt the code into yours.
url = "http://ticket.interpark.com/Ticket/Goods/GoodsInfo.asp?GroupCode=20002746"
data = requests.get(url).text
# I used regex to get the matching values `vGC` and `vPC`
vGC = re.search(r"var vGC = \"(\d+)\"", data).groups()[0]
vPC = re.search(r"var vPC = \"(\d+)\"", data).groups()[0]
# Notice that I placed placeholders to use `format`. Placeholders are `{}`.
priceUrl = "http://ticket.interpark.com/Ticket/Goods/GoodsInfoJSON.asp?Flag=SalesPrice&GoodsCode={}&PlaceCode={}"
# Looks like that url needs a referer url and that is the goods page, we will pass it as header.
lastData = requests.get(priceUrl.format(vGC, vPC), headers={"Referer": url}).text
# As the data is a javascript object but inside it is a json object,
# we can remove the callback and parse the inside of callback as json data:
lastData = re.search(r"^Callback\((.*)\);$", lastData).groups()[0]
lastData = json.loads(lastData)["JSON"]
print(lastData)
Output:
[{'DblDiscountOrNot': 'N',
'GoodsName': '뷰티풀 민트 라이프 2020 - 공식 티켓',
'PointDiscountAmt': '0',
'PriceGradeName': '입장권',
'SalesPrice': '140000',
'SeatGradeName': '2일권'},
{'DblDiscountOrNot': 'N',
'GoodsName': '뷰티풀 민트 라이프 2020 - 공식 티켓',
'PointDiscountAmt': '0',
'PriceGradeName': '입장권',
'SalesPrice': '88000',
'SeatGradeName': '1일권'}]

Related

Get data from a web page using soup

I wanted to download the data from a page in which the link of each data are found in rows of a table.
I wrote a code using BeautifulSoup to read href of all rows, but it couldn't provide me the links list to download them. I guess it couldn't see table data (td) in each table row (tr).
from bs4 import BeautifulSoup
import urllib.request
testurl = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-562-CD'
page = urllib.request.urlopen(testurl)
page_content = BeautifulSoup(page, "html.parser")
table_dt = page_content.find_all("table")
for tt in table_dt.select("tr"):
print(tt)
## print
<tr>
<th>Friendly Name</th>
<th colspan="2">Posted</th>
<th>Available Files</th>
</tr>##
The table shows:
[<table class="table table-condensed report-table" id="reportTable">
<thead>
<tr>
<th>Friendly Name</th>
<th colspan="2">Posted</th>
<th>Available Files</th>
</tr>
</thead>
<tbody>
</tbody>
</table>]
As it can be seen, there is no info for other rows (tr), and it only captures the header row information.
Could you please guide me to get data the link of data for each rows in order to download them?

Most likely, the structure of the table is in the original HTML page, and the row data is retrieved by a Javascript request. If you can figure out what the javacript request is (probably by using your browser's "web developer" tools), you can get it that way.

Why some elements of the response object are missing? Requests module

As I've recently started learning web scraping, I thought I would try to parse an HTML table from this site using requests and bs4 modules.
I know I need to access td class from tbody -- this is how a web page looks like at least:
When I try, though, it doesn't seem to work properly as it only captures td class from thead and not from tbody. Hence, I cannot capture anything but the headers of the table.
I assume it has something to do with requests module.
url = 'https://vstup.edbo.gov.ua/statistics/requests-by-university/?
qualification=1&education-base=40'
r = requests.get(url)
print(r.text)
The result is as follows (pasting table-related part):
<table id="stats">
<caption></caption>
<thead>
<tr>
<td class="region">Регіон</td>
<td class="university">Назва закладу</td>
<td class="speciality">Спеціальність (спеціалізація)</td>
<td class="average-ball number" title="Середній конкурсний бал">СКБ</td>
<td class="requests-total number">Усього заяв</td>
<td class="requests-budget number">Заяв на бюджет</td>
</tr>
</thead>
<tbody></tbody>
</table>
So the tbody elements are missing in my response object, while they are present in the code of the web page. What am I doing wrong?

#Holdenweb suggested trying Selenium and everything worked.
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://vstup.edbo.gov.ua/statistics/requests-by-university/?
qualification=1&education-base=40'
browser = webdriver.Firefox(executable_path=r'D:/folder/geckodriver.exe')
browser.get(url)
html = browser.page_source
after that, I used BeautifulSoup and managed to parse the web page.

BeautifulSoup4 not able to scrape data from this table

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.
I'm trying to scrape data from this website. Specifically, from this part/table of the page:
末"四"位数 9775,2275,4775,7275
末"五"位数 03881,23881,43881,63881,83881,16913,66913
末"六"位数 313110,563110,813110,063110
末"七"位数 4210962,9210962,9785582
末"八"位数 63262036
末"九"位数 080876872
I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.
Here is my source code:
import bs4 as bs
import urllib.request
def scrapezqh(url):
source = urllib.request.urlopen(url).read()
page = bs.BeautifulSoup(source, 'html.parser')
print(page)
url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))
It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:
<td class="tdcolor">网下有效申购股数(万股)
</td>
<td class="tdwidth" id="td_wxyxsggs"> 
</td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>
<td class="tdcolor">中签号公布日期
</td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!
Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)

If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.
import requests
import re
import json
from pprint import pprint
s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)
url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])
for i in range (len(j)):
print ( j[i]['LOTNUM'])
#pprint(j)
Outputs:
9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872

From where I look at things your question isn't clear to me. But here's what I did.
I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here.
So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.
Here we go.
pip install pywebber --upgrade
from pywebber import PageRipper
page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')
page_soup = page.soup
tr_zqh_table = page_soup.find('tr', id='tr_zqh')
from here you can do tr_zqh_table.find_all('td')
tr_zqh_table.find_all('td')
Output
[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]
Going a bit further
for td in tr_zqh_table.find_all('td'):
print(td.contents)
Output
['中签号\n ']
['中签号公布日期\n ']
['\xa02018-02-22 (周四)\n ']

How to clean up the data from this webscraping script?

So here is my code:
import requests
from bs4 import BeautifulSoup
import lxml
r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")
tables = soup.find_all('table')
print(tables)
print(tables)
I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:
<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA 4721 </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>
I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:
for tag in soup():
for attribute in ["class", "id", "name", "style", "td", "tr"]:
del tag[attribute]
Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.
So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.
import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
for list_item in tables.select("tr")]
for data in list_items:
print(' '.join(data))
Partial results:
Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36

How to get next form content in python

This is my first script(also post) in python.
In script i filled form content and submit it. So After submitting form it will generate result on next form. Now the issue is next form link not static, it will changed according data entered in previous form. See below some code of my script
import mechanize
browser = mechanize.Browser()
browser.open('https://example.com')
browser.select_form(nr=1)
browser.form["MyIDNO"] = '000D6F0004C46834'
browser.form["RuleID"] = '0109108301234567890A'
browser.submit()
Above code just fill data and submit it. Now i want next opened form content. I am getting dynamic link as below
https://example.com/index.php?option=com_gencert&task=results&tmpl=gencert&cfId=189537&MyIDNO=000D6F0004C46834&RuleID=0109108301234567890A&esKey=
As seen in above link, it will generated based on MyIDNO and RuleID.
I tried one solution as below
html = browser.response().read()
print html
It will print all content in html form. Now i need to parse specific data. See below some output
<tr>
<td><strong>User key: </strong></td>
<td>0200fde8a7f3d1084224962a4e7c54e69ac3f04da6b8</td>
</tr>
<tr>
<td><strong>Institute id: </strong></td>
<td>
030780ffa3641183273ad548ae09872f9dcf4b0c4267<br/>000d6f0004c468345445535453454341010910830123<br/>4567890a<br/> </td>
</tr>
<tr>
<td><strong>part id:</strong></td>
<td>00ecd01536ff66296f9d572219d7acac02d59b24c6</td>
</tr>
<tr>
From above content i need below output
User key: 0200fde8a7f3d1084224962a4e7c54e69ac3f04da6b8
Institute id: 030780ffa3641183273ad548ae09872f9dcf4b0c4267000d6f0004c4683454455354534543410109108301234567890a
part id: 00ecd01536ff66296f9d572219d7acac02d59b24c6

Once you have the html document you can use BeautifulSoup for getting the data you need.
from bs4 import BeautifulSoup
# submit form as per your snippet
html = browser.response().read()
soup = BeautifulSoup(html, 'html.parser')
# Process the content with BeautifulSoup.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I cannot crawl HTML text using BeautifulSoup - python

Related

Get data from a web page using soup

Why some elements of the response object are missing? Requests module

BeautifulSoup4 not able to scrape data from this table

How to clean up the data from this webscraping script?

How to get next form content in python

Categories

Resources