How to fetch some data conditionally with Python and Beautiful Soup - python

Sorry if you feel like this has been asked but I have read the related questions and being quite new to Python I could not find how to write this request in a clean manner.
For now I have this minimal Python code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
br.open("http://www.atpworldtour.com/Rankings/Singles.aspx")
filename = "rankings.html"
FILE = open(filename,"w")
html = br.response().read();
soup = BeautifulSoup(html);
links = soup.findAll('a', href=re.compile("Players"));
for link in links:
print link['href'];
FILE.writelines(html);
It retrieves all the link where the href contains the word player.
Now the HTML I need to parse looks something like this:
<tr>
<td>1</td>
<td>Federer, Roger (SUI)</td>
<td>10,550</td>
<td>0</td>
<td>19</td>
</tr>
The 1 contains the rank of the player.
I would like to be able to retrieve this data in a dictionary:
rank
name of the player
link to the detailed page (here /Tennis/Players/Top-Players/Roger-Federer.aspx)
Could you give me some pointers or if this is easy enough help me to build the piece of code ? I am not sure about how to formulate the request in Beautiful Soup.
Anthony

Searching for the players using your method will work, but will return 3 results per player. Easier to search for the table itself, and then iterate over the rows (except the header):
table=soup.find('table', 'bioTableAlt')
for row in table.findAll('tr')[1:]:
cells = row.findAll('td')
#retreieve data from cells...
To get the data you need:
rank = cells[0].string
player = cells[1].a.string
link = cells[1].a['href']

Related

Two classes have the same name in HTML and BeautifulSoup is only selecting the first class

The linked page below has two classes of the same name with data in them. I'm trying to mine the player names from these and assign positions of where they placed in the tournament. The find function in beautifulsoup is only allowing me to grab the first instance of the class.
I've tried a few different iterations of trying to iterate past the first instance of the class but nothing has worked. Having two instances of Table2__tbody seems to be the problem, how do I get past the first one and mine the data from the second one.
url_page = "https://www.espn.com/golf/leaderboard/_/tournamentId/401056502"
page = requests.get(url_page)
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='Table2__tbody')
name_list_items = name_list.find_all('a')
name_list is only capturing the data from the first instance of Table2__tbody. What I need is only the data from the second instance.
I think that you are not quite getting into the right attribute. The 'Table2__tbody' was only pointing to the first table of the hole_playoff results. The attribute you are looking for is actually 'tl Table2__td'.
So when you run the following code (run in python3) and BS4:
from bs4 import BeautifulSoup
from urllib import request
url_page = "https://www.espn.com/golf/leaderboard/_/tournamentId/401056502"
page = request.urlopen(url_page)
soup = BeautifulSoup(page, 'html.parser')
name_list = soup.find_all(class_='tl Table2__td')
name_list_items = []
for i in name_list:
name_list_items.append(i.get_text())
you actually get a list with the position of the player on the even indexes, and the name on the odd indexes. Some simple data manipulation can arrange that to do whatever you need it to do.
One solution how to select the proper table is using CSS selectors.
table:has(a.leaderboard_player_name) will select <table> that contains <a> with class leaderboard_player_name, which is our player list:
import requests
from bs4 import BeautifulSoup
url_page = "https://www.espn.com/golf/leaderboard/_/tournamentId/401056502"
page = requests.get(url_page)
soup = BeautifulSoup(page.text, 'html.parser')
table_with_namelist = soup.select_one('table:has(a.leaderboard_player_name)')
for a in table_with_namelist.select('.leaderboard_player_name'):
print(a.text)
Prints:
Xander Schauffele
Tony Finau
Justin Rose
Andrew Putnam
Kiradech Aphibarnrat
Keegan Bradley
...etc.

BeautifulSoup4 not able to scrape data from this table

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.
I'm trying to scrape data from this website. Specifically, from this part/table of the page:
末"四"位数 9775,2275,4775,7275
末"五"位数 03881,23881,43881,63881,83881,16913,66913
末"六"位数 313110,563110,813110,063110
末"七"位数 4210962,9210962,9785582
末"八"位数 63262036
末"九"位数 080876872
I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.
Here is my source code:
import bs4 as bs
import urllib.request
def scrapezqh(url):
source = urllib.request.urlopen(url).read()
page = bs.BeautifulSoup(source, 'html.parser')
print(page)
url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))
It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:
<td class="tdcolor">网下有效申购股数(万股)
</td>
<td class="tdwidth" id="td_wxyxsggs"> 
</td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>
<td class="tdcolor">中签号公布日期
</td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!
Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)
If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.
import requests
import re
import json
from pprint import pprint
s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)
url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])
for i in range (len(j)):
print ( j[i]['LOTNUM'])
#pprint(j)
Outputs:
9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872
From where I look at things your question isn't clear to me. But here's what I did.
I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here.
So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.
Here we go.
pip install pywebber --upgrade
from pywebber import PageRipper
page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')
page_soup = page.soup
tr_zqh_table = page_soup.find('tr', id='tr_zqh')
from here you can do tr_zqh_table.find_all('td')
tr_zqh_table.find_all('td')
Output
[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]
Going a bit further
for td in tr_zqh_table.find_all('td'):
print(td.contents)
Output
['中签号\n ']
['中签号公布日期\n ']
['\xa02018-02-22 (周四)\n ']

scraping a table and getting more info from a link

I am using python and beautifulsoup to scrape a table...I have a pretty good handle on getting most of the information I need. shortened table of what I am trying to scrape.
<tr> <td>Joseph Carter Abbott</td> <td>1868–1872</td> <td>North Carolina</td> <td>Republican</td>
</tr>
<tr> <td>James Abdnor</td> <td>1981–1987</td> <td>South Dakota</td> <td>Republican</td> </tr> <tr> <td>Hazel Abel</td> <td>1954</td> <td>Nebraska</td> <td>Republican</td>
</tr>
http://en.wikipedia.org/wiki/List_of_former_United_States_senators
I want Name, Description, Years, State, Party.
The Description is the first paragraph of text on each persons page. I know how to get this independently, but I have no idea on how to integrate it with Name, Years, State, Party because I have to navigate to a different page.
oh and I need to write it to a csv.
Thanks!
Just to expound on #anrosent's answer: sending a request mid-parsing is one of the best and most consistent ways of doing this. However, your function that gets the description has to behave properly as well, because if it returns a NoneType error, the whole process is turned into disarray.
The way I did this on my end is this (note that I'm using the Requests library and not urllib or urllib2 as I'm more comfortable with that -- feel free to change it to your liking, the logic is the same anyway):
from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv
ofile = open("presidents.csv", "wb")
f = csv.writer(ofile)
f.writerow(["Name","Description","Years","State","Party"])
base_url = "http://en.wikipedia.org/wiki/List_of_former_United_States_senators"
r = rq.get(base_url)
soup = bsoup(r.content)
all_tables = soup.find_all("table", class_="wikitable")
def get_description(url):
r = rq.get(url)
soup = bsoup(r.content)
desc = soup.find_all("p")[0].get_text().strip().encode("utf-8")
return desc
complete_list = []
for table in all_tables:
trs = table.find_all("tr")[1:] # Ignore the header row.
for tr in trs:
tds = tr.find_all("td")
first = tds[0].find("a")
name = first.get_text().encode("utf-8")
desc = get_description("http://en.wikipedia.org%s" % first["href"])
years = tds[1].get_text().encode("utf-8")
state = tds[2].get_text().encode("utf-8")
party = tds[3].get_text().encode("utf-8")
f.writerow([name, desc, years, state, party])
ofile.close()
However, this attempt ends at the line just after David Barton. If you check the page, maybe it has something to do with him occupying two lines to himself. This is up to you to fix. Traceback is as follows:
Traceback (most recent call last):
File "/home/nanashi/Documents/Python 2.7/Scrapers/presidents.py", line 25, in <module>
name = first.get_text().encode("utf-8")
AttributeError: 'NoneType' object has no attribute 'get_text'
Also, notice how my get_description function is before the main process. This is obviously because you have to define the function first. Finally, my get_description function is not nearly perfect enough, as it can fail if by some chance the first p tag in the individual pages is not the one you want.
Sample of result:
Pay attention to the erroneous lines, like Maryon Allen's description. This is for you to fix as well.
Hope this points you in the right direction.
If you're using BeautifulSoup, you won't be navigating to the other page in the stateful, browser-like sense so much as just making another request for the other page with the url like wiki/name. So your code might look like
import urllib, csv
with open('out.csv','w') as f:
csv_file = csv.writer(f)
#loop through the rows of the table
for row in senator_rows:
name = get_name(row)
... #extract the other data from the <tr> elt
senator_page_url = get_url(row)
#get description from HTML text of senator's page
description = get_description(get_html(senator_page_url))
#write this row to the CSV file
csv_file.writerow([name, ..., description])
#quick way to get the HTML text as string for given url
def get_html(url):
return urllib.urlopen(url).read()
Note that in python 3.x you'll be importing and using urllib.request instead of urllib, and you'll have to decode the bytes the read() call will return.
It sounds like you know how to fill in the other get_* functions I left in there, so I hope this helps!

How to get the content from a certain <table> using python?

I have some <tr>s, like this:
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
I want to fetch the content without html tags, like:
yangfanhit
3155
Accepted
344K
219MS
C++
3940B
2012-10-02 16:42:45
Now I'm using the following code to deal with it:
response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()
pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
for i in pat.findall(item):
print p.sub(r'', i)
print '================================================='
I'm new to regex and also new to python. So could you suggest some better methods to process it?
You could use BeautifulSoup to parse the html. To write the content of the table in csv format:
#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))
writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
writer.writerow([td.get_text() for td in tr('td')])
Output
Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25
Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.
import itertools
from pyquery import PyQuery as pq
# parse html
html = pq(url="http://poj.org/status")
# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]
# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]
# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]
You really don't need to work with regex directly to parse html, see answer here.
Or see Dive into Python Chapter 8 about HTML Processing.
Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you
Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.
Example:
>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""
>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>

How do I draw out specific data from an opened url in Python using urllib2?

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:
http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13
where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.
As another poster mentioned, BeautifulSoup is a wonderful tool for this job.
Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.
I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.
The whole program...
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys
URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]
# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)
# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})
# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]
# Helper function to return concatenation of all character data in an element
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
# Get all the text from the <td>s
data = map(parse_string, row.findAll('td'))
# Skip the first td, which is an image
data = data[1:]
# Do something with the data...
print data
And here's a test run.
> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']
Voila :)
You can use Beautiful Soup to parse the HTML.

Categories