scraping a table and getting more info from a link - python

I am using python and beautifulsoup to scrape a table...I have a pretty good handle on getting most of the information I need. shortened table of what I am trying to scrape.
<tr> <td>Joseph Carter Abbott</td> <td>1868–1872</td> <td>North Carolina</td> <td>Republican</td>
</tr>
<tr> <td>James Abdnor</td> <td>1981–1987</td> <td>South Dakota</td> <td>Republican</td> </tr> <tr> <td>Hazel Abel</td> <td>1954</td> <td>Nebraska</td> <td>Republican</td>
</tr>
http://en.wikipedia.org/wiki/List_of_former_United_States_senators
I want Name, Description, Years, State, Party.
The Description is the first paragraph of text on each persons page. I know how to get this independently, but I have no idea on how to integrate it with Name, Years, State, Party because I have to navigate to a different page.
oh and I need to write it to a csv.
Thanks!

Just to expound on #anrosent's answer: sending a request mid-parsing is one of the best and most consistent ways of doing this. However, your function that gets the description has to behave properly as well, because if it returns a NoneType error, the whole process is turned into disarray.
The way I did this on my end is this (note that I'm using the Requests library and not urllib or urllib2 as I'm more comfortable with that -- feel free to change it to your liking, the logic is the same anyway):
from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv
ofile = open("presidents.csv", "wb")
f = csv.writer(ofile)
f.writerow(["Name","Description","Years","State","Party"])
base_url = "http://en.wikipedia.org/wiki/List_of_former_United_States_senators"
r = rq.get(base_url)
soup = bsoup(r.content)
all_tables = soup.find_all("table", class_="wikitable")
def get_description(url):
r = rq.get(url)
soup = bsoup(r.content)
desc = soup.find_all("p")[0].get_text().strip().encode("utf-8")
return desc
complete_list = []
for table in all_tables:
trs = table.find_all("tr")[1:] # Ignore the header row.
for tr in trs:
tds = tr.find_all("td")
first = tds[0].find("a")
name = first.get_text().encode("utf-8")
desc = get_description("http://en.wikipedia.org%s" % first["href"])
years = tds[1].get_text().encode("utf-8")
state = tds[2].get_text().encode("utf-8")
party = tds[3].get_text().encode("utf-8")
f.writerow([name, desc, years, state, party])
ofile.close()
However, this attempt ends at the line just after David Barton. If you check the page, maybe it has something to do with him occupying two lines to himself. This is up to you to fix. Traceback is as follows:
Traceback (most recent call last):
File "/home/nanashi/Documents/Python 2.7/Scrapers/presidents.py", line 25, in <module>
name = first.get_text().encode("utf-8")
AttributeError: 'NoneType' object has no attribute 'get_text'
Also, notice how my get_description function is before the main process. This is obviously because you have to define the function first. Finally, my get_description function is not nearly perfect enough, as it can fail if by some chance the first p tag in the individual pages is not the one you want.
Sample of result:
Pay attention to the erroneous lines, like Maryon Allen's description. This is for you to fix as well.
Hope this points you in the right direction.

If you're using BeautifulSoup, you won't be navigating to the other page in the stateful, browser-like sense so much as just making another request for the other page with the url like wiki/name. So your code might look like
import urllib, csv
with open('out.csv','w') as f:
csv_file = csv.writer(f)
#loop through the rows of the table
for row in senator_rows:
name = get_name(row)
... #extract the other data from the <tr> elt
senator_page_url = get_url(row)
#get description from HTML text of senator's page
description = get_description(get_html(senator_page_url))
#write this row to the CSV file
csv_file.writerow([name, ..., description])
#quick way to get the HTML text as string for given url
def get_html(url):
return urllib.urlopen(url).read()
Note that in python 3.x you'll be importing and using urllib.request instead of urllib, and you'll have to decode the bytes the read() call will return.
It sounds like you know how to fill in the other get_* functions I left in there, so I hope this helps!

Related

How to convert data into csv file

I am unable to convert the data retrieved with bs4 into a meaningful csv file. It only takes the last set of data from what is actually retrieved.
#Beautiful soup or BS4 is a package I will be using to allow me to parse the HTML data which I will be retrieving from a website.
#parsing is te conversion of codes from machine language into a code which humans can understand and allow it to be structured.
#(Converting data from one format to another) with BS4
from bs4 import BeautifulSoup
#requests is an HTTP Library which allows me to send requests to websites the retrieve date using Python. This is helpful as
#The website is writtin in a different language so it allows me to retrieve what I want and read it as well.
import requests
#import writer
url= "https://myanimelist.net/anime/season"
#requesting to get data using 'requests' and gain acess as well.
#hadve to check the response before moving forward to ensure there is no problem retrieving data.
page= requests.get(url)
#print(page)
#<Response [200]> response was "200" meaning "Successful responses"
soup = BeautifulSoup(page.content, 'html.parser')
#here i retrieve my
#for this to identify the html code and determine what we will be producing(retrieveing data) for each item on the page we had to
#find the parent category which contains all the info we need to make our data categories.
lists = soup.select("[data-genre]")
#we add _ after class to make class_ because without the underscore the program identifies it as a python class
# when really it is more of a cs class
all_data = []
#must create loop to find titles seperate as there are alot that will come up
for list in lists:
#identify and find class which includes the title of the shows, show ratings, members watching, and episodes
#added .text.replace in order to get rid of the|n spacing which was in html format
title= list.find("a", class_="link-title").text.replace("\n", "")
rating= list.find("div", class_="score").text.replace("\n", "")
members= list.find("div", class_="scormem-item member").text.replace("\n", "")
release_date= list.find("span", class_="item").text.replace("\n", "")
all_data.append(
[title.strip(), rating.strip(), members.strip(), release_date.strip()]
)
print(*all_data, sep="\n")
#testing for errors and makins sure locations are correct to withdraw/request the data
#this allows us to create and close a csv file. using 'w' to allow editing
from csv import writer
#defining the website which I will be retrieving my code
#organizing chart
header=['Title', 'Show Rating', 'Members', 'Release Date']
info= [title.strip(), rating.strip(), members.strip(), release_date.strip()]
with open('shows.csv', 'w', encoding='utf8', newline='')as f:
#will write onto our file 'f'
writing = writer(f)
#use our writer to write a row in file
writing.writerow(header)
writing.writerow(info)
I tried to change definition of list but to no avail. This is currently what I get. Even though it should be much longer.
Instead of writing just the last line with
writing.writerow(info)
you need to write all of the lines:
writing.writerows(all_data)

How do I get information from a table into variables while using BeautifulSoup 4?

I have a table, found below and stored as "table". It contains the following:
http://pastebin.com/aBFLpU4U
My code captures the correct information, but I need to know how to get each piece of the information into it's own variable. I appreciate any help with this, I have only been playing with BeautifulSoup for a week so forgive me. I have looked all over stack and haven't found an answer that works for me.
This is the output I see: http://pastebin.com/fiYQvBix
import sys, locale, os, re, urllib2
import lxml.etree, requests
from bs4 import BeautifulSoup as bSoup
# Website that we are scraping:
BASE_URL = 'https://www.biddergy.com/detail.asp?id='
#ID = raw_input("Enter listing #: ")
ID = str(330998) # defined constant for debugging
# Store response in soup:
response = requests.get(BASE_URL+ID)
soup = bSoup(response.text)
# Find auction info <table>
table = soup.find('table', cellpadding="2")
#### Everything above this line works great ####
for row in table.find_all('tr'):
for col in row.find_all("td"):
print(col.string)
Well, I have figured it out.
data = []
for row in table.find_all('tr'):
for cols in row.find_all('td', text=True)
for col in cols:
data.append(col.strip())
Then data can be extracted from the data[] list and saved into the respective variables.
Thank you for all who have read my question!

Why am i getting authors, titles, abstract and journals first and then appearing together? They should be together per title.

I am trying to pull out information from an html file of the link http://dl.acm.org/results.cfm?CFID=376026650&CFTOKEN=88529867. For every Paper title, i need the authors, journal name and abstract. But i am getting repetitive versions of each first before getting them together. Please help. Meaning i am first getting a list of titles, then authors, then journals, then abstract, and then i get them together per title, as in title first, then the respective authors, journal name, and the abstract. I only need them together, not individually.
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
import re
f = open('acmpage.html', 'r') #open html file stores locally
html = f.read() #read from the html file and store the content in 'html'
soup = BeautifulSoup(html)
pret = soup.prettify()
soup1 = BeautifulSoup(pret)
for content in soup1.find_all("table"):
soup2 = BeautifulSoup(str(content))
pret2 = soup2.prettify()
soup3 = BeautifulSoup(pret2)
for titles in soup3.find_all('a', target = '_self'): #to print title
print "Title: ",
print titles.get_text()
for auth in soup3.find_all('div', class_ = 'authors'): #to print authors
print "Authors: ",
print auth.get_text()
for journ in soup3.find_all('div', class_ = 'addinfo'): #to print name of journal
print "Journal: ",
print journ.get_text()
for abs in soup3.find_all('div', class_ = 'abstract2'): # to print abstract
print "Abstract: ",
print abs.get_text()
You are searching for each list of information separately, there is little question as to why you see each type of information listed separately.
Your code is also full of redundancies; you only need to import one version of BeautifulSoup (the first import is shadowed by the second), and you don't need to re-parse the elements 2 times either. You import two different URL loading libraries then ignore both by loading a local file instead.
Search for the table rows containing the title information instead, then per table row, parse out the information contained.
For this page, with its more complex (and frankly, disorganized) layout with multiple tables, it'd be easiest just to go up to the table row per title link found:
from bs4 import BeautifulSoup
import requests
resp = requests.get('http://dl.acm.org/results.cfm',
params={'CFID': '376026650', 'CFTOKEN': '88529867'})
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
for title_link in soup.find_all('a', target='_self'):
# find parent row to base rest of search of
row = next(p for p in title_link.parents if p.name == 'tr')
title = title_link.get_text()
authors = row.find('div', class_='authors').get_text()
journal = row.find('div', class_='addinfo').get_text()
abstract = row.find('div', class_='abstract2').get_text()
The next() call loops over a generator expression that goes over each parent of the title link until a <tr> element is found.
Now you have all the information grouped per title.
You need to find the first addinfo div, then creep forward to find the publisher in a div further on in the document. You will need to go up the tree to the enclosing tr, then get the next sibling for the consecutive tr. Then search inside that tr for the next data item ( the publisher).
Once you have done this for all the items you need to display, issue a single print command for all the items you've found

How do I draw out specific data from an opened url in Python using urllib2?

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:
http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13
where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.
As another poster mentioned, BeautifulSoup is a wonderful tool for this job.
Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.
I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.
The whole program...
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys
URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]
# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)
# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})
# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]
# Helper function to return concatenation of all character data in an element
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
# Get all the text from the <td>s
data = map(parse_string, row.findAll('td'))
# Skip the first td, which is an image
data = data[1:]
# Do something with the data...
print data
And here's a test run.
> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']
Voila :)
You can use Beautiful Soup to parse the HTML.

How to fetch some data conditionally with Python and Beautiful Soup

Sorry if you feel like this has been asked but I have read the related questions and being quite new to Python I could not find how to write this request in a clean manner.
For now I have this minimal Python code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
br.open("http://www.atpworldtour.com/Rankings/Singles.aspx")
filename = "rankings.html"
FILE = open(filename,"w")
html = br.response().read();
soup = BeautifulSoup(html);
links = soup.findAll('a', href=re.compile("Players"));
for link in links:
print link['href'];
FILE.writelines(html);
It retrieves all the link where the href contains the word player.
Now the HTML I need to parse looks something like this:
<tr>
<td>1</td>
<td>Federer, Roger (SUI)</td>
<td>10,550</td>
<td>0</td>
<td>19</td>
</tr>
The 1 contains the rank of the player.
I would like to be able to retrieve this data in a dictionary:
rank
name of the player
link to the detailed page (here /Tennis/Players/Top-Players/Roger-Federer.aspx)
Could you give me some pointers or if this is easy enough help me to build the piece of code ? I am not sure about how to formulate the request in Beautiful Soup.
Anthony
Searching for the players using your method will work, but will return 3 results per player. Easier to search for the table itself, and then iterate over the rows (except the header):
table=soup.find('table', 'bioTableAlt')
for row in table.findAll('tr')[1:]:
cells = row.findAll('td')
#retreieve data from cells...
To get the data you need:
rank = cells[0].string
player = cells[1].a.string
link = cells[1].a['href']

Categories