Cannot download certain texts from website - python

I'm trying to download all the last statements from the Death Row Website. Basic outline is like this
1. The info from the site gets imported in an sqlite database, prison.sqlite
2. Based on the names in the table, I generate unique URL's for each name, to get their last statements.
3. The program checks each generated URL, if URL is OK, it checks for the last statement. This statement gets downloaded to the database prison.sqlite (still 2 do).
This is my code:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = ["http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/moselydaroycelast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999288.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/hernandezadophlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/carterrobertanthonylast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/livingstoncharleslast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/wilkersonrichardlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/hererraleonellast.html",]
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( Execution text, link1 text, Statements text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0]
firstname = column[1]
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName = CleanName.replace(" ", "")
CleanName = CleanName.replace("III","")
CleanName = re.sub("Sr","",CleanName)
CleanName = re.sub("Jr","",CleanName)
CleanName = CleanName.lower()
Baseurl = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Baseurl+CleanName+"last.html"
URLS.append(Link)
for Link in URLS:
try:
r = requests.get(Link)
r.raise_for_status()
print "URL OK", Link
document = urllib2.urlopen(Link)
html = document.read()
soup = BeautifulSoup(html)
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
print Statement
continue
except requests.exceptions.HTTPError as err:
print err
print "Offender has made no statement.", Link
#cur.execute("INSERT OR IGNORE INTO prison(Statements) VALUES(?)"), (Statement, )
csvfile.close()
conn.commit()
conn.close()
When running the program I get:
C:\python>prison.py
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html
Can you hear me? Did I ever tell you, you have dad's eyes? I've noticed that in the last couple of days. I'm sorry for putting you through all this. Tell everyone I love them. It was good seeing the kids. I love them all; tell mom, everybody. I am very sorry for all of the pain. Tell Brenda I love her. To everybody back on the row, I know you're going through a lot over there. Keep fighting, don't give up everybody.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html
Thank you, Jesus Christ. Thank you for your blessing. You are above the president. And know it is you, Jesus Christ, that is performing this miracle in my life. Hallelujah, Holy, Holy, Holy. For this reason I was born and raised. Thank you for this, my God is a God of Salvation. Only through you, Jesus Christ, people will see that you're still on the throne. Hallelujah, Holy, Holy, Holy. I invoke Your name. Thank you, Yahweh, thank you Jesus Christ. Hallelujah, Amen. Thank you, Warden.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html
Traceback (most recent call last):
File "C:\python\prison.py", line 60, in <module>
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
AttributeError: 'NoneType' object has no attribute 'findNext'
First two statements are fine, but after that program crashes. Looking at the page source of the URL where the error occurs, I see:
(only relevant data)
<div class="return_to_div"></div>
<h1>Offender Information</h1>
<h2>Last Statement</h2>
<p class="text_bold">Date of Execution:</p>
<p> February 4, 2009</p>
<p class="text_bold"> Offender:</p>
<p> Martinez, David</p>
<p class="text_bold"> Last Statement:</p>
<p> Yes, nothing I can say can change the past. I am asking for forgiveness. Saying sorry is not going to change anything. I hope one day you can find peace. I am sorry for all of the pain that I have caused you for all those years. There is nothing else I can say, that can help you. Mija, I love you. Sis, Cynthia, and Sandy, keep on going and it will be O.K. I am sorry to put you through this as well. I can't change the past. I hope you find peace and know that I love you. I am sorry. I am sorry and I can't change it. </p>
What could be causing this issue. Do I have to change something on this line?:
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
Feel free to share improvements to my code. Right now I want to get everything working before I will make it more robust.
For the people wondering about the list with URL's in it: It is due to some bugs on the death row site. Sometimes the URL differs from [lastname][firstname]last.html. I added them manually for now.

Related

Python: BeautifulSoup Scrape, Blank Descriptions For Courses Messing Up Data

I'm trying to scrape some course data from the site https://bulletins.psu.edu/university-course-descriptions/undergraduate/ for a project.
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 5 20:37:33 2018
#author: DazedFury
"""
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
import requests
# returns a CloudflareScraper instance
#scraper = cfscrape.create_scraper()
#URL and textfile
text_file = open("Output.txt", "w", encoding='UTF-8')
page_link = 'https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/'
page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")
#Array for storing URL's
URLArray = []
#Find links
for link in page_content.find_all('a'):
if('/university-course-descriptions/undergraduate' in link.get('href')):
URLArray.append(link.get('href'))
k = 1
#Parse Loop
while(k != 242):
print("Writing " + str(k))
completeURL = 'https://bulletins.psu.edu' + URLArray[k]
# this is the url that we've already determined is safe and legal to scrape from.
page_link = completeURL
# here, we fetch the content from the url, using the requests library
page_response = requests.get(page_link)
#we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
page_content.prettify
#Find and print all text with tag p
paragraphs = page_content.find_all('div', {'class' : 'course_codetitle'})
paragraphs2 = page_content.find_all('div', {'class' : 'courseblockdesc'})
j = 0
for i in range(len(paragraphs)):
if i % 2 == 0:
text_file.write(paragraphs[i].get_text())
text_file.write("\n")
if j < len(paragraphs2):
text_file.write(" ".join(paragraphs2[j].get_text().split()))
text_file.write("\n")
text_file.write("\n")
if(paragraphs2[j].get_text() != ""):
j += 1
k += 1
#FORMAT
#text_file.write("<p style=\"page-break-after: always;\"> </p>")
#text_file.write("\n\n")
#Close Text File
text_file.close()
The specific info I need are the course title and the description. The problem is that some of the courses have blank descriptions, which messes up the order and giving bad data.
I thought about just checking if the course description is blank but on the site, the 'courseblockdesc' tag doesn't exists if the course has no description. Therefore when I find_all courseblockdesc, the list doesn't actually add add an element to the array, so the order ends up messed up. There are too many errors on this to manually fix, so I was hoping someone could help me find a solution to this.
The simplest solution would be to go through each item in one find_all for the parents of the items you are looking for.
for block in page_content.find_all('div', class_="courseblock"):
title = block.find('div', {'class' : 'course_codetitle'})
description = block.find('div', {'class' : 'courseblockdesc'})
# do what you need with the navigable strings here.
print(title.get_text()
if description:
print(description.get_text())
You may be over-complicating the procedure somewhat, but you're certainly on the right track. Instead of storing the information in an array and relying on all of the indexes to line up, write the text file as you traverse the courses, pulling title and description dynamically from each course block. If a block doesn't have a description, you can handle that on the spot. Here's a working example:
from bs4 import BeautifulSoup
import requests
url = "https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/"
with open("out.txt", "w", encoding="UTF-8") as f:
for link in BeautifulSoup(requests.get(url).content, "html.parser").find_all("a"):
if "/university-course-descriptions/undergraduate" in link["href"]:
soup = BeautifulSoup(requests.get("https://bulletins.psu.edu" + link["href"]).content, "html.parser")
for course in soup.find_all("div", {"class": "courseblock"}):
title = course.find("div", {"class" : "course_title"}).get_text().strip()
try:
desc = course.find("div", {"class" : "courseblockdesc"}).get_text().strip()
except AttributeError:
desc = "No description available"
f.write(title + "\n" + desc + "\n\n")
Output snippet (from end of text file to validate alignment):
WLED 495: **SPECIAL TOPICS**
No description available
WLED 495B: Field Experience for World Languages Teacher Preparation in Grades 1-5
WL ED 495B Field Experience for World Languages Teacher Preparation in Grades 1-5 (3) Practicum situation where Prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with children in grades 1-5 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluated own designed activities and lessons; (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events; (3) inquiry projects on teaching and learning of World Languages.
WLED 495C: Field Experience for World Languages Teacher Preparation in Grades 6-12
WL ED 495C Field Experience for World Languages Teacher Preparation in Grades 6-12 (3) Practicum situation where prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements in grades 6-12 and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with students in grades 6-12 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluating their own designed activities and lessons, (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events, and (3) inquiry projects on teaching and learning of World Languages.
Additional minor remarks:
It's a good idea to use the with keyword for file I/O. This will automatically close the file handle when done.
Verbose intermediate variables and comments that add noise like:
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
or
#Close Text File
text_file.close()
can always be removed, making the program logic easier to follow.

Extracting CSV from Export Button

I apologize for not being able to specifically give out the url im dealing with. I'm trying to extract some data from a certain site but its not organized well enough. However, they do have an "Export To CSV file" and the code for that block is ...
<input type="submit" name="ctl00$ContentPlaceHolder1$ExportValueCSVButton" value="Export to Value CSV" id="ContentPlaceHolder1_ExportValueCSVButton" class="smallbutton">
In this type of situation, whats the best way to go about grabbing that data when there is no specific url to the CSV, Im using Mechanize and BS4.
If you're able to click a button that could download the data as a csv, it sounds like you might be able to wget link that data and save it on your machine and work with it there. I'm not sure if that's what you're getting at here though, any more details you can offer?
You should try Selenium, Selenium is a suite of tools to automate web browsers across many platforms. It can do a lot thing including click button.
Well, you need SOME starting URL to feed br.open() to even start the process.
It appears that you have an aspnetForm type control there and the below code MAY serve as a bit of a starting point, even though it does not work as-is (it's a work in progress...:-).
You'll need to look at the headers and parameters via the network tab of your browser dev tools to see them.
br.open("http://media.ethics.ga.gov/search/Lobbyist/Lobbyist_results.aspx?&Year=2016&LastName="+letter+"&FirstName=&City=&FilerID=")
soup = BS(br.response().read())
table = soup.find("table", { "id" : "ctl00_ContentPlaceHolder1_Results" }) # Need to add error check here...
if table is None: # No lobbyist with last name starting with 'X' :-)
continue
records = table.find_all('tr') # List of all results for this letter
for form in br.forms():
print "Form name:", form.name
print form
for row in records:
rec_print = ""
span = row.find_all('span', 'lblentry', 'value')
for sname in span:
if ',' in sname.get_text(): # They actually have a field named 'comma'!!
continue
rec_print = rec_print + sname.get_text() + "," # Create comma-delimited output
print(rec_print[:-1]) # Strip final comma
lnk = row.find('a', 'lblentrylink')
if lnk is None: # For some reason, first record is blank.
continue
print("Lnk: ", lnk)
newlnk = lnk['id']
print("NEWLNK: ", newlnk)
newstr = lnk['href']
newctl = newstr[+25:-5] # Matching placeholder (strip javascript....)
br.select_form('aspnetForm') # Tried (nr=0) also...
print("NEWCTL: ", newctl)
br[__EVENTTARGET] = newctl
response = br.submit(name=newlnk).read()

scrape text from webpage using python 2.7

I'm trying to scrape data from this website:
Death Row Information
I'm having trouble to scrape the last statements from all the executed offenders in the list because the last statement is located at another HTML page. The name of the URL is built like this: http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html. I can't think of a way of how I can scrape the last statements from these pages and put them in an Sqlite database.
All the other info (expect for "offender information", which I don't need) is already in my datbase.
Anyone who can give me a pointer to get started getting this done in Python?
Thanks
Edit2: I got a little bit further:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0].lower()
firstname = column[1].lower()
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName2 = CleanName.replace(" ", "")
Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Url+CleanName2+"last.html"
URLS.append(Link)
for URL in URLS:
try:
page = urllib2.urlopen(URL)
except URLError, e:
if e.code ==404:
continue
soup = BeautifulSoup(page.read())
statements = soup.findAll ('p',{ "class" : "Last Statement:" })
print statements
csvfile.close()
conn.commit()
conn.close()
The code is messy, I know. Once everything works I will clean it up. One problem though. I'm trying to get all the statements by using soup.findall, but I cannot seem to get the class right. The relevant part of the page source looks like this:
<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>
However, the output of my program:
[]
[]
[]
...
What could be the problem exactly?
I will not write code that solves the problem, but will give you a simple plan for how to do it yourself:
You know that each last statement is located at the URL:
http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html
You say you already have all the other information. This presumably includes the list of executed prisoners. So you should generate a list of names in your python code. This will allow you to generate the URL to get to each page you need to get to.
Then make a For loop that iterates over each URL using the format I posted above.
Within the body of this for loop, write code to read the page and get the last statement. The last statement on each page is in the same format on each page, so you can use parsing to capture the part that you want:
<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don’t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that’s all I have to say.</p>
Once you have your list of last statements, you can push them to SQL.
So your code will look like this:
import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above
# Create a dictionary to hold all the last words:
LastWords = {}
# Iterate over each individual page
for eachURL in URLS:
response = urllib2.urlopen(eachURL)
html = response.read()
## Some prisoners had no last words, so those URLs will 404.
if ...: # Handle those 404s here
## Code to parse the response, hunting specifically
## for the code block I mentioned above. Once you have the
## last words as a string, save to dictionary:
LastWords['LastFirst'] = "LastFirst's last words."
# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.

Dictionary / JSON issue using Python 2.7

I'm looking at scraping some data from Facebook using Python 2.7. My code basically augments by 1 changing the Facebook profile ID to then capture details returned by the page.
An example of the page I'm looking to capture the data from is graph.facebook.com/4.
Here's my code below:
import scraperwiki
import urlparse
import simplejson
source_url = "http://graph.facebook.com/"
profile_id = 1
while True:
try:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
data['id'] = result['id']
data['name'] = result['name']
data['first_name'] = result['first_name']
data['last_name'] = result['last_name']
data['link'] = result['link']
data['username'] = result['username']
data['gender'] = result['gender']
data['locale'] = result['locale']
print data['id'], data['name']
scraperwiki.sqlite.save(unique_keys=['id'], data=data)
#time.sleep(3)
except:
continue
profile_id +=1
I am using the scraperwiki site to carry out this check but no data is printed back to console despite the line 'print data['id'], data['name'] used just to check the code is working
Any suggestions on what is wrong with this code? As said, for each returned profile, the unique data should be captured and printed to screen as well as populated into the sqlite database.
Thanks
Any suggestions on what is wrong with this code?
Yes. You are swallowing all of your errors. There could be a huge number of things going wrong in the block under try. If anything goes wrong in that block, you move on without printing anything.
You should only ever use a try / except block when you are looking to handle a specific error.
modify your code so that it looks like this:
while True:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
# ... more ...
and then you will get detailed error messages when specific things go wrong.
As for your concern in the comments:
The reason I have the error handling is because, if you look for
example at graph.facebook.com/3, this page contains no user data and
so I don't want to collate this info and skip to the next user, ie. no
4 etc
If you want to handle the case where there is no data, then find a way to handle that case specifically. It is bad practice to swallow all errors.

What is the right way to handle errors?

My script below scrapes a website and returns the data from a table. It's not finished but it works. The problem is that it has no error checking. Where should I have error handling in my script?
There are no unittests, should I write some and schedule my unittests to be run periodicaly. Or should the error handling be done in my script?
Any advice on the proper way to do this would be great.
#!/usr/bin/env python
''' Gets the Canadian Monthly Residential Bill Calculations table
from URL and saves the results to a sqllite database.
'''
import urllib2
from BeautifulSoup import BeautifulSoup
class Bills():
''' Canadian Monthly Residential Bill Calculations '''
URL = "http://www.hydro.mb.ca/regulatory_affairs/energy_rates/electricity/utility_rate_comp.shtml"
def __init__(self):
''' Initialization '''
self.url = self.URL
self.data = []
self.get_monthly_residential_bills(self.url)
def get_monthly_residential_bills(self, url):
''' Gets the Monthly Residential Bill Calculations table from URL '''
doc = urllib2.urlopen(url)
soup = BeautifulSoup(doc)
res_table = soup.table.th.findParents()[1]
results = res_table.findNextSibling()
header = self.get_column_names(res_table)
self.get_data(results)
self.save(header, self.data)
def get_data(self, results):
''' Extracts data from search result. '''
rows = results.childGenerator()
data = []
for row in rows:
if row == "\n":
continue
for td in row.contents:
if td == "\n":
continue
data.append(td.text)
self.data.append(tuple(data))
data = []
def get_column_names(self, table):
''' Gets table title, subtitle and column names '''
results = table.findAll('tr')
title = results[0].text
subtitle = results[1].text
cols = results[2].childGenerator()
column_names = []
for col in cols:
if col == "\n":
continue
column_names.append(col.text)
return title, subtitle, column_names
def save(self, header, data):
pass
if __name__ == '__main__':
a = Bills()
for td in a.data:
print td
See the documentation of all the functions and see what all exceptions do they throw.
For ex, in urllib2.urlopen(), it's written that Raises URLError on errors. It's a subclass of IOError.
So, for the urlopen(), you could do something like:
try:
doc = urllib2.urlopen(url)
except IOError:
print >> sys.stderr, 'Error opening URL'
Similary, do the same for others.
You should write unit tests and you should use exception handling. But only catch the exceptions you can handle; you do no one any favors by catching everything and throwing any useful information out.
Unit tests aren't run periodically though; they're run before and after the code changes (although it is feasible for one change's "after" to become another change's "before" if they're close enough).
A copple places you need to have them.is in importing things like tkinter
try:
import Tkinter as tk
except:
import tkinter as tk
also anywhere where the user enters something with a n intended type. A good way to figure this out is to run it abd try really hard to make it crash. Eg typing in wrong type.
The answer to "where should I have error handling in my script?" is basically "any place where something could go wrong", which depends entirely on the logic of your program.
In general, any place where your program relies on an assumption that a particular operation worked as you intended, and there's a possibility that it may not have, you should add code to check whether or not it actually did work, and take appropriate remedial action if it didn't. In some cases, the underlying code might generate an exception on failure and you may be happy to just let the program terminate with an uncaught exception without adding any error-handling code of your own, but (1) this would be, or ought to be, rare if anyone other than you is ever going to use that program; and (2) I'd say this would fall into the "works as intended" category anyway.

Categories