I'm trying to scrape data from this website:
Death Row Information
I'm having trouble to scrape the last statements from all the executed offenders in the list because the last statement is located at another HTML page. The name of the URL is built like this: http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html. I can't think of a way of how I can scrape the last statements from these pages and put them in an Sqlite database.
All the other info (expect for "offender information", which I don't need) is already in my datbase.
Anyone who can give me a pointer to get started getting this done in Python?
Thanks
Edit2: I got a little bit further:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0].lower()
firstname = column[1].lower()
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName2 = CleanName.replace(" ", "")
Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Url+CleanName2+"last.html"
URLS.append(Link)
for URL in URLS:
try:
page = urllib2.urlopen(URL)
except URLError, e:
if e.code ==404:
continue
soup = BeautifulSoup(page.read())
statements = soup.findAll ('p',{ "class" : "Last Statement:" })
print statements
csvfile.close()
conn.commit()
conn.close()
The code is messy, I know. Once everything works I will clean it up. One problem though. I'm trying to get all the statements by using soup.findall, but I cannot seem to get the class right. The relevant part of the page source looks like this:
<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>
However, the output of my program:
[]
[]
[]
...
What could be the problem exactly?
I will not write code that solves the problem, but will give you a simple plan for how to do it yourself:
You know that each last statement is located at the URL:
http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html
You say you already have all the other information. This presumably includes the list of executed prisoners. So you should generate a list of names in your python code. This will allow you to generate the URL to get to each page you need to get to.
Then make a For loop that iterates over each URL using the format I posted above.
Within the body of this for loop, write code to read the page and get the last statement. The last statement on each page is in the same format on each page, so you can use parsing to capture the part that you want:
<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don’t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that’s all I have to say.</p>
Once you have your list of last statements, you can push them to SQL.
So your code will look like this:
import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above
# Create a dictionary to hold all the last words:
LastWords = {}
# Iterate over each individual page
for eachURL in URLS:
response = urllib2.urlopen(eachURL)
html = response.read()
## Some prisoners had no last words, so those URLs will 404.
if ...: # Handle those 404s here
## Code to parse the response, hunting specifically
## for the code block I mentioned above. Once you have the
## last words as a string, save to dictionary:
LastWords['LastFirst'] = "LastFirst's last words."
# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.
Related
I have developed a webscraper with beautiful soup that scrapes news from a website and then sends them to a telegram bot. Every time the program runs it picks up all the news currently on the news web page, and I want it to just pick the new entries on the news and send only those.
How can I do this? Should I use a sorting algorithm of some sort?
Here is the code:
#Lib requests
import requests
import bs4
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
body = soup.body
for paragrafo in body.find_all('p', class_='article-thumb-text'):
print(paragrafo.text)
conteudo = paragrafo.text
id = requests.get('https://api.telegram.org/bot<TOKEN>/getUpdates')
chat_id = id.json()['result'][0]['message']['from']['id']
print(chat_id)
msg = requests.post('https://api.telegram.org/bot<TOKEN>/sendMessage', data = {'chat_id': chat_id ,'text' : conteudo})
You need to keep track of articles that you have seen before, either by using a full database solution or by simply saving the information in a file. The file needs to be read before starting. The website is then scraped and compared against the existing list. Any articles not in the list are added to the list. At the end, the updated list is saved back to the file.
Rather that storing the whole text in the file, a hash of the text can be saved instead. i.e. convert the text into a unique number, in this case a hex digest is used to make it easier to save to a text file. As each hash will be unique, they can be stored in a Python set to speed up the checking:
import hashlib
import requests
import bs4
import os
# Read in hashes of past articles
db = 'past.txt'
if os.path.exists(db):
with open(db) as f_past:
past_articles = set(f_past.read().splitlines())
else:
past_articles = set()
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
for paragrafo in soup.body.find_all('p', class_='article-thumb-text'):
m = hashlib.md5(paragrafo.text.encode('utf-8'))
if m.hexdigest() not in past_articles:
print('New {} - {}'.format(m.hexdigest(), paragrafo.text))
past_articles.add(m.hexdigest())
# ...Update telegram here...
# Write updated hashes back to the file
with open(db, 'w') as f_past:
f_past.write('\n'.join(past_articles))
The first time this is run, all articles will be displayed. The next time, no articles will be displayed until the website is updated.
I apologize for not being able to specifically give out the url im dealing with. I'm trying to extract some data from a certain site but its not organized well enough. However, they do have an "Export To CSV file" and the code for that block is ...
<input type="submit" name="ctl00$ContentPlaceHolder1$ExportValueCSVButton" value="Export to Value CSV" id="ContentPlaceHolder1_ExportValueCSVButton" class="smallbutton">
In this type of situation, whats the best way to go about grabbing that data when there is no specific url to the CSV, Im using Mechanize and BS4.
If you're able to click a button that could download the data as a csv, it sounds like you might be able to wget link that data and save it on your machine and work with it there. I'm not sure if that's what you're getting at here though, any more details you can offer?
You should try Selenium, Selenium is a suite of tools to automate web browsers across many platforms. It can do a lot thing including click button.
Well, you need SOME starting URL to feed br.open() to even start the process.
It appears that you have an aspnetForm type control there and the below code MAY serve as a bit of a starting point, even though it does not work as-is (it's a work in progress...:-).
You'll need to look at the headers and parameters via the network tab of your browser dev tools to see them.
br.open("http://media.ethics.ga.gov/search/Lobbyist/Lobbyist_results.aspx?&Year=2016&LastName="+letter+"&FirstName=&City=&FilerID=")
soup = BS(br.response().read())
table = soup.find("table", { "id" : "ctl00_ContentPlaceHolder1_Results" }) # Need to add error check here...
if table is None: # No lobbyist with last name starting with 'X' :-)
continue
records = table.find_all('tr') # List of all results for this letter
for form in br.forms():
print "Form name:", form.name
print form
for row in records:
rec_print = ""
span = row.find_all('span', 'lblentry', 'value')
for sname in span:
if ',' in sname.get_text(): # They actually have a field named 'comma'!!
continue
rec_print = rec_print + sname.get_text() + "," # Create comma-delimited output
print(rec_print[:-1]) # Strip final comma
lnk = row.find('a', 'lblentrylink')
if lnk is None: # For some reason, first record is blank.
continue
print("Lnk: ", lnk)
newlnk = lnk['id']
print("NEWLNK: ", newlnk)
newstr = lnk['href']
newctl = newstr[+25:-5] # Matching placeholder (strip javascript....)
br.select_form('aspnetForm') # Tried (nr=0) also...
print("NEWCTL: ", newctl)
br[__EVENTTARGET] = newctl
response = br.submit(name=newlnk).read()
I'm trying to download all the last statements from the Death Row Website. Basic outline is like this
1. The info from the site gets imported in an sqlite database, prison.sqlite
2. Based on the names in the table, I generate unique URL's for each name, to get their last statements.
3. The program checks each generated URL, if URL is OK, it checks for the last statement. This statement gets downloaded to the database prison.sqlite (still 2 do).
This is my code:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = ["http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/moselydaroycelast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999288.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/hernandezadophlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/carterrobertanthonylast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/livingstoncharleslast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/wilkersonrichardlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/hererraleonellast.html",]
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( Execution text, link1 text, Statements text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0]
firstname = column[1]
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName = CleanName.replace(" ", "")
CleanName = CleanName.replace("III","")
CleanName = re.sub("Sr","",CleanName)
CleanName = re.sub("Jr","",CleanName)
CleanName = CleanName.lower()
Baseurl = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Baseurl+CleanName+"last.html"
URLS.append(Link)
for Link in URLS:
try:
r = requests.get(Link)
r.raise_for_status()
print "URL OK", Link
document = urllib2.urlopen(Link)
html = document.read()
soup = BeautifulSoup(html)
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
print Statement
continue
except requests.exceptions.HTTPError as err:
print err
print "Offender has made no statement.", Link
#cur.execute("INSERT OR IGNORE INTO prison(Statements) VALUES(?)"), (Statement, )
csvfile.close()
conn.commit()
conn.close()
When running the program I get:
C:\python>prison.py
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html
Can you hear me? Did I ever tell you, you have dad's eyes? I've noticed that in the last couple of days. I'm sorry for putting you through all this. Tell everyone I love them. It was good seeing the kids. I love them all; tell mom, everybody. I am very sorry for all of the pain. Tell Brenda I love her. To everybody back on the row, I know you're going through a lot over there. Keep fighting, don't give up everybody.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html
Thank you, Jesus Christ. Thank you for your blessing. You are above the president. And know it is you, Jesus Christ, that is performing this miracle in my life. Hallelujah, Holy, Holy, Holy. For this reason I was born and raised. Thank you for this, my God is a God of Salvation. Only through you, Jesus Christ, people will see that you're still on the throne. Hallelujah, Holy, Holy, Holy. I invoke Your name. Thank you, Yahweh, thank you Jesus Christ. Hallelujah, Amen. Thank you, Warden.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html
Traceback (most recent call last):
File "C:\python\prison.py", line 60, in <module>
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
AttributeError: 'NoneType' object has no attribute 'findNext'
First two statements are fine, but after that program crashes. Looking at the page source of the URL where the error occurs, I see:
(only relevant data)
<div class="return_to_div"></div>
<h1>Offender Information</h1>
<h2>Last Statement</h2>
<p class="text_bold">Date of Execution:</p>
<p> February 4, 2009</p>
<p class="text_bold"> Offender:</p>
<p> Martinez, David</p>
<p class="text_bold"> Last Statement:</p>
<p> Yes, nothing I can say can change the past. I am asking for forgiveness. Saying sorry is not going to change anything. I hope one day you can find peace. I am sorry for all of the pain that I have caused you for all those years. There is nothing else I can say, that can help you. Mija, I love you. Sis, Cynthia, and Sandy, keep on going and it will be O.K. I am sorry to put you through this as well. I can't change the past. I hope you find peace and know that I love you. I am sorry. I am sorry and I can't change it. </p>
What could be causing this issue. Do I have to change something on this line?:
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
Feel free to share improvements to my code. Right now I want to get everything working before I will make it more robust.
For the people wondering about the list with URL's in it: It is due to some bugs on the death row site. Sometimes the URL differs from [lastname][firstname]last.html. I added them manually for now.
I've got some test code I'm working on. In a separate HTML file, a button onclick event gets the URL of the page and passes it as a variable (jquery_input) to this python script. Python then scrapes the URL and identifies two pieces of data, which it then formats and concatenates together (resulting in the variable lowerCaseJoined). This concatenated variable has a corresponding entry in a MySQL database. With each entry in the db, there is an associated .gif file.
From here, what I'm trying to do is open a connection to the MySQL server and query the concatenated variable against the db to get the associated .gif file.
Once this has been accomplished, I want to print the .gif file as an alert on the webpage.
If I take out the db section of the code (connection, querying), the code runs just fine. Also, I am successfully able to execute the db part of the code independently through the Python shell. However, when the entire code resides in one file, nothing happens when I click the button. I've systematically removed the lines of code related to the db connection, and my code begins stalling out at the first line (db = MySQLdb.connection...). So it looks like as soon as I start trying to connect to the db, the program goes kaput.
Here is the code:
#!/usr/bin/python
from bs4 import BeautifulSoup as Soup
import urllib
import re
import cgi, cgitb
import MySQLdb
cgitb.enable() # for troubleshooting
# the cgi library gets the var from the .html file
form = cgi.FieldStorage()
jquery_input = form.getvalue("stuff_for_python", "nothing sent")
# the next section scrapes the URL,
# finds the call no and location,
# formats them, and concatenates them
content = urllib.urlopen(jquery_input).read()
soup = Soup(content)
extracted = soup.find_all("tr", {"class": "bibItemsEntry"})
cleaned = str(extracted)
start = cleaned.find('browse') +8
end = cleaned.find('</a>', start)
callNo = cleaned[start:end]
noSpacesCallNo = callNo.replace(' ', '')
noSpacesCallNo2 = noSpacesCallNo.replace('.', '')
startLoc = cleaned.find('field 1') + 13
endLoc = cleaned.find('</td>', startLoc)
location = cleaned[startLoc:endLoc]
noSpacesLoc = location.replace(' ', '')
joined = (noSpacesCallNo2+noSpacesLoc)
lowerCaseJoined = joined.lower()
# the next section establishes a connection
# with the mySQL db and queries it
# using the call/loc code (lowerCaseJoined)
db = MySQLdb.connect(host="localhost", user="...", "passwd="...",
db="locations")
cur = db.cursor()
queryDb = """
SELECT URL FROM locations WHERE location = %s
"""
cur.execute(queryDb, lowerCaseJoined)
result = cur.fetchall()
cur.close()
db.close()
# the next 2 'print' statements are important for web
print "Content-type: text/html"
print
print result
Any ideas what I'm doing wrong?
I'm new at programming, so I'm sure there's a lot that can be improved upon here. But prior to refining it I just want to get the thing to work!
I figured out the problem. Seems that I had quotation mark before the password portion of the db connection line. Things are all good now.
I am trying to collect data from a webpage which has a bunch of select lists i need to fetch
data from. Here is the page:- http://www.asusparts.eu/partfinder/Asus/All In One/E Series/
And this is what i have so far:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
I am not sure if i am going about the right way? As all the select menus make it confusing. Basically i need to grab
all the data from the selected results such as images,price,description,etc. They are all contained within
one div tag which contains all the results, which is named 'accordion' so would this still gather all the data?
or would i need to dig deeper to search through the tags inside this div? Also i would have prefered to search by id rather than
class as i could fetch all the data in one go. How would i do this from what i have above? Thanks. Also i am unsure about the glob function too if i am using that correctly or not?
EDIT
Here is my edited code, no errors return however i am not sure if it returns all the models for the e-series?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']
First of all, I'd like to point out a couple of problems you have in the code you posted. First, of all the glob module is not typically used for making HTTP requests. It is useful for iterating through a subset of files on a specified path, you can read more about it in its docs.
The second issue is that in the line:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
you have an indentation error, because there is no indented code that follows. This will raise an error and prevent the rest of the code from being executed.
Another problem is that you are using some of python's "reserved" names for your variables. You should never use words such as all or file for variable names.
Finally when you are looping through option_tags:
for option in option_tags:
open(url + option['value'])
The open statement will try and open a local file whose path is url + option['value']. This will likely raise an error, as I doubt you'll have a file at that location. In addition, you should be aware that you aren't doing anything with this open file.
Okay, so enough with the critique. I've taken a look at the asus page and I think I have an idea of what you want to accomplish. From what I understand, you want to scrape a list of parts (images, text, price, etc..) for each computer model on the asus page. Each model has its list of parts located at a unique URL (for example: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220). This means that you need to be able to create this unique URL for each model. To make matters more complicated, each parts category is loaded dynamically, so for example the parts for the "Cooling" section are not loaded until you click on the link for "Cooling". This means we have a two part problem: 1) Get all of the valid (brand, type, family, model) combinations and 2) Figure out how to load all the parts for a given model.
I was kind of bored and decided to write up a simple program that will take care of most of the heavy lifting. It isn't the most elegant thing out there, but it'll get the job done. Step 1) is accomplished in get_model_information(). Step 2) is taken care of in parse_models() but is a little less obvious. Taking a look at the asus website, whenever you click on a parts subsection the JavaScript function getProductsBasedOnCategoryID() is run, which makes an ajax call to a formatted PRODUCT_URL (see below). The response is some JSON information that is used to populate the section you clicked on.
import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
Finally, one side note. I have dropped urllib2 in favor of the requests library. I personally think provides much more functionality and has better semantics, but you can use whatever you would like.