I've got some test code I'm working on. In a separate HTML file, a button onclick event gets the URL of the page and passes it as a variable (jquery_input) to this python script. Python then scrapes the URL and identifies two pieces of data, which it then formats and concatenates together (resulting in the variable lowerCaseJoined). This concatenated variable has a corresponding entry in a MySQL database. With each entry in the db, there is an associated .gif file.
From here, what I'm trying to do is open a connection to the MySQL server and query the concatenated variable against the db to get the associated .gif file.
Once this has been accomplished, I want to print the .gif file as an alert on the webpage.
If I take out the db section of the code (connection, querying), the code runs just fine. Also, I am successfully able to execute the db part of the code independently through the Python shell. However, when the entire code resides in one file, nothing happens when I click the button. I've systematically removed the lines of code related to the db connection, and my code begins stalling out at the first line (db = MySQLdb.connection...). So it looks like as soon as I start trying to connect to the db, the program goes kaput.
Here is the code:
#!/usr/bin/python
from bs4 import BeautifulSoup as Soup
import urllib
import re
import cgi, cgitb
import MySQLdb
cgitb.enable() # for troubleshooting
# the cgi library gets the var from the .html file
form = cgi.FieldStorage()
jquery_input = form.getvalue("stuff_for_python", "nothing sent")
# the next section scrapes the URL,
# finds the call no and location,
# formats them, and concatenates them
content = urllib.urlopen(jquery_input).read()
soup = Soup(content)
extracted = soup.find_all("tr", {"class": "bibItemsEntry"})
cleaned = str(extracted)
start = cleaned.find('browse') +8
end = cleaned.find('</a>', start)
callNo = cleaned[start:end]
noSpacesCallNo = callNo.replace(' ', '')
noSpacesCallNo2 = noSpacesCallNo.replace('.', '')
startLoc = cleaned.find('field 1') + 13
endLoc = cleaned.find('</td>', startLoc)
location = cleaned[startLoc:endLoc]
noSpacesLoc = location.replace(' ', '')
joined = (noSpacesCallNo2+noSpacesLoc)
lowerCaseJoined = joined.lower()
# the next section establishes a connection
# with the mySQL db and queries it
# using the call/loc code (lowerCaseJoined)
db = MySQLdb.connect(host="localhost", user="...", "passwd="...",
db="locations")
cur = db.cursor()
queryDb = """
SELECT URL FROM locations WHERE location = %s
"""
cur.execute(queryDb, lowerCaseJoined)
result = cur.fetchall()
cur.close()
db.close()
# the next 2 'print' statements are important for web
print "Content-type: text/html"
print
print result
Any ideas what I'm doing wrong?
I'm new at programming, so I'm sure there's a lot that can be improved upon here. But prior to refining it I just want to get the thing to work!
I figured out the problem. Seems that I had quotation mark before the password portion of the db connection line. Things are all good now.
Related
I have multiple unstructured txt files in a directory and I want to insert all of them into mysql; basically, the entire content of each text file should be placed into a row . In MySQL, I have 2 columns: ID (auto increment), and LastName(nvarchar(45)). I used Python to connect to MySql; used LOAD DATA LOCAL INFILE to insert the whole content. But when I run the code I see the following messages in Python console:
.
Also, when I check MySql, I see nothing but a bunch of empty rows with Ids being automatically generated.
Here is the code:
import MySQLdb
import sys
import os
result = os.listdir("C:\\Users\\msalimi\\Google Drive\\s\\Discharge_Summary")
for x in result:
db = MySQLdb.connect("localhost", "root", "Pass", "myblog")
cursor = db.cursor()
file1 = os.path.join(r'C:\\Discharge_Summary\\'+x)
cursor.execute("LOAD DATA LOCAL INFILE '%s' INTO TABLE clamp_test" %(file1,));
db.commit()
db.close()
Can someone please tell me what is wrong with the code? What is the right way to achieve my goal?
I edited my code with:
.....cursor.execute("LOAD DATA LOCAL INFILE '%s' INTO TABLE clamp_test LINES TERMINATED BY '\r' (Lastname) SET id = NULL" %(file1,))
and it worked :)
I have written a little script using python3 that gets a RSS news feed using the feedparser library.
I then loop through the entries (dictionary) and then use a try/except block to insert the data into a MySQL db using pymysql (originally I tried to use MySQLDB but read here and other places that is does not work with Python3 or above)
I originally followed the PyMySQL example on git hub, however this did not work for me and I had to use different syntax for pymysql like they have here on digital ocean. However this worked for me when I tested out their example on their site.
But when I tried to incorporate it into my query,there was an error as it would not run the code the try block and just ran the exception code each time.
Here is my code;
#! /usr/bin/python3
# web_scraper.py 1st part of the project, to get the data from the
# websites and store it in a mysql database
import cgitb
cgitb.enable()
import requests,feedparser,pprint,pymysql,datetime
from bs4 import BeautifulSoup
conn = pymysql.connect(host="localhost",user="root",password="pass",db="stories",charset="utf8mb4")
c = conn.cursor()
def adbNews():
url = 'http://feeds.feedburner.com/adb_news'
d = feedparser.parse(url)
articles = d['entries']
for article in articles:
dt_obj = datetime.datetime.strptime(article.published,"%Y-%m-%d %H:%M:%S")
try:
sql = "INSERT INTO articles(article_title,article_desc,article_link,article_date) VALUES (%s,%s,%s,%s,%s)"
c.execute(sql,(article.title, article.summary,article.link,dt_obj.strftime('%Y-%m-%d %H:%M:%S'),))
conn.commit()
except Exception:
print("Not working")
adbNews()
I am not entirely sure what I am doing wrong. I have converted the string so that it is the format for the MySQL DATETIME type. As I originally did not have this but each time I run the program nothing gets stored in the db and the exception gets printed.
EDIT:
After reading Daniel Roseman's comments I removed the try/except block and read the errors that python gave me. It was to do with an extra argument in my sql query.
Here is he edited working code;
#! /usr/bin/python3
# web_scraper.py 1st part of the project, to get the data from the
# websites and store it in a mysql database
import cgitb
cgitb.enable()
import requests,feedparser,pprint,pymysql,datetime
from bs4 import BeautifulSoup
conn = pymysql.connect(host="localhost",user="root",password="pass",db="stories",charset="utf8mb4")
c = conn.cursor()
def adbNews():
url = 'http://feeds.feedburner.com/adb_news'
d = feedparser.parse(url)
articles = d['entries']
for article in articles:
dt_obj = datetime.datetime.strptime(article.published,"%Y-%m-%d %H:%M:%S")
#extra argument was here removed now
sql = "INSERT INTO articles(article_title,article_desc,article_link,article_date) VALUES (%s,%s,%s,%s)"
c.execute(sql,(article.title, article.summary,article.link,dt_obj.strftime('%Y-%m-%d %H:%M:%S'),))
conn.commit()
adbNews()
I'm trying to scrape data from this website:
Death Row Information
I'm having trouble to scrape the last statements from all the executed offenders in the list because the last statement is located at another HTML page. The name of the URL is built like this: http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html. I can't think of a way of how I can scrape the last statements from these pages and put them in an Sqlite database.
All the other info (expect for "offender information", which I don't need) is already in my datbase.
Anyone who can give me a pointer to get started getting this done in Python?
Thanks
Edit2: I got a little bit further:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0].lower()
firstname = column[1].lower()
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName2 = CleanName.replace(" ", "")
Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Url+CleanName2+"last.html"
URLS.append(Link)
for URL in URLS:
try:
page = urllib2.urlopen(URL)
except URLError, e:
if e.code ==404:
continue
soup = BeautifulSoup(page.read())
statements = soup.findAll ('p',{ "class" : "Last Statement:" })
print statements
csvfile.close()
conn.commit()
conn.close()
The code is messy, I know. Once everything works I will clean it up. One problem though. I'm trying to get all the statements by using soup.findall, but I cannot seem to get the class right. The relevant part of the page source looks like this:
<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>
However, the output of my program:
[]
[]
[]
...
What could be the problem exactly?
I will not write code that solves the problem, but will give you a simple plan for how to do it yourself:
You know that each last statement is located at the URL:
http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html
You say you already have all the other information. This presumably includes the list of executed prisoners. So you should generate a list of names in your python code. This will allow you to generate the URL to get to each page you need to get to.
Then make a For loop that iterates over each URL using the format I posted above.
Within the body of this for loop, write code to read the page and get the last statement. The last statement on each page is in the same format on each page, so you can use parsing to capture the part that you want:
<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don’t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that’s all I have to say.</p>
Once you have your list of last statements, you can push them to SQL.
So your code will look like this:
import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above
# Create a dictionary to hold all the last words:
LastWords = {}
# Iterate over each individual page
for eachURL in URLS:
response = urllib2.urlopen(eachURL)
html = response.read()
## Some prisoners had no last words, so those URLs will 404.
if ...: # Handle those 404s here
## Code to parse the response, hunting specifically
## for the code block I mentioned above. Once you have the
## last words as a string, save to dictionary:
LastWords['LastFirst'] = "LastFirst's last words."
# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.
I am dealing with a CRON job that places a text file with 9000 lines device names.
The job recreates the file every day with an updated list from a network crawler in our domain.
What I was running into is when I have the following worker running my import into my database the db.[name].id kept growing with this method below
scheduler.py
# -*- coding: utf-8 -*-
from gluon.scheduler import Scheduler
def demo1():
db(db.asdf.id>0).delete()
db.commit()
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit()
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
default.py (initial kickoff)
#auth.requires_membership('!Group-IS_MASTER')
def rgroup():
mysched.queue_task('demo1',start_time=request.now,stop_time = None,prevent_drift=True,repeats=0,period=86400)
return 'you are member of a group!'
So the next time the job kicked off it would start at db.[name].id = 9001. So every day the ID number would grow by 9000 or so depending on the crawler's return. It just looked sloppy and I didn't want to run into issues years down the road with database limitations that I don't know about.
(I'm a DB newb (I know, I don't know stuff))
SOOOOOOO.....
This is what I came up with and I don't know if this is the best practice or not. And an issue that I ran into when using db.[name].drop() in the same function that is creating entries is the db tables didn't exist and my job status went to 'FAILED'. So I defined the table in the job. see below:
scheduler.py
from gluon.scheduler import Scheduler
def demo1():
db.asdf.drop() #<=====Kill db.asdf
db.commit() #<=====Commit Kill
db.define_table('asdf',Field('asdf'),auth.signature ) #<==== Phoenix Rebirth!!!
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit() #<=========== Magic
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
In the line of Phoenix Rebirth in the comments of code above. Is that the best way to achieve my goal?
It starts my ID back at 1 and that's what I want but is that how I should be going about it?
Thanks!
P.S. Forgive my example with windows dir structure as my current non-prod sandbox is my windows workstation. :(
Why wouldn't you check if the line is present prior to inserting its corresponding record ?
...
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
# distinguishing t_ for tables and f_ for fields
db_matching_entries = db(db.t_asdf.f_asdf==line).select()
if len(db_matching_entries) == 0:
db.t_asdf.insert(f_asdf = line)
else:
# here you could update your record, just in case ;-)
pass
db.commit() #<=========== Magic
Got a similar process that takes few seconds to complete with 2k-3k entries. Yours should not take longer than half a minute.
I am using Python3 and the package requests to fetch HTML data.
I have tried running the line
r = requests.get('https://github.com/timeline.json')
, which is the example on their tutorial, to no avail. However, when I run
request = requests.get('http://www.math.ksu.edu/events/grad_conf_2013/')
it works fine. I am getting errors such as
AttributeError: 'MockRequest' object has no attribute 'unverifiable'
Error in sys.excepthook:
I am thinking the errors have something to do with the type of webpage I am attempting to get, since the html page that is working is just basic html that I wrote.
I am very new to requests and Python in general. I am also new to stackoverflow.
As a little example, here is a little tool which I developed in order to fetch data from a website, in this case IP and show it:
# Import the requests module
# TODO: Make sure to install it first
import requests
# Get the raw information from the website
r = requests.get('http://whatismyipaddress.com')
raw_page_source_list = r.text
text = ''
# Join the whole list into a single string in order
# to simplify things
text = text.join(raw_page_source_list)
# Get the exact starting position of the IP address string
ip_text_pos = text.find('IP Information') + 62
# Now extract the IP address and store it
ip_address = text[ip_text_pos : ip_text_pos + 12]
# print 'Your IP address is: %s' % ip_address
# or, for Python 3 ... #
# print('Your IP address is: %s' % ip_address)