Here is a scraper I created using Python on ScraperWiki:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
It works perfectly except when scraping the final data row of the table (the "York University" line), at which point instead of lines 9 through 11 of the code causing the string "401-500" to be retrieved from the table and assigned to data["arwu_rank"], those lines somehow seem instead to be causing the int 450 to be assigned to data["arwu_rank"]. You can see that I've added a few lines of "debugging" code to get a better understanding of what's going on, but also that that debugging code doesn't go very deep.
I have two questions:
What are my options for debugging scrapers run on the ScraperWiki infrastructure, e.g. for troubleshooting issues like this? E.g. is there a way to step through?
Can you tell me why the the int 450, instead of the string "401-500", is being assigned to data["arwu_rank"] for the "York University" line?
EDIT 6 May 2013, 20:07h UTC
The following scraper completes without issue, but I'm still unsure why the first one failed on the "York University" line:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
There's no easy way to debug your scripts on ScraperWiki, unfortunately it just sends your code in its entirety and gets the results back, there's no way to execute the code interactively.
I added a couple more prints to a copy of your code, and it looks like the if check before the bit that assigns data
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
doesn't trigger for "York University" so it will be keeping the int value (you set it later on) from the previous time around the loop.
Related
I'm using mathjax-node to try to convert mathjax code into an SVG. Currently, the code I have set up here is this:
const mathjax = require("mathjax-node");
process.stdin.on("data", data => {
mathjax.typeset({
math: data.slice(1),
format: [...data][0] == "Y" ? "inline-TeX" : "TeX",
svg: true
}).then(data => {
process.stdout.write(data.svg + String.fromCodePoint(0));
});
});
Which takes in input and the first character determines if it's inline or not and everything else is the code. It's used by a python file like this:
# -*- coding: utf-8 -*-
from subprocess import *
from pathlib import Path
cdir = "/".join(str(Path(__file__)).split("/")[:-1])
if cdir:
cdir += "/"
converter = Popen(["node", cdir + "mathjax-converter.js"], stdin = PIPE, stdout = PIPE)
def convert_mathjax(mathjax, inline = True):
converter.stdin.write(bytes(("Y" if inline else "N") + mathjax, "utf-8"))
converter.stdin.flush()
result = ""
while True:
char = converter.stdout.read(1)
if not char: return ""
if ord(char) == 0:
return result
result += char.decode("utf-8")
So convert_markdown is the function that takes the code and turns it into the SVG. However, when I try to render the output just using data:text/html,<svg>...</svg>, it gives this error in the console:
Error: <path> attribute d: Expected number, "…3T381 315T301241Q265 210 201 149…".
Using MathJax client-side with the _SVG config option works fine, so how do I resolve this?
I can confirm that there is an error in that SVG path. The T command is supposed to have two coordinate parameters. But there is one in the middle there that doesn't.
T 381 315 T 301241 Q ...
is probably supposed to be:
T 381 315 T 301 241 Q ...
Either there is a bug in the mathjax SVG generator, or something else in your code is accidentally stripping random characters.
I need to get all the elements on a page and iterate through them to search each element.
currently I am using, driver.find_elements_by_xpath('//*[#*]')
However, there can be a delay in completing the line of code above on larger pages. Is there a way to retrieve the results in increments of 100 elements? Or at least add a timeout?
Terminating driver.find_elements_by_xpath('//*[#*]') inside a multithread is the only why I currently think I can solve this.
I need to find all elements on a page that contain certain strings. For example. elem.get_attribute('outerHTML').find('type="submit"') != -1 … and so on and so forth … I also need their proximity to each other to compare index positions
Thanks!
import Globalz ###### globals import is an empty .py file
import threading
import time
import ctypes
def find_xpath():
for i in range(5):
print(i)
time.sleep(1)
Globalz.curr_value = 'DONE!'
### this is where the xpath retrieval goes (ABOVE loop is for example purposes only)
def stopwatch(info):
curr_time = 0
failed = False
Globalz.curr_value = ''
thread1 = threading.Thread(target=info['function'])
thread1.start()
while thread1.is_alive() is True:
if curr_time >= info['timeout']: failed = True; ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread1.ident), ctypes.py_object(SystemExit))
curr_time += 1; time.sleep(1)
if failed is True: return info['failed_returns']
if failed is False: return Globalz.curr_value
betty = stopwatch({'function': find_xpath, 'timeout': 10, 'failed_returns': 'failed'})
print(betty)
If anyone is interested here is a solution. I've created a wrapper called stopwatch()
I was making a site component scanner with Python. Unfortunately, something goes wrong when I added another value to my script. This is my script:
#!/usr/bin/python
import sys
import urllib2
import re
import time
import httplib
import random
# Color Console
W = '\033[0m' # white (default)
R = '\033[31m' # red
G = '\033[1;32m' # green bold
O = '\033[33m' # orange
B = '\033[34m' # blue
P = '\033[35m' # purple
C = '\033[36m' # cyan
GR = '\033[37m' # gray
#Bad HTTP Responses
BAD_RESP = [400,401,404]
def main(path):
print "[+] Testing:",host.split("/",1)[1]+path
try:
h = httplib.HTTP(host.split("/",1)[0])
h.putrequest("HEAD", "/"+host.split("/",1)[1]+path)
h.putheader("Host", host.split("/",1)[0])
h.endheaders()
resp, reason, headers = h.getreply()
return resp, reason, headers.get("Server")
except(), msg:
print "Error Occurred:",msg
pass
def timer():
now = time.localtime(time.time())
return time.asctime(now)
def slowprint(s):
for c in s + '\n':
sys.stdout.write(c)
sys.stdout.flush() # defeat buffering
time.sleep(8./90)
print G+"\n\t Whats My Site Component Scanner"
coms = { "index.php?option=com_artforms" : "com_artforms" + "link1","index.php?option=com_fabrik" : "com_fabrik" + "ink"}
if len(sys.argv) != 2:
print "\nUsage: python jx.py <site>"
print "Example: python jx.py www.site.com/\n"
sys.exit(1)
host = sys.argv[1].replace("http://","").rsplit("/",1)[0]
if host[-1] != "/":
host = host+"/"
print "\n[+] Site:",host
print "[+] Loaded:",len(coms)
print "\n[+] Scanning Components\n"
for com,nme,expl in coms.items():
resp,reason,server = main(com)
if resp not in BAD_RESP:
print ""
print G+"\t[+] Result:",resp, reason
print G+"\t[+] Com:",nme
print G+"\t[+] Link:",expl
print W
else:
print ""
print R+"\t[-] Result:",resp, reason
print W
print "\n[-] Done\n"
And this is the error message that comes up:
Traceback (most recent call last):
File "jscan.py", line 69, in <module>
for com,nme,expl in xpls.items():
ValueError: need more than 2 values to unpack
I already tried changing the 2 value into 3 or 1, but it doesn't seem to work.
xpls.items returns a tuple of two items, you're trying to unpack it into three. You initialize the dict yourself with two pairs of key:value:
coms = { "index.php?option=com_artforms" : "com_artforms" + "link1","index.php?option=com_fabrik" : "com_fabrik" + "ink"}
besides, the traceback seems to be from another script - the dict is called xpls there, and coms in the code you posted...
you can try
for (xpl, poc) in xpls.items():
...
...
because dict.items will return you tuple with 2 values.
You have all the information you need. As with any bug, the best place to start is the traceback. Let's:
for com,poc,expl in xpls.items():
ValueError: need more than 2 values to unpack
Python throws ValueError when a given object is of correct type but has an incorrect value. In this case, this tells us that xpls.items is an iterable an thus can be unpacked, but the attempt failed.
The description of the exception narrows down the problem: xpls has 2 items, but more were required. By looking at the quoted line, we can see that "more" is 3.
In short: xpls was supposed to have 3 items, but has 2.
Note that I never read the rest of the code. Debugging this was possible using only those 2 lines.
Learning to read tracebacks is vital. When you encounter an error such as this one again, devote at least 10 minutes to try to work with this information. You'll be repayed tenfold for your effort.
As already mentioned, dict.items() returns a tuple with two values. If you use a list of strings as dictionary values instead of a string, which should be split anyways afterwards, you can go with this syntax:
coms = { "index.php?option=com_artforms" : ["com_artforms", "link1"],
"index.php?option=com_fabrik" : ["com_fabrik", "ink"]}
for com, (name, expl) in coms.items():
print com, name, expl
>>> index.php?option=com_artforms com_artforms link1
>>> index.php?option=com_fabrik com_fabrik ink
I have a seed file of 250 URLs of IMDB's top 250 movies.
I need to crawl each one of them and get some info from it.
I've created a function that gets a URL of a movie and returns the info I need. It works great. My problem is when I'm trying to run this function on all of the 250 URLs.
After a certian amount (not constant!) of URLs that were crawled successfully, the program stops its run. The python.exe process takes 0% CPU and the memory consumption doesn't change. After some debugging, I figured that the problem is with the parsing, it just stops working and I have no idea why (stuck on a find command).
I'm using urllib2 to get the HTML content of the URL, than parse it as a string and then continue to the next URL (I'm going only once on each of these strings, linear time for all the checks and extractions).
Any idea what can cause this kind of behavior?
EDIT:
I'm attaching one of the problematic functions' code (got 1 more, but I'm guessing it's the same problem)
def getActors(html,actorsDictionary):
counter = 0
actorsLeft = 3
actorFlag = 0
imdbURL = "http://www.imdb.com"
for line in html:
# we have 3 actors, stop
if (actorsLeft == 0):
break
# current line contains actor information
if (actorFlag == 1):
endTag = str(line).find('/" >')
endTagA = str(line).find('</a>')
if (actorsLeft == 3):
actorList = str(line)[endTag+7:endTagA]
else:
actorList += ", " + str(line)[endTag+7:endTagA]
actorURL = imdbURL + str(line)[str(line).find('href=')+6:endTag]
actorFlag = 0
actorsLeft -= 1
actorsDictionary[actorURL] = str(line)[endTag+7:endTagA]
# check if next line contains actor information
if (str(line).find('<td class="name">') > -1 ):
actorFlag = 1
# convert commas and clean \n
actorList = actorList.replace(",",", ")
actorList = actorList.replace("\n","")
return actorList
I'm calling the function this way:
for url in seedFile:
moviePage = urllib.request.urlopen(url)
print(getTitleAndYear(moviePage),",",movieURL,",",getPlot(moviePage),getActors(moviePage,actorsDictionary))
This works great without the getActors function
There is no exception raised here (I removed the try and catch for now)
and it's getting stuck in the for loop after some iterations
EDIT 2: if I run only the getActors function, it works well and finishes all the URLs in the seed file (250)
I'd like to parse status.dat file for nagios3 and output as xml with a python script.
The xml part is the easy one but how do I go about parsing the file? Use multi line regex?
It's possible the file will be large as many hosts and services are monitored, will loading the whole file in memory be wise?
I only need to extract services that have critical state and host they belong to.
Any help and pointing in the right direction will be highly appreciated.
LE Here's how the file looks:
########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################
info {
created=1233491098
version=2.11
}
program {
modified_host_attributes=0
modified_service_attributes=0
nagios_pid=15015
daemon_mode=1
program_start=1233490393
last_command_check=0
last_log_rotation=0
enable_notifications=1
active_service_checks_enabled=1
passive_service_checks_enabled=1
active_host_checks_enabled=1
passive_host_checks_enabled=1
enable_event_handlers=1
obsess_over_services=0
obsess_over_hosts=0
check_service_freshness=1
check_host_freshness=0
enable_flap_detection=0
enable_failure_prediction=1
process_performance_data=0
global_host_event_handler=
global_service_event_handler=
total_external_command_buffer_slots=4096
used_external_command_buffer_slots=0
high_external_command_buffer_slots=0
total_check_result_buffer_slots=4096
used_check_result_buffer_slots=0
high_check_result_buffer_slots=2
}
host {
host_name=localhost
modified_attributes=0
check_command=check-host-alive
event_handler=
has_been_checked=1
should_be_scheduled=0
check_execution_time=0.019
check_latency=0.000
check_type=0
current_state=0
last_hard_state=0
plugin_output=PING OK - Packet loss = 0%, RTA = 3.57 ms
performance_data=
last_check=1233490883
next_check=0
current_attempt=1
max_attempts=10
state_type=1
last_state_change=1233489475
last_hard_state_change=1233489475
last_time_up=1233490883
last_time_down=0
last_time_unreachable=0
last_notification=0
next_notification=0
no_more_notifications=0
current_notification_number=0
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
service {
host_name=gateway
service_description=PING
modified_attributes=0
check_command=check_ping!100.0,20%!500.0,60%
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=4.017
check_latency=0.210
check_type=0
current_state=0
last_hard_state=0
current_attempt=1
max_attempts=4
state_type=1
last_state_change=1233489432
last_hard_state_change=1233489432
last_time_ok=1233491078
last_time_warning=0
last_time_unknown=0
last_time_critical=0
plugin_output=PING OK - Packet loss = 0%, RTA = 2.98 ms
performance_data=
last_check=1233491078
next_check=1233491378
current_notification_number=0
last_notification=0
next_notification=0
no_more_notifications=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
last_update=1233491098
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
It can have any number of hosts and a host can have any number of services.
Pfft, get yerself mk_livestatus. http://mathias-kettner.de/checkmk_livestatus.html
Nagiosity does exactly what you want:
http://code.google.com/p/nagiosity/
Having shamelessly stolen from the above examples,
Here's a version build for Python 2.4 that returns a dict containing arrays of nagios sections.
def parseConf(source):
conf = {}
patID=re.compile(r"(?:\s*define)?\s*(\w+)\s+{")
patAttr=re.compile(r"\s*(\w+)(?:=|\s+)(.*)")
patEndID=re.compile(r"\s*}")
for line in source.splitlines():
line=line.strip()
matchID = patID.match(line)
matchAttr = patAttr.match(line)
matchEndID = patEndID.match( line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.setdefault(cur[0],[]).append(cur[1])
del cur
return conf
To get all Names your Host which have contactgroups beginning with 'devops':
nagcfg=parseConf(stringcontaingcompleteconfig)
hostlist=[host['host_name'] for host in nagcfg['host']
if host['contact_groups'].startswith('devops')]
Don't know nagios and its config file, but the structure seems pretty simple:
# comment
identifier {
attribute=
attribute=value
}
which can simply be translated to
<identifier>
<attribute name="attribute-name">attribute-value</attribute>
</identifier>
all contained inside a root-level <nagios> tag.
I don't see line breaks in the values. Does nagios have multi-line values?
You need to take care of equal signs within attribute values, so set your regex to non-greedy.
You can do something like this:
def parseConf(filename):
conf = []
with open(filename, 'r') as f:
for i in f.readlines():
if i[0] == '#': continue
matchID = re.search(r"([\w]+) {", i)
matchAttr = re.search(r"[ ]*([\w]+)=([\w\d]*)", i)
matchEndID = re.search(r"[ ]*}", i)
if matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2)
cur[1][attribute] = value
elif matchEndID:
conf.append(cur)
return conf
def conf2xml(filename):
conf = parseConf(filename)
xml = ''
for ID in conf:
xml += '<%s>\n' % ID[0]
for attr in ID[1]:
xml += '\t<attribute name="%s">%s</attribute>\n' % \
(attr, ID[1][attr])
xml += '</%s>\n' % ID[0]
return xml
Then try to do:
print conf2xml('conf.dat')
If you slightly tweak Andrea's solution you can use that code to parse both the status.dat as well as the objects.cache
def parseConf(source):
conf = []
for line in source.splitlines():
line=line.strip()
matchID = re.match(r"(?:\s*define)?\s*(\w+)\s+{", line)
matchAttr = re.match(r"\s*(\w+)(?:=|\s+)(.*)", line)
matchEndID = re.match(r"\s*}", line)
if len(line) == 0 or line[0]=='#':
pass
elif matchID:
identifier = matchID.group(1)
cur = [identifier, {}]
elif matchAttr:
attribute = matchAttr.group(1)
value = matchAttr.group(2).strip()
cur[1][attribute] = value
elif matchEndID and cur:
conf.append(cur)
del cur
return conf
It is a little puzzling why nagios chose to use two different formats for these files, but once you've parsed them both into some usable python objects you can do quite a bit of magic through the external command file.
If anybody has a solution for getting this into a a real xml dom that'd be awesome.
For the last several months I've written and released a tool that that parses the Nagios status.dat and objects.cache and builds a model that allows for some really useful manipulation of Nagios data. We use it to drive an internal operations dashboard that is a simplified 'mini' Nagios. Its under continual development and I've neglected testing and documentation but the code isn't too crazy and I feel fairly easy to follow.
Let me know what you think...
https://github.com/zebpalmer/NagParser