Crawling web pages with Python

Crawling web pages with Python - python

I have a seed file of 250 URLs of IMDB's top 250 movies.
I need to crawl each one of them and get some info from it.
I've created a function that gets a URL of a movie and returns the info I need. It works great. My problem is when I'm trying to run this function on all of the 250 URLs.
After a certian amount (not constant!) of URLs that were crawled successfully, the program stops its run. The python.exe process takes 0% CPU and the memory consumption doesn't change. After some debugging, I figured that the problem is with the parsing, it just stops working and I have no idea why (stuck on a find command).
I'm using urllib2 to get the HTML content of the URL, than parse it as a string and then continue to the next URL (I'm going only once on each of these strings, linear time for all the checks and extractions).
Any idea what can cause this kind of behavior?
EDIT:
I'm attaching one of the problematic functions' code (got 1 more, but I'm guessing it's the same problem)
def getActors(html,actorsDictionary):
counter = 0
actorsLeft = 3
actorFlag = 0
imdbURL = "http://www.imdb.com"
for line in html:
# we have 3 actors, stop
if (actorsLeft == 0):
break
# current line contains actor information
if (actorFlag == 1):
endTag = str(line).find('/" >')
endTagA = str(line).find('</a>')
if (actorsLeft == 3):
actorList = str(line)[endTag+7:endTagA]
else:
actorList += ", " + str(line)[endTag+7:endTagA]
actorURL = imdbURL + str(line)[str(line).find('href=')+6:endTag]
actorFlag = 0
actorsLeft -= 1
actorsDictionary[actorURL] = str(line)[endTag+7:endTagA]
# check if next line contains actor information
if (str(line).find('<td class="name">') > -1 ):
actorFlag = 1
# convert commas and clean \n
actorList = actorList.replace(",",", ")
actorList = actorList.replace("\n","")
return actorList
I'm calling the function this way:
for url in seedFile:
moviePage = urllib.request.urlopen(url)
print(getTitleAndYear(moviePage),",",movieURL,",",getPlot(moviePage),getActors(moviePage,actorsDictionary))
This works great without the getActors function
There is no exception raised here (I removed the try and catch for now)
and it's getting stuck in the for loop after some iterations
EDIT 2: if I run only the getActors function, it works well and finishes all the URLs in the seed file (250)

Related

Python3 with .csv files

I have generated a .csv file with a list of assets and have had to go back and make some changes. which require me to regenerate my NFT Images(done this multiple times at this point) But now the script isn't recognizing my .csv file and even allowing me to run the script that has worked flawlessly up until this point...
def generateOneRandRow(ADATvID):
FILENAME = "ADA Tv" + str(ADATvID)
NO = ADATvID
BACKGROUND = randBackground()
ACCESSORIES = randAccessories()
HEAD = randHead()
HAT = randHat()
BODY = randBody()
CHEST = randChest()
ARMS = randArms()
FACE = randFace()
singleRow = [FILENAME, NO, BACKGROUND, ACCESSORIES, HEAD, HAT, BODY, CHEST, ARMS, FACE]
def checkIfExists(checkRow):
aData = pd.read_csv('adalist.csv')
index_list = aData[(aData['Background'] == checkRow[2])
& (aData['Accessories'] == checkRow[3]) &
(aData['Head'] == checkRow[4]) &
(aData['Hat'] == checkRow[5]) &
(aData['Body'] == checkRow[6]) &
(aData['Chest'] == checkRow[7]) &
(aData['Arms'] == checkRow[8]) &
(aData['Face'] == checkRow[9])].index.tolist()
if index_list == []:
return False
else:
return True
df = pd.read_csv(adalist.csv)
rowCount = df["NO"].count()
print("number of rows is:" + str(rowCount))
the first appearance of adalist.csv shows fine. the last appearance is giving me WHY!?!?!?!
The .csv is in the same Source folder as everything else... this error just occurred and wont even allow me to press Run on my script.
As stated before, this script ran FLAWLESSLY multiple times, Up until I muted some code to convert the .csv into JSON for Metadata. even removing the JSON conversion its still not recognizing the .csv
Please help... I don't have the days it took me to re organize 3500 Images to my Liking [Face Palm]

python selenium, slow xpath 'all elements'. add timeout

I need to get all the elements on a page and iterate through them to search each element.
currently I am using, driver.find_elements_by_xpath('//*[#*]')
However, there can be a delay in completing the line of code above on larger pages. Is there a way to retrieve the results in increments of 100 elements? Or at least add a timeout?
Terminating driver.find_elements_by_xpath('//*[#*]') inside a multithread is the only why I currently think I can solve this.
I need to find all elements on a page that contain certain strings. For example. elem.get_attribute('outerHTML').find('type="submit"') != -1 … and so on and so forth … I also need their proximity to each other to compare index positions
Thanks!

import Globalz ###### globals import is an empty .py file
import threading
import time
import ctypes
def find_xpath():
for i in range(5):
print(i)
time.sleep(1)
Globalz.curr_value = 'DONE!'
### this is where the xpath retrieval goes (ABOVE loop is for example purposes only)
def stopwatch(info):
curr_time = 0
failed = False
Globalz.curr_value = ''
thread1 = threading.Thread(target=info['function'])
thread1.start()
while thread1.is_alive() is True:
if curr_time >= info['timeout']: failed = True; ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread1.ident), ctypes.py_object(SystemExit))
curr_time += 1; time.sleep(1)
if failed is True: return info['failed_returns']
if failed is False: return Globalz.curr_value
betty = stopwatch({'function': find_xpath, 'timeout': 10, 'failed_returns': 'failed'})
print(betty)
If anyone is interested here is a solution. I've created a wrapper called stopwatch()

What's the fastest way to expand url in python

I have a checkin list which contains about 600000 checkins, and there is a url in each checkin, I need to expand them back to original long ones. I do so by
now = time.time()
files_without_url = 0
for i, checkin in enumerate(NYC_checkins):
try:
foursquare_url = urllib2.urlopen(re.search("(?P<url>https?://[^\s]+)", checkin[5]).group("url")).url
except:
files_without_url += 1
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But this takes too long time: from 0 to 1000 checkins, it takes 3241 seconds! Is this normal? What's the most efficient way to expand url by Python?
MODIFIED: Some Urls are from Bitly while some others are not, and I am not sure where they come from. In this case, I wanna simply use urllib2 module.
for your information, here is an example of checkin[5]:
I'm at The Diner (2453 18th Street NW, Columbia Rd., Washington) w/ 4 others. http...... (this is the short url)

I thought I would expand on my comment regarding the use of multiprocessing to speed up this task.
Let's start with a simple function that will take a url and resolve it as far as possible (following redirects until it gets a 200 response code):
import requests
def resolve_url(url):
try:
r = requests.get(url)
except requests.exceptions.RequestException:
return (url, None)
if r.status_code != 200:
longurl = None
else:
longurl = r.url
return (url, longurl)
This will either return a (shorturl, longurl) tuple, or it will
return (shorturl, None) in the event of a failure.
Now, we create a pool of workers:
import multiprocessing
pool = multiprocessing.Pool(10)
And then ask our pool to resolve a list of urls:
resolved_urls = []
for shorturl, longurl in pool.map(resolve_url, urls):
resolved_urls.append((shorturl, longurl))
Using the above code...
With a pool of 10 workers, I can resolve 500 URLs in 900 seconds.
If I increase the number of workers to 100, I can resolve 500 URLs in 30 seconds.
If I increase the number of workers to 200, I can resolve 500 URLs in 25 seconds.
This is hopefully enough to get you started.
(NB: you could write a similar solution using the threading module rather than multiprocessing. I usually just grab for multiprocessing first, but in this case either would work, and threading might even be slightly more efficient.)

Thread are most appropriate in case of network I/O. But you could try the following first.
pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
missing_urls = 0
bad_urls = 0
def check(checkin):
match = pat.search(checkin[5])
if not match:
global missing_urls
missing_urls += 1
else:
url = match.group("url")
try:
urllib2.urlopen(url) # don't lookup .url if you don't need it later
except URLError: # or just Exception
global bad_urls
bad_urls += 1
for i, checkin in enumerate(NYC_checkins):
check(checkin)
print(bad_urls, missing_urls)
If you get no improvement, now that we have a nice check function, create a threadpool and feed it. Speedup is guaranteed. Using processes for network I/O is pointless

Sublime Text 3 plugin: ValueError with Edit objects?

I'm building a Sublime Text 3 plugin to shorten URLs using the goo.gl API. Bear in mind that the following code is hacked together from other plugins and tutorial code. I have no previous experience with Python.
The plugin does actually work as it is. The URL is shortened and replaced inline. Here is the plugin code:
import sublime
import sublime_plugin
import urllib.request
import urllib.error
import json
import threading
class ShortenUrlCommand(sublime_plugin.TextCommand):
def run(self, edit):
sels = self.view.sel()
threads = []
for sel in sels:
url = self.view.substr(sel)
thread = GooglApiCall(sel, url, 5) # Send the selection, the URL and timeout to the class
threads.append(thread)
thread.start()
# Wait for threads
for thread in threads:
thread.join()
self.view.sel().clear()
self.handle_threads(edit, threads, sels)
def handle_threads(self, edit, threads, sels, offset=0, i=0, dir=1):
next_threads = []
for thread in threads:
sel = thread.sel
result = thread.result
if thread.is_alive():
next_threads.append(thread)
continue
if thread.result == False:
continue
offset = self.replace(edit, thread, sels, offset)
thread = next_threads
if len(threads):
before = i % 8
after = (7) - before
if not after:
dir = -1
if not before:
dir = 1
i += dir
self.view.set_status("shorten_url", "[%s=%s]" % (" " * before, " " * after))
sublime.set_timeout(lambda: self.handle_threads(edit, threads, sels, offset, i, dir), 100)
return
self.view.erase_status("shorten_url")
selections = len(self.view.sel())
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
def replace(self, edit, thread, sels, offset):
sel = thread.sel
result = thread.result
if offset:
sel = sublime.Region(edit, thread.sel.begin() + offset, thread.sel.end() + offset)
self.view.replace(edit, sel, result)
return
class GooglApiCall(threading.Thread):
def __init__(self, sel, url, timeout):
self.sel = sel
self.url = url
self.timeout = timeout
self.result = None
threading.Thread.__init__(self)
def run(self):
try:
apiKey = "xxxxxxxxxxxxxxxxxxxxxxxx"
requestUrl = "https://www.googleapis.com/urlshortener/v1/url"
data = json.dumps({"longUrl": self.url})
binary_data = data.encode("utf-8")
headers = {
"User-Agent": "Sublime URL Shortener",
"Content-Type": "application/json"
}
request = urllib.request.Request(requestUrl, binary_data, headers)
response = urllib.request.urlopen(request, timeout=self.timeout)
self.result = json.loads(response.read().decode())
self.result = self.result["id"]
return
except (urllib.error.HTTPError) as e:
err = "%s: HTTP error %s contacting API. %s." % (__name__, str(e.code), str(e.reason))
except (urllib.error.URLError) as e:
err = "%s: URL error %s contacting API" % (__name__, str(e.reason))
sublime.error_message(err)
self.result = False
The problem is that I get the following error in the console every time the plugin runs:
Traceback (most recent call last):
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 51, in <lambda>
sublime.set_timeout(lambda: self.handle_threads(edit, threads, sels, offset, i, dir), 100)
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 39, in handle_threads
offset = self.replace(edit, thread, sels, offset)
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 64, in replace
self.view.replace(edit, sel, result)
File "/Applications/Sublime Text.app/Contents/MacOS/sublime.py", line 657, in replace
raise ValueError("Edit objects may not be used after the TextCommand's run method has returned")
ValueError: Edit objects may not be used after the TextCommand's run method has returned
I'm not sure what the problem is from that error. I have done some research and I understand that the solution may be held in the answer to this question, but due to my lack of Python knowledge I can't figure out how to adapt it to my use case.

I was searching for a Python autocompletion plugin for Sublime and found this question. I like your plugin idea. Did you ever get it working? The ValueError is telling you that you are trying to use the edit argument to ShortenUrlCommand.run after ShortenUrlCommand.run has returned. I think you could do this in Sublime Text 2 using begin_edit and end_edit, but in 3 your plugin has to finish all of its edits before run returns (https://www.sublimetext.com/docs/3/porting_guide.html).
In your code, the handle_threads function is checking the GoogleApiCall threads every 100 ms and executing the replacement for any thread that has finished. But handle_threads has a typo that causes it to run forever: thread = next_threads where it should be threads = next_threads. This means that finished threads are never removed from the list of active threads and all threads get processed in each invocation of handle_threads (eventually throwing the exception that you see).
You actually don't need to worry about whether the GoogleApiCall treads are finished in handle_threads, though, because you call join on each one before calling handle_threads (see the python threading docs for more detail on join: https://docs.python.org/2/library/threading.html). You know the threads are finished, so you can just do something like:
def handle_threads(self, edit, threads, sels):
offset = 0
for thread in threads:
if thread.result:
offset = self.replace(edit, thread, sels, offset)
selections = len(threads)
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
This still has problems: it does not properly handle multiple selections and it blocks the UI thread in Sublime.
Multiple Selections
When you replace multiple selections you have to consider that the replacement text might not be the same length as the text it replaces. This shifts the text after it and you have to adjust the indexes for subsequent selected regions. For example, suppose the URLs are selected in the following text and that you are replacing them with shortened URLs:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123
blah blah http://example.com/long blah blah http://example.com/longer blah
The second URL occupies indexes 44 to 68. After replacing the first URL we have:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123
blah blah http://goo.gl/abc blah blah http://example.com/longer blah
Now the second URL occupies indexes 38 to 62. It is shifted by -6: the difference between the length of the string we just replaced and the length of the string we replaced it with. You need keep track of that difference and update it after each replacement as you go along. It looks like you had this in mind with your offset argument, but never got around to implementing it.
def handle_threads(self, edit, threads, sels):
offset = 0
for thread in threads:
if thread.result:
offset = self.replace(edit, thread.sel, thread.result, offset)
selections = len(threads)
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
def replace(self, edit, selection, replacement_text, offset):
# Adjust the selection region to account for previous replacements
adjusted_selection = sublime.Region(selection.begin() + offset,
selection.end() + offset)
self.view.replace(edit, adjusted_selection, replacement_text)
# Update the offset for the next replacement
old_len = selection.size()
new_len = len(replacement_text)
delta = new_len - old_len
new_offset = offset + delta
return new_offset
Blocking the UI Thread
I'm not familiar with Sublime plugins, so I looked at how this is handled in the Gist plugin (https://github.com/condemil/Gist). They block the UI thread for the duration of the HTTP requests. This seems undesirable, but I think there might be a problem if you don't block: the user could change the text buffer and invalidate the selection indexes before your plugin finishes its updates. If you want to go down this road, you might try moving the URL shortening calls into a WindowCommand. Then once you have the replacement text you could execute a replacement command on the current view for each one. This example gets the current view and executes ShortenUrlCommand on it. You will have to move the code that collects the shortened URLs out into ShortenUrlWrapperCommand.run:
class ShortenUrlWrapperCommand(sublime_plugin.WindowCommand):
def run(self):
view = self.window.active_view()
view.run_command("shorten_url")

Debugging ScraperWiki scraper (producing spurious integer)

Here is a scraper I created using Python on ScraperWiki:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
It works perfectly except when scraping the final data row of the table (the "York University" line), at which point instead of lines 9 through 11 of the code causing the string "401-500" to be retrieved from the table and assigned to data["arwu_rank"], those lines somehow seem instead to be causing the int 450 to be assigned to data["arwu_rank"]. You can see that I've added a few lines of "debugging" code to get a better understanding of what's going on, but also that that debugging code doesn't go very deep.
I have two questions:
What are my options for debugging scrapers run on the ScraperWiki infrastructure, e.g. for troubleshooting issues like this? E.g. is there a way to step through?
Can you tell me why the the int 450, instead of the string "401-500", is being assigned to data["arwu_rank"] for the "York University" line?
EDIT 6 May 2013, 20:07h UTC
The following scraper completes without issue, but I'm still unsure why the first one failed on the "York University" line:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)

There's no easy way to debug your scripts on ScraperWiki, unfortunately it just sends your code in its entirety and gets the results back, there's no way to execute the code interactively.
I added a couple more prints to a copy of your code, and it looks like the if check before the bit that assigns data
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
doesn't trigger for "York University" so it will be keeping the int value (you set it later on) from the previous time around the loop.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling web pages with Python - python

Related

Python3 with .csv files

python selenium, slow xpath 'all elements'. add timeout

What's the fastest way to expand url in python

Sublime Text 3 plugin: ValueError with Edit objects?

Debugging ScraperWiki scraper (producing spurious integer)

Categories

Resources