What's the fastest way to expand url in python

What's the fastest way to expand url in python - python

I have a checkin list which contains about 600000 checkins, and there is a url in each checkin, I need to expand them back to original long ones. I do so by
now = time.time()
files_without_url = 0
for i, checkin in enumerate(NYC_checkins):
try:
foursquare_url = urllib2.urlopen(re.search("(?P<url>https?://[^\s]+)", checkin[5]).group("url")).url
except:
files_without_url += 1
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But this takes too long time: from 0 to 1000 checkins, it takes 3241 seconds! Is this normal? What's the most efficient way to expand url by Python?
MODIFIED: Some Urls are from Bitly while some others are not, and I am not sure where they come from. In this case, I wanna simply use urllib2 module.
for your information, here is an example of checkin[5]:
I'm at The Diner (2453 18th Street NW, Columbia Rd., Washington) w/ 4 others. http...... (this is the short url)

I thought I would expand on my comment regarding the use of multiprocessing to speed up this task.
Let's start with a simple function that will take a url and resolve it as far as possible (following redirects until it gets a 200 response code):
import requests
def resolve_url(url):
try:
r = requests.get(url)
except requests.exceptions.RequestException:
return (url, None)
if r.status_code != 200:
longurl = None
else:
longurl = r.url
return (url, longurl)
This will either return a (shorturl, longurl) tuple, or it will
return (shorturl, None) in the event of a failure.
Now, we create a pool of workers:
import multiprocessing
pool = multiprocessing.Pool(10)
And then ask our pool to resolve a list of urls:
resolved_urls = []
for shorturl, longurl in pool.map(resolve_url, urls):
resolved_urls.append((shorturl, longurl))
Using the above code...
With a pool of 10 workers, I can resolve 500 URLs in 900 seconds.
If I increase the number of workers to 100, I can resolve 500 URLs in 30 seconds.
If I increase the number of workers to 200, I can resolve 500 URLs in 25 seconds.
This is hopefully enough to get you started.
(NB: you could write a similar solution using the threading module rather than multiprocessing. I usually just grab for multiprocessing first, but in this case either would work, and threading might even be slightly more efficient.)

Thread are most appropriate in case of network I/O. But you could try the following first.
pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
missing_urls = 0
bad_urls = 0
def check(checkin):
match = pat.search(checkin[5])
if not match:
global missing_urls
missing_urls += 1
else:
url = match.group("url")
try:
urllib2.urlopen(url) # don't lookup .url if you don't need it later
except URLError: # or just Exception
global bad_urls
bad_urls += 1
for i, checkin in enumerate(NYC_checkins):
check(checkin)
print(bad_urls, missing_urls)
If you get no improvement, now that we have a nice check function, create a threadpool and feed it. Speedup is guaranteed. Using processes for network I/O is pointless

Related

Python multi-threading method

I've heard that Python multi-threading is a bit tricky, and I am not sure what is the best way to go about implementing what I need. Let's say I have a function called IO_intensive_function that does some API call which may take a while to get a response.
Say the process of queuing jobs can look something like this:
import thread
for job_args in jobs:
thread.start_new_thread(IO_intense_function, (job_args))
Would the IO_intense_function now just execute its task in the background and allow me to queue in more jobs?
I also looked at this question, which seems like the approach is to just do the following:
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(2)
results = pool.map(IO_intensive_function, jobs)
As I don't need those tasks to communicate with each other, the only goal is to send my API requests as fast as possible. Is this the most efficient way? Thanks.
Edit:
The way I am making the API request is through a Thrift service.

I had to create code to do something similar recently. I've tried to make it generic below. Note I'm a novice coder, so please forgive the inelegance. What you may find valuable, however, is some of the error processing I found it necessary to embed to capture disconnects, etc.
I also found it valuable to perform the json processing in a threaded manner. You have the threads working for you, so why go "serial" again for a processing step when you can extract the info in parallel.
It is possible I will have mis-coded in making it generic. Please don't hesitate to ask follow-ups and I will clarify.
import requests
from multiprocessing.dummy import Pool as ThreadPool
from src_code.config import Config
with open(Config.API_PATH + '/api_security_key.pem') as f:
my_key = f.read().rstrip("\n")
f.close()
base_url = "https://api.my_api_destination.com/v1"
headers = {"Authorization": "Bearer %s" % my_key}
itm = list()
itm.append(base_url)
itm.append(headers)
def call_API(call_var):
base_url = call_var[0]
headers = call_var[1]
call_specific_tag = call_var[2]
endpoint = f'/api_path/{call_specific_tag}'
connection_tries = 0
for i in range(3):
try:
dat = requests.get((base_url + endpoint), headers=headers).json()
except:
connection_tries += 1
print(f'Call for {api_specific_tag} failed after {i} attempt(s). Pausing for 240 seconds.')
time.sleep(240)
else:
break
tag = list()
vars_to_capture_01 = list()
vars_to_capture_02 = list()
connection_tries = 0
try:
if 'record_id' in dat:
vars_to_capture_01.append(dat['record_id'])
vars_to_capture_02.append(dat['second_item_of_interest'])
else:
vars_to_capture_01.append(call_specific_tag)
print(f'Call specific tag {call_specific_tag} is unavailable. Successful pull.')
vars_to_capture_02.append(-1)
except:
print(f'{call_specific_tag} is unavailable. Unsuccessful pull.')
vars_to_capture_01.append(call_specific_tag)
vars_to_capture_02.append(-1)
time.sleep(240)
pack = list()
pack.append(vars_to_capture_01)
pack.append(vars_to_capture_02)
return pack
vars_to_capture_01 = list()
vars_to_capture_02 = list()
i = 0
max_i = len(all_tags)
while i < max_i:
ind_rng = range(i, min((i + 10), (max_i)), 1)
itm_lst = (itm.copy())
call_var = [itm_lst + [all_tags[q]] for q in ind_rng]
#packed = call_API(call_var[0]) # for testing of function without pooling
pool = ThreadPool(len(call_var))
packed = pool.map(call_API, call_var)
pool.close()
pool.join()
for pack in packed:
try:
vars_to_capture_01.append(pack[0][0])
except:
print(f'Unpacking error for {all_tags[i]}.')
vars_to_capture_02.append(pack[1][0])

For network API request you can use asyncio. Have a look at this article https://realpython.com/python-concurrency/#asyncio-version for an example how to implement it.

python selenium, slow xpath 'all elements'. add timeout

I need to get all the elements on a page and iterate through them to search each element.
currently I am using, driver.find_elements_by_xpath('//*[#*]')
However, there can be a delay in completing the line of code above on larger pages. Is there a way to retrieve the results in increments of 100 elements? Or at least add a timeout?
Terminating driver.find_elements_by_xpath('//*[#*]') inside a multithread is the only why I currently think I can solve this.
I need to find all elements on a page that contain certain strings. For example. elem.get_attribute('outerHTML').find('type="submit"') != -1 … and so on and so forth … I also need their proximity to each other to compare index positions
Thanks!

import Globalz ###### globals import is an empty .py file
import threading
import time
import ctypes
def find_xpath():
for i in range(5):
print(i)
time.sleep(1)
Globalz.curr_value = 'DONE!'
### this is where the xpath retrieval goes (ABOVE loop is for example purposes only)
def stopwatch(info):
curr_time = 0
failed = False
Globalz.curr_value = ''
thread1 = threading.Thread(target=info['function'])
thread1.start()
while thread1.is_alive() is True:
if curr_time >= info['timeout']: failed = True; ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread1.ident), ctypes.py_object(SystemExit))
curr_time += 1; time.sleep(1)
if failed is True: return info['failed_returns']
if failed is False: return Globalz.curr_value
betty = stopwatch({'function': find_xpath, 'timeout': 10, 'failed_returns': 'failed'})
print(betty)
If anyone is interested here is a solution. I've created a wrapper called stopwatch()

How would I go about using concurrent.futures and queues for a real-time scenario?

It is fairly easy to do parallel work with Python 3's concurrent.futures module as shown below.
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to = {executor.submit(do_work, input, 60): input for input in dictionary}
for future in concurrent.futures.as_completed(future_to):
data = future.result()
It is also very handy to insert and retrieve items into a Queue.
q = queue.Queue()
for task in tasks:
q.put(task)
while not q.empty():
q.get()
I have a script running in background listening for updates. Now, in theory assume that, as those updates arrive, I would queue them and do work on them concurrently using the ThreadPoolExecutor.
Now, individually, all of these components work in isolation, and make sense, but how do I go about using them together? I am not aware if it is possible to feed the ThreadPoolExecutor work from the queue in real time unless the data to work from is predetermined?
In a nutshell, all I want to do is, receive updates of say 4 messages a second, shove them in a queue, and get my concurrent.futures to work on them. If I don't, then I am stuck with a sequential approach which is slow.
Let's take the canonical example in the Python documentation below:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
The list of URLS is fixed. Is it possible to feed this list in real-time and get the worker to process it as they come by, perhaps from a queue for management purposes? I am a bit confused on whether my approach is actually possible?

The example from the Python docs, expanded to take its work from a queue. A change to note, is that this code uses concurrent.futures.wait instead of concurrent.futures.as_completed to allow new work to be started while waiting for other work to complete.
import concurrent.futures
import urllib.request
import time
import queue
q = queue.Queue()
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def feed_the_workers(spacing):
""" Simulate outside actors sending in work to do, request each url twice """
for url in URLS + URLS:
time.sleep(spacing)
q.put(url)
return "DONE FEEDING"
def load_url(url, timeout):
""" Retrieve a single page and report the URL and contents """
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# start a future for a thread which sends work in through the queue
future_to_url = {
executor.submit(feed_the_workers, 0.25): 'FEEDER DONE'}
while future_to_url:
# check for status of the futures which are currently working
done, not_done = concurrent.futures.wait(
future_to_url, timeout=0.25,
return_when=concurrent.futures.FIRST_COMPLETED)
# if there is incoming work, start a new future
while not q.empty():
# fetch a url from the queue
url = q.get()
# Start the load operation and mark the future with its URL
future_to_url[executor.submit(load_url, url, 60)] = url
# process any completed futures
for future in done:
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
if url == 'FEEDER DONE':
print(data)
else:
print('%r page is %d bytes' % (url, len(data)))
# remove the now completed future
del future_to_url[future]
Output from fetching each url twice:
'http://www.foxnews.com/' page is 67574 bytes
'http://www.cnn.com/' page is 136975 bytes
'http://www.bbc.co.uk/' page is 193780 bytes
'http://some-made-up-domain.com/' page is 896 bytes
'http://www.foxnews.com/' page is 67574 bytes
'http://www.cnn.com/' page is 136975 bytes
DONE FEEDING
'http://www.bbc.co.uk/' page is 193605 bytes
'http://some-made-up-domain.com/' page is 896 bytes
'http://europe.wsj.com/' page is 874649 bytes
'http://europe.wsj.com/' page is 874649 bytes

At work I found a situation where I wanted to do parallel work on an unbounded stream of data. I created a small library inspired by the excellent answer already provided by Stephen Rauch.
I originally approached this problem by thinking about two separate threads, one that submits work to a queue and one that monitors the queue for any completed tasks and makes more room for new work to come in. This is similar to what Stephen Rauch proposed, where he consumes the stream using a feed_the_workers function that runs in a separate thread.
Talking to one of my colleagues, he helped me realize that you can get away with doing everything in a single thread if you define a buffered iterator that allows you to control how many elements are let out of the input stream every time you are ready to submit more work to the thread pool.
So we introduce the BufferedIter class
class BufferedIter(object):
def __init__(self, iterator):
self.iter = iterator
def nextN(self, n):
vals = []
for _ in range(n):
vals.append(next(self.iter))
return vals
which allows us to define the stream processor in the following way
import logging
import queue
import signal
import sys
import time
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
level = logging.DEBUG
log = logging.getLogger(__name__)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(logging.Formatter('%(asctime)s %(message)s'))
handler.setLevel(level)
log.addHandler(handler)
log.setLevel(level)
WAIT_SLEEP = 1 # second, adjust this based on the timescale of your tasks
def stream_processor(input_stream, task, num_workers):
# Use a queue to signal shutdown.
shutting_down = queue.Queue()
def shutdown(signum, frame):
log.warning('Caught signal %d, shutting down gracefully ...' % signum)
# Put an item in the shutting down queue to signal shutdown.
shutting_down.put(None)
# Register the signal handler
signal.signal(signal.SIGTERM, shutdown)
signal.signal(signal.SIGINT, shutdown)
def is_shutting_down():
return not shutting_down.empty()
futures = dict()
buffer = BufferedIter(input_stream)
with ThreadPoolExecutor(num_workers) as executor:
num_success = 0
num_failure = 0
while True:
idle_workers = num_workers - len(futures)
if not is_shutting_down():
items = buffer.nextN(idle_workers)
for data in items:
futures[executor.submit(task, data)] = data
done, _ = wait(futures, timeout=WAIT_SLEEP, return_when=ALL_COMPLETED)
for f in done:
data = futures[f]
try:
f.result(timeout=0)
except Exception as exc:
log.error('future encountered an exception: %r, %s' % (data, exc))
num_failure += 1
else:
log.info('future finished successfully: %r' % data)
num_success += 1
del futures[f]
if is_shutting_down() and len(futures) == 0:
break
log.info("num_success=%d, num_failure=%d" % (num_success, num_failure))
Below we show an example for how to use the stream processor
import itertools
def integers():
"""Simulate an infinite stream of work."""
for i in itertools.count():
yield i
def task(x):
"""The task we would like to perform in parallel.
With some delay to simulate a time consuming job.
With a baked in exception to simulate errors.
"""
time.sleep(3)
if x == 4:
raise ValueError('bad luck')
return x * x
stream_processor(integers(), task, num_workers=3)
The output for this example is shown below
2019-01-15 22:34:40,193 future finished successfully: 1
2019-01-15 22:34:40,193 future finished successfully: 0
2019-01-15 22:34:40,193 future finished successfully: 2
2019-01-15 22:34:43,201 future finished successfully: 5
2019-01-15 22:34:43,201 future encountered an exception: 4, bad luck
2019-01-15 22:34:43,202 future finished successfully: 3
2019-01-15 22:34:46,208 future finished successfully: 6
2019-01-15 22:34:46,209 future finished successfully: 7
2019-01-15 22:34:46,209 future finished successfully: 8
2019-01-15 22:34:49,215 future finished successfully: 11
2019-01-15 22:34:49,215 future finished successfully: 10
2019-01-15 22:34:49,215 future finished successfully: 9
^C <=== THIS IS WHEN I HIT Ctrl-C
2019-01-15 22:34:50,648 Caught signal 2, shutting down gracefully ...
2019-01-15 22:34:52,221 future finished successfully: 13
2019-01-15 22:34:52,222 future finished successfully: 14
2019-01-15 22:34:52,222 future finished successfully: 12
2019-01-15 22:34:52,222 num_success=14, num_failure=1

I really liked the interesting approach by #pedro above. However, when processing thousands of files, I noticed that at the end a StopIteration would be thrown and some files would always be skipped. I had to make a little modification to as follows. Very useful answer again.
class BufferedIter(object):
def __init__(self, iterator):
self.iter = iterator
def nextN(self, n):
vals = []
try:
for _ in range(n):
vals.append(next(self.iter))
return vals, False
except StopIteration as e:
return vals, True
-- Call as follows
...
if not is_shutting_down():
items, is_finished = buffer.nextN(idle_workers)
if is_finished:
stop()
...
-- Where stop is a function that simply tells to shutdown
def stop():
shutting_down.put(None)

It is possible to gain the benefits of the executor without strictly having to use a Queue. New tasks are submitted from the main thread. The undone futures are tracked and waited on until all futures are done.
import concurrent.futures
import sys
import time
sys.setrecursionlimit(64) # This is only for demonstration purposes to trigger a RecursionError. Do not set in practice.
def slow_factorial(n: int) -> int:
time.sleep(0.01)
if n == 0:
return 1
else:
return n * slow_factorial(n-1)
initial_inputs = [0, 1, 5, 20, 200, 100, 50, 51, 55, 40, 44, 21, 222, 333, 202, 1000, 10, 9000, 9009, 99, 9999]
for executor_class in (concurrent.futures.ThreadPoolExecutor, concurrent.futures.ProcessPoolExecutor):
for max_workers in (4, 8, 16, 32):
start_time = time.monotonic()
with executor_class(max_workers=max_workers) as executor:
futures_to_n = {executor.submit(slow_factorial, n): n for n in initial_inputs}
while futures_to_n:
futures_done, futures_not_done = concurrent.futures.wait(futures_to_n, return_when=concurrent.futures.FIRST_COMPLETED)
# Note: Length of futures_done is often > 1.
for future in futures_done:
n = futures_to_n.pop(future)
try:
factorial_n = future.result()
except RecursionError:
n_smaller = int(n ** 0.9)
future = executor.submit(slow_factorial, n_smaller)
futures_to_n[future] = n_smaller
# print(f'Failed to compute factorial of {n}. Trying to compute factorial of a smaller number {n_smaller} instead.')
else:
# print(f'Factorial of {n} is {factorial_n}.')
pass
used_time = time.monotonic() - start_time
executor_type = executor_class.__name__.removesuffix('PoolExecutor').lower()
print(f'Workflow took {used_time:.1f}s with {max_workers} {executor_type} workers.')
print()
Output:
Workflow took 9.4s with 4 thread workers.
Workflow took 6.3s with 8 thread workers.
Workflow took 5.4s with 16 thread workers.
Workflow took 5.2s with 32 thread workers.
Workflow took 9.0s with 4 process workers.
Workflow took 5.9s with 8 process workers.
Workflow took 5.1s with 16 process workers.
Workflow took 4.9s with 32 process workers.
For more clarity, uncomment the two print statements. As per the output above, there is an asymptotic speed benefit with more workers.

Sublime Text 3 plugin: ValueError with Edit objects?

I'm building a Sublime Text 3 plugin to shorten URLs using the goo.gl API. Bear in mind that the following code is hacked together from other plugins and tutorial code. I have no previous experience with Python.
The plugin does actually work as it is. The URL is shortened and replaced inline. Here is the plugin code:
import sublime
import sublime_plugin
import urllib.request
import urllib.error
import json
import threading
class ShortenUrlCommand(sublime_plugin.TextCommand):
def run(self, edit):
sels = self.view.sel()
threads = []
for sel in sels:
url = self.view.substr(sel)
thread = GooglApiCall(sel, url, 5) # Send the selection, the URL and timeout to the class
threads.append(thread)
thread.start()
# Wait for threads
for thread in threads:
thread.join()
self.view.sel().clear()
self.handle_threads(edit, threads, sels)
def handle_threads(self, edit, threads, sels, offset=0, i=0, dir=1):
next_threads = []
for thread in threads:
sel = thread.sel
result = thread.result
if thread.is_alive():
next_threads.append(thread)
continue
if thread.result == False:
continue
offset = self.replace(edit, thread, sels, offset)
thread = next_threads
if len(threads):
before = i % 8
after = (7) - before
if not after:
dir = -1
if not before:
dir = 1
i += dir
self.view.set_status("shorten_url", "[%s=%s]" % (" " * before, " " * after))
sublime.set_timeout(lambda: self.handle_threads(edit, threads, sels, offset, i, dir), 100)
return
self.view.erase_status("shorten_url")
selections = len(self.view.sel())
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
def replace(self, edit, thread, sels, offset):
sel = thread.sel
result = thread.result
if offset:
sel = sublime.Region(edit, thread.sel.begin() + offset, thread.sel.end() + offset)
self.view.replace(edit, sel, result)
return
class GooglApiCall(threading.Thread):
def __init__(self, sel, url, timeout):
self.sel = sel
self.url = url
self.timeout = timeout
self.result = None
threading.Thread.__init__(self)
def run(self):
try:
apiKey = "xxxxxxxxxxxxxxxxxxxxxxxx"
requestUrl = "https://www.googleapis.com/urlshortener/v1/url"
data = json.dumps({"longUrl": self.url})
binary_data = data.encode("utf-8")
headers = {
"User-Agent": "Sublime URL Shortener",
"Content-Type": "application/json"
}
request = urllib.request.Request(requestUrl, binary_data, headers)
response = urllib.request.urlopen(request, timeout=self.timeout)
self.result = json.loads(response.read().decode())
self.result = self.result["id"]
return
except (urllib.error.HTTPError) as e:
err = "%s: HTTP error %s contacting API. %s." % (__name__, str(e.code), str(e.reason))
except (urllib.error.URLError) as e:
err = "%s: URL error %s contacting API" % (__name__, str(e.reason))
sublime.error_message(err)
self.result = False
The problem is that I get the following error in the console every time the plugin runs:
Traceback (most recent call last):
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 51, in <lambda>
sublime.set_timeout(lambda: self.handle_threads(edit, threads, sels, offset, i, dir), 100)
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 39, in handle_threads
offset = self.replace(edit, thread, sels, offset)
File "/Users/joejoinerr/Library/Application Support/Sublime Text 3/Packages/URL Shortener/url_shortener.py", line 64, in replace
self.view.replace(edit, sel, result)
File "/Applications/Sublime Text.app/Contents/MacOS/sublime.py", line 657, in replace
raise ValueError("Edit objects may not be used after the TextCommand's run method has returned")
ValueError: Edit objects may not be used after the TextCommand's run method has returned
I'm not sure what the problem is from that error. I have done some research and I understand that the solution may be held in the answer to this question, but due to my lack of Python knowledge I can't figure out how to adapt it to my use case.

I was searching for a Python autocompletion plugin for Sublime and found this question. I like your plugin idea. Did you ever get it working? The ValueError is telling you that you are trying to use the edit argument to ShortenUrlCommand.run after ShortenUrlCommand.run has returned. I think you could do this in Sublime Text 2 using begin_edit and end_edit, but in 3 your plugin has to finish all of its edits before run returns (https://www.sublimetext.com/docs/3/porting_guide.html).
In your code, the handle_threads function is checking the GoogleApiCall threads every 100 ms and executing the replacement for any thread that has finished. But handle_threads has a typo that causes it to run forever: thread = next_threads where it should be threads = next_threads. This means that finished threads are never removed from the list of active threads and all threads get processed in each invocation of handle_threads (eventually throwing the exception that you see).
You actually don't need to worry about whether the GoogleApiCall treads are finished in handle_threads, though, because you call join on each one before calling handle_threads (see the python threading docs for more detail on join: https://docs.python.org/2/library/threading.html). You know the threads are finished, so you can just do something like:
def handle_threads(self, edit, threads, sels):
offset = 0
for thread in threads:
if thread.result:
offset = self.replace(edit, thread, sels, offset)
selections = len(threads)
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
This still has problems: it does not properly handle multiple selections and it blocks the UI thread in Sublime.
Multiple Selections
When you replace multiple selections you have to consider that the replacement text might not be the same length as the text it replaces. This shifts the text after it and you have to adjust the indexes for subsequent selected regions. For example, suppose the URLs are selected in the following text and that you are replacing them with shortened URLs:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123
blah blah http://example.com/long blah blah http://example.com/longer blah
The second URL occupies indexes 44 to 68. After replacing the first URL we have:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123
blah blah http://goo.gl/abc blah blah http://example.com/longer blah
Now the second URL occupies indexes 38 to 62. It is shifted by -6: the difference between the length of the string we just replaced and the length of the string we replaced it with. You need keep track of that difference and update it after each replacement as you go along. It looks like you had this in mind with your offset argument, but never got around to implementing it.
def handle_threads(self, edit, threads, sels):
offset = 0
for thread in threads:
if thread.result:
offset = self.replace(edit, thread.sel, thread.result, offset)
selections = len(threads)
sublime.status_message("URL shortener successfully ran on %s URL%s" %
(selections, "" if selections == 1 else "s"))
def replace(self, edit, selection, replacement_text, offset):
# Adjust the selection region to account for previous replacements
adjusted_selection = sublime.Region(selection.begin() + offset,
selection.end() + offset)
self.view.replace(edit, adjusted_selection, replacement_text)
# Update the offset for the next replacement
old_len = selection.size()
new_len = len(replacement_text)
delta = new_len - old_len
new_offset = offset + delta
return new_offset
Blocking the UI Thread
I'm not familiar with Sublime plugins, so I looked at how this is handled in the Gist plugin (https://github.com/condemil/Gist). They block the UI thread for the duration of the HTTP requests. This seems undesirable, but I think there might be a problem if you don't block: the user could change the text buffer and invalidate the selection indexes before your plugin finishes its updates. If you want to go down this road, you might try moving the URL shortening calls into a WindowCommand. Then once you have the replacement text you could execute a replacement command on the current view for each one. This example gets the current view and executes ShortenUrlCommand on it. You will have to move the code that collects the shortened URLs out into ShortenUrlWrapperCommand.run:
class ShortenUrlWrapperCommand(sublime_plugin.WindowCommand):
def run(self):
view = self.window.active_view()
view.run_command("shorten_url")

Crawling web pages with Python

I have a seed file of 250 URLs of IMDB's top 250 movies.
I need to crawl each one of them and get some info from it.
I've created a function that gets a URL of a movie and returns the info I need. It works great. My problem is when I'm trying to run this function on all of the 250 URLs.
After a certian amount (not constant!) of URLs that were crawled successfully, the program stops its run. The python.exe process takes 0% CPU and the memory consumption doesn't change. After some debugging, I figured that the problem is with the parsing, it just stops working and I have no idea why (stuck on a find command).
I'm using urllib2 to get the HTML content of the URL, than parse it as a string and then continue to the next URL (I'm going only once on each of these strings, linear time for all the checks and extractions).
Any idea what can cause this kind of behavior?
EDIT:
I'm attaching one of the problematic functions' code (got 1 more, but I'm guessing it's the same problem)
def getActors(html,actorsDictionary):
counter = 0
actorsLeft = 3
actorFlag = 0
imdbURL = "http://www.imdb.com"
for line in html:
# we have 3 actors, stop
if (actorsLeft == 0):
break
# current line contains actor information
if (actorFlag == 1):
endTag = str(line).find('/" >')
endTagA = str(line).find('</a>')
if (actorsLeft == 3):
actorList = str(line)[endTag+7:endTagA]
else:
actorList += ", " + str(line)[endTag+7:endTagA]
actorURL = imdbURL + str(line)[str(line).find('href=')+6:endTag]
actorFlag = 0
actorsLeft -= 1
actorsDictionary[actorURL] = str(line)[endTag+7:endTagA]
# check if next line contains actor information
if (str(line).find('<td class="name">') > -1 ):
actorFlag = 1
# convert commas and clean \n
actorList = actorList.replace(",",", ")
actorList = actorList.replace("\n","")
return actorList
I'm calling the function this way:
for url in seedFile:
moviePage = urllib.request.urlopen(url)
print(getTitleAndYear(moviePage),",",movieURL,",",getPlot(moviePage),getActors(moviePage,actorsDictionary))
This works great without the getActors function
There is no exception raised here (I removed the try and catch for now)
and it's getting stuck in the for loop after some iterations
EDIT 2: if I run only the getActors function, it works well and finishes all the URLs in the seed file (250)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's the fastest way to expand url in python - python

Related

Python multi-threading method

python selenium, slow xpath 'all elements'. add timeout

How would I go about using concurrent.futures and queues for a real-time scenario?

Sublime Text 3 plugin: ValueError with Edit objects?

Crawling web pages with Python

Categories

Resources