I'm working on a web crawler. The crawler is built for web page which has many categories. These categories can have subcategories, the same for subcategories etc.
So it can seems like this:
So I made a recursive method which provides deep first search.
def deep_search(url):
if is_leaf(url):
return get_data(url)
for url in get_subcategories(url):
deep_search(url)
This method works fine but it takes a long time to finish so there are situations when connection falls or another error raises.
What would you do to remember state in case that error occures and next time it continues from this state?
I can't just remember last 'url' or category since there are loops and the program would not know what 'urls' and categories has been stored in upper loops.
If the order of search paths is stable (every time your script visits sub-categories in the same order), then you can maintain a branch number list in your DFS, and make it persistent - save it in a file or database:
current_path = [] # save the path currently visited
def deep_search(url, last_saved_path=None):
if is_leaf(url):
if last_saved_path:
# Continue where you left off
if path_reached(last_saved_path):
data = get_data(url)
else: # first run
data = get_data(url)
# save the whole path persistently
save_to_file(current_path)
# add data to result
else:
for index, url in enumerate(get_subcategories(url)):
current_path.append(index)
deep_search(url, last_saved_path)
del current_path[-1]
def path_reached(old_path):
print old_path, current_path
# if the path has been visited in last run
for i,index in enumerate(current_path):
if index < old_path[i]:
return False
elif index > old_path[i]:
return True
return True
When running the crawler for a second time, you can load the saved path and start where you left off:
# first run
deep_search(url)
# subsequent runs
last_path = load_last_saved_path_from_file()
deep_search(url, last_path)
That said, I think in a web crawler there are 2 kind of tasks: traversing the graph and downloading data. And it's better to keep them separate: use the above DFS algorithm (plus logic to skip paths that have been visited) to traverse the links, and save the download urls in a queue; Then start a bunch of workers to take urls from the queue and download. This way, you just need to record the current position in queue if interrupted.
And I recommend scrapy to you, I haven't read scrapy source, but I guess it implements all of the above, and more.
As a simple hint you can use a try-except statement to handle your Errors and save the relative url and as a good choice for such task you can use collections.deque with 1 capacity,and check it in next iterations.
Demo :
from collections import deque
def deep_search(url,deq=deque(maxlen=1)):
if is_leaf(url):
return get_data(url)
try:
for url in get_subcategories(url):
if deq[0]==url:
deep_search(url,deq)
except : #you can put the error title after except
deq.append(url)
But as a more pythonic way for dealing with networks you can use networkx.
NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
Related
I subscribe to a real time stream which publishes a small JSON record at a slow rate (0.5 KBs every 1-5 seconds). The publisher has provided a python client that exposes these records. I write these records to a list in memory. The client is just a python wrapper for doing a curl command on a HTTPS endpoint for a dataset. A dataset is defined by filters and fields. I can let the client go for a few days and stop it at midnight to process multiple days worth of data as one batch.
Instead of multi-day batches described above, I'd like to write every n-records by treating the stream as a generator. The client code is below. I just added the append() line to create a list called 'records' (in memory) to playback later:
records=[]
data_set = api.get_dataset(dataset_id='abc')
for record in data_set.request_realtime():
records.append(record)
which as expected, gives me [*] in Jupyter Notebook; and keeps running.
Then, I created a generator from my list in memory as follows to extract one record (n=1 for initial testing):
def Generator():
count = 1
while count < 2:
for r in records:
yield r.data
count +=1
But my generator definition also gave me [*] and kept calculating; which I understand it is because the list is still being written in memory. But I thought my generator would be able to lock the state of my list and yield the first n-records. But it didn't. How can I code my generator in this case? And if a generator is not a good choice in this use case, please advise.
To give you the full picture, if my code was working, then, I'd have instantiated it, printed it, and received an object as expected like this:
>>>my_generator = Generator()
>>>print(my_generator)
<generator object Gen at 0x0000000009910510>
Then, I'd have written it to a csv file like so:
with open('myfile.txt', 'w') as f:
cf = csv.DictWriter(f, column_headers, extrasaction='ignore')
cf.writeheader()
cf.writerows(i.data for i in my_generator)
Note: I know there are many tools for this e.g. Kafka; but I am in an initial PoC phase. Please use Python 2x. Once I get my code working, I plan on stacking generators to set up my next n-record extraction so that I don't lose data in between. Any guidance on stacking would also be appreciated.
That's not how concurrency works. Unless some magic is being used that you didn't tell us about, while your first code returns * you can't run more code. Putting the generator in another cell just adds it to a queue to run when the first code finishes - since the first code will never finish, the second code will never even start running!
I suggest looking into some asynchronous networking library, like asyncio, twisted or trio. They allow you to make functions cooperative so while one of them is waiting for data, the other can run, instead of blocking. You'd probably have to rewrite the api.get_dataset code to be asynchronous as well.
I have a class that I have written that will open a text document and search it line by line for the keywords that are input from a GUI that I have created in a different file. It works great, the only problem is the text document that I am searching is long (over 60,000 entries). I was looking at ways to make the search faster and have been playing around with multithreading but have not had any success yet. Basically, the main program calls the search function which takes the line and breaks it into individual words. Then over a loop checks each of the words against the keywords from the user. If the keyword is in that word then it says its true and adds a 1 to a list. At the end, if there is the same number of keywords as true statements then it adds that line to a set that is returned at the end of main.
What I would like to do is incorporate multithreading into this so that it will run much faster but at the end of the main function will still return results. Any advice or direction with being able to accomplish this will be very helpful. I have tried to read a bunch of examples and watched a bunch of youtube videos but it didn't seem to transfer over when I tried. Thank you for your help and your time.
import pdb
from threading import Thread
class codeBook:
def __init__(self):
pass
def main(self, search):
count = 0
results = set()
with open('CodeBook.txt') as current_CodeBook:
lines = current_CodeBook.readlines()
for line in lines:
line = line.strip()
new_search = self.change_search(line,search)
line = new_search[0]
search = new_search[1]
#if search in line:
if self.speed_search(line,search) == True:
results.add(line)
else:
pass
count = count + 1
results = sorted(list(results))
return results
def change_search(self, current_line, search):
current_line = current_line.lower()
search = search.lower()
return current_line, search
def search(self,line,keywords):
split_line = line.split()
split_keywords = keywords.split()
numberOfTrue = list()
for i in range(0,len(split_keywords)):
if split_keywords[i] in line:
numberOfTrue.append(1)
if len(split_keywords) == len(numberOfTrue):
return True
else:
return False
You can split the file into several parts and create a new thread that reads and processes a specific part. You can keep a data structure global to all threads and add lines that match the search query from all the threads to it. This structure should either be thread-safe or you need to use some kind of synchronization (like a lock) to work with it.
Note: CPython interpreter has a global interpreter lock (GIL), so if you're using it and your application is CPU-heavy (which seems to the case here), you might not get any benefits from multithreading whatsoever.
You can use the multiprocessing module instead. It comes with means of interprocess communitation. A Queue looks like the right structure for your problem (each process could add matching lines to the queue). After that, you just need to get all lines from the queue and do what you did with the results in your code.
While threading and/or multiprocessing can be beneficial and speed up execution, I would want to direct your attention to looking into the possibility to optimize your current algorithm, running in a single thread, before doing that.
Looking at your implementation I believe a lot of work is done several times for no reason. To the best of my understanding the following function will perform the same operation as your codeBook.main but with less overhead:
def search_keywords(keyword_string, filename='CodeBook.txt'):
results = set()
keywords = set()
for keyword in keyword_string.lower():
keywords.add(keyword)
with open(filename) as code_book:
for line in code_book:
words = line.strip().lower()
kws_present = True
for keyword in keywords:
kws_present = keyword in words
if not kws_present:
break
if kws_present:
results.add(line)
return sorted(list(results))
Try this function, as is, or slightly modified for your needs and see if that gives you a sufficient speed-up. First when that is not enough, you should look into more complex solutions, as it invariably will increase the complexity of your program to introduce more threads/processes.
I need to try all possible paths, branching every time I hit a certain point. There are <128 possible paths for this problem, so no need to worry about exponential scaling.
I have a player that can take steps through a field. The player
takes a step, and on a step there could be an encounter.
There are two options when an encounter is found: i) Input 'B' or ii) Input 'G'.
I would like to try both and continue repeating this until the end of the field is reached. The end goal is to have tried all possibilities.
Here is the template, in Python, for what I am talking about (Step object returns the next step using next()):
from row_maker_inlined import Step
def main():
initial_stats = {'n':1,'step':250,'o':13,'i':113,'dng':0,'inp':'Empty'}
player = Step(initial_stats)
end_of_field = 128
# Walk until reaching an encounter:
while player.step['n'] < end_of_field:
player.next()
if player.step['enc']:
print 'An encounter has been reached.'
# Perform an input on an encounter step:
player.input = 'B'
# Make a branch of player?
# perform this on the branch:
# player.input = 'G'
# Keep doing this, and branching on each encounter, until the end is reached.
As you can see, the problem is rather simple. Just I have no idea, as a beginner programmer, how to solve such a problem.
I believe I may need to use recursion in order to keep branching. But I really just do not understand how one 'makes a branch' using recursion, or anything else.
What kind of solution should I be looking at?
You should be looking at search algorithms like breath first search (BFS) and depth first search (DFS).
Wikipedia has this as the pseudo-code implementation of BFS:
procedure BFS(G, v) is
let Q be a queue
Q.enqueue(v)
label v as discovered
while Q is not empty
v← Q.dequeue()
for all edges from v to w in G.adjacentEdges(v) do
if w is not labeled as discovered
Q.enqueue(w)
label w as discovered
Essentially, when you reach an "encounter" you want to add this point to your queue at the end. Then you pick your FIRST element off of the queue and explore it, putting all its children into the queue, and so on. It's a non-recursive solution that is simple enough to do what you want.
DFS is similar but instead of picking the FIRST element form the queue, you pick the last. This makes it so that you explore a single path all the way to a dead end before coming back to explore another.
Good luck!
I have a python script that pulls from various internal network sources. With how our systems are set up we will initiate a urllib pull from a network location and it will get hung up waiting forever for a response on certain parts of the network. I would like my script to check that if it hasnt finished the pull in lets say 5 minutes it will pass the function and attempt to pull from the next address, and record it to a bad directory repository(so we can go check out which systems get hung up, there's like over 20,000 IP addresses we are checking some with some older scripts running on them that no longer work but will still try and run when requested, and they never stop trying to run)
Im familiar with having a script pause at a certain point
import time
time.sleep(300)
What Im thinking from a psuedo code perspective (not proper python just illustrating the idea)
import time
import urllib2
url_dict = ['http://1', 'http://2', 'http://3', ...]
fail_log_path = 'C:/Temp/fail_log.txt'
for addresses in url_dict:
clock_value = time.start()
while clock_value <= 300:
print str(clock_value)
res = urllib2.retrieve(url)
if res != []:
pass
else:
fail_log = open(fail_log_path, 'a')
fail_log.write("Failed to pull from site location: " + str(url) + "\n")
faile_log.close
Update: a specific option for this dealing with urls timeout for urllib2.urlopen() in pre Python 2.6 versions
Found this answer which is more in line with the overall problem of my question:
kill a function after a certain time in windows
Your code as is doesn't seem to describe what you were saying. It seems you want the if/else check inside your while loop. On top of that, you would want to loop over the ip addresses and not over a time period as your code is currently written (otherwise you will keep requesting the same ip address every time). Instead of keeping track of time yourself, I would suggest reading up on urllib.request.urlopen - specifically the timeout parameter. Once set, that function call will throw a socket.timeout exception once the time limit is reached. Surround that with a try/except block catching that error and then handle it appropriately.
I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.
May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.