I have a list consisting of ID's, about 50k per day.
and i have to make 50k request per day to the server { the server is at the same city } , and fetch the information and store it into database ... i've done that using loop and Threads
and i've notice that after unknown period of time it's stop fetching and storing ...
take a look of my code fragment
import re,urllib,urllib2
import mysql.connector as sql
import threading
from time import sleep
import idvalid
conn = sql.connect(user="example",password="example",host="127.0.0.1",database="students",collation="cp1256_general_ci")
cmds = conn.cursor()
ids=[] #here is going to be stored the ID's
def fetch():
while len(ids)>0:#it will loop until the list of ID's is finish
try:
idnumber = ids.pop()
content = urllib2.urlopen("http://www.example.com/fetch.php?id="+idnumber,timeout=120).read()
if content.find('<font color="red">') != -1:
pass
else:
name=content[-20:]
cmds.execute("INSERT INTO `students`.`basic` (`id` ,`name`)VALUES ('%s', '%s');"%(idnumber,name))
except Exception,r:
print r,"==>",idnumber
sleep(0.5)#i think sleep will help in threading ? i'm not sure
pass
print len(ids)#print how many ID's left
for i in range(0,50):#i've set 50 threads
threading.Thread(target=fetch).start()
output:it will continue printing how many ID's left and at unknown moment it stops printing and fetching & storing
Both networking and threading are non-trivial... most probably the cause is a networking event that results in a hanging thread. I'd be interested to hear whether people have solutions for this, because I have suffered the same problem of threads that stop responding.
But there are some things I would definitely change in your code:
I would never catch "Exception". Just catch those exceptions that you know how to deal with. If a network error occurs in one of your threads, you could retry rather than giving up on the id.
There is a race condition in your code: you first check whether there is remaining content, and then you take it out. At the second point in time, the remaining work may have disappeared, resulting in an exception. If you find this difficult to fix, there is a brilliant python object that is meant to pass objects between threads without race conditions and deadlocks: the Queue object. Check it out.
The "sleep(0.5)" is not helping threading in general. It should not be necessary. It may reduce the chance of hitting race conditions, but it is better to program race conditions totally out. On the other hand, having 50 threads at full spead banging the web server may not be a very friendly thing to do. Make sure to stay within the limits of what the service can offer.
Related
I need to write a python program that lets me set reminders for specific times, eg 'remember to take bins out at 2pm', but I can only work out setting a reminder for a certain length of time, not for a given time. I also need to be able to set multiple reminders for multiple times.
Any help would be much appreciated :)
This looks like a homework assignment, so you need to write the code yourself.
You know what time it is now. You know when 2pm is. How much time is there between now and 2pm? Sleep for that long.
Keep a list of all pending alarms. Find the earliest alarm. Remove it from the list. Sleep until that alarm happens. Repeat
You'll probably find Step 2 easier if you use an appropriate data structure like heapq or PriorityQueue. But if the number of alarms is small, a list should do just fine.
The following checks for any new reminders every seconds,
although, after reading Frank's answer, that would be a better solution,
and the best solution is to not use Python at all, and let the operating system manage this by creating a cron job or on Windows a scheduled task.
reminders = [
# Put all of your reminders here
('2021-11-16 02:44:00', 'Take out the garbage'),
('2021-11-17 04:22:00', 'Another reminder')
]
from datetime import datetime
import time
# For performance reasons it's best to perform whatever computations we can before we go into our infinite loop
# In this case let's calculate all of the timestamps
reminders2 = {datetime.fromisoformat(reminder[0]).timestamp(): reminder[1] for reminder in reminders}
while True:
now = time.time()
for timestamp, reminder_msg in reminders2.items():
if timestamp < now:
print(reminder_msg)
del reminders2[timestamp]
# we're not in any hurry, instead of worrying about the consequences of deleting something from the same dictionary we are iterating over
# we can just break and wait for the next go around of the while loop to finish checking the remaining reminders
break
time.sleep(1) # in seconds
Currently I'm making a python bot for whatsapp manually without APIs or that sort because I am clueless. As such, I'm using Selenium to take in messages and auto reply. Currently, I'm noticing that every few messages, one message doesn't get picked up because the loops ran are too slow and my computer is already pretty fast. Here's the code:
def incoming_msges():
msges = driver.find_elements_by_class_name("message-in")
msgq = []
tq = []
try:
for msg in msges:
txt_msg = msg.find_elements_by_class_name("copyable-text")
time = msg.find_elements_by_class_name("_18lLQ")
for t in time:
tq.append(t.text.lower())
for txt in txt_msg:
msgq.append(txt.text.lower())
msgq = msgq[-1]
tq = tq[-1]
if len(msgq) > 0:
return (msgq, tq)
except StaleElementReferenceException:
pass
return False
Previously, I didn't add the time check thing, and the message sent would be saved, with this loop continuously running such that even if the other party sent the same thing again, the code would not recognise it as a new message because it thinks it's the same one as before. So now, the problem is that my code is super time consuming and I have no idea how to speed it up. I tried doing this:
def incoming_msges():
msges = browser.find_elements_by_class_name("message-in")
try:
msg = msges[-1]
txt_msg = msg.find_element_by_xpath("/span[#class=\"copyable-text\"]").text.lower()
time = msg.find_element_by_xpath("/span[#class=\"_18lLQ\"]").text.lower()
return (txt_msg, time)
except Exception:
pass
return False
However, like this, the code just doesn't find any messages. I have gotten the elements' types and classes correct according to the whatsapp web website but it just doesn't run. What's the correct way of rewriting my first code block as it is still correct? Thanks in advance.
First thing first ...
I definitely recommend using API ... Because what you are trying to do here is to reinvent the wheel. API has the power of telling you if there is a change in your status and you can queue these changes ... So I definitely recommend to use API ... It might be hard at the beginning, but trust me, its worth it ...
Next I would recommend you to use normal variable names. msges msgq tq (these are kindof unreadable and I still dont get what they are supposed to be after reading the code twice ...)
But to your speed problem ... "try - catch (aka except)" blocks are really heavy on a performance ... I would recommend to use safe programming if possible (20 if statements might be faster, but might not a same time) ... Also I think you are kind of unaware of a python language (atleast from what i can see here)
msgq = msgq[-1] # you are telling it to take the last element and change array variable to string .. to be more specific...
msgq ([1,2,3,4]) = msgq[-1] (4) will result to -> msgq = 4 (which in my option hits you performance as well)
tq = tq[-1] # same here
This would be better :)
if len(msgq[-1]) > 0:
return (msgq[-1], tq[-1])
If I understand your code correctly, you are trying to scrape the messages, but if its like you are saying that you want to make auto-reply bot, I would recommend you to eighter get ready for some JS magic or switch tool. I personally noticed that the selenium has a problem with dynamic content ... to be more specific ... once its at the end of the file it does not scrape it again ... so if you do not want to auto refresh every 5-10 seconds to get the latest HTML file, I recommend eighter to create this bot in JS (that will trigger everytime that an element changes) or use the API and use selenium just for responses. I was told that Selenium was created to simulate the common user to check if user interface works as it should (if buttons exists, if the website contains all what it should etc.) ... I think that selenium is for this job something like a flower small sponge for a car clean ... you can do it ... buts gonna cost you alot of time and you might miss some spots (like you missed those messages) ...
Lastly ... the work with strings in general is really costly. you are doing O(n^2) of operations in a try block ... which i can imagine can be really costly ... if its possible, I would reduce the number of inner for loops.
I wish you good luck in this project and I hope you find the answer you seek, while I hope my answer was at least a little helpful.
I have a loop running for quite a long time (several hours). It may be that the user, looking at the current results, considers the run iterations as sufficient and then wants to stop the loop before its natural end, but without interrupting the whole program (no "Ctrl+C") since some final results processing is necessary.
To do that, I added the possibility of creating a specific 'stop' file in the working directory. At each loop, the code verify if that file exists and, if that is the case, it end the loop. I do not know if this solution is efficient and whether better solutions exist.
Example
i = 0
while i < 1000 and not(path.isfile(path.join(self.wrkdir,'stop'))) :
DoSomeStuff
i += 1
FinalizingStuff
If the only reason for not using Ctrl+C is that you think it will stop all your program, then the best solution is to use it instead of watching the files.
Simply because you can catch this exception (it is called KeyboardInterrupt) in your code as any other and do whatever you want.
import time
try:
while True:
time.sleep(0.1)
except KeyboardInterrupt:
print('Ok, user is pissed with our loop, go further')
finally:
# if some resources need to be cleaned
pass
print('Here we are, nothing is lost')
I'm really new to programming in general and very inexperienced, and I'm learning python as I think it's more simple than other languages. Anyway, I'm trying to use Flask-Ask with ngrok to program an Alexa skill to check data online (which changes a couple of times per hour). The script takes four different numbers (from a different URL) and organizes it into a dictionary, and uses Selenium and phantomjs to access the data.
Obviously, this exceeds the 8-10 second maximum runtime for an intent before Alexa decides that it's taken too long and returns an error message (I know its timing out as ngrok and the python log would show if an actual error occurred, and it invariably occurs after 8-10 seconds even though after 8-10 seconds it should be in the middle of the script). I've read that I could just reprompt it, but I don't know how and that would only give me 8-10 more seconds, and the script usually takes about 25 seconds just to get the data from the internet (and then maybe a second to turn it into a dictionary).
I tried putting the getData function right after the intent that runs when the Alexa skill is first invoked, but it only runs when I initialize my local server and just holds the data for every new Alexa session. Because the data changes frequently, I want it to perform the function every time I start a new session for the skill with Alexa.
So, I decided just to outsource the function that actually gets the data to another script, and make that other script run constantly in a loop. Here's the code I used.
import time
def getData():
username = '' #username hidden for anonymity
password = '' #password hidden for anonymity
browser = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
browser.get("https://gradebook.com") #actual website name changed
browser.find_element_by_name("username").clear()
browser.find_element_by_name("username").send_keys(username)
browser.find_element_by_name("password").clear()
browser.find_element_by_name("password").send_keys(password)
browser.find_element_by_name("password").send_keys(Keys.RETURN)
global currentgrades
currentgrades = []
gradeids = ['2018202', '2018185', '2018223', '2018626', '2018473', '2018871', '2018886']
for x in range(0, len(gradeids)):
try:
gradeurl = "https://www.gradebook.com/grades/"
browser.get(gradeurl)
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:3]
if grade[2] != "%":
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:4]
if grade[1] == "%":
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:1]
currentgrades.append(grade)
except Exception:
currentgrades.append('No assignments found')
continue
dictionary = {"class1": currentgrades[0], "class2": currentgrades[1], "class3": currentgrades[2], "class4": currentgrades[3], "class5": currentgrades[4], "class6": currentgrades[5], "class7": currentgrades[6]}
return dictionary
def run():
dictionary = getData()
time.sleep(60)
That script runs constantly and does what I want, but then in my other script, I don't know how to just call the dictionary variable. When I use
from getdata.py import dictionary
in the Flask-ask script it just runs the loop and constantly gets the data. I just want the Flask-ask script to take the variable defined in the "run" function and then use it without running any of the actual scripts defined in the getdata script, which have already run and gotten the correct data. If it matters, both scripts are running in Terminal on a MacBook.
Is there any way to do what I'm asking about, or are there any easier workarounds? Any and all help is appreciated!
It sounds like you want to import the function, so you can run it; rather than importing the dictionary.
try deleting the run function and then in your other script
from getdata import getData
Then each time you write getData() it will run your code and get a new up-to-date dictionary.
Is this what you were asking about?
This issue has been resolved.
As for the original question, I didn't figure out how to make it just import the dictionary instead of first running the function to generate the dictionary. Furthermore, I realized there had to be a more practical solution than constantly running a script like that, and even then not getting brand new data.
My solution was to make the script that gets the data start running at the same time as the launch function. Here was the final script for the first intent (the rest of it remained the same):
#ask.intent("start_skill")
def start_skill():
welcome_message = 'What is the password?'
thread = threading.Thread(target=getData, args=())
thread.daemon = True
thread.start()
return question(welcome_message)
def getData():
#script to get data here
#other intents and rest of script here
By design, the skill requested a numeric passcode to make sure I was the one using it before it was willing to read the data (which was probably pointless, but this skill is at least as much for my own educational reasons as for practical reasons, so, for the extra practice, I wanted this to have as many features as I could possibly justify). So, by the time you would actually be able to ask for the data, the script to get the data will have finished running (I have tested this and it seems to work without fail).
I have a python script that pulls from various internal network sources. With how our systems are set up we will initiate a urllib pull from a network location and it will get hung up waiting forever for a response on certain parts of the network. I would like my script to check that if it hasnt finished the pull in lets say 5 minutes it will pass the function and attempt to pull from the next address, and record it to a bad directory repository(so we can go check out which systems get hung up, there's like over 20,000 IP addresses we are checking some with some older scripts running on them that no longer work but will still try and run when requested, and they never stop trying to run)
Im familiar with having a script pause at a certain point
import time
time.sleep(300)
What Im thinking from a psuedo code perspective (not proper python just illustrating the idea)
import time
import urllib2
url_dict = ['http://1', 'http://2', 'http://3', ...]
fail_log_path = 'C:/Temp/fail_log.txt'
for addresses in url_dict:
clock_value = time.start()
while clock_value <= 300:
print str(clock_value)
res = urllib2.retrieve(url)
if res != []:
pass
else:
fail_log = open(fail_log_path, 'a')
fail_log.write("Failed to pull from site location: " + str(url) + "\n")
faile_log.close
Update: a specific option for this dealing with urls timeout for urllib2.urlopen() in pre Python 2.6 versions
Found this answer which is more in line with the overall problem of my question:
kill a function after a certain time in windows
Your code as is doesn't seem to describe what you were saying. It seems you want the if/else check inside your while loop. On top of that, you would want to loop over the ip addresses and not over a time period as your code is currently written (otherwise you will keep requesting the same ip address every time). Instead of keeping track of time yourself, I would suggest reading up on urllib.request.urlopen - specifically the timeout parameter. Once set, that function call will throw a socket.timeout exception once the time limit is reached. Surround that with a try/except block catching that error and then handle it appropriately.