I'm trying to test a publicly available web page that takes a GET request and returns a different JSON file depending on the GET argument.
The API looks like
https://www.example.com/api/page?type=check&code=[Insert string here]
I made a program to check the results of all possible 4-letter strings on this API. My code looks something like this (with the actual URL replaced):
import time, urllib.request
for a in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for b in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for d in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
a,b,c,d = "J","A","K","E"
test = urllib.request.urlopen("https://www.example.com/api/page?type=check&code=" + a + b + c + d).read()
if test != b'{"result":null}':
print(a + b + c + d)
f = open("codes", "a")
f.write(a + b + c + d + ",")
f.close()
This code is completely functional and works as expected. However, there is a problem. Because the program can't progress until it receives a responses, this method is very slow. If this ping time is 100ms for the API, then it will take 100ms for each check. When I modified this code so that it could test half of the results in one instance, and half in another, I noticed that the speed doubled.
Because of this, I'm led to believe that the ping time of the site is the limiting factor in this script. What I want to do is be able to is basically check each code, and then immediately check the next one without waiting for a response.
That would be the equivalent of opening up the page a few thousand times in my browser. It could load many tabs at the same time, since each page is less than a kilobyte.
I looked into using threading to do this, but I'm not sure if its relevant or helpful.
User a worker pool, like described here: https://docs.python.org/3.7/library/multiprocessing.html
from multiprocessing import Pool
def test_url(code):
''' insert code to test URL '''
pass
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(test_url, [code1,code2,code3]))
Just be aware that the website might be rate-limiting the amount of requests you are making.
To be more specific with your example, I would split it up into two phases: (1) generate test codes (2) test url, given one test code. Once you have the list of codes generated, you can apply the above strategy of applying the verifier to each generated code, using a worker pool.
To generate the test codes, you can use itertools:
codes_to_test = [''.join(i) for i in itertools.product(string.ascii_lowercase, repeat = 5)]
You have a better understanding of how to test a URL given one test code , so I assume you can write a function test_url(test_code) that will make the appropriate URL request and verify the result as necessary. Then you can call:
with Pool(5) as p:
print(p.map(test_url, test_codes))
On top of this, I would suggest two things: make sure test_codes is not enormous at first (for example by taking a sublist of these generated codes) to make sure your code is working correctly and (2) that you can play with the size of the worker pool to not overwhelm your machine or the API.
Alternatively you can use asyncio (https://docs.python.org/3/library/asyncio.html) to keep everything in a single process.
Related
I have done some Python programming, but I'm not a seasoned developer by any stretch of the imagination. We have a Python etl programme, which was set up as a Cloud Function but it is timing out as there is just too much data to load and we are looking to re-write it to work in Dataflow.
The code at the moment simply connects to an API, which returns a newline-delimiter JSON, and then the data is loaded into a new table in BigQuery.
This is our first time using Dataflow and we are just trying to get to grips with how it works. It seems pretty easy to get the data into BigQuery the stumbling block we are hitting is how to get the data out of the API. Its not clear to us how we can make this work, do we need to go down the route of developing a new I/O connector as per [Develop IO Connector]? Or is there another option as developing a new connector seems complex?
We've done a lot of googling, but haven't found anything obvious to help.
Here is a sample of our code but we are not 100% sure its on the right track. The code doesn't work, and we think it needs to be a .io.read and not a .ParDo initially but we aren't quite sure where to go with that. Some guidance would be much appreciated!
class callAPI(beam.DoFn):
def __init__(self, input_header, input_uri):
self.headers = input_header
self.remote_url = input_uri
def process(self):
try:
res = requests.get(self.remote_url, headers=self.headers)
res.raise_for_status()
except HTTPError as message:
logging.error(message)
return
return res.text
with beam.Pipeline() as p:
data = ( p
| 'Call API ' >> beam.ParDo(callAPI(HEADER, REMOTE_URI))
| beam.Map(print))
Thanks in advance.
You are on the right track, but there are a couple of things to fix.
As you point out, the root of a pipeline needs to be a read of some kind. The ParDo operation processes a set of elements (ideally in parallel), but needs some input to process. You could do
p | beam.Create(['a', 'b', 'c']) | beam.ParDo(SomeDoFn())
in which SomeDoFn will be passed a, b, and c into its process method. There is a special p | beam.Impulse() operation that will produce a single None element if there's no reasonable input and you want to ensure your DoFn is just called once. You can also read elements from a file (or similar). Note that your process method takes both self and the element to be processed, and returns an iterable (to allow zero or more outputs. There is also beam.Map and beam.FlatMap which encapsulates the simpler pattern). So you could do something like
class CallAPI(beam.DoFn):
def __init__(self, input_header):
self.headers = input_header
def process(self, input_uri):
try:
res = requests.get(input_uri, headers=self.headers)
res.raise_for_status()
except HTTPError as message:
logging.error(message)
yield res.text
with beam.Pipeline() as p:
data = (
p
| beam.Create([REMOTE_URI])
| 'Call API ' >> beam.ParDo(CallAPI(HEADER))
| beam.Map(print))
which would allow you to read from more than one URI (in parallel) in the same pipeline.
You could write a full IO connector if your source is such that it can be split (ideally dynamically) rather than only read in one huge request.
Can you share the code from your cloud function?
Is this a scheduled task or triggered by an event? If it is a scheduled task Apache Airflow may be a better option, you could use Dataflow Python Operators and BigQueryOperators to do what you're looking for
Apache Airflow https://airflow.apache.org/
DataFlowPythonOperator https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/contrib/operators/dataflow_operator/index.html#airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator
BigQueryOperator https://airflow.apache.org/docs/apache-airflow/1.10.14/_api/airflow/contrib/operators/bigquery_operator/index.html
I subscribe to a real time stream which publishes a small JSON record at a slow rate (0.5 KBs every 1-5 seconds). The publisher has provided a python client that exposes these records. I write these records to a list in memory. The client is just a python wrapper for doing a curl command on a HTTPS endpoint for a dataset. A dataset is defined by filters and fields. I can let the client go for a few days and stop it at midnight to process multiple days worth of data as one batch.
Instead of multi-day batches described above, I'd like to write every n-records by treating the stream as a generator. The client code is below. I just added the append() line to create a list called 'records' (in memory) to playback later:
records=[]
data_set = api.get_dataset(dataset_id='abc')
for record in data_set.request_realtime():
records.append(record)
which as expected, gives me [*] in Jupyter Notebook; and keeps running.
Then, I created a generator from my list in memory as follows to extract one record (n=1 for initial testing):
def Generator():
count = 1
while count < 2:
for r in records:
yield r.data
count +=1
But my generator definition also gave me [*] and kept calculating; which I understand it is because the list is still being written in memory. But I thought my generator would be able to lock the state of my list and yield the first n-records. But it didn't. How can I code my generator in this case? And if a generator is not a good choice in this use case, please advise.
To give you the full picture, if my code was working, then, I'd have instantiated it, printed it, and received an object as expected like this:
>>>my_generator = Generator()
>>>print(my_generator)
<generator object Gen at 0x0000000009910510>
Then, I'd have written it to a csv file like so:
with open('myfile.txt', 'w') as f:
cf = csv.DictWriter(f, column_headers, extrasaction='ignore')
cf.writeheader()
cf.writerows(i.data for i in my_generator)
Note: I know there are many tools for this e.g. Kafka; but I am in an initial PoC phase. Please use Python 2x. Once I get my code working, I plan on stacking generators to set up my next n-record extraction so that I don't lose data in between. Any guidance on stacking would also be appreciated.
That's not how concurrency works. Unless some magic is being used that you didn't tell us about, while your first code returns * you can't run more code. Putting the generator in another cell just adds it to a queue to run when the first code finishes - since the first code will never finish, the second code will never even start running!
I suggest looking into some asynchronous networking library, like asyncio, twisted or trio. They allow you to make functions cooperative so while one of them is waiting for data, the other can run, instead of blocking. You'd probably have to rewrite the api.get_dataset code to be asynchronous as well.
I'm really new to programming in general and very inexperienced, and I'm learning python as I think it's more simple than other languages. Anyway, I'm trying to use Flask-Ask with ngrok to program an Alexa skill to check data online (which changes a couple of times per hour). The script takes four different numbers (from a different URL) and organizes it into a dictionary, and uses Selenium and phantomjs to access the data.
Obviously, this exceeds the 8-10 second maximum runtime for an intent before Alexa decides that it's taken too long and returns an error message (I know its timing out as ngrok and the python log would show if an actual error occurred, and it invariably occurs after 8-10 seconds even though after 8-10 seconds it should be in the middle of the script). I've read that I could just reprompt it, but I don't know how and that would only give me 8-10 more seconds, and the script usually takes about 25 seconds just to get the data from the internet (and then maybe a second to turn it into a dictionary).
I tried putting the getData function right after the intent that runs when the Alexa skill is first invoked, but it only runs when I initialize my local server and just holds the data for every new Alexa session. Because the data changes frequently, I want it to perform the function every time I start a new session for the skill with Alexa.
So, I decided just to outsource the function that actually gets the data to another script, and make that other script run constantly in a loop. Here's the code I used.
import time
def getData():
username = '' #username hidden for anonymity
password = '' #password hidden for anonymity
browser = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
browser.get("https://gradebook.com") #actual website name changed
browser.find_element_by_name("username").clear()
browser.find_element_by_name("username").send_keys(username)
browser.find_element_by_name("password").clear()
browser.find_element_by_name("password").send_keys(password)
browser.find_element_by_name("password").send_keys(Keys.RETURN)
global currentgrades
currentgrades = []
gradeids = ['2018202', '2018185', '2018223', '2018626', '2018473', '2018871', '2018886']
for x in range(0, len(gradeids)):
try:
gradeurl = "https://www.gradebook.com/grades/"
browser.get(gradeurl)
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:3]
if grade[2] != "%":
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:4]
if grade[1] == "%":
grade = browser.find_element_by_id("currentStudentGrade[]").get_attribute('innerHTML').encode('utf8')[0:1]
currentgrades.append(grade)
except Exception:
currentgrades.append('No assignments found')
continue
dictionary = {"class1": currentgrades[0], "class2": currentgrades[1], "class3": currentgrades[2], "class4": currentgrades[3], "class5": currentgrades[4], "class6": currentgrades[5], "class7": currentgrades[6]}
return dictionary
def run():
dictionary = getData()
time.sleep(60)
That script runs constantly and does what I want, but then in my other script, I don't know how to just call the dictionary variable. When I use
from getdata.py import dictionary
in the Flask-ask script it just runs the loop and constantly gets the data. I just want the Flask-ask script to take the variable defined in the "run" function and then use it without running any of the actual scripts defined in the getdata script, which have already run and gotten the correct data. If it matters, both scripts are running in Terminal on a MacBook.
Is there any way to do what I'm asking about, or are there any easier workarounds? Any and all help is appreciated!
It sounds like you want to import the function, so you can run it; rather than importing the dictionary.
try deleting the run function and then in your other script
from getdata import getData
Then each time you write getData() it will run your code and get a new up-to-date dictionary.
Is this what you were asking about?
This issue has been resolved.
As for the original question, I didn't figure out how to make it just import the dictionary instead of first running the function to generate the dictionary. Furthermore, I realized there had to be a more practical solution than constantly running a script like that, and even then not getting brand new data.
My solution was to make the script that gets the data start running at the same time as the launch function. Here was the final script for the first intent (the rest of it remained the same):
#ask.intent("start_skill")
def start_skill():
welcome_message = 'What is the password?'
thread = threading.Thread(target=getData, args=())
thread.daemon = True
thread.start()
return question(welcome_message)
def getData():
#script to get data here
#other intents and rest of script here
By design, the skill requested a numeric passcode to make sure I was the one using it before it was willing to read the data (which was probably pointless, but this skill is at least as much for my own educational reasons as for practical reasons, so, for the extra practice, I wanted this to have as many features as I could possibly justify). So, by the time you would actually be able to ask for the data, the script to get the data will have finished running (I have tested this and it seems to work without fail).
I have a python script that pulls from various internal network sources. With how our systems are set up we will initiate a urllib pull from a network location and it will get hung up waiting forever for a response on certain parts of the network. I would like my script to check that if it hasnt finished the pull in lets say 5 minutes it will pass the function and attempt to pull from the next address, and record it to a bad directory repository(so we can go check out which systems get hung up, there's like over 20,000 IP addresses we are checking some with some older scripts running on them that no longer work but will still try and run when requested, and they never stop trying to run)
Im familiar with having a script pause at a certain point
import time
time.sleep(300)
What Im thinking from a psuedo code perspective (not proper python just illustrating the idea)
import time
import urllib2
url_dict = ['http://1', 'http://2', 'http://3', ...]
fail_log_path = 'C:/Temp/fail_log.txt'
for addresses in url_dict:
clock_value = time.start()
while clock_value <= 300:
print str(clock_value)
res = urllib2.retrieve(url)
if res != []:
pass
else:
fail_log = open(fail_log_path, 'a')
fail_log.write("Failed to pull from site location: " + str(url) + "\n")
faile_log.close
Update: a specific option for this dealing with urls timeout for urllib2.urlopen() in pre Python 2.6 versions
Found this answer which is more in line with the overall problem of my question:
kill a function after a certain time in windows
Your code as is doesn't seem to describe what you were saying. It seems you want the if/else check inside your while loop. On top of that, you would want to loop over the ip addresses and not over a time period as your code is currently written (otherwise you will keep requesting the same ip address every time). Instead of keeping track of time yourself, I would suggest reading up on urllib.request.urlopen - specifically the timeout parameter. Once set, that function call will throw a socket.timeout exception once the time limit is reached. Surround that with a try/except block catching that error and then handle it appropriately.
I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.
May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.