I want to use different API keys for data scraping each time my program is run.
For instance, I have the following 2 keys:
apiKey1 = "123abc"
apiKey2 = "345def"
and the following URL:
myUrl = http://myurl.com/key=...
When the program is run, I would like myUrl to be using apiKey1. Once it is run again, I would then like it to use apiKey2 and so forth... i.e:
First Run:
url = "http://myurl.com/key=" + apiKey1
Second Run:
url = "http://myurl.com/key=" + apiKey2
Sorry if this doesn't make sense, but does anyone know a way to do this? I have no idea.
EDIT:
To avoid confusion, I've had a look at this answer. But this doesn't answer my query. My target is to cycle between the variables between executions of my script.
I would use a persistent dictionary (it's like a database but more lightweight). That way you can easily store the options and the one to visit next.
There's already a library in the standard library that provides such a persistent dictionary: shelve:
import shelve
filename = 'target.shelve'
def get_next_target():
with shelve.open(filename) as db:
if not db:
# Not created yet, initialize it:
db['current'] = 0
db['options'] = ["123abc", "345def"]
# Get the current option
nxt = db['options'][db['current']]
db['current'] = (db['current'] + 1) % len(db['options']) # increment with wraparound
return nxt
And each call to get_next_target() will return the next option - no matter if you call it several times in the same execution or once per execution.
The logic could be simplified if you never have more than 2 options:
db['current'] = 0 if db['current'] == 1 else 1
But I thought it might be worthwhile to have a way that can easily handle multiple options.
Here is an example of how you can do it with automatic file creation if no such file exists:
import os
if not os.path.exists('Checker.txt'):
'''here you check whether the file exists
if not this bit creates it
if file exists nothing happens'''
with open('Checker.txt', 'w') as f:
#so if the file doesn't exist this will create it
f.write('0')
myUrl = 'http://myurl.com/key='
apiKeys = ["123abc", "345def"]
with open('Checker.txt', 'r') as f:
data = int(f.read()) #read the contents of data and turn it into int
myUrl = myUrl + apiKeys[data] #call the apiKey via index
with open('Checker.txt', 'w') as f:
#rewriting the file and swapping values
if data == 1:
f.write('0')
else:
f.write('1')
I would rely on an external process to hold which key was used last time,
or even simpler I would count executions of the script and use a key if execution count is an odd number, or the other key for an even number.
So I would introduce something like redis, which will also help a lot for other (future ?) features you may want to add in your project. redis is one of those tools that always give benefits at almost no cost, it's very practical to be able to rely on an external permanent storage - it can serve many purposes.
So here is how I would do it:
first make sure redis-server is running (can be started automatically as a daemon, depends on your system)
install Python redis module
then, here is some Python code for inspiration:
import redis
db = redis.Redis()
if db.hincrby('execution', 'count', 1) % 2:
key = apiKey1
else:
key = apiKey2
That's it !
Related
I use arango-orm (which uses python-arango in the background) in my Python/ArangoDB back-end. I have set up a small testing util that uses a remote database to insert test data, execute the unit tests and remove the test data again.
I insert my test data with a Python for loop. Each iteration, a small piece of information changes based on a generic object and then I insert that modified generic object into ArangoDB until I have 10 test objects. However, after that code is run, my test assertions tell me I don't have 10 objects stored inside my db, but only 8 (or sometimes 3, 7 or 9). It looks like pythong-arango runs these queries asynchronously or that ArangoDB already replies with an OK before the data is actually inserted. Anyone has an idea of what is going on? When I put in a sleep of 1 second after all data is inserted, my tests run green. This obviously is no solution.
This is a little piece of example code I use:
def load_test_data(self) -> None:
# This method is called from the setUp() method.
logging.info("Loading test data...")
for i in range(1, 11):
# insertion with data object (ORM)
user = test_utils.get_default_test_user()
user.id = i
user.username += str(i)
user.name += str(i)
db.add(user)
# insertion with dictionary
project = test_utils.get_default_test_project()
project['id'] = i
project['name'] += str(i)
project['description'] = f"Description for project with the id {i}"
db.insert_document("projects", project)
# TODO: solve this dirty hack
sleep(1)
def test_search_by_user_username(self) -> None:
actual = dao.search("TestUser3")
self.assertEqual(1, len(actual))
self.assertEqual(3, actual[0].id)
Then my db is created like this in a separate module:
client = ArangoClient(hosts=f"http://{arango_host}:{arango_port}")
test_db = client.db(arango_db, arango_user, arango_password)
db = Database(test_db)
EDIT:
I had not put the sync property to true upon collection creation, but after changing the collection and setting it to true, the behaviour stays exactly the same.
After getting in touch with the people of ArangoDB, I learned that views are not updatet as quickly as collections. Thye have given me an internal SEARCH option which also waits for synching views. Since it's an internal option, only used for unit testing, they high discourage the use of it. For me, I only use it for unit testing.
I'm trying to connect two separate codes into one program. I need to put one string from first to second part.
First:
import boto3
if __name__ == "__main__":
bucket='BUCKET-NAME'
collectionId='COLLECTION-ID'
fileName='input.jpg'
threshold = 70
maxFaces=1
client=boto3.client('rekognition')
response=client.search_faces_by_image(CollectionId=collectionId,
Image={'S3Object':{'Bucket':bucket,'Name':fileName}},
FaceMatchThreshold=threshold,
MaxFaces=maxFaces)
faceMatches=response['FaceMatches']
for match in faceMatches:
print (match['Face']['FaceId'])
Second:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('faces')
response = table.scan(
FilterExpression=Attr('faceid').eq('FaceId')
)
items = response['Items']
print(items)
I need to put ID shown by print (match['Face']['FaceId']) from first code to FaceId in second code.
I tried to define a variable and put a value into it and then get it later but I could not do it correctly
Typically, you're write your first block of code as a library/module with a function that does some unit of work and returns the result. Then the second block of code would import the first and call the function.
# lib.py
def SomeFunction(inputs):
output = doSomething(inputs)
return output
# main.py
import lib
data = ...
result = lib.SomeFunction(data)
moreWork(result)
If you want two separate programs that run independently and share data, you want Inter-process communication. You can get processes to share information with each other via: a file/fifo in the filesystem; a network socket; shared memory; and STDIO (and probably more). However, IPC is definitely more work than synchronous library calls.
I've written a simple python code to login to a forum, in order to keep alive and gain the online time. The code is as following:
logPara = {'username':user,'password':pwd}
s = requests.Session()
s.post(forumUrl,data=logPara)
homePage = requests.get(pageUrl)
I can get the correct homePage and am sure the login is successful. While I'm curious how long will this Session() last? If my program only contains theses four lines, will the Session() close thus the online status is lost?
Yes, definitely the session will be lost.
So, you two options for making the session last longer, One is as the answer posted by #Seekheart. Second is to save the session state in a file using python's pickle and load it again when needed. But this also will depend on the cookie expiration etc.
This is how you can do it.
When making the session request:
import pickle
import requests
logPara = {'username':user,'password':pwd}
s = requests.Session()
s.post(forumUrl,data=logPara)
homePage = requests.get(pageUrl)
with open('temp.dat', 'w') as f:
pickle.dump(s, f)
When you want to get the state back later:
import pickle
with open('temp.dat', 'r') as f:
s = pickle.load(f)
When you run the script, unless the script is being told to run endlessly or after a certain condition, it'll terminate almost immediately. So for your script it'll end after you run it. To continue running it you can put in a condition like for example.
while 1:
#Run your code
I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.
May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.
Ok here's my scenario (be kind, only been using Python for a short while):
I have a service I'm calling and need to run several iterations of the same test with a different variable passed to the method. I am able to run iterations against a single method just fine but I need the variable to change per each test and without counting the call to get a random variable as an iteration. I'm probably going about this the wrong way but I'd love any help I can get.
Here's my code thus far:
data = ""
class MyTestWorkFlow:
global data
def Data(self):
low = 1
high = 1000
pid = random.randrange(low,high)
data = linecache.getline('c:/tmp/testData.csv', pid)
def Run(self):
client = Client(wsdl)
result = client.service.LookupData(data)
f = open('/tmp/content','w')
f.write (str(result))
f.close()
f = open('/tmp/content','r')
for i in f:
print i
f.close()
test = MyTestWorkFlow()
for i in range(1,2):
test.Run()
There's a lot we could talk about regarding automated testing in Python, but the problem here is that you don't seem to be invoking your Data method.
If you change your code like this:
def Run(self)
self.Data()
client = Client(wsdl)
...
does it do what you need?