Using the shelve module has given me some surprising behavior. keys(), iter(), and iteritems() don't return all the entries in the shelf! Here's the code:
cache = shelve.open('my.cache')
# ...
cache[url] = (datetime.datetime.today(), value)
later:
cache = shelve.open('my.cache')
urls = ['accounts_with_transactions.xml', 'targets.xml', 'profile.xml']
try:
print list(cache.keys()) # doesn't return all the keys!
print [url for url in urls if cache.has_key(url)]
print list(cache.keys())
finally:
cache.close()
and here's the output:
['targets.xml']
['accounts_with_transactions.xml', 'targets.xml']
['targets.xml', 'accounts_with_transactions.xml']
Has anyone run into this before, and is there a workaround without knowing all possible cache keys a priori?
According to the python library reference:
...The database is also (unfortunately) subject to the limitations of dbm, if it is used — this means that (the pickled representation of) the objects stored in the database should be fairly small...
This correctly reproduces the 'bug':
import shelve
a = 'trxns.xml'
b = 'foobar.xml'
c = 'profile.xml'
urls = [a, b, c]
cache = shelve.open('my.cache', 'c')
try:
cache[a] = a*1000
cache[b] = b*10000
finally:
cache.close()
cache = shelve.open('my.cache', 'c')
try:
print cache.keys()
print [url for url in urls if cache.has_key(url)]
print cache.keys()
finally:
cache.close()
with the output:
[]
['trxns.xml', 'foobar.xml']
['foobar.xml', 'trxns.xml']
The answer, therefore, is don't store anything big—like raw xml—but rather results of calculations in a shelf.
Seeing your examples, my first thought is that cache.has_key() has side effects, i.e. this call will add keys to the cache. What do you get for
print cache.has_key('xxx')
print list(cache.keys())
Related
I'm fairly new to python and web-scraping in general. The code below works but it seems to be awfully slow for the amount of information its actually going through. Is there any way to easily cut down on execution time. I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated.
Currently the code starts at the sitemap then iterates through a list of additional sitemaps. Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. From the json data I pull an xml link that I use to search for a string. If the string is found it appends it to a text file.
#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"
old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
urlPLmap=loc.text
old_xmlPLmap=requests.get(urlPLmap)
print(old_xmlPLmap)
new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
linkToBeFound2 = final_xmlPLmap.findAll('loc')
for pls in linkToBeFound2:
argh = pls.text.find('PLAW')
theWanted = pls.text[argh:]
thisShallWork =eval(requests.get(start + theWanted).text)
print(requests.get(start + theWanted))
dict1 = (thisShallWork['download'])
finaldict = (dict1['modslink'])[2:]
print(finaldict)
url2='https://' + finaldict
try:
old_xml4=requests.get(url2)
print(old_xml4)
new_xml4= io.BytesIO(old_xml4.content).read()
final_xml4=BeautifulSoup(new_xml4)
references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
for sec in references:
if sec.text == "106 Stat. 4845":
Print(dash * 20)
print(sec.text)
Print(dash * 20)
sec313 = open('sec313info.txt','a')
sec313.write("\n")
sec313.write(pls.text + '\n')
sec313.close()
except:
print('error at: ' + url2)
No idea why i spent so long on this, but i did. Your code was really hard to look through. So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. I broke out a few bits into separate functions too.
This is checking about 2 urls per second on my machine which seems about right.
How this is better (you can argue with me about this part).
Don't have to reopen and close the output file after each write
Removed a fair bit of unneeded code
gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it)
Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident.
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))
Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!
Example - I have the following dictionary...
URLDict = {'OTX2':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=OTX2&action=view_all',
'RAB3GAP':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=RAB3GAP1&action=view_all',
'SOX2':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=SOX2&action=view_all',
'STRA6':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=STRA6&action=view_all',
'MLYCD':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=MLYCD&action=view_all'}
I would like to use urllib to call each url in a for loop, how can this be done?
I have successfully done this with with the urls in a list format like this...
OTX2 = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=OTX2&action=view_all'
RAB3GAP = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=RAB3GAP1&action=view_all'
SOX2 = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=SOX2&action=view_all'
STRA6 = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=STRA6&action=view_all'
MLYCD = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=MLYCD&action=view_all'
URLList = [OTX2,RAB3GAP,SOX2,STRA6,PAX6,MLYCD]
for URL in URLList:
sourcepage = urllib.urlopen(URL)
sourcetext = sourcepage.read()
but I want to also be able to print the key later when returning data. Using a list format the key would be a variable and thus not able to access it for printing, I would lonly be able to print the value.
Thanks for any help.
Tom
Have you tried (as a simple example):
for key, value in URLDict.iteritems():
print key, value
Doesn't look like a dictionary is even necessary.
dbs = ['OTX2', 'RAB3GAP', 'SOX2', 'STRA6', 'PAX6', 'MLYCD']
urlbase = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=%s&action=view_all'
for db in dbs:
sourcepage = urllib.urlopen(urlbase % db)
sourcetext = sourcepage.read()
I would go about it like this:
for url_key in URLDict:
URL = URLDict[url_key]
sourcepage = urllib.urlopen(URL)
sourcetext = sourcepage.read()
The url is obviously URLDict[url_key] and you can retain the key value within the name url_key. For exemple:
print url_key
On the first iteration will printOTX2.
I'm trying to fetch results in a python2.7 appengine app using cursors, but each time I use with_cursor() it fetches the same result set.
query = Model.all().filter("profile =", p_key).order('-created')
if r.get('cursor'):
query = query.with_cursor(start_cursor = r.get('cursor'))
cursor = query.cursor()
objs = query.fetch(limit=10)
count = len(objs)
for obj in objs:
...
Each time through I'm getting same 10 results. I'm thinkng it has to do with using end_cursor, but how do I get that value if query.cursor() is returning the start_cursor. I've looked through the docs but this is poorly documented.
Your formatting is a bit screwy by the way. Looking at your code (which is incomplete and therefore potentially leaving something out.) I have to assume you have forgotten to store the cursor after fetching results (or return to the user - I am assuming r is a request ?).
So after you have fetched some data you need to call cursor() on the query. e.g This function counts all entities using a cursor.
def count_entities(kind):
c = None
count = 0
q = kind.all(keys_only=True)
while True:
if c:
q.with_cursor(c)
i = q.fetch(1000)
count = count + len(i)
if not i:
break
c = q.cursor()
return count
See how after fetch() has been called the c=q.cursor() call and it's is used as the cursor next time through the loop.
Here's what finally worked:
query = Model.all().filter("profile =", p_key).order('-created')
if request.get('cursor'):
query = query.with_cursor(request.get('cursor'))
objs = query.fetch(limit=10)
cursor = query.cursor()
for obj in objs:
...
I am trying to update an atomic count counter with Python Boto 2.3.0, but can find no documentation for the operation.
It seems there is no direct interface, so I tried to go to "raw" updates using the layer1 interface, but I was unable to complete even a simple update.
I tried the following variations but all with no luck
dynoConn.update_item(INFLUENCER_DATA_TABLE,
{'HashKeyElement': "9f08b4f5-d25a-4950-a948-0381c34aed1c"},
{'new': {'Value': {'N':"1"}, 'Action': "ADD"}})
dynoConn.update_item('influencer_data',
{'HashKeyElement': "9f08b4f5-d25a-4950-a948-0381c34aed1c"},
{'new': {'S' :'hello'}})
dynoConn.update_item("influencer_data",
{"HashKeyElement": "9f08b4f5-d25a-4950-a948-0381c34aed1c"},
{"AttributesToPut" : {"new": {"S" :"hello"}}})
They all produce the same error:
File "/usr/local/lib/python2.6/dist-packages/boto-2.3.0-py2.6.egg/boto/dynamodb/layer1.py", line 164, in _retry_handler
data)
boto.exception.DynamoDBResponseError: DynamoDBResponseError: 400 Bad Request
{u'Message': u'Expected null', u'__type': u'com.amazon.coral.service#SerializationException'}
I also investigated the API docs here but they were pretty spartan.
I have done a lot of searching and fiddling, and the only thing I have left is to use the PHP API and dive into the code to find where it "formats" the JSON body, but that is a bit of a pain. Please save me from that pain!
Sorry, I misunderstood what you were looking for. You can accomplish this via layer2 although there is a small bug that needs to be addressed. Here's some Layer2 code:
>>> import boto
>>> c = boto.connect_dynamodb()
>>> t = c.get_table('counter')
>>> item = t.get_item('counter')
>>> item
{u'id': 'counter', u'n': 1}
>>> item.add_attribute('n', 20)
>>> item.save()
{u'ConsumedCapacityUnits': 1.0}
>>> item # Here's the bug, local Item is not updated
{u'id': 'counter', u'n': 1}
>>> item = t.get_item('counter') # Refetch item just to verify change occurred
>>> item
{u'id': 'counter', u'n': 21}
This results in the same over-the-wire request as you are performing in your Layer1 code, as shown by the following debug output.
2012-04-27 04:17:59,170 foo [DEBUG]:StringToSign:
POST
/
host:dynamodb.us-east-1.amazonaws.com
x-amz-date:Fri, 27 Apr 2012 11:17:59 GMT
x-amz-security- token:<removed> ==
x-amz-target:DynamoDB_20111205.UpdateItem
{"AttributeUpdates": {"n": {"Action": "ADD", "Value": {"N": "20"}}}, "TableName": "counter", "Key": {"HashKeyElement": {"S": "counter"}}}
If you want to avoid the initial GetItem call, you could do this instead:
>>> import boto
>>> c = boto.connect_dynamodb()
>>> t = c.get_table('counter')
>>> item = t.new_item('counter')
>>> item.add_attribute('n', 20)
>>> item.save()
{u'ConsumedCapacityUnits': 1.0}
Which will update the item if it already exists or create it if it doesn't yet exist.
For those looking for the answer I have found it.
First IMPORTANT NOTE, I am currently unaware of what is going on BUT for the moment, to get a layer1 instance I have had to do the following:
import boto
AWS_ACCESS_KEY=XXXXX
AWS_SECRET_KEY=YYYYY
dynoConn = boto.connect_dynamodb(AWS_ACCESS_KEY, AWS_SECRET_KEY)
dynoConnLayer1 = boto.dynamodb.layer1.Layer1(AWS_ACCESS_KEY, AWS_SECRET_KEY)
Essentially instantiating a layer2 FIRST and THEN a layer 1.
Maybe Im doing something stupid but at this point Im just happy to have it working....
I'll sort the details later. THEN...to actually do the atomic update call:
dynoConnLayer1.update_item("influencer_data",
{"HashKeyElement":{"S":"9f08b4f5-d25a-4950-a948-0381c34aed1c"}},
{"direct_influence":
{"Action":"ADD","Value":{"N":"20"}}
}
);
Note in the example above Dynamo will ADD 20 to what ever the current value is and this operation will be atomic meaning other operations happening at the "same time" will be correctly "scheduled" to happen after the new value has been established as +20 OR before this operation is executed. Either way the desired effect will be accomplished.
Be certain to do this on the instance of the layer1 connection as the layer2 will throw errors given it expects a different set of parameter types.
Thats all there is to it!!!! Just so folks know, I figured this out using the PHP SDK. Takes a very short time to install and set up AND THEN when you do a call, the debug data will actually show you the format of the HTTP request body so you will be able to copy/model your layer1 parameters after the example. Here is the code I used to do the atomic update in PHP:
<?php
// Instantiate the class
$dynamodb = new AmazonDynamoDB();
$update_response = $dynamodb->update_item(array(
'TableName' => 'influencer_data',
'Key' => array(
'HashKeyElement' => array(
AmazonDynamoDB::TYPE_STRING=> '9f08b4f5-d25a-4950-a948-0381c34aed1c'
)
),
'AttributeUpdates' => array(
'direct_influence' => array(
'Action' => AmazonDynamoDB::ACTION_ADD,
'Value' => array(
AmazonDynamoDB::TYPE_NUMBER => '20'
)
)
)
));
// status code 200 indicates success
print_r($update_response);
?>
Hopefully this will help other up until the Boto layer2 interface catches up...or someone simply figures out how to do it in level2 :-)
I'm not sure this is truly an atomic counter, since when you increment the value of 1, another call call could increment the number by 1, so that when you "get" the value, it is not the value that you would expect.
For instance, putting the code by garnaat, which is marked as the accepted answer, I see that when you put it in a thread, it does not work:
class ThreadClass(threading.Thread):
def run(self):
conn = boto.dynamodb.connect_to_region(aws_access_key_id=os.environ['AWS_ACCESS_KEY'], aws_secret_access_key=os.environ['AWS_SECRET_KEY'], region_name='us-east-1')
t = conn.get_table('zoo_keeper_ids')
item = t.new_item('counter')
item.add_attribute('n', 1)
r = item.save() #- Item has been atomically updated!
# Uh-Oh! The value may have changed by the time "get_item" is called!
item = t.get_item('counter')
self.counter = item['n']
logging.critical('Thread has counter: ' + str(self.counter))
tcount = 3
threads = []
for i in range(tcount):
threads.append(ThreadClass())
# Start running the threads:
for t in threads:
t.start()
# Wait for all threads to complete:
for t in threads:
t.join()
#- Now verify all threads have unique numbers:
results = set()
for t in threads:
results.add(t.counter)
print len(results)
print tcount
if len(results) != tcount:
print '***Error: All threads do not have unique values!'
else:
print 'Success! All threads have unique values!'
Note: If you want this to truly work, change the code to this:
def run(self):
conn = boto.dynamodb.connect_to_region(aws_access_key_id=os.environ['AWS_ACCESS_KEY'], aws_secret_access_key=os.environ['AWS_SECRET_KEY'], region_name='us-east-1')
t = conn.get_table('zoo_keeper_ids')
item = t.new_item('counter')
item.add_attribute('n', 1)
r = item.save(return_values='ALL_NEW') #- Item has been atomically updated, and you have the correct value without having to do a "get"!
self.counter = str(r['Attributes']['n'])
logging.critical('Thread has counter: ' + str(self.counter))
Hope this helps!
There is no high-level function in DynamoDB for atomic counters. However, you can implement an atomic counter using the conditional write feature. For example, let's say you a table with an string hash key called like this.
>>> import boto
>>> c = boto.connect_dynamodb()
>>> schema = s.create_schema('id', 's')
>>> counter_table = c.create_table('counter', schema, 5, 5)
You now write an item to that table that includes an attribute called 'n' whose value is zero.
>>> n = 0
>>> item = counter_table.new_item('counter', {'n': n})
>>> item.put()
Now, if I want to update the value of my counter, I would perform a conditional write operation that will bump the value of 'n' to 1 iff it's current value agrees with my idea of it's current value.
>>> n += 1
>>> item['n'] = n
>>> item.put(expected_value={'n': n-1})
This will set the value of 'n' in the item to 1 but only if the current value in the DynamoDB is zero. If the value was already incremented by someone else, the write would fail and I would then need to increment by local counter and try again.
This is kind of complicated but all of this could be wrapped up in some code to make it much simpler to use. I did a similar thing for SimpleDB that you can find here:
http://www.elastician.com/2010/02/stupid-boto-tricks-2-reliable-counters.html
I should probably try to update that example to use DynamoDB
You want to increment a value in dynamodb then you can achieve this by using:
import boto3
import json
import decimal
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
if o % 1 > 0:
return float(o)
else:
return int(o)
return super(DecimalEncoder, self).default(o)
ddb = boto3.resource('dynamodb')
def get_counter():
table = ddb.Table(TableName)
try:
response = table.update_item(
Key={
'haskey' : 'counterName'
},
UpdateExpression="set currentValue = currentValue + :val",
ExpressionAttributeValues={
':val': decimal.Decimal(1)
},
ReturnValues="UPDATED_NEW"
)
print("UpdateItem succeeded:")
except Exception as e:
raise e
print(response["Attributes"]["currentValue" ])
This implementetion needs a extra counter table that will just keep the last used value for you.