if I iterate every user.id inside the user collection I get every user.id printed out correctly:
user_ref = db.collection(u'users')
for user_collection in user_ref.get():
print(user_collection.id, file = sys.stderr)
Now, when I try to iterate a collections inside each one of the documents inside the user collection, the original iteration that printsuser.id does not run completely:
user_ref = db.collection(u'users')
for user_collection in user_ref.get():
print(user_collection.id, file = sys.stderr)
s2_ref = user_ref.document(user_collection.id).collection(u'preferences')
for s2 in s2_ref.get():
try:
print(s2.id, file = sys.stderr)
except google.cloud.exceptions.NotFound:
pass
I have included an exception to bypass empty collections.
How can I complete the iteration correctly?
I just had to create an array for the first set of results, and then iterate each id separately:
user_id_array = []
for user_collection in user_ref.get():
user_id_array.append(user_collection.id)
for user_id in user_id_array:
try:
suscription_ref = doc_ref.document(user_id).collection(u'suscriptions').document(user_id).get()
print(suscription_ref.id,file = sys.stderr)
except google.cloud.exceptions.NotFound:
pass
It takes more time, but it'll get you there.
Related
I am currently trying to append to the output list in my code the id of the query result. I can get it to do one of the ids but it will override the first one how can I change my code to allow any amount of looping to the output.append(q.id)
Here is the code:
#app.route('/new-mealplan', methods=['POST'])
def create_mealplan():
data = request.get_json()
recipes = data['recipes']
output = []
for recipe in recipes:
try:
query = Recipes.query.filter(func.lower(Recipes.recipe_name) == func.lower(recipe)).all()
# print(recipe)
if query:
query = Recipes.query.filter(func.lower(Recipes.recipe_name) == func.lower(recipe)).all()
for q in query:
output.append(q.id)
finally:
return jsonify({"data" : output})
To fix this I removed the
Try and Finally blocks.
Then returned after the for-loop was completed.
I keep getting the following error when trying to parse some json:
Traceback (most recent call last):
File "/Users/batch/projects/kl-api/api/helpers.py", line 37, in collect_youtube_data
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
KeyError: 'brandingSettings'
How do I make sure that I check my JSON output for a key before assigning it to a variable? If a key isn’t found, then I just want to assign a default value. Code below:
try:
channel_id = channel_id_response_data['items'][0]['id']
channel_info_url = YOUTUBE_URL + '/channels/?key=' + YOUTUBE_API_KEY + '&id=' + channel_id + '&part=snippet,contentDetails,statistics,brandingSettings'
print('Querying:', channel_info_url)
channel_info_response = requests.get(channel_info_url)
channel_info_response_data = json.loads(channel_info_response.content)
no_of_videos = int(channel_info_response_data['items'][0]['statistics']['videoCount'])
no_of_subscribers = int(channel_info_response_data['items'][0]['statistics']['subscriberCount'])
no_of_views = int(channel_info_response_data['items'][0]['statistics']['viewCount'])
avg_views = round(no_of_views / no_of_videos, 0)
photo = channel_info_response_data['items'][0]['snippet']['thumbnails']['high']['url']
description = channel_info_response_data['items'][0]['snippet']['description']
start_date = channel_info_response_data['items'][0]['snippet']['publishedAt']
title = channel_info_response_data['items'][0]['snippet']['title']
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
except Exception as e:
raise Exception(e)
You can either wrap all your assignment in something like
try:
keywords = channel_info_response_data['items'][0]['brandingSettings']['channel']['keywords']
except KeyError as ignore:
keywords = "default value"
or, let say, use .has_key(...). IMHO In your case first solution is preferable
suppose you have a dict, you have two options to handle the key-not-exist situation:
1) get the key with default value, like
d = {}
val = d.get('k', 10)
val will be 10 since there is not a key named k
2) try-except
d = {}
try:
val = d['k']
except KeyError:
val = 10
This way is far more flexible since you can do anything in the except block, even ignore the error with a pass statement if you really don't care about it.
How do I make sure that I check my JSON output
At this point your "JSON output" is just a plain native Python dict
for a key before assigning it to a variable? If a key isn’t found, then I just want to assign a default value
Now you know you have a dict, browsing the official documention for dict methods should answer the question:
https://docs.python.org/3/library/stdtypes.html#dict.get
get(key[, default])
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
so the general case is:
var = data.get(key, default)
Now if you have deeply nested dicts/lists where any key or index could be missing, catching KeyErrors and IndexErrors can be simpler:
try:
var = data[key1][index1][key2][index2][keyN]
except (KeyError, IndexError):
var = default
As a side note: your code snippet is filled with repeated channel_info_response_data['items'][0]['statistics'] and channel_info_response_data['items'][0]['snippet'] expressions. Using intermediate variables will make your code more readable, easier to maintain, AND a bit faster too:
# always set a timeout if you don't want the program to hang forever
channel_info_response = requests.get(channel_info_url, timeout=30)
# always check the response status - having a response doesn't
# mean you got what you expected. Here we use the `raise_for_status()`
# shortcut which will raise an exception if we have anything else than
# a 200 OK.
channel_info_response.raise_for_status()
# requests knows how to deal with json:
channel_info_response_data = channel_info_response.json()
# we assume that the response MUST have `['items'][0]`,
# and that this item MUST have "statistics" and "snippets"
item = channel_info_response_data['items'][0]
stats = item["statistics"]
snippet = item["snippet"]
no_of_videos = int(stats.get('videoCount', 0))
no_of_subscribers = int(stats.get('subscriberCount', 0))
no_of_views = int(stats.get('viewCount', 0))
avg_views = round(no_of_views / no_of_videos, 0)
try:
photo = snippet['thumbnails']['high']['url']
except KeyError:
photo = None
description = snippet.get('description', "")
start_date = snippet.get('publishedAt', None)
title = snippet.get('title', "")
try:
keywords = item['brandingSettings']['channel']['keywords']
except KeyError
keywords = ""
You may also want to learn about string formatting (contatenating strings is quite error prone and barely readable), and how to pass arguments to requests.get()
Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!
The fields that I have in Mongoldb are;
id, website_url, status.
I need to find the website_url and update its status to 3 and add a new field called err_desc.
I have a list of website_urls, its status and its err_desc.
Below is my code.
client = MongoClient('localhost', 9000)
db1 = client['Company_Website_Crawl']
collection1 = db1['All']
posts1 = collection1.posts
bulk = posts1.initialize_ordered_bulk_op()
website_url = ["http://www.example.com","http://example2.com/"]
err_desc = ["error1","error2"]
for i in website_url:
parsed_uri = urlparse(i)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
final_url = domain
final_url_strip = domain.rstrip("/")
print i,final_url,final_url_strip,"\n"
try:
k = bulk.find({'website_url':i}).upsert().update({'$push':{'err_desc':err_desc,'status':3}})
k = bulk.execute()
print k
except Exception as e:
print "fail"
print e
Error
fail batch op errors occurred
fail Bulk operations can only be executed once.
Initially I used
k = posts1.update({'website_url':final_url_strip},{'$set':{'err_desc':err_desc,'status':3}},multi=True)
It was too slow for 5M records. So I wanted to use bulk update option. Kindly help me to use bulk upsert for this scenario.
The error message is telling you that you need to re-initialize the batch writes operation after calling execute(). But the thing is, you are doing it wrong. In you case, you need to call execute at the end of the for loop like this:
from itertools import count
ct = count()
for url in website_url:
...
try:
bulk.find({'website_url':i}).upsert().update({'$push':{'err_desc':err_desc,'status':3}})
val = next(ct)
except Exception as e:
...
if val > 0:
bulk.execute()
Also note that Bulk() is now deprecated and replaced with bulkwrite
In the code below, the worker function checks if the data passed is valid and if it is valid, it returns a dictionary which will be used in a bulk SQLAlchemy Core insert. If its invalid, I want the None value not to be added to the receiving_list because if it is, the bulk insert will fail as a single None value cannot map out to the table structure.
from datetime import datetime
from sqlalchemy import Table
import multiprocessing
CONN = Engine.connect() #Engine is imported from another module
NUM_CONSUMERS = multiprocessing.cpu_count()
p = multiprocessing.Pool(NUM_CONSUMERS)
def process_data(data):
#Long process to validate data
if is_valid_data(data) == True:
returned_dict = {}
returned_dict['created_at'] = datetime.now()
returned_dict['col1'] = data[0]
returned_dict['colN'] = data[N]
return returned_dict
else:
return None
def spawn_some_processes(data):
table_to_insert = Table('postgresql_database_table', meta, autoload=True, autoload_with=Engine)
While True:
#Get some data here and pass it on to the worker
receiving_list = p.map(process_data, data_to_process)
try:
if len(receiving_list) > 0:
trans = CONN.begin()
CONN.execute(table_to_insert.insert(), receiving_list)
trans.commit()
except IntegrityError:
trans.rollback()
except:
trans.rollback()
Trying to rephrase the question, how can I stop a spawned process from adding to receiving_list when the value None is returned by the spawned process?
A workaround is incorporating a queue with queue.put() and queue.get() that will put only valid data. The disadvantage with this is that after the processes are over, I have to then unpack the queue which adds overhead. My ideal solution would be one where a clean list of dictionaries is returned which SQLAlchemy can use to do the bulk insert
You can just remove the None entries from the list:
received_list = filter(None, p.map(process_data, data_to_process))
This is pretty quick even for really huge lists:
>>> timeit.timeit('l = filter(None, l)', 'l = range(0,10000000)', number=1)
0.47683095932006836
Note that using filter will remove anything where bool(val) is False, like empty strings, empty lists, etc. This should be fine for your use-case, though.