Get n number of documents from a collection using MongoDB/MongoEngine

Get n number of documents from a collection using MongoDB/MongoEngine - python

Hi everyone I have a document inside a collection like this. (Ignore the absurdity of the question).
[
{
"tag": "english",
"difficulty": "hard",
"question": "What are alphabets",
"option_1": "98 billion light years",
"option_2": "23.3 trillion light years",
"option_3": "6 minutes",
"option_4": "It is still unknown",
"correct_answer": "option_1",
"id": "5f80befbaaf3c9ce2f4e2fb9"
}
]
There are multiple documents such as this one (10000).
I'm trying to write a python get to function using flask-restful to get n number of documents from this collection.
Currently, I'm confused about how to write a MongoEngine query.
This is what I do to get a single document based on it.
def get(self,id):
questions = Question.objects.get(id=id).to_json()
return Response(questions,
mimetype="application/json",
status = 200)
for n number of documents, I'm unable to figure out what to write inside.
def get_n_questions(self,n):
body = request.get_json(force =True)
questions = ???
return Response(questions,
mimetype="application/json",
status = 200)

You can use the limit(n) method (doc) on a queryset. This will let you retrieve the n firsts documents from the collection.
In your case that would mean:
questions = Question.objects().limit(n).to_json()
You may also be interested in the skip(n) method, this will allow you to do pagination (similarly to a limit/offset from MySQL for instance).

Related

Python json pulling list of items

I apologize in advance if this is simple. This is my first go at Python and I've been searching and trying things all day and just haven't been able to figure out how to accomplish what I need.
I am pulling a list of assets from an API. Below is an example of the result of this request (in reality it will return 50 sensorpoints.
There is a second request that will pull readings from a specific sensor based on sensorPointId. I need to be able to enter an assetId, and pull the readings from each sensor.
{
"assetId": 1436,
"assetName": "Pharmacy",
"groupId": "104",
"groupName": "West",
"environment": "Freezer",
"lastActivityDate": "2021-01-25T18:54:34.5970000Z",
"tags": [
"Manager: Casey",
"State: Oregon"
],
"sensorPoints": [
{
"sensorPointId": 126,
"sensorPointName": "Top Temperature",
"devices": [
"23004000080793070793",
"74012807612084533500"
]
},
{
"sensorPointId": 129,
"sensorPointName": "Bottom Temperature",
"devices": [
"86004000080793070956"
]
}
]
}
My plan was to go through the list from the first request, make a list of all the sensorpointIds in that asset then run the second request for each based on that list. The problem no matter which method I try to pull the individual sensorpointIds, it says object is not subscriptable, even when looking at a string value. These are all the things I've tried. I'm sure it's something silly I'm missing, but all of these I have seen in examples. I've written the full response to a text file just to make sure I'm getting good data, and that works fine.
r = request...
data = r.json
for sensor in data:
print (data["sensorpointId")
or
print(["sensorsPoints"]["sensorPointName"])
these give 'method' object is not iterable
I've also just tried to print a single sensorpointId
print(data["sensorpointId"][0])
print(data["sensorpointName"][0])
print(data["sensorPoints"][0]["sensorpointId"])
all of these give object is not subscriptable
print(r["sensorPoints"][0]["sensorpointName"])
'Response' object is not subscriptable
print(data["sensorPoints"][0]["sensorpointName"])
print(["sensorPoints"][0]["sensorpointName"]
string indices must be integers, not 'str'

I got it!
data = r.json()['sensorPoints']
sensors = []
for d in data:
sensor = d['sensorPointId']
sensors.append(sensor)

Python Check if Key/Value exists in JSON output

I have a JSON output and I want to create an IF statement so if it contains the value I am looking for the do something ELSE do something else.
JSON Blob 1
[
{
"domain":"www.slatergordon.co.uk",
"displayed_link":"https://www.slatergordon.co.uk/",
"description":"We Work With Thousands Of People Across The UK In All Areas Of Personal Legal Services. Regardless Of How You Have Been Injured Through Negligence We're Here To Help You. Personal Injury Experts.",
"position":1,
"block_position":"top",
"title":"Car Claims Solicitors - No Win No Fee Solicitors - SlaterGordon.co.uk",
"link":"https://www.slatergordon.co.uk/personal-injury-claim/road-traffic-accidents-solicitors/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABABGgJwdg&sig=AOD64_3u1ct0jmXAnvemxFHh_tfK5UK8Xg&q&adurl"
},
{
"is_phone_ad":true,
"phone_number":"0333 358 0496",
"domain":"www.accident-claimsline.co.uk",
"displayed_link":"http://www.accident-claimsline.co.uk/",
"description":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"sitelinks":[
{
"title":"Replacement Vehicle Hire",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABALGgJwdg&ae=2&sig=AOD64_20YjAoyMY_c6XVTnBU1vQAD2tDTA&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAM&adurl="
},
{
"title":"Request a Call Back",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAOGgJwdg&ae=2&sig=AOD64_36-Pd831AXrPbh1yvUyTbhXH2irg&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAN&adurl="
}
],
"position":6,
"block_position":"bottom",
"title":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"link":"http://www.accident-claimsline.co.uk/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAGGgJwdg&ae=2&sig=AOD64_09pMtWxFo9s8c1dL16NJo5ThOlrg&q&adurl"
}
]
JSON Blob 2
JSON
[
{
"domain":"www.slatergordon.co.uk",
"displayed_link":"https://www.slatergordon.co.uk/",
"description":"We Work With Thousands Of People Across The UK In All Areas Of Personal Legal Services. Regardless Of How You Have Been Injured Through Negligence We're Here To Help You. Personal Injury Experts.",
"position":1,
"block_position":"top",
"title":"Car Claims Solicitors - No Win No Fee Solicitors - SlaterGordon.co.uk",
"link":"https://www.slatergordon.co.uk/personal-injury-claim/road-traffic-accidents-solicitors/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABABGgJwdg&sig=AOD64_3u1ct0jmXAnvemxFHh_tfK5UK8Xg&q&adurl"
},
{
"is_phone_ad":true,
"phone_number":"0333 358 0496",
"domain":"www.accident-claimsline.co.uk",
"displayed_link":"http://www.accident-claimsline.co.uk/",
"description":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"sitelinks":[
{
"title":"Replacement Vehicle Hire",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABALGgJwdg&ae=2&sig=AOD64_20YjAoyMY_c6XVTnBU1vQAD2tDTA&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAM&adurl="
},
{
"title":"Request a Call Back",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAOGgJwdg&ae=2&sig=AOD64_36-Pd831AXrPbh1yvUyTbhXH2irg&q=&ved=2ahUKEwjvlM-djLDwAhVmJzQIHSZHDLEQvrcBegQIBRAN&adurl="
}
],
"position":6,
"block_position":"top",
"title":"Car Insurance Claims Advice - Car Accident Claims Helpline",
"link":"http://www.accident-claimsline.co.uk/",
"tracking_link":"https://www.google.co.uk/aclk?sa=l&ai=DChcSEwj8-NSdjLDwAhXBEH0KHRYwA1MYABAGGgJwdg&ae=2&sig=AOD64_09pMtWxFo9s8c1dL16NJo5ThOlrg&q&adurl"
}
]
Desired Output
if "block_position":"bottom" in JSONBlob:
do something
else:
do something else
but I cant seem to get it to trigger for me. I want it to search through the entire output and if it contains that key/value do something and if it doesnt contain it do something else.
Blob 1 would go down the IF path
Blob 2 would go down the else path

The main problem you have here is that the JSON output is a list/array with two objects inside. As you can have the block_position key in any of the inner objects, you could do something like this:
if any([obj.get('block_position') == 'bottom' for obj in JSONBlob]):
print('I do something')
else:
print('I do somehting else')
EDIT 1: OK, I think I got your point. You only need to do something for each object with block_position set to bottom. Then the following should do it:
for obj in JSONBlob:
if obj.get('block_position') == 'bottom':
print('I do something with the object')
else:
print('I do something else with the object')
EDIT 2: As spoken in the post, if you only want to do something with the objects with block_position set as bottom, you can suppress the else clause as follows:
for obj in JSONBlob:
if obj.get('block_position') == 'bottom':
print('I do something with the object')

you can use JMESPath library. Its a query language for JSON.
Basic jmespath expression for your case would be [?block_position==bottom]. This will filter out the specific node for you.
I tried it online here with data provided by you.
If you are looking for more nested node you will only have to alter your expression to search that specific node.

How to sort paginated logs by #timestamp with Elasticsearch?

My goal is to sort millions of logs by timestamp that I receive out of Elasticsearch.
Example logs:
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:00:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:01:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:02:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:04:09.000Z"}
Unfortunately, I am not able to get all the logs sorted out of Elastic. It seems like I have to do it by myself.
Approaches I have tried to get the data sorted out of elastic:
es = Search(index="somelogs-*").using(client).params(preserve_order=True)
for hit in es.scan():
print(hit['#timestamp'])
Another approach:
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.scan()
)
So I am looking for a way to sort these logs by myself or directly through Elasticsearch. Currently, I am saving all the data in a local 'logs.json' and it seems to me I have to iter over and sort it by myself.

You should definitely let Elasticsearch do the sorting, then return the data to you already sorted.
The problem is that you are using .scan(). It uses Elasticsearch's scan/scroll API, which unfortunately only applies the sorting params on each page/slice, not the entire search result. This is noted in the elasticsearch-dsl docs on Pagination:
Pagination
...
If you want to access all the documents matched by your query you can
use the scan method which uses the scan/scroll elasticsearch API:
for hit in s.scan():
print(hit.title)
Note that in this case the results won’t be sorted.
(emphasis mine)
Using pagination is definitely an option especially when you have a "millions of logs" as you said. There is a search_after pagination API:
Search after
You can use the search_after parameter to retrieve the next page of
hits using a set of sort values from the previous page.
...
To get the first page of results, submit a search request with a sort
argument.
...
The search response includes an array of sort values for
each hit.
...
To get the next page of results, rerun the previous search using the last hit’s sort values as the search_after argument. ... The search’s query and sort arguments must remain unchanged. If provided, the from argument must be 0 (default) or -1.
...
You can repeat this process to get additional pages of results.
(omitted the raw JSON requests since I'll show a sample in Python below)
Here's a sample how to do it with elasticsearch-dsl for Python. Note that I'm limiting the fields and the number of results to make it easier to test. The important parts here are the sort and the extra(search_after=).
search = Search(using=client, index='some-index')
# The main query
search = search.extra(size=100)
search = search.query('range', **{'#timestamp': {'gte': '2020-12-29T09:00', 'lt': '2020-12-29T09:59'}})
search = search.source(fields=('#timestamp', ))
search = search.sort({
'#timestamp': {
'order': 'desc'
},
})
# Store all the results (it would be better to be wrap all this in a generator to be performant)
hits = []
# Get the 1st page
results = search.execute()
hits.extend(results.hits)
total = results.hits.total
print(f'Expecting {total}')
# Get the next pages
# Real use-case condition should be "until total" or "until no more results.hits"
while len(hits) < 1000:
print(f'Now have {len(hits)}')
last_hit_sort_id = hits[-1].meta.sort[0]
search = search.extra(search_after=[last_hit_sort_id])
results = search.execute()
hits.extend(results.hits)
with open('results.txt', 'w') as out:
for hit in hits:
out.write(f'{hit["#timestamp"]}\n')
That would lead to an already sorted data:
# 1st 10 lines
2020-12-29T09:58:57.749Z
2020-12-29T09:58:55.736Z
2020-12-29T09:58:53.627Z
2020-12-29T09:58:52.738Z
2020-12-29T09:58:47.221Z
2020-12-29T09:58:45.676Z
2020-12-29T09:58:44.523Z
2020-12-29T09:58:43.541Z
2020-12-29T09:58:40.116Z
2020-12-29T09:58:38.206Z
...
# 250-260
2020-12-29T09:50:31.117Z
2020-12-29T09:50:27.754Z
2020-12-29T09:50:25.738Z
2020-12-29T09:50:23.601Z
2020-12-29T09:50:17.736Z
2020-12-29T09:50:15.753Z
2020-12-29T09:50:14.491Z
2020-12-29T09:50:13.555Z
2020-12-29T09:50:07.721Z
2020-12-29T09:50:05.744Z
2020-12-29T09:50:03.630Z
...
# 675-685
2020-12-29T09:43:30.609Z
2020-12-29T09:43:30.608Z
2020-12-29T09:43:30.602Z
2020-12-29T09:43:30.570Z
2020-12-29T09:43:30.568Z
2020-12-29T09:43:30.529Z
2020-12-29T09:43:30.475Z
2020-12-29T09:43:30.474Z
2020-12-29T09:43:30.468Z
2020-12-29T09:43:30.418Z
2020-12-29T09:43:30.417Z
...
# 840-850
2020-12-29T09:43:27.953Z
2020-12-29T09:43:27.929Z
2020-12-29T09:43:27.927Z
2020-12-29T09:43:27.920Z
2020-12-29T09:43:27.897Z
2020-12-29T09:43:27.895Z
2020-12-29T09:43:27.886Z
2020-12-29T09:43:27.861Z
2020-12-29T09:43:27.860Z
2020-12-29T09:43:27.853Z
2020-12-29T09:43:27.828Z
...
# Last 3
2020-12-29T09:43:25.878Z
2020-12-29T09:43:25.876Z
2020-12-29T09:43:25.869Z
There are some considerations on using search_after as discussed in the API docs:
Use a Point In Time or PIT parameter
If a refresh occurs between these requests, the order of your results may change, causing inconsistent results across pages. To prevent this, you can create a point in time (PIT) to preserve the current index state over your searches.
You need to first make a POST request to get a PIT ID
Then add an extra 'pit': {'id':xxxx, 'keep_alive':5m} parameter to every request
Make sure to use the PIT ID from the last response
Use a tiebreaker
We recommend you include a tiebreaker field in your sort. This tiebreaker field should contain a unique value for each document. If you don’t include a tiebreaker field, your paged results could miss or duplicate hits.
This would depend on your Document schema
# Add some ID as a tiebreaker to the `sort` call
search = search.sort(
{'#timestamp': {
'order': 'desc'
}},
{'some.id': {
'order': 'desc'
}}
)
# Include both the sort ID and the some.ID in `search_after`
last_hit_sort_id, last_hit_route_id = hits[-1].meta.sort
search = search.extra(search_after=[last_hit_sort_id, last_hit_route_id])

Thank you Gino Mempin. It works!
But I also figured out, that a simple change does the same job.
by adding .params(preserve_order=True) elasticsearch will sort all the data.
es = Search(index="somelog-*").using(client)
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.params(preserve_order=True)
.scan()
)

Trouble when storing API data in Python list

I'm struggling with my json data that I get from an API. I've gone into several api urls to grab my data, and I've stored it in an empty list. I then want to take out all fields that say "reputation" and I'm only interested in that number. See my code here:
import json
import requests
f = requests.get('my_api_url')
if(f.ok):
data = json.loads(f.content)
url_list = [] #the list stores a number of urls that I want to request data from
for items in data:
url_list.append(items['details_url']) #grab the urls that I want to enter
total_url = [] #stores all data from all urls here
for index in range(len(url_list)):
url = requests.get(url_list[index])
if(url.ok):
url_data = json.loads(url.content)
total_url.append(url_data)
print(json.dumps(total_url, indent=2)) #only want to see if it's working
Thus far I'm happy and can enter all urls and get the data. It's in the next step I get trouble. The above code outputs the following json data for me:
[
[
{
"id": 316,
"name": "storabro",
"url": "https://storabro.net",
"customer": true,
"administrator": false,
"reputation": 568
}
],
[
{
"id": 541,
"name": "sega",
"url": "https://wedonthaveanyyet.com",
"customer": true,
"administrator": false,
"reputation": 45
},
{
"id": 90,
"name": "Villa",
"url": "https://brandvillas.co.uk",
"customer": true,
"administrator": false,
"reputation": 6
}
]
]
However, I only want to print out the reputation, and I cannot get it working. If I in my code instead use print(total_url['reputation']) it doesn't work and says "TypeError: list indices must be integers or slices, not str", and if I try:
for s in total_url:
print(s['reputation'])
I get the same TypeError.
Feels like I've tried everything but I can't find any answers on the web that can help me, but I understand I still have a lot to learn and that my error will be obvious to some people here. It seems very similar to other things I've done with Python, but this time I'm stuck. To clarify, I'm expecting an output similar to: [568, 45, 6]
Perhaps I used the wrong way to do this from the beginning and that's why it's not working all the way for me. Started to code with Python in October and it's still very new to me but I want to learn. Thank you all in advance!

It looks like your total_url is a list of lists, so you might write a function like:
def get_reputations(data):
for url in data:
for obj in url:
print(obj.get('reputation'))
get_reputations(total_url)
# output:
# 568
# 45
# 6
If you'd rather not work with a list of lists in the first place, you can extend the list with each result instead of append in the expression used to construct total_url

You can also use json.load and try to read the response
def get_rep():
response = urlopen(api_url)
r = response.read().decode('utf-8')
r_obj = json.loads(r)
for item in r_obj['response']:
print("Reputation: {}".format(item['reputation']))

Json file to dictionary

I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:
{
"votes": {
"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
"text": (
"dr. goldberg offers everything i look for in a general practitioner. "
"he's nice and easy to talk to without being patronizing; he's always on "
"time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
"which my parents have explained to me is very important in case something "
"happens and you need surgery; and you can get referrals to see specialists "
"without having to see him first. really, what more do you need? i'm "
"sitting here trying to think of any complaints i have about him, but i'm "
"really drawing a blank."
),
"type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.
EDIT
As requested pandas code :
reading the json
with open('yelp_academic_dataset_review.json') as f:
df = pd.DataFrame(json.loads(line) for line in f)
Creating the dictionary
dict = {}
for i, row in df.iterrows():
business_id = row['business_id']
user_id = row['user_id']
rating = row['stars']
key = (business_id, user_id)
dict[key] = rating

You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:
sample.json
{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
read_json.py
import json
with open('sample.json', 'r') as fh:
result_dict = json.load(fh)
print(result_dict['user_id'])
print(result_dict['stars'])
output
Xqd0DzHaiyRqVH3WRG7hzg
5
With that output you can easily create a DataFrame.
There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.
In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get n number of documents from a collection using MongoDB/MongoEngine - python

Related

Python json pulling list of items

Python Check if Key/Value exists in JSON output

How to sort paginated logs by #timestamp with Elasticsearch?

Trouble when storing API data in Python list

Json file to dictionary

Categories

Resources