Bulk MongoDB / Pymongo Insert with Datetime - python

I have code that looks like the following, which executes every minute:
huge_list = query_results() # Returns a long list of dictionaries.
db.objects.insert(huge_list)
I need the current datetime appended to each object in the list before insertion. Is there some way I can modify the insert command so it also appends a 'datetime' field, and if not what would be the most efficient way of doing this? There are several thousands records in the response dictionary, so I feel like visiting each index of the list and appending a field may not be the most efficient method.
Later I will need to be able to query for records individually from the whole group, and also for records within a specific datetime range.
Thanks in advance!

Related

Query for 2 indices in Elasticsearch?

I'm wondering if it's possible to query for 2 indicies in Elasticsearch, and display the results mixed together in 1 table. For example:
Indicies:
food-american-burger
food-italian-pizza
food-japanese-ramen
food-mexican-burritos
#query here for burger and pizza, and display the results in a csv file
#i.e. if there was a timestamp field, display results starting from the most recent
I know you can do a query for food-*, but it would give 2 indices that I wouldn't want.
I looked up the multisearch module for Elasticsearch DSL, but the documentation shows only an instance of 1 index query:
ms = MultiSearch(index='blogs')
ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))
Part 1:
Is it possible to use this for multiple indices? Ultimately, I would like to query for x number of indicies and display all the data in a single human-readable format (csv, json, etc.), but I'm not sure how to perform a single query for only the indices I want.
I currently have the functionality to perform queries and write out the data, but each data file would only consist of that index I queried for. I would like to display all the data into one file.
Part 2:
The data is stored in a dictionary, and then I am writing it to a csv. It is currently being ordered by timestamp. The code:
sorted_rows = sorted(rows,key=lambda x: x['#timestamp'], reverse=True)
for row in sorted_rows:
writer.writerow(row.values())
When writing to the csv, the timestamp field is not the first column. I'm storing the fields in a dictionary, and updating that dictionary for every Elasticsearch hit, then writing it to the csv. Is there a way to move the timestamp field to the first column?
Thanks!
According to the Elasticsearch Docs, you can query a single index (e.g. food-american-burger), multiple comma-separated indicies (e.g. food-american-burger,food-italian-pizza), or all indicies using the _all keyword.
I haven't personally used the Python client, but this is an API convention and should apply to any of the official Elasticsearch clients.
For part 2, you should probably submit a separate question to keep things to a single topic per question, since the two topics are not directly related.

How to create an updatable list of tuples in Python?

I'm writing a script in Python 3, where I go through a file, and collect information about the duration of various tasks. I need to maintain a list of summations of these durations (in the form of datetime.timedelta objects), split by date and which task was done. Each task is identified by an ID string.
This means that while going through the file I build a list of records, where each record consist of a date, an ID string and a duration. When adding a new record I first check if the date and ID string combination is already present in the list. If it is I add the new duration to the current duration in the list. If the date and ID string combination doesn't exist, I append the record to the list.
I don't know in advance how many different combinations of date and ID string there is, so I can't pre-allocate them.
At the end I would like to be able to sort the list on date and ID string before printing it to standard out.
I tried doing it in a list of tuples, but tuples are immutable, so I can't add a new duration to an existing duration I found.
If pressed I could create a new ID string by concatenating a string representation of the date and the ID string. But I would really prefer to keep those two values separate.
Is this possible? And if so: How?
I wouldn't use a list in this case, but rather a dict. Here's a simple example:
data = {}
with open("myfile.txt") as file:
for line in file:
# Parse the line for the following:
# tid: The task ID we read
# date: The date we read
# duration: The duration we read
# Once the data has been parsed out, store it:
data.setdefault((date, tid), 0)
data[(date, tid)] += duration
After parsing the file you can get the keys to the dict (data.keys()), sort them, and print out the results.

Is there a way to modify datetime objects through the Django ORM Query?

We've a Django, Postgresql database that contains objects with:
object_date = models.DateTimeField()
as a field.
We need to count the objects by hour per day, so we need to remove some of the extra time data, for example: minutes, seconds and microseconds.
We can remove the extra time data in python:
query = MyModel.objects.values('object_date')
data = [tweet['tweet_date'].replace(minute=0, second=0, microsecond=0) for tweet in query
Which leaves us with a list containing the date and hour.
My Question: Is there a better, faster, cleaner way to do this in the query itself?
If you simply want to obtain the dates without the time data, you can use extra to declare calculated fields:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
You don't gain too much from just that, though; the database is now sending you even more data.
However, with these additional fields, you can use aggregation to instantly get the counts you were looking for, by adding one line:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
.annotate(count=Count('*'))
Alternatively, you could use any valid SQL to combine what I made two fields into one field, by formatting it into a string, for example. The nice thing about doing that, is that you can then use the tuples to construct a Counter for convenient querying (use values_list()).
This query will certainly be more efficient than doing the counting in Python. For a background job that may not be so important, however.
One downside is that this code is not portable; for one, it does not work on SQLite, which you may still be using for testing purposes. In that case, you might save yourself the trouble and write a raw query right away, which will be just as unportable but more readable.
Update
As of 1.10 it is possible to perform this query nicely using expressions, thanks to the addition of TruncHour. Here's a suggestion for how the solution could look:
from collections import Counter
from django.db.models import Count
from django.db.models.functions import TruncHour
counts_by_group = Counter(dict(
MyModel.objects
.annotate(object_group=TruncHour('object_date'))
.values_list('object_group')
.annotate(count=Count('object_group'))
)) # query with counts_by_group[datetime.datetime(year, month, day, hour)]
It's elegant, efficient and portable. :)
count = len(MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)))
or
count = MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)).count()
Assuming I understand what you're asking for, this returns the number of objects that have a date within a specific time range. Set the range to be from the beginning of the hour until the end of the hour and you will return all objects created in that hour. Count() or len() can be used depending on the desired use. For more information on that check out https://docs.djangoproject.com/en/1.9/ref/models/querysets/#count

Python Date Time with JSON API

I have a Python 2.7 API that queries a SQL db and delivers a JSON list of dictionaries that is then used in a bootstrap/Django site.
Dates in the DB are strings in the format '2017-04-20 00:00:00', but sometimes the time of the source data instead has a decimal, which causes trouble with strptime, so I'm removing the seconds by keeping only the first 10 characters of the string.
import datetime
dict_list = response['my_list_of_dicts']
for dt_to_cmpr in dict_list:
dt_to_cmpr['date_key'] = dt_to_cmpr['date_key'][:10]
Before I can compare date ranges, the dates need to be date time not strings. (Note: For production, I plan to account for exceptions such as null values.)
dt_to_cmpr['date_key'] = datetime.datetime.strptime(dt_to_cmpr['date_key'],
'%Y-%m-%d')
I want to know things about dictionaries where date_key is roughly no more than 90 days from today. (i.e. the total number in the time frame, or the sum of every dictionary's price_key.)
under_days = datetime.timedelta(days=-1)
over_days = datetime.timedelta(days=91)
now = datetime.datetime.now()
ttl_within_90days = sum(1 for d in response['my_list_of_dicts'] if (under_days <
(d.get('date_key')-now) < over_days))
One problem is now that I've converted my dates, the are not JSON serializable. So, now I have to put them back into a string again
for dt_to_cmpr in dict_list:
dt_to_cmpr['date_key'] = dt_to_cmpr['date_key'].strftime("%Y-%m-%d")
I cleaned up the above for simplicity, but that should all work. When it gets to Django, the view is going to covert them all back to date time again for use in a template.
Can I have Python just treat my date strings as time for the 90 day comparison, but leave them alone. Or, maybe have JSON use the Python date times? That much iteration every page load is slow, and can't be the best way.
The main problem is the way you're storing the datetimes. You should probably be storing them as actual datetimes in your database, not strings. You can't do date queries on string fields. Instead, you have to use the inefficient method of querying all the records and then filtering all of them in python after the fact. Database data types were created for a reason, use them.
There's no reason to convert datetimes to strings except at the very last moment when you need to format it for json or html, and the only bit of code that should need to do that is the Django app. That means:
Your Django app should almost entirely be using datetimes. It only coverts to strings when it needs to render out html or json.
Your API should only use python datetimes.
Your database should only use datetimes as well.
If you don't control the database, the best case is going to be 2 conversions
string -> datetime when pulling data out of the database.
datetime -> string when serializing to html or json.
If you can fix the database, then you only need to do the 2nd conversion.

storing python list into mysql and accessing it

How can I store python 'list' values into MySQL and access it later from the same database like a normal list?
I tried storing the list as a varchar type and it did store it. However, while accessing the data from MySQL I couldn't access the same stored value as a list, but it instead it acts as a string. So, accessing the list with index was no longer possible. Is it perhaps easier to store some data in the form of sets datatype? I see the MySQL datatype 'set' but i'm unable to use it from python. When I try to store set from python into MySQL, it throws the following error: 'MySQLConverter' object has no attribute '_set_to_mysql'. Any help is appreciated
P.S. I have to store co-ordinate of an image within the list along with the image number. So, it is going to be in the form [1,157,421]
Use a serialization library like json:
import json
l1 = [1,157,421]
s = json.dumps(l1)
l2 = json.loads(s)
Are you using an ORM like SQLAlchemy?
Anyway, to answer your question directly, you can use json or pickle to convert your list to a string and store that. Then to get it back, you can parse it (as JSON or a pickle) and get the list back.
However, if your list is always a 3 point coordinate, I'd recommend making separate x, y, and z columns in your table. You could easily write functions to store a list in the correct columns and convert the columns to a list, if you need that.

Categories