Best way to speed up PyMongo loop - python

I'm currently using a MongoDB database where I'm storing product data. I'm currently using a for loop of around ~50 IDs, and with each iteration, I'm searching for the ID and if the ID doesn't exist, I'm adding it, and if it exists and another column is a specific value, I'll run a function.
for id in ids:
value = db.find_one({"value": id})
if value:
# It checks for some other columns here using both the ID and the return value
else:
# It adds the ID and some other information to the database
The problem here is that this is incredibly inefficient. When searching around for other ways to do this, all results show how to get a list of the results, but I'm not sure how this would be implemented in my scenario since I'm running functions and checks with each result and ID.
Thank you!

You can improve by doing only one find request.
And in a second time, add all the documents in DB. Maybe with an insert_many ?
value = db.find({"value": {"$in": ids}})
for value in values:
# It checks for some other columns here using both the ID and the return
ids.remove(value.id)
# Do all your inserts
# with a loop
for id in ids:
df.insert(...)
# or with insert_many
db.insert_many(...)

Related

API Get method to get all tweets with hashtag count greater than within MongoDB in JSON format

I have a MongoDB database that contains a number of tweets. I want to be able to get all the tweets in JSON list through my API that contain a number of hashtags greather than that specified by the user in the url (eg http://localhost:5000/tweets?morethan=5, which is 5 in this case) .
The hashtags are contained inside the entities column in the database, along with other columns such as user_mentions, urls, symbols and media. Here is the code I've written so far but doesnt return anything.
#!flask/bin/python
app = Flask(__name__)
#app.route('/tweets', methods=['GET'])
def get_tweets():
# Connect to database and pull back collections
db = client['mongo']
collection = db['collection']
parameter = request.args.get('morethan')
if parameter:
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) # create the namespace#
cursor = collection.find({key_im_looking_for: {"$exists": True}})
EDIT: IT WORKS!
The code in question is this line
cursor = collection.find({"entities": {"hashtags": parameter}})
This answer explains why it is impossible to directly perform what you ask.
mongodb query: $size with $gt returns always 0
That answer also describes potential (but poor) ideas to get around it.
The best suggestion is to modify all your documents and put a "num_hashtags" key in somewhere, index that, and query against it.
Using The Twitter JSON API you could update all your documents and put a the num_hashtags key in the entities document.
Alternatively, you could solve your immediate problem by doing a very slow full table scan across all documents for every query checking if the hashtag number which is one greater than your parameter exists by abusing MongoDB Dot Notation.
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) #create the namespace#
# py2.7 => key_im_looking_for = "entities.hashtags.%s" %(gt_parameter)
# in this example it would be "entities.hashtags.6"
cursor = collection.find({key_im_looking_for: {"$exists": True}})
The best answer (and the key reason to use a NoSQL database in the first place) is that you should modify your data to suit your retrieval. If possible, you should perform an inplace update adding the num_hashtags key.

Checking millions of MySQL rows in Python

I have a Python program that downloads a text file of over 100 million unique values and follows the following logic:
If the value already exists in the table, update the entry's last_seen date (SELECT id WHERE <col> = <value>;)
If the value does not exist in the table, insert the value into the table
I queue up entries that need to be added and then insert them in a bulk statement after a few hundred have been gathered.
Currently, the program takes over 24 hours to run. I've created an index on the column that stores the values.
I'm currently using the MySQLdb.
It seems that checking for value existence is taking the lion's share of the runtime. What avenues can I pursue to make this faster?
Thank you.
You could try loading the values into a set, so you can do the lookups without fetching from the database every time. Assuming that the table is not being updated by anyone else, and that you have sufficient memory.
# Let's assume you have a function runquery, that executes the
# provided statement and returns a collection of values as strings.
existing_values = set(runquery('SELECT DISTINCT value FROM table'))
with open('big_file.txt') as f:
inserts = []
updates = []
for line in f:
value = line.strip()
if value in existing_values:
updates.append(value)
else:
existing_values.add(value)
inserts.append(value)
if len(inserts) > THRESHOLD or len(updates) > THRESHOLD:
# Do bulk updates and clear inserts and updates

Optimizing an Update statement with many records in SQLAlchemy

I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)
Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).
According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)

DynamoDB Querying in Python (Count with GroupBy)

This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.

How to delete a specific row in wxListCtrl after sorting?

I'm using wx.ListCtrl for live report in my app, and there will be continuous status updating including inserting a new row when a task starts and deleting related rows when the tasks end. Since the list gets sorted every now and then, you cannot simply delete the rows by the rowid you started with. Although you can assign a unique id using SetItemData, and that way you know exactly which row to delete when a task is done, there does NOT seem to be any method related to deleting a row by that unique id, not even a method to get rowid by unique id, and the only method I found is GetItemData, which will return the unique id for a certain row.
So the only way came to my mind is to iterate all rows checking their unique ids and compares it against the given id, if it matches then delete that row. But this sounds way too clumsy, so is there a better way to delete a specific row after sorting?
If you can upgrade your wxPython to the 2.9.x series then there is a simple answer - use a DataViewListCtrl - in that the display reflects your data rather than actually containing the data. As a result if your data model changes due to a data item, (line), being deleted then your display will loose that line regardless of how the display is sorted. If you can't then I suspect that you will have to tag lines with a unique id and then find them for deletion.
If you need to do the latter I would suggest having a, possibly hidden, RowID column containing your unique ID and either a dictionary that you maintain with the subprocess PID as the key and the unique ID as the value or a function that maps the process id to the unique row ID.
Obviously adding the new row on the process create is not a problem, just remember to include your unique ID. When your process ends get the row ID and do something like:
def FindRow(ID):
""" Find the row that matches the ID """
match = None
for index in range(self.TheGrid.GetNumberRows()):
if self.TheGrid.GetCellValue(index, IdColNo):
match = index
break
return match
# In your process end handler
Line = FindRow(GetID(pid))
if (Line):
self.TheGrid.DeleteRows(Line, 1)
I ended up using ObjectListView to do the job. Basically you build an index for your objects in the list, and then you are able to operate on any row you want. It's way more convenient than wx.ListCtrl

Categories