I'm trying to sort a collection, and then print the first 5 docs to make sure it has worked:
#!/user/bin/env python
import pymongo
# Establish a connection to the mongo database.
connection = pymongo.MongoClient('mongodb://localhost')
# Get a handle to the students database.
db = connection.school
students = db.students
def order_homework():
projection = {'scores': {'$elemMatch': {'type': 'homework'}}}
cursor = students.find({}, projection)
# Sort each item's scores.
for each in cursor:
each['scores'].sort()
# Sort by _id.
cursor = sorted(cursor, key=lambda x: x['_id'])
# Print the first five items.
count = 0
for each in cursor:
print(each)
count += 1
if count == 5:
break
if __name__ == '__main__':
order_homework()
When I run this, nothing prints.
If I take out the sorts, then it prints.
Each sort works when run individually.
Please teach me what I'm doing wrong / educate me.
You're trying to treat the cursor like a list, which you can iterate several times from the start. PyMongo cursors don't act that way - once you've iterated it in for each in cursor, the cursor is completed and you can't iterate it again.
You can turn the cursor into a list like:
data = list(students.find({}, projection))
For efficiency, get results pre-sorted from MongoDB:
list(students.find({}, projection).sort('_id'))
This sends the sort criterion to the server, which then streams the results back to you pre-sorted, instead of requiring you to do it client-side. Now delete your "Sort by _id" line below.
Related
This is the code :
import mysql.connector as mariadb
import time
import random
mariadb_connection = mariadb.connect(user='root', password='xxx', database='UniqueCode',
port='3306', host='192.168.xx.xx')
cursor = mariadb_connection.cursor()
FullChar = 'CFLMNPRTVWXYK123456789' # i just need that char
total = 5000
count = 10
limmit = 0
count = int(count)
entries = []
uq_id = 0
total_all = 0
def inputDatabase(data):
try:
maria_insert_query = "INSERT INTO SN_UNIQUE_CODE(unique_code) VALUES (%s)"
cursor.executemany(maria_insert_query, data)
mariadb_connection.commit()
print("Commiting " + str(total) + " entries..")
except Exception:
maria_alter_query = "ALTER TABLE UniqueCode.SN_UNIQUE_CODE AUTO_INCREMENT=0"
cursor.execute(maria_alter_query)
print("UniqueCode Increment Altered")
while (0 < 1) :
for i in range(total):
unique_code = ''.join(random.sample(FullChar, count))
entry = (unique_code)
entries.append(entry)
inputDatabase(entries)
#print(entries)
entries.clear()
time.sleep(0.1)
Output:
Id unique_code
1 N5LXK2V7CT
2 7C4W3Y8219
3 XR9M6V31K7
The code above runs well, the time it takes to generate it is also fast, the problem I faced was when the unique_code stored in the tuple was to be entered into mariadb, to avoid data redundancy, i added a unique index in the unique_code column.
The more data that is entered, the more checking of the unique code that will be entered, which makes the process of entering data into the database longer.
From that problem, how can I generate 1 billion data to the database in a short time?
note: the process will slow down when the unique_code that enters the database is > 150 million unique_codes
Thank's a lot
The quick way
If you want to insert many records into the database, you can bulk-insert them as you do now.
I would recommend you disable the keys on the table before inserting and skip the check for unique. Else you will have a bad time like #CryptoFool mentioned.
ALTER TABLE SN_UNIQUE_CODE DISABLE KEYS;
<run code>
ALTER TABLE SN_UNIQUE_CODE ENABLE KEYS;
If I were you, then I would try to play around with the maximum you can insert at once. Try changing max_allowed_packet variable in MariaDB if necessary.
The table
It seems like your unique_code could be a natural key. Therefore you could remove the auto_incremented variable, it won't bring much performance but it is a start.
I got a function that is called to calculate a response every time the user inputs something. The function gets the response from a database. What I don't understand is, why I have to redefine my variable (I have called it intents_db) that contains all the data from the database each time the function is called? I have tried putting it outside the function, but then my program only works the first time, but the returns an empty answer the second time the user inputs something.
def response(sentence, user_id='1'):
results = classify_intent(sentence)
intents_db = c.execute("SELECT row_num, responses, tag, responses, intent_type, response_type, context_set,\
context_filter FROM intents")
if results:
# loop as long as there are matches to process
while results:
if results[0][1] > answer_threshold:
for i in intents_db:
# print('tag:', i[2])
if i[2] == results[0][0]:
print(i[6])
if i[6] != 'N/A':
if show_details:
print('context: ', i[6])
context[user_id] = i[6]
responses = i[1].split('&/&')
print(random.choice(responses))
if i[7] == 'N/A' in i or \
(user_id in context and i[7] in i and i[7] == context[
user_id]):
# a random response from the intent
responses = i[1].split('&/&')
print(random.choice(responses))
print(i[4], i[5])
print(results[0][1])
elif results[0][1] <= answer_threshold:
print(results[0][1])
for i in intents_db:
if i[2] == 'unknown':
# a random response from the intent
responses = i[1].split('&/&')
print(random.choice(responses))
initial_comm_output = random.choice(responses)
return initial_comm_output
else:
initial_comm_output = "Something unexpected happened when calculating response. Please restart me"
return initial_comm_output
results.pop(0)
return results
Also, I started getting into databases and sqlite3 because I want to make a massive database long term. Therefore it also seems inefficient that I have to load the whole database at all. Is there some way I can only load the row of data I need? I got a row_number column in my database, so if it was somehow possible to say like:
"SELECT WHERE row_num=2 FROM intents"
that would be great, but I can't figure out how to do it.
cursor.execute() returns an iterator, and you can only loop over it once.
If you want to reuse it, turn it into a list:
intents_db = list(c.execute("..."))
Therefore it also seems inefficient that I have to load the whole database at all. Is there some way I can only load the row of data I need? I got a row_number column in my database, so if it was somehow possible to say like: "SELECT WHERE row_num=2 FROM intents" that would be great, but I can't figure out how to do it.
You nearly got it: it is
intents_db = c.execute("SELECT row_num, responses, tag, responses, intent_type,
response_type, context_set, context_filter
FROM intents WHERE row_num=2")
But don't do the mistake other database beginners make and try to put in some variable from your program directly into that string. This makes the program prone to SQL injections.
Rather, do
row_num = 2
intents_db = c.execute("SELECT row_num, responses, tag, responses, intent_type,
response_type, context_set, context_filter
FROM intents WHERE row_num=?", (row_num,))
Of course, you can also set conditions for other fields.
I have a function that populates a database table using python and sqlalchemy. The function is running fairly slowly right now, taking around 17 minutes. I think the main problem is I am looping through two large sets of data to build the new table. I have included the record count in the code below.
How can I speed this up? Should I try to convert the nested for loop into one big sqlalchemy query? I profiled this function with pycharm but am not sure I fully understand the results.
def populate(self):
"""Core function to populate positions."""
# get raw annotations with tag Org
# returns 11,659 records
organizations = model.session.query(model.Annotation) \
.filter(model.Annotation.tag == 'Org')\
.filter(model.Annotation.organization_id.isnot(None)).all()
# get raw annotations with tags Support or Oppose
# returns 2,947 records
annotations = model.session.query(model.Annotation) \
.filter((model.Annotation.tag == 'Support') | (model.Annotation.tag == 'Oppose')).all()
for org in organizations:
for anno in annotations:
# Org overlaps with Support or Oppose tag
# start and end columns are integers
if org.start >= anno.start and org.end <= anno.end:
position = model.Position()
# set to de-duplicated organization
position.organization_id = org.organization_id
position.disposition = anno.tag
# look up bill_id from document_bill table
document = model.session.query(model.document_bill)\
.filter_by(document_id=anno.document_id).first()
position.bill_id = document.bill_id
position.document_id = anno.document_id
model.session.add(position)
logging.info('org: {}, disposition: {}, bill: {}'.format(
position.organization_id, position.disposition, position.bill_id)
)
continue
logging.info('committing to database')
model.session.commit()
My bets, in order of descending probability:
Autocommit is ON, so you're waiting for disk.
The query inside the loop "document = model.session.query(model.document_bill)...." is slow (use EXPLAIN ANALYZE).
most of the time is actually spent printing logs to the terminal in the inner loop (you should profile)
model.session.add(position) is slow (no idea what that does)
(and this one should really be first) Could a SQL query like INSERT INTO SELECT do this in a couple tens of milliseconds? If so, why make a loop in the application?...
I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)
Im creating a python program that connects to mysql.
i need to check if a table contains the number 1 to show that it has connected successfully, this is my code thus far:
xcnx.execute('CREATE TABLE settings(status INT(1) NOT NULL)')
xcnx.execute('INSERT INTO settings(status) VALUES(1)')
cnx.commit()
sqlq = "SELECT * FROM settings WHERE status = '1'"
xcnx.execute(sqlq)
results = xcnx.fetchall()
if results =='1':
print 'yep its connected'
else:
print 'nope not connected'
what have i missed? i am an sql noob, thanks guys.
I believe the most efficient "does it exist" query is just to do a count:
sqlq = "SELECT COUNT(1) FROM settings WHERE status = '1'"
xcnx.execute(sqlq)
if xcnx.fetchone()[0]:
# exists
Instead of asking the database to perform any count operations on fields or rows, you are just asking it to return a 1 or 0 if the result produces any matches. This is much more efficient that returning actual records and counting the amount client side because it saves serialization and deserialization on both sides, and the data transfer.
In [22]: c.execute("select count(1) from settings where status = 1")
Out[22]: 1L # rows
In [23]: c.fetchone()[0]
Out[23]: 1L # count found a match
In [24]: c.execute("select count(1) from settings where status = 2")
Out[24]: 1L # rows
In [25]: c.fetchone()[0]
Out[25]: 0L # count did not find a match
count(*) is going to be the same as count(1). In your case because you are creating a new table, it is going to show 1 result. If you have 10,000 matches it would be 10000. But all you care about in your test is whether it is NOT 0, so you can perform a bool truth test.
Update
Actually, it is even faster to just use the rowcount, and not even fetch results:
In [15]: if c.execute("select (1) from settings where status = 1 limit 1"):
print True
True
In [16]: if c.execute("select (1) from settings where status = 10 limit 1"):
print True
In [17]:
This is also how django's ORM does a queryObject.exists().
If all you want to do is check if you have successfully established a connection then why are you trying to create a table, insert a row, and then retrieve data from it?
You could simply do the following...
sqlq = "SELECT * FROM settings WHERE status = '1'"
xcnx.execute(sqlq)
results = xcnx.fetchone()
if results =='1':
print 'yep its connected'
else:
print 'nope not connected'
In fact if your program has not thrown an exception so far indicates that you have established the connection successfully. (Do check the code above, I'm not sure if fetchone will return a tuple, string, or int in this case).
By the way, if for some reason you do need to create the table, I would suggest dropping it before you exit so that your program runs successfully the second time.
When you run results = xcnx.fetchall(), the return value is a sequence of tuples that contain the row values. Therefore when you check if results == '1', you are trying to compare a sequence to a constant, which will return False. In your case, a single row of value 0 will be returned, so you could try this:
results = xcnx.fetchall()
# Get the value of the returned row, which will be 0 with a non-match
if results[0][0]:
print 'yep its connected'
else:
print 'nope not connected'
You could alternatively use a DictCursor (when creating the cursor, use .cursor(MySQLdb.cursors.DictCursor) which would make things a bit easier to interpret codewise, but the result is the same:
if results[0]['COUNT(*)]':
# Continues...
Also, not a big deal in this case, but you are comparing an integer value to a string. MySQL will do the type conversion, but you could use SELECT COUNT(*) FROM settings WHERE status = 1 and save a (very small) bit of processing.
I recently improved my efficiency by instead of querying select, just adding a primary index to the unique column and then adding it. MySQL will only add it if it doesn't exist.
So instead of 2 statements:
Query MySQL for exists:
Query MySQL insert data
Just do 1 and it will only work if it's unique:
Query MySQL insert data
1 Query is better than 2.