I have a Python program that downloads a text file of over 100 million unique values and follows the following logic:
If the value already exists in the table, update the entry's last_seen date (SELECT id WHERE <col> = <value>;)
If the value does not exist in the table, insert the value into the table
I queue up entries that need to be added and then insert them in a bulk statement after a few hundred have been gathered.
Currently, the program takes over 24 hours to run. I've created an index on the column that stores the values.
I'm currently using the MySQLdb.
It seems that checking for value existence is taking the lion's share of the runtime. What avenues can I pursue to make this faster?
Thank you.
You could try loading the values into a set, so you can do the lookups without fetching from the database every time. Assuming that the table is not being updated by anyone else, and that you have sufficient memory.
# Let's assume you have a function runquery, that executes the
# provided statement and returns a collection of values as strings.
existing_values = set(runquery('SELECT DISTINCT value FROM table'))
with open('big_file.txt') as f:
inserts = []
updates = []
for line in f:
value = line.strip()
if value in existing_values:
updates.append(value)
else:
existing_values.add(value)
inserts.append(value)
if len(inserts) > THRESHOLD or len(updates) > THRESHOLD:
# Do bulk updates and clear inserts and updates
Related
I'm currently using a MongoDB database where I'm storing product data. I'm currently using a for loop of around ~50 IDs, and with each iteration, I'm searching for the ID and if the ID doesn't exist, I'm adding it, and if it exists and another column is a specific value, I'll run a function.
for id in ids:
value = db.find_one({"value": id})
if value:
# It checks for some other columns here using both the ID and the return value
else:
# It adds the ID and some other information to the database
The problem here is that this is incredibly inefficient. When searching around for other ways to do this, all results show how to get a list of the results, but I'm not sure how this would be implemented in my scenario since I'm running functions and checks with each result and ID.
Thank you!
You can improve by doing only one find request.
And in a second time, add all the documents in DB. Maybe with an insert_many ?
value = db.find({"value": {"$in": ids}})
for value in values:
# It checks for some other columns here using both the ID and the return
ids.remove(value.id)
# Do all your inserts
# with a loop
for id in ids:
df.insert(...)
# or with insert_many
db.insert_many(...)
I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)
Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).
According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)
I have a Python program that uses Pytables and queries a table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with table.copy(sortby='colname'). This indeed improved the query time (spent in where), but it increased the time spent in the next() built-in function by several orders of magnitude! What could be the reason?
This slowdown occurs only when there is another column in the table, and the slowdown increases with the element size of that other column.
To help me understand the problem and make sure this was not related to something else in my program, I made this minimum working example reproducing the problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tables
import time
import sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.myset
for i in xrange(500):
get_element(table, i)
print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output showing the figures:
$ ./timedset.py nosort nodata
Queried the set in 0.718s
$ ./timedset.py sort nodata
Queried the set in 0.003s
$ ./timedset.py nosort withdata
Queried the set in 0.597s
$ ./timedset.py sort withdata
Queried the set in 5.846s
Some notes:
The rows are actually sorted in all cases, so it seems to be linked to the table being aware of the sort rather than just the data being sorted.
If instead of creating the file, I read it from disk, same results.
The issue occurs only when the data column is present, even though I never write to it nor read it. I noticed that the time difference increases "in stages" when the size of the column (the number of floats) increases. The slowdown must be linked with internal data movements or I/O:
If I don't use the next function, but instead use a for row in rows and trust that there is only one result, the slowdown still occurs.
Accessing an element from a table by some sort of id (sorted or not) sounds like a basic feature, I must be missing the typical way of doing it with pytables. What is it?
And why such a terrible slowdown? Is it a bug that I should report?
I finally understood what's going on.
Long story short
The root cause is a bug and it was on my side: I was not flushing the data before making the copy in case of sort. As a result, the copy was based on data that was not complete, and so was the new sorted table. This is what caused the slowdown, and flushing when appropriate led to a less surprising result:
...
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
fp.flush() # <--
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush() # <--
return fp
...
But why?
I realized my mistake when I decided to inspect and compare the structure and data of the tables "not sorted" vs "sorted". I noticed that in the sorted case, the table had less rows. The number varied seemingly randomly from 0 to about 450 depending on the size of the data column. Moreover, in the sorted table, the id of all the rows was set to 0. I guess that when creating a table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This "may or may not" probably depends on the size of the row and the computed chunksize.
As a result, when querying the sorted table, all queries but the one with id == 0 had no result. I initially thought that raising and catching the StopIteration error was what caused the slowdown, but that would not explain why the slowdown depends on the size of the data column.
After reading some of the code from pytables (notably table.py and tableextension.pyx), I think what happens is the following: when a column is indexed, pytables will first try to use this index to fasten the search. If some matching rows are found, only these rows will be read. But if the index indicates that no row matches the query, for some reason pytables fallbacks to a "in kernel" search, which iterates over and reads all the rows. This requires reading the full rows from disk in multiple I/Os, and this is why the size of the data column mattered. Also under a certain size of that column, pytables did not "pre-create" some rows on disk, resulting in a sorted table with no row at all. This is why on the graph the search is very fast when the column size is under 525: iterating over 0 row doesn't take much time.
I am not clear on why the iterator fallbacks on an "in kernel" search. If the searched id is clearly out of the index bounds, I don't see any reason to search it anyway... Edit: After a closer look at the code, it turns out this is because of a bug. It is present in the version I am using (3.1.1), but has been fixed in 3.2.0.
The irony
What really makes me cry is that I forgot to flush before copying only in the example of the question. In my actual program, this bug is not present! What I also did not know but found out while investigating the question is that by default pytables do not propagate indexes. This has to be required explicitly with propindexes=True. This is why the search was slower after sorting in my application...
So moral of the story:
Indexing is good: use it
But don't forget to propagate them when sorting a table
Make sure your data is on disk before reading it...
I'm using boto.dynamodb2, and it seems I can use Table.query_count(). However it had raised an exception when no query filter is applied.
What can I do to fix this?
BTW, where is the document of filters that boto.dynamodb2.table.Table.Query can use? I tried searching for it but found nothing.
There are two ways you can get a row count in DynamoDB.
The first is performing a full table scan and counting the rows as you go. For a table of any reasonable size this is generally a horrible idea as it will consume all of your provisioned read throughput.
The other way is to use the Describe Table request to get an estimate of the number of rows in the table. This will return instantly, but will only be updated periodically per the AWS documentation.
The number of items in the specified index. DynamoDB updates this
value approximately every six hours. Recent changes might not be
reflected in this value.
As per documentation boto3
"The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value."
import boto3
dynamoDBResource = boto3.resource('dynamodb')
table = dynamoDBResource.Table('tableName')
print(table.item_count)
or you can use DescribeTable:
import boto3
dynamoDBClient = boto3.client('dynamodb')
table = dynamoDBClient.describe_table(
TableName='tableName'
)
print(table)
If you want to count the number of items:
import boto3
client = boto3.client('dynamodb','us-east-1')
response = client.describe_table(TableName='test')
print(response['Table']['ItemCount'])
#ItemCount (integer) --The number of items in the specified table.
# DynamoDB updates this value approximately every six hours.
# Recent changes might not be reflected in this value.
Ref: Boto3 Documentation (under ItemCount in describe_table())
You can use this, to get count of entire table items
from boto.dynamodb2.table import Table
dynamodb_table = Table('Users')
dynamodb_table.count() # updated roughly 6 hours
Refer here: http://boto.cloudhackers.com/en/latest/ref/dynamodb2.html#module-boto.dynamodb2.table
query_count method will return the item count based on the indexes you provide.
For example,
from boto.dynamodb2.table import Table
dynamodb_table = Table('Users')
print dynamodb_table.query_count(
index='first_name-last_name-index', # Get indexes from indexes tab in dynamodb console
first_name__eq='John', # add __eq to your index name for specific search
last_name__eq='Smith' # This is your range key
)
You can add the primary index or global secondary indexes along with range keys.
possible comparison operators
__eq for equal
__lt for less than
__gt for greater than
__gte for greater than or equal
__lte for less than or equal
__between for between
__beginswith for begins with
Example for between
print dynamodb_table.query_count(
index='first_name-last_name-index', # Get indexes from indexes tab in dynamodb console
first_name__eq='John', # add __eq to your index name for specific search
age__between=[30, 50] # This is your range key
)
I have a working program that does some transforms but i'm quite afraid that, what will happen if the database it too bigger.
I'll make you clear like, if the below program bombs in the middle how would i get the program recover o get it working from the specified line of code.
Defenetely the process will get killed after executing a piece of code.
Will the program be able to continue back from where it got an error, or from the position were it got killed.
import sqlite3 as sqlite
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read('removespecial.ini')
con = sqlite.connect('listofcomp.sqlite')
cur = con.cursor()
def transremovechars():
char_cfg = open('specialchars.txt', 'r') #Reads all the special chars to be removed from specialchars.txt#
special_chars = char_cfg.readline()
char_cfg.close()
cur.execute('select * from originallist')
for row in cur: #Applies transformation to remove chars for each row in a loop#
company = row[0]
for bad_char in special_chars:
company = company.replace(bad_char, '')
cur.execute('Create table transform1 (Names Varchar, Transformtype Varchar')
cur.execute('Insert into transform1 (Names)', company)
def transtolower():
cur.execute('select * from transform1') #Transformation to convert all the namesto lower cases#
for row in cur:
company = row[0]
company = company.lower()
cur.execute('Create table transform2 (Names Varchar, Transformtype Varchar') #Creates another table named transform2#
cur.execute('Insert into transform2 (Names)', company) #Copies all the lower cased names to transform2#
if __name__=="__main__":
transremovechars()
transtolower()
if the below program bombs in the middle how would i get the program recover o get it working from the specified line of code
You can't.
Your code is utterly mysterious because the create table will get an error prior to every insert except the first one.
If, however, you want to do a long series of inserts from one old table into one new table,
and you're worried about the possibility that it doesn't finish correctly, you have two choices for maintaining the required state information.
Unique Keys.
Batches.
Query.
Unique Keys.
If each row has a unique key, then some inserts will get an error because the row is a duplicate.
If the program "bombs", you just restart. You get a lot of duplicates (which you expected). This is not inefficient.
Batches.
Another technique we use is to query all the rows from the old table, and include a "batch" number that increments every 1000 rows. batch_number = row_count // 1000.
You create a "batch number" file with the number -1.
Your program starts, it reads the batch number. That's the last batch that was finished.
You read the source data until you get to the a batch number > the last one that finished.
You then do all the inserts from a batch.
When the batch number changes, do a commit, and write the batch number to a file. That way you can restart on any batch.
When restarting after a "bomb", you may get some duplicates from the partial batch (which you expected). This is not inefficient.
Query.
You can query before each insert to see if the row exists. This is inefficient.
If you don't have a unique key, then you must do a complex query to see if the row was created by a previous run of the program.