I'm loading a large pandas DataFrame into a DynamoDB table with the boto3 batch_writer context. The hash key is symbol and the sort key is date.
with table.batch_writer() as batch:
for row in df.itertuples(index=False, name="R"):
batch.put_item(Item=row)
My network connection was lost and the job stopped. I want to start putting records where I left off.
The DynamoDB table has 1_400_899 items. My DataFrame has 5_998_099 rows.
When I inspect the DataFrame at and around index 1_400_899, those records do not exist in DynamoDB. That makes me think rows are not inserted sequentially.
How can I determine where I left off so I can slice the DataFrame appropriately and restart the job?
Dynamodb's put_item doesn't gaurantee that the items will be inserted in a sequential fashion so you can not rely on the order of items inserted. Now, coming back to your question How can I determine where I left off so I can slice the DataFrame appropriately and restart the job?
The only way to know for sure is to scan the entire table and retrieve the values for primary key columns that are already inserted then drop those keys from the original dataframe and start the batch write operation again.
Here is some code that will help you get the job done:
def scan_table(table, keys, **kwargs):
resp = table.scan(ProjectionExpression=', '.join(keys), **kwargs)
yield from resp['Items']
if 'LastEvaluatedKey' in resp:
yield from scan_table(table, keys, ExclusiveStartKey=resp['LastEvaluatedKey'])
keys = ['symbol', 'date']
df_saved = pd.DataFrame(scan_table(table, keys))
i1 = df.set_index(keys).index
i2 = df_saved.set_index(keys).index
df_not_saved = df[~i1.isin(i2)]
Now you can restart the batch write operation on df_not_saved instead of df
with table.batch_writer() as batch:
for row in df_not_saved.itertuples(index=False, name="R"):
batch.put_item(Item=row)
The Python batch_writer() is a utility around DynamoDB's BatchWriteItem operation. It splits your work into smallish sets of items (BatchWriteItem is limited to 25 items), and writes each batch using BatchWriteItem.
Normally, these writes are sequential in a sense: If your client managed to send a batch of writes to DynamoDB, they will all be done, even if you lose your connection. However, there is a snag here: BatchWriteItem is not guaranteed to succeed writing all the items. When it can't, often because you have used more than your reserved capacity, it returns UnprocessedItems - a list of items that need to be resent. batch_writer() will resend those items later - but if you interrupt it at that point - it is possible that a random subset from the last 25 items were written, but not all of them. So make sure to back at least 25 items to be sure you have reached the position where batch_writer() wrote everything successfully.
Another question is where did you get the information that the DynamoDB table has 1_400_899 items. DynamoDB does have such a number, but they claim it is only updated once every 6 hours. Did you wait 6 hours?
Related
My Firebase realtime database schema:
Let's suppose above Firebase database schema.
I want to get data with order_by_key() which after first 5 and before first 10 not more. Range should be 5-10. Like in the image.
My key is always starting with -.
I'm trying this but failed. It returns 0. How can I do this?
snapshot = ref.child('tracks').order_by_key().start_at('-\5').end_at(u'-\10').get()
Firebase queries are based on cursor/anchor values, and not on offsets. This means that the start_at and end_at calls expect values of the thing you order on, so in your keys they expect the keys of those notes.
To get the slice you indicate you'll need:
ref.child('tracks').order_by_key().start_at('-MQJ7P').end_at(u'-MQJ8O').get()
If you don't know either of those values, you can't specify them and can only start from the first item or end on the last item.
The only exception is that you can specify a limit_to_first instead of end_at to get a number of items at the start of the slice:
ref.child('tracks').order_by_key().start_at('-MQJ7P').limit_to_first(5).get()
Alternatively if you know only the key of the last item, you can get the five items before that with:
ref.child('tracks').order_by_key().end_at('-MQJ8O').limit_to_last(5).get()
But you'll need to know at least one of the keys, typically because you've shown it as the last item on the previous page/first item on the next page.
In my Cloud Bigtable table, I have millions of requests per second. I get a unique row key and then I need to modify the row with an atomic mutation.
When I filter by column to get the key, will it be atomic for each request?
col1_filter = row_filters.ColumnQualifierRegexFilter(b'customerId')
label1_filter = row_filters.ValueRegexFilter('')
chain1 = row_filters.RowFilterChain(filters=[col1_filter, label1_filter])
partial_rows = table.read_rows(filter_=chain1)
for data in partial_rows:
row_cond = table.row(data.cell[row_key])
row_cond.set_cell(u'data', b'customerId', b'value', state=True)
row_cond.commit()
CheckAndMutateRow operations are atomic, BUT it is check and mutate row not rows. So the way you have this set up wont create the atomic operation.
You need to create a conditional row object using the rowkey and your filter, supply the modification, then commit. Like so:
col1_filter = row_filters.ColumnQualifierRegexFilter(b'customerId')
label1_filter = row_filters.ValueRegexFilter('')
chain1 = row_filters.RowFilterChain(filters=[col1_filter, label1_filter])
partial_rows = table.read_rows()
for data in partial_rows:
row_cond = table.row(data.cell[row_key], filter_=chain1) # Use filter here
row_cond.set_cell(u'data', b'customerId', b'value', state=True)
row_cond.commit()
So you would have to do a full table scan and apply the filter to each row. If you are applying that filter, you'd be doing a full scan already, so there shouldn't be performance differences. For best practices with Cloud Bigtable, you want to avoid full table scans. If this is a one time program you need to run that would be fine, otherwise you may want to figure out a different way to do this if you're going to be doing it regularly.
Note that we are updating the API to be provide more clarity on the different kinds of mutations.
I have a Python program that downloads a text file of over 100 million unique values and follows the following logic:
If the value already exists in the table, update the entry's last_seen date (SELECT id WHERE <col> = <value>;)
If the value does not exist in the table, insert the value into the table
I queue up entries that need to be added and then insert them in a bulk statement after a few hundred have been gathered.
Currently, the program takes over 24 hours to run. I've created an index on the column that stores the values.
I'm currently using the MySQLdb.
It seems that checking for value existence is taking the lion's share of the runtime. What avenues can I pursue to make this faster?
Thank you.
You could try loading the values into a set, so you can do the lookups without fetching from the database every time. Assuming that the table is not being updated by anyone else, and that you have sufficient memory.
# Let's assume you have a function runquery, that executes the
# provided statement and returns a collection of values as strings.
existing_values = set(runquery('SELECT DISTINCT value FROM table'))
with open('big_file.txt') as f:
inserts = []
updates = []
for line in f:
value = line.strip()
if value in existing_values:
updates.append(value)
else:
existing_values.add(value)
inserts.append(value)
if len(inserts) > THRESHOLD or len(updates) > THRESHOLD:
# Do bulk updates and clear inserts and updates
This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.
I have the following table:
CREATE TABLE prosfiles (
name_file text,
beginpros timestamp,
humandate timestamp,
lastpros timestamp,
originalname text,
pros int,
uploaded int,
uploader text,
PRIMARY KEY (name_file)
)
CREATE INDEX prosfiles_pros_idx ON prosfiles (pros);
In this table I keep the location of several csv files wich are processed by a python script, as I have several scripts running at the same time processing those files, I use this table to keep control and avoid two scripts start processing the same file at the same time (in the 'pros' colum 0 means the file has not being processed, 1 for processed files and 1010 for files that are currently being processed by another script)
each file runs the following query to pick the file to process:
"select name_file from prosfiles where pros = 0 limit 1"
but this always returns the first row of the files with that condition
I would like to run a query that returns a randow row from all the ones with pros = 0.
In mysql I've used "order by rand()" but in cassandra I don't know how to random sort the results.
Looks like that you're using Cassandra as a queue and it's not the best usage pattern for it, use rabbitmq/sqs/any-other-queue-service. Also Cassandra does not support sorting at all, and it's done with the idea that:
sort will require a lot of computations inside database if you are trying to sort 1B of rows.
sort is not an easy task in distributed environment: you have to ask all nodes holding the data to perform it.
But if you know what you are doing, you can revisit your database schema to be more suitable for this type of workload:
split your source table into two different tables: first one with full file information and the second one with the queue itself containing only ids of files to process.
your worker process reads random row from queue table (see below how to read ~random row from cassandra by primary key)
worker deletes target id from queue and updates your targets table with processing information.
This way of doing things will lead you to possible errors:
multiple workers can get the same target at once.
if you have a lot of workers and targets, Cassandra's compaction process will kill the performance of your DIY queue.
To read a pseudo-random row from table by it's primary key you can use this query: select * from some_table where token(id_column)>some_random_long_value limit 1, but it will also have it's cons:
if you have a small set of targets, it will sporadically return empty result because your some_random_long_value will be higher than token of any existing key.