I have the following table:
CREATE TABLE prosfiles (
name_file text,
beginpros timestamp,
humandate timestamp,
lastpros timestamp,
originalname text,
pros int,
uploaded int,
uploader text,
PRIMARY KEY (name_file)
)
CREATE INDEX prosfiles_pros_idx ON prosfiles (pros);
In this table I keep the location of several csv files wich are processed by a python script, as I have several scripts running at the same time processing those files, I use this table to keep control and avoid two scripts start processing the same file at the same time (in the 'pros' colum 0 means the file has not being processed, 1 for processed files and 1010 for files that are currently being processed by another script)
each file runs the following query to pick the file to process:
"select name_file from prosfiles where pros = 0 limit 1"
but this always returns the first row of the files with that condition
I would like to run a query that returns a randow row from all the ones with pros = 0.
In mysql I've used "order by rand()" but in cassandra I don't know how to random sort the results.
Looks like that you're using Cassandra as a queue and it's not the best usage pattern for it, use rabbitmq/sqs/any-other-queue-service. Also Cassandra does not support sorting at all, and it's done with the idea that:
sort will require a lot of computations inside database if you are trying to sort 1B of rows.
sort is not an easy task in distributed environment: you have to ask all nodes holding the data to perform it.
But if you know what you are doing, you can revisit your database schema to be more suitable for this type of workload:
split your source table into two different tables: first one with full file information and the second one with the queue itself containing only ids of files to process.
your worker process reads random row from queue table (see below how to read ~random row from cassandra by primary key)
worker deletes target id from queue and updates your targets table with processing information.
This way of doing things will lead you to possible errors:
multiple workers can get the same target at once.
if you have a lot of workers and targets, Cassandra's compaction process will kill the performance of your DIY queue.
To read a pseudo-random row from table by it's primary key you can use this query: select * from some_table where token(id_column)>some_random_long_value limit 1, but it will also have it's cons:
if you have a small set of targets, it will sporadically return empty result because your some_random_long_value will be higher than token of any existing key.
Related
I'm loading a large pandas DataFrame into a DynamoDB table with the boto3 batch_writer context. The hash key is symbol and the sort key is date.
with table.batch_writer() as batch:
for row in df.itertuples(index=False, name="R"):
batch.put_item(Item=row)
My network connection was lost and the job stopped. I want to start putting records where I left off.
The DynamoDB table has 1_400_899 items. My DataFrame has 5_998_099 rows.
When I inspect the DataFrame at and around index 1_400_899, those records do not exist in DynamoDB. That makes me think rows are not inserted sequentially.
How can I determine where I left off so I can slice the DataFrame appropriately and restart the job?
Dynamodb's put_item doesn't gaurantee that the items will be inserted in a sequential fashion so you can not rely on the order of items inserted. Now, coming back to your question How can I determine where I left off so I can slice the DataFrame appropriately and restart the job?
The only way to know for sure is to scan the entire table and retrieve the values for primary key columns that are already inserted then drop those keys from the original dataframe and start the batch write operation again.
Here is some code that will help you get the job done:
def scan_table(table, keys, **kwargs):
resp = table.scan(ProjectionExpression=', '.join(keys), **kwargs)
yield from resp['Items']
if 'LastEvaluatedKey' in resp:
yield from scan_table(table, keys, ExclusiveStartKey=resp['LastEvaluatedKey'])
keys = ['symbol', 'date']
df_saved = pd.DataFrame(scan_table(table, keys))
i1 = df.set_index(keys).index
i2 = df_saved.set_index(keys).index
df_not_saved = df[~i1.isin(i2)]
Now you can restart the batch write operation on df_not_saved instead of df
with table.batch_writer() as batch:
for row in df_not_saved.itertuples(index=False, name="R"):
batch.put_item(Item=row)
The Python batch_writer() is a utility around DynamoDB's BatchWriteItem operation. It splits your work into smallish sets of items (BatchWriteItem is limited to 25 items), and writes each batch using BatchWriteItem.
Normally, these writes are sequential in a sense: If your client managed to send a batch of writes to DynamoDB, they will all be done, even if you lose your connection. However, there is a snag here: BatchWriteItem is not guaranteed to succeed writing all the items. When it can't, often because you have used more than your reserved capacity, it returns UnprocessedItems - a list of items that need to be resent. batch_writer() will resend those items later - but if you interrupt it at that point - it is possible that a random subset from the last 25 items were written, but not all of them. So make sure to back at least 25 items to be sure you have reached the position where batch_writer() wrote everything successfully.
Another question is where did you get the information that the DynamoDB table has 1_400_899 items. DynamoDB does have such a number, but they claim it is only updated once every 6 hours. Did you wait 6 hours?
I have Python code that uses SQLAlchemy for batch insertion with bulk_insert_mappings. The dataset I need to load may have possible duplicate values, so how best to deal with this situation if there are multiple processes doing parallel imports? It's easy enough for a single batch insert, I can just keep a collection of keys to be added in a set and never add one more than once using this set:
for line in fs.open(json_path[3:]):
# get dictionary from JSON line
json_dict = json.loads(line)
# get the values we'll store in the database columns
id_hash = json_dict['key']['AddressKeyHash']
address1 = json_dict['address']['Address1']
if id_hash not in distinct_hashes:
distinct_hashes.add(id_hash)
# hold a dictionary mapping columns to values for each row
row_mapping = {
"id": id_hash,
"address1": address1,
}
mappings.append(row_mapping)
count += 1
# if we've reached the batch size (count is even divisible)
# then we save the batch
if (count % batch_size) == 0:
self.session.bulk_insert_mappings(Address, mappings)
self.session.commit()
# clear the mappings list for the next batch
mappings = []
if len(mappings) > 0:
self.session.bulk_insert_mappings(Address, mappings)
self.session.commit()
If I use the above in parallel there is no way to communicate between the parallel processes as to which keys have already been added. So I wonder if there's a way to add a setting that will allow for how to handle duplicate inserts other than a simple failure/error/crash. For example, how to specify "ignore on duplicate" or "update on duplicate" behavior?
I am using Python with SQLAlchemy. Each job will run on AWS EC2 cluster and executes a single script (much of it above) to perform batch insertion from data in a JSON file on S3 into an Aurora (Postgres) cluster.
Maybe there's a way to utilize multiprocessing for this instead, utilizing multiple cores rather than multiple Batch/EC2 jobs, and somehow utilize a shared memory object to allow the batch insertion processes to communicate? If not then perhaps there's something in place that allows for duplicate inserts to be ignored or treated as an update?
I have to parse data from XML files and each file has over 170k lines. The structure is following:
<el1>
<el2>
<el3>
<el3-1>...</el3-1>
<el3-2>...</el3-2>
</el3>
<el4>
<el4-1>...</el4-1>
<el4-2>...</el4-2>
</el4>
<el5>
<el5-1>...</el4-1>
<el5-2>...</el5-2>
</el5>
</el2>
</el1>
Basically I need to save attributes from the "el" tags and each tag represents a table in the database. So I get the values from attributes in el1 and insert them to the table etc.. The thing is, there is a lot of tags el3, el4, el5,... and I think when I call INSERT after each of them, it slows down the whole process because 1 file takes about 3 minutes to be saved into the database.
I thought I can solve this by saving the values into the list and when the file is processed, I run 1 INSERT command to save all values. But the problem is before I do the INSERT command, I select the last inserted id from the previous element and put it into the INSERT command in the next element as his foreign key. If I want to save values into the list, I can't get the last id from the database becase the insert will happen at the end of file.
I don't know if the whole process is slowed down because of the big amounts of INSERT commands, this is just my guess.
If you haven't already done that, run all the INSERT statements in a single transaction, that makes a big difference.
For a better answer, provide more data, like your code, your system and how many transactions per second you are reaching.
A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.
How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.
The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).
I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.