python - slow saving big files into the database

python - slow saving big files into the database - python

I have to parse data from XML files and each file has over 170k lines. The structure is following:
<el1>
<el2>
<el3>
<el3-1>...</el3-1>
<el3-2>...</el3-2>
</el3>
<el4>
<el4-1>...</el4-1>
<el4-2>...</el4-2>
</el4>
<el5>
<el5-1>...</el4-1>
<el5-2>...</el5-2>
</el5>
</el2>
</el1>
Basically I need to save attributes from the "el" tags and each tag represents a table in the database. So I get the values from attributes in el1 and insert them to the table etc.. The thing is, there is a lot of tags el3, el4, el5,... and I think when I call INSERT after each of them, it slows down the whole process because 1 file takes about 3 minutes to be saved into the database.
I thought I can solve this by saving the values into the list and when the file is processed, I run 1 INSERT command to save all values. But the problem is before I do the INSERT command, I select the last inserted id from the previous element and put it into the INSERT command in the next element as his foreign key. If I want to save values into the list, I can't get the last id from the database becase the insert will happen at the end of file.
I don't know if the whole process is slowed down because of the big amounts of INSERT commands, this is just my guess.

If you haven't already done that, run all the INSERT statements in a single transaction, that makes a big difference.
For a better answer, provide more data, like your code, your system and how many transactions per second you are reaching.

Related

best way to upsert 300 million entries into postgres?

I have a new csv file every day with 400 million+ entries which I need to upsert into my database (3 tables with 2 foreign keys, indexed). The majority of the entries are already in the table, in which case I need to update a column. Some entries, which are not already in the table need to be inserted.
I tried to insert the CSV each day into a temptable then run:
INSERT INTO restaurants (name, food_id, street_id, datecreated, lastdayobservedopen) SELECT DISTINCT temptable.name, typesoffood.food_id, location.street_id, temptable.datecreated, temptable.lastdayobservedopen FROM temptable INNER JOIN typesoffood on typesoffood.food_type = temptable.food_type INNER JOIN location ON location.street_name = temptable.street_name ON CONFLICT ON CONSTRAINT restaurants_pk DO UPDATE SET lastdayobservedopen = EXCLUDED.lastdayobservedopen
But it takes over 6 hrs.
Is it possible to make this faster?
Edit:
Some more details: 3 tables- restaurants(name, food_id, street_id, datecreated, lastdayobservedopen) with pk (name, street_id) and fks (food_id and street_id); typesoffood(food_id, food_type) with pk (food_id) and index on food_type; location(street_id, street_name) with pk (street_id) and index on street_name; as for the csv file, I don’t know which are new or old entries, but I do know that the majority of the entries are already in the database which would require me to update the lastdayobserved date. The rest are to be inserted with the lastdayobserved date as today. This is supposed to help distinguish between restaurants that are no longer in operation (in which case their lastdayobserved column would not be updated) and currently operating restaurants whose date in that column should always match today’s date. Open to more efficient schema suggestions, as well. Thanks to all!

There is a function in sql called bulk insert can handle large volume of data:
bulk insert #temp
from "file location path"

If you can change you postgres settings you could take advantage of parallelism in Postgres. Otherwise you could at least speed up the csv upload using Postgres's bulk upload otherwise known as the COPY command.
Without more details it's hard to give better advice.

Under Python, use DBF package0.96.8. can I insert a blank record in any position within a DBF file?

I encountered this problems for a few days, can not find a solution so far.

Not easily.
To answer the question asked:
The only way to physically insert a record (blank or otherwise) at a particular location would be to create a new dbf, copy all the records that go before the new one over, then create the new one, then copy all the following records, and finally delete the original file and rename the new one to take its place.
To answer the question I guess you were answering:
You can create temporary, in-memory indexes and use those to iterate over your records:
table = dbf.Tableh('sometablehere.dbf')
member_order = table.create_index(lambda record: record.member_no)
for record in member_order:
print record # this happens in member_no order

DynamoDB Querying in Python (Count with GroupBy)

This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)

Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.

Profile Millions of Text Files In Parallel Using An Sqlite Counter?

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.

How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.

The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

select randow row from cassandra

I have the following table:
CREATE TABLE prosfiles (
name_file text,
beginpros timestamp,
humandate timestamp,
lastpros timestamp,
originalname text,
pros int,
uploaded int,
uploader text,
PRIMARY KEY (name_file)
)
CREATE INDEX prosfiles_pros_idx ON prosfiles (pros);
In this table I keep the location of several csv files wich are processed by a python script, as I have several scripts running at the same time processing those files, I use this table to keep control and avoid two scripts start processing the same file at the same time (in the 'pros' colum 0 means the file has not being processed, 1 for processed files and 1010 for files that are currently being processed by another script)
each file runs the following query to pick the file to process:
"select name_file from prosfiles where pros = 0 limit 1"
but this always returns the first row of the files with that condition
I would like to run a query that returns a randow row from all the ones with pros = 0.
In mysql I've used "order by rand()" but in cassandra I don't know how to random sort the results.

Looks like that you're using Cassandra as a queue and it's not the best usage pattern for it, use rabbitmq/sqs/any-other-queue-service. Also Cassandra does not support sorting at all, and it's done with the idea that:
sort will require a lot of computations inside database if you are trying to sort 1B of rows.
sort is not an easy task in distributed environment: you have to ask all nodes holding the data to perform it.
But if you know what you are doing, you can revisit your database schema to be more suitable for this type of workload:
split your source table into two different tables: first one with full file information and the second one with the queue itself containing only ids of files to process.
your worker process reads random row from queue table (see below how to read ~random row from cassandra by primary key)
worker deletes target id from queue and updates your targets table with processing information.
This way of doing things will lead you to possible errors:
multiple workers can get the same target at once.
if you have a lot of workers and targets, Cassandra's compaction process will kill the performance of your DIY queue.
To read a pseudo-random row from table by it's primary key you can use this query: select * from some_table where token(id_column)>some_random_long_value limit 1, but it will also have it's cons:
if you have a small set of targets, it will sporadically return empty result because your some_random_long_value will be higher than token of any existing key.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.