Compress XML column in Sqlite with Python is SLOW! - python

I'm new to Python and Sqlite, so I'm sure there's a better way to do this. I have a DB with 6000 rows, where 1 column is a 14K XML string. I wanted to compress all those XML strings to make the DB smaller. Unfortunately, the script below is much, much slower than this simple command line (which takes a few seconds).
sqlite3 weather.db .dump | gzip -c > backup.gz
I know it's not the same thing, but it does read/convert the DB to text and run gzip. So I was hoping this script would be within 10X performance, but it is more like 1000X slower. Is there a way to make the following script more efficient? Thanks.
import zlib, sqlite3
conn = sqlite3.connect(r"weather.db")
r = conn.cursor()
w = conn.cursor()
rows = r.execute("select date,location,xml_data from forecasts")
for row in rows:
data = zlib.compress(row[2])
w.execute("update forecasts set xml_data=? where date=? and location=?", (data, row[0], row[1]))
conn.commit()
conn.close()

Not sure you can increase the performance by doing an update after the fact. there's too much overhead between doing the compress and updating the record. you won't gain any space savings unless you do a vacuum after you're done with the updates. the best solution would probably be to do the compress when the records are first inserted. then you get the space savings and the performance hit won't likely be as noticeable. if you can't do it on the insert, then i think you've explored the two possibilities and seen the results.

You are comparing apples to oranges here. The big difference between the sqlite3|gzip and python version is that the later writes the changes back to the DB!
What sqlite3|gzip does is:
read the db
gzip the text
in addition to the above the python version writes the gzipped text back into the db with one UPDATE per read record.

Sorry, but are you implicitly starting a transaction in your code? If you're autocommitting after each UPDATE that will slow you down substantially.
Do you have an appropriate index on date and/or location? What kind of variation do you have in those columns? Can you use an autonumbered integer primary key in this table?
Finally, can you profile how much time you're spending in the zlib calls and how much in the UPDATEs? In addition to the database writes that will slow this process down, your database version involves 6000 calls (with 6000 initializations) of the zip algorithm.

Related

How to check the number of rows i have inserted into a table while the insertion is still ongoing

I have a dataframe with 4 million rows and 53 columns. I am trying to write the dataframe to an oracle table. see below a snipet of my code in python;
import pandas as pd
import cx_Oracle
conn = (--------------)
df = pd.read_sql(------)
#write to oracle table
df.to_sql(---)
This code has been running for over three days now with no end in sight. Please how can i get the progress of the insertion?
PS: My connection is working well and i already confirmed that the "to_sql()" is working cos i tried it on a dataframe with 10 rows and it worked.
Edited:
Thanks everyone, this link helped.
Did explicit conversion of the str and my code executed in 26mins!
You can check if the insert is already running or it is still the pd.read_sql step by if there is a table lock on your target table. With some bad luck you are still loading the data from the database. Here you should check if it's faster to use push down technology. Getting all data out of a database to insert it back can be slow sometimes.
Checking the sessions will not help because you are probably inserting the rows 1 by 1 ...
I don't know if any of undo/redo or final segments for the table is filled during the insert. But it seems that your oracle database is quite big, so probably someone can help you.
Next time you should use some additional parameters when doing such a big operation.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql
-> Read by iterating over data ... reading the total 4 GB Data should be possible but splitting in parts will be faster.
think of using parameter chunksize
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
As written by #lekz use chunksize, but when doing many insert you should also
think of using parameter method ('multi' or callable) + chunk size
this should increase the speed as well.

How to reduce time it takes to append to SQL database in python

I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
As #John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.

Efficient way to aggregate and remove duplicates from very large (password) lists

Context:
I am attempting to combine a large amount of separate password list text files into a single file for use in dictionary based password cracking.
Each text file is line delimited (a single password per line) and there are 82 separate files at the moment. Most (66) files are in the 1-100Mb filesize range, 12 are 100-700Mb, 3 are 2Gb, and 1 (the most problematic) is 11.2Gb.
In total I estimate 1.75 billion non-unique passwords need processing; of these I estimate ~450 million (%25) will be duplicates and ultimately need to be discarded.
I am attempting to do this on a device which has a little over 6Gb of RAM free to play with (i.e. 8Gb with 2Gb already consumed).
Problem:
I need a way to a) aggregate all of these passwords together and b) remove exact duplicates, within my RAM memory constrains and within a reasonable (~7 days, ideally much less but really I don't care if it takes weeks and then I never need to run it again) time window.
I am a competent Python programmer and thus gave it a crack several times already. My most successful attempt used sqlite3 to store processed passwords on the hard disk as it progressed. However this meant that keeping track of which files had already been completed between processing instances (I cancelled and restarted several times to make changes) was tediously achieved by hashing each completed file and maintaining/comparing these each time a new file was opened. For the very large files however, any progress would be lost.
I was processing the text files in blocks of ~1 billion (at most) lines at a time to prevent memory exhaustion without having no feedback for extended periods of time. I know that I could, given a lot of time populate my database fully as I achieved a DB filesize of ~4.5Gb in 24 hours of runtime so I estimate that left to run it would take about 4 days at most to get through everything, but I don't know if/how to most efficiently read/write to it nor do I have any good ideas on how to tackle the removing of duplicates (do it as I am populating the DB or make additional passes afterwards...? Is there a much faster means of doing lookups for uniqueness in a database configuration I don't know about?).
My request here today is for advice / solutions to a programming and optimisation approach on how to achieve my giant, unique password list (ideally with Python). I am totally open to taking a completely different tack if I am off the mark already.
Two nice to haves are:
A way to add more passwords in the future without having to rebuild the whole list; and
A database < 20Gb at the end of all this so that it isn't a huge pain to move around.
Solution
Based on CL's solution which was ultimately a lot more elegant than what I was thinking I came up with a slightly modified method.
Following CL's advice I setup a sqlite3 DB and fed the text files into a Python script which consumed them and then output a command to insert them into the DB. Straight off the bat this ~did~ work but was extremely (infeasibly) slow.
I solved this by a few simple DB optimisations which was much easier to implement and frankly cleaner to just do all from the core Python script included below which builds upon CL's skeleton code. The fact that the original code was generating sooooooo many I/O operations was causing something funny on my (Win7) OS causing BSODs and lost data. I solved this by making the insertion of a whole password file one SQL transaction plus a couple of pragma changes. In the end the code runs at about 30,000 insertions / sec which is not the best but is certainly acceptable for my purposes.
It may be the case that this will still fail on the largest of files but if/when that is the case, I will simply chunk the file down into smaller 1Gb portions and consume them individually.
import sys
import apsw
i = 0
con = apsw.Connection("passwords_test.db")
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS Passwords(password TEXT PRIMARY KEY) WITHOUT ROWID;")
cur.execute("PRAGMA journal_mode = MEMORY;")
cur.execute("PRAGMA synchronous = OFF;")
cur.execute("BEGIN TRANSACTION")
for line in sys.stdin:
escaped = line.rstrip().replace("'", "''")
cur.execute("INSERT OR IGNORE INTO Passwords VALUES(?);", (escaped,))
i += 1
if i % 100000 == 0: # Simple line counter to show how far through a file we are
print i
cur.execute("COMMIT")
con.close(True)
This code is then run from command line:
insert_passwords.py < passwordfile1.txt
And automated by:
for %%f in (*.txt) do (
insert_passwords.py < %%f
)
All in all, the DB file itself is not growing too quickly, the insertion rate is sufficient, I can break/resume operations at the drop of a hat, duplicate values are being accurately discarded, and the current limiting factor is the lookup speed of the DB not the CPU or disk space.
When storing the passwords in an SQL database, being able to detect duplicates requires an index.
This implies that the passwords are stored twice, in the table and in the index.
However, SQLite 3.8.2 or later supports WITHOUT ROWID tables (called "clustered index" or "index-organized tables" in other databases), which avoid the separate index for the primary key.
There is no Python version that already has SQLite 3.8.2 included.
If you are not using APSW, you can still use Python to create the SQL commands:
Install the newest sqlite3 command-line shell (download page).
Create a database table:
$ sqlite3 passwords.db
SQLite version 3.8.5 2014-06-02 21:00:34
Enter ".help" for usage hints.
sqlite> CREATE TABLE MyTable(password TEXT PRIMARY KEY) WITHOUT ROWID;
sqlite> .exit
Create a Python script to create the INSERT statements:
import sys
print "BEGIN;"
for line in sys.stdin:
escaped = line.rstrip().replace("'", "''")
print "INSERT OR IGNORE INTO MyTable VALUES('%s');" % escaped
print "COMMIT;"
(The INSERT OR IGNORE statement will not insert a row if a duplicate would violate the primary key's unique constraint.)
Insert the passwords by piping the commands into the database shell:
$ python insert_passwords.py < passwords.txt | sqlite3 passwords.db
There is no need to split up input files; fewer transaction have less overhead.

What is the best way to loop through and process a large (10GB+) text file?

I need to loop through a very large text file, several gigabytes in size (a zone file to be exact). I need to run a few queries for each entry in the zone file, and then store the results in a searchable database.
My weapons of choice at the moment, mainly because I know them, are Python and MySQL. I'm not sure how well either will deal with files of this size, however.
Does anyone with experience in this area have any suggestions on the best way to open and loop through the file without overloading my system? How about the most efficient way to process the file once I can open it (threading?) and store the processed data?
You shouldn't have any real trouble storing that amount of data in MySQL, although you will probably not be able to store the entire database in memory, so expect some IO performance issues. As always, make sure you have the appropriate indices before running your queries.
The most important thing is to not try to load the entire file into memory. Loop through the file, don't try to use a method like readlines which will load the whole file at once.
Make sure to batch the requests. Load up a few thousand lines at a time and send them all in one big SQL request.
This approach should work:
def push_batch(batch):
# Send a big INSERT request to MySQL
current_batch = []
with open('filename') as f:
for line in f:
batch.append(line)
if len(current_batch) > 1000:
push_batch(current_batch)
current_batch = []
push_batch(current_batch)
Zone files are pretty normally formatted, consider if you can get away with just using LOAD DATA INFILE. You might also consider creating a named pipe, pushing partially formatted data in to it from python, and using LOAD DATA INFILE to read it in with MySQL.
MySQL has some great tips on optimizing inserts, some highlights:
Use multiple value lists in each insert statement.
Use INSERT DELAYED, particularly if you are pushing from multiple clients at once (e.g. using threading).
Lock your tables before inserting.
Tweak the key_buffer_size and bulk_insert_buffer_size.
The fastest processing will be done in MySQL, so consider if you can get away with doing the queries you need after the data is in the db, not before. If you do need to do operations in Python, threading is not going to help you. Only one thread of Python code can execute at a time (GIL), so unless you're doing something which spends a considerable amount of time in C, or interfaces with external resources, you're only going to ever be running in one thread anyway.
The most important optimization question is what is bounding the speed, there's no point spinning up a bunch of threads to read the file, if the database is the bounding factor. The only way to really know is to try it and make tweaks until it is fast enough for your purpose.
#Zack Bloom's answer is excellent and I upvoted it. Just a couple of thoughts:
As he showed, just using with open(filename) as f: / for line in f is all you need to do. open() returns an iterator that gives you one line at a time from the file.
If you want to slurp every line into your database, do it in the loop. If you only want certain lines that match a certain regular expression, that's easy.
import re
pat = re.compile(some_pattern)
with open(filename) as f:
for line in f:
if not pat.search(line):
continue
# do the work to insert the line here
With a file that is multiple gigabytes, you are likely to be I/O bound. So there is likely no reason to worry about multithreading or whatever. Even running several regular expressions is likely to crunch through the data faster than the file can be read or the database updated.
Personally, I'm not much of a database guy and I like using an ORM. The last project I did database work on, I was using Autumn with SQLite. I found that the default for the ORM was to do one commit per insert, and it took forever to insert a bunch of records, so I extended Autumn to let you explicitly bracket a bunch of inserts with a single commit; it was much faster that way. (Hmm, I should extend Autumn to work with a Python with statement, so that you could wrap a bunch of inserts into a with block and Autumn would automatically commit.)
http://autumn-orm.org/
Anyway, my point was just that with database stuff, doing things the wrong way can be very slow. If you are finding that the database inserting is your bottleneck, there might be something you can do to fix it, and Zack Bloom's answer contains several ideas to start you out.

is there a limit to the (CSV) filesize that a Python script can read/write?

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

Categories