I need to read data from a huge table (>1million rows, 16 cols of raw text) and do some processing on it. Reading it row by row seems very slow (python, MySQLdb) indeed and I would like to be able to read multiple rows at a time (possibly parallelize it).
Just FYI, my code currently looks something like this:
cursor.execute('select * from big_table')
rows = int(cursor.rowcount)
for i in range(rows):
row = cursor.fetchone()
.... DO Processing ...
I tried to run multiple instances of the program to iterate over different sections of the table (for example, the 1st instance would iterate over 1st 200k rows, 2nd instance would iterate over rows 200k-400k ...) but the problem is that the 2nd instance (and 3rd instance and so on) takes FOREVER to get to a stage where it starts looking at row 200k onwards. It almost seems like it is still doing the processing of 1st 200k rows instead of skipping over them. The code I use (for 2nd instance) in this case is something like:
for i in range(rows):
#Fetch the row but do nothing (need to skip over 1st 200k rows)
row = cur.fetchone()
if not i in range(200000,400000):
continue
.... DO Processing ...
How can I speed up this process? Is there a clean way to do faster/parallel reads from MySQL database through python?
EDIT 1: I tried the "LIMIT" thing based on the suggestions below. For some reason though when I start 2 processes on my quad core server, it seems like only 1 single process is being run at a time (CPU seems to be time sharing between these processes, as opposed to each core running a separate process). The 2 python processes are using respectively 14% and 9% of the CPUs. Any thoughts what might be wrong?
The LIMIT clause can take two parameters, where the first is the start row and the second is the row count.
SELECT ...
...
LIMIT 200000,200000
You may also run into i/o contention on the DB server (even though you are getting the data in chunks, the disks need to serialize the reads at some level). So, rather than reading from mysql in parallel, a single read may work better for you.
Rather than reading 200K rows at a time, you could dump the whole of the data in one hit and process the data (possibly in parallel) in memory, in python.
Potentially, you can use something like psycopg.copy_expert(). Or alternatively, do a mysql dump in a single file and use csv.reader to iterate over it (or sections of it if you're processing it in parallel).
You're exactly right that your attempt to parallelize the second chunk is requesting the first 200k records before it begins processing. You need to use the LIMIT keyword to ask the server to return different results:
select * from big_table LIMIT 0,200000
...
select * from big_table LIMIT 200000,200000
...
select * from big_table LIMIT 400000,200000
...
And so on. Pick the numbers however you wish -- but be aware that memory, network, and disk bandwidths might not give you perfect scaling. In fact, I'd be wary of starting more than two or three of these simultaneously.
Related
My code is acquiring data from a sensor and it's performing some operations based on the last N-minutes of data.
At the moment I just initialize a list at the beginning of my code as
x = []
while running:
new_value = acquire_new_point()
x.append(new_value)
do_something_with_x(x)
Despite working, this has some intrinsic flaws that make it sub-optimal. The most important are:
If the code crashes or restart, the whole time-history is reset
There is no record or log of the past time-history
The memory consumption could grow out of control and exceed the available memory
Some obvious solutions exist as:
log each new element to a csv file and read it when the code starts
divide the data in N-minutes chunks and drop from the memory chunks that are more than N-minutes old
I have the feeling, however, that this could be a problem for which something more specific solution has been already created. My first thought went to HDF5, but I'm not sure it's the best candidate for this problem.
So, what is the best strategy/format for a database that needs to be written (appended) run-time and needs to be accessed partially (only the last N-minutes)? Is there any tool that is specific for a task like this one?
I'd simply just use SQLite with a two-column table (timestamp, value) with an index on the timestamp.
This has the additional bonus that you can use SQL for data analysis, which will likely be faster than doing it by hand in Python.
If you don't need the historical data, you can periodically DELETE FROM data WHERE TIMESTAMP < ....
I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.
I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:
For string in File B:
For record in File A:
if record contains string: # I use regex for this
write record to a separate file
This is currently running as a single thread of execution and takes a few hours to complete.
I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?
Running such a problem with multi-threads poses a couple of challenges:
We have to run over all of the records in file A in order to get the algorithm done.
We have to synchronize the writing to the separate file, so we won't override the printed records.
I'd suggest:
Assign a single thread just for printing - so your external file won't get messed up.
Open as many threads as you can support (n), and give each of them different 1000000/n records to work on.
The processing you want to do requires checking whether any of the 2_000 strings is in each of the 1_000_000 records—which amounts to 2_000_000_000 such "checks" total. There's no way around that. Your current logic with the nested for loops just that iterates over all the possible combinations of things in the two files—one-by-one—and does the checking (and output file writing).
You need to determine the way (if any) ahat this could be accomplished in concurrently. For example you could have "N" tasks each checking for one string in each of the million records. The outputs from all these tasks represent the desired output and would likely need to be at aggregated together into a single file. Since the results will be in relatively random order, you may also want to sort it.
I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
As #John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.
I have a 20 gb file which looks like the following:
Read name, Start position, Direction, Sequence
Note that read names are not neccessarily unique.
E.g. a snippet of my file would look like
Read1, 40009348, +, AGTTTTCGTA
Read2, 40009349, -, AGCCCTTCGG
Read1, 50994530, -, AGTTTTCGTA
I want to be able to store these lines in a way that allows me to
keep the file sorted based on the second value
iterate over the sorted file
It seems that databases can be used for this.
The documentation seems to imply that dbm cannot be used to sort the file and iterate over it.
Therefore I'm wondering whether SQLite3 will be able to do 1) and 2). I know that I will be able to sort my file with a SQL-query and iterate over the resultset with sqlite3. However, will I be able to do this without running out of memory on a 4gb of RAM computer?
SQLite is able to do both 1) and 2).
I recommend you try it and report any problems you encounter.
With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes.
See this question about large SQLlite databases.
The important bit:
I tried to insert multiple rows into a sqlite file with just one
table. When the file was about 7GB (sorry I can't be specific about
row counts) insertions were taking far too long. I had estimated that
my test to insert all my data would take 24 hours or so, but it did
not complete even after 48 hours.
The sample used was ~50GB of data, though system specs are not mentioned.
I'm new to Python and Sqlite, so I'm sure there's a better way to do this. I have a DB with 6000 rows, where 1 column is a 14K XML string. I wanted to compress all those XML strings to make the DB smaller. Unfortunately, the script below is much, much slower than this simple command line (which takes a few seconds).
sqlite3 weather.db .dump | gzip -c > backup.gz
I know it's not the same thing, but it does read/convert the DB to text and run gzip. So I was hoping this script would be within 10X performance, but it is more like 1000X slower. Is there a way to make the following script more efficient? Thanks.
import zlib, sqlite3
conn = sqlite3.connect(r"weather.db")
r = conn.cursor()
w = conn.cursor()
rows = r.execute("select date,location,xml_data from forecasts")
for row in rows:
data = zlib.compress(row[2])
w.execute("update forecasts set xml_data=? where date=? and location=?", (data, row[0], row[1]))
conn.commit()
conn.close()
Not sure you can increase the performance by doing an update after the fact. there's too much overhead between doing the compress and updating the record. you won't gain any space savings unless you do a vacuum after you're done with the updates. the best solution would probably be to do the compress when the records are first inserted. then you get the space savings and the performance hit won't likely be as noticeable. if you can't do it on the insert, then i think you've explored the two possibilities and seen the results.
You are comparing apples to oranges here. The big difference between the sqlite3|gzip and python version is that the later writes the changes back to the DB!
What sqlite3|gzip does is:
read the db
gzip the text
in addition to the above the python version writes the gzipped text back into the db with one UPDATE per read record.
Sorry, but are you implicitly starting a transaction in your code? If you're autocommitting after each UPDATE that will slow you down substantially.
Do you have an appropriate index on date and/or location? What kind of variation do you have in those columns? Can you use an autonumbered integer primary key in this table?
Finally, can you profile how much time you're spending in the zlib calls and how much in the UPDATEs? In addition to the database writes that will slow this process down, your database version involves 6000 calls (with 6000 initializations) of the zip algorithm.