I am retrieving data from a remote web application and I store it in a local sqlite DB. From time to time I perform a commit() call on the connection, which results in committing about 50 inserts to 8 tables. There are about 1000000 records in most tables. I don't use explicit BEGIN and END commands.
I measured commit() call time and got 100-150 ms. But during the commit, my PC freezes for ~5-15 seconds. The INSERTs themselves are naive (i.e. one execute() call per insert) but are performed fast enough (their rate is limited by the speed of records retrieval, anyways, which is fairly low). I'm using Arch Linux x64 on a PC with AMD FX 6200 CPU, 8 GB RAM and a SATA HDD, Python 3.4.1, sqlite 3.8.4.3.
Does anyone have an idea why this could happen? I guess it has something to do with HDD caching. If so, is there something I could optimize?
UPD: switched to WAL and synchronous=1, no improvements.
UPD2: I have seriously underestimated number of INSERTs per commit. I measured it using sqlite3.Connection's total_changes property, and it appears there are 30000-60000 changes per commit. Is it possible to optimize inserts, or maybe it is about time to switch to postgres?
If the call itself is quick enough, as you say, it surely sounds like an IO problem. You could use tools such as iotop to check this maybe. If possible I would suggest that you divide the inserts into smaller and more frequent ones instead of large chunks. If that is not possible you should consider investing in an SSD disk instead of traditional harddisk, due to normally quicker write speeds.
There sure could be system parameters to investigate. You should at least make sure you mount your disk with noatime and nodiratime flags. You could also try data=writebackas parameter. See the following for more details:
https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
Related
I have a python script which hits dozens of API endpoints every 10s to write climate data to a database. Lets say on average I insert 1,500 rows every 10 seconds, from 10 different threads.
I am thinking of making a batch system whereby the insert queries aren't written to the db as they come in, but added to a waiting list and this list is inserted in batch when it reaches a certain size, and the list of course emptied.
Is this justified due to the overhead with frequently writing small numbers of rows to the db?
If so, would a list be wise? I am worried about if my program terminates unexpectedly, perhaps a form of serialized data would be better?
150 inserts per second can be a load on a database and can affect performance. There are pros and cons to changing the approach that you have. Here are some things to consider:
Databases implement ACID, so inserts are secure. This is harder to achieve with buffering schemes.
How important is up-to-date information for queries?
What is the query load?
insert is pretty simple. Alternative mechanisms may require re-inventing the wheel.
Do you have other requirements on the inserts, such as ensuring they are in particular order?
No doubt, there are other considerations.
Here are some possible alternative approaches:
If recent data is not a concern, snapshot the database for querying purposes -- say once per day or once per hour.
Batch inserts in the application threads. A single insert can insert multiple rows.
Invest in larger hardware. An insert load that slows down a single processor may have little effect on a a larger machine.
Invest in better hardware. More memory and faster disk (particularly solid state) and have a big impact.
No doubt, there are other approaches as well.
I'm tinkering with some big-data queries in the ipython shell using the Django ORM. This is on a Debian 6 VM in VMware Fusion on OS X, the VM is allowed access 4 or 8 cores (I've played with the settings) of the 4-core HT i7 on the host.
When I watch the progress in top, when doing for example a 'for result in results: do_query()' in the python shell, it seems that python and one of the postgres processes are always co-located on the same physical CPU core - their total CPU usage never adds up to more than 100%, python is usually 65% to postgres' 25% or so. iowait on the VM isn't excessively high.
I'm not positive they're always on the same core, but it sure looks it. Given how I plan to scale this eventually, I'd prefer that the python process(es) and postgress workers be scheduled more optimally. Any insight?
Right now, if your code works the way I think it works, Postgres is always waiting for Python to send it a query, or Python is waiting for Postgres to come back with a response. There's no situation where they'd both be doing work at once, so only one ever runs at a time.
To start using your machine more heavily, you'll need to implement some sort of multithreading on the Python end. Since you haven't given many details on what your queries are, it's hard to say what that might look like.
I am running several thousand python processes on multiple servers which go off, lookup a website, do some analysis and then write the results to a central MySQL database.
It all works fine for about 8 hours and then my scripts start to wait for a MySQL connection.
On checking top it's clear that the MySQL daemon is overloaded as it is using up to 90% of most of the CPUs.
When I stop all my scripts, MySQL continues to use resources for some time afterwards.
I assume it is still updating the indexes? - If so, is there anyway of determining which indexes it is working on, or if not what it is actually doing?
Many thanks in advance.
Try enabling the slow query log: http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
Also, take a look at the output of SHOW PROCESSLIST; on the mysql shell, it should give you some more information.
There are a lot of tweaks that can be done to improve the performance of MySQL. Given your workload, you would probably benefit a lot from mysql 5.5 and higher, which improved performance on multiprocessor machines. Is the machine in question hitting VM? if it is paging out, then the performance of mysql will be horrible.
My suggestions:
check version of mysql. If possible, get the latest 5.5 version.
Look at the config files for mysql called my.cnf. Make sure that it makes sense on your machine. There are example config files for small, medium, large, etc machines to run MySQL. I think the default setup is for a machine with < 1 Gig of ram.
As the other answer suggests, turn on slow query logging.
Is there a way to reduce the I/O's associated with either mysql or a python script? I am thinking of using EC2 and the costs seem okay except I can't really predict my I/O usage and I am worried it might blindside me with costs.
I basically develop a python script to parse data and upload it into mysql. Once its in mysql, I do some fairly heavy analytic on it(creating new columns, tables..basically alot of math and financial based analysis on a large dataset). So is there any design best practices to avoid heavy I/O's? I think memcached stores a everything in memory and accesses it from there, is there a way to get mysql or other scripts to do the same?
I am running the scripts fine right now on another host with 2 gigs of ram, but the ec2 instance I was looking at had about 8 gigs so I was wondering if I could use the extra memory to save me some money.
By IO I assume you mean disk IO... and assuming you can fit everything into memory comfortably. You could:
Disable swap on your box†
Use mysql MEMORY tables while you are processing, (or perhaps consider using an Sqlite3 in memory store if you are only using the database for the convenience of SQL queries)
Also: unless you are using EBS I didn't think Amazon charged for IO on your instance. EBS is much slower than your instance storage so only use it when you need the persistance, ie. not while you are crunching data.
†probably bad idea
You didn't really specify whether it was writes or reads. My guess is that you can do it all in a mysql instance in a ramdisc (tmpfs under Linux).
Operations such as ALTER TABLE and copying big data around end up creating a lot of IO requests because they move a lot of data. This is not the same as if you've just got a lot of random (or more predictable queries).
If it's a batch operation, maybe you can do it entirely in a tmpfs instance.
It is possible to run more than one mysql instance on the machine, it's pretty easy to start up an instance on a tmpfs - just use mysql_install_db with datadir in a tmpfs, then run mysqld with appropriate params. Stick that in some shell scripts and you'll get it to start up. As it's in a ramfs, it won't need to use much memory for its buffers - just set them fairly small.
I have a script with a main for loop that repeats about 15k times. In this loop it queries a local MySQL database and does a SVN update on a local repository. I placed the SVN repository in a RAMdisk as before most of the time seemed to be spent reading/writing to disk.
Now I have a script that runs at basically the same speed but CPU utilization for that script never goes over 10%.
ProcessExplorer shows that mysqld is also not taking almost any CPU time or reading/writing a lot to disk.
What steps would you take to figure out where the bottleneck is?
Doing SQL queries in a for loop 15k times is a bottleneck in every language..
Is there any reason you query every time again ? If you do a single query before the for loop and then loop over the resultset and the SVN part, you will see a dramatic increase in speed.
But I doubt that you will get a higher CPU usage. The reason is that you are not doing calculations, but mostly IO.
Btw, you can't measure that in mysqld cpu usage, as it's in the actual code not complexity of the queries, but their count and the latency of the server engine to answer. So you will see only very short, not expensive queries, that do sum up in time, though.
Profile your Python code. That will show you how long each function/method call takes. If that's the method call querying the MySQL database, you'll have a clue where to look. But it also may be something else. In any case, profiling is the usual approach to solve such problems.
It is "well known", so to speak, that svn update waits up to a whole second after it has finished running, so that file modification timestamps get "in the past" (since many filesystems don't have a timestamp granularity finer than one second). You can find more information about it by Googling for "svn sleep_for_timestamps".
I don't have any obvious solution to suggest. If this is really performance critical you could either: 1) not update as often as you are doing 2) try to use a lower-level Subversion API (good luck).