Pandas using too much memory with read_sql_table - python

I am trying to read in a table from my Postgres database into Python. Table has around 8 million rows and 17 columns, and has a size of 622MB in the DB.
I can export the entire table to csv using psql, and then use pd.read_csv() to read it in. It works perfectly fine. Python process only uses around 1GB of memory and everything is good.
Now, the task we need to do requires this pull to be automated, so I thought I could read the table in using pd.read_sql_table() directly from the DB. Using the following code
import sqlalchemy
engine = sqlalchemy.create_engine("postgresql://username:password#hostname:5432/db")
the_frame = pd.read_sql_table(table_name='table_name', con=engine,schema='schemaname')
This approach starts using a lot of memory. When I track the memory usage using Task Manager, I can see the Python process memory usage climb and climb, until it hits all the way up to 16GB and freezes the computer.
Any ideas on why this might be happening is appreciated.

You need to set the chunksize argument so that pandas will iterate over smaller chunks of data. See this post: https://stackoverflow.com/a/31839639/3707607

Related

Store MySql query results for faster reuse

I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?
If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:
https://stackoverflow.com/a/37929007/5741205
https://stackoverflow.com/a/42750132/5741205
https://stackoverflow.com/a/42022053/5741205
Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").
https://docs.python.org/3/library/pickle.html
This would be the "easy way".

Loading data to Netezza as a list is very slow

I have about million records in a list that I would like to write to a Netezza table. I have been using executemany() command with pyodbc, which seems to be very slow (I can load much faster if I save the records to Excel and load to Netezza from the excel file). Are there any faster alternatives to loading a list with executemany() command?
PS1: The list is generated by a proprietary DAG in our company, so writing to the list is very fast.
PS2: I have also tried looping executemany() into chunks, with each chunk containing a list with 100 records. It takes approximately 60 seconds to load, which seems very slow.
From Python I have had great performance loading millions of rows to Netezza using transient external tables. Basically Python creates a CSV file on the local machine, and then tells the ODBC driver to load the CSV file into the remote server.
The simplest example:
SELECT *
FROM EXTERNAL '/tmp/test.txt'
SAMEAS test_table
USING (DELIM ',');
Behind the scenes this is equivalent to the nzload command, but it does not require nzload. This worked great for me on Windows where I did not have nzload.
Caveat: be careful with the formatting of the CSV, the values in the CSV, and the options to the command. Netezza gives obscure error messages for invalid values.
Netezza is good for bulk loads, where executeMany() inserts number of rows in one go. The best way to load millions of rows is "nzload" utility which can be scheduled by vbscript, Excel Macro from Windows or Shell script from Linux.

how to perform defragmentation on cassandra table

I am playing around with Python and some of NoSql DBs to create file store(mainly because of built in replication), i tried it with MongoDB and its working but due to "Write Greedy" nature of MongoDB i moved to cassandra and implemented the same thing. While its working, i want to know (point me to docs that will be fine) how to defragment the data in cassandra. i will explain this with example, say i upload the 200 MB file, then 20 MB file. now data size in cassandra is ~220MB. If i go and delete the 200MB file then also i see that data size is ~200MB so that space is not gained back. In mongoDB there is a command to gain (re use the same space for new files) i want to know how same can be achieved in cassandra. I am getting confused b/w compress & compaction.
And to store data i am splitting file in part and then storing as "blob" in table.
Cassandra cleans up deleted and expired data using a process called compaction.
While you can force compactions yourself using nodetool compact, I would not recommend this as it is better to tune compaction and let it happen in the background.
That may not completely do the trick as cassandra has a configuration property named 'gc_grace_seconds' which prevents data marked as deleted (with a tombstone) from being deleted until gc_grace_seconds passes. The default is 10 days but you can configure this to a smaller value or even make it 0 to disable tombstones all together.

Python/Cassandra: insert vs. CSV import

I am generating load test data in a Python script for Cassandra.
Is it better to insert directly into Cassandra from the script, or to write a CSV file and then load that via Cassandra?
This is for a couple million rows.
For a few million, I'd say just use CSV (assuming rows aren't huge); and see if it works. If not, inserts it is :)
For more heavy duty stuff, you might want to create sstables and use sstable loader.

is there a limit to the (CSV) filesize that a Python script can read/write?

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

Categories