I'm looking for advice on efficient ways to stream data incrementally from a Postgres table into Python. I'm in the process of implementing an online learning algorithm and I want to read batches of training examples from the database table into memory to be processed. Any thoughts on good ways to maximize throughput? Thanks for your suggestions.
If you are using psycopg2, then you will want to use a named cursor, otherwise it will try to read the entire query data into memory at once.
cursor = conn.cursor("some_unique_name")
cursor.execute("SELECT aid FROM pgbench_accounts")
for record in cursor:
something(record)
This will fetch the records from the server in batches of 2000 (default value of itersize) and then parcel them out to the loop one at a time.
You may want to look into the Postgres LISTEN/NOTIFY functionality https://www.postgresql.org/docs/9.1/static/sql-notify.html
Related
I'm a very novice web developer and I am currently building a website from scratch. I have most of the frontend part setup, but I am really struggling with backend and databases.
The point of the website is to display a graph with class completion status (for each class, it will display what percent is complete/incomplete, and how many total users). It will retrieve this data from a CSV file on an SFTP server. The issue I am having is when I try to directly access the data, it loads incredibly slowly.
Here is the code I am using to retrieve the data:
Courses = ['']
Total =[0]
Compl =[0]
csvreal = pandas.read_csv(file)
for index, row in csvreal.iterrows():
string =(csvreal.loc[[index]].to_string(index=False, header=False))
if(Courses[i] !=string.split(' ')[0]):
i+=1
Courses.append(string.split(' ')[0])
Total.append(0)
Compl.append(0)
if(len(string.split(' ')[2])>3):
Compl[i]+=1
Total[i]+=1
To explain it a little bit, the CSV file has the roster information, i.e. each row has a name of course, name of user, completion date, and course code. The course name is the first column so that is why in the code, you see string,split(' ')[0], as it is the first part of the string. If the user has completed it, then the third column (completion date) is empty, so that is why it checks if it is longer than 3 chars, because if it is, then the user has completed it.
This takes entirely too long to compute. About 30 seconds with around 7,000 entries. Recently the CSV size was increased to something like 36,000.
I was advised to setup a database using SQL and have a nightly cronjob to parse the data and have the website retrieve the data from the database, instead of the CSV.
Any advice on where to even begin, or how to do this would be greatly appreciated.
This takes entirely too long to compute. About 30 seconds with around 7,000 entries. Recently the CSV size was increased to something like 36,000.
I was advised to setup a database using SQL and have a nightly cronjob to parse the data and have the website retrieve the data from the database, instead of the CSV.
Before I recommend using a database, how fast is the connection to the SFTP server you are getting the data from? Would it be faster to host it on the local machine? If this isn't the issue, so see below.
Yes, in this case a database would speed up your computation time and retrieval time. You need to setup a SQL database, have a way to put data into it, and then retrieve it. I included resources at the bottom that will help familiarize yourself with SQL. Knowledge of PHP will be needed in order to interact and manipulate the database.
Using SQl will be much simpler for you to interact with. For example, you needed to check to see if a cell is empty. In SQL, this can be done with;
SELECT * FROM table WHERE some_col IS NULL OR some_col = '';
https://www.khanacademy.org/computing/computer-programming/sql
https://www.w3schools.com/sql/
https://www.guru99.com/introduction-to-database-sql.html
Let's say I have class User which inherits from the Document class (I am using Mongoengine). Now, I want to retrieve all users signed up after some timestamp. Here is the method I am using:
def get_users(cls, start_timestamp):
return cls.objects(ts__gte=start_timestamp)
1000 documents are returned in 3 seconds. This is extremely slow. I have done similar queries in SQL in a couple of miliseconds. I am new to MongoDB and No-SQL in general, so I guess I am doing something terribly wrong.
I suspect the retrieval is slow because it is done in several batches. I read somewhere that for PyMongo the batch size is 101, but I do not know if that is same for Mongoengine.
Can I change the batch size, so I could get all documents at once. I will know approximately how much data will be retrieved in total.
Any other suggestions are very welcome.
Thank you!
As you suggest there is no way that it should take 3 seconds to run this query. However, the issue is not going to be the performance of the pymongo driver, some things to consider:
Make sure that the ts field is included in the indexes for the user collection
Mongoengine does some aggressive de-referencing so if the 1000 returned user documents have one or more ReferenceField then each of those results in additional queries. There are ways to avoid this.
Mongoengine provides a direct interface to the pymongo method for the mongodb aggregation framework this is by far the most efficient way to query mongodb
mongodb recently released an official python ODM pymodm in part to provide better default performance than mongoengine
I'd like to create a snapshot of a database periodically, and execute some queries on the snapshot data to generate data for next step. Finally I want to discard the snapshot.
I read and convert all data into memory data structure(python dict) from the database and execute queries(implemented by my own code) on data structure
The program have a bottleneck on "execute query" step after data size increased
How can I query on data snapshot elegantly? Thanks much for any advice.
you can get all tables from your database with
SHOW TABLES FROM <yourDBname>
after that you may create copies of the tables in a new DB via
CREATE TABLE copy.tableA AS SELECT * FROM <yourDBname>.tableA
afterwars you can query the copy-database instead of the real data.
if you do queries on the tables, pls add indexes since they are not copied.
Have some programming background, but in the process of both learning Python and making a web app, and I'm a long-time lurker but first-time poster on Stack Overflow, so please bear with me.
I know that SQLite (or another database, seems like PostgreSQL is popular) is the way to store data between sessions. But what's the most efficient way to store large amounts of data during a session?
I'm building a script to identify the strongest groups of employees to work on various projects in a company. I have received one SQLite database per department containing employee data including skill sets, achievements, performance, and pay.
My script currently runs one SQL query on each database in response to an initial query by the user, pulling all the potentially-relevant employees and their data. It stores all of that data in a list of Python dicts so the end-user can mix-and-match relevant people.
I see two other options: I could still run the comprehensive initial queries but instead of storing it in Python dicts, dump it all into SQLite temporary tables; my guess is that this would save some space and computing because I wouldn't have to store all the joins with each record. Or I could just load employee name and column/row references, which would save a lot of joins on the first pass, then pull the data on the fly from the original databases as the user requests additional data, storing little if any data in Python data structures.
What's going to be the most efficient? Or, at least, what is the most common/proper way of handling large amounts of data during a session?
Thanks in advance!
Aren't you over-optimizing? You don't need the best solution, you need a solution which is good enough.
Implement the simplest one, using dicts; it has a fair chance to be adequate. If you test it and then find it inadequate, try SQLite or Mongo (both have downsides) and see if it suits you better. But I suspect that buying more RAM instead would be the most cost-effective solution in your case.
(Not-a-real-answer disclaimer applies.)
My python project involves an externally provided database: A text file of approximately 100K lines.
This file will be updated daily.
Should I load it into an SQL database, and deal with the diff daily? Or is there an effective way to "query" this text file?
ADDITIONAL INFO:
Each "entry", or line, contains three fields - any one of which can be used as an index.
The update is is the form of the entire database - I would have to manually generate a diff
The queries are just looking up records and displaying the text.
Querying the database will be a fundamental task of the application.
How often will the data be queried? On the one extreme, if once per day, you might use a sequential search more efficiently than maintaining a database or index.
For more queries and a daily update, you could build and maintain your own index for more efficient queries. Most likely, it would be worth a negligible (if any) sacrifice in speed to use an SQL database (or other database, depending on your needs) in return for simpler and more maintainable code.
What I've done before is create SQLite databases from txt files which were created from database extracts, one SQLite db for each day.
One can query across SQLite db to check the values etc and create additional tables of data.
I added an additional column of data that was the SHA1 of the text line so that I could easily identify lines that were different.
It worked in my situation and hopefully may form the barest sniff of an acorn of an idea for you.