I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.
Related
I'm new to python and was trying to create a python bot, I wanted a optimized way to modify and access my bot configs per server. I had 2 ideas on how/when to fetch configs from the database for optimization.
this is what you would normally do - just fetch data variables(fetch a variable at a time) for each command, this would keep the bot simple and minimize unused recources.
In this one, whenever the user uses a command for the first time, it fetches the entire config table and stores it in a loaded dict from which you can access the config from. you can also update the config in the dict and every 30m-1hr it will log the values in the table and empty the dict. The benefit of this one is less sql calls but potentially less scalability because of unused objects in the dict.
Can someone help me decide which one is better, i dont know normally how you would make discord bots or the convention.
Your second approach is called caching the data. You're basically creating a cached database in your application (the dictionary) and save a bunch of usually necessary data to access them quickly. It is what every (almost every) major service (like Steam) does in order to minimize the main database calls.
I think this is the better practice however it has its drawbacks.
First, from time to time, you have to compare the cached data with what you have in the original database because your bot will not have a single user and while the cached data is available to one user, another user might alter the data in the original database.
Second, it is harder to implement than the first approach. You need to determine which data to store, which data to update rapidly and also you need to implement an alarm system for the users to update their cache whenever the main data is altered in the database.
If I were you and I just wanted to mess around with bots, I would go with just fetching the data each time from the database. It's easier and it is good enough for most applications.
I had a question regarding the python IBM_DB package (but I think it could be applied to any of the packages that employ the connection/cursor logic i.e. pyodbc).
When the cursor.execute() method is called, it executes an sql query on the database. However, to access this data, you would need to use the fetchall()/other fetch methods. I want to time the hit on the database.
Does the query completely finish running at the execute level, and it is in memory just for python to fetch? Or does the fetch method continue calling the database? I have scoured the documentation and am unable to find anything definitive on this subject.
Most or all of the Db2 open source drivers are based on the Call Level Interface (CLI). The CLI functions and details are part of the overall Db2 documentation. The Fetch() from a ResultSet retrieves one more row.
AFAIK the result set can be cached or go back to the engine. It makes sense to bring in few (dozen) rows, but not for some million rows.
You would need insights and understanding of how drivers and database query processing work in order to measure something useful and interpret it correctly.
BTW: There is some form of CLI tracing available.
This question is more on architecture and libs, than on implementation.
I am currently working at project, which requires a local long-term cache storage (updated once a day) at client kept in sync with a remote db at server. For client side sqlite has been chosen as a lightweight approach and postgresql as feature rich db at server. Native replication mechanisms of postgres are no-opt cause I need to keep client really lightweight and free of relying on external components like db servers.
The implementation language would be Python. Now I'm looking at ORMs like SQLAlchemy, but haven't worked with any before.
Does SQLAlchemy have any tools to keep sqlite and postgres dbs in sync?
If not, are there any other Python libraries which have such tools?
Any ideas about how should the architecture look like, if the task must be solved "by hand"?
Added:
It's like telemetry, cause client would have internet connection only for approximately 20 minutes a day
So, the main question is about architecure of such a system
It doesn't usually fall within the tasks of an ORM to sync data between databases, so you will likely have to implement it yourself. I don't know of any solution that will handle syncing for you given your choice of databases.
There are a couple important design choices to consider:
how do you figure out what data changed ( i.e. inserted, updated or deleted )
what is the most efficient way to package the change-log
will you have to deal with conflicts ? and how will you do that.
The most efficient way to figure out what changed is to have the database tell you that directly. Bottled water can offer some inspiration in this regard. The idea is to tap into the event log postgres would use for replication. You will need something like Kafka to keep track of what each of your clients already knows. This will allow you to optimize your server for writes, as you won't have clients querying trying to figure out what changed since they were last online.
The same can be achieved on the sqlight end with event callbacks, you'll just have to trade some storage space on the client to retain the changes to be sent to the server. If that sounds like too much infrastructure for your needs, it's something that you can easily implement with SQL and pooling as well, but I would still think of it as an event log, and consider how it's implemented a detail - possibly allowing for a more efficient implementation lather on.
The best way to structure and package your change log will depend on your applications requirements, available band-with, etc. You could use standard formats such as json, compress and encrypt if needed.
It will be much simpler to design your application as such to avoid conflicts, and possibly flow data in a single direction, or partition your data so that it always flows in a single direction for a specific partition.
One final taught is that with such an architecture you would be getting incremental updates, some of which might be missed for unplanned reasons ( system failure, bugs, dropped messages, etc ). You could have some built in heuristic to check that your data matches, like at least checking the number of records on each side, with some way to recover such a fault, at a minimal a way to manually re-fetch the data from the authoritative source, i.e. if the server is authoritative, the client should be able to discard it's data and re-fetch it. You might need such a mechanism anyway for cases wen the client is reinstalled, etc.
I've been using PostgreSQL for the longest time. All of my data lives inside Postgres. I've recently looked into redis and it has a lot of powerful features that would otherwise take a couple of lines in Django (python) to do. Redis data is persistent as long the machine it's running on doesn't go down and you can configure it to write out the data it's storing to disk every 1000 keys or every 5 minutes or so depending on your choice.
Redis would make a great cache and it would certainly replace a lot of functions I have written in python (up voting a user's post, viewing their friends list etc...). But my concern is, all of this data would some how need to be translated over to postgres. I don't trust storing this data in redis. I see redis as a temporary storage solution for quick retrieval of information. It's extremely fast and this far outweighs doing repetitive queries against postgres.
I'm assuming the only way I could technically write the redis data to the database is to save() whatever I get from the 'get' query from redis to the postgres database through Django.
That's the only solution I could think of. Do you know of any other solutions to this problem?
Redis is increasingly used as a caching layer, much like a more sophisticated memcached, and is very useful in this role. You usually use Redis as a write-through cache for data you want to be durable, and write-back for data you might want to accumulate then batch write (where you can afford to lose recent data).
PostgreSQL's LISTEN and NOTIFY system is very useful for doing selective cache invalidation, letting you purge records from Redis when they're updated in PostgreSQL.
For combining it with PostgreSQL, you will find the Redis foreign data wrapper provider that Andrew Dunstain and Dave Page are working on very interesting.
I'm not aware of any tool that makes Redis into a transparent write-back cache for PostgreSQL. Their data models are probably too different for this to work well. Usually you write changes to PostgreSQL and invalidate their Redis cache entries using listen/notify to a cache manager worker, or you queue changes in Redis then have your app read them out and write them into Pg in chunks.
Redis is persistent if configured to be so, both through snapshots and a kind of WAL called AOF. Loads of people use it as a primary datastore.
https://redis.io/topics/persistence
If one is referring to the greater world of Redis compatible (resp protocol) datastores, many are not limited to in-memory storage:
https://keydb.dev/
http://ssdb.io/
and many more...
I seem to remember reading somewhere that google app engine automatically caches the results of very frequent queries into memory so that they are retrieved faster.
Is this correct?
If so, is there still a charge for datastore reads on these queries?
If you're using Python and the new ndb API, it DOES have automatic caching of entities, so if you fetch entities by key, it would be cached:
http://code.google.com/appengine/docs/python/ndb/cache.html
As the comments say, queries are not cached.
Cached requests don't hit the datastore, so you save on reads there.
If you're using Java, or the other APIs for accessing the datastore, then no, there's no caching.
edited Fixed my mistake about queries getting cached.
I think that app engine does not cache anything for you. While it could be that, internally, it caches some things for a split second, I don't think you should rely on that.
I think you will be charged the normal number of read operations for every entity you read from every query.
No, it doesn't. However depending on what framework you use for access to the datastore, memcache will be used. Are you developing in java or python? On the java side, Objectify will cache GETs automatically but not Queries. Keep in mind that there is a big difference in terms of performance and cachability between gets and queries in both python and java.
You are not charged for datastore reads for memcache hits.