I am working on a system where a bunch of modules connect to a MS SqlServer DB to read/write data. Each of these modules are written in different languages (C#, Java, C++) as each language serves the purpose of the module best.
My question however is about the DB connectivity. As of now, all these modules use the language-specific Sql Connectivity API to connect to the DB. Is this a good way of doing it ?
Or alternatively, is it better to have a Python (or some other scripting lang) script take over the responsibility of connecting to the DB? The modules would then send in input parameters and the name of a stored procedure to the Python Script and the script would run it on the database and send the output back to the respective module.
Are there any advantages of the second method over the first ?
Thanks for helping out!
If we assume that each language you use will have an optimized set of classes to interact with databases, then there shouldn't be a real need to pass all database calls through a centralized module.
Using a "middle-ware" for database manipulation does offer a very significant advantage. You can control, monitor and manipulate your database calls from a central and single location. So, for example, if one day you wake up and decide that you want to log certain elements of the database calls, you'll need to apply the logical/code change only in a single piece of code (the middle-ware). You can also implement different caching techniques using middle-ware, so if the different systems share certain pieces of data, you'd be able to keep that data in the middle-ware and serve it as needed to the different modules.
The above is a very advanced edge-case and it's not commonly used in small applications, so please evaluate the need for the above in your specific application and decide if that's the best approach.
Doing things the way you do them now is fine (if we follow the above assumption) :)
Related
this is more of an architecture question which I can't solve this properly as I don't have enough experience with such architecture... I'm currently running the solution with Python and SqlAlchemy, but the question is generic and the answer doesn't have to address those technologies.
I will try to explain it on an example of public library. So imagine having a public library, with server holding tables with all the books, scans (large binary images), users. I've already made a client and server parts which work great, but locally for a single library.
Now I would like to have this of server and clients for another public library (and later more public libraries to come). Having a local server for each library is desired as there is much data to be transferred to and from local server.
The complication comes from the requirement to be able to share users (with their member cards) between libraries - if user comes and registers at library A, he should be able to go to library B without the need for new registration. There's no need for being able to see other user data in the library he wasn't registered in the first place, just hist member account (id, login and password).
The simple solution would be:
having large data on local server
having users on cloud (some public server on internet)
The problem is that there are queries (for statistics, views, and so on), which run on local server and need accessing users, so I can't have users on a different server and database, because I couldn't then do select + join on such an architecture.
The solution which is left behind by previous developer and which other developers think is wrong, is to have the users table set up as replicated table (MariaDB + Galera), so it would end up having users table the same on cloud and each library site, so the previous code would work as if everything is just local, while sharing the users on the background with other libraries.
One of the problems with this is that the current version of our database (MariaDB) doesn't support (or has broken) partial replication (only some tables or some databases), so it would need patching of the MariaDB and distributing this patched version of database server to cloud and other sites, which stinks of various problems now and in the future, when new version of MariaDB will come out.
What would be the proper way of sharing these users between sites, while retaining the ability to do local selects and joins with the user table?
(Maybe there's a known design / architecture pattern for this, but I just don't know what to search for as I'm new to this.)
Thanks,
Miro
schema - sharing table between sites
Start with a single-source-of truth for the user registrations. That is one server (or Galera cluster, for HA) somewhere (in HQ, in Cloud, wherever). Login queries remotely access that server.
Think about any place you log in -- you are going to some central cite. My point is, that is the way everyone does it because it is fast, reliable, efficient, etc, with today's networks.
Next, what about images, etc? If they are shared across your sites, you may as well do them the same way. Look at any search engine for the last two decades -- images (etc) are fetched from a single site. (Actually a small number of sites, for redundancy, etc). Even the biggest web providers have no more than perhaps a dozen datacenters to service the entire world.
After that, you need to decide on Cloud vs dedicated (or even run your own datacenter).
For HA, Cloud providers do a lot. For do-it-yourself, there are various replication scenarios, Galera being one of the best (today). For true HA, you need two copies of your data geographically separated -- to protect from hurricanes, fires, floods, earthquakes, etc. Consider a WAN deployment of Galera, or some asynchronouse replication (possibly even between two Galera clusters.
Another choice is whether the Users and Images tables need to be on separate servers. Only if the traffic and size are high do you need to consider separating them. For a huge Image library, you may need a large number of servers, at which point, they should probably living on servers with the sole purpose of delivering images -- no Users, no HTML pages, etc. Even the "meta" info about images could be elsewhere in MySQL; the Images are in files and just a web server tuned to deliver images runs. (I can think of multiple 'big guys' that do it this way.)
This question is more on architecture and libs, than on implementation.
I am currently working at project, which requires a local long-term cache storage (updated once a day) at client kept in sync with a remote db at server. For client side sqlite has been chosen as a lightweight approach and postgresql as feature rich db at server. Native replication mechanisms of postgres are no-opt cause I need to keep client really lightweight and free of relying on external components like db servers.
The implementation language would be Python. Now I'm looking at ORMs like SQLAlchemy, but haven't worked with any before.
Does SQLAlchemy have any tools to keep sqlite and postgres dbs in sync?
If not, are there any other Python libraries which have such tools?
Any ideas about how should the architecture look like, if the task must be solved "by hand"?
Added:
It's like telemetry, cause client would have internet connection only for approximately 20 minutes a day
So, the main question is about architecure of such a system
It doesn't usually fall within the tasks of an ORM to sync data between databases, so you will likely have to implement it yourself. I don't know of any solution that will handle syncing for you given your choice of databases.
There are a couple important design choices to consider:
how do you figure out what data changed ( i.e. inserted, updated or deleted )
what is the most efficient way to package the change-log
will you have to deal with conflicts ? and how will you do that.
The most efficient way to figure out what changed is to have the database tell you that directly. Bottled water can offer some inspiration in this regard. The idea is to tap into the event log postgres would use for replication. You will need something like Kafka to keep track of what each of your clients already knows. This will allow you to optimize your server for writes, as you won't have clients querying trying to figure out what changed since they were last online.
The same can be achieved on the sqlight end with event callbacks, you'll just have to trade some storage space on the client to retain the changes to be sent to the server. If that sounds like too much infrastructure for your needs, it's something that you can easily implement with SQL and pooling as well, but I would still think of it as an event log, and consider how it's implemented a detail - possibly allowing for a more efficient implementation lather on.
The best way to structure and package your change log will depend on your applications requirements, available band-with, etc. You could use standard formats such as json, compress and encrypt if needed.
It will be much simpler to design your application as such to avoid conflicts, and possibly flow data in a single direction, or partition your data so that it always flows in a single direction for a specific partition.
One final taught is that with such an architecture you would be getting incremental updates, some of which might be missed for unplanned reasons ( system failure, bugs, dropped messages, etc ). You could have some built in heuristic to check that your data matches, like at least checking the number of records on each side, with some way to recover such a fault, at a minimal a way to manually re-fetch the data from the authoritative source, i.e. if the server is authoritative, the client should be able to discard it's data and re-fetch it. You might need such a mechanism anyway for cases wen the client is reinstalled, etc.
I'm currently building a web service using python / flask and would like to build my data layer on top of neo4j, since my core data structure is inherently a graph.
I'm a bit confused by the different technologies offered by neo4j for that case. Especially :
i originally planned on using the REST Api through py2neo , but the lack of transaction is a bit of a problem.
The "embedded database" neo4j doesn't seem to suit my case very well. I guess it's useful when you're working with batch and one-time analytics, and don't need to store the database on a different server from the web server.
I've stumbled upon the neo4django project, but i'm not sure this one offers transaction support (since there are no native client to neo4j for python), and if it would be a problem to use it outside django itself. In fact, after having looked at the project's documentation, i feel like it has exactly the same limitations, aka no transaction (but then, how can you build a real-world service when you can corrupt your model upon a single connection timeout ?). I don't even understand what is the use for that project.
Could anyone could recommend anything ? I feel completely stuck.
Thanks
None of the REST API clients will be able to explicitly support (proper) transactions since that functionality is not available through the Neo4j REST API interface. There are a few alternatives such as Cypher queries and batched execution which all operate within a single atomic transaction on the server side; however, my general approach for client applications is to try to build code which can gracefully handle partially complete data, removing the need for explicit transaction control.
Often, this approach will make heavy use of unique indexing and this is one reason that I have provided a large number of "get_or_create" type methods within py2neo. Cypher itself is incredibly powerful and also provides uniqueness capabilities, in particular through the CREATE UNIQUE clause. Using these, you can make your writes idempotent and you can err on the side of "doing it more than once" safe in the knowledge that you won't end up with duplicate data.
Agreed, this approach doesn't give you transactions per se but in most cases it can give you an equivalent end result. It's certainly worth challenging yourself as to where in your application transactions are truly necessary.
Hope this helps
Nigel
I think neo4django makes use of neo4j-rest-client, that does support transactions through the batch resource in the Neo4j REST interface.
The syntax is quite similar to the one used by Neo4j Python emebedded API:
>>> n = gdb.nodes.create()
>>> n["age"] = 25
>>> n["place"] = "Houston"
>>> n.properties
{'age': 25, 'place': 'Houston'}
>>> with gdb.transaction():
....: n.delete("age")
....:
>>> n.properties
{u'place': u'Houston'}
More information can be found in the neo4j-rest-client documentation about transactions.
I have an application that needs to interface with another app's database. I have read access but not write.
Currently I'm using sql statements via pyodbc to grab the rows and using python manipulate the data. Since I don't cache anything this can be quite costly.
I'm thinking of using an ORM to solve my problem. The question is if I use an ORM like "sql alchemy" would it be smart enough to pick up changes in the other database?
E.g. sql alchemy accesses a table and retrieves a row. If that row got modified outside of sql alchemy would it be smart enough to pick it up?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit: To be more clear
I have one application that is simply a reporting tool lets call App A.
I have another application that handles various financial transactions called App B.
A has access to B's database to retrieve the transactions and generates various reports. There's hundreds of thousands of transactions. We're currently caching this info manually in python, if we need an updated report we refresh the cache. If we get rid of the cache, the sql queries combined with the calculations becomes unscalable.
I don't think an ORM is the solution to your problem of performance. By default ORMs tend to be less efficient than row SQL because they might fetch data that you're not going to use (eg. doing a SELECT * when you need only one field), although SQLAlchemy allows fine-grained control over the SQL generated.
Now to implement a caching mechanism, depending on your application, you could use a simple dictionary in memory or a specialized system such as memcached or Redis.
To keep your cached data relatively fresh, you can poll the source at regular intervals, which might be OK if your application can tolerate a little delay. Otherwise you'll need the application that has write access to the db to notify your application or your cache system when an update occurs.
Edit: since you seem to have control over app B, and you've already got a cache system in app A, the simplest way to solve your problem is probably to create a callback in app A that app B can call to expire cached items. Both apps need to agree on a convention to identify cached items.
I was wondering when dealing with a web service API that returns XML, whether it's better (faster) to just call the external service each time and parse the XML (using ElementTree) for display on your site or to save the records into the database (after parsing it once or however many times you need to each day) and make database calls instead for that same information.
First off -- measure. Don't just assume that one is better or worse than the other.
Second, if you really don't want to measure, I'd guess the database is a bit faster (assuming the database is relatively local compared to the web service). Network latency usually is more than parse time unless we're talking a really complex database or really complex XML.
Everyone is being very polite in answering this question: "it depends"... "you should test"... and so forth.
True, the question does not go into great detail about the application and network topographies involved, but if the question is even being asked, then it's likely a) the DB is "local" to the application (on the same subnet, or the same machine, or in memory), and b) the webservice is not. After all, the OP uses the phrases "external service" and "display on your own site." The phrase "parsing it once or however many times you need to each day" also suggests a set of data that doesn't exactly change every second.
The classic SOA myth is that the network is always available; going a step further, I'd say it's a myth that the network is always available with low latency. Unless your own internal systems are crap, sending an HTTP query across the Internet will always be slower than a query to a local DB or DB cluster. There are any number of reasons for this: number of hops to the remote server, outage or degradation issues that you can't control on the remote end, and the internal processing time for the remote web service application to analyze your request, hit its own persistence backend (aka DB), and return a result.
Fire up your app. Do some latency and response times to your DB. Now do the same to a remote web service. Unless your DB is also across the Internet, you'll notice a huge difference.
It's not at all hard for a competent technologist to scale a DB, or for you to completely remove the DB from caching using memcached and other paradigms; the latency between servers sitting near each other in the datacentre is monumentally less than between machines over the Internet (and more secure, to boot). Even if achieving this scale requires some thought, it's under your control, unlike a remote web service whose scaling and latency are totally opaque to you. I, for one, would not be too happy with the idea that the availability and responsiveness of my site are based on someone else entirely.
Finally, what happens if the remote web service is unavailable? Imagine a world where every request to your site involves a request over the Internet to some other site. What happens if that other site is unavailable? Do your users watch a spinning cursor of death for several hours? Do they enjoy an Error 500 while your site borks on this unexpected external dependency?
If you find yourself adopting an architecture whose fundamental features depend on a remote Internet call for every request, think very carefully about your application before deciding if you can live with the consequences.
Consuming the webservices is more efficient because there are a lot more things you can do to scale your webservices and webserver (via caching, etc.). By consuming the middle layer, you also have the options to change the returned data format (e.g. you can decide to use JSON rather than XML). Scaling database is much harder (involving replication, etc.) so in general, reduce hits on DB if you can.
There is not enough information to be able to say for sure in the general case. Why don't you do some tests and find out? Since it sounds like you are using python you will probably want to use the timeit module.
Some things that could effect the result:
Performance of the web service you are using
Reliability of the web service you are using
Distance between servers
Amount of data being returned
I would guess that if it is cacheable, that a cached version of the data will be faster, but that does not necessarily mean using a local RDBMS, it might mean something like memcached or an in memory cache in your application.
It depends - who is calling the web service? Is the web service called every time the user hits the page? If that's the case I'd recommend introducing a caching layer of some sort - many web service API's throttle the amount of hits you can make per hour.
Whether you choose to parse the cached XML on the fly or call the data from a database probably won't matter (unless we are talking enterprise scaling here). Personally, I'd much rather make a simple SQL call than write a DOM Parser (which is much more prone to exceptional scenarios).
It depends from case to case, you'll have to measure (or at least make an educated guess).
You'll have to consider several things.
Web service
it might hit database itself
it can be cached
it will introduce network latency and might be unreliable
or it could be in local network and faster than accessing even local disk
DB
might be slow since it needs to access disk (although databases have internal caches, but those are usually not targeted)
should be reliable
Technology itself doesn't mean much in terms of speed - in one case database parses SQL, in other XML parser parses XML, and database is usually acessed via socket as well, so you have both parsing and network in either case.
Caching data in your application if applicable is probably a good idea.
As a few people have said, it depends, and you should test it.
Often external services are slow, and caching them locally (in a database in memory, e.g., with memcached) is faster. But perhaps not.
Fortunately, it's cheap and easy to test.
Test definitely. As a rule of thumb, XML is good for communicating between apps, but once you have the data inside of your app, everything should go into a database table. This may not apply in all cases, but 95% of the time it has for me. Anytime I ever tried to store data any other way (ex. XML in a content management system) I ended up wishing I would have just used good old sprocs and sql server.
It sounds like you essentially want to cache results, and are wondering if it's worth it. But if so, I would NOT use a database (I assume you are thinking of a relational DB): RDBMSs are not good for caching; even though many use them. You don't need persistence nor ACID.
If choice was between Oracle/MySQL and external web service, I would start with just using service.
Instead, consider real caching systems; local or not (memcache, simple in-memory caches etc).
Or if you must use a DB, use key/value store, BDB works well. Store response message in its serialized form (XML), try to fetch from cache, if not, from service, parse. Or if there's a convenient and more compact serialization, store and fetch that.