I'm looking for open-ended advice on the best approach to re-write a simple document control app I developed, which is really just a custom file log generator that looks for and logs files that have a certain naming format and file location. E.g., we name all our Change Orders with the format "CO#3 brief description.docx". When they're issued, they get moved to an "issued" folder under another folder that has the project name. So, by logging the file and querying it's path, we can tell which project it's associated with and whether it's been issued.
I wrote it with Python 3.3. Works well, but the code's tough to support because I'm building the reports while walking the file structure, which can get pretty messy. I'm thinking it would be better to build a DB of most/all of the files first and then query the DB with SQL to build the reports.
Sorry for the open-ended question, but I'm hoping not to reinvent the wheel. Anyone have any advice as to going down this road? E.g., existing apps I should look at or bundles that might help? I have lots of C/C++ coding experience but am still new to Python and MySQL. Any advice would be greatly appreciated.
Really nice answer by #GCord. I'd add just two bits:
If it's a relatively small database, consider sqlite3 instead of
MySQL (it is nicely supported out of the box, multiplatform, no
dependencies on a running rdbms)
If it's expected to grow, and/or you
just want to play with some new technology, try to write automated
ingestion scripts for a real document management system (e.g., http://www.alfresco.com/). I'd
recommend Apache Solr (based on Apache Lucene) as a full text
indexing service and then you could use Apache Tika to automatically
extract text and metadata from your documents (see
http://wiki.apache.org/solr/ExtractingRequestHandler)
Firstly, if it works well as you suggest, then why fix it?
Secondly, before doing any changes to your code I would ask myself the following questions:
What are the improvements/new requirements I want to implement that I can't easily do with the current structure?
Do I have a test suite of the current solution, so that I can regression-test any refactoring? When re-implementing something it is easy to overlook some specific behaviors which are not very well documented but that you/users rely on.
Do those improvements warrant an SQL database? For instance:
Do you need to often run reports out of an SQL database without walking the directory structure?
Is there a problem with walking the directories?
Do you have network or performance issues?
Are you facing an increase in usage?
When implementing an SQL solution, you will need a new task to update the SQL data. If I understand correctly, the reports are currently generated on-the-fly, and therefore are always up-to-date. That won't be the case with SQL reports, so you need to make sure they are up-to-date too. How frequently will you update the SQL database:
a) In real-time? That will necessitate a background service. That could be a operational hassle.
b) On-demand? Then what would be the difference with the current solution?
c) At scheduled times? Then your data may be not up-to-date between the updates.
I don't have any packages or technical approaches to recommend to you, I just thought I'd give you those general software management advices.
In any case, I also have extensive C++ and Python and SQL experience, and I would just stick to Python on this one.
On the SQL side, why stick to traditional SQL engines? Why not MongoDB for instance, which would be well suited to storing structured data such as file information.
Related
I've just discovered Sir - a database based on text files, but it's far from ready and it's written in JS (i.e. not for me).
My first intuition was to ask if there's something like this available for Python or C++, but since that's not the kind of question one should ask on Stackoverflow let me put it more general:
I like the way e.g. git is made - it stores data as easy to handle separate files and it's astonishingly fast at the same time. Moreover git does not require a server which holds data in memory to be fast (the filesystem cache is doing a good enough job) and - maybe the best part - the way git keeps data in "memory" (the filesystem) is intrinsically language agnostic.
Of course git is not a database and databases have different challenges to master but I still dare to ask: are there generic approaches to make databases as transparent and manually modifiable as git is?
Are there keywords, examples, generally accepted concepts or working projects (like Sir but preferably Python or C++ based) I should learn to know if I want to enhance my fuzzy filesystem polluting project with a database-like fast technology, providing a nice query language without having to sacrifice the simplicity to just manually edit/copy/overwrite files on the filesystem?
SQLite is exactly what you are looking for. It is built in into Python as well: sqlite3.
It's just not human readable, but neither is git. It is purely serverless based on files however, just like git.
I've tried various ways of extracting reports from Oracle Business Intelligence (not hosted locally, version 11g), and the best I've come up with so far is the pyobiee library here, which is pretty good: https://github.com/kazei92/pyobiee. I've managed to login and extract reports that I've already written, but in an ideal world I would be able to interrogate the SQL directly. I've tried this using the executeSQL function in pyobiee, but I can only manage to extract a column or two and then it can't do any more. I think I'm limited by my understanding of the SQL syntax which is not a familiar one (it's more logical, no GROUP BY requirement), and I can't find a decent summary of how to use it. Where I have found summaries, I've followed them and it doesn't work (https://docs.oracle.com/middleware/12212/biee/BIESQ/toc.htm#BIESQ102). Please can you advise where I can find a better summary of the logical SQL syntax? The other possibility is that there is something wrong with the pyobiee library (hasn't been maintained since August). I would be open to using pyodbc or cx_Oracle instead, but I can't work out how to login using these routes. Please can you advise?
The reason I'm taking this route is because my organisation has mapping tables that are not held in obiee and there is no prospect of getting them in there. So I'm working on extracting using python so that I can add the mapping tables in SQL server.
I advise you to rethink what you are doing. First of all the python is a wrapper around the OBI web services which in itself isn't wrong, but an additional layer of abstraction which hides most of the web services and functionalities. There are way more than three...
Second - the real question is "What exactly are you trying top achieve?". If you simply want data from the OBI server, then you can just as well get that over ODBC. No need for 50 additional technologies in the middle.
As far as LSQL is concerned: Yes, there is a reference: https://docs.oracle.com/middleware/12212/biee/BIESQ/BIESQ.pdf
BUT you will definitely need to know what you want to access since what's governing things is the RPD. A metadata layer. Not a database.
I have been working on developing this analytical tool to help interpret and analyze a database that is bundled within the package. It is very important for us to secure the database in a way that can only be accessed with our software. What is the best way of achieving it in Python?
I am aware that there may not be a definitive solution, but deterrence is what really matters here.
Thank you very much.
Someone has gotten Python and SQLCipher working together by rebuilding SQLCipher as a DLL and replacing Python's sqlite3.dll here.
This question comes up on the SQLite users mailing list about once a month.
No matter how much encryption etc you do, if the database is on the client machine then the key to decrypt will also be on the machine at some point. An attacker will be able to get that key since it is their machine.
A better way of looking at this is in terms of money - how much would a bad guy need to spend in order to get the data. This will generally be a few hundred dollars at most. And all it takes is any one person to get the key and they can then publish the database for everyone.
So either go for a web service as mentioned by Donal or just spend a few minutes obfuscating the database. For example if you use APSW then you can write a VFS in a few lines that XORs the database content so regular SQLite will not open it, nor will a file viewer show the normal SQLite header. (There is example code in APSW showing how to do this.)
Consequently anyone who does have the database content had to knowingly do so.
I have a situation where various analysis programs output large amounts of data, but I may only need to manipulate or access certain parts of the data in a particular Excel workbook.
The numbers might often change as well as newer analyses are run, and I'd like these changes to be reflected in Excel in as automated a manner as possible. Another important consideration is that I'm using Python to process some of the data too, so putting the data somewhere where it's easy for Python and Excel to access would be very beneficial.
I know only a little about databases, but I'm wondering if using one would be a good solution for what my needs - Excel has database interaction capability as far as I'm aware, as does Python. The devil is in the details of course, so I need some help figuring out what system I'd actually set up.
From what I've currently read (in the last hour), here's what I've come up with so far simple plan:
1) Set up an SQLite managed database. Why SQLite? Well, I don't need a database that can manage large volumes of concurrent accesses, but I do need something that is simple to set up, easy to maintain and good enough for use by 3-4 people at most. I can also use the SQLite Administrator to help design the database files.
2 a) Use ODBC/ADO.NET (I have yet to figure out the difference between the two) to help Excel access the database. This is going to be the trickiest part, I think.
2 b) Python already has the built in sqlite3 module, so no worries with the interface there. I can use it to set up the output data into an SQLite managed database as well!
Putting down some concrete questions:
1) Is a server-less database a good solution for managing my data given my access requirements? If not, I'd appreciate alternative suggestions. Suggested reading? Things worth looking at?
2) Excel-SQLite interaction: I could do with some help flushing out the details there...ODBC or ADO.NET? Pointers to some good tutorials? etc.
3) Last, but not least, and definitely of concern: will it be easy enough to teach a non-programmer how to setup spreadsheets using queries to the database (assuming they're willing to put in some time with familiarization, but not very much)?
I think that about covers it for now, thank you for your time!
Although you could certainly use a database to do what you're asking, I'm not sure you really want to add that complexity. I don't see much benefit of adding a database to your mix. ...if you were pulling data from a database as well, then it'd make more sense to add some tables for this & use it.
From what I currently understand of your requirements, since you're using python anyway, you could do your preprocessing in python, then just dump out the processed/augmented values into other csv files for Excel to import. For a more automated solution, you could even write the results directly to the spreadsheets from Python using something like xlwt.
I'm working on a little project, the aim being to generate a report from a database for a server. The database is SQLite and contains tables like 'connections', 'downloads', etc.
The report I produce will ultimately contain a number of graphs displaying things like 'connections per day', 'top downloads this month', etc.
I plan to use flot for the graphs because the graphs it makes look very nice:
This is my current plan for how my reports will work:
Static .HTML file which is the report. This will contain headings, embedded flot graphs, etc.
JSON Data file. These will be generated by my report generation python script, they will basically contain a JSON variable for each graph representing the dataset that the graph should map. ([100,2009-2-2],[192,2009-2-3]...)
Report generation python script, this will load the SQLite database, run a list of set SQL queries and spit out the JSON Data files.
Does this sound like a sensible set up? I can't help but feel it could be improved but I don't see how. I want the reports to be static. The server they run on cannot take heavy loads so a dynamically generated report is out of the question and also unnecessary for this application.
My concerns are:
I feel that the Python script is largely pointless, all of the processing performed is done by SQLite, my script is basically going to be used to store SQL queries and package up the output. With a bit more work SQLite could probably do this for me.
It seems I'm solving a problem that must have been solved many times before 'take sql queries, spit out pretty graphs in a daily report' must have been done hundreds of times. I'm just having trouble tracking down any broad implementations.
It sounds sensible to me.
You need some programming language to talk to SQLite. You could do it in C, but if you can write the necessary glue code easily in Python, why not? You'll almost certainly save more time writing it than you'll lose from not having the most efficient possible program.
There are definitely programs to analyse logs for you - I've heard of Piwik, for instance. That's for dynamic reports, but no doubt there are projects to do static reports too. But they may not fit the data and output you're after. Writing it yourself means you know precisely what you're getting, so if it's not too much work, carry on.
I feel that the Python script is largely pointless, all of the processing performed is done by SQLite, my script is basically going to be used to store SQL queries and package up the output. With a bit more work SQLite could probably do this for me.
Maybe so, but even then, Python is a great glue language. Also, if you need to do some processing SQLite isn't good at, Python is already there.
It seems I'm solving a problem that must have been solved many times before 'take sql queries, spit out pretty graphs in a daily report' must have been done hundreds of times. I'm just having trouble tracking down any broad implementations.
I think you're leaning towards the general class of HTTP-served reporting. One thing out there that overlaps your problem set is Django, which provides a Python interface between database (SQLite is supported) and web server, along with a templating system for your outputs.
If you just want one or two pieces of a solution, then I recommend looking at SQLAlchemy for interfacing with the database, Jinja2 for templating, and/or Werkzeug for HTTP server interface.