Password protect sqlite3 database with python [duplicate] - python

I have been working on developing this analytical tool to help interpret and analyze a database that is bundled within the package. It is very important for us to secure the database in a way that can only be accessed with our software. What is the best way of achieving it in Python?
I am aware that there may not be a definitive solution, but deterrence is what really matters here.
Thank you very much.

Someone has gotten Python and SQLCipher working together by rebuilding SQLCipher as a DLL and replacing Python's sqlite3.dll here.

This question comes up on the SQLite users mailing list about once a month.
No matter how much encryption etc you do, if the database is on the client machine then the key to decrypt will also be on the machine at some point. An attacker will be able to get that key since it is their machine.
A better way of looking at this is in terms of money - how much would a bad guy need to spend in order to get the data. This will generally be a few hundred dollars at most. And all it takes is any one person to get the key and they can then publish the database for everyone.
So either go for a web service as mentioned by Donal or just spend a few minutes obfuscating the database. For example if you use APSW then you can write a VFS in a few lines that XORs the database content so regular SQLite will not open it, nor will a file viewer show the normal SQLite header. (There is example code in APSW showing how to do this.)
Consequently anyone who does have the database content had to knowingly do so.

Related

How to share sensitive data among programs while keeping the possibility of comparing them with other local data?

Context
As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites.
Currently, I have a list of domain names saved in a CSV file, containing both known domains considered safe (discord.com, google.com, etc.), and known fraudulent domains (free-nitro.ru etc.)
To share this list between my personal computer and my server, I regularly "deploy" it in ftp. But since my bot also uses GitHub and a MySQL database, I'm looking for a better system to synchronize this list of domain names without allowing anyone to access it.
I feel like I'm looking for a miracle solution that doesn't exist, but I don't want to overestimate my knowledge so I'm coming to you for advice, thanks in advance!
My considered solutions:
Put the domain names in a MySQL table
Advantages: no public access, live synchronization
Disadvantages: my scam detection script should be able to work offline
Hash the domain names before putting them on git
Advantages: no public access, easy to do, supports equality comparison
Disadvantages: does not support similarity comparison, which is an important part of the program
Hash domain names with locality-sensitive hashing
Advantages: no easy public access, supports equality and similarity comparison
Disadvantages : similarities less precise than in clear, and impossible to hash a new string from the server without knowing at least the seed of the random, so public access problems
My opinion
It seems to me that the last solution, with the LSH, is the one that causes the least problems. But it is far from satisfying me, and I hope to find better.
For the LSH algorithm, I have reproduced it here (from this notebook). I get similarity coefficients between 10% and 40% lower than those obtained with the current plain method.
EDIT: for clarification purpose, maybe my intentions weren’t clear enough (I’m sorry, English is not my native language and I’m bad at explaining things lol). The database or GitHub are just convenient ways to share info between my different bot instances. I could have one locally running on my pc, one on my VPS, one other god know where… and this is why I don’t want a FTP or any kind of synchronisation process involving an IP and/or a fixed destination folder. Ideally I’d like to just take my program at any time, download it wherever I want (by git clone) and just run it.
Please tell me if this isn’t clear enough, thanks :)
At the end I think I'll use yet another solution. I'm thinking of using the MySQL database to store domain names, but only use it in my script to synchronize to it, keeping a local CSV version.
In short, the workflow I'm imagining:
I edit my SQL table when I want to add/remove items to it
When the bot is launched, the script connects to the DB and retrieves all the information from the table
Once the information is retrieved, it saves it in a CSV file and finishes running the rest of the script
If at launch no internet connection is available, the synchronization to the DB is not done and only the CSV file is used.
This way I have the advantages of no public access, an automatic synchronization, an access even offline after the first start, and I keep the support of comparison by similarity since no hash is done.
If you think you can improve my idea, I'm interested!

login to obiee and execute SQL using python

I've tried various ways of extracting reports from Oracle Business Intelligence (not hosted locally, version 11g), and the best I've come up with so far is the pyobiee library here, which is pretty good: https://github.com/kazei92/pyobiee. I've managed to login and extract reports that I've already written, but in an ideal world I would be able to interrogate the SQL directly. I've tried this using the executeSQL function in pyobiee, but I can only manage to extract a column or two and then it can't do any more. I think I'm limited by my understanding of the SQL syntax which is not a familiar one (it's more logical, no GROUP BY requirement), and I can't find a decent summary of how to use it. Where I have found summaries, I've followed them and it doesn't work (https://docs.oracle.com/middleware/12212/biee/BIESQ/toc.htm#BIESQ102). Please can you advise where I can find a better summary of the logical SQL syntax? The other possibility is that there is something wrong with the pyobiee library (hasn't been maintained since August). I would be open to using pyodbc or cx_Oracle instead, but I can't work out how to login using these routes. Please can you advise?
The reason I'm taking this route is because my organisation has mapping tables that are not held in obiee and there is no prospect of getting them in there. So I'm working on extracting using python so that I can add the mapping tables in SQL server.
I advise you to rethink what you are doing. First of all the python is a wrapper around the OBI web services which in itself isn't wrong, but an additional layer of abstraction which hides most of the web services and functionalities. There are way more than three...
Second - the real question is "What exactly are you trying top achieve?". If you simply want data from the OBI server, then you can just as well get that over ODBC. No need for 50 additional technologies in the middle.
As far as LSQL is concerned: Yes, there is a reference: https://docs.oracle.com/middleware/12212/biee/BIESQ/BIESQ.pdf
BUT you will definitely need to know what you want to access since what's governing things is the RPD. A metadata layer. Not a database.

Need advice on writing a document control software with Python and MySQL

I'm looking for open-ended advice on the best approach to re-write a simple document control app I developed, which is really just a custom file log generator that looks for and logs files that have a certain naming format and file location. E.g., we name all our Change Orders with the format "CO#3 brief description.docx". When they're issued, they get moved to an "issued" folder under another folder that has the project name. So, by logging the file and querying it's path, we can tell which project it's associated with and whether it's been issued.
I wrote it with Python 3.3. Works well, but the code's tough to support because I'm building the reports while walking the file structure, which can get pretty messy. I'm thinking it would be better to build a DB of most/all of the files first and then query the DB with SQL to build the reports.
Sorry for the open-ended question, but I'm hoping not to reinvent the wheel. Anyone have any advice as to going down this road? E.g., existing apps I should look at or bundles that might help? I have lots of C/C++ coding experience but am still new to Python and MySQL. Any advice would be greatly appreciated.
Really nice answer by #GCord. I'd add just two bits:
If it's a relatively small database, consider sqlite3 instead of
MySQL (it is nicely supported out of the box, multiplatform, no
dependencies on a running rdbms)
If it's expected to grow, and/or you
just want to play with some new technology, try to write automated
ingestion scripts for a real document management system (e.g., http://www.alfresco.com/). I'd
recommend Apache Solr (based on Apache Lucene) as a full text
indexing service and then you could use Apache Tika to automatically
extract text and metadata from your documents (see
http://wiki.apache.org/solr/ExtractingRequestHandler)
Firstly, if it works well as you suggest, then why fix it?
Secondly, before doing any changes to your code I would ask myself the following questions:
What are the improvements/new requirements I want to implement that I can't easily do with the current structure?
Do I have a test suite of the current solution, so that I can regression-test any refactoring? When re-implementing something it is easy to overlook some specific behaviors which are not very well documented but that you/users rely on.
Do those improvements warrant an SQL database? For instance:
Do you need to often run reports out of an SQL database without walking the directory structure?
Is there a problem with walking the directories?
Do you have network or performance issues?
Are you facing an increase in usage?
When implementing an SQL solution, you will need a new task to update the SQL data. If I understand correctly, the reports are currently generated on-the-fly, and therefore are always up-to-date. That won't be the case with SQL reports, so you need to make sure they are up-to-date too. How frequently will you update the SQL database:
a) In real-time? That will necessitate a background service. That could be a operational hassle.
b) On-demand? Then what would be the difference with the current solution?
c) At scheduled times? Then your data may be not up-to-date between the updates.
I don't have any packages or technical approaches to recommend to you, I just thought I'd give you those general software management advices.
In any case, I also have extensive C++ and Python and SQL experience, and I would just stick to Python on this one.
On the SQL side, why stick to traditional SQL engines? Why not MongoDB for instance, which would be well suited to storing structured data such as file information.

Using Excel to work with large amounts of output data: is an Excel-database interaction the right solution for the problem?

I have a situation where various analysis programs output large amounts of data, but I may only need to manipulate or access certain parts of the data in a particular Excel workbook.
The numbers might often change as well as newer analyses are run, and I'd like these changes to be reflected in Excel in as automated a manner as possible. Another important consideration is that I'm using Python to process some of the data too, so putting the data somewhere where it's easy for Python and Excel to access would be very beneficial.
I know only a little about databases, but I'm wondering if using one would be a good solution for what my needs - Excel has database interaction capability as far as I'm aware, as does Python. The devil is in the details of course, so I need some help figuring out what system I'd actually set up.
From what I've currently read (in the last hour), here's what I've come up with so far simple plan:
1) Set up an SQLite managed database. Why SQLite? Well, I don't need a database that can manage large volumes of concurrent accesses, but I do need something that is simple to set up, easy to maintain and good enough for use by 3-4 people at most. I can also use the SQLite Administrator to help design the database files.
2 a) Use ODBC/ADO.NET (I have yet to figure out the difference between the two) to help Excel access the database. This is going to be the trickiest part, I think.
2 b) Python already has the built in sqlite3 module, so no worries with the interface there. I can use it to set up the output data into an SQLite managed database as well!
Putting down some concrete questions:
1) Is a server-less database a good solution for managing my data given my access requirements? If not, I'd appreciate alternative suggestions. Suggested reading? Things worth looking at?
2) Excel-SQLite interaction: I could do with some help flushing out the details there...ODBC or ADO.NET? Pointers to some good tutorials? etc.
3) Last, but not least, and definitely of concern: will it be easy enough to teach a non-programmer how to setup spreadsheets using queries to the database (assuming they're willing to put in some time with familiarization, but not very much)?
I think that about covers it for now, thank you for your time!
Although you could certainly use a database to do what you're asking, I'm not sure you really want to add that complexity. I don't see much benefit of adding a database to your mix. ...if you were pulling data from a database as well, then it'd make more sense to add some tables for this & use it.
From what I currently understand of your requirements, since you're using python anyway, you could do your preprocessing in python, then just dump out the processed/augmented values into other csv files for Excel to import. For a more automated solution, you could even write the results directly to the spreadsheets from Python using something like xlwt.

Best DataMining Database

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.
Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.
I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.
Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.
Many thanks!
Quick Summary
You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
We know:
You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
Missing information
Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
"get all users who work for departement x"
"get all sales from user x"
Database needed
I am the computer person for
everything in a small company and I
have been started a new project where
I think it is about time to try new
databases.
You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.
MySQL
I thought about using mysql, but then
I need installing mysql in every
desktop, sqlite is easier, but it is
very slow. I do not need a full
relational database, just some way of
play with big amounts of data in a
decent time.
A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.
Tokyo Tyrant
You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).
Redis:
You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.
Document stores
This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.
Full-text-search
If you want to do full-text-search queries you could like at:
elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.
You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.
And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.
EDIT: #Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.
If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.
The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.
Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.
It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.
Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?
As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.
Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.
Here is a performance benchmark of different database suits ->
Database Speed Comparison
I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it. Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.
If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?
EDIT: Updated with a link to a slightly more recent speed comparison.
It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are.
My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it.
Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.
If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.
Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)
Take a look at mongodb.

Categories